This post is reviewing the LSTM & GRU model structure, based on RNN basics.

Background: RNN (Recurrent Neural Network)

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data of varying lengths. Unlike traditional feedforward neural networks, RNNs maintain a form of internal memory through their recurrent hidden states, allowing them to exhibit dynamic temporal behavior. This feature makes RNNs particularly suitable for tasks where the current input is dependent on prior inputs, such as natural language processing or time series analysis.

For any given sequence X=(x1,x2,x3,...,xr), the RNN updates its hidden state ht at each time step t, which is a function of the current input xt and the previous hidden state ht-1. Mathematically, this can be represented as:

RNN Formula

where f is a nonlinear activation function, xt is the input at time step t, and ht-1 is the hidden state from the previous time step.

RNN Diagram

RNN Structure

RNN Overview

Cons: RNNs are hard to train with long input sequences due to several issues:

Too many calculations
Gradient vanishing
Gradient exploding

Solutions: Clipping Gradient, Using Gating Unit

Clipping Gradient

Gradient Clipping

Gradient clipping involves limiting (or "clipping") the gradients to a defined range or threshold during the backpropagation process to ensure they do not become too large (exploding) or too small (vanishing).

How It Works:

By Value: Gradient values are clipped directly if they exceed a predefined threshold. For example, all gradient components greater than a value are set to that value, and all those less than a negative of that value are set to its negative.
By Norm: The gradients are scaled down proportionally if the norm of the gradient vector exceeds a specified threshold. This approach keeps the direction of the gradient but reduces its magnitude to avoid exploding gradients.

Using Gating Unit

Gating Unit

This paper mainly explains about GRU, comparing the basic structures with LSTM.

LSTM (Long Short-Term Memory)

LSTM Diagram

Forget Gate: Decides what information to discard from the cell state, using a sigmoid function to assign values close to 0 for information to forget and values close to 1 for information to retain.
Input Gate: Determines which new information to update in the cell state, combining a sigmoid layer to select values and a tanh layer to create a vector of candidate values.
Cell State: Serves as the LSTM's memory, carrying relevant information across the sequence, and is updated based on inputs from the forget and input gates.
Output Gate: Controls what part of the cell state is passed to the output, using a sigmoid layer to select parts of the cell state and a tanh layer to scale these selected parts before producing the final output.

LSTM Output Calculation

o_jt is the total sum of memory content exposure (output gate).

Memory Content Exposure

V0 is a diagonal matrix.

The memory cell state c_bar_jt forgets some of the previous content and adds new memory content cbar_jt.

Memory Cell State

New Memory Content

The forget gate calculation f_jt determines how much of the new content will be added, which is controlled by the input gate i_jt.

Forget Gate Calculation

Here, Vf is a diagonal matrix of Vi.

GRU (Gated Recurrent Unit)

GRU Diagram

Update Gate: Determines the amount of past information to keep versus new information to add, using a sigmoid function to balance between the previous activation and the potential new candidate activation.
Reset Gate: Decides how much of the past information to forget, allowing the model to drop irrelevant information from the previous steps for the current prediction.
Candidate Activation: Combines the current input with the past memory, influenced by the reset gate, to create a candidate for the new hidden state, blending old and new information with the potential to update the model’s memory.

The linear interpolation between h_jt-1 and hbar_jt works as the activation of GRU.

GRU Activation

Here, the update gate z_jt is the value of updated content.

Update Gate

These processes to get the linear sum of the existing state and the new state (candidate activation) are similar to LSTM calculations. However, the difference is that GRU cannot adjust the amount of state exposure. hbar_jt is calculated as below.

Candidate Activation Calculation

The reset gate has a similar process.

Reset Gate

Experiment (Validation)

Sequence Modeling aims at learning a probability distribution over sequences.

Sequence Modeling

Training three different units (LSTM, GRU, tanh unit). Used RMSProp as the optimizer, and weight noise is fixed at 0.075 (standard deviation). The magnitude of the gradient’s norm cannot exceed 1 to prevent gradient exploding.

Training Units

Training Results

Result

Experiment Result

In the case of GRU, LSTM’s four gates are reduced to two gates (update, reset gate). Also, all states are provided at once (single hidden state). Since the number of gates is smaller, the number of parameters is also less than LSTM.

Actually, the ability of GRU is not dramatically higher than LSTM. Still, GRU has the advantage of requiring fewer calculations and faster speed.

[Paper Review] LSTM & GRU