LSTM

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTMs are RNNs which can control which parts of the information it wants to remember/forget/retain with the help of forget/remembering gates.
This helps them remember information over long periods of time.

![[LSTM-20240905170245598.webp]]
![[LSTM-1777239428642.webp]]
![[LSTM-1777240033334.webp]]

Improving over vanilla RNNs:
Memory allows read and write.
Can store information indefinitely.
Add to memory → no vanishing gradients.

LSTM

Long Short-Term Memory: an RNN that maintains a separate cell state $c_{t}$ , updated by a gated additive rule, so that error can flow across many timesteps without vanishing or exploding. A second hidden state $h_{t}$ is a gated read-out of $c_{t}$ .
$f_{t} i_{t} o_{t} \tilde{c}_{t} c_{t} h_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) = tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c}) = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t} = o_{t} ⊙ tanh (c_{t}) (forget gate) (input gate) (output gate) (candidate cell) (new cell state) (new hidden state)$
$x_{t}$ … input vector at time $t$
$c_{t - 1}, c_{t}$ … previous and current cell state (the long-term memory)
$h_{t - 1}, h_{t}$ … previous and current hidden state (short-term, also the output passed to other layers)
$f_{t}, i_{t}, o_{t}$ … forget, input, output gates (values between 0 and 1)
$\tilde{c}_{t}$ … candidate cell content (values between -1 and 1)
$σ, tanh$ … sigmoid, tanh
$⊙$ … elementwise multiplication
$W_{*}, U_{*}$ … input / hidden state weight matrices
$b_{*}$ … bias vectors

The constant error carousel (CEC)

The original LSTM (Hochreiter & Schmidhuber, 1997) has no forget gate. The cell state evolves purely additively:
$c_{t} = c_{t - 1} + i_{t} ⊙ \tilde{c}_{t}$
A self-loop with weight exactly $1$ and identity activation.
Error circulates around the cell unchanged.

The motivation is the vanishing gradients problem in vanilla RNNs. There $h_{t} = tanh (W h_{t - 1} + \dots)$ , so $\partial h_{t} / \partial h_{t - k}$ is a product of $W$ factors and $tanh^{'}$ factors, both of which shrink with depth. In the gateless LSTM,
$\frac{\partial c _{t}}{\partial c _{t - 1}} = 1$
exactly, for arbitrary $k$ .

The input and output gates already exist in this design: $i_{t}$ protects the cell from being overwritten by noisy inputs, $o_{t}$ protects the rest of the network from being interfered with by the cell on timesteps where it shouldn’t speak. The cell is insulated from the outside on both sides.

Why a forget gate was added

Gers, Schmidhuber, Cummins (2000), “Learning to Forget”, identified two problems with the pure CEC:

Unbounded growth. With $i_{t} \approx 1$ over many steps, $c_{t}$ accumulates without bound. The downstream $tanh (c_{t})$ saturates and the cell becomes informationally dead.

No way to reset. When a sequence transitions to a new sub-task, the cell still carries old content. Nothing in the equations can erase it.

The fix: scale $c_{t - 1}$ by a learned gate $f_{t} \in (0, 1)$ before adding new content,
$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t} .$
This is the modern LSTM.

Forget gate vs. CEC: a controllable leak

Strictly, the forget gate breaks the constant error carousel:
$\frac{\partial c _{t}}{\partial c _{t - 1}} = f_{t},$
so over $k$ steps the cell-to-cell gradient is $\prod_{j = 1}^{k} f_{t - j + 1}$ — exponential decay whenever $f < 1$ . The pure CEC’s perfect gradient flow is gone.

But $f_{t}$ is learnable. When long-range memory is needed the network can drive $f_{t} \to 1$ and recover the original CEC; when it isn’t, $f_{t} < 1$ gives a controlled, learned forgetting rate. The fixed self-weight $1$ becomes a special case of a more flexible mechanism, and the CEC is recovered adaptively rather than structurally.

A practical consequence: $b_{f}$ is commonly initialised to $1$ (sometimes $2$ ), so that early in training $f_{t} \approx σ (1) \approx 0.73$ and the cell behaves close to “remember everything” by default. With $b_{f} = 0$ the initial $f_{t} \approx 0.5$ halves the cell every step, and the gradient dies before training can learn to open the gate. This single initialisation tweak meaningfully improves LSTM training on long sequences.

Roles of the three gates

$i_{t}$ — what to write: scales how much of the proposed update $\tilde{c}_{t}$ enters the cell.
$f_{t}$ — what to keep: scales how much of the previous cell state survives.
$o_{t}$ — what to read: scales the cell’s contribution to $h_{t}$ , decoupling what is in memory from what is exposed to the rest of the network on this step.

Each gate is a sigmoid over the same $(x_{t}, h_{t - 1})$ pair but with its own weights, so each learns its own conditioning on context.

GRU

Input and forget gates are coupled into a single update gate $z_{t}$ , with $1 - z_{t}$ playing the role of $f_{t}$ and $z_{t}$ the role of $i_{t}$ . Writing more necessarily forgets the same amount; the LSTM can decouple the two.
Cell and hidden states are merged, so the output gate disappears.
A reset gate $r_{t}$ masks $h_{t - 1}$ inside the candidate computation, recovering some of what the LSTM’s input gate did but at a different point in the dataflow.

Empirically the two perform similarly on most tasks. The LSTM has more parameters and more flexibility (decoupled write/forget matters when you want to add new content without erasing old); the GRU is cheaper and often easier to optimise.

Fused-matmul implementation, with the standard forget-bias-to-1 initialisation; $W_{*}$ and $U_{*}$ are concatenated into a single matrix and the input/hidden vectors are stacked, turning the four gate computations into one matmul of size $4 h$ :

class LSTMCell(nn.Module):
	def __init__(self, input_size, hidden_size):
		super().__init__()
		self.hidden_size = hidden_size
		self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
		# forget gate bias = 1 → f_t ≈ 0.73 at init, preserves CEC-like behaviour early
		nn.init.zeros_(self.gates.bias)
		nn.init.constant_(self.gates.bias[hidden_size:2 * hidden_size], 1.0)
 
	def forward(self, x_t, state):  # x_t (B, I), state = (h_prev, c_prev)
		h_prev, c_prev = state
		z = self.gates(torch.cat([h_prev, x_t], dim=-1))
		i_t, f_t, g_t, o_t = z.chunk(4, dim=-1)
		i_t = torch.sigmoid(i_t)
		f_t = torch.sigmoid(f_t)
		o_t = torch.sigmoid(o_t)
		g_t = torch.tanh(g_t)		 # candidate, ~ c̃_t
 
		c_t = f_t * c_prev + i_t * g_t
		h_t = o_t * torch.tanh(c_t)
		return h_t, (h_t, c_t)

Computational limitation (sequential + non-linearities):
After the last hidden step, we perform 2x linear combination + nonlinearity.
We need the previous steps to calculate the next steps and due to the nonlinearity, we cannot jump ahead in calculations or parallelize.

Understanding the pytorch output (SO):
center|500

References

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

https://youtu.be/x8pW19wKfXQ?t=752

https://people.idsia.ch/~juergen/rnn.html

Graph View

LSTM

References

Backlinks