GRU

GRU

The Gated Recurrent Unit is an RNN that combines forget and input gates of the LSTM into a single update gate, and merges cell state and hidden state.
$z_{t} r_{t} \tilde{h}_{t} h_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z}) = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r}) = tanh (W_{h} x_{t} + U_{h} (r_{t} ⊙ h_{t - 1}) + b_{h}) = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t} (update gate) (reset gate) (candidate state) (new hidden state)$
$x_{t}$ : input vector at time (t)
$h_{t - 1}, h_{t}$ : previous and current hidden states
$z_{t}, r_{t}$ … update and reset gates (values between 0 and 1)
$σ, tanh$ : sigmoid, tanh
$⊙$ … elementwise multiplication
$W_{*}$ … input weight matrices
$U_{*}$ … hidden state weight (update) matrices
$b_{*}$ … bias vectors

The notation here summarizes $W_{*}$ and $U_{*}$ into a single weight matrix each, and concatenates the input and hidden state vectors. Biases are like so often omitted for simplicity.

This matches how it’s actually implemented, doing 2 instead of 6 matmuls:

class GRUCellDiagram(nn.Module):
	def __init__(self, input_size, hidden_size):
		super().__init__()
		self.gates = nn.Linear(input_size + hidden_size, 2 * hidden_size) # W_z and W_r
		self.cand  = nn.Linear(input_size + hidden_size, hidden_size) # W 
 
	def forward(self, x_t, h_prev): # (batch, input_size), (batch, hidden_size)
		hx = torch.cat([h_prev, x_t], dim=-1)
		r_t, z_t = self.gates(hx).chunk(2, dim=-1)
		r_t = torch.sigmoid(r_t)
		z_t = torch.sigmoid(z_t)
 
		rhx   = torch.cat([r_t * h_prev, x_t], dim=-1)
		h_hat = torch.tanh(self.cand(rhx))
 
		h_t = (1.0 - z_t) * h_prev + z_t * h_hat
		return h_t

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Max Wolf's Second Brain

Explorer

GRU

Graph View

Backlinks