RNN

h_{t} = W \cdot [h_{t - 1} x_{t}] + b

RNNs are feedforward networks with shared weights

RNNs are turing complete / can represent anything theoretically (with sufficient memory)

… the hard problem is finding a good representation

Memory footprint and sequential training nature

RNN have constant memory during inference, since previous hidden states/memories are not needed, so you only need all context tokens once at a time. (memory: $O (T)$ ).
However, this means that you cannot explicitly mention any single previous Token out of the memory, which is a major limitation of RNNs compared to transformer. Sometimes you can’t compress all the information into a single hidden state like that. Also, this leads to vanishing gradient (due to Backrop through time). ¹

Due to the sequential stepping through time, RNNs can’t be trained in a parallel manner.
In contrast, with a Transformer and its causal attention mask, you can train $T$ examples in parallel (which is also $T$ different contexts/examples, but in the RNN you always depend on the previous ones).

vanishing gradients
repeated multiplication
gradients of earlier timesteps vanish, doesnt keep it in memory
→ LSTM

Bi-directional RNN (BRNN) Andrew Ng Lesson

(Allows an RNN to processs an input sequence not only sequentially, but also know stuff about the end of the sequence for calculating the weight for the beginning of the sequence)

References

Mentioned in YK RMKV PaperReview ↩

Graph View

RNN

References

Backlinks

Graph View

RNN

References

Footnotes

Backlinks