year: 2018
paper: universal-transformers
website:
code:
connections: adaptive computation time, RNN, transformer, weight-sharing
Decouples parameters from time: Single layer recurrently refines, depth is a dynamic parameter.
Doesn’t need to be a single layer, could also be a deep network (→ looped transformer).
Summary
The core argument
Vanilla Transformer has fixed depth — its number of sequential compute steps doesn’t scale with input length. Two consequences:
- It’s not Turing-complete under finite precision. There’s a fixed budget of sequential operations regardless of how complex the input is.
- It lacks RNN-style inductive bias toward iterative/recursive computation. Empirically: it fails to generalize to longer sequences on algorithmic tasks (copy, reverse, addition), and overfits on bAbI even with extensive tuning.
RNNs have the right inductive bias but no parallelism. NTM/Neural GPU have universality and the right bias but don’t compete on real NLP. The argument is: you can keep Transformer’s parallelism and global receptive field while injecting recurrent inductive bias if you recur over depth instead of over positions.
The mechanism
At each step, refine every position’s representation in parallel using (i) self-attention across positions, (ii) a shared transition function. Weights tied across steps — so it’s literally a per-position RNN whose “time axis” is processing depth, not sequence index. Add a 2D position+time embedding so the model knows where it is in both axes.
Setting T=1 recovers Transformer. Setting T as a function of input length, with parameters tied so you can run more steps than you trained on, recovers Turing-completeness (Appendix B). They also show a direct reduction: zero out self-attention, make the transition a convolution, set T = input length → you get Neural GPU. So UT is positioned as the family that unifies these.
Why this is more expressive than RNNs-over-time
A time-recurrent RNN only carries a fixed-size hidden state into the next step — that’s automaton-like memory. UT’s recurrent step re-attends to the entire previous layer, so it has full memory access inside the recurrence. This is the subtle but key argument for why depth-recurrence beats position-recurrence in expressivity.
Dynamic per-position halting (ACT)
Some positions need more refinement than others (ambiguous words, harder facts). They add per-symbol ACT halting; halted positions just get copied forward. Two findings worth flagging:
- bAbI ponder time correlates with required supporting facts (1 fact → ~2.3 steps, 3 facts → ~3.8). The model genuinely allocates compute to difficulty.
- On LAMBADA, dynamic halting beats a fixed-T model even when fixed T equals or exceeds the average dynamic T. They interpret this as a regularization effect — forcing easy positions to halt early prevents over-processing.
Empirical case
- Algorithmic (copy/reverse/add, LTE): wide margins over Transformer/LSTM; only Neural GPU beats them, and only with curriculum training.
- bAbI: SOTA in all train/eval regimes (1k and 10k, single and joint).
- Subject-verb agreement: gap over Transformer grows monotonically with number of attractors — deeper-hierarchy parsing benefits more from depth-recurrence.
- LAMBADA: SOTA LM perplexity and accuracy (vanilla Transformer collapses here — 7321 perplexity vs 142).
- WMT14 En-De: +0.9 BLEU over Transformer-base at matched params.
What’s actually new (the steelman)
It’s not “Transformer with more layers.” Three things compose:
- Weight tying across depth turns layer count into a runtime hyperparameter, which is what enables universality and length-generalization arguments.
- Per-position halting turns depth into data-dependent compute.
- The combination claims that recurrent inductive bias is the missing ingredient — not capacity — and the experiments where vanilla Transformer simply overfits (bAbI, algorithmic tasks) are the strongest evidence for this.
The weakest part of the case: on MT, halting marginally hurts; the win is small (~1 BLEU), and weight-tying-across-depth alone could plausibly account for much of the gain (i.e., it’s a regularizer at matched params). The paper doesn’t fully isolate “recurrent inductive bias” from “fewer effective parameters.” But on the algorithmic and reasoning side, the case is clean.