year: 2018
paper: universal-transformers
website:
code:
connections: adaptive computation time, RNN, transformer, weight-sharing


Decouples parameters from time: Single layer recurrently refines, depth is a dynamic parameter.
Doesn’t need to be a single layer, could also be a deep network (→ looped transformer).