looped transformer

Decoupling compute from data and model size.

Demonstrated at scale: Scaling Latent Reasoning via Looped Language Models

Todo

https://arxiv.org/pdf/2301.13196
Has some interesting stuff in the apendix also; e.g. also link to a paper that shows how deep a relu network needs to be to approximate a polynomial with X precision.

Graph View

looped transformer

Backlinks