Decoupling compute from data and model size.
Demonstrated at scale: Scaling Latent Reasoning via Looped Language Models
Todo
https://arxiv.org/pdf/2301.13196
Has some interesting stuff in the apendix also; e.g. also link to a paper that shows how deep a relu network needs to be to approximate a polynomial with X precision.