Refactor some of the backlinks to this note that would more fittingly go to one of these
Illya Sutskever’s scaling fallacy
… in my opinion, the best way to think about the question of Architecture is not in terms of a binary “is it enough” but “how much effort, what will be the cost of using this particular architecture”? Like at this point, I don’t think anyone doubts that the Transformer architecture can do amazing things, but maybe something else, maybe some modification, could have some computer efficiency benefits. So, it’s better to think about it in terms of compute efficiency rather than in terms of can it get there at all. I think at this point the answer is obviously yes.
…
These are fairly well-known ideas in AI, that the cortex of humans and animals are extremely uniform, and that further supports the idea that you just need one big uniform architecture, that’s all you need. [source]No, we do not need a big uniform architecture, we need a self-organizing architecture!
Even with unlimited resources, scaling is limited.
You cannot naively scale up a team of smart people, due to limitations in how to synchronise across people - in communication, there’s always round-trip costs that limit your resources.
scaling neural networks
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks