Power law scaling laws prlly come from logarithmic distribution of (length of) correlations.

TLDR of some papers – in a nutshell: Adam and xLSTM improve scaling laws by some constant, but only data (distribution, dataest size) improves the power law exponent, data pruning can even do better than power law scaling:
https://x.com/_katieeverett/status/1925665335727808651

power law