grokking

The phenomenon of generalization in neural networks happening long after the model has been overfitting.

Grokking is an instance of the minimum description length principle.

If you have a problem, you can just memorize a point-wise input to output mapping.
This has zero generalization.
But from there, you can keep pruning your mapping, making it simpler, a.k.a. more compressed.
The program that generalizes the best (while performing well on a training set), is the shortest. (or is it…? See How to build conscious machines)
→ **Generalization is memorization + regularization ** ←
(this type of generalization is still limited to in distribution, however)

Papers

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

Todo

References

optimization
generalization

Max Wolf's Second Brain

Explorer

grokking

Papers

References

Graph View

Backlinks