The phenomenon of generalization in neural networks happening long after the model has been overfitting.
Grokking is an instance of the minimum description length principle.
If you have a problem, you can just memorize a point-wise input to output mapping.
This has zero generalization.
But from there, you can keep pruning your mapping, making it simpler, a.k.a. more compressed.
The program that generalizes the best (while performing well on a training set), is the shortest. (or is it…? See How to build conscious machines)
→ **Generalization is memorization + regularization ** ←
(this type of generalization is still limited to in distribution, however)
Papers
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
Todo