The shortest program that performs well on a task generalizes the best.
“The shortest description of the data is the best model.”
compression generalization
Importantly, this generalization is “frozen” generalization. (See ^8c870a / ^a43c70)
Link to originalOccam’s Razor as the inspiration: Both MDL and Solomonoff Induction can be seen as mathematical formalizations of Occam’s Razor. They replace the vague notion of “simplicity” or “fewest assumptions” with the precise measure of “description length” or “program length.”
Solomonoff as the theoretical ideal: Solomonoff Induction provides the most general and theoretically optimal formalization. It essentially says the “simplest” explanation (shortest program) for the data is the most probable.
MDL as a practical application: MDL takes the core idea from Solomonoff/Kolmogorov complexity and makes it applicable to real-world statistical modeling and machine learning problems by using specific (computable) ways to define and measure the description length of models and data.
Kolmogorov Complexity: The precise, theoretical, but uncomputable measure of “algorithmic simplicity” for an individual object.
Occam’s Razor:
Nature: A philosophical principle or heuristic.
Statement (paraphrased): “Among competing hypotheses, the one with the fewest assumptions should be selected.” Or, “Entities should not be multiplied beyond necessity.”
Goal: To guide towards simpler explanations, which are often considered more likely to be true, easier to test, and more generalizable.
Formality: Informal, qualitative. It doesn’t precisely define “simplicity” or “assumption.”Minimum Description Length (MDL) Principle:
Nature: A formal, information-theoretic, and statistical principle for inductive inference and model selection.
Statement (paraphrased): “The best hypothesis (model) for a given set of data is the one that leads to the shortest overall description of the data and the model itself.”
Goal: To find a model that compresses the data best. This involves a trade-off: a more complex model might fit the data perfectly (short description of data given the model) but the model itself will have a long description. A simpler model might have a longer description of data given the model, but the model itself is short. MDL seeks the minimum of Length(Model) + Length(Data | Model).
Formality: Formal, quantitative. Description length is measured in bits (using concepts related to Kolmogorov Complexity, though often practical approximations are used).
Computability: While true Kolmogorov complexity is uncomputable, MDL uses computable approximations and specific coding schemes, making it a practical tool in statistics and machine learning.Solomonoff Induction (Algorithmic Probability):
Nature: A formal, theoretical framework for inductive inference and prediction, often considered the “gold standard” for optimal prediction.
Statement (paraphrased): Given a sequence of observations, the probability of a particular continuation is the sum of probabilities of all programs (for a universal Turing machine) that generate the observed sequence and then that continuation, where the probability of a program is 2^(-length of program). Essentially, shorter programs (simpler explanations) get higher prior probability.
Goal: To provide a universal and optimal method for predicting the next element in a sequence based on prior observations, by weighting all possible explanations (programs) by their simplicity (length).
Formality: Highly formal, based on algorithmic information theory and Turing machines. It provides a Bayesian framework with a universal prior (Kolmogorov complexity).
Computability: Fundamentally uncomputable due to its reliance on Kolmogorov complexity and the halting problem (we can’t know if an arbitrary program will halt, let alone its shortest description). It’s a theoretical ideal.