From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

Marc Finzi $^{1, *}$ , Shikai Qiu $^{2, *}$ , Yiding Jiang $^{1, *}$ , Pavel Izmailov $^{2}$ , J. Zico Kolter $^{1}$ , Andrew Gordon Wilson $^{2}$
$^{1}$ Carnegie Mellon University $^{2}$ New York University

Abstract

Can we learn more from data than existed in the generating process itself? Can new and useful information be constructed from merely applying deterministic transformations to existing data? Can the learnable content in data be evaluated without considering a downstream task? On these questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part because they assume observers with unlimited computational capacity and fail to target the useful information content. In this work, we identify and exemplify three seeming paradoxes in information theory: (1) information cannot be increased by deterministic transformations; (2) information is independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed light on the tension between these results and modern practice, and to quantify the value of data, we introduce epiplexity, a formalization of information capturing what computationally bounded observers can learn from data. Epiplexity captures the structural content in data while excluding time-bounded entropy, the random unpredictable content exemplified by pseudorandom number generators and chaotic dynamical systems. With these concepts, we demonstrate how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself. We also present practical procedures to estimate epiplexity which we show capture differences across data sources, track with downstream performance, and highlight dataset interventions that improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity provides a theoretical foundation for data selection, guiding how to select, generate, or transform data for learning systems.

1 Introduction

As AI research progresses towards more general-purpose intelligent systems, cracks are beginning to show in mechanisms for grounding mathematical intuitions. Much of learning theory is built around controlling generalization error with respect to a given distribution, treating the training distribution as fixed and focusing optimization effort on the choice of model. Yet modern systems are expected to transfer across tasks, domains, and objectives that were not specified at training time, often after large-scale pretraining on diverse and heterogeneous data. In this regime, success or failure frequently hinges less on architectural choices than on what data the model was exposed to in the first place. Pursuing broad generalization to out-of-distribution tasks forces a shift in perspective: instead of treating data as given and optimizing for in-distribution performance, we need to choose and curate data to facilitate generalization to unseen tasks. This shift makes the value of data itself a central question—how much usable, transferable information can a model acquire from training? In other words, instead of model selection, how do we perform data selection? On this question, existing theory offers little guidance and often naively contradicts empirical observations.

*Equal contribution.

A conceptual diagram illustrating random vs structural information, showing how computation can create apparent randomness and emergent structure, and how structural information relates to out-of-distribution generalization. Figure 1: Illustration of random vs structural information. (Left) Illustration of random vs structural information of different data for computationally-bounded observers, which we formalize with time-bounded entropy and epiplexity (Section 3) and can be estimated from loss curves of neural networks trained on that data (Section 4). (Top Right) Unlike other forms of information, time-bounded entropy and epiplexity can be increased through computational processes, such as simulating dynamical systems (cellular automation, Lorenz equations) and interventions like changing the data ordering, which can produce apparent randomness but also learnable, emergent structures like gliders and the Lorenz attractor invariant measure (Section 5). (Bottom Right) Whereas time-bounded entropy captures the in-distribution randomness and unpredictability, epiplexity measures the amount of structural information the model extracts from the data to its weights, which can be useful for OOD tasks such as by reusing learned circuits shared between the in-distribution and OOD tasks.

Consider synthetic data, crucial for further developing model capabilities (Abdin et al., 2024; Maini et al., 2024)Abdin and colleagues, twenty twenty-four, and Maini and colleagues, twenty twenty-four when existing natural data are exhausted. Existing concepts in information theory like the data processing inequality appear to suggest that synthetic data adds no additional value. Questions about what information is transferred to a given model seem naturally within the purview of information theory, yet, quantifying this information with existing tools proves to be elusive. Even basic questions, such as the source of the information in the weights of an AlphaZero game-playing model (Silver et al., 2018)Silver and colleagues, twenty eighteen, are surprisingly tricky to answer. AlphaZero takes in zero human data, learning merely from the deterministic rules of the game and the AlphaZero RL algorithm, both of which are simple to describe. Yet the resulting models achieve superhuman performance and are large in size. To assert that AlphaZero has learned little to no information in this process is clearly missing the mark, and yet both Shannon and algorithmic information theory appear to say so.

In this paper, we argue that the amount of structural information a computationally bounded observer can extract from a dataset is a fundamental concept that underlies many observed empirical phenomena. As we will show, existing notions from Shannon and algorithmic information theory are inadequate when forced to quantify this type of information. These frameworks often lend intuitive or mathematical support to beliefs that, in fact, obscure important aspects of empirical phenomena. To highlight the limitations of classical frameworks and motivate the role of computational constraints in quantifying information, we identify and demonstrate three apparent paradoxes: statements which can be justified mathematically by Shannon and algorithmic information theory, and yet are in tension with intuitions and empirical phenomena.

Paradox 1: Information cannot be increased by deterministic processes. For both Shannon entropy and Kolmogorov complexity, deterministic transformations cannot meaningfully increase the information content of an object. And yet, we use pseudorandom number generators to produce randomness, synthetic data improves model capabilities, mathematicians can derive new knowledge by reasoning from axioms without external information, dynamical systems produce emergent phenomena, and self-play loops like AlphaZero learn sophisticated strategies from games (Silver et al., 2018)as shown by Silver and colleagues in 2018.

Paradox 2: Information is independent of factorization order. A property of both Shannon entropy and Kolmogorov complexity is that total information content is invariant to factorization: the information from observing first $X$ X and then $Y$ Y is the same as observing $Y$ Y followed by $X$ X. On the other hand, LLMs learn better on English text ordered left-to-right than reverse ordered text, picking out an “arrow of time” (Papadopoulos et al., 2024; Bengio et al., 2019), and we have cryptography built on the existence of functions that are computationally hard to predict in one direction and easy in another.

Paradox 3: Likelihood modeling is merely distribution matching. Maximizing the likelihood is often equated with matching the training data generating process: the true data-generating process is a perfect model of itself, and no model can achieve a higher expected likelihood. As a consequence, it is often assumed that a model trained on a dataset cannot extract more structure or learn useful features that were not used in generating the data. However, we show that a computationally-limited observer can in fact uncover much more structure than is in the data generating process. For example, in Conway’s game of life the data are generated via simple programmatic rules that operate on two-dimensional arrays of bits. Applying these simple rules sequentially, we see emergent structures, such as different species of objects that move and interact in a predictable way. While an unbounded observer can simply simulate the evolution of the environment exactly, a computationally-bounded observer would make use of the emergent structures and learn the different types of objects and their behaviors.

The tension between these theoretical statements and empirical phenomena can be resolved by imposing computational constraints on the observer and separating the random content from the structural content. Drawing on ideas from cryptography, algorithmic information theory, and these unexplained empirical phenomena, we define a new information measure, epiplexity (epistemic complexity), which formally defines the amount of structural information that a computationally-bounded observer can extract from the data (Section 3, Definition 8). Briefly, epiplexity is the information in the model that minimizes the description length of data under computational constraints. A simple heuristic measurement is the area under the loss curve above the final loss, while a more rigorous approach uses the cumulative KL divergence between a teacher and student model (Section 4, Figure 2).

Our definitions capture the intuition that an object contains both random, inherently unpredictable information (entropy), and predictable structured information that enables observers to generalize by identifying patterns (epiplexity). In Figure 1 (left) we illustrate this divide. In the top row, we have highly redundant and repetitive code and simple color gradients, which have little information content, be it structural or random. In the middle row, we have the inner workings of an algorithm and pictures of animals, showing complex, long-range interdependencies between the elements from which a model can learn complex features and subcircuits that are helpful even for different tasks. In contrast, on the bottom, we have random data with little structure: configuration files with randomly generated API keys, file paths, hashes, arbitrary boolean flags have negligible learnable content and no long-range dependencies or complex circuits that result from learning on this task. Similarly, uniformly shuffled pixels from the animal pictures have high entropy but are fundamentally unpredictable, and no complex features or circuits arise from training on these data.

An essential property of our formulation is that information is observer dependent: the same object may appear random or structured depending on the computational resources of the observer. For instance, the output of a strong pseudorandom generator appears indistinguishable from true randomness to any polynomial-time observer lacking the secret key (seed), regardless of the algorithm or function class. In other situations, such as chaotic dynamical systems, both apparently random behavior is produced along with structure: the state of the system cannot be predicted precisely over long time-scales, but such observers may still learn meaningful predictive distributions, as shown by the invariant measure in Figure 1 (top right).

Models trained to represent these distributions are computer programs, and substructures within these programs, like circuits for performing specific tasks, or induction heads (Olsson et al., 2022)Olsson and colleagues, twenty twenty-two, can be reused even for seemingly unrelated data. This view motivates selecting high epiplexity data that induces more structural information in the model, since these structures can then be reused for unseen out-of-distribution (OOD) tasks, as illustrated in Figure 1 (bottom right). We emphasize, however, that epiplexity is a measure of information, not a guarantee of OOD generalization to specific tasks. Epiplexity quantifies the amount of structural information a model extracts, while being agnostic to whether these structures are relevant to a specific downstream task.

To build intuition, we explore a range of phenomena and provide experimental evidence for behaviours that are poorly accounted for by existing information-theoretic tools, yet naturally accommodated by epiplexity. We show that information can be created purely through computation, giving insights into synthetic data (subsection 5.1). We examine how certain factorizations of the same data can increase structural information and downstream OOD performance—even as they result in worse training loss (subsection 5.2). We show why likelihood modeling is more than distribution matching, identifying induction and emergence as two settings where the observer can learn more information than was present in the data generating process (subsection 5.3). By measuring epiplexity, we can better understand why pre-training on text data transfers more broadly than image data, and why certain data selection strategies for LLMs are empirically successful (Section 6). Together, our results provide clarity on the motivating questions: the information content of data can be compared independently of a specific task, new information can be created by computation, and models can learn more information than their generating processes contain.

In short, we identify a disparity between existing concepts in information theory and modern practice, embodied by three apparent paradoxes, and introduce epiplexity as a measurement of structural information acquired by a computationally-bounded observer to help resolve them. We formally define epiplexity in Section 3 (Definition 8) and present measurement procedures in Section 4. In Section 5, we show how epiplexity and time-bounded entropy shed light on these paradoxes, including induction and emergent phenomena. Finally, in Section 6, we demonstrate that epiplexity correlates with OOD generalization, helping explain why certain data enable broader generalization than others.

2 Background

In order to define the interesting, structural, and predictive component of information, we must separate it out from random information—that which is fundamentally unpredictable given the computational constraints of the observer. Along the way, we will review algorithmic randomness as developed in algorithmic information theory as well as notions of pseudo-randomness used in cryptography, and how these concepts crucially depend on the observer.

2.1 What Does it Mean for An Object to Be Random?

Random Variables and Shannon Information. Many common intuitions about randomness start from random variables and Shannon information. A random variable defines a map from a given measurable probability space to different outcomes, with probabilities corresponding to the measure of the space that lead to a certain outcome. Shannon information assigns to each outcome $x$ x a self-information (or surprisal) $lo g 1/ P (x)$ log of one over P of x based on the probability $P$ P, and an entropy for the random variable $H (X) = E [lo g 1/ P (X)]$ H of X, defined as the expected value of log one over P of X, which provides a lower bound on the average code length needed to communicate samples to another party (Shannon, 1948). In Shannon’s theory, information comes only from distributions and random variables. Objects which are not random must contain no information. As a result, non-random information is seemly contradictory, and thus we must draw from a broader mathematical perspective to describe such concepts.

In the mid 1900s, mathematicians were interested in formalizing precisely what it means for a given sample to be a random draw from a given distribution, to ground the theory of probability and random variables (Shafer and Vovk, 2006). A central consideration involves a uniformly sampled binary sequence $u_{1 : \infty}$ u one to infinity from which other distributions of interest can be constructed. This sequence can also be interpreted as the binary expression of a number $[0, 1)$ in the interval zero to one. Intuitively, one might think that all sequences should be regarded as equally random, as they are all equally likely according to the probability distribution: $1111111 \dots$ has the same probability mass as $10011101 \dots$ and also the same self-information. However, looking at statistics on these sequences reveals something missing from this perspective; from the law of large numbers, for example, it must be that $lim_{N \to \infty} \frac{1}{N} \sum_{i = 1}^{N} u_{i} = 0.5$ the limit as N approaches infinity of the average value is zero point five, which is clearly not satisfied by the first sequence of 1s.

Martin-Löf Randomness: No algorithm exists to predict the sequence. Initial attempts were made to formalize randomness as sequences which pass all statistical tests for randomness, such as the law of large numbers for selected substrings. However, under such definitions all sequences fail to be random since tests like $u_{1 : \infty} \neq = y_{1 : in f t y}$ u is not equal to y for any particular sequence $y$ y must also be included (Downey and Hirschfeldt, 2019). The solution to these issues was found by defining random sequences not as those that pass all tests of randomness, but those that pass all computable tests of randomness, in a formalization known as Martin-Löf randomness (Martin-Löf, 1966). As it turned out, this definition is equivalent to a number of seemingly distinct definitions, such as the inability for any gambler to exploit properties of the sequence to make a profit, or that all prefixes of the random sequence should be nearly incompressible (Terwijn, 2016). For this last definition, we must invoke Kolmogorov complexity, a notion of compressibility and a key concept in this paper.

Definition 1 (Prefix Kolmogorov complexity (Kolmogorov, 1968; Chaitin, 1975))

Fix a universal prefix-free Turing machine $U$ U. The (prefix) Kolmogorov complexity of a finite binary string $x$ x is $K (x) = min {∣ p ∣ : U (p) = x}$ K of x equals the minimum length of a program p such that U of p outputs x. That is, $K (x)$ K of x is the length of the shortest self-delimiting program (a program which also encodes its length) that outputs $x$ x and halts.

Due to the universality of Turing machines, the Kolmogorov complexity for two Turing machines (or programming languages) $U_{1}$ and $U_{2}$ differ by at most a constant, $∣ K_{U_{1}} (x) - K_{U_{2}} (x) ∣ \leq C$ , where the constant $C$ depends only on $U_{1}, U_{2}$ , but not on $x$ (Li et al., 2008).

Definition 2 (Martin–Löf random sequence (Martin-Löf, 1966))

An infinite sequence $x_{1 : \infty} \in {0, 1}^{N}$ x from one to infinity in the set of infinite binary strings is Martin–Löf random iff there exists a constant $c$ c such that for all $n$ n, $K (x_{1 : n}) \geq n - c$ K of the prefix x one to n is greater than or equal to n minus c.

Using this criterion, all computable randomness tests are condensed into a single incomputable randomness test concerning Kolmogorov complexity.

One can extend Martin-Löf randomness to finite sequences. We say that a sequence $x \in {0, 1}^{n}$ x of length n is $c$ c-random if $K (x) > n - c$ K of x is greater than n minus c. Equivalently, randomness discrepancy is defined as $δ (x) = n - K (x)$ delta of x equals n minus K of x,

which measures how far away $x$ x is from having maximum Kolmogorov complexity. A sequence $x$ x is $c$ -randomc random if $δ (x) < c$ delta of x is less than c. High Kolmogorov complexity, low randomness discrepancy, sequences are overwhelmingly likely when sampled from uniform randomly sampled random variables. From Kraft’s inequality (Kraft, 1949; McMillan, 1956), there are at most $2^{n - c}$ two to the n minus c (prefix-free) programs of length $L \leq n - c$ L less than or equal to n minus c, therefore in the $2^{n}$ two to the n possibilities in uniformly sampling $X \sim U_{n}$ X distributed as U n, the probability that $K (X)$ K of X is size $L$ or smaller is $P (K (X) \leq n - c) = P (δ (X) \geq c) < 2^{- c}$ the probability that K of X is less than or equal to n minus c, which equals the probability that delta of X is greater than or equal to c, is less than two to the minus c. The randomness discrepancy of the sequence can thus be considered as a test statistic for which one would reject the null hypothesis that the given object $X$ X is indeed sampled uniformly at random (Grünwald et al., 2008). In order for a sequence to have low randomness discrepancy there must be no discernible pattern, and thus there is an objective sense in which 1001011100 is more random than 0101010101.

Given the Martin-Löf definition of infinite random sequences, every random sequence is incomputable; in other words, there is no program that can implement the function $N \to {0, 1}$ from the natural numbers to the set zero one which produces the bits of the sequence. One should contrast such random numbers from those like $π /4$ or $e /3$ pi over four or e over three, which though transcendental, are computable, as there exist programs that can compute the bits of their binary expressions. While the computable numbers in $[0, 1)$ the half open interval zero to one form a countable set, algorithmically random numbers in $[0, 1)$ are uncountably large in number. With the incomputability of random sequences in mind we can appreciate the Von Neumann quote

“Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.” (Von Neumann, 1951)

which anticipates the Martin–Löf formalization that came later. But this viewpoint also misses something essential, as evidenced by the success of pseudorandom number generation, derandomization, and cryptography.

Cryptographic Randomness: No polynomial time algorithm exists to predict the sequence. An important practical and theoretical development of random numbers has come from the cryptography community, by once again limiting the computational model of the observer.

Rather than passing all computable tests as with Martin-Löf randomness, cryptographically secure pseudorandom number generators (CSPRNG or PRG) are defined as functions which produce sequences which pass all polynomial time tests of randomness.

Definition 3 (Non-uniform CSPRNG (Blum and Micali, 1982; Goldreich, 2006)) A function $G$ stretching $k$ input bits into $n$ output bits is a CSPRNG iff over the randomness of the seed $s$ , no $D$ with polynomial-size advice can distinguish the generation from a random sequence more than a negligible fraction of the time. More precisely, $G$ is a (non-uniform) CSPRNG iff for every non-uniform probabilistic polynomial time algorithm $D_{k} : {0, 1}^{n} \to {0, 1}$ (making use of advice strings ${a_{k}}_{k \in N}$ of length $poly (k)$ ) has at most negligible advantage $ϵ (k)$ distinguishing outputs of $G$ from uniformly random sequences $u \sim U_{n}$ :

ϵ (k) = ∣ Pr_{s \sim U_{k}} [D_{n} (G (s)) = 1] - Pr_{u \sim U_{n}} [D_{n} (u) = 1] ∣ < negl (k) . [^{p} 5 - 1] (1)

epsilon of k is the absolute difference between the probability that D n of G of s equals one and the probability that D n of u equals one, which is less than a negligible function of k

The definition of indistinguishability via polynomial time tests is equivalent to a definition on the failure to predict the next element of a sequence given the previous elements: no polynomial time predictor can predict the next bit of the sequence with probability negligibly better than random guessing (Yao, 1982).

Following from the indistinguishability definition, randomness of this kind can be substituted for Martin-Löf randomness in the vast majority of practical circumstances.¹ For example, if a use-case of randomness that runs in polynomial time like quicksort, and takes more iterations to run with CSPRNG sequences than with truly random sequences, and this difference could be determined within polynomial time such as by measuring the quicksort runtime, then this construction could be used as a polynomial time distinguisher, which by the definition of CSPRNG does not exist. If CSPRNGs exist, then quicksort must run nearly as fast using pseudorandom number generation as it does with truly random sequences.

The existence of CSPRNGs hinges on the existence of one way functions, from which CSPRNGs and other cryptographic primitives are constructed, forming the basis of modern cryptography. For example, the backbone algorithm for parallel random number generation in Jax (Bradbury et al., 2018)Bradbury and colleagues, works to create random numbers $u_{1}, u_{2}, \dots u_{N}$ u one, u two, up to u N by simply encrypting the numbers $1, 2, \dots, N$ : $u_{k} = E (k, s)$ u sub k equals E of k and s where the encryption key $s$ s is the random seed and $E$ is the threefish block cypher (Salmon et al., 2011)Salmon and colleagues. Block ciphers, like other primitives, are constructed using one way functions.

Definition 4 (Non-uniform one-way function, OWF (Yao, 1982; Goldreich, 2006)) Let $f : {0, 1}^{n} \to {0, 1}^{m}$ f from binary strings of length n to binary strings of length m (with $m > n$ ) be computable in time $poly (n)$ where $n = ∣ x ∣$ . We say $f$ is one-way against non-uniform adversaries if for every non-uniform PPT algorithm $A_{n}$ (i.e., a polynomial-time algorithm $A$ with advice strings ${a_{n}}_{n \in N}$ of length $poly (n)$ ),

x \sim U_{n} Pr [A_{n} (f (x)) \in f^{- 1} (f (x))] < negl (n),

where the probability is over the uniform choice of $x$ (and the internal randomness of $A$ ).

While cryptographers are most interested in the polynomial versus nonpolynomial compute separations for security, one way functions with respect to less extreme compute separations have been constructed and are believed to exist, for example for quadratic time (Merkle, 1978), quasipolynomial time (Liu and Pass, 2024), and even constraints on circuit depth (Applebaum, 2016). While the results we prove in this paper are based on the polynomial vs nonpolynomial separation in cryptographic primitives, it seems likely that a much wider array of compute separations are relevant for information in the machine learning context even if not as important for cryptography. For example, the separations between quadratic or cubic time and higher order polynomials may be relevant to transformer self attention, or gaps between fixed circuit depth and variable depth as made possible with chain of thought or other mechanisms.

2.2 Random vs Structural Information

With these notions of randomness in hand, we can use what is random to define what is not random. In algorithmic information theory, there is a lesser known concept that captures exactly this idea, known as sophistication (Koppel, 1988), which has no direct analog in Shannon information theory. While several variants of the definition exist, the most straightforward is perhaps the following:

Definition 5 (Naive Sophistication (Mota et al., 2013)) Sophistication, like Kolmogorov complexity, is defined on individual bitstrings, and it uses the compressibility criterion from Martin-Löf randomness to carve out the random content of the bitstring. Sophistication is defined as the smallest Kolmogorov complexity of a set $S$ such that $x$ is a random element from that set (at randomness discrepancy of $c$ ).

nsoph_{c} (x) = S min {K (S) : K (x ∣ S) > lo g ∣ S ∣ - c} (2)

Naive sophistication of x is the minimum Kolmogorov complexity of a set S such that the complexity of x given S is greater than the log of the size of S minus c

Informally, sophistication describes the structural component of an object; however, it is surprisingly difficult to give concrete examples of high sophistication objects. The difficulty of finding high sophistication objects is a consequence of Chaitin’s incompleteness theorem (Chaitin, 1974). This theorem states that in a given formal system there is a constant $L$ L for which there are no proofs that any specific string $x$ x has $K (x) > L$ K of x is greater than L, even though nearly all strings have nearly maximal complexity. Since $nsoph_{c} (x) > L$ n soph c of x greater than L implies $K (x) > L - O (1)$ K of x is greater than L minus order one, there can be no proofs that the sophistication of a particular string exceeds a certain constant either. It is known that high sophistication strings exist by a diagonalization argument (Antunes et al., 2005), but we cannot pinpoint any specific strings which have high sophistication. On typical Turing machines, $L$ L is often not more than a few thousand (Chaitin, 1998), far from the terabytes of information that frontier AI models have encoded.

We look towards complex systems and behaviors as likely examples of high sophistication objects; however in many of these cases the objects could conceivably be produced by simpler descriptions given tremendous amounts of computation. The mixing of two fluids for example can produce extremely complex transient behavior due to the complexities of fluid dynamics; however, with access to unlimited computation and some appropriately chosen random initial data one should be able to reproduce the exact dynamics (Aaronson et al., 2014). Owing to the unbounded compute available for the programs in sophistication, many complex objects lose their complexity. Additionally, for strings that do have high sophistication, the steps of computation required for the optimal program grow faster than any computable function with the sophistication content (Ay et al., 2010). For a computationally-bounded observer, an encrypted message or a cryptographically secure pseudo-random number generator (CSPRNG) output is random, and measurements that do not recognize this randomness do not reflect the circumstances of this observer. These limitations of sophistication leads to a disconnect with real systems with observers that have limited computation, and it is our contention that this disconnect is an essential one, central to phenomena such as emergence, induction, chaos, and cryptography.

2.3 The Minimum Description Length Principle

Finally, we review the minimum description length principle (MDL), used as a theoretical criterion for model selection, which we will use in defining epiplexity. The principle states that among models for the data, the best explanation minimizes the total description length of the data, including both the description of the data using the model and the description of the model itself (Rissanen, 2004). The most common instantiation of this idea is via the statistical two-part code MDL.

Definition 6 (Two-part MDL (Rissanen, 2004; Grünwald, 2007)) Let $x \in {0, 1}^{n \times d}$ x in the set of binary strings of length n by d be the data and $H$ H be a set of candidate models. The two-part MDL is:

L (x) = H \in H min L (H) - lo g P (x ∣ H),

L of x equals the minimum over H in the set of models of L of H minus log P of x given H

where $L (H)$ L of H specifies the number of bits required to encode the model $H$ H, and $- lo g P (x ∣ H)$ minus log P of x given H is the number of bits required to encode the data given the model.

This formulation provides an intuitive implementation of Occam’s Razor: complex models (large $L (H)$ L of H) are penalized unless they provide a reduction in the data’s description length (large $P (x ∣ H)$ P of x given H). If there are repeating patterns in the data, they can be stored in the model $H$ H rather than being duplicated in the code for the data. We review the modern developments of MDL in Appendix H. While MDL is a criterion for model selection given a fixed dataset, epiplexity, which we introduce next, can be viewed as its dual: a criterion for data selection given a fixed computation budget.

3 Epiplexity: Structural Information Extractable by a Computationally Bounded Observer

Keeping in mind the distinction between structural and random information in the unbounded compute setting, and the computational nature of pseudorandomness in cryptography, we now introduce epiplexity. Epiplexity captures the structural information present to a computationally bounded observer. As the computational constraints of this observer change, so too does the division between random and structured content. After introducing epiplexity here, we present ways of measuring epiplexity in Section 4. In Sections 5 and 6 we show how epiplexity can shed light on seeming paradoxes in information theory around the value of data, and OOD generalization.

First we will define what it means for a probability distribution to have an efficient implementation, requiring that it be implemented on a prefix-free universal Turing machine (UTM) and halt in a fixed number of steps.

Definition 7 (Time-bounded probabilistic model) Let $T : N \to N$ T from N to N be non-decreasing time-constructible function and let $U$ U be a fixed prefix-free universal Turing machine. A (prefix-free) program $P$ P is a $T$ -timeT time probabilistic model over ${0, 1}^{n}$ the set of binary strings of length n if it supports both sampling and probability evaluation in time $T (n)$ T of n:

Evaluation. On input $(0, x)$ zero comma x with $x \in {0, 1}^{n}$ x in the set of binary strings of length n, $U (P, (0, x))$ U of P and zero comma x halts within $T (n)$ T of n steps and outputs an element in $[0, 1]$ (with a finite binary expansion), denoted

Prob_{P} (x) := U (P, (0, x)) .

Sampling. On input $(1, u)$ one comma u where $u \in {0, 1}^{\infty}$ u is an infinite random tape, $U (P, (1, u))$ U of P and one comma u halts within $T (n)$ T of n steps and outputs an element of ${0, 1}^{n}$ binary strings of length n, denoted

Sample_{P} (u) := U (P, (1, u)) .

These outputs must define a normalized distribution matching the sampler:

x \in {0, 1}^{n} \sum Prob_{P} (x) = 1 and Pr_{u \sim U_{\infty}} [Sample_{P} (u) = x] = Prob_{P} (x) \forall x \in {0, 1}^{n} .

Let $P_{T}$ P sub T be the set of all such programs. To simplify the notation, we will use italicized $P$ to denote the probability mass function $Prob_{P}$ in contrast with the non-italicized $P$ , which denotes the program.

Here, $n$ n denotes the dimension of the underlying sample space (e.g., the length of the binary string.) This definition allows us to constrain the amount of computation the function class can use. Such a model class enforces that the functions of interest are both efficiently sampleable and evaluable, which include most sequence models. While in this work we focus primarily on computational constraints which we consider most fundamental, other constraints such as memory or within a given function class $F$ F can be accommodated by replacing $P_{T}$ P sub T with $P_{F}$ P sub F, and may be important for understanding particular phenomena.² With these preliminaries in place, we can now separate the random and structural components of information.

We define epiplexity and time-bounded entropy in terms of the program which achieves the best expected compression of the random variable $X$ X, minimizing the two-part code length (model and data given model bits) under the given runtime constraint.

Definition 8 (Epiplexity and Time-Bounded Entropy)

Consider a random variable $X$ X on ${0, 1}^{n}$ the set of binary strings of length n. Let
$P^{⋆} = P \in P_{T} ar g min {∣ P ∣ + E [lo g 1/ P (X)]} (3)$
be the program that minimizes the time bounded MDL with ties broken by the smallest program, and expectations taken over $X$ X. $∣ P ∣$ the length of P denotes the length of the program $P$ P in bits, and logarithms are in base 2. We define the $T$ T-bounded epiplexity $S_{T}$ S sub T and entropy $H_{T}$ H sub T of the random variable $X$ X as
$S_{T} (X) := ∣ P^{⋆} ∣, and H_{T} (X) := E [lo g 1/ P^{⋆} (X)] . (4)$

The time-bounded entropy $H_{T}$ H sub T captures the amount of information in the random variable that is random and unpredictable, whereas the epiplexity $S_{T}$ S sub T captures the amount of structure and regularity visible within the object at the given level of compute. Uniform random variables have trivial epiplexity because a model as simple as the uniform distribution achieves a small two part code length, despite having large time bounded entropy. Explicitly, for a uniform random variable $U_{n}$ U n on ${0, 1}^{n}$ binary strings of length n, and even a constant time bound $T (n) \geq c_{1}$ T of n at least c one, $S_{T} (U_{n}) + H_{T} (U_{n}) \leq n + c_{2}$ S sub T plus H sub T is at most n plus c two where $c_{2}$ c two is the length of a program for the uniform distribution running in time $c_{1}$ c one, and since $H_{T} (U_{n}) \geq H (U_{n}) = n$ H sub T is at least n, it must be that $S_{T} (U_{n}) \leq c_{2}$ S sub T is at most c two. Random variables with very simple patterns, like 0101010101… with probability 1/2 and 1010101010… with probability 1/2, also have low epiplexity because the time bounded MDL minimal model is simple. In this case with linear time $T (n) = Θ (n)$ T of n is order n, both $S_{T} (X) = O (1)$ S sub T of X is constant order and $H_{T} (X) = O (1)$ H sub T of X is constant order. Henceforth, we will abbreviate $M D L_{T} (X) := S_{T} (X) + H_{T} (X)$ MDL sub T of X is defined as S sub T plus H sub T, which is the total time-bounded information content. We will now enumerate a few basic consequences of these definitions.

Basic Properties

(1) $S_{T} (X) \geq 0, H_{T} (X) \geq 0$ ,
(2) $H (X) \leq S_{T} (X) + H_{T} (X) \leq n + c_{1}$ ,
(3) $M D L_{T^{'}} (X) \leq M D L_{T} (X)$ whenever $T^{'} (n) \geq T (n)$ ,
(4) $M D L_{T^{'}} (f^{- 1} (X)) \leq M D L_{T} (X) + ∣ f ∣ + c_{2}$ , with $T^{'} (n) = T (n) + Time (f)$ .

Statement 4 (defined for programs $f$ f that run in a fixed time implementing a bijection) is an analog of the information non-increase property $K (f (x)) \leq K (x) + K (f) + c$ K of f of x is at most K of x plus K of f plus a constant. However, note that while the Kolmogorov complexity for $K (f)$ K of f and $K (f^{- 1})$ K of f inverse are the same to within an additive constant, in our setting of a fixed computational budget having a short program for $f^{- 1}$ f inverse does not imply one for $f$ f, and vice versa. This gap between a function and its inverse has important consequences for the three paradoxes as we will see in Section 5.

Pseudorandom number sequences have high random content and little structure. Unlike Shannon entropy, Kolmogorov complexity, or even resource bounded forms of Kolmogorov complexity (Allender et al., 2011), we show that CSPRNGs have nearly maximal time-bounded entropy for polynomial time observers. Additionally, while CSPRNGs produce random content, they do not produce structured content as the epiplexity is negligibly larger than constant. Formally, let $U_{k}$ U sub k be the uniform distribution on $k$ k bits.

BLUE

Theorem 9 For any $T \in Poly (n)$ T in poly n and $G \in CSPRNG$ G in the set of cryptographically secure pseudorandom number generators that stretches the input to $n = poly (k)$ n equals poly k bits and allowing for an advantage of at most $ε (k)$ epsilon of k, the time bounded entropy is nearly maximal:
$n - 2 - n ε (k) < H_{T} (G (U_{k})) \leq n + c,$
and the epiplexity is nearly constant
$S_{T} (G (U_{k})) \leq c + n ε (k) .$
Proof: see Appendix A.1.

In contrast, the Shannon entropy is $H (G (U_{k})) = k$ H of G of U sub k equals k, polynomial time bounded Kolmogorov complexity will be at most $k + c$ k plus c (assuming $n$ n is fixed or specified ahead of time) as there is a short and efficiently runnable program $G$ G which produces the output, and similarly with other notions such as Levin complexity (Li and Vitányi, 2008)Li and Vitanyi or time bounded Kolmogorov complexity (Allender et al., 2011)Allender and colleagues. Taken together, these results show that epiplexity appropriately characterizes pseudorandom numbers as carrying a large amount of time-bounded randomness but essentially no learnable structure, exactly as intuition suggests.

Existence of Random Variables with High Epiplexity. One may wonder whether any high epiplexity random variables exist at all, and indeed under the existence of one way functions we can show via a counting argument that there exists a sequence of random variables whose epiplexity grows at least logarithmically with the dimension.

BLUE

Theorem 10 Assuming the existence of one-way functions secure against non-uniform probabilistic polynomial time adversaries, there exists a sequence of random variables ${X_{n}}_{n = 1}^{\infty}$ X sub n, from n equals one to infinity over ${0, 1}^{n}$ the set of binary strings of length n such that
$S_{Poly} (X_{n}) = Ω (lo g n) .$
Proof: see Appendix A.4.

From this result we know at least that random variables with arbitrarily large epiplexities indeed exist; however, logarithmically growing information content only admits a very modest amount of information, still far from the power law scaling we see with some natural data.

Conditional Entropy and Epiplexity. To describe situations like image classification, where we are only interested in a function which predicts the label from the image, and not the information in generating the images, we define conditional time-bounded entropy and epiplexity.

Definition 11 (Conditional epiplexity and time-bounded entropy) For a pair of random variables $X$ X and $Y$ Y, define $P_{T (n)}^{X}$ P sub T of n superscript X as the set of probabilistic models $P$ P such that for each fixed $x$ x, the conditional model $P_{Y ∣ x}$ is in $P_{T (n)}$ . The optimal conditional model with access to $X$ is:

P_{Y ∣ X}^{⋆} = P \in P_{T}^{X} ar g min {∣ P ∣ + E_{(X, Y)} [- lo g P (Y ∣ X)]} . (5)

The conditional epiplexity and time-bounded entropy are defined as:

S_{T} (Y ∣ X) := ∣ P_{Y ∣ X}^{⋆} ∣, H_{T} (Y ∣ X) := E_{(X, Y)} [- lo g P_{Y ∣ X}^{⋆} (y ∣ x)] . (6)

These quantities are defined with respect to the time bounded MDL over programs which take as input $X, Y$ X and Y and output the probabilities over $Y$ Y (conditioned on $X$ X), and with expectations taken over both $X$ and $Y$ X and Y. We note that in general this definition is not equivalent to the difference of the joint and individual entropies, $H_{T} (Y, X) - H_{T} (X) \neq = H_{T} (Y ∣ X)$ H sub T of Y comma X minus H sub T of X does not equal H sub T of Y given X. Unlike Shannon entropy, we can also condition on deterministic strings, which will change the values on account of not needing such a large program $P$ P. For example, we may be interested in the conditional epiplexity $S_{T} (X ∣ m)$ S sub T of X given m or entropy $H_{T} (X ∣ m)$ H sub T of X given m given a model $m$ m. For a deterministic string $d \in {0, 1}^{*}$ d in the set of all binary strings we define the conditional epiplexity via

P_{Y ∣ d}^{⋆} = P \in P_{T}^{{0, 1}^{*}} min {∣ P ∣ + E_{Y} [- lo g P (Y ∣ d)]}, (7)

where the minimization is over time bounded functions $P (\cdot ∣ \cdot)$ that take in the string $d$ as the second argument (which we refer to as $P_{T}^{{0, 1}^{*}}$ P T of binary strings).

For the machine learning setting, we take the random variable $X$ X to refer to the entire dataset of interest, i.e. typically a collection $X = [X_{1}, X_{2}, \dots]$ X equals X sub one, X sub two, and so on of many iid samples from a given distribution, rather than a lone sample from, and $E [lo g 1/ P (X)]$ scales with the dataset size. Epiplexity typically grows with the size of the dataset (see detailed arguments for why this is the case in Section B.4) as larger datasets allow identifying and extracting more intricate structure and patterns, mirroring the practice of ML training. Moreover, as we will see later, the epiplexity of a typical dataset is orders of magnitudes smaller than the random information content. While not a focus of this paper, conditioning on deterministic strings opens up the possibility to understand what additional data is most useful for a specific machine learning model, such as for post-training a pre-trained LLM.

4 Measuring Epiplexity and Time-Bounded Entropy

We have now introduced epiplexity and time-bounded entropy as measures of structural and random information of the data. In this section, we present practical procedures to estimate upper bounds and empirical proxies for these quantities. Intuitively, we want to find a probabilistic model $P (\cdot)$ of the data $X$ P of the data X that achieves low expected loss $E [lo g 1/ P (X)]$ expected loss of log one over P of X, is described by a short program $P$ P, and evaluating $P (X)$ takes time at most $T (∣ X ∣)$ T of the length of X, which we will abbreviate as $T$ T. Using this model, we thereby decompose the information of the data into its structural and random components, namely, (1) epiplexity $S_{T} (X)$ S sub T of X: the length of the program $∣ P ∣$ the length of P, accounting for the bits required to model the data distribution, and (2) time-bounded entropy $H_{T} (X)$ H sub T of X: the expected length for entropy coding the data using this model, which accounts for the bits required to specify the particular realization of $X$ within that distribution. We estimate conditional epiplexity analogously, providing random variable conditioning as input into the model.

Since directly searching over the space of programs is intractable, we restrict attention to probabilistic models parameterized by neural networks, as they achieve strong empirical compression across data modalities (MacKay, 2003; Goldblum et al., 2023; Delétang et al., 2023; Ballé et al., 2018) and capture the most relevant ML phenomenology. While a naive approach is to let $P$ be a program that directly stores the architecture and weights of a neural network and evaluates it on the given data, this approach can significantly overestimate the information content in the weights, particularly for large models trained on relatively little data. Instead, we will use a more efficient approach that encodes the training process that produces the weights. We will discuss two approaches for encoding neural network training processes, based on prequential coding (Dawid, 1984)Dawid, 1984 and requential coding (Finzi et al., 2026)Finzi and colleagues, 2026, respectively. The former is more straightforward to understand and evaluate, but relies on a heuristic argument to separate structure bits from noise bits, while the latter is rigorous at the cost of being more difficult to evaluate. Fortunately, both approaches often yield comparable rankings of epiplexity across datasets (Section 4.3).

A composite figure with three panels: (a) showing training NLL curves for prequential and requential estimation, (b) a Pareto frontier for compute-optimal two-part codes, and (c) a comparison of Requential versus Prequential S sub req values across four datasets. Figure 2: How to estimate epiplexity. (a) We consider two approaches for efficiently coding trained neural networks. Prequential estimation estimates information content as the area under the loss curve of a model above its final loss, with the training set matching the test data distribution. Requential coding, which provides an explicit code for $P^{s}$ P s with expected length as the cumulative KL between a student model $P^{s}$ P s and the teacher $P^{t}$ P t that generates its synthetic training data, visualized approximately by their loss gap. We typically choose $P^{t}$ P t to be a model trained on the real training set, as in prequential coding. (b) Using either approach, we optimize hyperparameters (model size $N$ N, training tokens $D$ D, etc.) to find the shortest two-part code for each compute budget, which decomposes into the estimated epiplexity and time-bounded entropy. (c) Comparing prequential and requential coding on four groups of datasets used in this work. The prequential estimate is typically larger, but the two correlate well, particularly within each group.

Moving forward, we will measure time by the number of floating-point operations (FLOPs) and dataset size by number of tokens, so that training a model with $N$ N parameters on $D$ D tokens takes time approximately $6 N D$ six N D (Kaplan et al., 2020), while evaluating it on $X$ X takes time $2 N D$ two N D with $D = ∣ X ∣$ D equals the size of X the number of tokens in $X$ X. To distinguish $X$ X from the training dataset, which we are free to choose, we will refer to $X$ X as the test dataset, as it is the data we need to perform inference on.

4.1 Approximating Model Description Length with Prequential Coding

Prequential coding provides a classic approach for compressing the training process of a neural network. We assume a batch size of one for simplicity, but generalizing to batch sizes larger than one is straightforward. Starting with a randomly initialized network $P_{0}$ P zero (where the subscript indicates timestep), we proceed iteratively: at each step $i$ i, we entropy encode the current training token $Z_{i}$ Z i using $lo g 1/ P_{i} (Z_{i})$ log one over P i of Z i bits, then train the model on this token to produce $P_{i + 1}$ P i plus one. Typically $Z_{i}$ Z i’s are drawn i.i.d. from the same distribution as $X$ X. On the side of the decoder, a synchronized model is maintained; the model decodes $Z_{i}$ Z i using $P_{i}$ P i and then trains on it to produce the identical $P_{i + 1}$ P i plus one. Omitting small constant overheads for specifying the random initialization, architecture, and training algorithm, a total of $L (Z_{: M}, P_{M}) = \sum_{i = 0}^{M - 1} lo g 1/ P_{i} (Z_{i})$ L of Z and P M, defined as the sum from i equals zero to M minus one of log one over P i of Z i bits yields an explicit code for both the training data $Z_{: M} = {Z_{0}, \dots, Z_{M - 1}}$ and the final model weights $P_{M}$ P M, which can be decoded in time $6 N D$ six N D for a model with $N$ N parameters trained on $D$ D tokens (typically $D > M$ as each example contains multiple tokens). Despite having an explicit code for $Z, P_{M}$ Z and P M, we cannot easily separate this into a code for $P_{M}$ P M alone for estimating epiplexity.

To isolate the description length of $P_{M}$ P M alone, we adopt the heuristic in Zhang et al. (2020)Zhang and colleagues and Finzi et al. (2025)Finzi and colleagues: we first estimate the description length of the training data given $P_{M}$ P M as its entropy code length under the final model, $L (Z_{: M} ∣ P_{M}) = \sum_{i = 0}^{M - 1} lo g 1/ P_{M} (Z_{i})$ L of Z given P M, which is the sum of log one over P M of Z i. Then, appealing to the symmetry of information, which states $K (P_{M}) = K (Z_{: M}, P_{M}) - K (Z_{: M} ∣ P_{M})$ up to constant terms, we estimate the description length of $P_{M}$ P M as the difference $L (Z_{: M}, P_{M}) - L (Z_{: M} ∣ P_{M})$ :

∣ P_{preq} ∣ \approx i = 0 \sum M - 1 (lo g 1/ P_{i} (Z_{i}) - lo g 1/ P_{M} (Z_{i})) . (8)

The prequential description length of P is approximately the sum from i equals zero to M minus one of the difference between log one over P i of Z i and log one over P M of Z i

If $Z_{i}$ Z sub i is sampled i.i.d., as is typically the case, then the code length for the model can be visualized as the area under the loss curve above the final loss in Figure 2a. Intuitively, the model absorbs a significant amount of information from the data if training yields a sustained and substantial reduction in loss. For random data, $lo g 1/ P_{i} (Z_{i})$ log one over P i of Z i never decreases, while for simple data, $lo g 1/ P_{i} (Z_{i})$ log one over P i of Z i drops rapidly and stabilizes, both leading to small $∣ P_{preq} ∣$ the length of P preq.

Encoding the test dataset $X$ X (not to be confused with the training data) using this model, we obtain a two-part code of expected length $∣ P_{preq} ∣ + E [lo g 1/ P_{M} (X)]$ the length of P preq plus the expected value of log one over P M of X that runs in time $6 N D + 2 N D$ six N D plus two script N D. We optimize the training hyperparameters (e.g., learning rate) and the trade-off between $N$ N and $D$ D subject to the time bound $6 N D + 2 N D \leq T$ six N D plus two script N D is less than or equal to T to find the optimal $P^{⋆}$ P star that minimizes the two-part code within this family, and estimate epiplexity and time-bounded entropy as $S_{T} (X) = ∣ P_{preq}^{⋆} ∣$ S T of X equals the length of P star preq and $H_{T} (X) = E [lo g 1/ P^{⋆} (X)]$ H T of X equals the expected value of log one over P star of X. The better these hyperparameters are optimized, the more accurate our estimates become. We use the Maximal Update Parameterization ( $μ P$ )mu P (Yang et al., 2022)introduced by Yang and colleagues to ensure the optimal learning rate and initialization are consistent across model sizes, simplifying tuning. We estimate the expectation $E [lo g 1/ P_{M} (X)]$ the expected value of log one over P M of X by its empirical value on held-out validation data, i.e., the validation loss scaled by the size of $X$ X. We detail the full procedure in Section B, such as how we choose the hyperparameters and estimate the Pareto frontier of MDL vs compute.

While conceptually simple, practically useful, and easy to evaluate, this prequential approach to approximating epiplexity is not rigorous for two reasons. First, both $L (Z_{: M}, P_{M})$ and $L (Z_{: M} ∣ P_{M})$ can only upper-bound the respective Kolmogorov complexities, and thus their difference does not yield an upper bound for $K (P_{M})$ .³ Second, even setting this issue aside, the argument only establishes the existence of a program that encodes $P_{M}$ with length $∣ P_{preq} ∣$ , but does not guarantee that its runtime falls within $6 N D$ , since the symmetry of information does not extend to time-bounded Kolmogorov complexity. Nevertheless, prequential coding can serve as a useful starting point for crudely estimating epiplexity, particularly convenient when one already has access to the loss curve from an existing training run.

4.2 Explicitly Coding the Model with Requential Coding

To address the shortcomings of the previous approach based on prequential coding, we adopt requential coding (Finzi et al., 2026)by Finzi and colleagues for constructing an explicit code of the model with a known runtime. Rather than trying to code a particular training dataset, with requential coding one can use the insensitivity to the exact data points sampled to code for a sampled dataset that leads to a performant model but without paying for the entropy of the data. Specifically, it encodes a training run where at step $i$ i a student model $P_{i}^{s}$ P s i is trained on a synthetic token sampled randomly from a teacher model $P_{i}^{t}$ P t i, where the sequence $P_{0}^{t}, \dots, P_{M - 1}^{t}$ P t zero through P t M minus one are arbitrary teacher model checkpoints. We typically choose $P_{i}^{t}$ P t i to be the checkpoints from training on the original real training set, as in prequential coding. Using relative entropy coding (Theis and Ahmed, 2022)by Theis and Ahmed, the synthetic tokens $Z_{i} \sim P_{i}^{t}$ Z tilde i sampled from P t i can be coded given only the student $P_{i}^{s}$ P s i (synchronized between encoder and decoder) using $KL (P_{i}^{t} ∥ P_{i}^{s}) + lo g (1 + KL (P_{i}^{t} ∥ P_{i}^{s})) + 4$ the K L divergence between P t i and P s i, plus the log of one plus the K L divergence, plus four bits in expectation. Summing over all steps gives the requential code length for $P_{M}^{s}$ :

∣ P_{req} ∣ = i = 0 \sum M - 1 KL (P_{i}^{t} ∥ P_{i}^{s}) + lo g (1 + KL (P_{i}^{t} ∥ P_{i}^{s})) + 4 + O (1) \approx i = 0 \sum M - 1 KL (P_{i}^{t} ∥ P_{i}^{s}), (9)

The requential code length is approximately the sum from i equals zero to M minus one of the K L divergence between P t i and P s i

where the logarithmic and constant overheads are typically negligible due to large sequence length and batch size, and as before we omit the small constant cost of specifying the random initialization, architecture, and training algorithm. In addition to providing an explicit code, a key advantage of requential coding is its flexibility in choosing the teacher sequence: by selecting teachers $P_{i}^{t}$ that

remain close to the student $P_{i}^{s}$ P s i while still pointing toward the target distribution, we keep the per-step coding cost $KL (P_{i}^{t} ∥ P_{i}^{s})$ K L divergence between P t i and P s i small while effectively guiding the student’s learning.

Figure 2a connects requential coding to the student’s and teacher’s loss curves: suppose we take as teachers the checkpoints $P_{0}^{t}, \dots, P_{M - 1}^{t}$ P t zero through P t M minus one from a model trained on real data $Z_{0}, \dots, Z_{M - 2} \sim P_{X}$ Z zero through Z M minus two sampled from P X. For visualization, we can then estimate $KL (P_{i}^{t} ∥ P_{i}^{s})$ by the loss gap $lo g 1/ P_{i}^{s} (Z_{i}) - lo g 1/ P_{i}^{t} (Z_{i})$ log of one over P s i of Z i minus log of one over P t i of Z i, which is accurate when $P_{i}^{t} \approx P_{X}$ . We can thus visualize the code length for the student as approximately the area between the teacher’s and student’s loss curves on real data, as shown in Figure 2a.

The two-part code has expected length $∣ P_{req} ∣ + E [lo g 1/ P_{M}^{s} (X)]$ , consisting of first decoding $P_{M}^{s}$ by replaying the training process, which takes time $6 N D$ for a total of $D$ requential training tokens, and then evaluating $P_{M}^{s}$ on the test dataset $X$ , taking an additional time $2 N D$ , for a total runtime of $6 N D + 2 N D$ . We optimize the training hyperparameters, teacher choices, and the trade-off between $N$ and $D$ subject to the specified time bound $T$ to find the optimal model $P^{⋆}$ minimizing the two-part code, and estimate $S_{T} (X) = ∣ P_{req}^{⋆} ∣$ and $H_{T} (X) = E [lo g 1/ P^{⋆} (X)]$ . See details in Section B.1.

4.3 Comparison Between the Two Approaches and Practical Recommendations

Figure 2c compares the estimated epiplexity obtained by the two approaches across four groups of datasets used in this work: ECA (Section 5.1), easy and hard induction (Section 5.3.1), and natural datasets (Section 6.2). While the prequential estimate is typically several times larger than the requential estimate, the two estimates correlate well, particularly within each group where the datasets yield similar learning dynamics. We detail the datasets and time bounds used in Section C.7. This general agreement is expected since the prequential estimate can be viewed as an approximation of requential coding with a static teacher (Section B.2). In general, however, the discrepancy between the two estimates will depend on particular datasets and training configurations, and a good correlation between the two is not guaranteed.

While requential coding is the more rigorous approach, it is typically $2 \times$ to $10 \times$ slower than prequential coding, which requires only standard training. The overhead depends on batch size, sequence length, and inference implementation (smaller overhead for large batches and short sequences), as requential coding requires repeatedly sampling from the teacher, though it is possible that the overhead can be reduced with more efficient algorithms. Therefore, we recommend using prequential coding for crudely estimating epiplexity and ranking the epiplexity of different datasets, particularly when one has access to the loss curve from an existing expensive training run (e.g., see an application in Section 6.2), and requential coding for obtaining the most accurate estimates otherwise.

4.4 How Epiplexity and Time-Bounded Entropy Scale with Compute and Data

Under natural assumptions about neural network training—namely, that larger models are more sample-efficient and that there are diminishing returns to scaling model size or data alone—we expect epiplexity and time-bounded entropy to exhibit certain generic scaling behavior as a function of the compute budget $T$ and dataset size $D$ D. In Section B.4, we show that, under these assumptions, the compute-optimal model size $N^{⋆} (T)$ and training data size $D^{⋆} (T)$ are generally increasing in the compute budget $T$ , which implies that epiplexity $S_{T} (X)$ typically grows with $T$ while time-bounded entropy $H_{T} (X)$ decreases. In the infinite-compute limit, epiplexity $S_{\infty} (X)$ typically grows with the test set size $D = ∣ X ∣$ D equals the size of X, while the per-token time-bounded entropy $H_{\infty} (X) / D$ decreases. These results align with our intuition that larger compute budgets and more data allow the model to extract more structural information from the dataset and reduce the apparent randomness remaining in each sample. However, they should be understood only as typical trends, with a counterexample shown in Section 5.3.2 relating to the phenomenon of emergence.

5 Three Apparent Paradoxes of Information

To illustrate the lacunae in existing information theory perspectives, we highlight three apparent paradoxes of information: (1) information cannot be created by deterministic transformations; (2) total information content of an object is the same regardless of the factorization; and (3) likelihood modeling can only learn to match the data-generating process. Each statement captures some existing sentiment within the machine learning community, can be justified mathematically by Shannon and algorithmic information theory, and yet seems to be in conflict with intuitions and experimental observations. In this section, we will show with both theoretical results and empirical evidence that time bounding and epiplexity help resolve these apparent paradoxes.

5.1 Paradox 1: Information Cannot be Created by Deterministic Transformations

Both Shannon and algorithmic information theory state in some form that the total information cannot be increased by applying deterministic transformations on existing data. The data processing inequality (DPI) states that if some information source $W$ W produces natural data $X$ X that are collected, then no deterministic or stochastic transformations used to produce $Y$ from $X$ Y from X can increase the mutual information with the variable of interest $W$ : $I (Y; W) \leq I (X; W)$ I of Y and W is less than or equal to I of X and W. Similarly, information non-increase states that a deterministic transformation $f$ f can only preserve or decrease the Shannon information, a property that holds pointwise $- lo g P_{Y} (f (x)) \leq - lo g P_{X} (x)$ minus log P sub Y of f of x is less than or equal to minus log P sub X of x and in expectation: $H (f (X)) \leq H (X)$ the entropy of f of X is less than or equal to the entropy of X (we note $X$ X here is a discrete random variable). In algorithmic information theory, there is a corresponding property: $K (f (x)) \leq K (x) + K (f) + c$ K of f of x is less than or equal to K of x plus K of f plus a constant c for a fixed constant $c$ c. These inequalities appear to rule out creating new information with deterministic computational processes.

How can we reconcile this fact with algorithms like AlphaZero (Silver et al., 2018)Silver and colleagues that can be run in a closed environment from a small deterministic program on the game of chess, extracting insights about the game, different openings, the relative values of pieces in different positions, tactics and high level strategy, and requiring megabytes of information stored in the weights? Similarly we have dynamical systems with simple descriptions of the underlying laws that produce rich and unexpected structures, from which we can learn new things about them and mathematics.

We also have evidence that synthetic data is helpful (Liu et al., 2024; Gerstgrasser et al., 2024; Maini et al., 2024; OpenAI, 2025) is helpful for model capabilities and moreover, if we believe that the processes that create natural data could in principle have been simulated to sufficient precision on a large computer, then all data could have been equivalently replaced with synthetic data. For practical synthetic data produced from transformations of samples from a given model and prompt, this sampling is performed with pseudorandom number generators, making the entire transformation deterministic. If we consider $f$ f as the transformations we use to produce synthetic data and $x$ x was the limited real data we started with, these inequalities appear to state very concretely that our synthetic data adds no additional information beyond the model and training data.

Whatever information it is that we mean when we say that AlphaZero has produced new and unexpected insights in chess, or new theoretical results in mathematics, or with synthetic data, it is not Shannon or algorithmic information. We argue that these unintuitive properties of information theory are a consequence of assuming unlimited computation for the observer. With limited computation, a description of the AlphaZero algorithm and the result of running AlphaZero for thousands of TPU hours are distinct. To build intuition, we start with the humble CSPRNG which also creates time-bounded information through computation (albeit random information).

Visualizations of cellular automata rules 15, 30, and 54 alongside plots of their training loss, epiplexity, and time-bounded entropy. Figure 3: Information created with cellular automata. (Left) Example rollouts from random initial conditions of the class II rule 15, class III rule 30, and class IV rule 54. Time flows from up to down. (Right) Measuring epiplexity on data produced by these transformations, we see that rule 15 produces little information (low $H_{T}$ , low $S_{T}$ ), rule 30 produces lots of unpredictable random information (high $H_{T}$ , low $S_{T}$ ), and rule 54 produces both random and structural information (medium $H_{T}$ , high $S_{T}$ ). These observations are reflected in the training loss curve of LLMs, which saturates quickly for rule 15, makes no progress for rule 30, and makes continued progress with compute for rule 54.

Theorem 12

Let $G : {0, 1}^{k} \to {0, 1}^{n}$ G mapping k bits to n bits be a CSPRNG which admits advantage $ε (k)$ epsilon of k and $U_{k}$ U sub k be the uniform distribution. $H_{Poly} (G (U_{k})) > H_{Poly} (U_{k}) + n - n ε (k) - k - O (1)$ .
Proof: see Appendix A.2.

Notably, we have a deterministic function which dramatically increases the time-bounded information content of the input. It is worth contrasting this result with Equation 3, where the time-bounded information content increase from a deterministic function can be bounded if the inverse function has a short program which can run efficiently. The statement highlights an important asymmetry between the function $G$ G and its inverse with fixed computation that does not hold with unlimited computation (e.g. $K (G^{- 1}) = K (G) + O (1)$ ). Simultaneously, it provides some useful guidance for synthetic data: if we want to produce interesting information, we should make sure the functions we use do not have simple and efficiently computable inverses.

As an illustrative example, consider the iterated dynamics of elementary cellular automata (Wolfram and Gad-el Hak, 2003; Zhang et al., 2024)as discussed in prior work. An elementary cellular automaton (ECA) is a one-dimensional array of binary cells that evolves in discrete time steps according to a fixed rule mapping each cell’s current state and the states of its two immediate neighbors to its next state. Despite their simple formulation – only 256 possible rules—these systems can produce a rich variety of behaviors, from stable and periodic patterns to chaotic and computationally universal dynamics. We setup the problem of predicting $Y_{i} = F (X_{i})$ Y i equals F of X i from random initial data $X_{i}$ X i for $F$ F being an ECA iterated 48 times on a grid of size 64, and assemble these pairs into a dataset $X = [X_{1}, \dots, X_{K}]$ X and $Y = [Y_{1}, \dots, Y_{K}]$ Y for a total dataset of $D = 100 M$ D equals 100 million tokens. We measure the conditional information content $Y ∣ X$ Y given X (epiplexity and entropy) for ECA rules 15, 30, and 54 by training LLMs on this dataset. We provide a visualization of these dynamics in Figure 3 (left). For the class II rule 15 in the Wolfram hierarchy (Wolfram and Gad-el Hak, 2003), the produced behavior is periodic and has a simple inverse. Consequently, in Figure 3 (right), we see that training dynamics that rapidly converge to optimal predictions and with little epiplexity or time-bounded entropy. With the class III rule 30, the computation produces outputs that are inherently intractable to predict with limited computation, and as a result we see that there is maximal time-bounded entropy that is produced but no epiplexity. For the class IV rule 54, we see that the dynamics are complex but also partly understandable: the loss decreases slowly and much epiplexity is produced. These results highlight the sensitivity of epiplexity to the generating process. With the same compute spent and with a very similar program

we can have drastically different outcomes, producing simple objects, producing only random content, and producing a mix of random and structured content.

5.2 Paradox 2: Information Content is Independent of Factorization

An important property of Shannon’s information is the symmetry of information, which states that the amount of information content does not change with factorization. The information we acquire when predicting $x$ x and then $y$ y is exactly equal to when predicting $y$ y and then $x$ x: $H (Y ∣ X) + H (X) = H (X, Y) = H (X ∣ Y) + H (Y)$ H of Y given X plus H of X equals H of X comma Y, which equals H of X given Y plus H of Y. An analogous property also holds for Kolmogorov complexity, known as the symmetry of information identity: $K (Y ∣ X) + K (X) = K (X ∣ Y) + K (Y) + O (1)$ .

On the other hand, multiple works have observed that natural text is better compressed (with final model achieving higher likelihoods) when modeled in the left-to-right order (for English) than when modeled in reverse order (Papadopoulos et al., 2024; Bengio et al., 2019)as seen in work by Papadopoulos and colleagues and Bengio and colleagues, picking out an arrow of time in LLMs where one direction of modeling is preferred over the other. It seems likely that for many documents, other orderings may lead to more information extracted by LLMs. Similarly, as we will show later, small rearrangements of the data can lead to substantially different losses and downstream performance. Cryptographic primitives like one way functions and block cyphers also provide examples where the order of conditioning can make all the difference to how entropic the data appears, for example considering autoregressive modeling of two prime numbers followed by their product vs the reverse ordering. These experimental results and cryptographic ideas indicate what can be learned is dependent on the ordering of the data, which in turn suggests that different amounts of “information” are extracted from these different orderings.

Our time-bounded definitions capture this discrepancy. Under the existence of one way permutations, we can prove that a gap in prediction exists over different factorizations for time bounded entropy.

Theorem 13

Let $f$ f be a one-way permutation and let $X = U_{n}$ X equals U n be uniform and $Y = f (X)$ Y equals f of X.
$H_{Poly} (X ∣ Y) + H_{Poly} (Y) > H_{Poly} (Y ∣ X) + H_{Poly} (X) + ω (lo g n)$
Proof: see Appendix A.5.

As a corollary, we show no polynomial time probability model which can fit a one way function’s forward direction can satisfy Bayes theorem (see Theorem 26). Adding to these theoretical results, we look empirically at the gap in time-bounded entropy for one way functions, and the gap in both entropy and epiplexity over two orderings of predicting chess data.

In Figure 4(a), we choose $f$ f to be given by the 8 steps of evolution of the ECA rule 30 with state size $n$ n and periodic boundary conditions (Wolfram and Gad-el Hak, 2003)as described by Wolfram and Gad-el Hak. Though distinct from the one way functions used in cryptography, rule 30 is believed to be one way (Wolfram and Gad-el Hak, 2003) and unlike typical one way functions, the forward pass of rule 30 can be modeled by an autoregressive transformer, which we demonstrate by constructing an explicit RASP-L (Zhou et al., 2023; Weiss et al., 2021)as shown by Zhou and colleagues and Weiss and colleagues program in Appendix D. As shown in Figure 4(a), the model achieves the Shannon entropy (gray) in the forward direction, but has a consistent gap in the reverse direction.

Beyond just how the random information can vary with orderings, the structural information can also differ as we will show next. We demonstrate this fact by training autoregressive transformer models on the Lichess dataset, a large collection of chess games where the moves are recorded in algebraic chess notation. We consider two variants of this dataset: (1) formatting each game as the move sequence followed by final board state in FEN notation, and (2) formatting each game as the final board state followed by the move sequence, as illustrated in Figure 4b. We provide full experiment details in Section C.4. While there is no clear polynomial vs non-polynomial time separation in this setup, the first ordering is analogous to the forward direction as the final board

Three-part figure: (a) line graph showing losses for forward vs reverse one-way functions as state size n increases; (b) diagram comparing forward vs reverse factorization order in chess and string functions; (c) line graph showing entropy and epiplexity for chess orderings over compute. Figure 4: Factorization matters. (a) We compare the losses from modeling a conjectured one way function in forward and reverse as the state size $n$ n is increased. The model reaches Shannon entropy in the forward direction, but with a persistent gap in the reverse direction. (b) The two orderings produce different outcomes. Analogous to the OWF, predicting the moves followed by the final board state is the direction that can be predicted with a straightfoward computation. Predicting the board first and then the moves requires more complex behaviors. (c) As compute increases, the same chess data presented in the reverse order leads to higher time-bounded entropy and epiplexity, showing it becomes more difficult to predict but allows more structure to be learned.

state can be straightforwardly mapped from the moves with a simple function, while the latter ordering is analogous to the reverse direction, where recovering the moves from the final board state requires the inverse function that infers the intermediate moves from the final state. We hypothesize the reverse direction is a more complex task and will lead the model to acquire more structural information, such as a deeper understanding of the board state. Figure 4c confirms this hypothesis, showing that the reverse order has both time-bounded higher entropy and epiplexity. This gap vanishes at small compute budgets where the model likely learns only surface statistics common to both orderings before the additional complexity of the reverse task forces it to develop richer board-state representations.

5.3 Paradox 3: Likelihood Modeling is Merely Distribution Matching

There is a prevailing view that from a particular training distribution, we can at best hope to match the data generating process. If there is a property or function that is not present in the data-generating process, then we should not expect to learn it in our models. As an extension, if the generating process is simple, then so are models that attempt to match it. This viewpoint can be supported by considering the likelihood maximization process abstractly, $ar g min_{P} E_{X \sim Q} [- lo g P (X)] = Q$ the argument minimum over P of the expected value under Q of negative log P of X, equals Q; the test NLL is minimized when the two distributions match. The extent to which the distributions differ is regarded as a failure either from too limited a function class or insufficient data for generalization. From these arguments we could reasonably believe that AI models cannot surpass human intelligence when pretraining on human data. Here we provide two classes of phenomena that seem to contradict this viewpoint: induction, and emergence. In both cases, restricting the compute available to AI models leads them to extract more structural information than what is required for implementing the generating process itself.

5.3.1 Induction

The generative modeling community is often challenged with simultaneously wanting a tractable sampling process and tractable likelihood evaluation, with autoregressors, diffusion models, VAEs,

A three-part figure showing a data generation process diagram and two plots comparing train loss versus compute for different numbers of hidden bits or rows alongside measured epiplexity bar charts. Figure 5: Studying induction through epiplexity. (a) Our setup for creating induction problems. (b) Predicting Rule 30 ECA with hidden inputs. The LLM must induct on the $h$ bits missing from the input, paying a cost exponential in $h$ . For $h$ small enough but $> 0$ , epiplexity is increased. (c) Predicting Markov chain samples with hidden transition probabilities. Models that need to both use the provided probabilities and induct on the missing ones acquire the most epiplexity.

GANs, and normalizing flows each providing different approaches. For natural generative processes, it is often the case that one direction may be much more straightforward than the other. Here we investigate generative processes which can be constructed by transforming latent variables such that computing likelihoods requires inducting on the values of those latents.

A window into the phenomenon can be appreciated through this quote from Ilya Sutskever:

“You’re reading a murder mystery and at some point the text reveals the identity of the criminal. … If the model can predict [the name] then it must have figured out [who perpetrated the murder from the evidence provided].” (Sutskever, 2019)Sutskever, 2019

The author of the book on the other hand, need not have made that same induction. Instead, they may have chosen the murderer first and then painted a compelling story of their actions. This example highlights a gap between the generating process and the requirements of a predictive model, a gap which we explore with the following more mathematical setup.

As we illustrate in Figure 5(a), consider a simple to model random variable $Z$ over ${0, 1}^{n}$ Z over the set of binary strings of length n which we transform with two functions $m$ and $f$ , which are both short in length and efficient to compute, and produce the data $Y = (m (Z), f (Z))$ Y equals m of Z and f of Z. We choose $m : {0, 1}^{n} \to {0, 1}^{n - h}$ m from binary strings of length n to binary strings of length n minus h as a masking function which removes the bits at a total of $h$ h fixed locations in the input, leaving the rest unchanged. The generating process is simple to implement and can be executed efficiently. Now consider a likelihood generative model learning to model $Y$ Y, under any given factorization. With appropriate properties of the function $f$ f, in producing the likelihoods the model must learn to induct on the missing information in the state $Z$ Z, and then apply the transformation given by the data generating process. We consider cases both where the function $f$ f is hard to invert and those where $f$ f is not especially hard to invert. In both cases, predictive circuits must be learned that were not present in the data generating process, but with hard $f$ f these circuits only appear at exponentially high compute.

Induction Hard: Rule 30 ECA. For the first setting we use uniform $Z = U_{n}$ Z equals U n and $f$ f as 4 steps of the rule 30 ECA on state size $n = 32$ n equals 32, $m$ m simply removes the first $h$ h bits, and we also compute the loss only on $f (Z)$ f of Z (conditioned on $m (Z)$ m of Z) as the bits in $m (Z)$ m of Z are uniform and only add noise. We use an LLM, and the loss curves and measured epiplexities are shown in Figure 5b. The loss converges to the number of hidden bits $- lo g_{2} P (f (Z) ∣ m (Z)) = h$ minus log base two of the probability of f of Z given m of Z, which equals h, representing the $2^{h}$ two to the h possible inductions on the hidden state. However, the total compute required for this loss to converge grows exponentially with $h$ h, an overall behavior consistent with a strategy of passing all $2^{h}$ two to the h candidates through $f$ f and then eliminating inconsistent candidates as values of $f (Z)_{i}$ f of Z sub i are observed with the autoregressive factorization. This complex learned function stands in contrast with the mere $f (Z)$ f of Z

and simple postprocessing removing bits with masking. This picture is mirrored by the measured epiplexity: as the model is forced to induct on the missing bits, the epiplexity grows.

Induction Easy: Random Markov Chains. In the second setting, we leverage the statistical induction heads setup (Edelman et al., 2024)by Edelman and colleagues with a few modifications. $Z$ Z is given by a random Markov chain transition matrix with $V = 8$ V equals eight symbols, and $m$ m removes $h$ h columns of the matrix at fixed random locations. The function $f (Z)$ f of Z computes a sampled sequence from the Markov chain of length $n = 512$ n equals five hundred twelve. When $h > 0$ h is greater than zero, the optimal solution involves 1) using the provided rows $Z$ Z to perfectly predict next-token probabilities on $V - h$ V minus h of the symbols, and 2) inducting on the missing rows of $Z$ Z in-context based on the empirically observed transitions to improve remaining predictions. For $h = 0$ h equals zero, the first is sufficient, and for $h = 8$ h equals eight the second is sufficient. In Figure 5c, we find evidence that both strategies are employed whenever $0 < h < 8$ h is between zero and eight as the final loss achieved matches the theoretical loss of both (the lower of the two dotted lines). The higher horizontal line marks the loss achievable using 1) along with a simple unigram strategy (Edelman et al., 2024), showing that the transformer learns 1) first and later the induction strategy 2). While the data generating program only involves strategy one followed by the postprocessing masking step, the model must learn both strategies to reach these values. Measured epiplexity matches this picture, with values $0 < h < 8$ h between zero and eight having higher epiplexity than $h = 0$ h equals zero or $h = 8$ eight. We emphasize that the induction strategy was never present in the data-generating process, yet it is learned by a generative model trained on that same data distribution. In Section G, we argue the induction phenomena are not specific to autoregressive models, but occur more generally for models trained via Maximum Likelihood Estimation as they need to be able to evaluate the likelihood $P (x)$ P of x for an arbitrary data point $x$ x rather than merely sample random $x$ x from $P$ P. VAEs (Kingma et al., 2013)Variational Autoencoders provide a clear example of explicitly performing induction in non-autoregressive models: the encoder is trained specifically to approximate the posterior $P_{Z ∣ X}$ P of Z given X, enabling tractable likelihood estimation, yet this encoder is entirely unnecessary if the goal is merely to sample from the model.

In both of the hard and easy induction examples, the size of the program needed to perform the induction strategy is greater than the size of the program needed generate the data. We can expect that with limited computational constraints, it will not be generically possible to invert the generation process using brute force, and thus, in cases where alternative inverse strategies exist (like the easy induction example with the statistical induction heads), those additional strategies increase the epiplexity. Given that there is likely no single generally applicable strategy for these computationally efficient inverses across problems, it is likely to be possible as a source of epiplexity.

To make these statements more precise, it seems likely that there are no constants $c_{1}$ and $c_{2}$ for which the following property holds:

Limited Epiplexity Increase Property

Given any program $G : {0, 1}^{k} \to {0, 1}^{n}$ running in time at most $T_{1}$ on random variable $Z$ , the epiplexity of $G (Z)$ is increased by at most a constant more than the size of $G$ : $S_{T_{2}} (G (U_{k})) \leq ∣ G ∣ + c_{1}$ for $T_{2} (n) > T_{1} (k) + c_{2}$ .

In other words, there is no bound on how much larger the MDL optimal probability model will be than the generating program even when the model is allowed more compute than the generating program. We present this phenomenon in contrast to Shannon information or Kolmogorov complexity, where a function and its inverse can differ in complexity by at most a fixed constant: $K (F^{- 1}) = K (F) + O (1)$ K of F inverse equals K of F plus big O of one. When the computational constraints are lifted, the brute force inverse is possible, and there is no essential gap between deduction and induction, or between sampling and likelihood computation.

5.3.2 Emergent Phenomena

One of the most striking counterexamples to the “distribution matching” viewpoint is emergence. Even when a system’s underlying dynamics admit a simple description, an observer with limited computation may need to learn a richer, and seemingly unrelated, set of concepts to predict or explain its behavior. As articulated by Anderson (1972)Anderson in nineteen seventy two, reductionism—that a complex object’s behavior follows from its parts—does not guarantee that knowing those parts lets us predict the whole. Across biology and physics, many-body interactions give rise to behaviors (e.g. bird flocking, Conway’s Game of Life patterns, molecular chemistry, superconductivity) that are not apparent from the microscopic laws alone. Here we sketch how emergence critically relates to the computational constraints of the observer, demonstrating how observers predicting future states may be required to learn more than their unbounded counterparts who can execute the full generating process.

Consider Type-Ib emergence in the Carroll and Parola (2024)Carroll and Parola classification, in which higher-level patterns arise from local rules yet resist prediction from those rules. A canonical example is Conway’s Game of Life (see Appendix E for definition), where iterating a simple computational rule $Φ$ phi on a 2D grid leads to complex emergent behavior. For observers that lack the computational resources to directly compute the iterated evolution $Φ^{k}$ phi to the k, an alternate description must be found. In the state evolution, one can identify localized “species” (static blocks, oscillators, gliders, guns) which propagate through space and time. By classifying these species, learning their velocities, and how they are altered under collisions with other species, as well as the ability to identify their presence in the initial state, computationally more limited observers can make predictions about the future state of the system. Doing so, however, requires a more complex program in the sense of description length, and the epiplexity will be higher. We can formalize this intuition into the following definition of emergence.

Definition 14 (Epiplexity Emergent)

Let ${Φ_{n}}_{n \geq 1}$ be a computable family $Φ_{n} : {0, 1}^{n} \to {0, 1}^{n}$ and let ${X_{n}}_{n \geq 1}$ be random variables over ${0, 1}^{n}$ . We say $(Φ, X)$ is epiplexity-emergent if there exist time bounds $T_{1}, T_{2}$ with $T_{1} (n) = o (T_{2} (n))$ and an iteration schedule $k (n)$ such that as $n \to \infty$ ,
$S_{T_{1}} (Φ (X) ∣ X, n) - S_{T_{2}} (Φ (X) ∣ X, n) S_{T_{1}} (Φ^{k} (X) ∣ X, n, k) - S_{T_{2}} (Φ^{k} (X) ∣ X, n, k) = Θ (1), = ω (1), (10)$
where we have suppressed the dependence of $X_{n}$ and $Φ_{n}$ on $n$ for clarity.

In words, $Φ, X$ phi and X displays emergent phenomena if two observers see equivalent structural complexity in the one step map, but asymptotically more structural complexity in the multistep map for the observer with fewer computational resources.

Considering $Φ$ phi from the Game of Life as an example, $P (Φ (X) ∣ X, n)$ could be well estimated by both $T_{1}$ and $T_{2}$ -bounded observers using the exact time evolution rule, using constant bits for both. $P (Φ^{k} (X) ∣ X, n, k)$ could be estimated by $T_{2}$ using the iterated rule, but not by $T_{1}$ . Using knowledge of the different pattern species improves predictions of $Φ^{k} (X) ∣ X$ phi to the k of X, given X, so they would need to be learned; however, the number of patterns that needs to be considered in the time-bounded optimal solution is unbounded, and grows with the size of the board $n$ n, and thus the gap in epiplexity for the two time bounds grows with $n$ n. We have not proven that the Game of Life satisfies this definition, which is likely difficult as small changes to the evolution rule can destroy the emergent behavior; however, we provide empirical evidence for this set being non-empty with the example below.

In Figure 6, we empirically demonstrate the emergence phenomenon by training a transformer to predict the iterated dynamics of ECA rule 54, a class IV rule that produces complex patterns. As in Conway’s Game of Life, a model with sufficient computation can exactly simulate the dynamics by directly iterating the per-step rule—a brute-force solution with a short description length. However, a compute-limited model cannot afford this approach and must instead learn emergent patterns (e.g., gliders and their collision rules) that approximately shortcut the infeasible exact simulation.

Two line graphs showing MDL and epiplexity for non-looped and looped models against compute Figure 6: Emergence in ECA. Compute-constrained models extract high epiplexity from data generated by simple rules, trading increased program length for reduced computation.

The brute-force solution can be naturally implemented by learning to autoregressively unroll intermediate ECA states rather than directly predicting the final state, resembling the use of chain-of-thought (Wei et al., 2022) or looped transformers (Dehghani et al., 2018; Giannou et al., 2023; Saunshi et al., 2025). We provide experiment details in Section C.8. While initially the non-looped model (directly predicting final state) gradually achieves better MDL and higher epiplexity as compute increases, we a identify critical compute threshold beyond which the looped model suddenly becomes favorable, causing an abrupt drop in MDL and epiplexity, likely by learning the simple, brute-force solution. Below this threshold, the looped model underperforms likely because it lacks the compute to fully unroll the dynamics. The non-looped model, unable to rely on brute-force simulation, must instead learn increasingly sophisticated emergent rules, recognizing more species and their interactions, thus causing epiplexity to initially rise with compute before eventually falling.

While this experiment cleanly demonstrates how compute-limited models can learn richer structure from data, it is a rare example where the brute-force solution is experimentally accessible, and where training with more compute can therefore be counterproductive. In most practical settings, such as AlphaZero or modeling natural data, the corresponding simple, brute-force solutions are computationally infeasible, such as performing exhaustive search over an astronomical game tree and predicting natural data by simulating the laws of physics. Therefore, for most settings, models trained with more compute do continue to learn more structure, as the regime where brute-force becomes viable remains out of reach.

We explore other kinds of emergence, such as in chaotic dynamical systems or in the optimal strategies of game playing agents in Appendix F. Each of these examples presents clear evidence that in pursuit of the best probability distribution to explain the data, observers with limited compute will require models with greater description length than the minimal data generating process in order to achieve comparable predictive performance (Martínez et al., 2006; Redeker, 2010). Epiplexity provides a general tool for understanding and quantifying these phenomena of emergence, and how simple rules can create meaningful, complex structures that AI models can learn from, as recently demonstrated empirically by Zhang et al. (2024)Zhang and colleagues.

6 Epiplexity, Pre-Training, and OOD Generalization

Pre-training on internet-scale data has led to remarkable OOD generalization, yet a thorough understanding of this phenomenon remains elusive. What kinds of data provide the best signal for enabling broad generalization? Why does pre-training on text yield capabilities that transfer across domains while image data does not? As high-quality internet data becomes exhausted, what metric should guide the selection or synthesis of new pre-training data? In this section, we show how epiplexity helps answer these foundational questions.

OOD generalization is fundamentally about how much reusable structure the model acquires, not how well it predicts in-distribution. Two models trained on different corpora can achieve the same

in-distribution loss, yet differ dramatically in their ability to transfer to OOD tasks. This happens because loss captures only the residual unpredictability, corresponding to the time-bounded entropy, not how much reusable structure the model has internalized to achieve that loss. Epiplexity measures exactly this missing component: the amount of information in the learned program. Intuitively, loss indicates how random the data looks to the model, while epiplexity indicates how much structure the model must acquire to explain away the non-random part. If OOD generalization depends on reusing learned mechanisms rather than memorizing superficial statistics, then epiplexity is a natural lens through which to understand the relationship between pre-training data and OOD transfer. As a motivating toy example, Zhang et al. (2024)Zhang and colleagues observed that type IV ECA rules tend to benefit downstream tasks, which correlates with Figure 3 where we showed that rule 54 (a type IV rule) induces much larger epiplexity compared to other types of ECAs.

6.1 Epiplexity Correlates with OOD Generalization in Chess

Three bar charts showing Epiplexity, Puzzle Accuracy, and Centipawn Accuracy for forward and reverse model orderings. Figure 7: Epiplexity and OOD performance in chess. Models trained on the higher epiplexity reverse order performs better in OOD tasks.

We finetune models trained on either ordering from Section 5.2 on two downstream tasks: (1) solving chess puzzles, where the model must predict the optimal next move given a board state (Burns et al., 2023), and (2) predicting centipawn evaluation, where the model evaluates positional advantage from FEN notation—a more substantial distribution shift from next-move prediction learned in pre-training. Experiment details are in Section C.4. As shown in Figure 7, the reverse (board-then-moves) ordering yields higher epiplexity and better downstream performance: matching accuracy on chess puzzles but significantly higher accuracy on the centipawn task. This result supports our hypothesis: the reverse order forces the model to develop richer board-state representations needed to infer the intermediate moves, and these representations transfer to OOD tasks like centipawn evaluation that similarly require understanding the board state. This example reflects a more general principle: epiplexity measures the learnable structural information a model extracts from data to its weights, which is precisely the source of the information transferable to novel tasks, making epiplexity a plausible indicator for the potential of OOD generalization. However, we emphasize that higher epiplexity does not guarantee better generalization to any specific task: epiplexity measures the amount of structural information, irrespective of its content. A model trained on high epiplexity data can learn a lot of structures, but these structures may or may not be relevant to the particular downstream task of interest.

6.2 Measuring Structural Information in Natural Data

Among different modalities of natural data, language has proven uniquely fruitful for pre-training, not only for improving in-distribution performance such as language understanding (Radford et al., 2019), but also for out-of-distribution tasks such as robotics control (Ahn et al., 2022), formal theorem proving (Song et al., 2024), and time-series forecasting (Gruver et al., 2023). While equally abundant information (as measured by raw bytes) is available in other modalities, such as images and videos, pre-training on those data sources typically does not confer a similarly broad increase in capabilities. We now show that epiplexity helps explain this asymmetry by revealing differences in structural information across datasets. In Figure 8a, we show the estimated decomposition of the information in 5B tokens of data from OpenWebText, Lichess, and CIFAR-5M (Nakkiran et al., 2020) into epiplexity (structural) and time-bounded entropy (random) with a time-bound of $6 \times 1 0^{18}$ six times ten to the eighteen FLOPs, by training models of up to 160M parameters on at most 5B tokens using sequential coding. In all cases, epiplexity accounts for only a tiny fraction of the total information, with the OpenWebText data carrying the

A series of three charts showing epiplexity in natural data, estimation via scaling laws, and comparison of ADO versus natural data selection. (a) Epiplexity in natural data (b) Estimation via scaling laws (c) ADO: epiplexity and downstream metrics Figure 8: Epiplexity reveals differences in the structural information across data modalities and can guide pre-training data selection. (a) Estimated epiplexity and time-bounded entropy using requential coding for 1B OpenWebText, Chess, and CIFAR-5M tokens at $6 \times 1 0^{18}$ FLOPs. (b) Estimated values based on scaling laws and prequential coding for 1T language, image, and video tokens at $1 0^{25}$ FLOPs. (c) Selecting pre-training data using ADO (Jiang et al., 2025)Jiang and colleagues leads to different loss curves than standard sampling (natural). Our measurement shows ADO selects data with higher epiplexity, in line with the improved downstream performance and OOD perplexity on different text corpora.

most epiplexity, followed by chess data. Despite carrying the most total information, CIFAR-5M data has the least epiplexity, as over 99% of its information is random (e.g., unpredictability of the exact pixels).

6.3 Estimating Epiplexity from Scaling Laws

Two line plots showing epiplexity and optimal training tokens converging as compute increases for different data types. Figure 9: Epiplexity and optimal training tokens for each fixed dataset converge to predictable limits as compute increases.

We can estimate the epiplexities of larger datasets at higher compute budgets using reported scaling laws, which describe the loss achieved by an $N$ N-parameter model trained on $D$ D tokens as

L (N, D) = E + (N / N_{0})^{- α} + (D / D_{0})^{- β}

for some dataset-specific constants $α, β, N_{0}, D_{0}, E$ alpha, beta, N zero, D zero, and E (Hoffmann et al., 2022; Kaplan et al., 2020; Henighan et al., 2020). By estimating the model’s description length via the prequential coding approach (Section 4.3), we obtain estimates for the epiplexity and time-bounded entropy for language, image, and video datasets, with varying resolutions and tokenizations of size $D = 1 0^{12}$ D equals ten to the twelfth (1T) tokens under a compute budget of $1 0^{25}$ FLOPs (equivalent to the training compute of Llama3 70B), illustrated in Figure 8b (see details in Section C.9). Consistent with our smaller-scale experiments, we find that language data has the highest epiplexity, while image data has the least. For image data, applying VQ tokenization leads to a significant increase in epiplexity, likely as a result of allowing the model to focus on higher-level semantic structures. Video data has less time-bounded entropy and epiplexity than image data with the same resolution, likely due to significant redundancy across the temporal dimension.

Using this approach, we can also gain some analytical insights about epiplexity for data admitting scaling laws of this form. As we derive in Section B.3, for a fixed dataset $X$ X with $D$ D tokens, the optimal split of the compute budget between training and inference (evaluating the trained model on $X$ X) approaches a fixed ratio as compute increases, with the optimal asymptotic training tokens $D_{\infty}^{⋆} = D$ D star infinity equals D and asymptotic epiplexity $S_{\infty} (X) = \frac{β}{1 - β} D_{0}^{β} D^{1 - β}$ S infinity of X equals beta over one minus beta, times D zero to the beta, times D to the one minus beta, both illustrated in Figure 9. As expected, the maximum amount of extractable structural information is ultimately capped by the dataset size $D$ D when compute is not the bottleneck, and epiplexity can increase further if we also grow the dataset

size. For large $D$ D, the scale of the asymptotic epiplexity is primarily determined by $β$ beta and $D_{0}$ D zero, with smaller $β$ beta and larger $D_{0}$ D zero leading to higher epiplexity, corresponding to slower improvement in loss and thus more (estimated) information absorbed per token. In line with our discussion on emergence in Section 5.3.2, it is possible that with significantly more compute much simpler programs can model these natural datasets, such as by directly simulating the basic laws of physics from which the natural world emerges, but the amount of required computation is likely so high that such programs remain inaccessible to any physically realizable observer and we must treat natural data as having high epiplexity for all practical purposes.

6.4 Pre-Training Data Selection and Curriculum for Language Models

A crucial step in pretraining a language model is designing the composition of the pretraining data, but there lack clear guidelines for this step. Existing data mixtures are designed through extensive trial-and-error and rely on heuristic guidelines such as “diversity” or “high-quality”. More importantly, the primary way of comparing different training data is via perplexity metrics of held-out datasets and downstream performance. These procedures are highly susceptible to data contamination, overfitting to a narrow set of downstream evaluations, and Goodhart’s law. After all, no suite of downstream evaluations is extensive enough to faithfully capture the range of tasks that a general-purpose language model will encounter in the real world.

As we argued above, epiplexity measures the structural information learned by the model, which could be affected by data selection strategies. Jiang et al. (2025)Jiang and colleagues demonstrated that models of the loss curves for different data subsets can be used to dynamically adjust the data distribution online to favor data subsets whose training losses are decreasing faster. Intuitively, this objective coincides with increasing the prequential approximation of epiplexity described in subsection 4.3 as it leads to a higher area under the loss curve. We hypothesize that the proposed algorithm, Adaptive Data Optimization (ADO), inadvertently achieves higher epiplexity. Experiments of Jiang et al. (2025)Jiang and colleagues are conducted on decoder-only transformers with 1.3B parameters trained on 125B tokens from the Pile dataset (Gao et al., 2020). The models are evaluated on a suite of 7 zero-shot downstream tasks and two OOD validation datasets, SlimPajama (Soboleva et al., 2023) and FineWeb (Penedo et al., 2024).

In Figure 8c(c), we show the estimated epiplexity and the downstream performance as well as perplexity on two OOD datasets, adapted from Jiang et al. (2025)Jiang and colleagues. As shown in Jiang et al. (2025)Jiang and colleagues, ADO achieves higher downstream performance than a standard data sampling strategy that uniformly samples from the entire dataset (denoted by Natural in Figure 8c), despite not being optimized for any of these metrics. Interestingly, we see that ADO indeed achieves higher epiplexity measured by prequential coding. While these downstream evaluations do not capture everything about a pretrained model, they do offer evidence that epiplexity is a potentially useful concept for understanding the intrinsic value of pretraining data without particular downstream evaluations.

Epiplexity builds on a number of related ideas in algorithmic information theory and complexity science that attempt to theoretically characterize meaningful information. A group of closely related concepts are sophistication (subsection 2.2), effective complexity, and logical depth. Similar to sophistication, effective complexity aims to separate random from structural content (Gell-Mann and Lloyd, 1996). From a different starting point, Bennett (1988)Bennett introduced logical depth, measuring the number of time steps required by a nearly optimal program to produce a given string, and which was later shown to be equivalent to sophistication through the busy beaver function (Antunes et al., 2005; Ay et al., 2010). Several other formal measures have been developed to quantify structured

or meaningful complexity. Algorithmic statistics offers a principled decomposition of data into regular versus random components by introducing the notion of an algorithmic sufficient statistic (Vereshchagin and Vitányi, 2004)Vereshchagin and Vitanyi, 2004, a concept closely tied to sophistication. Relatedly, statistical complexity in computational mechanics (Shalizi and Crutchfield, 2001)Shalizi and Crutchfield, 2001 measures the entropy of causal states in an optimally predictive model, capturing structure in time-series data. As we argued above, these existing notions of complexity fail to account for the limited computation available to the observer, which is essential for understanding machine learning algorithms. Being oblivious to computational limits means that they cannot characterize CSPRNGs or encrypted objects as being random. One might think that these failures are surface-level; for example, a plausible strategy would be to upgrade sophistication by replacing Kolmogorov complexity with time-bounded Kolmogorov complexity in (Definition 5). However, this approach does not work for several reasons, the most obvious being that CSPRNG outputs do have short and efficiently runnable generating programs and thus their time-bounded Kolmogorov complexities are small. A more subtle reason is that doing so results in trivial sophistication for all strings, which we discuss in more detail in Appendix A.6.

Our work is also closely related to several lines of work trying to characterize observer-dependent notions of information. In cryptography, Barak et al. (2003)Barak and colleagues and Hsiao et al. (2007)Hsiao and colleagues discuss several possible definitions for computational pseudoentropy, an observer-dependent analogue of entropy. HILL-pseudoentropy (Håstad et al., 1999)Hastad and colleagues, 1999 is defined relative to a class of tests: a source is considered random if no test within the class can distinguish it from a high-entropy distribution with nontrivial advantage, and Yao-pseudoentropy is defined via compressing and decompressing an object for example. Both definitions are closely related to time-bounded entropy, which measures the random content to a given computationally bounded observer; however, our formulation directly maps on to machine learning practice and it allows for separating out the structural information content, a key contribution of our work. More recently, Xu et al. (2020)Xu and colleagues propose $V$ -entropyV-entropy, a generalization of Shannon entropy to the minimum expected negative log probability over a given family of probability models, such as those with given computational constraints. With $V$ -entropyV-entropy, the symmetry of information can be violated, and so too can the data processing inequality, though neither is explicitly proven in the paper. Unlike time-bounded entropy, the computational constraint in $V$ -entropyV-entropy only limits the inference time, and does not account for the time to find such a model. Hence, the minimizer can be far away from the regime that is practically evaluated (such as models that are trained on infinite data or with infinite compute). While these undesirable behaviors can be overcome by imposing further data constraints, we believe our formulation of imposing a single bound on both training and inference time leads to fewer complications. More importantly, both pseudoentropy and $V$ -entropyV-entropy, much like time-bounded entropy, capture only the random component of information since it still measures the unpredictability of the random variable under the best feasible model. For understanding what useful information a model has learned, we are more interested in the non-random component of information as measured by epiplexity. Using existing measures of data complexity, such as the Lempel-Ziv complexity and Wolfram classification, Zhang et al. (2024)Zhang and colleagues showed that models trained on complex data like Class IV ECA rules tend to perform better on downstream tasks.

Several works have also explored how to quantify data complexity. Dziugaite and Roy (2025)Dziugaite and Roy suggests that the complexity of a minimal near-optimal reference model can be viewed as a measure of data complexity under the PAC-Bayes framework and how such data complexity gives rise to empirical scaling laws. This perspective is related to epiplexity in that both associate data complexity with the size of compact models that explain the data well. However, the two notions differ in important ways. In particular, the PAC-Bayes formulation is concerned with the existence of some small reference model achieving good in-distribution performance, whereas epiplexity characterizes the amount of structural information extractable by a computationally bounded observer, formalized through a two-part code that explicitly accounts for the cost of obtaining such a model. Further, our primary interest is not in characterizing in-distribution generalization, but in using epiplexity to measure the

intrinsic value of data in settings that extend beyond supervised learning. Relatedly, Hutter (2021)Hutter shows that power-law learning curves can emerge under specific assumptions on the data-generating distribution, illustrating how properties of the data itself can shape empirical scaling behavior. While this line of work focuses on explaining observed learning dynamics rather than defining a complexity measure, it similarly emphasizes the role of data structure in determining learning outcomes. These perspectives on data complexity can be viewed as instances of coarse graining, where one seeks a compressed representation that preserves some notion of “relevant” structure. A canonical example is the information bottleneck framework, which formalizes coarse graining as a trade-off between compression and retained information about a relevant variable (Tishby et al., 2000). Epiplexity is aligned with this perspective, but rather than defining relevance through a task variable or through distinguishability to tests, it measures the amount of structural information extractable by a computationally bounded learner, while explicitly accounting for the cost of obtaining the model.

More broadly, our work is related to several lines of work on how resource constraints fundamentally alter the notion of simplicity and learnability. In algorithmic information theory, Schmidhuber (2002)Schmidhuber proposes the speed prior, which replaces Solomonoff’s universal prior with a computable semimeasure that favors both shorter program length and smaller computation time, thereby incorporating computational resources directly into the definition of simplicity. In learning theory, a related line of work shows that computational limitations can directly affect what can be learned from data. For instance, in the problem of sparse PCA detection, Berthet and Rigollet (2013)Berthet and Rigollet show that although there exist procedures that succeed with an information-theoretically minimal number of samples, any algorithm that runs in polynomial time necessarily requires more data under widely used average-case hardness assumptions. Memory and space constraints alone can also qualitatively change learnability. Steinhardt et al. (2016)Steinhardt and colleagues show that restricting a learner’s memory can dramatically increase the amount of data required to learn, even when the target concept itself has a very concise description. They identify parity functions as a canonical example where this tension is conjectured to be sharp. Raz (2018)Raz later resolves this conjecture by proving that any learner with sub-quadratic memory requires exponentially many samples to learn parity from random examples.

8 Discussion

Much of classical information theory is concerned with the representation and transmission of information, and abstracts away key aspects of the computational processes by which information is extracted and used. While complexity theory and cryptography treat computation as fundamental, machine learning theory typically does not. Yet learning, whether biological or artificial, is an inherently computational process. What can be learned from data depends not only on statistical feasibility, but on the available resources. This perspective calls for more theoretical tools that place computation on an equal footing with information.

This work reframes information as a property of data relative to a computationally-bounded observer, and demonstrates that information can be decomposed into time-bounded entropy and epiplexity, a formalization of structural information. It also sheds light on how perceived information can be changed through computation. This perspective resolves several tensions between information theory and empirical machine learning—including the usefulness of synthetic data, the dependence of learning on factorization and ordering, and the emergence of structure beyond the data-generating process itself. Technically, epiplexity connects ideas from algorithmic statistics, cryptography, and learning theory, showing that standard assumptions (i.e., existence of one-way functions) suffice to produce distributions with high structural complexity for efficient learners.

Our framework opens several exciting directions for future work. On the theoretical side, it invites a systematic and more fine-grained understanding of how structural information changes with

computational budget, model class, and data transformations, potentially yielding new lower bounds and impossibility results for representation learning and transfer. Taking information and computation as the fundamental resources may offer new explanations for the relative universality observed in large-scale training, including why scaling law exponents depend only weakly on architectural and optimizer details. There is also a possibility of a compute-aware analogue of classical notions such as sufficient statistics and information bottlenecks. More broadly, framing emergence, induction, and generalization through the lens of computationally bounded observers may offer a unifying language across learning theory, algorithmic information theory, cryptography, and complexity theory.

On the empirical side, epiplexity provides a way to reason about why some data sources, formatting, and transformations can lead to more transferable models than others, even when they do not improve training loss. The framework suggests that pretraining data should be evaluated not only by held-out perplexity, but by how much reusable structural information it induces in a computationally bounded model. This perspective helps explain empirical successes of curriculum design, data ordering, augmentation strategies, and even synthetic data that appear counterintuitive from a purely statistical viewpoint. Our empirical estimator offers a concrete starting point for comparing datasets and interventions in data centric research. In the long run, we believe epiplexity could provide guidance on how to generate new synthetic data from existing data.

Finally, representation learning can be understood as the gradual accumulation of epiplexity: the construction of increasingly rich internal programs that approximate a data distribution within a fixed time budget. While epiplexity in isolation is not a measure of generalization, or a complete theory of learning, this perspective raises the possibility of new notions of hardness for learning and transfer that are orthogonal to classical PAC-style measures, capturing not sample complexity but the size of the structure that must be extracted. Such notions may help explain why certain tasks appear to require disproportionately large models or long training horizons despite admitting simple generative descriptions, and why improvements in generalization sometimes correlate more strongly with training dynamics or data structure than with likelihood alone.

**Acknowledgements.** We thank NSF CAREER IIS-2145492, NSF CDS&E-MSS 2134216, and DARPA AIQ HR00112590066 for support, and Scott Aaronson, Alan Amin, Brandon Amos, Martin Marek, Zhili Feng, Vaishnavh Nagarajan, Patrick Shafto, Charlie Chen, Alex Ozdemir, Andres Potapczynski, and Ethan Baron for helpful feedback. This work was supported by Google’s TPU Research Cloud (TRC) program: [https://sites.research.google/trc/](https://sites.research.google/trc/). YJ thanks the support of the Google PhD Fellowship, and SQ thanks the support of the Two Sigma Fellowship.

References

[references omitted]

Appendix Outline

This appendix provides the technical details, proofs, and experimental specifications supporting the main text.

Appendix A presents rigorous proofs of all theoretical results, including properties of cryptographically secure pseudorandom number generators under time-bounded entropy and epiplexity (Theorem 9), creation of information through deterministic transformations (Theorem 12), the existence of high-epiplexity random variables (Theorem 10), the factorization dependence of information content (Theorem 13).

Appendix B details the practical methodology for estimating epiplexity, covering both prequential and requential coding implementations, hyperparameter optimization procedures for compute-optimal two-part codes, the connection between prequential and requential estimates under a static teacher assumption, and a solvable analytical model combining neural scaling laws with prequential coding. We also establish general properties showing that optimal model size and training tokens increase monotonically with compute budget, that optimal training tokens for prequential coding generally saturate at the test set size for large compute budgets, and that epiplexity and per-token entropy exhibit predictable monotonicity with respect to dataset size.

Appendix C provides comprehensive experimental specifications for all empirical results, including architectural choices, hyperparameters, and dataset details for elementary cellular automata experiments, easy and hard variants of induction tasks, chess experiments (with both pre-training data formatting and downstream evaluation tasks), natural data experiments on OpenWebText and CIFAR-5M, comparisons between prequential and requential coding estimates, and scaling law estimation procedures.

Appendix D presents executable RASP-L code demonstrating that elementary cellular automaton evolution rules can be implemented within the transformer computational model, providing constructive evidence that autoregressive transformers are capable of solving these tasks.

Appendix E contains definitions of elementary cellular automata and Conway’s Game of Life, emergence examples referenced in the paper.

Appendix F explores additional examples illustrating the relationship between emergence and epiplexity, including the Lorenz system as a case study in chaotic dynamics where entropy is created at a rate determined by Lyapunov exponents, and chess strategy as exemplified by the contrast between AlphaZero’s multi-million parameter networks solution at moderate compute and the simple minimax algorithm available at very high compute.

Appendix G argues that induction phenomena occur not merely in autoregressive models; instead, the key requirement is maximum likelihood estimation rather than autoregressive factorization specifically.

Appendix H provides a more comprehensive review of MDL, in particular on two-part code, one-part code and the notion of regret which can be interpreted as a generalization of epiplexity for all coding schemes.

Compute Resources. A cluster of 6 2080Ti was used for many of the smaller scale experiments. A cluster of 6 Titan RTX and 32 TPUv4 provided by the Google TPU Research Cloud was used for the more computationally expensive natural data experiments. We refer the reader to Jiang et al. (2025) for computational resources required in evaluating ADO.

Licenses. The Chess data used in Section 5.2 is released under Creative Commons CC0 license (database.lichess.org). The OpenWebText dataset used in Section 6.2 is released under Creative Commons CC0 license.

Appendix A. Proofs

First, we prove two short lemmas about the basic properties of epiplexity and time-bounded entropy.

Lemma 15 (Maximum expected description length)

For any random variable $X$ X on ${0, 1}^{n}$ the set of binary strings of length n there exists constants $c_{1}, c_{2}, c_{3}$ c one, c two, and c three such that:
$S_{T} (X) + H_{T} (X) \leq n + c_{1} (11)$
for time bounds $T (n) \geq c_{2} n + c_{3}$ T of n greater than or equal to c two n plus c three.

Proof Let $U_{n}$ U n be the uniform distribution $Q_{unif} (x) = 2^{- n}$ Q unif of x equals two to the minus n. $Q_{unif}$ Q unif can be computed in linear time (just by outputting $2^{- n}$ two to the minus n for each input) and with a program of constant size $c_{1}$ c one and in time $c_{2} n + c_{3}$ c two n plus c three with constants depending on the Turing machine..

∣ Q_{X}^{⋆} ∣ + E [- lo g Q_{X}^{⋆} (x)] \leq ∣ Q_{unif} ∣ + E [- lo g Q_{unif} (x)] \leq c + n .

$■$

Lemma 16 (Time-bounded entropy of uniform distribution)

Let $X = U_{n}$ X equal U n be the uniform distribution on ${0, 1}^{n}$ the set of binary strings of length n. The time-bounded entropy of $U_{n}$ U n for $T (n) \geq c_{2} n + c_{3}$ T of n greater than or equal to c two n plus c three is:
$n \leq H_{T} (X) \leq n + c_{1} . (12)$

Proof

For the lower bound, we have

E_{X} [- lo g Q (X)] = H (X) + D_{KL} (P_{X} ∥ Q) \geq H (X) = n

given that the KL is always positive. For the upper bound, we have that

H_{T} (X) \leq MDL_{T} (X) \leq n + c

. $■$

A.1 CSPRNGs have (nearly) maximal time-bounded Entropy and low epiplexity

Theorem 17

Let $X = U_{k}$ X equal U k and $n = ℓ (k)$ n equal ell of k for a non-uniform PRG $G$ G that admits advantage $ϵ (n)$ epsilon of n. Then, for every polynomial time bound $T (n)$ T of n,
$H_{T} (G (U_{k})) \geq n - 2 - n ϵ (k) . (13)$

Proof Fix $P \in P_{T}$ P in the set P T and let $L (x) = - lo g P (x)$ L of x equals minus log P of x. For each precision level $t \in {1, 2, \dots, n}$ t from one to n, we define the following distinguisher:

D_{t} (x) = 1 {L (x) \leq n - t} = 1 {P (x) \geq 2^{- (n - t)}} .

For any solution $P$ for $MDL_{T}$ , we have that $MDL_{T} (X) = ∣ P ∣ + E [- lo g P (X)] \leq n + c$ . Since both quantities are positive, it must be the case that $∣ P ∣ \leq n$ , which means that $∣ P ∣ \in poly (n)$ . Since $P$ belongs in $P_{T}$ and cannot be longer than $n$ , each $D_{t}$ is a non-uniform PPT algorithm with polysized advice (i.e., $P$ ) that CSPRNGs are secure against.

Uniform threshold bound. Let $U_{n}$ be uniform on ${0, 1}^{n}$ and set $A_{t} := {x : D_{t} (x) = 1}$ .

1 \geq x \sum P (x) \geq x \in A_{t} \sum P (x) \geq ∣ A_{t} ∣ 2^{- (n - t)} \Rightarrow ∣ A_{t} ∣ \leq 2^{n - t}

Hence,

Pr [D_{t} (U_{n}) = 1] = \frac{∣ A _{t} ∣}{2 ^{n}} \leq \frac{2 ^{n - t}}{2 ^{n}} = 2^{- t} .

CSPRNG transfers bound to $X := G (U_{k})$ . By the security of $G$ , for each $t$ ,

Pr [D_{t} (X) = 1] \leq Pr [D_{t} (U_{n}) = 1] + ϵ (k) \leq 2^{- t} + ϵ (k),

From threshold probabilities to an entropy lower bound. For any non-negative random variable $Z$ , we have the layercake representation:

E [Z] n - E [Z] = u = 0 \sum \infty (1 - P (Z \leq u)) = u = 0 \sum n - 1 1 - u = 0 \sum \infty (1 - P (Z \leq u)) = u = 0 \sum n - 1 1 - u = 0 \sum n - 1 (1 - P (Z \leq u)) - u = n \sum \infty (1 - P (Z \leq u)) = u = 0 \sum n - 1 P (Z \leq u) - u = n \sum \infty (1 - P (Z \leq u)) \leq u = 0 \sum n - 1 P (Z \leq u) . (14) (15) (16) (17) (18)

Now we change the bounds to be in terms of $t$ with $t = n - u$ . The lower bound becomes $t = n$ . The upper bound becomes $t = 1$ , which yields

n - E [Z] \leq u = 0 \sum n - 1 P (Z \leq u) = t = 1 \sum n P (Z \leq n - t) .

Let $Z = L (X) = - lo g P (X)$ :

n - E [Z] \leq t = 1 \sum n P (Z \leq n - t) = t = 1 \sum n P (D_{t} (X) = 1) \leq t = 1 \sum n 2^{- t} + ϵ (k) \leq 1 + n ϵ (k) .

The last two steps come from the fact that $X$ is a CSPRNG. This means that:

n - E [L (X)] \leq 1 + n ϵ (k) \Rightarrow E [- lo g P (X)] \geq n - n ϵ (k) - 1.

Since this is true for any $P \in P_{T}$ , taking the minimum over yields:

H_{T} (X) = H_{T} (G (U_{n})) = P \in P_{T} min E [- lo g P (X)] \geq n - n ϵ (k) - 1.

A.2 Deterministic transformation can increase time bounded entropy and epiplexity

Theorem 18

Let $G : {0, 1}^{k} \to {0, 1}^{n}$ be a CSPRNG which admits advantage $ε (k)$ and $U_{k}$ be the uniform distribution. $H_{poly} (G (U_{k})) > H_{poly} (U_{k}) + n - n ε (k) - k - O (1)$ . Proof: see Appendix A.1.

Proof By Lemma 15 applied to the uniform distribution on ${0, 1}^{k}$ , there is an absolute constant $c$ such that

H_{poly} (U_{k}) \leq k + c .

Rearranging gives $k \geq H_{poly} (U_{k}) - O (1)$ . Combining this with the assumed CSPRNG lower bound (Lemma 17),

H_{poly} (G (U_{k})) \geq n - 2 - n ε (k),

we obtain,

H_{poly} (G (U_{k})) - H_{poly} (U_{k}) \Rightarrow H_{poly} (G (U_{k})) \geq n - 2 - n ε (k) - (k + c) > H_{poly} (U_{k}) + n - n ε (k) - k - O (1) .

■

A.3 CSPRNGs have low epiplexity

Theorem 19

Let $X = U_{k}$ and $n = ℓ (k)$ for CSPRNG $G$ that admits advantage $ε (n)$ . Then, for every polynomial time bound $T (n)$ , the epiplexity of $Y = G (X)$ is,
$S_{T} (Y) \leq c + n ε (k) . (19)$

Proof We know from Theorem 17 that $H_{T} (G (U_{k})) \geq n - n ε (k) - 2$ , which means:

S_{T} (Y) + H_{T} (Y) \geq S_{T} (Y) + n - n ε (k) - 2. (20)

We also have from Lemma 15 that $S_{T} (Y) + H_{T} (Y) \leq n + c$ . Combining these two results yields:

S_{T} (Y) + n - n ε (k) - 1 \leq n + c \Rightarrow S_{T} (Y) \leq c + n ε (k) . (21)

■

A.4 Existence of High Epiplexity random variables

Definition 20 (Pseudorandom functions (PRF)) Let PRF be the class of keyed functions $F : {0, 1}^{k} \times {0, 1}^{n} \to {0, 1}^{m}$ F from the product of binary strings of length k and n, to binary strings of length m that are computable in polynomial time and satisfy the following property: For any probabilistic polynomial-time distinguisher $D$ D with oracle access to the provided function,

Pr_{K \sim U_{k}} [D^{F_{K} (\cdot)}] - Pr_{f \sim F_{n}} [D^{f (\cdot)}] < \frac{1}{n ^{c}}, (22)

for all integers $c > 0$ c greater than zero and sufficiently large $n$ n. Here, $F_{K} (\cdot)$ F sub K denotes the function $F (K, \cdot)$ with the key $K$ K fixed, and $F_{n}$ F sub n is the set of all functions mapping ${0, 1}^{n}$ to ${0, 1}^{m}$ .

Cryptographic assumptions. Assume one-way functions exist (secure against non-uniform PPT adversaries with inversion probability at most $ϵ (n)$ epsilon of n). By standard constructions (Håstad et al., 1999), this implies the existence of PRFs secure against non-uniform PPT distinguishers with advantage $poly (ϵ (n))$ polynomial in epsilon of n (and in particular negligible if $ϵ (n)$ is negligible).

Definition 21 (Heavy set) For a distribution $Q$ Q on ${0, 1}^{n}$ the set of binary strings of length n, $m < n$ , and a fixed threshold $t \geq 0$ t greater than or equal to zero, the $(Q, t)$ -heavy set is:

A_{Q, t} := {z : Q (z) \geq 2^{- 2 (m + t)}} . (23)

Lemma 22

Let $P$ P be a distribution on ${0, 1}^{n}$ the set of binary strings of length n with entropy $H (P) = m$ H of P equals m. If $K L (P, Q) \leq t$ the KL divergence of P and Q is at most t, then $P (A_{Q, t}) \geq \frac{1}{2}$ the probability of the heavy set under P is at least one half.

Proof First, observe the standard inequality:

E_{z \sim P} [lo g \frac{1}{Q ( z )}] = H (P) + K L (P ∥ Q) \leq m + t .

Applying Markov’s inequality, we get:

Pr_{z \sim P} [lo g \frac{1}{Q ( z )} \geq 2 (m + t)] \leq \frac{E [ - lo g Q ( z )]}{2 ( m + t )} \leq \frac{1}{2} . (24)

Taking the complement gives:

Pr_{z \sim P} [lo g \frac{1}{Q ( z )} \leq 2 (m + t)] = Pr_{z \sim P} [Q (z) \geq 2^{- 2 (m + t)}] = P (A_{Q, t}) \geq \frac{1}{2} . (25)

■

Lemma 23

Let $U_{n}$ U n be the uniform distribution over ${0, 1}^{n}$ , the weights of $A_{Q, t}$ under $U_{n}$ is $U_{n} (A_{Q, t}) \leq 2^{- (n - 2 (m + t))}$

Proof For $z \sim U_{n}$ z sampled from U n, we have $E_{z \sim U_{n}} [Q (z)] = \sum_{z} 2^{- n} Q (z) = 2^{- n}$ . Applying Markov’ inequality:

Pr_{z \sim U_{n}} [Q (z) \geq 2^{- 2 (m + t)}] \leq \frac{E _{z \sim U_{n}} [ Q ( z )]}{2 ^{- 2 (m + t)}} \leq 2^{- n + 2 (m + t)} = 2^{- (n - 2 (m + t))} . (26)

■

Theorem 24

If there exists a PRF family $F_{K} : {0, 1}^{m} \to {0, 1}^{k}$ that is indexed by $K \in {0, 1}^{m}$ and secure against a non-uniform PPT distinguisher $D_{m}$ allowing for an advantage of at most $ϵ (m)$ , there exists $n_{0}$ such that for all $n = m + k \geq n_{0}$ , there exists a sequence of random variables ${X_{k}}_{k = 1}^{n}$ over ${0, 1}^{n}$ such that $S_{Poly} (X_{n}) = Ω (lo g n)$ .

Proof We will prove the existence of such $P$ P via a counting argument. First, we define the family of distributions of interest. Concretely, we draw a sample $P_{K}$ as follows:

Sample $x \sim U_{m}$ x sampled from the uniform distribution over m-bit strings
Output $z = (x, F_{K} (x)) \in {0, 1}^{n}$ z equals x and F K of x, in the set of binary strings of length n

Since $F_{K}$ F K is a deterministic function, $H (P_{K}) = m$ the entropy of P K equals m.
We also defined a keyed model $Q_{K}$ Q K that models $P_{K}$ P K by directly storing the key $K$ K and the program for generating PRF from $K$ K inside its program:

Q_{K} (x, y) = 2^{- m} 1 {y = F_{K} (x)} .

This model matches the density of $P_{K}$ P K so $K L (P_{K} ∥ Q_{K}) = 0$ the K L divergence between P K and Q K is zero, and:

L (Q_{K}, P_{K}) = ∣ Q_{K} ∣ + H (P_{K}) \leq m + c_{1} + m = 2 m + c_{1} .

$c_{1}$ c one is the constant overhead to implement the PRF evaluation and sampling wrapper under a fixed encoding (i.e., a UTM).

Constructing distinguisher from $Q$ Q. Given a model $Q$ Q and its heavy set $A_{Q, t}$ A Q t (Definition 21), we can turn $Q$ Q into a single-query distinguisher $D^{O}$ :

Sample $x \sim U_{m}$ x from U m and query the oracle $y = O (x)$ and set $z = (x, y)$ .
Output 1 if $z \in A_{Q, t}$ i.e., $Q (z) \geq 2^{- 2 (m + t)}$ else 0.

If $O$ is a truly random function $R$ , then $(x, R (x))$ follows $U_{n}$ and by Lemma 23:

Pr [D^{R} = 1] = Pr_{z \sim U_{n}} [z \in A_{Q, t}] \leq 2^{- (n - 2 (m + t))} (27)

If $O$ is the PRF $F_{K}$ for a $K$ that satisfies $K L (P_{K} ∥ Q) \leq t$ , Lemma 22 gives:

Pr [D^{F_{K}} = 1 ∣ K L (P_{K} ∥ Q) \leq t] \geq \frac{1}{2} (28)

Let $p_{Q, t} = Pr_{K} [K L (P_{K} ∥ Q) \leq t]$ . We can average over all possible $K$ and obtain the following bound:

Pr [D^{F_{K}} = 1] \geq Pr_{K} [K L (P_{K} ∥ Q) \leq t] Pr [D^{F_{K}} = 1 ∣ K L (P_{K} ∥ Q) \leq t] \geq \frac{1}{2} p_{Q, t} (29)

Therefore, the distinguishing advantage of $D^{O}$ is:

Adv (D^{O}) = Pr [D^{F_{K}} = 1] - Pr [D^{R} = 1] \geq \frac{1}{2} p_{Q, t} - 2^{- (n - 2 (m + t))} (30)

Rearranging:

p_{Q, t} \leq 2 Adv (D^{O}) + 2 \cdot 2^{- (n - 2 (m + t))} (31)

Since $F_{K}$ is a PRF and $D_{O}$ is a PPT distinguisher, the advantage is upperbounded by $ε (m)$ epsilon of m:

p_{Q, t} \leq 2 ε (m) + 2 \cdot 2^{- (n - 2 (m + t))} (32)

Union bound over short models. Given a maximum program length $s$ s, there are at most $2^{s + 1}$ candidate programs $Q$ with $∣ Q ∣ \leq s$ . Applying union bound on all such $Q$ ‘s:

Pr_{K} [\exists Q : ∣ Q ∣ \leq s \land K L (P_{K} ∥ Q) \leq t] \leq 2^{s + 1} p_{Q, t} \leq 2^{s + 1} (2 ε (m) + 2 \cdot 2^{- (n - 2 (m + t))}) (33)

Now, it suffices to choose parameters such that the RHS of equation 33 is smaller than 1, which implies there exists a hard key $K^{⋆}$ K star such that:

K L (P_{K^{⋆}} ∥ Q) > t, \forall Q satisfying ∣ Q ∣ \leq s (34)

MDL lower bound from $K^{⋆}$ K star. For $K^{⋆}$ K star, every $∣ Q ∣ \leq s$ the size of Q less than or equal to s satisfies:

L (Q, P_{K^{⋆}}) = ∣ Q ∣ + H (P^{⋆}) + KL (P_{K^{⋆}} ∥ Q) \geq H (P^{⋆}) + KL (P_{K^{⋆}} ∥ Q) \geq m + t .

Meanwhile, the keyed model $Q_{K^{⋆}}$ Q sub K star satisfies: $L (Q_{K^{⋆}}, P_{K^{⋆}}) \leq 2 m + c_{1}$ L is less than or equal to two m plus c one. If we set:

t = m + c_{1} + Δ,

we get a margin of $Δ$ delta:

L (Q, P_{K^{⋆}}) \geq m + m + c_{1} + Δ > 2 m + c_{1} \geq L (Q_{K^{⋆}}, P_{K^{⋆}}) (35)

This implies that there exists at least one model that achieves a lower description length than any $Q$ with $∣ Q ∣ \leq s$ and the MDL minimizer must have $∣ Q^{⋆} ∣ > s$ .

Choosing parameters. Set:

$s = lo g m$
$Δ = lo g m$
$t = m + c_{1} + Δ = m + c_{1} + lo g m$
$k = 4 m + 4Δ + 2 c_{1}$

We now plug these values into Equation 33. First, $2^{s + 1} = poly (m)$ and $lim_{m \to \infty} 2^{s + 1} \cdot 2 ε (m) = 0$ . For the second term:

= = = = 2^{s + 1} \cdot 2 \cdot 2^{- (n - 2 (m + t))} 2^{l o g m + 1} \cdot 2 \cdot 2^{- (m + 4 m + 4Δ + 2 c_{1} - 2 (m + m + c_{1} + l o g m))} 2^{l o g m + 2} \cdot 2^{- (5 m + 4 l o g m + 2 c_{1} - 2 (2 m + c_{1} + l o g m))} 2^{l o g m + 2} \cdot 2^{- (m + 2 l o g m)} 2^{- m - l o g m + 2} .

This term also approaches 0 as $m$ increases. So for sufficiently large $m$ the RHS of Equation 33 is less than 1 as desired.

A.5 Information Content is not Independent of Factorization

Theorem 25 (OWP induces entropy asymmetry)

Let $f : {0, 1}^{n} \to {0, 1}^{n}$ f from the set of binary strings of length n to itself be a polynomial-time computable one-way permutation secure against non-uniform PPT inverters with negligible success probability. Let $X = U_{n}$ X be a uniform distribution and $Y = f (X)$ . Let $H_{poly} (\cdot)$ and $H_{poly} (\cdot ∣ \cdot)$ be defined as in Definition 8. Then for every constant $c > 0$ there exists $N$ such that for all $n \geq N$ ,
$H_{poly} (X ∣ Y) + H_{poly} (Y) > H_{poly} (Y ∣ X) + H_{poly} (X) + c lo g n .$

Proof We prove bounds on each term.

Unconditional terms $H_{poly} (X)$ H poly of X and $H_{poly} (Y)$ H poly of Y. Since $X = U_{n}$ X equals U n and $f$ f is a permutation, $Y = f (X)$ Y equals f of X is also uniform on ${0, 1}^{n}$ the set of all binary strings of length n. By Lemma 15 (time-bounded entropy of the uniform distribution), there is a constant $c_{0}$ c zero such that

n \leq H_{poly} (X) \leq n + c_{0}, n \leq H_{poly} (Y) \leq n + c_{0} .

In particular, $- c_{0} \leq H_{poly} (Y) - H_{poly} (X) \leq c_{0}$ the difference between the two is bounded by c zero, so $H_{poly} (Y) - H_{poly} (X) = O (1)$ the difference is order one.

Forward conditional term $H_{poly} (Y ∣ X)$ H poly of Y given X. There is a deterministic conditional sampler that on input $x$ x outputs $f (x)$ f of x. For this sampler, $P (Y ∣ X) = 1$ the probability of Y given X is one almost surely, hence $lo g (1/ P (Y ∣ X)) = 0$ almost surely. Since $H_{poly} (Y ∣ X)$ is the expected log-loss of the MDL-optimal conditional sampler, we obtain

H_{poly} (Y ∣ X) = O (1) .

Hard conditional term $H_{poly} (X ∣ Y)$ H poly of X given Y. Let $P^{⋆} := P_{X ∣ Y}^{⋆}$ P star be the MDL-optimal conditional probabilistic model for $X ∣ Y$ over the class of non-uniform PPT model, and define

ϕ (y) := u \sim U_{\infty} Pr [Sample_{P_{X ∣ y}^{⋆}} (u) = f^{- 1} (y)] .

Because $Y = f (X)$ and $f$ is a permutation, we have $X = f^{- 1} (Y)$ almost surely, and thus

P^{⋆} (X ∣ Y) = P^{⋆} (f^{- 1} (Y) ∣ Y) = ϕ (Y) a.s.

Therefore

H_{poly} (X ∣ Y) = E [lo g \frac{1}{P ^{⋆} ( X ∣ Y )}] = E [lo g \frac{1}{ϕ ( Y )}] .

By Jensen’s inequality for the convex function $lo g (1/ t)$ ,

E [lo g \frac{1}{ϕ ( Y )}] \geq lo g \frac{1}{E [ ϕ ( Y )]} .

Now consider the inverter $I$ I that on input $y$ runs the sampler $P^{⋆} (X ∣ Y)$ once and outputs the resulting $x$ . Since $P^{⋆}$ is a non-uniform PPT sampler, $I$ is a non-uniform PPT inverter. Moreover, its inversion success probability is exactly

Pr [I (Y) = f^{- 1} (Y)] = E [ϕ (Y)] .

Equivalently (since $Y = f (X)$ ),

X \sim U_{n} Pr [I (f (X)) = X] = E [ϕ (Y)] .

By one-wayness, this success probability is negligible. In particular, for every constant $c > 0$ there exists $N$ such that for all $n \geq N$ ,

E [ϕ (Y)] \leq n^{- c} .

Plugging into the Jensen bound yields, for all $n \geq N$ ,

H_{poly} (X ∣ Y) \geq lo g \frac{1}{E [ ϕ ( Y )]} \geq c lo g n .

Combine. For $n \geq N$ , we have

H_{poly} (X ∣ Y) + H_{poly} (Y) \geq c lo g n + H_{poly} (Y) \geq c lo g n + H_{poly} (X) - O (1) = H_{poly} (Y ∣ X) + H_{poly} (X) + c lo g n - O (1), (36) (37) (38)

where we used $H_{poly} (Y ∣ X) = O (1)$ and $H_{poly} (Y) - H_{poly} (X) \geq - c_{0}$ . ■

Corollary 26

Let $f$ f be a one-way permutation and let $X = Unif ({0, 1}^{n}), Y = f (X)$ X be a uniform distribution over n-bit strings and Y equals f of X. Define $P$ P as a family of probabilistic generative model that allows for multiple factorizations of the data, i.e. $P \in P$ P in calligraphic P it can make predictions $P_{1 \to 2} (X, Y) = P_{1} (X) P_{2} (Y; X)$ P one to two of X and Y equals P one of X times P two of Y given X and $P_{2 \to 1} (X, Y) = P_{2} (Y) P_{1} (X; Y)$ P two to one of X and Y equals P two of Y times P one of X given Y for the functions $P_{1} (\cdot), P_{1} (\cdot; \cdot), P_{2} (\cdot), P_{2} (\cdot; \cdot)$ that are normalized probability distributions over the first variable.
Suppose that $P$ P fits the forward direction of $f$ f (and the input uniform distributions)
$E [- lo g P_{1} (X)] E [- lo g P_{2} (f (X) ∣ X)] \leq n + ε \leq ε$
then it must violate Bayes theorem $P_{1 \to 2} = P_{2 \to 1}$ by a margin growing with $n$ n. Specifically, for any value of $c$ c there exists $N$ N such that for all $n > N$ , there exists at least one $x \in {0, 1}^{n}$ such that
$P_{1} (x) P_{2} (f (x); x) > n^{c} 2^{- 2 ε} P_{2} (f (x)) P_{1} (x; f (x)) (39)$

Proof From Theorem 25 which applies also for each $P$ P, we have

E [- lo g P_{2} (X; Y)] > c lo g n .

The minimim value of $E [- lo g P_{2} (f (X))]$ is $n$ n since $f$ f is a bijection. Assembling these components,

E [lo g \frac{P _{1} ( X ) P _{2} ( f ( X ) ; X )}{P _{2} ( f ( X )) P _{1} ( X ; f ( X ))}] > c lo g n - 2 ε . (40)

Since the inequality holds in expectation, it also must hold for at least one value of $X$ X. Exponentiating provides the final result. ■

A.6 Problems with time-bounded sophistication

Epiplexity can be seen as a time-bounded and distributional generalization of sophistication. A natural question is whether we can directly define a time-bounded version of sophistication for individual strings. We show below that a naive time-bounded generalization degenerates: it makes the “model” part essentially constant for every string.

Preliminaries. Fix a reference universal (prefix-free or plain) Turing machine $U$ U. For a program $p$ p and auxiliary input $d$ d, we write $U (p, d)$ for the output of running $p$ on input $d$ . The length of a binary string $p$ is denoted $∣ p ∣$ . A program $p$ is total if $U (p, d)$ halts for every input $d$ (i.e., $p$ computes a total function).

We write $K (x)$ for Kolmogorov complexity (plain or prefix; the choice only changes values by $O (1)$ ). For a time bound $t (\cdot)$ , the time-bounded Kolmogorov complexity is

K^{t} (x) := min {∣ q ∣ : U (q) outputs x within t (∣ x ∣) steps} .

K t of x is defined as the minimum length of a program q such that U of q outputs x within t of the length of x steps

(Any standard time-constructible $t$ suffices for the discussion.)

We adopt the definition of sophistication from Koppel (1988)Koppel and Antunes et al. (2005)Antunes and colleagues, phrased for finite strings as in later expositions. For a significance level $c \geq 0$ , the sophistication of $x$ is

Definition 27 (Sophistication at significance $c$ c)

soph_{c} (x) := p min {∣ p ∣ : p is total and \exists d such that U (p, d) = x and ∣ p ∣ + ∣ d ∣ \leq K (x) + c}

Sophistication at significance c of x is the minimum length of a program p, such that p is total and there exists a data string d where the universal machine U of p and d equals x, and the sum of their lengths is at most the Kolmogorov complexity of x plus c.

Intuitively, $(p, d)$ p and d is a near-optimal two-part description of $x$ x. The requirement that $p$ p be total is crucial: it prevents taking $p$ p to be a tiny universal interpreter and pushing all information into $d$ d (since a universal interpreter is not total). One of the most intuitive attempts at “time-bounded sophistication” is to simply replace $K (x)$ K of x by the time-bounded complexity $K^{t} (x)$ K t of x in Definition 27.

Definition 28 (Naive time-bounded sophistication) Fix a time bound $t (\cdot)$ t and significance level $c \geq 0$ c greater than or equal to zero. Define

soph_{c}^{t} (x) := p min {∣ p ∣ : p is total and \exists d such that U (p, d) = x and ∣ p ∣ + ∣ d ∣ \leq K^{t} (x) + c}

The definition above collapses, essentially because time bounds make it easy to “totalize” a universal interpreter by adding a timeout.

Lemma 29 (Naive time-bounded sophistication is $O (1)$ order one)

For every time bound $t (\cdot)$ t and every $c \geq 0$ c greater than or equal to zero, there exists a constant $C_{t}$ C sub t (depending only on $t$ t and the choice of $U$ U) such that for every string $x$ x,
$soph_{c}^{t} (x) \leq C_{t}$
In particular, $soph_{c}^{t} (x)$ naive time-bounded sophistication does not meaningfully distinguish structured strings from random-looking strings.

Proof [sketch] Fix $t$ t. Let $p_{tl}$ p t l be a constant-size program that, on input $d$ d, simulates $U (d)$ U of d for at most $t (∣ x ∣)$ steps (or more generally for the same time budget used in the definition of $K^{t} (x)$ K t of x), and: (i) if the simulation halts within the budget, output the same result; otherwise (ii) output a fixed default string (say 0). By construction, $p_{tl}$ p t l is total (it always halts, because it enforces a timeout).

Now let $d^{⋆}$ d star be a shortest program witnessing $K^{t} (x)$ K t of x, i.e., $∣ d^{⋆} ∣ = K^{t} (x)$ the length of d star equals K t of x and $U (d^{⋆})$ U of d star outputs $x$ x within the allowed time. Then $U (p_{tl}, d^{⋆}) = x$ U of p t l and d star equals x. Moreover,

∣ p_{tl} ∣ + ∣ d^{⋆} ∣ = ∣ p_{tl} ∣ + K^{t} (x) \leq K^{t} (x) + c for all c \geq ∣ p_{tl} ∣

Thus $p_{tl}$ p t l is feasible in Definition 28, giving $soph_{c}^{t} (x) \leq ∣ p_{tl} ∣ = C_{t}$ for all $x$ x. $■$

In the original (unbounded-time) Definition 27, totality prevents a universal interpreter from being used as the “model” part, because such an interpreter cannot halt on inputs that encode non-halting computations. However, once we commit to a time bound in the optimality criterion (i.e., we compare against $K^{t} (x)$ K t of x), the data part $d$ d can be chosen to be a short program that is guaranteed to halt quickly. A constant-size clocked interpreter $p_{tl}$ p t l is then total and suffices for every $x$ x, pushing all of the description length into $d$ d. This is precisely the sense in which the naive time-bounded generalization becomes degenerate.

Appendix B. Measuring Epiplexity

B.1 Further details on estimating epiplexity

Here we provide further details on measuring epiplexity.

Evaluating code lengths and time bounds. As described in Section 4, evaluating the code length for the model boils down to tracking the training losses (prequential) or teacher-student KL (requential) at each step $i$ i:

∣ P_{preq} ∣ ∣ P_{req} ∣ \approx i = 0 \sum M - 1 (lo g 1/ P_{i} (Z_{i}) - lo g 1/ P_{M} (Z_{i})), \approx i = 0 \sum M - 1 KL (P_{i}^{t} ∥ P_{i}^{s}) . (41) (42)

For prequential coding, we need to compute the loss of the final model summed over the entire training dataset, $\sum_{i = 0}^{M - 1} lo g 1/ P_{M} (Z_{i})$ the sum from i equals zero to M minus one of log one over P M of Z i, which is time-consuming if done exactly. Since all of our experiments are in the one-epoch training regime without data repeat and training data $Z_{i}$ Z i are drawn i.i.d. (except for the ADO experiment Section 6.4), we make the assumption that the generalization gap is small and estimate $\sum_{i = 0}^{M - 1} lo g 1/ P_{M} (Z_{i})$ as $M lo g 1/ P_{M} (Z_{M})$ M times log one over P M of Z M, where the latter is a rescaled loss for $P_{M}$ P M on unseen data $Z_{M}$ Z M. The i.i.d. assumption breaks down for the ADO experiment Section 6.4, where we instead compute $\sum_{i = 0}^{M - 1} lo g 1/ P_{M} (Z_{i})$ exactly.

For requential coding, we need to evaluate the teacher-student KL, $KL (P^{t} ∥ P^{s})$ , at each training step. The KL divergence over sequences decomposes as a sum over token positions and is estimated as:

KL (P^{t} ∥ P^{s}) = j = 1 \sum L E_{Z_{< j} \sim P^{t}} Z_{j} \in V \sum P^{t} (Z_{j} ∣ Z_{< j}) lo g \frac{P ^{t} ( Z _{j} ∣ Z _{< j} )}{P ^{s} ( Z _{j} ∣ Z _{< j} )} \approx j = 1 \sum L Z_{j}^{'} \in V \sum P^{t} (Z_{j}^{'} ∣ Z_{< j}) lo g \frac{P ^{t} ( Z _{j}^{'} ∣ Z _{< j} )}{P ^{s} ( Z _{j}^{'} ∣ Z _{< j} )}, (43) (44)

where $Z \sim P^{t}$ Z sampled from P t is a sample from the teacher, $L$ L is the sequence length, and $V$ V is the vocabulary. We evaluate this estimator using the sample $Z$ Z generated by the teacher to train the student, along with their next-token-prediction logits ${P^{t} (Z_{j} ∣ Z_{< j}), P^{s} (Z_{j} ∣ Z_{< j})}_{j}$ P t and P s of Z j given Z less than j recorded on the generated sequence.

Finally, to estimate the expected entropy code length for the test data $E [lo g 1/ P (X)]$ the expected value of log one over P of X under the trained model $P$ P, we use an appropriately scaled empirical entropy code length of a heldout test set $\hat{X}$ X hat. Let $K$ K and $\hat{K}$ K hat denote the number of examples in each dataset. Then:

E [lo g 1/ P (X)] = E [lo g \frac{1}{\prod _{i} P ( X _{i} )}] = i \sum E [lo g 1/ P (X_{i})] = K E [lo g 1/ P (X_{1})] \approx \frac{K}{K ^} i = 1 \sum \hat{K} lo g 1/ P (\hat{X}_{i}) (45) (46) (47) (48)

where we assumed the datasets $X$ X and $\hat{X}$ X hat consist of i.i.d. draws from the same distribution. This estimator is simply a scaled version of the standard empirical test loss, and it converges to the true expectation as $\hat{K}$ K hat becomes large. To speedup evaluation, we typically choose $\hat{K} ≪ K$ K hat much smaller than K, but this choice does not affect our time-bound calculation: for both prequential and requential coding, the total decoding time of the two-part code for the test dataset $X$ X is estimated as $6 N D + 2 N D$ 6 N D plus 2 N calligraphic D where $N$ N is the number of parameters of the (student) model, $D$ D is the number of (student) training tokens,

Four plots showing compute vs. code length, model size, and training tokens, with various Pareto frontiers. Figure 10: Estimating the Pareto frontier from a finite number of training runs. While the exact Pareto frontier is smooth and the optimal model size and training tokens increase smoothly with compute, the empirical frontier is jagged and includes many spurious points due to selecting over only a finite number of hyperparameter combinations. Replacing the empirical Pareto frontier with the lower convex hull and retaining only the median point (ordered by compute) belong to a single training run with a fixed model size results in a much more accurate estimate of the true Pareto frontier. The example training curves are generated using the scaling laws in Hoffmann et al. (2022)Hoffmann and colleagues and prequential coding. The exact frontier is found via root finding for Equation (56).

and $D$ D is the number of tokens in the test dataset. When evaluating conditional epiplexity $S_{T} (Y ∣ X)$ S T of Y given X, decoding time takes into account both the input ( $X$ X) and label ( $Y$ Y) tokens, but code length only needs to be computed for the label tokens (tokens contributing to the training loss).

Finding Hyperparameters for Compute-Optimal Two-Part Code. To identify models that lead to compute-optimal two-part code, we need to optimize several key hyperparameters, including model size ( $N$ N), training tokens ( $D$ D), width-depth ratio, learning rate, etc. Through our early experiments, we found two interventions that reduce the model code length under requential coding: (1) distilling from an exponential moving average (EMA) of teacher checkpoints rather than instantaneous checkpoints, which reduces noise in the distillation signal, and (2) imposing a maximum KL threshold between teacher and student—when exceeded, the teacher is frozen while the student catches up, preventing divergence that would otherwise inflate the code length. The EMA time scale and the maximum KL threshold are additional hyperparameters for requential coding.

In each experiment, we first identify a good learning rate for a small model size and use the Maximum Update Parameterization (Yang et al., 2022)by Yang and colleagues and CompleteP (Dey et al., 2025)by Dey and colleagues to transfer the found learning rate to larger models. We also optimize the EMA time scale and maximum KL threshold for the small model when using requential coding. We then train models of various depths and widths to simultaneously sweep over model size and width-depth ratios, for a total number of training tokens chosen to be larger than the test dataset size $D$ D, motivated by the observation that the optimal training tokens typically grows with the model size but do not exceed $D$ D (see Section B.4). To avoid separately training a model for intermediate training token budgets, we record an EMA of the iterates (for requential coding, this is done for the student) under a constant learning rate schedule, rather than using a decaying learning rate schedule, following Hägele et al. (2024)Hägele and colleagues. Each training run traces a curve in the $∣ P ∣ + E [1/ lo g P (X)]$ vs $T$ code length versus T plane as more training tokens are seen. The Pareto frontier of all such curves yields the optimal hyperparameters ( $N, D$ N, D, width, depth, etc.) as a function of the compute budget.

Estimating the Pareto Frontier. Due to computational constraints, we can only sweep over a limited set of hyperparameter combinations, which makes the empirical Pareto frontier noisy and jagged; we therefore use the lower convex hull of the resulting curves as a smoother approximation to the true Pareto frontier, a strategy often used in the compute-optimal scaling law literature

(Henighan et al., 2020; McLeish et al., 2025) Henighan and colleagues, and McLeish and colleagues, to overcome similar issues. After applying this strategy, we still often observe that multiple checkpoints from a single training run appear on the Pareto frontier. This is an artifact of finite hyperparameter sweeps: we expect both the optimal training tokens $D$ D and model size $N$ N to vary smoothly with compute budget, precluding multiple values of $D$ at the same $N$ from lying on the true Pareto frontier. These spurious points cause noisy, oscillatory trends in the estimated epiplexity, as shown in Figure 10. As a simple workaround, we retain only the median point (ordered by compute) per training run (which has a fixed model size) on the lower convex hull.

Sources of errors. In addition to the artifacts produced by finite $(N, D)$ combinations, our estimated epiplexity may differ from the true value for a few reasons: 1) potential systematic errors introduced by using the lower convex hull and taking the median point, 2) using a fixed architecture (e.g., the transformer) and learning algorithm (e.g., requential training with Adam) rather than considering all possible programs, and 3) suboptimality of other hyperparameters, such as the learning rate, Adam $(β_{1}, β_{2})$ beta one and beta two, etc. In most cases, we believe these sources of errors only contribute sub-leading corrections to the estimated epiplexity that do not impact the result qualitatively. For example, they are unlikely to alter the ordering between datasets if the estimated epiplexity gap is already significant or there is a clear trend along some axis of variation (e.g., number of hidden bits in the induction experiment in Section 5.3.1).

B.2 Prequential Coding Approximates Requential Coding with a Static Teacher

In this section, we show that the prequential coding estimate in Equation (8) can be viewed as an approximation to requential coding with a static teacher, providing an alternative justification for its use beyond the symmetry of information argument.

Consider requential coding with a fixed teacher across all time steps, i.e., $P_{i}^{t} = P^{t}$ for all $i \in {0, \dots, M - 1}$ P i t equals P t for all i from zero to M minus one. The requential code length becomes

∣ P_{req} ∣ \approx i = 0 \sum M - 1 KL (P^{t} ∥ P_{i}^{s}) = i = 0 \sum M - 1 E_{P^{t}} [lo g \frac{1}{P _{i}^{s} ( X )} - lo g \frac{1}{P ^{t} ( X )}] (49)

the requential code length is approximately the sum from i equals zero to M minus one of the K L divergence between the teacher and student distributions

Now suppose the static teacher closely matches the true data distribution, i.e., $P^{t} \approx P_{X_{1}}$ P t is approximately P X one (we use $P_{X_{1}}$ in order to refer to the distribution of a single example, not the dataset). Under this assumption, we can make three simplifying approximations:

The expectation under the teacher can be replaced by the expectation under the data distribution: $E_{P^{t}} [\cdot] \approx E_{P_{X_{1}}} [\cdot]$ .
Training the student on synthetic samples from $P^{t}$ yields similar dynamics to training on real data samples from $P_{X_{1}}$ .
If the student converges to the teacher, then $P_{M}^{s} \approx P^{t}$ , allowing us to estimate the teacher’s loss $E_{P_{X_{1}}} [lo g 1/ P^{t} (X)]$ by the final student’s loss $E_{P_{X_{1}}} [lo g 1/ P_{M}^{s} (X)]$ .

Applying these approximations, the requential code length with a static teacher becomes

∣ P_{req} ∣ \approx i = 0 \sum M - 1 E_{P_{X_{1}}} [lo g \frac{1}{P _{i}^{s} ( X )} - lo g \frac{1}{P _{M}^{s} ( X )}], (50)

the requential code length is approximately the sum over the difference between the student’s intermediate losses and its final loss

which, when estimated empirically on real training data $Z_{0}, \dots, Z_{M - 1} \sim P_{X_{1}}$ Z zero through Z M minus one distributed according to P X one, recovers precisely the prequential estimate from Equation (8):

∣ P_{preq} ∣ \approx i = 0 \sum M - 1 (lo g \frac{1}{P _{i} ( Z _{i} )} - lo g \frac{1}{P _{M} ( Z _{i} )}) . (51)

This connection also lends some justification to treating $6 N D$ as the decoding time for the model in prequential coding, as it relates to a requential scheme that achieves this runtime. Since a static teacher is generally suboptimal compared to the time-varying teachers used in full requential coding, which can remain close to the student throughout training while still guiding it toward the target distribution, we expect the prequential estimate to be an overestimate of the requential code length. This is consistent with the empirical observations in Figure 2c, where the prequential estimate is typically several times larger than the requential estimate.

B.3 A Solvable Model Using Scaling Laws

In this section, we present a simplified analytical model from combining neural scaling laws with prequential coding to gain insight into how epiplexity and compute-optimal hyperparameters typically vary with compute and dataset size, along with their asymptotic behaviors.

We adopt a standard scaling law for the loss as a function of model size $N$ N and training tokens $D$ D:

L (N, D) = E + (\frac{N _{0}}{N})^{α} + (\frac{D _{0}}{D})^{β}, (52)

where $E$ E is the irreducible loss, $N_{0}$ N zero and $D_{0}$ D zero are scaling constants, and $0 < α, β < 1$ alpha and beta are between zero and one are the scaling exponents. The total compute for training and evaluating on $D$ D test tokens is $T = 6 N D + 2 N D = 2 N (3 D + D)$ T equals six N D plus two N D, which equals two N times the quantity three D plus D.

To simplify the analysis, we work in natural units: $n = N / N_{0}$ n equals N over N zero, $d = D / D_{0}$ d equals D over D zero, $δ = D / D_{0}$ delta equals D over D zero, and $t = T / (2 N_{0} D_{0})$ t equals T over two N zero D zero. The loss becomes $L (n, d) = E + n^{- α} + d^{- β}$ L of n and d equals E plus n to the minus alpha plus d to the minus beta, and the compute constraint simplifies to $t = n (3 d + δ)$ t equals n times the quantity three d plus delta.

Two-part code length. The two-part code $P_{tot}$ P tot consists of the model description and the data encoded using the model. The data code length on the test set is $δ D_{0} \cdot L (n, d)$ delta D zero times L of n and d.

For the model description length, we use the prequential estimate from Equation (8), which corresponds to the area under the loss curve above the final loss⁴:

∣ P_{preq} ∣ = i = 1 \sum D [(\frac{i}{D _{0}})^{- β} - (\frac{D}{D _{0}})^{- β}] = \int_{1}^{D} [(\frac{u}{D _{0}})^{- β} - (\frac{D}{D _{0}})^{- β}] d u + O (1), (53)

where the $O (1)$ term remains bounded as $D \to \infty$ . Evaluating the integral and dropping $O (1)$ terms, we obtain the expression valid for large $D$ :

∣ P_{preq} ∣ = \frac{β}{1 - β} D_{0} d^{1 - β} . (54)

Optimality condition. Dropping the constant term $δ D_{0} E$ delta D zero E from the two-part code length and dividing by $D_{0}$ D zero, we seek to minimize

f (n, d) = \frac{β}{1 - β} d^{1 - β} + δ (n^{- α} + d^{- β}) (55)

subject to $t = n (3 d + δ)$ t equals n times the quantity 3 d plus delta.

Solution. Eliminating $n$ n using the constraint $n = t / (3 d + δ)$ n equals t over 3 d plus delta, we obtain a one-dimensional optimization problem in $d$ d. Setting the derivative to zero and simplifying, the optimal $d^{⋆} (t)$ d star of t satisfies

β d^{- β - 1} (δ - d) = 3 α δ t^{- α} (3 d + δ)^{α - 1}, (56)

with the corresponding optimal model size given by

n^{⋆} (t) = \frac{t}{3 d ^{⋆} ( t ) + δ} . (57)

While Equation (56) does not admit a simple closed-form solution in general, we can extract the asymptotic behavior in the large- and small-compute regimes.

Large-compute regime ( $t \to \infty$ t goes to infinity). As $t$ t grows, the right-hand side of Equation (56) scales as $t^{- α} \to 0$ t to the minus alpha goes to zero. For the equation to remain balanced, we require $δ - d \to 0$ delta minus d goes to zero, i.e., $d^{⋆} (t) \to δ$ d star of t approaches delta. The leading-order scaling is therefore:

d^{⋆} (t) = δ - Θ (t^{- α}), n^{⋆} (t) \sim \frac{t}{4 δ} . (58)

In this regime, the optimal training set size saturates at the test set size $δ$ delta, while the model size grows linearly with compute. Correspondingly, the epiplexity saturates to

S_{\infty} (X) = \frac{β}{1 - β} D_{0} δ^{1 - β} = \frac{β}{1 - β} D_{0}^{β} D^{1 - β} . (59)

For the entropy, we have $(n^{⋆})^{- α} \to 0$ n star to the minus alpha goes to zero while $(d^{⋆})^{- β} \to δ^{- β}$ d star to the minus beta goes to delta to the minus beta, so

H_{\infty} (X) = D (E + δ^{- β}) = D E + D_{0}^{β} D^{1 - β} . (60)

The entropy approaches the irreducible entropy $D E$ D E plus a residual term from finite training data that scales sublinearly with the test set size, meaning that the achieved per-token loss is $E + O (D^{- β})$ E plus order D to the minus beta.

Small-compute regime ( $d^{⋆} ≪ δ$ d star is much less than delta). When compute is limited such that $d ≪ δ$ d is much less than delta, we approximate $δ - d \approx δ$ and $(3 d + δ)^{α - 1} \approx δ^{α - 1}$ . Substituting into Equation (56) and solving for $d$ d gives

d^{⋆} (t) = (\frac{β}{3 α})^{\frac{1}{β + 1}} t^{\frac{α}{β + 1}} δ^{\frac{1 - α}{β + 1}} . (61)

Since $3 d^{⋆} ≪ δ$ in this regime, the optimal model size is

n^{⋆} (t) \approx \frac{t}{δ} . (62)

Here, the model size is constrained by the need to evaluate on $δ$ delta tokens, and the optimal training set size grows sublinearly with compute as $d^{⋆} \propto t^{α / (β + 1)}$ d star is proportional to t to the alpha over beta plus one. The epiplexity in this regime scales as

S_{T} (X) = \frac{β}{1 - β} D_{0} (d^{⋆})^{1 - β} \propto T^{\frac{α ( 1 - β )}{β + 1}}, (63)

growing sublinearly with compute.

For the entropy, both the model and data contributions are significant. The model contribution scales as

D (n^{⋆})^{- α} = D (\frac{δ}{t})^{α} \propto T^{- α}, (64)

while the data contribution scales as

D (d^{⋆})^{- β} \propto T^{- \frac{α β}{β + 1}} . (65)

Since $α β / (β + 1) < α$ alpha beta over beta plus one is less than alpha, the data term decays more slowly and dominates for larger $t$ t within this regime. The entropy above the irreducible level is thus

H_{T} (X) - D E \propto T^{- \frac{α β}{β + 1}}, (66)

decaying as a power law with compute.

For typical scaling exponents (e.g., $α \approx 0.34$ and $β \approx 0.28$ from Hoffmann et al. (2022))for example, alpha approximately point three four and beta approximately point two eight from Hoffmann and colleagues, twenty twenty two, the epiplexity grows as $S_{T} \propto T^{0.19}$ S sub T proportional to T to the point one nine and the entropy decays as $H_{T} - D E \propto T^{- 0.07}$ H sub T minus D E proportional to T to the minus point zero seven in the small-compute regime.

B.4 How Epiplexity and Time-Bounded Entropy Scale with Compute and Dataset Size

In this section, we analyze how epiplexity and time-bounded entropy scale with compute budget and dataset size under natural assumptions about neural network training, without relying on specific functional forms for scaling laws. The goal is to provide some general intuitions for how these quantities are expected to vary as a function of the compute budget and dataset size. Section B.3 explicitly demonstrates using scaling laws and prequential coding that (1) epiplexity grows with both compute and dataset size, and (2) for a fixed $X$ X, epiplexity saturates to a finite value in the limit of infinite compute—specifically, to the amount of information acquired by an arbitrarily large model trained on a training set of the same size as the test set $X$ X, while time-bounded entropy decays to the loss achievable by an infinitely large model on this training set. Here, we show that similar or weaker statements hold more generally, requiring only a few natural assumptions about the effect of increasing model size $N$ N and training data size $D$ D. These assumptions capture typically observed regularities in deep learning, such as the smoothly diminishing returns in scaling only model size while holding training set size fixed, but they may fail to capture rare exceptions like grokking and sudden improvement in performance above certain compute thresholds (as in Section 5.3.2).

Denote the code length for an $N$ -parameter model trained on $D$ tokens as $∣ P ∣ (N, D)$ P of N and D, the per-token loss it achieves as $L (N, D)$ L of N and D, the compute-optimal model size as $N^{⋆} (T)$ and training data size as $D^{⋆} (T)$ , so that $S_{T} (X) = ∣ P ∣ (N^{⋆} (T), D^{⋆} (T))$ and $H_{T} (X) = D L (N^{⋆} (T), D^{⋆} (T))$ . We establish the following results as we vary $T$ and $D = ∣ X ∣$ D equals the size of X, fixing the distribution of $X_{i}$ X sub i (only the dataset size changes):

Monotonicity of $N^{⋆} (T), D^{⋆} (T), S_{T} (X)$ , and $H_{T} (X)$ (Section B.4.1): Under natural assumptions on the effect of increasing $N$ and $D$ , the compute-optimal model size $N^{⋆} (T)$ and training data size $D^{⋆} (T)$ are both increasing in the compute budget $T$ . As a result, epiplexity typically grows with $T$ while time-bounded entropy typically decreases with $T$ .
Monotonicity of $S_{\infty} (X)$ and $H_{\infty} (X)$ in $D$ (Section B.4.2): In the infinite-compute limit, epiplexity $S_{\infty} (X)$ is nondecreasing in $D = ∣ X ∣$ , while the per-token time-bounded entropy $h_{\infty} (X_{D}) := H_{\infty} (X_{D}) / D$ h infinity of X sub D, defined as H infinity of X sub D over D is nonincreasing in $D$ .
$D^{⋆} (T)$ D star of T generally approaches $D$ D in prequential coding (Section B.4.3): For prequential coding, the compute-optimal training set size satisfies $D^{⋆} (T) \to D$ D star of T approaches D as $T \to \infty$ T goes to infinity, where $D$ D is the test set size, without assuming the scaling law form. Combined with monotonicity of $D^{⋆} (T)$ D star of T, this implies $D^{⋆} (T) ↑ D$ D star of T increases toward D from below.

B.4.1 Monotonicity of $N^{⋆} (T), D^{⋆} (T), S_{T} (X)$ , and $H_{T} (X)$

The following theorem shows that the compute-optimal model size and training data size are both monotonically increasing in the compute budget under natural assumptions.

Theorem 30 (Monotone growth of compute-optimal $N$ and $D$ )

Define the effective data $\tilde{D} = 6 D + 2 D$ , so that the compute constraint becomes $T = N \tilde{D}$ . Let $J (N, \tilde{D})$ denote the two-part code length as a function of model size $N$ and effective data $\tilde{D}$ , and assume $J$ is twice continuously differentiable. Consider the constrained MDL problem
$N > 0, \tilde{D} \geq 2 D min J (N, \tilde{D}) s.t. N \tilde{D} = T . (67)$
Assume that for each $T$ in the regime of interest there is a unique interior optimizer $(N^{⋆} (T), \tilde{D}^{⋆} (T))$ with $\tilde{D}^{⋆} (T) > 2 D$ and $N^{⋆} (T) \tilde{D}^{⋆} (T) = T$ .
Work in logarithmic coordinates $μ := lo g N$ and $ν := lo g \tilde{D}$ , and by slight abuse of notation write $J (μ, ν) = J (e^{μ}, e^{ν})$ . Assume that for all such $T$ , the following conditions hold at the corresponding optimum $(μ^{⋆} (T), ν^{⋆} (T))$ :

Complementarity (larger models are more sample-efficient):

$\frac{\partial ^{2} J}{\partial μ \partial ν} \leq 0. (68)$

Diminishing returns in model size (in log coordinates):

$\frac{\partial ^{2} J}{\partial μ ^{2}} > 0. (69)$

Diminishing returns in effective data (in log coordinates):

$\frac{\partial ^{2} J}{\partial ν ^{2}} > 0. (70)$
Then both compute-optimal choices are strictly increasing functions of $T$ :
$T_{2} > T_{1} ⟹ N^{⋆} (T_{2}) > N^{⋆} (T_{1}) and \tilde{D}^{⋆} (T_{2}) > \tilde{D}^{⋆} (T_{1}) . (71)$

Proof Work in logarithmic coordinates

μ := lo g N, ν := lo g \tilde{D}, τ := lo g T . (72)

mu is log N, nu is log D tilde, and tau is log T

The compute constraint $N \tilde{D} = T$ becomes the affine constraint

μ + ν = τ ⟺ ν = τ - μ . (73)

mu plus nu equals tau, which is equivalent to nu equals tau minus mu

By slight abuse of notation, write $J (μ, ν) := J (e^{μ}, e^{ν})$ and denote its partial derivatives by $J_{μ}, J_{ν}, J_{μμ}, J_{ν ν}, J_{μν}$ , etc., all taken with respect to the log-coordinates $(μ, ν)$ .

Define the restricted objective along the compute frontier by

f (μ, τ) := J (μ, τ - μ) . (74)

f of mu and tau is defined as J of mu and tau minus mu.

For each $τ$ tau in the regime of interest, let $μ^{⋆} (τ)$ mu star of tau denote the unique interior minimizer of $f (\cdot, τ)$ f, and set $ν^{⋆} (τ) := τ - μ^{⋆} (τ)$ nu star of tau equals tau minus mu star of tau.

Holding $τ$ tau fixed and differentiating $f$ f with respect to $μ$ mu gives

f_{μ} (μ, τ) = \frac{\partial}{\partial μ} J (μ, τ - μ) = J_{μ} (μ, ν) + J_{ν} (μ, ν) \frac{\partial}{\partial μ} (τ - μ) = J_{μ} (μ, ν) - J_{ν} (μ, ν), (75)

where $ν = τ - μ$ nu equals tau minus mu. The optimality condition for $μ^{⋆} (τ)$ mu star of tau is therefore

f_{μ} (μ^{⋆} (τ), τ) = 0 ⟺ J_{μ} (μ^{⋆} (τ), ν^{⋆} (τ)) = J_{ν} (μ^{⋆} (τ), ν^{⋆} (τ)) . (76)

Differentiating the identity $f_{μ} (μ^{⋆} (τ), τ) = 0$ f mu of mu star of tau and tau equals zero with respect to $τ$ tau yields

0 = \frac{d}{d τ} f_{μ} (μ^{⋆} (τ), τ) = f_{μμ} (μ^{⋆} (τ), τ) \frac{d μ ^{⋆}}{d τ} + f_{μτ} (μ^{⋆} (τ), τ) . (77)

Assuming $f_{μμ} (μ^{⋆} (τ), τ) \neq = 0$ f mu mu is non-zero (verified below), we obtain

\frac{d μ ^{⋆}}{d τ} = - \frac{f _{μτ}}{f _{μμ}} evaluated at (μ, τ) = (μ^{⋆} (τ), τ) . (78)

We now express $f_{μτ}$ f mu tau and $f_{μμ}$ f mu mu in terms of second partial derivatives of $J$ J. From (75) and the chain rule, using $\partial_{τ} (τ - μ) = 1$ the partial derivative of tau minus mu with respect to tau is one,

f_{μτ} (μ, τ) = \frac{\partial}{\partial τ} (J_{μ} (μ, ν) - J_{ν} (μ, ν)) = J_{μν} (μ, ν) \frac{\partial ν}{\partial τ} - J_{ν ν} (μ, ν) \frac{\partial ν}{\partial τ} = J_{μν} (μ, ν) - J_{ν ν} (μ, ν), (79)

with $ν = τ - μ$ nu equals tau minus mu. Similarly, differentiating (75) with respect to $μ$ mu while holding $τ$ tau fixed, and using $\partial_{μ} (τ - μ) = - 1$ the partial derivative of tau minus mu with respect to mu is minus one together with symmetry $J_{ν μ} = J_{μν}$ J nu mu equals J mu nu, yields

f_{μμ} (μ, τ) = \frac{\partial}{\partial μ} (J_{μ} (μ, ν) - J_{ν} (μ, ν)) = (J_{μμ} (μ, ν) + J_{μν} (μ, ν) \frac{\partial ν}{\partial μ}) - (J_{ν μ} (μ, ν) + J_{ν ν} (μ, ν) \frac{\partial ν}{\partial μ}) = (J_{μμ} - J_{μν}) - (J_{μν} - J_{ν ν}) = J_{μμ} (μ, ν) + J_{ν ν} (μ, ν) - 2 J_{μν} (μ, ν) . (80)

Substituting (79)–(80) into (78) gives

\frac{d μ ^{⋆}}{d τ} = - \frac{J _{μν} - J _{ν ν}}{J _{μμ} + J _{ν ν} - 2 J _{μν}} = \frac{J _{ν ν} - J _{μν}}{J _{μμ} + J _{ν ν} - 2 J _{μν}}, (81)

with all second partial derivatives of $J$ J evaluated at $(μ, ν) = (μ^{⋆} (τ), ν^{⋆} (τ))$ mu and nu equals mu star of tau and nu star of tau.

By the assumptions $J_{ν ν} > 0$ and $J_{μν} \leq 0$ at the optimum, the numerator in (81) satisfies $J_{ν ν} - J_{μν} > 0$ . By the assumptions $J_{μμ} > 0$ , $J_{ν ν} > 0$ , and $J_{μν} \leq 0$ , the denominator satisfies $J_{μμ} + J_{ν ν} - 2 J_{μν} > 0$ . Hence

\frac{d μ ^{⋆}}{d τ} > 0. (82)

Since $ν^{⋆} (τ) = τ - μ^{⋆} (τ)$ nu star of tau equals tau minus mu star of tau, we also have

\frac{d ν ^{⋆}}{d τ} = 1 - \frac{d μ ^{⋆}}{d τ} = \frac{J _{μμ} - J _{μν}}{J _{μμ} + J _{ν ν} - 2 J _{μν}} > 0, (83)

where positivity follows from $J_{μμ} > 0$ and $J_{μν} \leq 0$ together with the same positive denominator.

Finally, $N^{⋆} (T) = exp (μ^{⋆} (lo g T))$ and $D^{⋆} (T) = exp (ν^{⋆} (lo g T))$ N star of T equals e to the mu star of log T, and D star of T equals e to the nu star of log T, so $d μ^{⋆} / d τ > 0$ and $d ν^{⋆} / d τ > 0$ imply that both $N^{⋆} (T)$ and $D^{⋆} (T)$ are strictly increasing functions of $T$ . ■

Empirical plausibility of the assumptions. The three conditions in Theorem 30 reflect well-documented empirical phenomena in deep learning. The complementarity condition $\partial^{2} J / \partial μ \partial ν \leq 0$ captures the observation that larger models are more sample-efficient: increasing model size leads to faster learning (Kaplan et al., 2020; Yang et al., 2022)as shown by Kaplan and colleagues in 2020 and Yang and colleagues in 2022, which leads to a faster decrease in both the model description length and data code length (final loss), and thus $\partial J / \partial ν$ should decrease with $μ$ mu. The diminishing returns conditions $\partial^{2} J / \partial μ^{2} > 0$ and $\partial^{2} J / \partial ν^{2} > 0$ simply state that there is diminishing return in successive doubling of the model size or training data size, holding the other quantity fixed.

Asymptotic growth of $S_{T}$ and monotone decay of $H_{T}$ . The monotone growth of the compute-optimal $N^{⋆} (T)$ and $D^{⋆} (T)$ does not by itself imply that $S_{T} (X) := ∣ P ∣ (N^{⋆} (T), D^{⋆} (T))$ S T of X, defined as the description length of N star of T and D star of T is monotone for all $T$ . Intuitively, while we expect the model description length $∣ P ∣ (N, D)$ to grow with $D$ , it need not increase with $N$ : larger models can be more sample-efficient, which may reduce the effective complexity of the learned predictor under some coding schemes. However, one should still expect $S_{T} (X)$ to grow with $T$ , at least asymptotically, if we assume (1) the compute-optimal model size diverges while the optimal training horizon converges, as in the scaling-law model of Section B.3, and (2) the existence of infinite-model-size limits of the training dynamics.

That is, assume that along the compute-optimal path,

N^{⋆} (T) \to \infty and D^{⋆} (T) \to D_{\infty} < \infty as T \to \infty. (84)

N star of T goes to infinity and D star of T approaches a finite limit D infinity as T goes to infinity

Assume moreover that for bounded training horizons, the description length admits a well-defined infinite-model-size limit: there exists a function $P_{\infty} (D)$ such that for each fixed $D$ in a neighborhood of $[0, D_{\infty}]$ ,

∣ P ∣ (N, D) \to P_{\infty} (D) as N \to \infty. (85)

the description length of N and D approaches P infinity of D as N goes to infinity

This assumption is motivated by the existence of infinite-width and depth limits of neural networks under appropriate parameterizations (Yang and Littwin, 2023; Dey et al., 2025)discussed by Yang and Littwin in 2023 and Dey and colleagues in 2025, where scalar quantities such as loss and teacher–student KL divergence that determine $∣ P ∣ (N, D)$ admit stable large-model limits for bounded training durations. Under these conditions, any non-monotonic dependence of $∣ P ∣$ on $N$ is a finite-model effect; once $N^{⋆} (T)$ is large enough, $∣ P ∣ (N^{⋆} (T), D^{⋆} (T))$ is well-approximated by the limiting curve $P_{\infty} (D^{⋆} (T))$ . Since $D^{⋆} (T)$ is monotone increasing and

convergent under our earlier assumptions, the large- $T$ T behavior of $S_{T} (X)$ S T of X is therefore governed primarily by the behavior of $D^{⋆} (T)$ D star of T alone, which we have shown is increasing with $T$ T, so one expects $S_{T} (X)$ to increase at large $T$ T as $P_{\infty} (D)$ should increase with $D$ .

For the entropy term $H_{T} (X) := D L (N^{⋆} (T), D^{⋆} (T))$ H T of X, defined as D times the loss function L at N star and D star, the conclusion is simpler and does not require taking $N \to \infty$ N to infinity. Assume only that the loss $L (N, D)$ is nonincreasing in both $N$ and $D$ (more data and parameters cannot make the loss worse). Since $N^{⋆} (T)$ and $D^{⋆} (T)$ are increasing in $T$ , we have

T_{2} > T_{1} ⟹ L (N^{⋆} (T_{2}), D^{⋆} (T_{2})) \leq L (N^{⋆} (T_{1}), D^{⋆} (T_{1})), (86)

and therefore $H_{T} (X)$ is nonincreasing in $T$ T. In particular, whenever $H_{T} (X)$ has a finite large-compute limit $H_{\infty} (X)$ , it approaches this limit from above.

B.4.2 Monotonicity of $S_{\infty} (X)$ and $H_{\infty} (X)$ in $D$

We now show that epiplexity and time-bounded entropy (after appropriate normalization) in the infinite-compute limit are monotonic in the test set size $D = ∣ X ∣$ , regardless of the coding scheme.

Fix a dataset $X_{D}$ of length $D$ tokens. For a two-part code of the form

J (N, D; D) = ∣ P ∣ (N, D) + D L (N, D), (87)

let $(N_{T}^{⋆} (D), D_{T}^{⋆} (D))$ denote the compute-optimal choices. We write

S_{T} (X_{D}) := ∣ P ∣ (N_{T}^{⋆} (D), D_{T}^{⋆} (D)), (88)

H_{T} (X_{D}) := D L (N_{T}^{⋆} (D), D_{T}^{⋆} (D)), (89)

h_{T} (D) := \frac{H _{T} ( X _{D} )}{D} = L (N_{T}^{⋆} (D), D_{T}^{⋆} (D)) . (90)

In the infinite-compute limit $T \to \infty$ , the compute constraint becomes irrelevant, so the limiting quantities coincide with the optimum of the unconstrained problem

(N_{\infty}^{⋆} (D), D_{\infty}^{⋆} (D)) = ar g N > 0, D \geq 0 min ∣ P ∣ (N, D) + D L (N, D) . (91)

Thus

S_{\infty} (X_{D}) = ∣ P ∣ (N_{\infty}^{⋆}, D_{\infty}^{⋆}), h_{\infty} (X_{D}) = L (N_{\infty}^{⋆}, D_{\infty}^{⋆}) . (92)

We claim that $S_{\infty} (X_{D})$ is nondecreasing in $D$ , and $h_{\infty} (X_{D})$ is nonincreasing in $D$ , assuming that for each $D > 0$ the unconstrained problem (91) admits at least one minimizer.

To see this, fix $D_{2} > D_{1}$ and let $(N_{i}, D_{i})$ be minimizers of (91) at $D = D_{i}$ . Write $P_{i} := ∣ P ∣ (N_{i}, D_{i})$ and $L_{i} := L (N_{i}, D_{i})$ . Optimality of $(N_{2}, D_{2})$ at $D_{2}$ implies

P_{2} + D_{2} L_{2} \leq P_{1} + D_{2} L_{1} . (93)

Optimality of $(N_{1}, D_{1})$ at $D_{1}$ implies

P_{1} + D_{1} L_{1} \leq P_{2} + D_{1} L_{2} . (94)

Adding (93) and (94) gives

(P_{2} + D_{2} L_{2}) + (P_{1} + D_{1} L_{1}) D_{2} L_{2} + D_{1} L_{1} (D_{2} - D_{1}) (L_{2} - L_{1}) \leq (P_{1} + D_{2} L_{1}) + (P_{2} + D_{1} L_{2}) \leq D_{2} L_{1} + D_{1} L_{2} \leq 0, (95)

hence $L_{2} \leq L_{1}$ since $D_{2} > D_{1}$ , i.e., the achieved loss $h_{\infty} (X_{D})$ is nonincreasing in $D$ . Substituting $L_{2} \leq L_{1}$ back into (94) yields $P_{2} \geq P_{1}$ , i.e., $S_{\infty} (X_{D})$ is nondecreasing in $D$ .

B.4.3 $D^{⋆} (T)$ D star of T Generally Approaches $D$ D in Prequential Coding

We now show that the compute-optimal training set size for prequential coding generically saturates at $D = D$ D equals caligraphic D as $T \to \infty$ T approaches infinity, without assuming specific scaling laws.

In continuous time, the prequential model description length is the area above the final loss:

∣ P_{preq} (N, D) ∣ := \int_{0}^{D} (L (N, u) - L (N, D)) d u . (96)

The corresponding two-part code length for a test set of size $D$ D is

J_{preq} (N, D; D) = ∣ P_{preq} (N, D) ∣ + D L (N, D) = \int_{0}^{D} L (N, u) d u + (D - D) L (N, D) . (97)

We express $N$ N in terms of $D$ D for fixed $T$ T using the constraint $6 N D + 2 N D = T$ :

N_{T} (D) = \frac{T}{6 D + 2 D} . (98)

Large-compute limit. Assume: (i) $L (N, D)$ is nonincreasing in $N$ and admits a pointwise infinite-model-size limit $L_{\infty} (D) := lim_{N \to \infty} L (N, D)$ ;⁵ (ii) $L_{\infty}$ is continuously differentiable and strictly decreasing, i.e., $L_{\infty}^{'} (D) < 0$ . Along the compute frontier (98), for any fixed $D$ we have $N_{T} (D) \to \infty$ as $T \to \infty$ , hence

J_{preq} (N_{T} (D), D; D) \to J_{\infty} (D) := \int_{0}^{D} L_{\infty} (u) d u + (D - D) L_{\infty} (D) . (99)

Differentiating gives

J_{\infty}^{'} (D) = (D - D) L_{\infty}^{'} (D) . (100)

Since $L_{\infty}^{'} (D) < 0$ , we have $J_{\infty}^{'} (D) < 0$ for $D < D$ and $J_{\infty}^{'} (D) > 0$ for $D > D$ . Thus $J_{\infty}$ is uniquely minimized at $D = D$ , implying

D^{⋆} (T) \to D as T \to \infty. (101)

Approach from below and linear growth of $N^{⋆} (T)$ . By Theorem 30, under the complementarity and diminishing-returns assumptions, the compute-optimal training set size $D^{⋆} (T)$ is strictly increasing in $T$ . Combined with the convergence $D^{⋆} (T) \to D$ , this yields $D^{⋆} (T) ↑ D$ , i.e., $D^{⋆} (T)$ approaches $D$ from below. Finally, since $N^{⋆} (T) = N_{T} (D^{⋆} (T))$ ,

N^{⋆} (T) = \frac{T}{6 D ^{⋆} ( T ) + 2 D} \sim \frac{T}{8 D}, (102)

so the compute-optimal model size grows linearly with $T$ in the large-compute regime.

Appendix C. Experiment Details

Unless otherwise stated, we use the GPT-2 (Radford et al., 2019)Radford and colleagues, 2019, transformer architecture trained with Adam optimizer. In experiments where we vary the model size, we tune the base learning rate on a small model and transfer it to larger models using using $μ$ P (Yang et al., 2022)mu P, described by Yang and colleagues, 2022, and CompleteP (Dey et al., 2025)and Complete P, described by Dey and colleagues, 2025. In $μ$ Pmu P, the per-layer learning rate is base learning rate divided by the input dimension, so our reported base learning rate is larger than typical learning rates used for Adam. The hyperparameters presented below are shared between the teacher and the student for requential coding (width, depth, learning rate, EMA time scale, etc.). As described in Section B.1, the EMA for the teacher is used only for producing the distillation target and does not alter the raw teacher training dynamics, while the EMA for the student model does alter its training dynamics and is used to replace a decaying learning rate schedule.

C.1 ECA

In Figure 3, we train the transformer to predict $Y$ Y given $X$ X where $X$ X is the initial state with a state size of 64 cells and $Y$ Y is obtained by evolving $X$ X for 48 steps. We apply a burnin period of 1000 steps for sampling the initial state $X$ X to eliminate the less uninteresting transient dynamics from random initialization. That is $X$ X is obtained by evolving the ECA on $Z$ Z for 1000 steps where $Z$ Z is a uniform random initial state. For each rule, we train models with width (embedding dimension) $\in {16, 32, 64, 128, 256, 512}$ ranging from 16 to 512 and depth (number of transformer blocks) $\in {1, 2, 4, 6, 9}$ ranging from 1 to 9. We train both teacher and student using batches of 1536 sequences (each an $(X, Y)$ X, Y pair), a base learning rate of 0.03 with 100 warmup steps, and an EMA time scale of 50 steps (half-life divided by $ln (2)$ natural log of two). We did not set a max teacher-student KL as the student smoothly trackes the teacher throughout training. The epiplexity and time-bounded entropy is estimated for a test set of size $D = 100 M$ D equals 100 million tokens (counting $Y$ Y only).

C.2 Easy induction

For this task, we use a sequence length of $n = 512$ n equals 512 (as described in Section 5.3.1). The model has 3 layers and a width of 128, and is trained with a learning rate of 0.03 and a batch size of 384 sequences for 3000 steps with 15 warmup steps and an EMA time scale of 50 steps. We found further increasing the model size led to negligible improvement in the loss, and Figure 5c shows that the model has nearly converged by the end of training to the theoretical minimum loss, so there is no need to further increase the training data. As a result, we expect the epiplexity $S_{T} (X)$ S T of X to stabilize as $T$ T and $D = ∣ X ∣$ D, defined as the size of X, increases (in the relevant regime where $T$ T is still much less than what is required for implementing the brute-force solution that enumerates all possible combinations of hidden entries in the transition matrix), and our estimated epiplexity approximates this stabilized value.

C.3 Hard induction

We modify the ECA experiment in Section C.1 to remove the first $h \in {0, 1, \dots, 5}$ h bits, where h is between zero and five, bits in $X$ X when fed to the model as input. We use a state size of 32, batch size of 1536 sequences, learning of 0.03, EMA time scale of 100 steps. We set the max KL threshold between the teacher and student as 0.03 (nats per token). To construct a forward function that is hard to invert, we use rule 30 iterated for 4 steps. We train models with 3 layers and width 256 for 20000. Further increasing model size or training data led to no improvement in the loss. As Figure 5b shows, the models converge by the end of training (the loss curves shown are for the student models, but the teacher models also converge) to the theoretical minimum values. Therefore, like the case for Section C.2, we expect the epiplexity

$S_{T} (X)$ S T of X to stabilize as $T$ T and $D = ∣ X ∣$ D equals the magnitude of X increases, at least in the relevant regime where $T$ T is still much less than what is required for implementing the brute-force solution that enumerates all possible combinations of hidden bits, and our estimated epiplexity approximates this stabilized value.

C.4 Chess

We train models of varying sizes from 1M to 160M parameters with depth between 3 and 24 layers. The base learning rate is set to 2 and the batch size is 256, with a sequence length of 512. We set the EMA time scale to 50 steps and max KL to 0.1 nats per token. We use character-level tokenization. The teacher models are trained for 5B tokens in total, and the student models are trained for slightly more due to hitting the max KL threshold during training. The test set size is set to 5B tokens.

Pre-Training Data. We use the Lichess dataset available on Hugging Face as pre-training data, formatted as either "<board>|<moves>" or "<moves>|<board>", where moves are in algebraic chess notation and board is the final board state in FEN notation. We use a slightly more concise version of the algebraic notation to further compress the move sequence. An example input where the board appears last is:

e4,e5;Nf3,Nc6;Bb5,a6;Ba4,Nf6;O-O,Be7;Re1,b5;Bb3,d6;c3,O-O;h3,Nb8;d4,Nbd7;
|r1bq1rk1/2pnbppp/p2p1n2/1p2p3/3PP3/1BP2N1P/PP3PP1/RNBQR1K1 w - - 0 10

For downstream evaluation, we evaluate performance on the following two datasets after fine-tuning on 50k examples for a 10M-parameter model with depth 24. We report accuracy under greedy decoding at zero temperature.

Chess Puzzles. We use puzzles from the Lichess puzzle database available at Hugging Face, filtering for puzzles with difficulty rating above 2000. The task is to predict the correct move sequence given the game context. Puzzles are formatted as move sequences where the model must predict the next optimal move, following (Burns et al., 2023)Burns and colleagues, 2023, with only the target moves included in the loss computation via masking. This tests the model’s ability to recognize tactical patterns and calculate forced sequences.

Centipawn Evaluation. We evaluate position understanding using the Lichess chess position evaluations dataset at Hugging Face, where models classify positions into 9 evaluation buckets based on Stockfish centipawn (cp) scores: class 0 ( $\leq - 800$ cp), class 1 ( $- 800$ to $- 400$ cp), class 2 ( $- 400$ to $- 200$ cp), class 3 ( $- 200$ to $- 50$ cp), class 4 ( $- 50$ to $+ 50$ cp), class 5 ( $+ 50$ to $+ 200$ cp), class 6 ( $+ 200$ to $+ 400$ cp), class 7 ( $+ 400$ to $+ 800$ cp), and class 8 ( $\geq + 800$ cp). Examples are formatted as "<board>|<class>" where the model predicts the evaluation class, with mate positions assigned to the extreme classes (0 or 8). Loss during fine-tuning is computed only for predicting the class.

C.5 OpenWebText

We use the OpenWebText dataset at Hugging Face, keeping only documents containing only 96 common alphanumeric symbols, and apply character-level tokenization. The setup is otherwise identical to the chess experiment (Section C.4).

C.6 CIFAR-5M

We use the CIFAR-5M dataset at GitHub. We convert the $32 \times 32 \times 3$ images to greyscale and flatten to a 1D sequence of 1024 in raster-scan order. The vocabulary is the set of pixel intensities ${0, \dots, 255}$ zero to two hundred and fifty-five. The setup is otherwise identical to the chess experiment (Section C.4).

C.7 Prequential vs Resequential Comparison

ECA. The ECA experiment include rules ${0, 32, 4, 15, 22, 30, 41, 54, 106, 110}$ , covering all 4 classes. We train models with width $\in {16, 32, 64, 128}$ and depth $\in {1, 2, 3}$ up to 10000 steps. We use a base learning rate of 0.03 and batch size of 384. Other hyperparameters are identical to Section C.1. We set $D = 250 M$ D to 250 million tokens. For each rule, we report the maximum epiplexity over the resulting compute range.

Induction. Both the easy and hard induction results directly come from the experiments in Section 5.3.1. As explained in Section C.2 and Section C.3, the compute budget $T$ T and test set size $D$ D need not be precisely specified for these two tasks as the epiplexity stabilizes as $T$ and $D$ increase due to the convergent training dynamics.

Natural data. We report the estimated epiplexity on each dataset at the maximum tested compute budget as described in Section C.4, Section C.5, and Section C.6.

C.8 ECA Emergence

We modify the setup in Section C.1 to include models that predict intermediate states and the final state rather than the final state directly. Let $X^{(0)}$ X zero denote the initial ECA state, and $X^{(s)}$ X s denote it evolved for $s$ s steps. For an $ℓ$ -loopl-loop model, we train the model to predict $(X^{(Δ)}, X^{(2Δ)}, \dots, X^{(t)})$ instead of $X^{(t)}$ only, where $Δ = t / ℓ$ delta equals t over l. Its marginal probability on the final state is lower-bounded by its joint probability on the ground truth trajectory:

P (X^{(t)}) = X^{' (Δ)}, \dots, X^{' (t - Δ)} \sum P (X^{' (Δ)}, \dots, X^{' (t - Δ)}, X^{(t)}) (103)

So we upper-bound its NLL as

lo g \frac{1}{P ( X ^{(t)} )} \leq lo g \frac{1}{P ( X ^{(Δ)} , \dots , X ^{(t)} )} = k = 1 \sum ℓ lo g \frac{1}{P ( X ^{(k Δ)} ∣ X ^{((k - 1) Δ)} , \dots , X ^{(Δ)} )} (104)

We account for the intermediate tokens when computing the time bound and the code length (they contribute to the model code length as well as the data entropy code length). In the experiment, we set the ECA steps to $t = 64$ . We train models with width ${16, 32, 64, 128}$ , depth $\in {1, 2, 4, 8, 16, 32}$ , and number of loops $ℓ \in {1, 2, 4, 8, 16}$ l in the set one, two, four, eight, sixteen. We found $ℓ \in {2, 4, 8}$ has no advantage over the non-looped model $(ℓ = 1)$ in terms of the two-part code, only $ℓ = 16$ does. We therefore refer to $ℓ = 1$ as non-looped and $ℓ = 16$ as looped. The fact that a small $ℓ > 1$ is not helpful is likely because the overhead of encoding and generating intermediate states exceeds the savings from only slightly simplifying each prediction step, as the per-step prediction horizon is still significant. We train all models with a base learning rate of 0.06, batch size of 147456 tokens, warmup of 100 steps, and EMA time scale of 50 steps. We did not set a max teacher-student KL. The test set size is set to $D = 100 M$ D equals one hundred million final state tokens.

C.9 Scaling Laws

We estimate epiplexity and time-bounded entropy using the expressions derived in Section B.3 for prequential coding using existing scaling laws for $L (N, D)$ L of N and D. We solve for the optimal training tokens $D^{⋆} (T)$ D star of T as a function of compute using root finding for Equation (56). For language, we use the Chinchilla scaling laws from Hoffmann et al. (2022)Hoffmann and colleagues, which were fit to total parameter counts. For all other modalities (images and video), we use the scaling laws from Henighan et al. (2020)Henighan and colleagues, which follow the methodology of Kaplan et al. (2020)Kaplan and colleagues and report non-embedding parameter counts. We correct these to use total parameters following Pearce and Song (2024)Pearce and Song, as described below.

Correcting for embedding parameters. The scaling laws in Kaplan et al. (2020) and Henighan et al. (2020) are reported in terms of non-embedding parameters $N_{∖ E}$ and non-embedding compute $C_{∖ E}$ , excluding embedding and unembedding parameters. As shown by Pearce and Song (2024), this choice—combined with smaller model scales—accounts for much of the discrepancy between the Kaplan and Chinchilla scaling laws. Following their approach, we relate total parameters $N$ to non-embedding parameters via

N = N_{∖ E} + ω N_{∖ E}^{1/3}, ω = (V + L_{ctx}) (\frac{A}{12})^{1/3}, (105)

where $V$ is the vocabulary size, $L_{ctx}$ is the context length, and $A$ is the aspect ratio (width/depth). We use $A = 5$ as Henighan et al. (2020) showed the optimal aspect ratio is around this value for non-language datasets. We generate points $(C_{∖ E}, N_{∖ E}, L)$ from the original scaling laws, convert to $(C, N, L)$ using this relation (with total compute as $C = C_{∖ E} \cdot N / N_{∖ E}$ ), and refit the power-law exponents and the irreducible loss.

Parameterization conversion. The scaling laws in Henighan et al. (2020) are reported in compute-centric form, expressing the optimal loss $L^{⋆} (C) = (C / C_{0})^{- γ} + E$ and optimal model size $N^{⋆} (C) = (C / \hat{C})^{δ}$ as functions of compute budget $C$ . We convert these to the $(N, D)$ parameterization used in this work:

L (N, D) = (\frac{N}{N _{0}})^{- α} + (\frac{D}{D _{0}})^{- β} + E, (106)

where the exponents transform as $α = γ / δ$ and $β = γ / (1 - δ)$ , and the token scale is given by $D_{0} = \frac{C ^}{6} N_{0}^{α / β} (β / α)^{- 1/ β}$ .

Corrected parameters. Table 1 presents the corrected scaling law parameters used in our final calculations.

Domain	$α$	$β$	$N_{0}$	$D_{0}$	$E$
Image $8 \times 8$	0.331	0.566	$8.0 \times 1 0^{1}$	$2.66 \times 1 0^{6}$	3.14
Image $16 \times 16$	0.307	0.820	$2.8 \times 1 0^{2}$	$8.94 \times 1 0^{7}$	2.68
Image $32 \times 32$	0.258	0.399	$6.3 \times 1 0^{1}$	$1.95 \times 1 0^{6}$	2.30
Image VQ $16 \times 16$	0.322	0.441	$2.7 \times 1 0^{4}$	$4.44 \times 1 0^{7}$	4.23
Image VQ $32 \times 32$	0.287	0.560	$1.9 \times 1 0^{4}$	$1.63 \times 1 0^{8}$	3.32
Video VQ $1 6^{3}$	0.428	0.718	$3.7 \times 1 0^{4}$	$1.79 \times 1 0^{8}$	1.15
Language (Chinchilla)	0.339	0.285	$4.91 \times 1 0^{7}$	$1.49 \times 1 0^{9}$	1.69

Table 1: Final scaling law parameters used. Image and video domains from Henighan et al. (2020) are corrected for embedding parameters using aspect ratio $A = 5$ following (Pearce and Song, 2024); Chinchilla (language) from Hoffmann et al. (2022) was originally fit to total parameter counts and requires no correction. $D_{0}$ is measured in tokens and $E$ is measured in nats.

Appendix D. RASP-L for Elementary Cellular Automata

Below we provide RASP-L code (Zhou et al., 2023)by Zhou and colleagues, twenty twenty three, demonstrating how the evolution rule of an ECA can be implemented, providing evidence that the solution can be expressed within an autoregressive transformer model.

Listing 1: RASPL implementation of a cellular automaton evolution step

from np_rasp import *
 
def int2bits(x, bits=8): # returns LSB first
	""" Helper function to generate fixed bitstring representing a number.
	Not RASP-L, can be assumed constant."""
	bits_str = bin(x)[2:].zfill(bits)
	return np.array(list(map(int, bits_str[::-1])),dtype=np.uint8)
 
sep = -1
sep2 = -2
def evolve_ca(x, rule):
	""" Function to autoregressively output produce the output of one step of the ECA
	rule. Problem encoded as x= --input state--,sep,sep2,--output state--.
	Rule: int (specifying the ECA)"""
	lookup = int2bits(rule, 8)
	in_input = 1 - has_seen(x, full(x, sep))
	in_input2 = 1 - has_seen(x, full(x, sep2))
	width = cumsum(in_input) # only valid after sep
	idx = indices(x)
	circ_x = where(in_input, x, index_select(x, idx - width))
	prev = shift_right(x, 1)
	cprev = where(in_input2, prev, index_select(prev, idx - width))
	prev2 = shift_right(x, 2)
	nbhd = (prev2 << 2) + (cprev << 1) + circ_x
	shifted_nextstate = lookup[nbhd]
	to_select_idx = idx - width
	to_select_idx = where(to_select_idx < 3, idx, to_select_idx)
	outstate = index_select(shifted_nextstate, to_select_idx)
	return outstate

Appendix E. Cellular Automata and Game of Life

Elementary cellular automata Elementary cellular automata (ECA) (Wolfram and Gad-el Hak, 2003)as described by Wolfram and Gad-el Hak in two thousand and three are one-dimensional cellular automata defined on a finite or infinite line of cells, each in one of two states: 0 or 1. The system evolves in discrete time steps according to local rules: a cell’s next state depends only on its current state and those of its two immediate neighbors, yielding $2^{3} = 8$ two cubed, or eight, possible neighborhood configurations. Since each configuration can map to either 0 or 1, there are $2^{8} = 256$ two to the eighth, or two hundred and fifty-six, possible rules, conventionally numbered 0–255 using Wolfram’s notation, where the rule number’s binary representation specifies the output for each neighborhood. Despite their simplicity, ECAs exhibit diverse behaviors ranging from trivial (e.g., Rule 0) to complex and chaotic (e.g., Rule 30), with Rule 54 proven to be Turing-complete. These systems serve as minimal models for studying emergence, computation, and the relationship between local rules and global behavior.

Conways Game of Life Conway’s Game of Life (Gardner, 1970)described by Gardner in nineteen seventy is a cellular automaton defined on an infinite two-dimensional grid of cells, each in one of two states: alive (1) or dead (0). The system evolves in discrete time steps according to deterministic local rules: a cell’s next state depends only on its current state and those of its eight neighbors. Specifically, a live cell survives if it has exactly 2 or 3 live neighbors (otherwise it dies), while a dead cell becomes alive if it has exactly 3 live neighbors (otherwise it remains dead). Despite the simplicity of these rules, the Game of Life exhibits

A comparison between chaotic trajectories and the resulting invariant measure. On the left, (a) shows trajectories from the Lorenz equations diverging from initial conditions. On the right, (b) compares 3000 points sampled from an LLM-modeled distribution to the true invariant measure of the Lorenz system, both showing a similar "butterfly" shape. Figure 11: LLMs can learn the invariant measure of chaotic systems despite unpredictable trajectories. (a) Chaotic systems like the Lorenz equations display sensitive dependence on initial conditions. Tiny perturbations to the initial conditions (orange) diverge exponentially, making long-term predictions impossible when simulating with limited computation and precision on a computer. (b) 3000 sampled points from the distribution modeled by the LLM (left) and from the invariant measure of the Lorenz system (right). Color denotes kernel density estimation of each density.

remarkably complex emergent behavior, including stable structures (blocks), periodic oscillators (blinkers), mobile patterns (gliders), and structures that generate infinite streams of gliders (glider guns). The system also happens to be Turing-complete, with a specific initial configuration specifying the program, it is capable of universal computation.

Appendix F. Emergence

Lorenz System and Chaotic Dynamics For the Lorenz system, a canonical example of a chaotic ODE, we can observe a different kind of emergence (Type-0 in Carroll and Parola (2024)). There exists a canonical invariant measure in dynamical systems (under some regularity conditions) known as the SRB measure (Metzger, 2000). States evolved for a long time in the Lorenz system will converge this measure. As the Lorenz system is chaotic, tiny perturbations are exponentially amplified through time at a rate related to the largest Lyapunov exponent $λ_{1} \approx 0.9$ lambda one, approximately zero point nine. There is a precise sense in which entropy is created in this system at a rate of $λ_{1} lo g_{2} (e)$ lambda one times the base two log of e bits per second, formalized through Pesin’s theorem (Pesin, 1977)Pesin’s theorem, despite the fact that it is a purely deterministic process. Intuitively one can see this picture when simulating the system using fixed precision numbers, and seeing $lo g_{2} (e)$ log base two of e bits of that description replaced with unpredictable random content after every Lyapunov time $1/ λ_{1}$ one over lambda one. On the one hand randomness is produced, but it is not uniformly random. Rather, there is a stationary measure in the shape of a butterfly, and an observer who has lost track of all previous bits due to chaos can still learn the shape of the butterfly. Moreover, the shape of the stationary measure is not immediately obvious from the ODE, it is emergent and cannot easily be understood without intensive numerical simulation of the system (hence why most of chaos theory was developed after computers).

To demonstrate this interplay, we train a language model to predict the first $B = 10$ B equals ten bits of the future state $Φ_{t} (X)$ phi t of X from an initial state sampled uniformly from the box $X \sim U [- 20, 20]^{3} + 20 [0, 0, 1]$ X sampled uniformly from a shifted box quantized to $B$ B bits, in comparison to directly modeling $Φ_{t} (X)$ phi t of X. For both we set the time $t$ t to be 30 Lyapunov times into the future, $t = 30/ λ_{1}$ t equals thirty over lambda one. The resulting model has a nearly identical loss and estimated epiplexity in the two settings. Despite being unable to distinguish the initial conditions, the LLM learns the invariant (SRB) measure to a reasonable approximation as shown in Figure 11b.

With very limited compute the stationary measure is not predictable apriori from the dynamics, but with more compute it is merely a consequence. The epiplexity of the attractor for limited compute may be larger than a description of the dynamics $S_{T} (Φ_{t} (X)) > S_{T} (Φ, t)$ S T of Phi t of X is greater than S T of Phi t.

Chess: AlphaZero and Minimax A qualitatively different kind of example can be had by considering the models produced by AlphaZero (Silver et al., 2018)AlphaZero, described by Silver and colleagues, and the theoretically optimal minimax solution for these two player zero sum perfect information games (von Neumann, 1928; Shannon, 1950). The minimax strategy can be implemented by a short program, and with sufficient compute (exponential in the size of the board (Fraenkel and Lichtenstein, 1981)) the optimal strategy can be found. On the other hand the CNN policy and value network produced by AlphaZero contain 10s of millions of parameters. Given that the rules of chess can be encoded in just a few hundred bytes, and the algorithm used to train the model can be simply described and also implemented by a short program, one may wonder where does this information come from? With the other examples of emergent phenomena in mind, we can make sense of this information being produced by the computational process of the AlphaZero system. In contrast, with unbounded compute, the best strategy contains little information.

To summarize, due to the existence of emergent phenomena, even systems that have simple generating processes or simple descriptions can lead to large amounts of structural information to be learned by computationally constrained observers.

Appendix G. Induction is Not Specific to Autoregressive Factorization

One might get the impression that key constraint that leads to this induction phenomenon is the autoregressive factorization, as it is intuitive to see how such a model needs to perform induction in-context to achieve minimum loss. However, we argue this phenomenon takes place with other classes of generative models trained as long as they are trained with Maximum Likelihood Estimation (MLE) or its approximations.

In MLE, a generative model allowing explicit likelihood evaluation is trained to maximize the likelihood of the data. Computing the likelihood can be significantly more computationally challenging than sampling from the distribution $P$ P. This distinction is particularly clear in the examples we gave where the ground-truth $P$ P is a mixture distribution represented by a latent variable model with the CA initial state or Markov chain transition matrix acting as the latent variable $Z$ Z. Given access to $P_{X ∣ Z}$ P of X given Z (equivalent to some easy to implement forward function $F$ F), sampling is easy as long as $P_{Z}$ P sub Z is a simple, but computing $P_{X} (x)$ P X of x for some input $x$ x requires evaluating an intractable integral $P_{X} (x) = \int P_{X ∣ Z} (x ∣ z) P_{Z} (z) d z$ P X of x equals the integral of P of X given Z times P Z with respect to z due to the high-dimensionality of $Z$ Z. As such, a model given a limited compute-budget is forced to learn a cheaper but more sophisticated algorithm for computing $P_{X} (x)$ P X of x, often involving approximating the inverse $P_{Z ∣ X}$ P of Z given X either explicitly as done in expectation–maximization type algorithms and Variational Autoencoders (Kingma et al., 2013)Variational Autoencoders, or implicitly as we illustrated for the autoregressive transformer.

Appendix H. Minimum Description Legnth

Intuitively, $L (H)$ L of H can be interpreted as the structural information, and $- lo g P (x ∣ H)$ minus log P of x given H can be understood as the remaining random information that cannot be predicted by the best model in $H$ H. A main problem with the crude two-part code is that it does not prescribe how one should design the code for $H$ H (i.e., a procedure for describing $H$ H within $H$ H). The description of a particular $H$ H can be short under one code but very large under another, which could require additional knowledge to resolve. To circumvent this issue, one can use a more refined one-part code that describes the data

thoughtful
with the entire model class $H$ H rather than any single model $H$ H. One of the most important one-part codes is the normalized maximum likelihood code.

Definition 31 (Normalized maximum likelihood code (Grünwald, 2007)as defined by Grunwald in 2007) The NML distribution $P_{H}^{NML} : {0, 1}^{n \times d} \to [0, 1]$ P N M L sub H from binary strings to the interval zero to one of a probabilistic model class $H$ H is:

P_{H}^{NML} (x) = \frac{P ( x ∣ H ( x ))}{\sum _{y \in {0, 1}^{n \times d}} P ( y ∣ H ( y ))}

where $H (x) = ar g max_{H \in H} P (x ∣ H)$ H hat of x is the argument that maximizes the probability of x given H is the maximum likelihood estimator for $x$ x.

Crucially, notice that the NML code only depends on $H$ H rather than any particular $H \in H$ H in H, so we do not have to design a particular code for $H$ H. Unfortunately, the NML code requires integrating over the maximum likelihood estimator for all possible data, which is intractable for most practical models such as deep neural networks. We can instead use a more tractable variant of one-part code based on sequential prediction called prequential coding.

Definition 32 (Prequential code (Grünwald, 2007)) The prequential distribution $P_{H}^{PREQ} : {0, 1}^{n \times d} \to [0, 1]$ P preq sub H of a probabilistic model class $H$ H is:

P_{H}^{PREQ} (x) = k = 1 \prod n P (x_{k} ∣ H (x_{1 : k}))

where $H (x_{1 : k}) = ar g max_{H \in H} P (x_{1 : k} ∣ H)$ H hat of x one through k is the MLE for the first $k$ k elements of $x$ x.

This definition above uses the MLE for updating $H$ H hat but there are in fact no constraints on how the update is performed. We may use any update method of our choice to produce the next model in the sequence, so long as it only depends on the previous data. This means that we can naturally adapt it for deep learning, where we use stochastic gradient descent to update the model sequentially.

A code cannot be optimal simultaneously for all possible data $x$ x unless it has knowledge of the particular $x$ x. Therefore, it is useful to characterize how close a given code is to the optimal model, which can be formalized via the notion of regret.

Definition 33 (Regret (Grünwald, 2007)) The regret of a code $Q$ Q relative to $H$ H for $x$ x is the additional number of bits needed to encode $x$ x using $Q$ Q compared to the best model in hindsight,

Reg (Q, H, x) = - lo g Q (x) - H \in H min {- lo g P (x ∣ H)} .

Under this notion of penalty, the NML is optimal in the sense that it achieves the minimax regret. The regret provides a way to compare different codes. Consider the two-part regret of the crude two-part code $P^{2 P} (\cdot)$ P two P with minimizer $H^{⋆}$ H star and associated predictive distribution $P (\cdot ∣ H^{⋆})$ P given H star,

Reg (P^{2 P}, H, x) = L (H^{⋆}) + lo g \frac{1}{P ( x ∣ H ^{⋆} )} - lo g \frac{1}{P ( x ∣ H )} .

This means that for a two-part code, the regret is an upper bound on the description length of the model. For sufficiently large $n$ n, the last two terms become close to each other and $Reg (P^{2 P}, H, x) \approx L (H^{⋆})$ Regret is approximately L of H star. In the case of NML, the regret is the minimax regret that $Reg (P_{H}^{NML}, H, x) = lo g \sum_{y \in {0, 1}^{n}} P (y ∣ H (y))$ the regret of the NML code equals the log of the sum over all binary strings y of the probability of y given its maximum likelihood estimator. This quantity is independent of $x$ x, which is also called parametric complexity of $H$ H, because

it measures how expressive the entire model class is by counting the total amount of possible data sequences the model class can model well.

As shown above, the regret of a code is fundamentally related to the size of the model. Similarly, the regret of the prequential code can also be interpreted as a measurement of model size even though it does not provide a separate description of the model. While prequential code is not minimax optimal like NML, it is shown that prequential Bayes code can have a small additive $O (lo g n)$ big O of log n regret penalty compared to NML. Furthermore, any non-prequential code (e.g., NML or two-part code) can be converted into a prequential code with only $O (lo g n)$ big O of log n overhead. As such, we can interpret $Reg (P_{H}^{PREQ}, H, x) = lo g 1/ P_{H}^{PREQ} (x) - lo g 1/ P (x ∣ H)$ the regret of the prequential code for H given x, defined as the log of one over the prequential probability minus the log of one over the likelihood of x given H hat as a measurement of the effective model size conditioned on the data $x$ x. The crucial advantage of prequential code is that it is data-dependent, it is fully compatible with sequential model optimization, and it avoids the need for estimating individual model size $L (H)$ L of H, making it an ideal tool for studying deep learning and properties of the data.

In section 4.3, we introduced $Reg (P_{H}^{PREQ}, H, x)$ the prequential regret as a proxy for $L (H^{⋆})$ L of H star. Indeed, since prequential coding does not emit a valid code for the model, only the code for both the data and model, it can only be interpreted as an approximation of epiplexity. However, if we forgo the two-part code formulation and use the regret as a measurement of the effective model size, then prequential epiplexity is no longer an approximation. We will refer to this regret-based formulation as generalized epiplexity. The main benefit of generalized epiplexity is that it applies to a much wider range of coding schemes (i.e., both one-part and two-part codes), as it only requires the total length of the data and model, and a reference probability model. Unfortunately, the theoretical results based on two-part epiplexity do not immediately transfer to generalized epiplexity. It remains an open question whether similar results can be obtained.

Specifically, when the difference between outcomes can be measured in polynomial time. ↩
One such possibility is to constrain the function class to all models reachable by a given optimization procedure with a given neural network architecture. ↩
We have $L (Z_{: M}, P_{M}) + O (1) \geq K (Z_{: M}, P_{M})$ , but not that $L (Z_{: M} ∣ P_{M}) + O (1) \leq K (Z_{: M} ∣ P_{M})$ . ↩
We start the sum and integral at 1 to avoid the singularity at 0, which is an artifact of the scaling law as it typically only holds for large $D$ . ↩
6. This limit provably exists under $μ$ P, but is a reasonable assumption in general as it simply asserts diminishing returns in scaling model size without increasing data. ↩

Graph View

TTS

From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

Abstract

1 Introduction

2 Background

2.2 Random vs Structural Information

2.3 The Minimum Description Length Principle

3 Epiplexity: Structural Information Extractable by a Computationally Bounded Observer

4 Measuring Epiplexity and Time-Bounded Entropy

4.1 Approximating Model Description Length with Prequential Coding

4.2 Explicitly Coding the Model with Requential Coding

4.3 Comparison Between the Two Approaches and Practical Recommendations

4.4 How Epiplexity and Time-Bounded Entropy Scale with Compute and Data

5 Three Apparent Paradoxes of Information

5.1 Paradox 1: Information Cannot be Created by Deterministic Transformations

5.2 Paradox 2: Information Content is Independent of Factorization

5.3 Paradox 3: Likelihood Modeling is Merely Distribution Matching

5.3.1 Induction

6 Epiplexity, Pre-Training, and OOD Generalization

6.1 Epiplexity Correlates with OOD Generalization in Chess

6.2 Measuring Structural Information in Natural Data

6.3 Estimating Epiplexity from Scaling Laws

6.4 Pre-Training Data Selection and Curriculum for Language Models

7 Additional Related Work

8 Discussion

References

Appendix A. Proofs

A.1 CSPRNGs have (nearly) maximal time-bounded Entropy and low epiplexity

A.2 Deterministic transformation can increase time bounded entropy and epiplexity

A.3 CSPRNGs have low epiplexity

A.4 Existence of High Epiplexity random variables

A.5 Information Content is not Independent of Factorization

A.6 Problems with time-bounded sophistication

Appendix B. Measuring Epiplexity

B.1 Further details on estimating epiplexity

B.2 Prequential Coding Approximates Requential Coding with a Static Teacher

B.3 A Solvable Model Using Scaling Laws

B.4 How Epiplexity and Time-Bounded Entropy Scale with Compute and Dataset Size

B.4.2 Monotonicity of S∞​(X) and H∞​(X) in D

B.4.3 D⋆(T)D star of T Generally Approaches DD in Prequential Coding

Appendix C. Experiment Details

C.1 ECA

C.2 Easy induction

C.3 Hard induction

C.4 Chess

C.5 OpenWebText

C.6 CIFAR-5M

C.7 Prequential vs Resequential Comparison

C.8 ECA Emergence

C.9 Scaling Laws

Appendix D. RASP-L for Elementary Cellular Automata

Appendix E. Cellular Automata and Game of Life

Appendix F. Emergence

Appendix G. Induction is Not Specific to Autoregressive Factorization

Appendix H. Minimum Description Legnth

Footnotes

B.4.2 Monotonicity of $S_{\infty} (X)$ and $H_{\infty} (X)$ in $D$

B.4.3 $D^{⋆} (T)$ D star of T Generally Approaches $D$ D in Prequential Coding