How do biological organisms grow their nerve systems?

Neural circutry is grown through a developmental process.
Each cell contains the same developmental program.

Indirect encodings:

Developmental indirect encodings:

→ Can’t arrive at the phenotype in a single time-step, but unfold it through a series of time-steps.
Conjectured to be analogous to computational irreducibility in some CA rules.

What’s interesting about indirect encodings?
compression: size of genotype << size of phenotype (1GB << 700TB)
generalization: genomic bottleneck hypothesis (incentivized to learn adaptive behaviors)

Generalization and innate abilities through indirect encodings / genomic bottleneck

There is much discussion in the deep learning community about the generalization properties of large neural networks. While larger neural networks generalize better than smaller networks, the reason is not that they have more weight parameters, but as recent work (e.g. [1, 2, 3]) suggests, it is because larger networks allow the optimization algorithm to find good solutions, or lottery tickets [1], within a small fraction of the allowable solution space. These solutions can then be pruned to form sub-networks with useful inductive biases that have desirable generalization properties.
Recent neuroscience critiques of deep learning (e.g. [4, 5, 6]) point out that animals are born with highly structured brain connectivity that are far too complex to be specified explicitly in the genome and must be compressed through a “genomic bottleneck”—information encoded into the genome that specify a set of rules for wiring up a brain [4]. Innate processes and behaviors are encoded by evolution into the genome, and as such many of the neuronal circuits in animal brains are pre-wired, and ready to operate from birth [6]. These innate abilities make it easier for animals to generalize and quickly adapt to different environments [7].

Analogous to the pruning of lottery ticket solutions, indirect encoding methods allow for both the expressiveness of large neural architectures while minimizing the number of free model parameters.

By encoding the weights of a large model with a small number of parameters, we can substantially reduce the search space of the solution, at the expense of restricting our solution to a small subspace of all possible solutions offered by direct encoding methods. This constraint naturally incorporates into our agent an inductive bias that determines what it does well at [5, 4, 6], and this inductive bias is dependent on the choice of our indirect encoding method.

Link to original

Computing an attention matrix as from the raw token features produces a gram matrix. This already captures associations/relational structure between tokens, but since there are no free (learnable) parameters,, it isn’t suitable for tasks beyond simple association.
Learnable matrices are introduced, projecting the raw features into a new space, where is more meaningful for the task at hand.
These weight matrices can be seen as the small genotype encoding the much larger phenotype attention matrix:
When we scale number of learnable parameters while the input dimension is held constant, the free parameters scale with , while grows with , and typically, , where are the number of input tokens.

E.g. for gpt 3, there’s a 4 OOM difference!

>>> ctx, d_qk = 2048, 2*128; ctx**2/d_qk
16384.0
Link to original