How do biological organisms grow their nerve systems?
Neural circutry is grown through a developmental process.
Each cell contains the same developmental program.
Indirect encodings: encoding:genotype→phenotype Developmental indirect encodings: encoding(t):genotype→phenotype
→ Can’t arrive at the phenotype in a single time-step, but unfold it through a series of time-steps.
Conjectured to be analogous to computational irreducibility in some CA rules.
There is much discussion in the deep learning community about the generalization properties of large neural networks. While larger neural networks generalize better than smaller networks, the reason is not that they have more weight parameters, but as recent work (e.g. [1, 2, 3]) suggests, it is because larger networks allow the optimization algorithm to find good solutions, or lottery tickets [1], within a small fraction of the allowable solution space. These solutions can then be pruned to form sub-networks with useful inductive biases that have desirable generalization properties.
Recent neuroscience critiques of deep learning (e.g. [4, 5, 6]) point out that animals are born with highly structured brain connectivity that are far too complex to be specified explicitly in the genome and must be compressed through a “genomic bottleneck”—information encoded into the genome that specify a set of rules for wiring up a brain [4]. Innate processes and behaviors are encoded by evolution into the genome, and as such many of the neuronal circuits in animal brains are pre-wired, and ready to operate from birth [6]. These innate abilities make it easier for animals to generalize and quickly adapt to different environments [7].
…
Analogous to the pruning of lottery ticket solutions, indirect encoding methods allow for both the expressiveness of large neural architectures while minimizing the number of free model parameters.
…
By encoding the weights of a large model with a small number of parameters, we can substantially reduce the search space of the solution, at the expense of restricting our solution to a small subspace of all possible solutions offered by direct encoding methods. This constraint naturally incorporates into our agent an inductive bias that determines what it does well at [5, 4, 6], and this inductive bias is dependent on the choice of our indirect encoding method.
Computing an attention matrix as A=XXT from the raw token features produces a gram matrix. This already captures associations/relational structure between tokens, but since there are no free (learnable) parameters,, it isn’t suitable for tasks beyond simple association.
Learnable Wq,Wk∈Rdemb,dqk matrices are introduced, projecting the raw features into a new space, where A is more meaningful for the task at hand.
These weight matrices can be seen as the small genotype encoding the much larger phenotype attention matrix:
When we scale number of learnable parameters while the input dimension is held constant, the free parameters scale with O(dqk), while A grows with O(n2), and typically, n2≫dqk, where n are the number of input tokens.