year: 2022
paper: https://arxiv.org/pdf/2212.04458
website: https://louiskirsch.com/gpicl / https://x.com/LouisKirschAI/status/1601028625893326853 / http://louiskirsch.com/
code:
connections: meta learning, transformer, grokking? (authors says they don’t think but they not sure)
Replicate paper and make the ablation about ctx length vs. plateau length.
What they did test for is: Longer sequences → Generalize to more complicated datasets!
Open question
does bs need to be this high also for more complex training tasks / larger models?
what does the transformer size reffer to? hidden dim?
General purpose in context learners need diverse task distributions! (duh)
If you meta learn on enough tasks, suddenly you can generalize to all tasks.
Three phases: Task memorization, Task identification, Learning to learn.
Image on the right: When meta-training on around 2^14 tasks (on the transition boundary) with multiple seeds, at the end of meta-training, the network either implements system identification or general learning-to-learn but rarely solutions in between. The dots are training runts with different seeds - so there is an unstable range between the two modes.
Task identification?
The name reflects the idea that the model is determining what task it’s solving (based on past learning) and then fine-tuning its behavior accordingly, even though it may not generalize well to tasks that are significantly different from those in its training set. This contrasts with memorization, where the model doesn’t recognize or learn anything about the any underlying task; it only recalls specific outputs for seen inputs.
Architectures with bigger state sizes are better general purpose meta-learners.
Navigating the phase-space: High batch-size and large task-space.
Higher number of tasks → longer plateau length (harder to generalize) + bigger task batch size required.
Bigger task batch size → shorter plateau (less instability and variance in updates).
It’s easier to first memorize and then generalize, rather than trying to generalize right away!
The more you bias your batch, the faster you can learn! There is a sweetspot, where you completely avoid the loss plateau phase, by allowing the model to memorize more quickly and then generalize.
This means curriculum learning works here!
Focus on theoretical insights by creating random tasks:
In this work, we take an intermediate step by augmenting existing datasets, in effect increasing the breadth of the task distribution based on existing task regularities. We generate a large number of tasks by taking existing supervised learning datasets, randomly projecting their inputs and permuting their classification labels. While the random projection removes spatial structure from the inputs, this structure is not believed to be central to the task (for instance, the performance of SGD-trained fully connected networks is invariant to projection by a random orthogonal matrix (Wadia et al., 2021)). Task augmentation allows us to investigate fundamental questions about learning-to-learn in the regime of many tasks without relying on huge amounts of existing tasks or elaborate schemes to generate those.
old / revised soup comments:
find auch so geil, dass einfach alle issues die er gennant hat soup nicht hat
- need big batch sizes? check (prlly at least on the order of 10^5 active neurons at a time, if we use rlly small models ≤10mio params, ≤1024 ctx – real neurons are simulated with ~200k params (ref))
- multiple tasks at the same time?
check (exploration vs. exploitation vs. prediction circuits)hm that’s just 3… and task here prlly references to what reward is received for, not roles in inter-cellular communication. though should still be easy to find that many tasks (various verifyable coding tasks, math tasks, games, summarization/retrieval/… (LLM-graded), …?. GATO also had 600 tasks)- need curricula?
check (higher level soup task == curriculum which biases neuron activities)ig it biases to some extent (simple algorithms first, tho the biasing then originates from the task-giving LLM/…)
VSML for Input neurons.
The efficiency benefits of this approach made me mentally rethink how I always thought about the input neurons:
GPICL is much better with features from domain-specific, pretrained feature encoders - as we have planned for soup (at least for initial prototypes). I just didn’t previously think of (because of complexity) what we will now need to do: 10s to hundreds of the same modality input neuron, such that VSML for it works (this now is the core of soup anways – the neurons context / memory is what defines them).Open question: Is it feasible to learn without pre-trained feature encoders?
Die output neurons sind noch immer eine Baustelle (too little RL exp).