UpperBound25 / https://x.com/ID_AA_Carmack/status/1925710474366034326

Some interesting points I had comments on (tho lots of other interseting points not written down here)

Larger batches of running environments should still learn much faster, and that is important for training an AGI, but if a batch-1 run fails to learn effectively, something is wrong with the algorithm relative to biological intelligences, and we should figure out how to fix it.

interesting take but idk what if dis is intractable like a baby doesnt learn from a blank slate. its network assembles from incredibly strong priors. its cells algos have been evo tuned for eons.
Like we can train batch 1, once we meta learnt an algorithm sufficiently capable of in-context learning

“reinforcement learning on atari is like putting a baby in front of an atari console the first time it opens its eyes – thinking abt it like that, it is impressive that it learns within a few (weeks?) of play” Ofc a fairer comparison than to human with life experience. But thats exactly the point, we want an agent that amasses experience. Not fully sold yet atari is the right testbed for that, but ig streaming algo perf would translate to other domains.

carmack is doing work on the algorithmic front of realtime sequential online learning/streaming RL, as in the real world, which doesnt wait for the agent to execute, latency is a thing…

Updating anything in a conventional model essentially changes everything by at least a tiny
amount.
You can hope that when learning a different task that the weight updates will be mostly
balanced random noise, but even random walks get destructive after billions of steps

I think it is underappreciated that EVERY weight update changes EVERY value output for EVERY observation. We just hope that the changes are positive generalizations or at worst random noise that averages out.

Interestingly, he doesnt really involve meta learnimg at all.
Soup solves this, ideally, by only activating relevant neurons / having big enough ctx per neuron.
What gets updated all the time (parameters) is the learning algorithm – not the memory.

Richard and Joseph are fond of a “streaming RL” ideal that eschews replay buffers, but I am
skeptical that they will be able to learn diverse tasks over days of training. Biological brains
certainly don’t store raw observations that can be called back up, but long term episodic
memory does feel distinct from simple policy evaluation

perfect memory of all history is viable, we can store terabytes of observation

That’s it right — we cant afford terrabytes of vram, but TBs of ssd easily.
So, 99% sparcity, LTM through node ctxs. Can remember lots of stuff. Lots and lots of stuff. Not 1:1 replay buffer, but learnt memory.

Offline RL

Old games in your memory are an offline RL problem.
Even if you have a perfect memory of everything, RL algorithms can go off the rails if they aren’t being constantly “tested against reality”. Offline learning is hard.
Why doesn’t offline optimization work better?
The difference between online (traditional) RL and offline RL is that online RL is constantly “testing” its model by taking potentially new actions as a result of changes to the model, while the offline training can bootstrap itself off into a coherent fantasy untested by reality.
Big LLMs are trained today with an effective batch of millions and random samples. I propose that such huge batches are not just an unfortunate artifact of cluster scaling, but actually necessary to generate gradients sufficient to steadily learn all the diverse information. You definitely can’t train an LLM by walking online through a corpus, even given infinite time – it will forget each previous book as it “reads” the next. Even if you did IID sampling of contexts with batch 1, you would see massively sublinear performance, because there might be a million optimizer steps between references to two different obscure topics.

And my current understanding is that GPICL / meta learning should solve exactly this.

can’t train an LLM by “walking online through a corpus” due to forgetting

so I take it nobody has even rlly tried that yet…
i mean lucas forked the streaming DRL repo, so maybe they’ll try with llm too
would be good if they do that, so i can focus on the self organizing and active data-gathering aspects, which will be even more niche then, outsource what carmack and team can and will do anyways to them

GATO showing negative transfer learning – learning a new game is harder if done after learning other games in parallel.

It is a challenge to just not get worse!
This “loss of plasticity” in a network with continuous use is a well known phenomenon in deep learning, and it may even have biological parallels with aging brains at some levels – old dogs and new tricks. But humans
It may be necessary to give up some initial learning speed “learn slow so you can learn fast”

generalization is an ignoring of details, while plasticity involves recognizing new patterns that are currently meaningless to you.

Adam is surprisingly hard to beat. Lucas has tried a dozen hot new things without any clear advantage across a suite of games.

Big image nets don’t work well for RL → cnns

No task ID!

Auxiliary losses like Self Predictive Representations are just ways to coerce the function approximator into working better for values and policies.

What he mean by dis?
O3:

Carmack’s point is not that SPR is useless—only that it’s scaffolding. We bolt on these auxiliary objectives because our current nets don’t naturally learn the right invariances. A future architecture that inherently captures temporal structure wouldn’t need this kind of external “coercion.”

Self-Predictive Representations (SPR) add a self-supervised “predict your own future latents” loss that forces the network’s features to become temporally predictive, so that the same weights produce cleaner Q-values and policies. They don’t change the RL algorithm at all—they just shove the function approximator into a shape that works.

Problem with vanilla value/policy lossHow an auxiliary SPR loss helps
Sparse / delayed rewards → almost no gradient early in training.Every frame supplies a dense self-supervised target (future latent), so the encoder learns useful structure before rewards appear.
Representation drift: slight weight updates for one state perturb Q-values everywhere.Predicting k-step-ahead latents encourages smooth, locally linear dynamics in feature space, making values more stable.
Overfitting to pixels: network can latch onto color patches that correlate with reward but don’t generalize.To minimize SPR loss it must keep track of objects that persist and move, implicitly disentangling position/velocity.
Bootstrapping noise: value targets are themselves estimates.SPR provides an independent, low-variance learning signal that regularizes the network.