reasoning = program synthesis
reasoning != applying a function with different input
These are two opposing approaches to reasoning / intelligence with basically no overlap, both are doing well in their specific domains, but their strenghts and weaknesses are very complimentary:
Combinatorial search for programs allows extreme breadth as opposed to depth via memorization. It also allows to deal with novel problems, by allowing you to learn a generalizable program from just a few examples, but it is very compute inefficient.
Whereas gradient descent e.g. with through LLMs allows shallow combination in extremely large banks of millions of vector programs. It is very compute efficient, giving you a strong signal about where the solution is, but very data inefficient. You need a lot of data to have a good fit, and even then you can’t generalize out of distribution.
Gradient descent is great for system-1-type thinking: Pattern cognition, intuition, memorization, …
Discrete program search is a great fit for system-2 thinking: Planning, reasoning, quickly generalizing
The brain is always doing a combination of the two. Chollet thinks that the path to AGI is a mostly system-2 system, guided by the intuition of deep learning… wait, so MCTS?
…
We have the same in diffusion models vs. generative AI being on opposite spectra.
To me, this “iterative program synthesis” or program search sounds like something quite similar to diffusion. Maybe that’s exactly what will marry search and contextual memory? :deep_thonk:
Note that there's a continuous spectrum between reusing program templates you've learned, and actual reasoning.
You can’t reason out of nothing. All reasoning needs a language — modular functions you can recombine into a new program. The more the better. The key question is, how much recombining takes place? No recombining implies no ability to handle novelty.
Direct prompting of a LLM is equivalent to having a DSL (Domain Specific Language) of ~10e8 programs, and no ability to recombine them into something new, no matter how trivial. This sucks.
Doing active inference with a LLM is equivalent to having a DSL of ~10e8 programs, and being able to do very shallow recombinations of them (mixing a few memorize programs together). Now you can handle some novelty! So maybe this can solve ARC — we’ll see.
Current program synthesis usually uses very small DSLs (10e3-10e3), but does very extensive recombination (chaining 10-20 functions…). We know this does well on ARC — but it isn’t nearly enough.
Meanwhile, a human SWE uses a large-ish bank of modular functions and thought patterns (10e4+), and is capable of a degree of recombination many orders of magnitude above current program synthesis. That’s what AGI is supposed to look like. - F. Chollet, Twitter
The question of whether LLMs can reason is, in many ways, the wrong question… X
The more interesting question is whether they are limited to memorization / interpolative retrieval, or whether they can adapt to novelty beyond what they know. (They can’t, at least until you start doing active inference, or using them in a search loop, etc.)
There are two distinct things you can call “reasoning”, and no benchmark aside from ARC-AGI makes any attempt to distinguish between the two. First, there is memorizing & retrieving program templates to tackle known tasks, such as “solve ax+b=c” — you probably memorized the “algorithm” for finding x when you were in school. LLMs can do this! In fact, this is most of what they do. However, they are notoriously bad at it, because their memorized programs are vector functions fitted to training data, that generalize via interpolation. This is a very suboptimal approach for representing any kind of discrete symbolic program. This is why LLMs on their own still struggle with digit addition, for instance — they need to be trained on millions of examples of digit addition, but they only achieve ~70% accuracy on new numbers. This way of doing “reasoning” is not fundamentally different from purely memorizing the answers to a set of questions (e.g. 3x+5=2, 2x+3=6, etc.) — it’s just a higher order version of the same. It’s still memorization and retrieval — applied to templates rather than pointwise answers.
The other way you can define reasoning is as the ability to synthesize new programs (from existing parts) in order to solve tasks you’ve never seen before. Like, solving ax+b=c without having ever learned to do it, while only knowing about addition, subtraction, multiplication and division. That’s how you can adapt to novelty. LLMs cannot do this, at least not on their own. They can however be incorporated into a program search process capable of this kind of reasoning. This second definition is by far the more valuable form of reasoning.
This is the difference between the smart kids in the back of the class that aren’t paying attention but ace tests by improvisation, and the studious kids that spend their time doing homework and get medium-good grades, but are actually complete idiots that can’t deviate one bit from what they’ve memorized. Which one would you hire?
LLMs cannot do this because they are very much limited to retrieval of memorized programs. They’re static program stores. However, can display some amount of adaptability, because not only are the stored programs capable of generalization via interpolation, the program store itself is interpolative: you can interpolate between programs, or otherwise “move around” in continuous program space. But this only yields local generalization, not any real ability to make sense of new situations. This is why LLMs need to be trained on enormous amounts of data: the only way to make them somewhat useful is to expose them to a dense sampling of absolutely everything there is to know and everything there is to do. Humans don’t work like this — even the really dumb ones are still vastly more intelligent than LLMs, despite having far less knowledge
Link to originalLayers of consciousness: Perception vs. reflection + construction; Indexed memory
Perception: Largely geometric; Perceptual system: Following a gradient - coalescing to a geometric interpretation of the perceptional reality. Gradient descent does not require memory of the process that you performed to do the gradient descent.
Reflection + construction: In reflexive perception, you construct reality - when you reason about things, you cannot just follow a gradient, you need memory in order to understand why you tried things when they didn’t work, so you can undo them, try another branch, …
This requires some kind of indexed memory that allows you to keep a protocol of what you did.
In order for something like system-2 thinking to arise from iterated fuzzy pattern cognition, that iteration sequence needs to be highly self-consistent.
For everything you add, you need to double check that it matches what came before it.
If you put no guardrails, you are basically hallucinating / dreaming. You are just repeatedly intuiting what comes next, with no regard whatsoever for (self-)consistency (with the past).
→ Any deliberate logical processing in the brain needs to involve awareness / consciousness, a sort-of self-consistency check, a process that forces your next iteration of this pattern cognition process to be consistent with what came before it. The only way to achieve this consistency is via back- and forth loops that are bringing the past into the present, bringing your prediction of the future into the present, so you have this sort of nexus point in the present, the thing you are focusing on.
→ Consciousness is the process that forces iterative pattern cognition into something that’s actually like reasoning, self-consistent.