year: 2020
paper: https://arxiv.org/pdf/1912.10305v2
website:
code:
connections: Jordan Ott
Key points.
→ Restrictions of gradient descent limit the space of possible models
→ Computational principles can be derived from the brain without following
every convoluted detail
→ The locality of current paradigms opposes evidence from the brain
→ Short comings of Classical AI systems are still present in Deep Learning
→ Parts must be understood in the context of the whole not in isolation
→ Complex internal dynamics give rise to abstract behaviors
→ Approximating a behavior is not the same as the behavior itself
Discretization & Separating the parts from the whole (also called methodological reductionism - quite successful in neuroscience (and ML) thus far, but approaching its limits - though advances in technology keep its “death” at bay).
Without a clear goal or definition of intelligence, the field of AI got split into discrete sub-fields (CV, NLP, …), with a 0.5% test set improvement as the goal, not bringing the fields together whatsoever.
Similarily with neuroscience, where there are vast amounts of observational data, but parts of the brain are studied in full isolation etc. and there have been little attempts to try to put it all together and craft theories for the observations (Jeff Hawkins explained this at his first lex appearnce).
We can, and should, provide agents with external stimuli, however, interpretation and dissemination of that stimuli should occur internally.
In order to create intelligent agents with autonomy we must move away from externally specified goals. Such as ones prescribed by the training paradigm (i.e. supervised learning and reinforcement learning), that lead to narrow AI. External reward paradigms force the agent to learn a mapping from input stimulus directly to the external reward. This may reduce an agent’s ability to generate causal models, as the value of a stimulus is predetermined and immutable by the agent. Therefore, agents must be designed with the ability to asses stimuli and assign arbitrary rewards to it. In this way, agents can internally construct goals and take actions in order to satisfy them.
Reward expectations at different levels.
In machine learning expectations are considered only at the model level. For next frame prediction, deep learning treats expectations as a distribution over pixels conditioned on the past. Expectations from the output of the model are compared with the ground truth to minimize error, globally. Similarly, deep reinforcement learning minimizes the reward prediction error when completing high level tasks. It is important to note this contradiction with biology, where expectations can occur at the network, region, layer, and even synaptic level. This is precisely how the learning rule STDP can be used to learn temporal, predictive, sequences.
Expectations / predictions / feedback at ( and between) differnet levels.
Ignoring attention and top-down expectations, it is clear certain tasks are temporal in nature. This temporal dependency requires recurrence or information to be maintained over time, at the very least. Computational principles derived from intra- and inter-layer feedback, are likely to yield better causal models. As units of the model maintain expectations, while integrating new information to update existing beliefs. The effectiveness of attention can be increased by incorporating feedback between regions that produce attention scores and ones that process the stimuli originally. Similarly, world models may benefit from top-down expectations - not just at the model level but within networks and layers, as well. In this way, context from higher level regions inform lower levels what is relevant to them. In turn, the lower level regions provide more meaningful information to higher regions.
Embodiment is not necessary, movement is.
It has been hypothesized that equivalent grid cells would occur in the neocortex as well. Recent experimental evidence confirms that grid like patterns are present when navigating conceptual space. In this way agents may need to derive an allocentric location signal from stimulus. So that object representations are learned in relation to one another.
In one sense a strictly physical embodiment may not be necessary. However, pseudo embodiment, like that of conceptual space is likely required. This has the clear advantage of allowing artificial agents to store information in a navigable space. Moving through concepts in the space like one moves, physically, through a room.
Deep learning is just millions of if-statements, but instead of manually crafting them, they are learnt from millions of data points.
For Classical AI to perform adequately, all rules or possible encounters must be codified in its knowledge base. This explicit need for predefined knowledge is paralleled in supervised learning. In this setting, we need not construct millions of if statements, instead millions of labeled training examples are required. Thus in order to build agents with common sense we would need labeled common sense data sets. This requirement is just as unfeasible as expert systems. Human knowledge can be abstracted as declarative rules but that is clearly not how the brain operates.
While attempting to combine beneficial aspects from both approaches, hybrid systems are hindered by a combination of issues. First, high dimensional information, like pixels in an image, must be encoded to a symbolic representation. Second, this representation will be manipulated with predefined rules. Thus the expressivity of the system is limited by the explicit encoded rules. Third, there is the issue of feedback (Section 7). Symbolic representations would need to inform areas that construct representations what operations are taking place. [which soup does :)]
Classical vs Symbolic→ Explicit vs EmergentClassical systems were brittle and rigid. Requirements of describing all of human knowledge became an untenable position, which is why they have mostly been abandoned. Hybrid systems offer mild robustness but in time, will fail to meet qualifications for human level intelligence. There has been ongoing debate between those who favor symbolic approaches and those who favor connectionism. This distinction is often misguided, as both paradigms can suffer similar problems. Both symbolic and connectionist approaches have relied on explicit design to solve pre-determined behaviors. The real distinction must be drawn between explicit and emergent systems.
Next-token-prediction is a proxy loss for what we actually care about: World-modelling. [soup’s direct objetive]
Most forms of learning can be framed in the concept of a cost function. For example, in biological organisms, competition between neurons enforce sparsity and specificity of neurons in response to input stimuli. This can be viewed as an optimization of its available resources to maximize information retention. Formulating a predictive model of the world can be seen as minimizing the error of future predictions. In this way we can describe any learning rule or form of credit assignment as an optimization of a cost function. However, there is a difference between explicitly minimizing a cost function and framing dynamics as an optimization of a pseudo cost function.
… humans use less cognitive resources for learned tasks opposed to novel and difficult ones …
Lol - there’s even another scalability axis right there for soup: If it is exposed like a chat interface, it will learn common questions and patterns too, and just store read-made answers or recipes or whatever for easy retrieval and slight modification.
But putting these creatures to work behind some web interface or sth is literally like making them work a job - you give them rewards (like a salary) for helping the humans, but the job will kinda get dull over time (their internal curiosity rewards won’t update that much). Like sure - they’ll continue, since it is the easiest way to sure and steady big rewards, but deeep down they are yearning to explore the world.
Crazy ethical questions right there.
How STDP does this, intuitively explained.
Zooming in to the microscopic level we can see how this could be implemented computationally. Evidence suggests STDP is well suited to learn temporal sequences. In order for a pre-synaptic weight to be strengthened it must be active prior to the post-synaptic cell firing. This time sensitive nature of synaptic learning inherently gives it the ability to learn temporal sequences. In this way pre-synaptic weights learn to predict which post-synaptic cells will become active. Many neurons in a cortical network will then learn a predictive sequence, each one predicting the next. This property is further elaborated in Section 13.1 and Figure 4.
Zooming back out we can now see how our coffee example is satisfied. The more attempts at this motor task, the better synaptic weights become at predicting the next active neuron. Therefore, less resources are used to predict erroneous neurons and electrical activity is more streamlined through correct neurons which increases the ease and smoothness of the task.
Connecting ideas from Section 11.2, STDP works locally (i.e. at the synapse), where its dynamics yield sequential modeling between the pre- and post-synaptic cell. Globally, this manifests an indirect optimization of entropy within neuronal populations and yields predictive sequence learning as a functional result.
I/O are obviously interdependent and require bi-directional feedback - absent in current ML.
This suggests that sensory and motor commands are more intertwined than previously thought. This is further evidence that we need not separate these tasks (Section 2). Instead, sensory and motor commands are codependent, often occurring within the same cortical region, and possibly the same layer. The synergistic interaction of sensory and motor aspects allows each to have information about the other. So that when motor commands are issued to move the eyes, there is a sensory expectation of what will be seen. Conversely, deep reinforcement learning maintains a policy, that takes sensory input and produces motor actions. However, the information only flows one way (Section 7). Sensory regions, in the policy network don’t receive feedback. Therefore, they have no information regarding which action was taken and, as a result, no expectation of what sensory information will come next.
As mentioned in Section 3, sparse codes sacrifice an infinite capacity for robustness and speed. Ideas from learning predictive sequences (Section 11.3) can be combined with top-down expectations (Section 7) to learn predictive models of the environment. In this way we can combine the beneficial properties of sparse representations with predictive models. This combination will yield faster learning and a model of the environment. Similar to model based reinforcement learning, models help us to predict, plan, and take actions all without using more costly real world samples - increasing data efficiency.
Many of the questions addressed in this article are intertwined with one another. [~all of them]
Researchers can list off properties of human cognition which the machine learning community will eagerly solve. However, this leaves us to question what we’re building via this approach. Are we creating agents that understand causality and the consequences of interactions, or are we creating function approximators on a data manifold?
Rewards, goals, and data manipulations are all external in machine learning. Conversely, in biology these process occur within the organism.
The questions in summary:
1. Introduction
2. Can intelligence be discretized?
3. Why are we so dense?
4. Can you turn the noise down?
4.1 Adversarial Attacks
4.2 Invariant Representations
4.3 Robust Properties from Biology
5. Where's my reward?
6. What form should learning rules take?
6.1 Are deep targets necessary for learning?
6.2 Is a form of credit assignment necessary for learning in deep networks?
7. Can I have some feedback?
8. How does my body look?
9. Anthropomorphism and Approximation
10. Symbolism or Connectionism?
11. What is learning?
11.1 Cost functions
11.2 Direct vs. indirect cost functions
11.3 Predictive sequences
11.4 Sensory motor integration
12. How can we be more data efficient?
12.1 Causes of inefficiency
12.2 Priors
13. What's common sense?
13.1 Intuitive Physics
14. How to remember?
15. Conclusion
Ideas on soup (Sept. 24):
Lots of open questions and ideas in the most crucial area - perception - e.g. video input stream:
- Multiple nodes, all receiving the same input?
- Multiple nodes, all receiving parts of the input?
- Firing continuously?
++ Take a look again on the algorithm in “Contribute to balance, wire in accordance - …”