year: 2022
paper: https://arxiv.org/pdf/2012.14905
website: http://louiskirsch.com/
code: https://github.com/louiskirsch/vsml-neurips2021/tree/main
connections: meta learning, backpropagation, Jürgen Schmidhuber, KAUST, Evolution Strategies as a Scalable Alternative to Reinforcement Learning, parameter-sharing, message passing, self-organization


This paper is doing message passing with global backpropagation and shared weights.

Transclude of The-Future-of-Artificial-Intelligence-is-Self-Organizing-and-Self-Assembling#^db08ab

The variable ratio problem:

Meta RNNs are simple, but they have much more meta variables than learned variables () leading them to be overparametrized, prone to overfitting.

Learned learning rules / Fast Weight Networks have , but introduce a lot of complexity in the meta-learning network etc.

→A variable-sharing and sparcity principle can be used to unify these approaches into a simple framework: VSML.

Meta training vs. Meta testing


During meta testing: Hold fixed.

How does the ICL work? How does it go only in forward mode?

→ Errors are also passed in as inputs during meta train and test time! (“backward messages”).

“Everything is a variable: Variables we thought of as activations can be interpreted as weights.”

Multi-agent systems

In the reinforcement learning setting multiple agents can be modeled with shared parameters also in the context of meta learning.

This is related to the variable sharing in VSML depending on how the agent-environment boundary is drawn. Unlike these works, we demonstrate the advantage of variable sharing in meta learning more general-purpose LAs and present a weight update interpretation

Specialization vs generalization through coordinate information

Each sub-RNN can be fed its coordinates (a,b), position in time, or layer position to enable:

  1. Specialization: Different neurons develop different functions (like biological neurons)

  2. Implicit architecture learning: Suppress outputs based on position

    However, experiments showed no benefits - the shared parameters without specialization were sufficient.
    → Parameter sharing without positional bias achieves better generalization.

The computational cost of the current VSML variant is also larger than the one of standard backpropagation. If we run a sub-RNN for each weight in a standard NN with W weights, the cost is in O(W N^2), where N is the state size of a sub-RNN.
If N is small enough, and our experiments suggest small N may be feasible, this may be an acceptable cost.
However, VSML is not bound to the interpretation of a sub-RNN as one weight. Future work may relax this particular choice.

Meta optimization is also prone to local minima.
In particular, when the number of ticks between input and feedback increases (e.g. deeper architectures), credit assignment becomes harder.
Early experiments suggest that diverse meta task distributions can help mitigate these issues.

Additionally, learning horizons are limited when using backprop-based meta optimization. Using ES allowed for training across longer horizons and more stable optimization. (gradient variance is independent of episode length)

Variable sharing enables distributed memory and self-organization

VSML can be viewed as:
Distributed memory: Memory spread across network instead of external memory banks
Self-organizing system: Many sub-RNNs induce emergence of global learning algorithm
Modular learning: But unlike typical modular systems, no conditional selection - all modules share same to implement the learning algorithm

Great work Louis. I'm curious if you are fully committed to the idea of strictly enforcing weight sharing. Have u looked at whether this improves performance for you? Conceptually it makes total sense, and it certainly reduces model complexity, but is there a performance cost? || Good question! It might be interesting to have some degree of specialisation, I discuss one possibility of feeding positional information to each RNN in the appendix. You probably don't want entirely independent params, in particular because of generalisation properties.

Recursive weight replacement hierarchy

“For future work”: Meta variables themselves could be replaced by LSTMs, yielding multi-level hierarchies of arbitrary depth. This would create a fractal-like structure where learning algorithms implement learning algorithms.

References

weight-sharing
RNN
message passing