Meta-Learning Bidirectional Update Rules

year: 2021/06
paper: https://arxiv.org/pdf/2104.04657
website:
code: https://github.com/google-research/google-research/tree/master/blur
connections: cma-es, hebbian learning, meta learning, backpropagation, VSML, self-organization

What if neurons could store more than just their activation? What if they could simultaneously track multiple pieces of information - like both their forward signal and the error signal flowing backward? BLUR explores this by giving neurons multiple “states” and discovering what update rules emerge.

Backpropagation is secretly Hebbian

During the forward pass, neuron $j$ computes:
$h_{j} = σ (i \sum w_{ij} h_{i})$
During backprop, we compute the gradient $\frac{\partial L}{\partial h _{j}}$ and update weights:
$w_{ij} \leftarrow w_{ij} - η \frac{\partial L}{\partial h _{j}} h_{j}^{'} h_{i}$
where $h_{j}^{'} = σ^{'} (\sum_{i} w_{ij} h_{i})$ is the derivative of the activation function.

This is Hebbian! The update depends on pre-synaptic activity ( $h_{i}$ ) times post-synaptic activity ( $\frac{\partial L}{\partial h _{j}} h_{j}^{'}$ ). We can reformulate this using two neuron states:

Forward activation $h_{j}$

Gradient times derivative $\frac{\partial L}{\partial h _{j}} h_{j}^{'}$

Now backprop is just fixed rules for how these states interact.

Generalizing beyond backprop: $k$ states per neuron, learned interactions

Instead of hardcoding that neurons have exactly 2 states (activation + gradient), BLUR allows $k$ states per neuron.
Instead of fixed rules for how states interact, it learns small matrices that control information flow:

Forward pass: Each neuron $j$ updates all its states using:
$a_{j}^{c} \leftarrow σ (f a_{j}^{c} + η i, d \sum w_{ij}^{d} a_{i}^{d} ν^{c d})$
Where:

$a_{j}^{c}$ : State $c$ of neuron $j$ (e.g., $a_{j}^{1}$ might be activation, $a_{j}^{2}$ might be error)

$w_{ij}^{d}$ : Channel $d$ of the synapse from neuron $i$ to $j$ (synapses also have multiple channels!)

$ν^{c d}$ : Element of the $k \times k$ matrix controlling how state $d$ from upstream influences state $c$ here

$f$ : Forget gate - how much of the previous state to keep

$η$ : Update gate - how strongly new information affects the state

$σ$ : Activation function

Sum over $i$ : All upstream neurons; sum over $d$ : All state dimensions

Example: If $ν = (1000)$ , only state 1 flows forward (like activations in backprop), while state 2 is isolated.

Backward pass: Similar update but information flows from downstream neurons $j \in J (i)$ :
$a_{i}^{c} \leftarrow σ (f a_{i}^{c} + η j, d \sum w_{ji}^{d} a_{j}^{d} μ^{c d})$
Matrix $μ^{c d}$ controls backward information flow. Notice $w_{ji}^{d}$ can differ from $w_{ij}^{d}$ - forward and backward weights need not be symmetric! Here $J (i)$ denotes all neurons downstream of $i$ .

Weight update: Generalizes Hebbian learning across all state pairs:
$w_{ij}^{c} \leftarrow \tilde{f} w_{ij}^{c} + \tilde{η} e, d \sum a_{i}^{e} \tilde{ν}^{ce} a_{j}^{d} \tilde{μ}^{c d}$
This says: (TODO … had a mistake here)

Forward and backward passes use the same type of update rule, just with different connectivity patterns.
This symmetry is more biologically plausible than backprop’s asymmetry.
The system can discover update rules where information flows bidirectionally during both passes, where weights for forward and backward differ ( $w_{ij} \neq = w_{ji}$ ), or where more than 2 states track different aspects of computation.

Since the genome doesn’t encode a loss function, the backward pass just propagates whatever signal you inject at the output layer. During meta-training, this is the true label, but the learned rules generalize to propagate any feedback signal.

These aren't gradient descent in disguise

In gradient descent, if changing weight $w_{ij}$ affects how much we update weight $w_{mn}$ , then the reverse must be equally true - changing $w_{mn}$ must affect $w_{ij}$ ‘s update by the exact same amount. Mathematically:
$\frac{\partial Δ w _{ij}}{\partial w _{mn}} = \frac{\partial Δ w _{mn}}{\partial w _{ij}}$
This symmetry is forced by calculus - if you’re minimizing any function $L$ , the order of taking derivatives doesn’t matter: $\frac{\partial ^{2} L}{\partial w _{ij} \partial w _{mn}} = \frac{\partial ^{2} L}{\partial w _{mn} \partial w _{ij}}$ .

But BLUR’s discovered genomes violate this symmetry! The update to $w_{ij}$ might strongly depend on $w_{mn}$ , while $w_{mn}$ ‘s update completely ignores $w_{ij}$ . This proves they’re not minimizing any hidden loss function - they’re doing something fundamentally different.

Traditional optimizers can only move “downhill”. BLUR’s learned rules could temporarily increase loss to escape local minima, exploit asymmetries in forward/backward passes, or optimize for multiple objectives simultaneously (learning speed, final accuracy, robustness). This is why BLUR often learns faster than SGD early in training - it’s following update rules that evolution discovered work well across many tasks, not trying to minimize a specific function.

Meta-learning: Finding the learning algorithm

The “genome” $G = {f, \tilde{f}, ν, \tilde{ν}, μ, \tilde{μ}, η, \tilde{η}}$ is tiny (~100 parameters) but controls how the entire network learns:

Forget gates $f, \tilde{f} \in [0, 1]$ : Control memory retention. $f$ for neuron states (how much activation/error to keep between updates), $\tilde{f}$ for synapse weights (how much of the old weight to retain). $f = 1$ means perfect memory, $f = 0$ means complete reset.

Update gates $η, \tilde{η} \in [0, 1]$ : Control learning strength. $η$ scales how much new information affects neuron states, $\tilde{η}$ scales weight updates. Similar to learning rate but state-specific.

Transform matrices $ν, μ \in R^{k \times k}$ : Route information between neuron states. $ν^{c d}$ controls how upstream state $d$ influences current state $c$ during forward pass. $μ^{c d}$ does the same for backward pass. For $k = 2$ , backprop uses specific values that keep activations and gradients separate.

Synaptic transforms $\tilde{ν}, \tilde{μ} \in R^{k \times k}$ : Determine which state combinations drive weight updates. $\tilde{ν}^{ce}$ weights post-synaptic state $e$ , $\tilde{μ}^{c d}$ weights pre-synaptic state $d$ . Backprop would use values that make activation × gradient produce the update.

Meta-training works by sampling many learning problems: initialize random weights, apply BLUR rules for some steps, evaluate performance. The genome that produces networks that learn well across many problems survives. They use either gradient descent on the genome parameters or evolution strategies (CMA-ES) which treats the genome as DNA to evolve.

Trained only on MNIST digits, they generalize to Fashion-MNIST and letter recognition without modification.

Architectural insights from experiments

Genomes trained on deeper networks (4 layers) successfully train shallower networks (1-3 layers) but not vice versa - suggesting they learn more general update rules when forced to handle deeper credit assignment.

Surprisingly, symmetric single-state synapses with backprop initialization often outperform more complex multi-state variants. The sweet spot seems to be minimal deviation from backprop that adds just enough flexibility to discover better dynamics.

In early training, BLUR consistently outpaces SGD - the learned rules excel at rapid initial learning, though SGD often catches up given enough steps (similar to VSML). This suggests the discovered algorithms optimize for different objectives than minimizing final loss.

Difference to VSML

Both approaches meta-learn update rules, but with different philosophies.
VSML says “replace each weight with a tiny RNN and let them figure out backprop”, while BLUR says “give neurons multiple states and learn the wiring diagram between them”.
BLUR’s genome is more interpretable - you can see which states talk to which during forward/backward passes, but imposes more structure.

References

hebbian learning | VSML | self-organization

Max Wolf's Second Brain

Explorer

Meta-Learning Bidirectional Update Rules

References

Graph View