Evolving Self-Assembling Neural Networks - From Spontaneous Activity to Experience-Dependent Learning

year: 2024/06
paper: https://arxiv.org/pdf/2406.09787 / Evolving Self-Assembling Neural Networks - From Spontaneous Activity to Experience-Dependent Learning (copy)
website: https://x.com/eplantec/status/1808918092497703188
code: https://github.com/erwanplantec/LNDP
connections: neuroevolution, structural plasticity, synaptic plasticity self-organization, ITU Copenhagen, spontaneous actctivity, indirect encoding, Sebastian Risi, parameter-sharing, NDP, indirect encoding

Tldr

Evolves update rules, instead of weights.
Network with fixed nodes starts with random/empty connections, grown during lifetime based on node states, which get updated via Graph Transformer + GRUs. All nodes, edges share params.
Pre-experience phase uses OU noise (spontaneous activity) to develop structure before seeing real data.
CMA-ES evolves self-organization rules that solve RL. Parameters stay fixed during lifetime, only the structure changes.

Spontaneous Activity & Ornstein-Uhlenbeck Process

The pre-experience developmental phase uses spontaneous activity (SA) - internally generated patterns in input neurons without environmental input. The authors model SA using a learnable Ornstein-Uhlenbeck (OU) stochastic process:
$o_{S A}^{t + 1} = o_{S A}^{t} + α (μ - o_{S A}^{t}) + W_{t}$
where $W_{t} \sim N (0, Σ)$

The OU process generates “noisy but correlated” signals - like a random walk pulled back toward a mean:

$o_{S A}^{t}$ : Current activity values

$α (μ - o_{S A}^{t})$ : Mean reversion term pulling values toward $μ$ (the further from $μ$ , the stronger the pull back)

$W_{t}$ : Gaussian noise

This creates temporally correlated activity patterns (not just white noise) that mimic spontaneous waves in developing biological brains - like retinal waves before birth that help wire up visual systems. The learnable parameters ( $μ$ , $α$ , $Σ$ ) let evolution discover appropriate activity patterns for each task.

SA runs for $T_{S A}$ timesteps at network initialization, before any environment interaction. During this one-time phase, the OU-generated patterns drive activity-dependent plasticity - synapses update based on pre/post-synaptic correlations without environmental rewards. This pre-structures the network topology and weights before real sensory data arrives.

Networks evolved with SA show immediate competence. In CartPole, SA-trained networks could balance from episode 1, while non-SA networks failed immediately and needed lifetime learning. The SA phase essentially evolves networks that arrive “pre-wired” with useful dynamics, even though they’ve never seen real observations. Evolution discovers SA parameters that produce beneficial initial configurations for the target task.

SA creates structural diversity in initially identical input neurons. The learnable covariance matrix $Σ$ generates different temporal patterns per input neuron during the $T_{S A}$ timestep pre-experience phase (run once before any environment interaction). Evolution shapes these patterns to pre-build distinct pathways - not to match specific sensors, but to create differentiated “slots” that can specialize when real observations arrive. Without SA, all input neurons would remain symmetric, making channel assignment during actual experience more difficult.

Structural plasticity - edges only, nodes fixed

Node count $N$ is fixed per network (chosen at initialization). Only edges can change - new synapses form via $P (A_{ij} \leftarrow 1) = f_{θ}^{+} (h_{i}, h_{j})$ and prune via $P (A_{ij} \leftarrow 0) = f_{θ}^{-} (e_{ij})$ . No neurogenesis or cell death.

Initial connectivity is sparse/random based on $μ^{conn}$ , $σ^{conn}$ defining connection probability.

Node structure

Node states: $h^{t + 1} = f_{θ}^{h} (G^{t})$ where $h \in R^{d_{h}}$
The Graph Transformer processes the full graph state (activations $v^{t}$ , states $h^{t}$ , structural features) and feeds into a GRU to update each node’s internal state. All nodes share parameters.

Node activations: $v^{t + 1} = tanh (\overset{v}{^}^{t} \cdot w^{t})$ where $\overset{v}{^}_{i} = {o_{i} v_{i} if i \in Input otherwise$
Each neuron’s new activation is the weighted sum of all neurons’ previous activations, weighted by the corresponding edge weights.
Input nodes’ activations are forced to match observations, while hidden/output nodes evolve freely.

Node types: All nodes are structurally identical (same parameters). Input nodes get their activations overwritten by observations, hidden nodes process information freely, and output nodes’ activations become actions.

Edge structure

Edge states: $e_{ij}^{t + 1} = f_{θ}^{e} (e_{ij}^{t}, h_{i}^{t + 1}, h_{j}^{t + 1}, r^{t})$ where $e \in R^{d_{e}}$
Each edge has a GRU that updates based on both connected nodes’ (pre- and post-synaptic) states $h_{i}^{t + 1}, h_{j}^{t + 1}$ and the global reward. The first element $e_{ij, 0}$ directly becomes the synaptic weight $w_{ij}$ . All edges share parameters.

Structural plasticity:

Synaptogenesis: $P (A_{ij} \leftarrow 1) = f_{θ}^{+} (h_{i}, h_{j})$

Pruning: $P (A_{ij} \leftarrow 0) = f_{θ}^{-} (e_{ij})$

New edges can form based on node compatibility (how well their states match). Existing edges can be pruned based on their state (weak connections die). Both use learned MLPs.

Bidirectionality: Edges process both nodes’ states simultaneously but the graph remains directed. Edge features include: edge state $e$ , forward bit $A_{ij}$ , backward bit $A_{ji}$ , and self-loop indicator $1_{i = j}$ .

Training procedure

Evolution loop (CMA-ES):

Start of a generation: Sample a population of parameter vectors $θ$

For each individual:

Load parameters $θ$

Initialize fresh graph: random connections via $P (A_{ij} = 1) \sim N_{[0, 1]} (μ^{conn}, σ^{conn})$

Run development phase if enabled ( $T_{S A}$ steps of spontaneous activity)

Run multiple episodes, keeping network graph between episodes

Return fitness, defined as: “average return of the agent over three different trials (i.e. different random seeds)”

CMA-ES updates its distribution based on fitnesses

Repeat for … generations

Each individual gets its own graph that persists across episodes but not across generations.

Information flow (per timestep):
$h^{t + 1} = f_{θ}^{h} (G^{t})$
$e_{ij}^{t + 1} = f_{θ}^{e} (e_{ij}^{t}, h_{i}^{t + 1}, h_{j}^{t + 1}, r^{t})$
$w_{ij} = e_{ij, 0}$
$v^{t + 1} = tanh (\overset{v}{^}^{t} \cdot w^{t}), \overset{v}{^}_{i} = {o_{i} v_{i} if i \in Input otherwise$ (repeat rnn_iters times)
Actions = $v_{output_nodes}$ argmax (discrete) or raw “concatted” activations (continuous)

Lifetime dynamics (no gradient updates):

Structural changes: $P (A_{ij} \leftarrow 1) = f_{θ}^{+} (h_{i}, h_{j})$ , $P (A_{ij} \leftarrow 0) = f_{θ}^{-} (e_{ij})$

Weight changes: via edge state updates $e_{ij}^{t + 1} = f_{θ}^{e} (...)$

Both use evolved rules $θ$ fixed at birth

Core challenge: Discover both structural rules (which connections to form) and learning rules (how to update weights) using only episodic rewards - no supervision on topology or weights.

Limitations of the approach / implementation

There is no activation sparsity!

Performance vs. adaptability tradeoff: LNDPs don’t match traditional RL networks in raw performance - they sacrifice task performance for lifelong adaptability.

Training challenges:

Requires expensive evolutionary strategies (CMA-ES) rather than gradient descent

Co-optimizing wiring rules and synaptic plasticity creates “strong dependencies” that complicate training

Graph generation without supervised feedback is inherently difficult (must explore exponentially many topologies ( $O (2^{N^{2}})$ without gradients, using only episodic rewards that reflect behavior not structure - unclear which connections deserve credit)

Instability issues:

High variance across random seeds

Pendulum environment showed frequent collapse to empty networks

Results lack reproducibility/consistency

Computational cost: Each update runs a Graph Transformer + GRUs for every neuron/synapse - far more expensive than standard neural networks.

Why evolutionary strategies instead of gradients?

Non-differentiable: Can’t backprop through “add/delete edge” decisions
Dynamic graphs: Network topology changes during runtime - no fixed computation graph
Temporal credit: Which structural change 500 steps ago helped performance?
Diversity exploration: ES naturally explores diverse developmental strategies - crucial for discovering novel self-organization principles
High variance: The same rules can produce very different networks - ES handles this stochasticity better than gradient methods

Alternatives?

RL: Each edge change as action → massive action space

Differentiable approximations: Continuous edge weights [0,1] instead of binary → still need thresholds, gets stuck with many weak connections.

Supervised pre-training: No clear target for “correct” development

soup sidesteps this by removing edges entirely - using PKM similarity matching for dynamic connectivity without discrete topology decisions.

Random semi-structured notes / questions / clarifications that popped up during implementation + implementtion details not in the paper:

States vs Features: The paper uses these inconsistently. “States” ( $h$ , $e$ ) are the evolving representations. “Features” are inputs to the Graph Transformer (states + structural info).

I/O in official implementation:

Observations overwrite input neuron activations directly (no projection)
Actions = output neuron activations (argmax for discrete, raw for continuous)
Network must learn scaling internally
Minimum nodes = obs_dims + action_dims

Paper mentions distributed actions (Action_contrib_i = H_i @ W_A_gen):

All neurons vote on actions, not just output neurons
Like soup where any neuron → decoder
Not implemented, probably overkill for CartPole

Gap: paper vision (distributed everything) vs implementation (fixed I/O neurons)

W_pre projection before Graph Transformer:

Input: [in_degree/10, out_degree/10, total_degree/20, node_type_encoding(2D), activation_sequence(rnn_iters+1), node_state(h), current_activation(v)] (normalization with magic nums)
Output: projected to node_dim
Paper only says GT takes “concatenation of nodes activation $v^{t}$ , nodes’ states $h^{t}$ , and structural graph features”

Edge GRUs see activation sequences:

Edge input includes both nodes’ full activation histories over rnn_iters steps
Paper says edges only use “pre and post-synaptic neurons’ states as well as their activity and the reward”

Node state as bias:

rnn = lambda a, _: (jnn.tanh(a.at[:self.obs_dims].set(obs) @ state.w + b), a.at[:self.obs_dims].set(obs))
- where b = state.G.h[:, 0] if use_bias=True, else zeros
Paper shows $v^{t + 1} = tanh (v^{t} @ w^{t})$ - no bias term
Each neuron learns its own bias through first dimension of state

Connection restrictions:

No input→input, output→output, or output→* connections allowed
Implemented in reservoir() and mask_A() functions
Paper doesn’t mention these constraints.
To reduce search space, inductive bias to encourage feedforward flow? Not biologically grounded?

Truncated normal only for hidden→hidden:

$p_{hh} \sim N_{[0, 1]} (μ_{hh}, σ_{hh})$ (truncated to [0,1])
$p_{ih}, p_{h o}$ use regular normal then clip
Bounds recurrent connections to prevent $∣∣ W_{hh} ∣∣$ from starting too large

Ideas / Backlog / Stream of thought:

Replication (date of creation: 2024/06)

replicate results
Modification ideas:

add spatial structure: distance regularization & 3D lattice (update: 2025/06: would be made obsolete by abolishing edges, see below)

~~decentralize information: Don’t share weights + don’t give them full graph knowledge → How much worse?~~ (update 2025/06: that’s a bad idea, weight explosion, worse generalization, harder to optimize, local memory / context should be enough .. see biological neurons/humans: same hardware, different context → wildly different function)

Here is where it is starting to get soupy / totally different to the paper except for the base structure:

~~decentralize optimization: each neuron is an independent reinforcement learner~~

~~→ decentralize rewards … this is were it gets very tricky (and where the least prior research has been done), this step kind of depends on this~~ (no need: The neurons need to figure out how to communicate reward as part of their message passing)

~~all the concepts of energy regularization and activation sparsity would come into play here~~

structural plasticity for nodes

exitatory vs inhibitory neurons? (update: can this be learned? is it necessary?)

Astrocytes? (update: don’t they primarily provide structure / are a biological implementation detail?)

different random graph initilizations (Watts-strogatz?

activation sparsity

vector-valued activations

Rewards as inputs?

More remarks on replication / roadmap to soup:

Nodes are the “what”, and Edges are the “who” for sending information.
Especially as the number of edges grows large, they pose a significant overhead.
One obvious optimization is having a common stem, a little bit like a dendrite.
…
Or to remove edges alltogether and do an efficient PKM match between QK matrices.

References

https://claude.ai/chat/61f782ff-1742-4328-8459-7405fedad851

Comments on edge and structural features:

**Structural features** are the in-degree, out-degree, and total degree, as well as a one-hot encoding indicating if the node is an input, hidden, or output node. 
Moreover, we introduce **edge features** in the attention layer as proposed in Dwivedi and Bresson (2021). We also augment edge features with structural features which are 2 bits indicating if there is a **forward** or **backward** connection between nodes and a bit indicating if the edge is a self-loop.
-----

de |  edge  | features 4

-----

Confusing nomenclature mixups:

Node states can be used to define neuron parameters such as biases.

dh | node features | 8
de | edge features | 4

Node features ht

ht ∈ HN and et ∈ EN 2 are the nodes and edges states respectively with H ≡ Rdh and E ≡ Rde. Edge states are masked by the adjacency matrix, i.e. set to zero wherever the adjacency matrix is 0.

Moreover, we introduce edge features in the attention layer as proposed in Dwivedi and Bresson (2021). We also augment edge features with structural features (this, in the implementation, is misnamed as get_node_features)

check for ep done so we dont try to learn transition of end to start of a new episode (for finite horizon envs)

Max Wolf's Second Brain

Explorer

Evolving Self-Assembling Neural Networks - From Spontaneous Activity to Experience-Dependent Learning

References

Graph View