year: 2024/06
paper: https://arxiv.org/pdf/2406.09787 / Evolving Self-Assembling Neural Networks - From Spontaneous Activity to Experience-Dependent Learning (copy)
website: https://x.com/eplantec/status/1808918092497703188
code: https://github.com/erwanplantec/LNDP
connections: neuroevolution, structural plasticity, synaptic plasticity self-organization, ITU Copenhagen, spontaneous actctivity, indirect encoding, Sebastian Risi, parameter-sharing, NDP, indirect encoding
Tldr
Evolves update rules, instead of weights.
Network with fixed nodes starts with random/empty connections, grown during lifetime based on node states, which get updated via Graph Transformer + GRUs. All nodes, edges share params.
Pre-experience phase uses OU noise (spontaneous activity) to develop structure before seeing real data.
CMA-ES evolves self-organization rules that solve RL. Parameters stay fixed during lifetime, only the structure changes.
Spontaneous Activity & Ornstein-Uhlenbeck Process
The pre-experience developmental phase uses spontaneous activity (SA) - internally generated patterns in input neurons without environmental input. The authors model SA using a learnable Ornstein-Uhlenbeck (OU) stochastic process:
where
The OU process generates “noisy but correlated” signals - like a random walk pulled back toward a mean:
- : Current activity values
- : Mean reversion term pulling values toward (the further from , the stronger the pull back)
- : Gaussian noise
This creates temporally correlated activity patterns (not just white noise) that mimic spontaneous waves in developing biological brains - like retinal waves before birth that help wire up visual systems. The learnable parameters (, , ) let evolution discover appropriate activity patterns for each task.
SA runs for timesteps at network initialization, before any environment interaction. During this one-time phase, the OU-generated patterns drive activity-dependent plasticity - synapses update based on pre/post-synaptic correlations without environmental rewards. This pre-structures the network topology and weights before real sensory data arrives.
Networks evolved with SA show immediate competence. In CartPole, SA-trained networks could balance from episode 1, while non-SA networks failed immediately and needed lifetime learning. The SA phase essentially evolves networks that arrive “pre-wired” with useful dynamics, even though they’ve never seen real observations. Evolution discovers SA parameters that produce beneficial initial configurations for the target task.
SA creates structural diversity in initially identical input neurons. The learnable covariance matrix generates different temporal patterns per input neuron during the timestep pre-experience phase (run once before any environment interaction). Evolution shapes these patterns to pre-build distinct pathways - not to match specific sensors, but to create differentiated “slots” that can specialize when real observations arrive. Without SA, all input neurons would remain symmetric, making channel assignment during actual experience more difficult.
Structural plasticity - edges only, nodes fixed
Node count is fixed per network (chosen at initialization). Only edges can change - new synapses form via and prune via . No neurogenesis or cell death.
Initial connectivity is sparse/random based on , defining connection probability.
Node structure
Node states: where
The Graph Transformer processes the full graph state (activations , states , structural features) and feeds into a GRU to update each node’s internal state. All nodes share parameters.Node activations: where
Each neuron’s new activation is the weighted sum of all neurons’ previous activations, weighted by the corresponding edge weights.
Input nodes’ activations are forced to match observations, while hidden/output nodes evolve freely.Node types: All nodes are structurally identical (same parameters). Input nodes get their activations overwritten by observations, hidden nodes process information freely, and output nodes’ activations become actions.
Edge structure
Edge states: where
Each edge has a GRU that updates based on both connected nodes’ (pre- and post-synaptic) states and the global reward. The first element directly becomes the synaptic weight . All edges share parameters.Structural plasticity:
- Synaptogenesis:
- Pruning:
New edges can form based on node compatibility (how well their states match). Existing edges can be pruned based on their state (weak connections die). Both use learned MLPs.
Bidirectionality: Edges process both nodes’ states simultaneously but the graph remains directed. Edge features include: edge state , forward bit , backward bit , and self-loop indicator .
Training procedure
Evolution loop (CMA-ES):
- Start of a generation: Sample a population of parameter vectors
- For each individual:
- Load parameters
- Initialize fresh graph: random connections via
- Run development phase if enabled ( steps of spontaneous activity)
- Run multiple episodes, keeping network graph between episodes
- Return fitness, defined as: “average return of the agent over three different trials (i.e. different random seeds)”
CMA-ES updates its distribution based on fitnesses
Repeat for … generations
Each individual gets its own graph that persists across episodes but not across generations.
Information flow (per timestep):
(repeatrnn_iters
times)
Actions = argmax (discrete) or raw “concatted” activations (continuous)Lifetime dynamics (no gradient updates):
- Structural changes: ,
- Weight changes: via edge state updates
- Both use evolved rules fixed at birth
Core challenge: Discover both structural rules (which connections to form) and learning rules (how to update weights) using only episodic rewards - no supervision on topology or weights.
Limitations of the approach / implementation
There is no activation sparsity!
Performance vs. adaptability tradeoff: LNDPs don’t match traditional RL networks in raw performance - they sacrifice task performance for lifelong adaptability.
Training challenges:
- Requires expensive evolutionary strategies (CMA-ES) rather than gradient descent
- Co-optimizing wiring rules and synaptic plasticity creates “strong dependencies” that complicate training
- Graph generation without supervised feedback is inherently difficult (must explore exponentially many topologies ( without gradients, using only episodic rewards that reflect behavior not structure - unclear which connections deserve credit)
Instability issues:
- High variance across random seeds
- Pendulum environment showed frequent collapse to empty networks
- Results lack reproducibility/consistency
Computational cost: Each update runs a Graph Transformer + GRUs for every neuron/synapse - far more expensive than standard neural networks.
Why evolutionary strategies instead of gradients?
Non-differentiable: Can’t backprop through “add/delete edge” decisions
Dynamic graphs: Network topology changes during runtime - no fixed computation graph
Temporal credit: Which structural change 500 steps ago helped performance?
Diversity exploration: ES naturally explores diverse developmental strategies - crucial for discovering novel self-organization principles
High variance: The same rules can produce very different networks - ES handles this stochasticity better than gradient methodsAlternatives?
- RL: Each edge change as action → massive action space
- Differentiable approximations: Continuous edge weights [0,1] instead of binary → still need thresholds, gets stuck with many weak connections.
- Supervised pre-training: No clear target for “correct” development
soup sidesteps this by removing edges entirely - using PKM similarity matching for dynamic connectivity without discrete topology decisions.
Random semi-structured notes / questions / clarifications that popped up during implementation + implementtion details not in the paper:
States vs Features: The paper uses these inconsistently. “States” (, ) are the evolving representations. “Features” are inputs to the Graph Transformer (states + structural info).
I/O in official implementation:
- Observations overwrite input neuron activations directly (no projection)
- Actions = output neuron activations (argmax for discrete, raw for continuous)
- Network must learn scaling internally
- Minimum nodes = obs_dims + action_dims
Paper mentions distributed actions (Action_contrib_i = H_i @ W_A_gen
):
- All neurons vote on actions, not just output neurons
- Like soup where any neuron → decoder
- Not implemented, probably overkill for CartPole
Gap: paper vision (distributed everything) vs implementation (fixed I/O neurons)
W_pre projection before Graph Transformer:
- Input:
[in_degree/10, out_degree/10, total_degree/20, node_type_encoding(2D), activation_sequence(rnn_iters+1), node_state(h), current_activation(v)]
(normalization with magic nums) - Output: projected to node_dim
- Paper only says GT takes “concatenation of nodes activation , nodes’ states , and structural graph features”
Edge GRUs see activation sequences:
- Edge input includes both nodes’ full activation histories over rnn_iters steps
- Paper says edges only use “pre and post-synaptic neurons’ states as well as their activity and the reward”
Node state as bias:
rnn = lambda a, _: (jnn.tanh(a.at[:self.obs_dims].set(obs) @ state.w + b), a.at[:self.obs_dims].set(obs))
- where
b = state.G.h[:, 0] if use_bias=True, else zeros
- where
- Paper shows - no bias term
- Each neuron learns its own bias through first dimension of state
Connection restrictions:
- No
input→input, output→output, or output→*
connections allowed - Implemented in
reservoir()
andmask_A()
functions - Paper doesn’t mention these constraints.
- To reduce search space, inductive bias to encourage feedforward flow? Not biologically grounded?
Truncated normal only for hidden→hidden:
- (truncated to [0,1])
- use regular normal then clip
- Bounds recurrent connections to prevent from starting too large
Ideas / Backlog / Stream of thought:
Replication (date of creation: 2024/06)
- replicate results
Modification ideas:- add spatial structure: distance regularization & 3D lattice (update: 2025/06: would be made obsolete by abolishing edges, see below)
decentralize information: Don’t share weights + don’t give them full graph knowledge → How much worse?(update 2025/06: that’s a bad idea, weight explosion, worse generalization, harder to optimize, local memory / context should be enough .. see biological neurons/humans: same hardware, different context → wildly different function)
Here is where it is starting to get soupy / totally different to the paper except for the base structure:
decentralize optimization: each neuron is an independent reinforcement learner
→ decentralize rewards … this is were it gets very tricky (and where the least prior research has been done), this step kind of depends on this(no need: The neurons need to figure out how to communicate reward as part of their message passing)all the concepts of energy regularization and activation sparsity would come into play here- structural plasticity for nodes
- exitatory vs inhibitory neurons? (update: can this be learned? is it necessary?)
- Astrocytes? (update: don’t they primarily provide structure / are a biological implementation detail?)
- different random graph initilizations (Watts-strogatz?
- activation sparsity
- vector-valued activations
- Rewards as inputs?
More remarks on replication / roadmap to soup:
Nodes are the “what”, and Edges are the “who” for sending information.
Especially as the number of edges grows large, they pose a significant overhead.
One obvious optimization is having a common stem, a little bit like a dendrite.
…
Or to remove edges alltogether and do an efficient PKM match between QK matrices.
References
https://claude.ai/chat/61f782ff-1742-4328-8459-7405fedad851
Comments on edge and structural features:
**Structural features** are the in-degree, out-degree, and total degree, as well as a one-hot encoding indicating if the node is an input, hidden, or output node.
Moreover, we introduce **edge features** in the attention layer as proposed in Dwivedi and Bresson (2021). We also augment edge features with structural features which are 2 bits indicating if there is a **forward** or **backward** connection between nodes and a bit indicating if the edge is a self-loop.
-----
de | edge | features 4
-----
Confusing nomenclature mixups:
Node states can be used to define neuron parameters such as biases.
dh | node features | 8
de | edge features | 4
Node features ht
ht ∈ HN and et ∈ EN 2 are the nodes and edges states respectively with H ≡ Rdh and E ≡ Rde. Edge states are masked by the adjacency matrix, i.e. set to zero wherever the adjacency matrix is 0.
Moreover, we introduce edge features in the attention layer as proposed in Dwivedi and Bresson (2021). We also augment edge features with structural features (this, in the implementation, is misnamed as
get_node_features
)
check for ep done so we dont try to learn transition of end to start of a new episode (for finite horizon envs)