Recurrent Independent Mechanisms

year:
paper: recurrent-independent-mechanisms
website:
code: https://github.com/lucidrains/RIM-pytorch
connections: Yoshua Bengio

Cell-activation sparsity.
But each cell attends to every other cell, no masking.
“Causal” modularity → OOD generalization

RIMs each carry persistent recurrent hidden state; MoE experts are stateless feedforward functions re-selected fresh each token.
MoE has a learned gating/router network producing expert weights; RIMs route by self-attention of each module’s query against the input + a null token, so modules “bid” for activation rather than being assigned.
Communication: MoE experts don’t talk to each other; RIMs explicitly do.

On causal and anti-causal learning

https://arxiv.org/pdf/1206.6471
Perhaps: https://web.math.ku.dk/~peters/jonas_files/ElementsOfCausalInference.pdf

Graph View

Recurrent Independent Mechanisms