year:
paper: recurrent-independent-mechanisms
website:
code: https://github.com/lucidrains/RIM-pytorch
connections: Yoshua Bengio


Cell-activation sparsity.
But each cell attends to every other cell, no masking.
“Causal” modularity → OOD generalization

RIMs each carry persistent recurrent hidden state; MoE experts are stateless feedforward functions re-selected fresh each token.
MoE has a learned gating/router network producing expert weights; RIMs route by self-attention of each module’s query against the input + a null token, so modules “bid” for activation rather than being assigned.
Communication: MoE experts don’t talk to each other; RIMs explicitly do.

On causal and anti-causal learning