Attention Residuals

year: https://youtu.be/iw1VF8HOCrk
paper: https://arxiv.org/abs/2603.15031
website:
code:
connections: residual connection, self-attention,

Learned residual connection weighting between all layers, as opposed to sequential additive, which mixes things that might not go well together and forces later layers to have larger activations to get heard.
Block-wise: full attn within a block + full attn on other blocks outputs.

I love this.

Basically, this turns the communication structure between the layers into a graph.
Blockwise, a graph with local connectivity.

But it is still a graph with sequential computation.

Break up the sequential flow (by shifting the inputs t-1).
Now you have L layers in any-to-any communication with eachother; a soup of heterogenous “neurons”.

If you now also tie the weights and add an adaptive computation time mechanism, you have a distributed universal transformer.

This new perspective helps me better frame what questions soup is actually asking:

Does the extra parallelism / modularity help

Graph View

Attention Residuals