SPARTA - Distributed Training with Sparse Parameter Averaging

year: 2025/03
paper: https://openreview.net/forum?id=stFPf3gzq1
website: https://blog.exolabs.net/day-12/
code:
connections: exolabs, distributed training, ensembles, DiLoCo

Share / sync / average only a small percentage of randomly sampled parameters each step
Reduces network communication by up to 1000x (e.g. sharing 0.1% of parameters)
Models stay highly correlated (>0.9, for reasonable percentages like 0.1%)
This even seems to work with weights some steps into the past, i.e. asynchronoulsy
Every node has a full instance of the model; It’s an ensemble of models
Training is more stable; Works with higher learning rates
At the end, just do a full average, or use the models for baysian approximation @ inference
Doesn’t scale well beyond 16 nodes :(

Read the paper.

Max Wolf's Second Brain