year: 2025/03
paper: https://openreview.net/forum?id=stFPf3gzq1
website: https://blog.exolabs.net/day-12/
code:
connections: exolabs, distributed training, ensembles, DiLoCo
- Share / sync / average only a small percentage of randomly sampled parameters each step
- Reduces network communication by up to 1000x (e.g. sharing 0.1% of parameters)
- Models stay highly correlated (>0.9, for reasonable percentages like 0.1%)
- This even seems to work with weights some steps into the past, i.e. asynchronoulsy
- Every node has a full instance of the model; It’s an ensemble of models
- Training is more stable; Works with higher learning rates
- At the end, just do a full average, or use the models for baysian approximation @ inference
- Doesn’t scale well beyond 16 nodes :(
Read the paper.