year: 2025/03
paper: https://openreview.net/forum?id=stFPf3gzq1
website: https://blog.exolabs.net/day-12/
code:
connections: exolabs, distributed training, ensembles, DiLoCo


  • Share / sync / average only a small percentage of randomly sampled parameters each step
  • Reduces network communication by up to 1000x (e.g. sharing 0.1% of parameters)
  • Models stay highly correlated (>0.9, for reasonable percentages like 0.1%)
  • This even seems to work with weights some steps into the past, i.e. asynchronoulsy
  • Every node has a full instance of the model; It’s an ensemble of models
  • Training is more stable; Works with higher learning rates
  • At the end, just do a full average, or use the models for baysian approximation @ inference
  • Doesn’t scale well beyond 16 nodes :(

Read the paper.