year: 2024
paper: https://arxiv.org/abs/2311.08105
website:
code:
connections: distributed training, model merging, federated learning


DiLoCo (Distributed Low-Communication) training enables distributed model training with minimal communication between workers. Workers train local copies independently for H steps, then synchronize via parameter averaging.

DiLoCo algorithm

Each worker maintains a full model copy and trains on its data shard for H “inner” steps using AdamW. Every H steps, workers synchronize through an “outer” optimization step:

  1. Average all worker parameters (like federated averaging)

  2. Apply SGD with Nesterov momentum to the averaged parameters

  3. Broadcast updated parameters back to all workers

    This reduces communication by factor H compared to standard data parallelism.

Key innovation: Using different optimizers for inner (AdamW) and outer (Nesterov SGD) steps. The outer optimizer treats averaged parameters as a gradient signal, applying momentum to smooth the trajectory.

Trade-offs with H

Small H (e.g., 100): Frequent synchronization keeps models aligned but high communication cost
Large H (e.g., 10,000): Rare synchronization saves bandwidth but models diverge, hurting performance

Note: SPARTA addresses this by continuously sharing sparse parameter subsets between full syncs, maintaining alignment with 99.9% less communication (works at least on smaller models).

DiLoCo works because models training on similar data stay in the same loss basin for reasonable H values. The periodic averaging pulls diverging models back together, while momentum in the outer optimizer provides stability.

Extensions include asynchronous variants, checkpoint streaming to reduce peak bandwidth, and combinations with sparse communication methods. OpenDiLoCo demonstrated training across continents, proving viability for geographically distributed setups.