year:
paper:
website:
code:
connections: exolabs, distributed training, ensemble


https://blog.exolabs.net/day-12/

  • Share / sync / average only a small percentage of randomly sampled parameters each step
  • Reduces network communication by up to 1000x (e.g. sharing 0.1% of parameters)
  • Models stay highly correlated (>0.9, for reasonable percentages like 0.1%)
  • This even seems to work with weights some steps into the past, i.e. asynchronoulsy
  • Every node has a full instance of the model; It’s an ensemble of models
  • Training is more stable; Works with higher learning rates
  • At the end, just do a full average, or use the models for baysian approximation @ inference
  • Doesn’t scale well beyond 16 nodes :(