year:
paper:
website:
code:
connections: exolabs, distributed training, ensemble
https://blog.exolabs.net/day-12/
- Share / sync / average only a small percentage of randomly sampled parameters each step
- Reduces network communication by up to 1000x (e.g. sharing 0.1% of parameters)
- Models stay highly correlated (>0.9, for reasonable percentages like 0.1%)
- This even seems to work with weights some steps into the past, i.e. asynchronoulsy
- Every node has a full instance of the model; It’s an ensemble of models
- Training is more stable; Works with higher learning rates
- At the end, just do a full average, or use the models for baysian approximation @ inference
- Doesn’t scale well beyond 16 nodes :(