justheuristic t1_j02g9m0 wrote on December 13, 2022 at 4:37 PM

https://github.com/bigscience-workshop/petals - fine-tuning BLOOM-176B Folding@home style

https://github.com/learning-at-home/hivemind - a library for decentralized training with volunteers

https://github.com/epfml/disco - a library for collaborative training in JS (in a browser!)

https://github.com/chavinlo/distributed-diffusion - a project that tries to train diffusion this way

https://bittensor.com/ - a comminity that makes decentralized training into a cryptocurrency

There are also projects like Together that build networks from university computers for decentralized training.

genuinelySurprised OP t1_j02iznh wrote on December 13, 2022 at 4:54 PM

Thanks for this! I came across Petals, but it looked to be focused on model usage, not training, but didn't see that it used the hivemind library.

justheuristic t1_j02jboc wrote on December 13, 2022 at 4:56 PM

They have a ~~training~~ fine-tuning example here

https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb

dojoteef t1_j02kku4 wrote on December 13, 2022 at 5:04 PM

This is great! Is it realistically possible to train LLMs ala BLOOM from scratch using these, or just do finetuning? I guess I'm wondering how the training speed scales with more compute nodes.

Even if we assume high end GPUs/TPUs, a frequent bottleneck is throughput due to network latency. How big of an issue is that? For example, I had previously tried scaling to multi-node training on my University's cluster and it turned out that it was faster to do gradient accumulation on a single node than to do multi-node training because the network switches were not purchased with high-throughput in mind.

justheuristic t1_j02ohk6 wrote on December 13, 2022 at 5:28 PM

The first link (petals) is about finetuning.

Others (e.g. distributed diffusion) involve training from scratch -- but they deal with smaller models. Thing is, you need a lot of people to train a 100B model from scratch. Like, a few hundred online on average. There aren't many communities that can do that. In turn, with finetuning, you can see it work more immediately.

I've heard a talk by Colin Raffel where he proposed an alternative view where instead of training from scratch, an open-source community could gradually improve the model over time. Like github, but for large models. A contributor can fine-tune for a task, then create a "pull-request", then maintainer runs a special procedure to merge the model without forgetting other tasks. That's how I remember it, anyways.