Given that well-funded groups like Google, Meta and OpenAI may eventually develop an insurmountable lead for services like image classification and NLP that seem to require huge numbers of parameters, I'd be surprised if there wasn't an effort underway to make a BOINC-powered distributed system that millions of us mere peons could contribute to collaboratively. But aside from the now-defunct MLC@Home project, I haven't found anything yet. Am I missing something?

Comments

You must log in or register to comment.

justheuristic t1_j02g9m0 wrote on December 13, 2022 at 4:37 PM

https://github.com/bigscience-workshop/petals - fine-tuning BLOOM-176B Folding@home style

https://github.com/learning-at-home/hivemind - a library for decentralized training with volunteers

https://github.com/epfml/disco - a library for collaborative training in JS (in a browser!)

https://github.com/chavinlo/distributed-diffusion - a project that tries to train diffusion this way

https://bittensor.com/ - a comminity that makes decentralized training into a cryptocurrency

There are also projects like Together that build networks from university computers for decentralized training.

genuinelySurprised OP t1_j02iznh wrote on December 13, 2022 at 4:54 PM

Thanks for this! I came across Petals, but it looked to be focused on model usage, not training, but didn't see that it used the hivemind library.

justheuristic t1_j02jboc wrote on December 13, 2022 at 4:56 PM

They have a ~~training~~ fine-tuning example here

https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb

dojoteef t1_j02kku4 wrote on December 13, 2022 at 5:04 PM

This is great! Is it realistically possible to train LLMs ala BLOOM from scratch using these, or just do finetuning? I guess I'm wondering how the training speed scales with more compute nodes.

Even if we assume high end GPUs/TPUs, a frequent bottleneck is throughput due to network latency. How big of an issue is that? For example, I had previously tried scaling to multi-node training on my University's cluster and it turned out that it was faster to do gradient accumulation on a single node than to do multi-node training because the network switches were not purchased with high-throughput in mind.

justheuristic t1_j02ohk6 wrote on December 13, 2022 at 5:28 PM

The first link (petals) is about finetuning.

Others (e.g. distributed diffusion) involve training from scratch -- but they deal with smaller models. Thing is, you need a lot of people to train a 100B model from scratch. Like, a few hundred online on average. There aren't many communities that can do that. In turn, with finetuning, you can see it work more immediately.

I've heard a talk by Colin Raffel where he proposed an alternative view where instead of training from scratch, an open-source community could gradually improve the model over time. Like github, but for large models. A contributor can fine-tune for a task, then create a "pull-request", then maintainer runs a special procedure to merge the model without forgetting other tasks. That's how I remember it, anyways.

dojoteef t1_j0275on wrote on December 13, 2022 at 3:30 PM

While there is a field of research investigating federated learning which might one day allow for an ML@Home type project, as it stands the current algorithms require too much memory, computation, and bandwidth for training the very large models like GPT3.

I'm hopeful that an improved approach will be devised that mitigates these issue (in fact I have some ideas I'm considering for my next research project), but as it stands these issues render a real ML@Home type project currently infeasible.

genuinelySurprised OP t1_j02iugf wrote on December 13, 2022 at 4:53 PM

I figured there was some technical catch related to scaling. It's a pity there's no way (yet) to put together a truly-open competitor to GPT3 and whatever comes after it.

makeasnek t1_j067hnh wrote on December 14, 2022 at 10:39 AM

Unfortunately AI as it stands strongly favors those with large resources, ML training is not something that works to distribute well because it benefits greatly from low latency. AI isn't a map-reduce problem or any other problem type which works well to split up into discrete parts to distribute for computation. MLC@Home if I understand correctly was training small single models on each machine/work unit, not training models in a distributed fashion.

NJank t1_j2e4r4q wrote on December 31, 2022 at 4:18 PM

Wondering now that there's an open source ML parallel to chatGPT ( see https://techcrunch.com/2022/12/30/theres-now-an-open-source-alternative-to-chatgpt-but-good-luck-running-it/) lacking the training resources, if an @home or BOINC model for automated training will develop And if it'll be at all effective. Seeing as how a lot of chatGPT required curated training with feedback, it may work more with an active (gamified?) distributed model rather than a passive one.