Submitted by Business-Lead2679 t3_12618zu in MachineLearning
ustainbolt t1_je7plqi wrote
For a 65b model you are probably going to have to parallelise the model parameters. See this link. As for training, it would be best to use a vm (any provider will work, lambda and vast.ai are cheap). I would a recommend 4x (or 8x) A100 machine. I'm sure you can find more information about all of this.
wrossmorrow t1_je7vy2p wrote
+1 for lambda labs
ustainbolt t1_je7xtcw wrote
I love lambda. More reliable than vast.ai, and WAY cheaper than AWS/GCP/Azure.
Nhabls t1_je9598b wrote
Every time I logged on to lambdalabs in the past year all their instances were full. Not that available in my experience
badabummbadabing t1_je9cdf7 wrote
They just had their Series B funding, they should upscale their resources soon.
itsyourboiirow t1_jecqc1d wrote
This is the only downside I've found. Sometimes it's too darn hard to find an instance.
learn-deeply t1_je9eovt wrote
Tensor (aka model parallel) parallel with model checkpointing works better than FSDP (though they can be used in conjunction) from my experience. FSDP is easier to work with though.
[deleted] t1_je9rb1f wrote
[deleted]
Viewing a single comment thread. View all comments