Submitted by Zondartul t3_zrbfcr in MachineLearning

I'm trying to figure out how to go about running something like GPT-J, FLAN-T5, etc, on my PC, without using cloud compute services (because privacy and other reasons). However, GPT-J-6B needs either ~14 GB of VRAM or 4x as much plain RAM.

Upgrading my PC for 48 GB of RAM is possible, and 16, 24 GB graphics cards are available for general public (though they cost as much as a car), but anything beyond that is in the realm of HPC, datacenter hardware and "GPU accelerators"... I.e. 128 GB GPUs exist out there somewhere, but the distributors don't even list a price, it's just "get a quote" and "contact us"... meaning it's super expensive and you need to be a CEO of medium-sized company for them to even talk to you?

I'm trying to figure out if it's possible to run the larger models (e.g. 175B GPT-3 equivalents) on consumer hardware, perhaps by doing a very slow emulation using one or several PCs such that their collective RAM (or swap SDD space) matches the VRAM needed for those beasts.

So the question is "will it run super slowly" or "will it fail immediately due to completely incompatible software / being impossible to configure for anything other than real datacenter hardware"?

86

Comments

You must log in or register to comment.

SpaceCockatoo t1_j12q9me wrote

I too would like to know if this is even theoretically possible

3

caninerosie t1_j12qgv6 wrote

there are a ton of consumer motherboards that support 128GB max RAM. a single 3090 also has 24GB GDDR6X of memory. If you need more than that you can NVLink another 3090 with the added benefit of speeding up training. That’s getting pretty pricey though.

other than that, there’s the M1 Ultra Mac Studio? won’t be as fast as training on a dedicated GPU but you’ll have the memory for it and faster throughput than normal DRAM

edit: for an extremely large model like GPT-3 you would need almost 400 GB of RAM. theoretically you could build multiple machines with NVLinked 3090/4090s, all networked together for distributed training

2

GoofAckYoorsElf t1_j12t661 wrote

A small car. I just bought a new 3090Ti with 24GB VRAM for as little as 1300€. I don't find that overly expensive.

22

LetterRip t1_j12uqxv wrote

Deepspeed, you can map weights to the SSD, very slow but possible.

5

Final-Rush759 t1_j12zqjw wrote

Model parallelism. But you need more than 1 card. Buy A6000 which has 48 GB vram.

4

recidivistic_shitped t1_j136lsh wrote

GPT-J-6B can load under 8GB vram with Int8.LLM. For this same reason, you can also run it in Colab nowadays.

175B.... Really bad idea to offload it to normal RAM. Inference is more limited by FLOPS than memory at that scale. OpenAI's API is cheap enough unless you're scaling to a substantial userbase.

28

arg_max t1_j136nbo wrote

CPU implementations are going to be very slow. I'd probably try renting an A100 VM, running some experiments, and measuring VRAM and RAM usage. But I'd be surprised if anything below a 24G 3090TI is going to do the job. The issue is that bigger than 24GB means you have to go A6000 which costs as much as 4 3090s.

18

arg_max t1_j136y5q wrote

Just to give you an idea about "optimal configuration" though, this is way beyond desktop PC levels:
You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory.

https://alpa.ai/tutorials/opt_serving.html

9

CKtalon t1_j13dg5b wrote

Just forget about it.

Yes, it's possible to do it on CPU/RAM (Threadripper builds with > 256GB RAM + some assortment of 2x-4x GPUs), but the speed is so slow that it's pointless working with it. Deepspeed or Hugging Face can spread it out between GPU and CPU, but even so, it will be stupid slow, probably MINUTES per token.

We are at least 5 years away before consumer hardware can run 175+B models on a single machine (4 GPUs in a single machine).

20B models are in the realm of consumer hardware (3090/4090) with INT8, though slow, but still possible.

73

suflaj t1_j13pqhe wrote

While you can run large models (layer by layer, batch by batch, dimension by dimension or element by element), the problem is getting to the weights. No one said you need to transform your input to the output in one go. All that is important is that there is no single operation that would make you go OOM.

Theoretically, there is no network where a linear combination would exceed modern memory sizes, but this doesn't mean that such a strategy would be fast. At the base level, all you need is 3 registers (2 for addition and multiplication, 1 to keep sum aggregate) and enough memory to store the network weights.

6

limapedro t1_j13qfxr wrote

The cheaper option would be to run on 2 RTX 3060s! Each GPU costing 300 USD you could buy two for 600ish! Also there's a 16 GB A770 from Intel! To run a very large model you could split the weights into so called blocks, I was able to test it to myself in a simple keras implementation, but the code for conversion is hard to write, although I think I've seen somewhere something similar from HuggingFace!

4

caedin8 t1_j147bx3 wrote

Is this just training? What about inferences? How does chatGPT serve millions of people so quickly if it needs such enterprise hardware per request

22

DavesEmployee t1_j147fki wrote

I think it’s because they’re mostly used for games which almost never take advantage of the technology. You can tell from the designs that they were going to support it but the feature was taken out probably due to price or power concerns

2

head_robotics t1_j14e1fn wrote

Another question could be what is the minimal sized language model that could be useful?
If the largest models can't be reasonably run, what about smaller models that could be?
Any chance of getting usable results for reasonable speed?

3

wywywywy t1_j151o6u wrote

You could run a cut-down version of such models. I managed to run inference on OPT 2.7B, GPT-Neo 2.7B, etc on my 8GB gpu.

Now that I've upgraded to a used 3090, I can run OPT 6.7B, GPT-J 6B, etc.

5

gBoostedMachinations t1_j155nsu wrote

It’s kind of scary to think how soon the tech will enable randos to make LLMs. Sure, at first expertise will be needed but as we’ve seen before it’s only a matter of a brief period of time before the tools needed for the average Joe to train a model are made available.

Jfc shit is getting weird

4

gBoostedMachinations t1_j155zas wrote

Training is what takes so much computation in almost all cases. Once the model itself is trained only a tiny fraction of the compute is needed. Most trained ML models that ship today can generate predictions on a raspberry pi or a cell phone. LLMs still require more hardware for inference, but you’d be surprised how little they need compared to what’s needed for training.

8

gBoostedMachinations t1_j16pzea wrote

If there’s on thing I’ve learned about Reddit, it’s that you can make the most uncontroversial comment of the year and still get downvoted. I mean, I got banned from r/coronavirus for pointing out that people who recover from covid probably have at least a little tiny bit of immunity to re-infection.

After covid, I’ve learned to completely ignore my comment scores when it comes to feedback on Reddit. The only way to know if one of my comments is valued is to read the replies.

7

CKtalon t1_j16qtog wrote

Training will at minimum need about 10x more resources than what I said (inferencing). And that’s just to fit the model and all its optimisation weights with batch size 1.

2

limapedro t1_j175nby wrote

No, I haven't! Although in theory it should be really good, you could still run Deep Learninig using Directml, but a native implemenation should be really fast because of its XMX cores, they're similar to Tensor Cores.

1

wywywywy t1_j18a6g2 wrote

I haven't tried it myself but Intel has their own dist of Python and they also have their own Pytorch extension. They seem to be quite usable from looking at some of the github comments.

1

AltruisticNight8314 t1_j1ohh7u wrote

What hardware would be required to i) train or ii) fine-tune weights (i.e. run a few epochs on my own data) for medium-sized transformers (500M-15B parameters)?

I do research on proteomics and I have a very specific problem where perhaps even fine-tuning the weights of a trained transformer (such as ESM-2) might be great.

Of course, there's always the poor man's alternative of building a supervised model on the embeddings returned by the encoder.

1