I'm trying to figure out how to go about running something like GPT-J, FLAN-T5, etc, on my PC, without using cloud compute services (because privacy and other reasons). However, GPT-J-6B needs either ~14 GB of VRAM or 4x as much plain RAM.

Upgrading my PC for 48 GB of RAM is possible, and 16, 24 GB graphics cards are available for general public (though they cost as much as a car), but anything beyond that is in the realm of HPC, datacenter hardware and "GPU accelerators"... I.e. 128 GB GPUs exist out there somewhere, but the distributors don't even list a price, it's just "get a quote" and "contact us"... meaning it's super expensive and you need to be a CEO of medium-sized company for them to even talk to you?

I'm trying to figure out if it's possible to run the larger models (e.g. 175B GPT-3 equivalents) on consumer hardware, perhaps by doing a very slow emulation using one or several PCs such that their collective RAM (or swap SDD space) matches the VRAM needed for those beasts.

So the question is "will it run super slowly" or "will it fail immediately due to completely incompatible software / being impossible to configure for anything other than real datacenter hardware"?

Comments

You must log in or register to comment.

SpaceCockatoo t1_j12q9me wrote on December 21, 2022 at 6:01 AM

#992,204

I too would like to know if this is even theoretically possible

caninerosie t1_j12qgv6 wrote on December 21, 2022 at 6:03 AM

#992,214

there are a ton of consumer motherboards that support 128GB max RAM. a single 3090 also has 24GB GDDR6X of memory. If you need more than that you can NVLink another 3090 with the added benefit of speeding up training. That’s getting pretty pricey though.

other than that, there’s the M1 Ultra Mac Studio? won’t be as fast as training on a dedicated GPU but you’ll have the memory for it and faster throughput than normal DRAM

edit: for an extremely large model like GPT-3 you would need almost 400 GB of RAM. theoretically you could build multiple machines with NVLinked 3090/4090s, all networked together for distributed training

GoofAckYoorsElf t1_j12t661 wrote on December 21, 2022 at 6:34 AM

#992,408

A small car. I just bought a new 3090Ti with 24GB VRAM for as little as 1300€. I don't find that overly expensive.

LetterRip t1_j12uqxv wrote on December 21, 2022 at 6:53 AM

#992,497

Deepspeed, you can map weights to the SSD, very slow but possible.

Final-Rush759 t1_j12zqjw wrote on December 21, 2022 at 7:56 AM

#992,769

Model parallelism. But you need more than 1 card. Buy A6000 which has 48 GB vram.

recidivistic_shitped t1_j136lsh wrote on December 21, 2022 at 9:30 AM

#993,066

GPT-J-6B can load under 8GB vram with Int8.LLM. For this same reason, you can also run it in Colab nowadays.

175B.... Really bad idea to offload it to normal RAM. Inference is more limited by FLOPS than memory at that scale. OpenAI's API is cheap enough unless you're scaling to a substantial userbase.

arg_max t1_j136nbo wrote on December 21, 2022 at 9:31 AM

#993,068

CPU implementations are going to be very slow. I'd probably try renting an A100 VM, running some experiments, and measuring VRAM and RAM usage. But I'd be surprised if anything below a 24G 3090TI is going to do the job. The issue is that bigger than 24GB means you have to go A6000 which costs as much as 4 3090s.

arg_max t1_j136y5q wrote on December 21, 2022 at 9:35 AM

#993,079

Replying to arg_max (#993,068)

Just to give you an idea about "optimal configuration" though, this is way beyond desktop PC levels:
You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512 GB memory.

https://alpa.ai/tutorials/opt_serving.html

CKtalon t1_j13dg5b wrote on December 21, 2022 at 11:06 AM

#993,410

Just forget about it.

Yes, it's possible to do it on CPU/RAM (Threadripper builds with > 256GB RAM + some assortment of 2x-4x GPUs), but the speed is so slow that it's pointless working with it. Deepspeed or Hugging Face can spread it out between GPU and CPU, but even so, it will be stupid slow, probably MINUTES per token.

We are at least 5 years away before consumer hardware can run 175+B models on a single machine (4 GPUs in a single machine).

20B models are in the realm of consumer hardware (3090/4090) with INT8, though slow, but still possible.

suflaj t1_j13pqhe wrote on December 21, 2022 at 1:18 PM

#994,246

While you can run large models (layer by layer, batch by batch, dimension by dimension or element by element), the problem is getting to the weights. No one said you need to transform your input to the output in one go. All that is important is that there is no single operation that would make you go OOM.

Theoretically, there is no network where a linear combination would exceed modern memory sizes, but this doesn't mean that such a strategy would be fast. At the base level, all you need is 3 registers (2 for addition and multiplication, 1 to keep sum aggregate) and enough memory to store the network weights.

limapedro t1_j13qfxr wrote on December 21, 2022 at 1:25 PM

#994,308

The cheaper option would be to run on 2 RTX 3060s! Each GPU costing 300 USD you could buy two for 600ish! Also there's a 16 GB A770 from Intel! To run a very large model you could split the weights into so called blocks, I was able to test it to myself in a simple keras implementation, but the code for conversion is hard to write, although I think I've seen somewhere something similar from HuggingFace!

mmeeh t1_j13urkr wrote on December 21, 2022 at 2:00 PM

#994,649

Replying to CKtalon (#993,410)

256 GB :O

sayoonarachu t1_j1408am wrote on December 21, 2022 at 2:42 PM

#995,041

If you're savy enough, you can technically run BLOOM 176b . But as others stated, it'll take forever to be usable. I.e 30 minutes for 10 token.

https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32

yashdes t1_j1420uo wrote on December 21, 2022 at 2:55 PM

#995,182

Replying to GoofAckYoorsElf (#992,408)

He's probably referring to Quadros, those things are stupid expensive even in comparison to the 3090/4090

DavesEmployee t1_j1426ms wrote on December 21, 2022 at 2:56 PM

#995,198

Replying to caninerosie (#992,214)

4090s don’t support NVLink unfortunately 🥲

caninerosie t1_j14738h wrote on December 21, 2022 at 3:29 PM

#995,570

Replying to DavesEmployee (#995,198)

really? NVIDIA is so weird

caedin8 t1_j147bx3 wrote on December 21, 2022 at 3:31 PM

#995,594

Replying to CKtalon (#993,410)

Is this just training? What about inferences? How does chatGPT serve millions of people so quickly if it needs such enterprise hardware per request

DavesEmployee t1_j147fki wrote on December 21, 2022 at 3:31 PM

#995,600

Replying to caninerosie (#995,570)

I think it’s because they’re mostly used for games which almost never take advantage of the technology. You can tell from the designs that they were going to support it but the feature was taken out probably due to price or power concerns

visarga t1_j14bnb7 wrote on December 21, 2022 at 3:59 PM

#995,884

Replying to CKtalon (#993,410)

GLM-130B runs on 4x 3090, uses INT4.

GoofAckYoorsElf t1_j14bxe6 wrote on December 21, 2022 at 4:01 PM

#995,907

Replying to yashdes (#995,182)

True, but who needs a Quadro, if a 3090Ti is entirely sufficient?

head_robotics t1_j14e1fn wrote on December 21, 2022 at 4:15 PM

#996,076

Another question could be what is the minimal sized language model that could be useful?
If the largest models can't be reasonably run, what about smaller models that could be?
Any chance of getting usable results for reasonable speed?

yashdes t1_j14h1mm wrote on December 21, 2022 at 4:34 PM

#996,286

Replying to GoofAckYoorsElf (#995,907)

100% agree, love my 3090s, but hope they keep coming down in price so I can get more :D

avialex t1_j14p22o wrote on December 21, 2022 at 5:26 PM

#996,874

Replying to sayoonarachu (#995,041)

There's a VRAM memory leak in that code btw. I haven't tracked it down yet, but it's easy to solve with a torch cache clear in the forward method.

Misaiato t1_j14pagb wrote on December 21, 2022 at 5:27 PM

#996,892

Replying to caedin8 (#995,594)

MSFT Azure. It has unlimited resources available to it.

wywywywy t1_j151o6u wrote on December 21, 2022 at 6:46 PM

#997,874

You could run a cut-down version of such models. I managed to run inference on OPT 2.7B, GPT-Neo 2.7B, etc on my 8GB gpu.

Now that I've upgraded to a used 3090, I can run OPT 6.7B, GPT-J 6B, etc.

artsybashev t1_j154fhy wrote on December 21, 2022 at 7:04 PM

#998,124

Replying to caedin8 (#995,594)

it is just the inference. Training requires more like 100 x A100 and a cluster to train on. Just a million to get started.

gBoostedMachinations t1_j155nsu wrote on December 21, 2022 at 7:12 PM

#998,215

Replying to CKtalon (#993,410)

It’s kind of scary to think how soon the tech will enable randos to make LLMs. Sure, at first expertise will be needed but as we’ve seen before it’s only a matter of a brief period of time before the tools needed for the average Joe to train a model are made available.

Jfc shit is getting weird

gBoostedMachinations t1_j155zas wrote on December 21, 2022 at 7:14 PM

#998,250

Replying to caedin8 (#995,594)

Training is what takes so much computation in almost all cases. Once the model itself is trained only a tiny fraction of the compute is needed. Most trained ML models that ship today can generate predictions on a raspberry pi or a cell phone. LLMs still require more hardware for inference, but you’d be surprised how little they need compared to what’s needed for training.

[deleted] t1_j15vcej wrote on December 21, 2022 at 10:02 PM

#1,000,460

Replying to CKtalon (#993,410)

[removed]

calv420 t1_j15ytb1 wrote on December 21, 2022 at 10:26 PM

#1,000,723

Replying to gBoostedMachinations (#998,250)

Don't see why you're getting down voted, inference requires significantly less compute vs training.

maizeq t1_j162xtk wrote on December 21, 2022 at 10:54 PM

#1,001,025

Replying to limapedro (#994,308)

How is the tooling and performance for the A770 on machine learning workloads? Do you have any experience with it?

[deleted] t1_j16dhck wrote on December 22, 2022 at 12:12 AM

#1,001,845

[deleted]

gBoostedMachinations t1_j16pzea wrote on December 22, 2022 at 1:49 AM

#1,002,888

Replying to calv420 (#1,000,723)

If there’s on thing I’ve learned about Reddit, it’s that you can make the most uncontroversial comment of the year and still get downvoted. I mean, I got banned from r/coronavirus for pointing out that people who recover from covid probably have at least a little tiny bit of immunity to re-infection.

After covid, I’ve learned to completely ignore my comment scores when it comes to feedback on Reddit. The only way to know if one of my comments is valued is to read the replies.

CKtalon t1_j16qtog wrote on December 22, 2022 at 1:55 AM

#1,002,965

Replying to caedin8 (#995,594)

Training will at minimum need about 10x more resources than what I said (inferencing). And that’s just to fit the model and all its optimisation weights with batch size 1.

BelialSirchade t1_j174112 wrote on December 22, 2022 at 3:41 AM

#1,003,994

Replying to GoofAckYoorsElf (#995,907)

More vram probably, but you can just hook up 2 3090 ti at half the price

Though for LLM you probably need 10 3090 ti and even then it’s probably not enough

BelialSirchade t1_j174fni wrote on December 22, 2022 at 3:44 AM

#1,004,024

Replying to DavesEmployee (#995,198)

You don’t need NVLink though, PyTorch support model parallelism through deepspeed anyways, so go ahead and buy that extra 4090

limapedro t1_j175nby wrote on December 22, 2022 at 3:55 AM

#1,004,117

Replying to maizeq (#1,001,025)

No, I haven't! Although in theory it should be really good, you could still run Deep Learninig using Directml, but a native implemenation should be really fast because of its XMX cores, they're similar to Tensor Cores.

wywywywy t1_j18a6g2 wrote on December 22, 2022 at 12:00 PM

#1,006,592

Replying to maizeq (#1,001,025)

I haven't tried it myself but Intel has their own dist of Python and they also have their own Pytorch extension. They seem to be quite usable from looking at some of the github comments.

AltruisticNight8314 t1_j1ohh7u wrote on December 26, 2022 at 2:38 AM

#1,065,555

Replying to artsybashev (#998,124)

What hardware would be required to i) train or ii) fine-tune weights (i.e. run a few epochs on my own data) for medium-sized transformers (500M-15B parameters)?

I do research on proteomics and I have a very specific problem where perhaps even fine-tuning the weights of a trained transformer (such as ESM-2) might be great.

Of course, there's always the poor man's alternative of building a supervised model on the embeddings returned by the encoder.

artsybashev t1_j1ph7f3 wrote on December 26, 2022 at 9:40 AM

#1,069,771

Replying to AltruisticNight8314 (#1,065,555)

one A100 80GB will get you started with models 500M-15B. You can rent that for a $50 per day. See where that takes you in a week.

AltruisticNight8314 t1_j1soeji wrote on December 27, 2022 at 2:17 AM

#1,084,767

Replying to artsybashev (#1,069,771)

Thanks!