catch23 t1_j9b9upb wrote on February 20, 2023 at 5:44 PM

Could try something like this: https://github.com/Ying1123/FlexGen

This was only released a few hours ago, so there's no way for you to have discovered this previously. Basically makes use of various strategies if your machine has lots of normal cpu memory. The paper authors were able to fit a 175B parameter model on their lowly 16GB T4 gpu (with a machine with 200GB of normal memory).

smallfried t1_j9dtyf7 wrote on February 21, 2023 at 4:45 AM

That is very interesting!

The paper is not yet on GitHub, but I'm assuming the hardware requirements are as mentioned one beefy consumer GPU (3090) and a whole bunch of DRAM (>210GB) ?

I've played with opt-175b and with a bit of twiddling it can actually generate some Python code :)

This is very exciting as it gets these models into the prosumer range hardware!

catch23 t1_j9dxlze wrote on February 21, 2023 at 5:21 AM

Their benchmark was done on a 16GB T4 which is anything but beefy. The T4 maxes out at 80W power consumption, and was primarily marketed toward model inference. The T4 is the cheapest GPU offered by google cloud.

EuphoricPenguin22 t1_j9c51t7 wrote on February 20, 2023 at 9:06 PM

Does that increase inference time?

catch23 t1_j9cd5tw wrote on February 20, 2023 at 9:59 PM

it does look to be 20-100x slower for those huge models, but still bearable if you're the only user on the machine. Still better than nothing if you don't have lots of GPU memory.

EuphoricPenguin22 t1_j9ceqy4 wrote on February 20, 2023 at 10:10 PM

Yeah, and DDR4 DIMMs are fairly inexpensive as compared to upgrading a GPU for more VRAM.

luaks1337 t1_j9cajyf wrote on February 20, 2023 at 9:42 PM

Yes, at least if I read the documentation correctly.

Disastrous_Elk_6375 t1_j99ry6s wrote on February 20, 2023 at 9:46 AM

GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

I managed to run GPT-J 6B on a 3060 w/ 12GB and it takes about 7.2GB of VRAM.

ArmagedonAshhole t1_j99tr0r wrote on February 20, 2023 at 10:12 AM

>GPT-NeoX should fit in 24GB VRAM with 8bit, for inference.

GPT-NeoX20B It will fit in 24GB vram but it will almost instantly go out of memory when context will get a bit bigger than starting page of sentences.

Disastrous_Elk_6375 t1_j99xxfa wrote on February 20, 2023 at 11:11 AM

Are there some rough numbers on prompt size vs. ram usage after the model load? I haven't played yet with GPT-NeoX

ArmagedonAshhole t1_j9a1vq3 wrote on February 20, 2023 at 12:01 PM

it depends mostly on settings so no.

Small context like 200-300 tokens could work with 24GB but then your AI will not remember and connect dots well which would make model worse than 13B

People are working right now on spliting work between gpu(vram) and cpu(ram) in 8bit mode. I think like 10% to RAM would make model work well on 24GB vram card. IT would be a bit slower but still usable.

If you want you can always load whole model to ram and run it via cpu but it is very slow.

Disastrous_Elk_6375 t1_j9a2877 wrote on February 20, 2023 at 12:05 PM

Thanks!

[deleted] t1_j9ati5p wrote on February 20, 2023 at 3:57 PM

[deleted]

head_robotics OP t1_j99tts4 wrote on February 20, 2023 at 10:13 AM

Did you use something like bitsandbytes for the 8bit inference?

How did you implement it?

https://github.com/TimDettmers/bitsandbytes

Disastrous_Elk_6375 t1_j99ujv1 wrote on February 20, 2023 at 10:24 AM

add this to your .from_pretrained("model" , device_map="auto", load_in_8bit=True)

Transformers does the rest.

gliptic t1_j99y0cp wrote on February 20, 2023 at 11:12 AM

RWKV can run on very little VRAM with Rwkvstic streaming and 8-bit. I've not tested streaming, but I expect it's a lot slower. 7B model sadly takes 8 GB with just 8-bit quantization.

avocadoughnut t1_j9a64k1 wrote on February 20, 2023 at 12:49 PM

Yup. I'd recommend using whichever RWKV model that can be fit with fp16/bf16. (apparently 8bit is 4x slower and lower accuracy) I've been running GPT-J on a 24GB gpu for months (longer contexts possible using accelerate) and I noticed massive speed increases when using fp16 (or bf16? don't remember) rather than 8bit.

hummingairtime t1_j9ey0bz wrote on February 21, 2023 at 12:56 PM

I appreciate you

wywywywy t1_j9apjs3 wrote on February 20, 2023 at 3:29 PM

I had a 3070 with 8GB and I managed to run these locally through KoboldAI.

Meta OPT 2.7B
EleutherAI GPT-Neo 2.7B
BigScience Bloom 1.7B

xrailgun t1_j9aq903 wrote on February 20, 2023 at 3:34 PM

Did you test any larger and it wouldn't run?

Also, any comments so far among those? Good? Bad? Easy? Etc?

wywywywy t1_j9ar2tk wrote on February 20, 2023 at 3:40 PM

I did test larger but it didn't run. I can't remember which ones, probably GPT-J. I recently got a 3090 so I can load larger models now.

As for quality, my use case is simple (writing prompt to help with writing stories & articles) and nothing sophisticated, and they worked well. Until ChatGPT came along. I use ChatGPT instead now.

xrailgun t1_j9avboh wrote on February 20, 2023 at 4:09 PM

Thanks!

I wish model publishers would indicate rough (V)RAM requirements...

wywywywy t1_j9b2kqu wrote on February 20, 2023 at 4:57 PM

So, not scientific at all, but I've noticed that checkpoint file size * 0.6 is pretty close to actual VRAM requirement for LLM.

But you're right it'd be nice to have a table handy.

gpt-doktor-6b t1_j9b3u79 wrote on February 20, 2023 at 5:05 PM

You might be interested in this tutorial on loading large models. They promise you the ability to inference model as long as you have enough disk space.

https://huggingface.co/blog/accelerate-large-models

Purplekeyboard t1_j9bd1jg wrote on February 20, 2023 at 6:05 PM

Keep in mind, these smaller models are going to be a lot dumber than what you've likely seen in GPT-3.

hummingairtime t1_j9ey1bv wrote on February 21, 2023 at 12:56 PM

really

CommunismDoesntWork t1_j9b1qjb wrote on February 20, 2023 at 4:51 PM

I'm surprised pytorch doesn't have an option to load models partially in a just in time basis yet. That way even an infinitely large model can be infered on.

Artichoke-Lower t1_j9bnbgf wrote on February 20, 2023 at 7:10 PM

This seems really promising also https://github.com/Ying1123/FlexGen

pyepyepie t1_j9bbg1b wrote on February 20, 2023 at 5:54 PM

Try to use both GPUs with this one: https://github.com/huggingface/accelerate https://huggingface.co/docs/accelerate/usage_guides/big_modeling https://huggingface.co/blog/accelerate-large-models Maybe it will help (the last link is clearer IMHO).

nikola-b t1_j9cqkys wrote on February 20, 2023 at 11:32 PM

Not sure if this helps, but you can use our hosted flan-t5 model at deepinfra.com using HTTP API. It's free for now. Disclaimer I work at deepinfra. If you want GPT-Neo or GPT-J I can deploy those also.

tyras_ t1_j9e9kp0 wrote on February 21, 2023 at 7:42 AM

Free for now or free for an hour as the pricing tab indicates?

nikola-b t1_j9hk5q4 wrote on February 22, 2023 at 12:45 AM

Free for now, we have not added the payment workflow. In the future, you are billed only for the inference time, so with 1h you should be able to generate lots of tokens. Also I added EleutherAI/gpt-neo-2.7B and EleutherAI/gpt-j-6B if the op wants to try them.

tyras_ t1_j9pjkcx wrote on February 23, 2023 at 5:36 PM

I finally got some time and was excited to try out. I did not see many LLMs pretrained on biomedical data available anywhere.

Anyway, while I could log in without a problem both CURL and deepctl return 401. Now I wonder whether it was cut off or did I miss some extra registration or authorization step that was not mentioned in the docs.

nikola-b t1_j9ujdux wrote on February 24, 2023 at 5:38 PM

There was auth bug in the code. Sorry for that. Please try again now.

pyonsu2 t1_j9ds6j5 wrote on February 21, 2023 at 4:28 AM

Depends on what you’re trying to do but just use OpenAI APIs. Your effort/time is also expensive.

2muchnet42day t1_j9j5wl3 wrote on February 22, 2023 at 10:11 AM

And the hardware.

Rockingtits t1_j9afl0a wrote on February 20, 2023 at 2:15 PM

Why not look into distilled models like DistilBERT

Emergency_Apricot_77 t1_j9b68si wrote on February 20, 2023 at 5:21 PM

They literally asked for LARGE language models

xrailgun t1_j9dtp9c wrote on February 21, 2023 at 4:42 AM

It might not be unreasonable to think maybe OP primarily wants the functionality of current LLMs, and if something can provide that more efficiently (or has promise to in the near future), s/he may want to know about it too.

[deleted] t1_j9b8duw wrote on February 20, 2023 at 5:35 PM

[deleted]

[deleted] t1_j9at634 wrote on February 20, 2023 at 3:54 PM

[deleted]

Last-Belt-4010 t1_j9b8gtl wrote on February 20, 2023 at 5:35 PM

Just a question does this work with non Nvidia gpus? Like Intel arc and such

Baeocystin t1_j9e6s12 wrote on February 21, 2023 at 7:06 AM

The tl;dr for all GPU questions is that CUDA is the answer. There are no other even 'kinda' contenders.

I'm not happy about the monopoly, but that's where we're at, and there is nothing on the horizon pointing otherwise, either.

AnothaUselessComment t1_j9c9er6 wrote on February 20, 2023 at 9:34 PM

Yikes, this may be tough.

I know you can try Bloom (like this blog post tried) and let it try and download overnight, but you may run into problems. (I've heard the download takes forever)

https://enjoymachinelearning.com/blog/gpt-3-vs-bloom/

Though I will say, it's probably worth whatever cost you're trying to dodge just to hit an API, even if your hardware is great.

Snoo9704 t1_j9e8k2w wrote on February 21, 2023 at 7:28 AM

I'm a super learning noob, but is there a reason you can't substitute large amounts of VRAM with large amounts of DRAM?

I know RAM bandwidth is important, but does it make that much of a difference if I got 256GB of quad channel DRAM and only 8GB VRAM? Compared to a more typical 32GB DRAM and 24GB VRAM?

IntrepidTieKnot t1_j9egvzc wrote on February 21, 2023 at 9:22 AM

yes

k3iter t1_j9bn6k3 wrote on February 20, 2023 at 7:09 PM

Nel

YinYang-Mills t1_j9dtwjh wrote on February 21, 2023 at 4:44 AM

Is there a way to do it with single precision?

[deleted] t1_j9e1uoq wrote on February 21, 2023 at 6:07 AM

[deleted]

halixness t1_j9e80y1 wrote on February 21, 2023 at 7:22 AM

So far I have tried BLOOM Petals (a distributed LLM), inference took me around 30s for a single prompt on a 8GB VRAM gpu, but not bad!

marcus_hk t1_j9g5hns wrote on February 21, 2023 at 6:05 PM

Seems it shouldn't be too difficult to run one stage or layer at a time and cache intermediate results.

Comments