Kinexity t1_jbznlup wrote on March 12, 2023 at 11:24 PM

There is a repo for CPU interference written in pure C++: https://github.com/ggerganov/llama.cpp

30B model can run on just over 20GB of RAM and take 1.2sec per token on my i7 8750H. Though actual Windows support has yet to arrive and as of right now the output is garbage for some reason.

Edit: fp16 version works. It's 4 bit quantisation that returns garbage.

[deleted] t1_jc0rr6z wrote on March 13, 2023 at 4:56 AM

[deleted]

light24bulbs t1_jc0s4wr wrote on March 13, 2023 at 5:00 AM

That is slowwwww

Kinexity t1_jc1lwah wrote on March 13, 2023 at 11:37 AM

That is fast. We are literally talking about a high end laptop CPU from 5 years ago running a 30B LLM.

light24bulbs t1_jc2s2oc wrote on March 13, 2023 at 4:57 PM

Oh, definitely, it's an amazing optimization.

But less than a token a second is going to be too slow for a lot of real time applications like human chat.

Still, very cool though

Lajamerr_Mittesdine t1_jc5b99n wrote on March 14, 2023 at 3:30 AM

I imagine 1 token per 0.2 seconds would be fast enough. That'd be equivalent to a 60 WPM typer.

Someone should benchmark it on an AMD 7950X3D or Intel 13900-KS

light24bulbs t1_jc5e0zk wrote on March 14, 2023 at 3:54 AM

yeah theres definitely a threshold in there where its fast enough for human interaction. It's only an order of magnitude off, that's not too bad.

Amazing_Painter_7692 OP t1_jbzbcmi wrote on March 12, 2023 at 9:55 PM

Should work fine with the 7b param model: https://huggingface.co/decapoda-research/llama-7b-hf-int4

remghoost7 t1_jbzmfku wrote on March 12, 2023 at 11:15 PM

Super neat. Thanks for the reply. I'll try that.

Also, do you know if there's a local interface for it....?

I know it's not quite the scope of the post, but it'd be neat to interact with it through a simple python interface (or something like how Gradio is used for A1111's Stable Diffusion) rather than piping it all through Discord.

Amazing_Painter_7692 OP t1_jbzoq05 wrote on March 12, 2023 at 11:33 PM

There's an inference engine class if you want to build out your own API:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/engine.py#L56-L96

And there's a simple text inference script here:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/llama_inference.py

Or in the original repo:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

BUT someone has already made a webUI like the automatic1111 one!

https://github.com/oobabooga/text-generation-webui

Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P

toothpastespiders t1_jc01mr9 wrote on March 13, 2023 at 1:12 AM

> BUT someone has already made a webUI like the automatic1111 one!

There's a subreddit for it over at /r/Oobabooga too that deserves more attention. I've only had a little time to play around with it but it's a pretty sleek system from what I've seen.

> it looked really complicated for me to set up with 4-bits weights

I'd like to say that the warnings make it more intimidating than it really is. I think it was just copying and pasting four or five lines for me onto a terminal. Then again I also couldn't get it to work so I might be doing something wrong. I'm guessing it's just that my weirdo gpu wasn't really accounted for somewhere. I'm going to bang my head against it when I've got time just because it's frustrating having tons of vram to spare and not getting the most out of it.

remghoost7 t1_jc0bymy wrote on March 13, 2023 at 2:34 AM

~~I'm having an issue with the C++ compiler on the last step.~~

~~I've been trying to use python 3.10.9 though, so maybe that's my problem....? My venv is set up correctly as well.~~

~~Not specifically looking for help.~~

Apparently this person posted a guide on it in that subreddit. Will report back if I am successful.

edit - Success! But, using WSL instead of Windows (because that was a freaking headache). WSL worked the first time following the instructions on the GitHub page. Would highly recommend using WSL to install it instead of trying to force Windows to figure it out.

Pathos14489 t1_jc0dame wrote on March 13, 2023 at 2:45 AM

r/Oobabooga isn't accessible for me.

remghoost7 t1_jbzqf5m wrote on March 12, 2023 at 11:46 PM

Most excellent. Thank you so much! I will look into all of these.

Guess I know what I'm doing for the rest of the day. Time to make more coffee! haha.

You are my new favorite person this week.

Also, one final question, if you will. What's so unique about the 4-bit weights and why would you prefer to run it in that manner? Is it just VRAM optimization requirements....? I'm decently versed in Stable Diffusion, but LLMs are fairly new territory for me.

My question seemed to have been answered here, and it is a VRAM limitation. Also, that last link seems to support 4-bit models as well. ~~Doesn't seem too bad to set up.... Though I installed A1111 when it first came out, so I learned through the garbage of that. Lol.~~ I was wrong. Oh so wrong. haha.

Yet again, thank you for your time and have a wonderful rest of your day. <3

[deleted] t1_jbzqsrt wrote on March 12, 2023 at 11:49 PM

[removed]

The_frozen_one t1_jbzqvwc wrote on March 12, 2023 at 11:49 PM

I'm running it using https://github.com/ggerganov/llama.cpp. The 4-bit version of 13b runs ok without GPU acceleration.

remghoost7 t1_jbzro03 wrote on March 12, 2023 at 11:55 PM

Nice!

How's the generation speed...?

The_frozen_one t1_jbzv0gt wrote on March 13, 2023 at 12:21 AM

It takes about 7 seconds to generate a full response using 13B to a prompt with the default (128) number of predicted tokens.

luaks1337 t1_jc24dqa wrote on March 13, 2023 at 2:19 PM

They managed to run the 7B model on a Raspberry PI and a Samsung Galaxy S22 Ultra.

thoughtdrops t1_jcjjq48 wrote on March 17, 2023 at 8:30 AM

>Samsung Galaxy S22 Ultra.

can you link to the samsung galaxy post? that sounds great

th3nan0byt3 t1_jbzw23a wrote on March 13, 2023 at 12:29 AM

only if you turn your pc case upside down

[P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM

remghoost7 t1_jbz96lt wrote on March 12, 2023 at 9:40 PM