Kinexity t1_jbznlup wrote on March 12, 2023 at 11:24 PM

Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

There is a repo for CPU interference written in pure C++: https://github.com/ggerganov/llama.cpp

30B model can run on just over 20GB of RAM and take 1.2sec per token on my i7 8750H. Though actual Windows support has yet to arrive and as of right now the output is garbage for some reason.

Edit: fp16 version works. It's 4 bit quantisation that returns garbage.

[deleted] t1_jc0rr6z wrote on March 13, 2023 at 4:56 AM

[deleted]

light24bulbs t1_jc0s4wr wrote on March 13, 2023 at 5:00 AM

That is slowwwww

Kinexity t1_jc1lwah wrote on March 13, 2023 at 11:37 AM

That is fast. We are literally talking about a high end laptop CPU from 5 years ago running a 30B LLM.

light24bulbs t1_jc2s2oc wrote on March 13, 2023 at 4:57 PM

Oh, definitely, it's an amazing optimization.

But less than a token a second is going to be too slow for a lot of real time applications like human chat.

Still, very cool though

Lajamerr_Mittesdine t1_jc5b99n wrote on March 14, 2023 at 3:30 AM

I imagine 1 token per 0.2 seconds would be fast enough. That'd be equivalent to a 60 WPM typer.

Someone should benchmark it on an AMD 7950X3D or Intel 13900-KS

light24bulbs t1_jc5e0zk wrote on March 14, 2023 at 3:54 AM

yeah theres definitely a threshold in there where its fast enough for human interaction. It's only an order of magnitude off, that's not too bad.