Kinexity t1_jbznlup wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
There is a repo for CPU interference written in pure C++: https://github.com/ggerganov/llama.cpp
30B model can run on just over 20GB of RAM and take 1.2sec per token on my i7 8750H. Though actual Windows support has yet to arrive and as of right now the output is garbage for some reason.
Edit: fp16 version works. It's 4 bit quantisation that returns garbage.
[deleted] t1_jc0rr6z wrote
[deleted]
light24bulbs t1_jc0s4wr wrote
That is slowwwww
Kinexity t1_jc1lwah wrote
That is fast. We are literally talking about a high end laptop CPU from 5 years ago running a 30B LLM.
light24bulbs t1_jc2s2oc wrote
Oh, definitely, it's an amazing optimization.
But less than a token a second is going to be too slow for a lot of real time applications like human chat.
Still, very cool though
Lajamerr_Mittesdine t1_jc5b99n wrote
I imagine 1 token per 0.2 seconds would be fast enough. That'd be equivalent to a 60 WPM typer.
Someone should benchmark it on an AMD 7950X3D or Intel 13900-KS
light24bulbs t1_jc5e0zk wrote
yeah theres definitely a threshold in there where its fast enough for human interaction. It's only an order of magnitude off, that's not too bad.
Viewing a single comment thread. View all comments