Submitted by Qwillbehr t3_11xpohv in MachineLearning

It's understandable that companies like OpenAI would want to charge for access to their projects due to the ongoing cost to train then run them, I assume most other projects that require as much power and have to run in the cloud will do the same.

I was wondering if there were any projects to run/train some kind of language model/AI chatbot on consumer hardware (like a single GPU)? I heard that since Facebook's LLama leaked people managed to get it running on even hardware like an rpi, albeit slowly, I'm not asking to link to leaked data but if there are any projects attempting to achieve a goal like running locally on consumer hardware.

48

Comments

You must log in or register to comment.

not_particulary t1_jd51f0h wrote

There's a lot coming up. I'm looking into it right now, here's a tutorial I found:

https://medium.com/@martin-thissen/llama-alpaca-chatgpt-on-your-local-computer-tutorial-17adda704c23

​

Here's something unique, where a smaller LLM outperforms GPT-3.5 on specific tasks. It's multimodal and based on T5, which is much more runnable on consumer hardware.

https://arxiv.org/abs/2302.00923

28

DB6135 t1_jd68vdi wrote

What are the recommended parameters? I tried the 7B model with default settings but it kept generating repeated garbage output.

9

Qwillbehr OP t1_jd6baxv wrote

I played with it for a few minutes and noticed that the 16B alpaca model gave significantly better responses. From what I can tell though the issue seems to be in how dalai prompts alpaca.CPP (just tells it to complete the sentence with all possible outputs rather then just one of the possible answers). The 16B model fixed most of it for me

4

blueSGL t1_jd53pbu wrote

/r/LocalLLaMA

14

QTQRQD t1_jd491r2 wrote

there's a number of efforts like llama.cpp/alpaca.cpp or openassistant but the problem is that fundamentally these things require a lot of compute, which you really cant step around.

11

KerfuffleV2 t1_jd52brx wrote

> there's a number of efforts like llama.cpp/alpaca.cpp or openassistant but the problem is that fundamentally these things require a lot of compute, which you really cant step around.

It's honestly less than you'd expect. I have a Ryzen 5 1600 which I bought about 5 years ago for $200 (it's $79 now). I can run llama 7B on the CPU and it generates about 3 tokens/sec. That's close to what ChatGPT can do when it's fairly busy. Of course, llama 7B is no ChatGPT but still. This system has 32GB RAM (also pretty cheap) and I can run llama 30B as well, although it takes a second or so per token.

So you can't really chat in real time, but you can set it to generate something and come back later.

The 3 or 2 bit quantized versions of 65B or higher models would actually fit in memory. Of course, it would be even slower to run but honestly, it's amazing it's possible to run it at all on 5 year old hardware which wasn't cutting edge even back then.

19

VestPresto t1_jd6iiw1 wrote

Sounds faster and less laborious than googling and scanning a few articles

5

Gatensio t1_jd6rixk wrote

Doesn't 7B parameters require like 12-26GB of RAM depending on precision? How do you run the 30B?

1

KerfuffleV2 t1_jd7rjvf wrote

There are quantized versions at 8bit and 4bit. The 4bit quantized 30B version is 18GB so it will run on a machine with 32GB RAM.

The bigger the model, the more tolerant it seems to quantization so even 1bit quantized models are in the realm of possibility (would probably have to be something like a 120B+ model to really work).

3

ambient_temp_xeno t1_jd7fm8a wrote

I have the 7b 4bit alpaca.cpp running on my cpu (on virtualized Linux) and also this browser open with 12.3/16GB free. So realistically to use it without taking over your computer I guess 16GB of ram is needed. 8GB wouldn't cut it. I mean, it might fit in 8gb of system ram apparently, especially if it's running natively on Linux. But I haven't tried it. I tried to load the 13b and I couldn't.

2

ambient_temp_xeno t1_jdcpvhv wrote

*turns out WSL2 uses half your ram size by default. **13b seems to be weirdly not much better/possibly worse by some accounts anyway.

1

xtof54 t1_jd467f3 wrote

There are several. either collaboratively (look at together.computer hivemind petals) or on single no gpu machine with pipeline parallelism, but it requires reimplementing for every model, see e.g slowLLM on github for bloom176b

10

ZestyData t1_jd5299x wrote

It's pretty much all that's been posted here for the past week

6

mxby7e t1_jd5rn62 wrote

https://github.com/oobabooga/text-generation-webui

I’ve had great results with this interface. It requires a little tweaking to get working with lower specs, but it utilizes a lot of optimization options including splitting the model between VRAM and CPU RAM. I’ve been running LLaMa 7b in 8bit and limiting to 8GB of VRAM.

6

ajt9000 t1_jd5w735 wrote

Speaking of this do you guys know of ways to inference and/or train models on graphics cards with insufficient vram? I have had some success with breaking up models into multiple models and then inferencing them as a boosted ensemble but thats obviously not possible with lots of architectures.

I'm just wondering if you can do that with an unfavorable architecture as long as its pretrained.

1

sanxiyn t1_jd68827 wrote

You don't need leaked LLaMA weight. ChatGLM-6B weight is being distributed by the first party.

1

fnbr t1_jd6j7gh wrote

Right now, the tech isn't there to train on a single GPU. You're gonna end up training a language model for ~1 month to do so. It is slightly more efficient, though.

Lots of people looking at running locally. In addition to everything that people have said, there's a bunch of companies that will be releasing models that can just barely fit on an A100 soon that I've heard rumours about from employees.

1

atheist-projector t1_jd6trwc wrote

I am considering doing algotrading with something like this. Nit sure if i will or not.

1

adventuringraw t1_jddte0k wrote

No one else mentioned this, so I figured I'd add that there's also much more exotic research going into low-power techniques that could match what we're seeing with modern LLMs. One of the most interesting areas to me personally, is that there's been recent progress in spiking neural networks, an approach much more inspired by biological intelligence. The idea, instead of continuous parameters sending vectors between layers, you've got spiking neurons sending sparse digital signals. Progress historically has been kind of stalled out since they're so hard to train, but there's been some big movement just this month actually, with spikeGPT. They basically figured out how to leverage normal deep learning training. That along with a few other tricks got something with comparable performance to an equivalently sized DNN, with 22x reduced power consumption.

The real promise of SNNs though, in theory you could develop large scale specialized 'neuromorphic' hardware... what GPUs and TPUs are for traditional DNNs, meant to optimally run SNNs. A chip like that could end up being a cornerstone of efficient ML, if things end up working out that way, and who knows? Maybe it'd even open the door to tighter coupling and progress between ML and neuroscience.

There's plenty of other things being researched too of course, I'm nowhere near knowledgeable enough to give a proper overview, but it's a pretty vast space once you start looking at more exotic research efforts. I'm sure carbon nanotube or superconductor based computing breakthroughs would massively change the equation for example. 20 years from now, we might find ourselves in a completely new paradigm... that'd be pretty cool.

1