Comments

You must log in or register to comment.

remghoost7 t1_jbz96lt wrote

><9 GiB VRAM

So does that mean my 1060 6GB can run it....? haha.

I doubt it, but I'll give it a shot later just in case.

18

stefanof93 t1_jbzeots wrote

Anyone evaluate all the quantized versions and compare them against smaller models yet? How many bits can you throw away before you're better of picking a smaller version?

26

remghoost7 t1_jbzmfku wrote

Super neat. Thanks for the reply. I'll try that.

Also, do you know if there's a local interface for it....?

I know it's not quite the scope of the post, but it'd be neat to interact with it through a simple python interface (or something like how Gradio is used for A1111's Stable Diffusion) rather than piping it all through Discord.

2

Amazing_Painter_7692 OP t1_jbzoq05 wrote

There's an inference engine class if you want to build out your own API:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/engine.py#L56-L96

And there's a simple text inference script here:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/llama_inference.py

Or in the original repo:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

BUT someone has already made a webUI like the automatic1111 one!

https://github.com/oobabooga/text-generation-webui

Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P

15

remghoost7 t1_jbzqf5m wrote

Most excellent. Thank you so much! I will look into all of these.

Guess I know what I'm doing for the rest of the day. Time to make more coffee! haha.

You are my new favorite person this week.

Also, one final question, if you will. What's so unique about the 4-bit weights and why would you prefer to run it in that manner? Is it just VRAM optimization requirements....? I'm decently versed in Stable Diffusion, but LLMs are fairly new territory for me.

My question seemed to have been answered here, and it is a VRAM limitation. Also, that last link seems to support 4-bit models as well. Doesn't seem too bad to set up.... Though I installed A1111 when it first came out, so I learned through the garbage of that. Lol. I was wrong. Oh so wrong. haha.

Yet again, thank you for your time and have a wonderful rest of your day. <3

4

toothpastespiders t1_jc01mr9 wrote

> BUT someone has already made a webUI like the automatic1111 one!

There's a subreddit for it over at /r/Oobabooga too that deserves more attention. I've only had a little time to play around with it but it's a pretty sleek system from what I've seen.

> it looked really complicated for me to set up with 4-bits weights

I'd like to say that the warnings make it more intimidating than it really is. I think it was just copying and pasting four or five lines for me onto a terminal. Then again I also couldn't get it to work so I might be doing something wrong. I'm guessing it's just that my weirdo gpu wasn't really accounted for somewhere. I'm going to bang my head against it when I've got time just because it's frustrating having tons of vram to spare and not getting the most out of it.

6

remghoost7 t1_jc0bymy wrote

I'm having an issue with the C++ compiler on the last step.

I've been trying to use python 3.10.9 though, so maybe that's my problem....? My venv is set up correctly as well.

Not specifically looking for help.

Apparently this person posted a guide on it in that subreddit. Will report back if I am successful.

edit - Success! But, using WSL instead of Windows (because that was a freaking headache). WSL worked the first time following the instructions on the GitHub page. Would highly recommend using WSL to install it instead of trying to force Windows to figure it out.

3

APUsilicon t1_jc0zbtj wrote

oooh, I've been getting trash responses from opt-6.7b hopefully this is better.

1

Raise_Fickle t1_jc1p9x5 wrote

Anyone having any luck finetuning LLama in a multi-gpu setup?

1

MorallyDeplorable t1_jc1umt7 wrote

I'm not actually sure. I've just been chatting with people in an unrelated Discord's off topic channel about it.

I'd post some of what I've got from it but I have no idea what I'm doing with it and don't think what I'm getting would be decently representative of what it can actually do.

2

LetterRip t1_jc4rifv wrote

Depends on the model. Some have difficulty even with full 8bit quantization; others you can go to 4bit relatively easily. There is some research that suggests 3bit might be the useful limit, with rarely certain 2bit models.

3