Amazing_Painter_7692

Amazing_Painter_7692 OP t1_jbzov27 wrote on March 12, 2023 at 11:34 PM

Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

https://github.com/qwopqwop200/GPTQ-for-LLaMa

Performance is quite good.

Amazing_Painter_7692 OP t1_jbzoq05 wrote on March 12, 2023 at 11:33 PM

Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

There's an inference engine class if you want to build out your own API:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/engine.py#L56-L96

And there's a simple text inference script here:

https://github.com/AmericanPresidentJimmyCarter/yal-discord-bot/blob/main/bot/llama_model/llama_inference.py

Or in the original repo:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

BUT someone has already made a webUI like the automatic1111 one!

https://github.com/oobabooga/text-generation-webui

Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P

Amazing_Painter_7692 OP t1_jbzbcmi wrote on March 12, 2023 at 9:55 PM

Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

Should work fine with the 7b param model: https://huggingface.co/decapoda-research/llama-7b-hf-int4

Amazing_Painter_7692 OP t1_jbz7hta wrote on March 12, 2023 at 9:28 PM

Reply to comment by 3deal in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692

It's the HuggingFace transformers module version of the weights from Meta/Facebook Research.

https://github.com/huggingface/transformers/pull/21955