Amazing_Painter_7692
Amazing_Painter_7692 OP t1_jbzoq05 wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
There's an inference engine class if you want to build out your own API:
And there's a simple text inference script here:
Or in the original repo:
https://github.com/qwopqwop200/GPTQ-for-LLaMa
BUT someone has already made a webUI like the automatic1111 one!
https://github.com/oobabooga/text-generation-webui
Unfortunately it looked really complicated for me to set up with 4-bits weights and I tend to do everything over a Linux terminal. :P
Amazing_Painter_7692 OP t1_jbzbcmi wrote
Reply to comment by remghoost7 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
Should work fine with the 7b param model: https://huggingface.co/decapoda-research/llama-7b-hf-int4
Amazing_Painter_7692 OP t1_jbz7hta wrote
Reply to comment by 3deal in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
It's the HuggingFace transformers module version of the weights from Meta/Facebook Research.
Amazing_Painter_7692 OP t1_jbzov27 wrote
Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
https://github.com/qwopqwop200/GPTQ-for-LLaMa
Performance is quite good.