machineko
machineko t1_jecvhyt wrote
Reply to comment by Evening_Ad6637 in [D] Training a 65b LLaMA model by Business-Lead2679
16gb of RAM is not enough for even the smallest LLaMA 7b model. You can try doing LoRA with int8 listed above. Did you try the python script I linked above?
machineko t1_je88wj9 wrote
Reply to [D] Training a 65b LLaMA model by Business-Lead2679
I'm working on an open source library focused on resource-efficient fine-tuning methods called xTuring: https://github.com/stochasticai/xturing
Here's how you would perform int8 LoRA fine-tuning in three lines:
python: https://github.com/stochasticai/xturing/blob/main/examples/llama/llama_lora_int8.py
colab notebook: https://colab.research.google.com/drive/1SQUXq1AMZPSLD4mk3A3swUIc6Y2dclme?usp=sharing
Of course the Colab still only works with smaller models. In the example above, 7B required 9G VRAM.
machineko t1_je888dc wrote
Why not use open source models. Especially it seems like you are trying not to sell the model for commercial purposes, you can easily replace it with open source models. Also, for retrieval-augmented generation, smaller models can be very effective.
machineko t1_je86hwt wrote
Reply to comment by ortegaalfredo in [D] llama 7b vs 65b ? by deck4242
What GPUs are you using to run them? Are you using any compression (i.e. quantization)?
machineko t1_je83m8x wrote
Reply to comment by LetGoAndBeReal in [D] The best way to train an LLM on company data by jaxolingo
Unsupervised fine-tuning (or extending the pre-training) with additional data will work. Of course, how to get it to learn new information effectively is a challenge but not impossible.
machineko t1_je70llx wrote
Reply to comment by LetGoAndBeReal in [D] The best way to train an LLM on company data by jaxolingo
Why would you say that fine-tuning is not viable? There are many production use cases of fine-tuning a model using in-house proprietary data.
If fact, if you have the resources you can do both fine-tuning of an existing model (whether is just supervised or unsupervised) and also use that for retrieval augmented generation.
machineko t1_je05orp wrote
Reply to comment by rshah4 in [D] FOMO on the rapid pace of LLMs by 00001746
I agree. While these giant centralized models are all over the news, there are ways to make smaller models much more efficient (i.e. LoRA mentioned above). And during the process working with these techniques, we can perhaps discover new methods and architecture .
We are working on an open-source project focused on making fine-tuning for LLMs, simple, fast and efficient: https://github.com/stochasticai/xturing.
OP, we till got a ton of stuff we want to try out to make fine-tuning faster and more compute/memory efficient, if you are interested in contributing.
machineko t1_jdtv8jv wrote
Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
If you need help, come find us on our discord channel.
machineko t1_jdqzmyq wrote
Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
model.save("path/to/your/weights") saves it to the directory
After that, you can load it with
model = BaseModel.create("gpt2", "path/to/your/weights")
Can you share the input text you have used? It is possible that GPT-2 is too small and needs custom generation parameters.
machineko t1_jdnmg8l wrote
Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
Right, 8GB won't be enough for LLaMA 7b. You should try GPT-2 model. That should work on 8GB VRAM.
machineko t1_jdmm43b wrote
Reply to comment by SWESWESWEh in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
Thanks for the comment. Are you looking to run on M2 or smaller edge devices?
machineko t1_jdmdvst wrote
Reply to comment by light24bulbs in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
We are working on adding that as well. Keep an eye out on our repo.
machineko t1_jdjeh6y wrote
We have a similar open-source project focused on personalization of LLMs and efficient fine-tuning: https://github.com/stochasticai/xturing
We actually released code for GPT-J, LLaMA and GPT-2 before these guys but we are a small team. You can run it on any local machines too.
machineko t1_jbu36nu wrote
Reply to [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali
How long is your text? If you are doing short sentences, try fine-tuning RoBERTa with your labeled dataset for classification. If you don't have labeled datasets, you need to use zero or few-shot learning on a larger model. I'd start with a smaller LLM like GPT-J, try playing with some prompts on a free playground like this (you can select GPT-J) until you find something that work well.
machineko t1_ja4jubd wrote
Reply to [D] Faster Flan-T5 inference by _learn_faster_
Inference acceleration involves model accuracy / latency / cost trade-offs and also how much $ and time you are willing to spend to speed things up. Is your goal to achieve real-time? Can you do it while taking 2-3% accuracy hits? What compute resource is the model going to run on? On the cloud and you have access to any GPUs? For example, certain inference optimization techniques will only run on newer and more expensive GPUs.
For example, for highly scalable and low-latency deployment, you'd probably want to do model compression. And once you have a compressed model, you can optimize inference using TensorRT and/or other compilers/kernel libraries. Happy to share more thoughts, feel free to reply here or DM me with more details.
machineko t1_j9dgblo wrote
You can use langchain with open source models like Flan-T5 or GPT-J as well. Just need to deploy them as an API endpoint and point to it.
machineko t1_j8yo6fd wrote
Reply to comment by askingforhelp1111 in [D] Speed up HuggingFace Inference Pipeline by [deleted]
Depends on what models you are using but for most transformers, running on GPUs may be much more efficient than CPUs when you consider $ / M inferences (or inf/$).
Are there specific EC2 instances you have to use or can you deploy on any EC2 instance?
machineko t1_j8b0zyv wrote
Reply to [D] Speed up HuggingFace Inference Pipeline by [deleted]
Are you interested in reducing the latency or just cutting down the cost? Can you run the workload on GPUs instead?
For BERT-type models, doing some compression and using inference libraries can easily get you 5-10x speedup. If interested, I'd be happy to share more resources on this.
machineko t1_j048ay1 wrote
Reply to comment by gyurisc in [D] Cloud providers for hobby use by gyurisc
You can start out by looking at deploying stable diffusion models with accelerated inference (https://github.com/stochasticai/x-stable-diffusion). You normally want to be containerizing the models when you deploy them, easier to maintain and scale. DM me if you have any questions.
machineko t1_iz7x0mh wrote
Reply to [D] Is there an affordable way to host a diffusers Stable Diffusion model publicly on the Internet for "real-time"-inference? (CPU or Serverless GPU?) by OkOkPlayer
How "cheap" does it have to be?
Cheapest would be to deploy it on your own using: https://github.com/stochasticai/x-stable-diffusion. Let me if you need more help on real-time inference.
machineko t1_iyisdnp wrote
Reply to [D] Cloud providers for hobby use by gyurisc
It would be nice if you can try out stochastic.ai and provide suggestions on how to improve it. I'd be happy to explain how to build ML cloud infrastructure yourself too.
machineko t1_ixzkdbt wrote
Reply to [D]deploy stable diffusion by Dense_History_1786
AWS Lambda provides serverless but you do not need serverless to make something scalable, if you are referring to scaling from single to multiple GPUs as your workload grows.
The simplest method is to containerize your application and use auto-scaling from GCP. You can also auto-scale it on Kubernetes. Alternatively, you can use services like stochastic.ai which deploys your model containerized and provides auto-scaling out of the box. You just need to upload your model and deploy.
However, I suggest you "accelerate" your inference first. For example, you can use open-source inference engines (see: https://github.com/stochasticai/x-stable-diffusion) to easily accelerate your inference 2x or more. That means you can generates 2x more images / $ on public clouds.
machineko t1_ixwzqpa wrote
Here's a free playground: https://playground.stochastic.ai
Also, $20 credits (equivalent to 4000 generations) when you sign up.
machineko t1_isdv3te wrote
Reply to comment by phb07jm in [D] Modern MLOps architecture info sources by lifesthateasy
I agree with this comment. Back when the tools were crappy, it might've been better to build from scratch but with many good tools available now (often giving you better performance than building them on your own and also cheaper), you should at least try them. Especially if you are interested in running deep learning.
There are mlops sw for:
- low latency inference
- training large language models
- explainable ml
and more.
machineko t1_jecw2v4 wrote
Reply to comment by darkbluetwilight in [D]Suggestions on keeping Llama index cost down by darkbluetwilight
Cerebras-GPT models are Apache-2.0. You should be able to use them for free. Not sure what you mean by charges. Are you referring to using the hosted APIs?
Btw, you should use the ones that are instruction fine-tuned.