machineko t1_jecw2v4 wrote on March 31, 2023 at 2:17 AM

Reply to comment by darkbluetwilight in [D]Suggestions on keeping Llama index cost down by darkbluetwilight

Cerebras-GPT models are Apache-2.0. You should be able to use them for free. Not sure what you mean by charges. Are you referring to using the hosted APIs?

Btw, you should use the ones that are instruction fine-tuned.

machineko t1_jecvhyt wrote on March 31, 2023 at 2:12 AM

Reply to comment by Evening_Ad6637 in [D] Training a 65b LLaMA model by Business-Lead2679

16gb of RAM is not enough for even the smallest LLaMA 7b model. You can try doing LoRA with int8 listed above. Did you try the python script I linked above?

machineko t1_je88wj9 wrote on March 30, 2023 at 3:04 AM

Reply to [D] Training a 65b LLaMA model by Business-Lead2679

I'm working on an open source library focused on resource-efficient fine-tuning methods called xTuring: https://github.com/stochasticai/xturing

Here's how you would perform int8 LoRA fine-tuning in three lines:

python: https://github.com/stochasticai/xturing/blob/main/examples/llama/llama_lora_int8.py
colab notebook: https://colab.research.google.com/drive/1SQUXq1AMZPSLD4mk3A3swUIc6Y2dclme?usp=sharing

Of course the Colab still only works with smaller models. In the example above, 7B required 9G VRAM.

machineko t1_je888dc wrote on March 30, 2023 at 2:58 AM

Reply to [D]Suggestions on keeping Llama index cost down by darkbluetwilight

Why not use open source models. Especially it seems like you are trying not to sell the model for commercial purposes, you can easily replace it with open source models. Also, for retrieval-augmented generation, smaller models can be very effective.

machineko t1_je86hwt wrote on March 30, 2023 at 2:44 AM

Reply to comment by ortegaalfredo in [D] llama 7b vs 65b ? by deck4242

What GPUs are you using to run them? Are you using any compression (i.e. quantization)?

machineko t1_je83m8x wrote on March 30, 2023 at 2:21 AM

Reply to comment by LetGoAndBeReal in [D] The best way to train an LLM on company data by jaxolingo

Unsupervised fine-tuning (or extending the pre-training) with additional data will work. Of course, how to get it to learn new information effectively is a challenge but not impossible.

machineko t1_je70llx wrote on March 29, 2023 at 9:30 PM

Reply to comment by LetGoAndBeReal in [D] The best way to train an LLM on company data by jaxolingo

Why would you say that fine-tuning is not viable? There are many production use cases of fine-tuning a model using in-house proprietary data.
If fact, if you have the resources you can do both fine-tuning of an existing model (whether is just supervised or unsupervised) and also use that for retrieval augmented generation.

machineko t1_je05orp wrote on March 28, 2023 at 1:50 PM

Reply to comment by rshah4 in [D] FOMO on the rapid pace of LLMs by 00001746

I agree. While these giant centralized models are all over the news, there are ways to make smaller models much more efficient (i.e. LoRA mentioned above). And during the process working with these techniques, we can perhaps discover new methods and architecture .

We are working on an open-source project focused on making fine-tuning for LLMs, simple, fast and efficient: https://github.com/stochasticai/xturing.

OP, we till got a ton of stuff we want to try out to make fine-tuning faster and more compute/memory efficient, if you are interested in contributing.

machineko t1_jdtv8jv wrote on March 27, 2023 at 3:40 AM

Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

If you need help, come find us on our discord channel.

machineko t1_jdqzmyq wrote on March 26, 2023 at 2:40 PM

Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

model.save("path/to/your/weights") saves it to the directory
After that, you can load it with
model = BaseModel.create("gpt2", "path/to/your/weights")

Can you share the input text you have used? It is possible that GPT-2 is too small and needs custom generation parameters.

machineko t1_jdnmg8l wrote on March 25, 2023 at 7:20 PM

Reply to comment by ephemeralentity in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

Right, 8GB won't be enough for LLaMA 7b. You should try GPT-2 model. That should work on 8GB VRAM.

machineko t1_jdmm43b wrote on March 25, 2023 at 3:02 PM

Reply to comment by SWESWESWEh in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

Thanks for the comment. Are you looking to run on M2 or smaller edge devices?

machineko t1_jdmdvst wrote on March 25, 2023 at 1:59 PM

Reply to comment by light24bulbs in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

We are working on adding that as well. Keep an eye out on our repo.

machineko t1_jdjeh6y wrote on March 24, 2023 at 8:35 PM

Reply to [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry

We have a similar open-source project focused on personalization of LLMs and efficient fine-tuning: https://github.com/stochasticai/xturing

We actually released code for GPT-J, LLaMA and GPT-2 before these guys but we are a small team. You can run it on any local machines too.

machineko t1_jbu36nu wrote on March 11, 2023 at 6:48 PM

Reply to [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali

How long is your text? If you are doing short sentences, try fine-tuning RoBERTa with your labeled dataset for classification. If you don't have labeled datasets, you need to use zero or few-shot learning on a larger model. I'd start with a smaller LLM like GPT-J, try playing with some prompts on a free playground like this (you can select GPT-J) until you find something that work well.

machineko t1_ja4jubd wrote on February 26, 2023 at 8:09 PM

Reply to [D] Faster Flan-T5 inference by _learn_faster_

Inference acceleration involves model accuracy / latency / cost trade-offs and also how much $ and time you are willing to spend to speed things up. Is your goal to achieve real-time? Can you do it while taking 2-3% accuracy hits? What compute resource is the model going to run on? On the cloud and you have access to any GPUs? For example, certain inference optimization techniques will only run on newer and more expensive GPUs.

For example, for highly scalable and low-latency deployment, you'd probably want to do model compression. And once you have a compressed model, you can optimize inference using TensorRT and/or other compilers/kernel libraries. Happy to share more thoughts, feel free to reply here or DM me with more details.

machineko t1_j9dgblo wrote on February 21, 2023 at 2:48 AM

Reply to [D] Does langchain upload all user’s data to Openai? by westeast1000

You can use langchain with open source models like Flan-T5 or GPT-J as well. Just need to deploy them as an API endpoint and point to it.

machineko t1_j8yo6fd wrote on February 17, 2023 at 10:05 PM

Reply to comment by askingforhelp1111 in [D] Speed up HuggingFace Inference Pipeline by [deleted]

Depends on what models you are using but for most transformers, running on GPUs may be much more efficient than CPUs when you consider $ / M inferences (or inf/$).

Are there specific EC2 instances you have to use or can you deploy on any EC2 instance?

machineko t1_j8b0zyv wrote on February 13, 2023 at 12:04 AM

Reply to [D] Speed up HuggingFace Inference Pipeline by [deleted]

Are you interested in reducing the latency or just cutting down the cost? Can you run the workload on GPUs instead?

For BERT-type models, doing some compression and using inference libraries can easily get you 5-10x speedup. If interested, I'd be happy to share more resources on this.

machineko t1_j048ay1 wrote on December 13, 2022 at 11:19 PM

Reply to comment by gyurisc in [D] Cloud providers for hobby use by gyurisc

You can start out by looking at deploying stable diffusion models with accelerated inference (https://github.com/stochasticai/x-stable-diffusion). You normally want to be containerizing the models when you deploy them, easier to maintain and scale. DM me if you have any questions.

machineko t1_iz7x0mh wrote on December 7, 2022 at 3:05 AM

Reply to [D] Is there an affordable way to host a diffusers Stable Diffusion model publicly on the Internet for "real-time"-inference? (CPU or Serverless GPU?) by OkOkPlayer

How "cheap" does it have to be?

Cheapest would be to deploy it on your own using: https://github.com/stochasticai/x-stable-diffusion. Let me if you need more help on real-time inference.

machineko t1_iyisdnp wrote on December 1, 2022 at 6:38 PM

Reply to [D] Cloud providers for hobby use by gyurisc

It would be nice if you can try out stochastic.ai and provide suggestions on how to improve it. I'd be happy to explain how to build ML cloud infrastructure yourself too.

machineko t1_ixzkdbt wrote on November 27, 2022 at 4:59 PM

Reply to [D]deploy stable diffusion by Dense_History_1786

AWS Lambda provides serverless but you do not need serverless to make something scalable, if you are referring to scaling from single to multiple GPUs as your workload grows.

The simplest method is to containerize your application and use auto-scaling from GCP. You can also auto-scale it on Kubernetes. Alternatively, you can use services like stochastic.ai which deploys your model containerized and provides auto-scaling out of the box. You just need to upload your model and deploy.

However, I suggest you "accelerate" your inference first. For example, you can use open-source inference engines (see: https://github.com/stochasticai/x-stable-diffusion) to easily accelerate your inference 2x or more. That means you can generates 2x more images / $ on public clouds.

machineko t1_ixwzqpa wrote on November 27, 2022 at 1:06 AM

Reply to [P] Free Stable Diffusion 2.0 hosted interface by philipkiely

Here's a free playground: https://playground.stochastic.ai

Also, $20 credits (equivalent to 4000 generations) when you sign up.

machineko t1_isdv3te wrote on October 15, 2022 at 5:31 AM

Reply to comment by phb07jm in [D] Modern MLOps architecture info sources by lifesthateasy

I agree with this comment. Back when the tools were crappy, it might've been better to build from scratch but with many good tools available now (often giving you better performance than building them on your own and also cheaper), you should at least try them. Especially if you are interested in running deep learning.

There are mlops sw for:
- low latency inference

- training large language models

- explainable ml

and more.