Hi,

Is there any way to run llama (or any other) model in such a way, that you only pay per API request?

I wanted to test how the llama model would do in my specific usecase, but when I went to HF Interface Endpoints it says that I would have to pay over 3k USD per month (ofc I do not have that much money to spend on a side-project).

I would like to test this model by paying on per request basis.

Comments

You must log in or register to comment.

VelvetyPenus t1_jcr1usl wrote on March 18, 2023 at 10:14 PM

Wait two weeks, it will all be free.

MBle OP t1_jcujt4a wrote on March 19, 2023 at 5:53 PM

Based on what information you predict this?

iKlsR t1_jcw7jms wrote on March 20, 2023 at 12:58 AM

Maybe based on how fast things have been moving recently... relevant https://replicate.com/blog/llama-roundup

currentscurrents t1_jcqzjil wrote on March 18, 2023 at 9:57 PM

I haven't heard of anybody running LLama as a paid API service. I think doing so might violate the license terms against commercial use.

>(or any other) model

OpenAI has a ChatGPT API that costs pennies per request. Anthropic also recently announced one for their Claude language model but I have not tried it.

veonua t1_jct419t wrote on March 19, 2023 at 10:44 AM

Creating a monopoly on AI can be extremely risky. Although OpenAI was founded to prevent it, recent actions by the company suggest that they may be contributing to monopolization by reducing prices.

danielbln t1_jctk7sy wrote on March 19, 2023 at 1:39 PM

Pennies per request would be a lot, it's a fraction of a penny per request.

Philpax t1_jcrgxbb wrote on March 19, 2023 at 12:09 AM

As the other commenter said, it's unlikely anyone will advertise a service like this as LLaMA's license terms don't allow for it. In your situation, I'd just rent a cloud GPU server (Lambda Labs etc) and test the models you care about. It'll only end up being a dollar or two if you're quick with your use.

NotARedditUser3 t1_jcsc9lp wrote on March 19, 2023 at 4:31 AM

You can get llama running on consumer grade hardware. There's 4 and 8 bit quantization for it i believe where it fits in a normal gpu's vram, i saw floating around here

veonua t1_jct3plc wrote on March 19, 2023 at 10:39 AM

As far as I know, the Meta license forbids this, since the model is for academic purposes only

tomd_96 t1_jctddsu wrote on March 19, 2023 at 12:35 PM

You can do this using replicate: https://github.com/replicate/cog-llama