coolmlgirl t1_j819nx4 wrote on February 10, 2023 at 9:52 PM

Can you share the link to that Hugging Face model so I can see how I may help?

askingforhelp1111 t1_j81ggm0 wrote on February 10, 2023 at 10:40 PM

Sure, I have a few links. All of them have an inference speed of 4-9 seconds.

https://huggingface.co/poom-sci/WangchanBERTa-finetuned-sentiment

https://huggingface.co/ayameRushia/bert-base-indonesian-1.5G-sentiment-analysis-smsa

I call each checkpoint like this:

nlp = pipeline('sentiment-analysis',
            model=checkpoint, 
            tokenizer=checkpoint)

Thank you!

coolmlgirl t1_j8fmfpi wrote on February 13, 2023 at 11:13 PM

I'm using the OctoML platform (https://octoml.ai/) to optimize your model and I got your average inference latency down to 2.14ms on an AWS T4 GPU. On an Ice Lake CPU I can get your latency down to 27.47ms. I'm assuming shapes of [1,128] for your inputs "input_ids," "attention_mask," and "token_type_ids," but want to confirm your actual shapes so that we're comparing apples to apples. Do you know what shapes you're using?

coolmlgirl t1_j8fml7y wrote on February 13, 2023 at 11:14 PM

My results above are for this model: https://huggingface.co/ayameRushia/bert-base-indonesian-1.5G-sentiment-analysis-smsa

It's pretty easy to use that platform to automatically do the same for your other model too-- we can discuss that one also later once we figure out this one.

gingerbread42 t1_j81f2rd wrote on February 10, 2023 at 10:30 PM

check out Triton for model deployment

askingforhelp1111 t1_j81hia1 wrote on February 10, 2023 at 10:47 PM

Thanks for the idea!

bacocololo t1_j83izz9 wrote on February 11, 2023 at 10:47 AM

https://arxiv.org/pdf/2212.14034.pdf

machineko t1_j8b0zyv wrote on February 13, 2023 at 12:04 AM

Are you interested in reducing the latency or just cutting down the cost? Can you run the workload on GPUs instead?

For BERT-type models, doing some compression and using inference libraries can easily get you 5-10x speedup. If interested, I'd be happy to share more resources on this.

askingforhelp1111 t1_j8cmbr6 wrote on February 13, 2023 at 8:40 AM

Much thanks for the reply, would love to read your resources on compression and inference.

I'm keen on cutting down costs. Previously ran on GPU via AWS EC2 instance but gotta tighten the company's belt this year and my manager suggested running on CPU. Love to hear your suggestions too (if any).

machineko t1_j8yo6fd wrote on February 17, 2023 at 10:05 PM

Depends on what models you are using but for most transformers, running on GPUs may be much more efficient than CPUs when you consider $ / M inferences (or inf/$).

Are there specific EC2 instances you have to use or can you deploy on any EC2 instance?

[D] Speed up HuggingFace Inference Pipeline

Comments