coolmlgirl t1_j819nx4 wrote on February 10, 2023 at 9:52 PM

Can you share the link to that Hugging Face model so I can see how I may help?

askingforhelp1111 t1_j81ggm0 wrote on February 10, 2023 at 10:40 PM

Sure, I have a few links. All of them have an inference speed of 4-9 seconds.

https://huggingface.co/poom-sci/WangchanBERTa-finetuned-sentiment

https://huggingface.co/ayameRushia/bert-base-indonesian-1.5G-sentiment-analysis-smsa

I call each checkpoint like this:

nlp = pipeline('sentiment-analysis',
            model=checkpoint, 
            tokenizer=checkpoint)

Thank you!

coolmlgirl t1_j8fmfpi wrote on February 13, 2023 at 11:13 PM

I'm using the OctoML platform (https://octoml.ai/) to optimize your model and I got your average inference latency down to 2.14ms on an AWS T4 GPU. On an Ice Lake CPU I can get your latency down to 27.47ms. I'm assuming shapes of [1,128] for your inputs "input_ids," "attention_mask," and "token_type_ids," but want to confirm your actual shapes so that we're comparing apples to apples. Do you know what shapes you're using?

coolmlgirl t1_j8fml7y wrote on February 13, 2023 at 11:14 PM

My results above are for this model: https://huggingface.co/ayameRushia/bert-base-indonesian-1.5G-sentiment-analysis-smsa

It's pretty easy to use that platform to automatically do the same for your other model too-- we can discuss that one also later once we figure out this one.