albertzeyer t1_j6a8qvq wrote on January 28, 2023 at 9:48 PM

Reply to comment by JustOneAvailableName in [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ? by KarmaCut132

I mean in the research community, and also all the big players who actually have speech recognition in products, like Google, Apple, Microsoft, Amazon, etc.

Whisper is nice for others. However, as an AED model, it has some disadvantages over an RNN-T model. E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec). Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

JustOneAvailableName t1_j6cfdmr wrote on January 29, 2023 at 9:43 AM

I worked with Wav2vec a year ago. WER on dutch was (noticeably) better when fine tuned than it was with GCP or Azure, and we didn't use any labeled own data. I used CTC mainly because it didn't reduce WER, hugely improved CER and made inference lots simpler. Inference cost was also a fraction (less than a cent per hour, assuming the GPU is fully utalized) of the paid services. I kinda assumed others got to the same conclusions I did back then, but my own conclusions, so plenty I could have done wrong.

Whisper offers this performance level practically out of the box, although with a lot higher inference costs. I, sadly, haven't had the time yet to finetune it. Nor have I found the time to optimize inference costs.

> E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec)

If you're okay with intermediary results getting improved later this is doable, although at a factor increased cost. Offline works like a charm though.

> Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

True that.

albertzeyer t1_j6ebian wrote on January 29, 2023 at 7:18 PM

It's a bit strange indeed that the GCP or Azure results are not so great. As said, I do actually research on speech recognition, and Google is probably the biggest player in this field, and usually always with the very best results.

My explanation is, they don't really use such good and big models for GCP. Maybe they want to reduce the computational cost as much as possible.

But you also anyway have to be a bit careful in what you compare. Your results might be flawed when your finetuning data is close to your validation set (e.g. similar domain, similar sound conditions). Because in case of GCP, they have very generic models, working for all kinds of domains, all kinds of conditions.