ggerganov

ggerganov OP t1_irwlluz wrote

I was thinking about this too.

Compiling the code is easy. The problem is you need to load 75 MB model data (this is the "tiny" model). I guess nobody would want to download 75 MB every time they load a page.

Even if we say you are OK with 75 MB assets, the next problem is WASM not supporting SIMD. So the performance would be much worse compared to native. How much? Not sure.

But nevertheless - it might be fun to try and run it in the browser.

5

ggerganov OP t1_irw8eho wrote

Essentially, it's the mat mul routine that I have re-implemented. It consumes more than 90% of the computation.

I tried using built-in BLAS implementation that comes from Apple Accelerate framework. My F16 mat mul performed better compared to cblas_sgemm and the Accelerate framework didn't provide F16 overloads.

I didn't wan't to include external BLAS implementations, because I wanted to have inference implementation that does not depend on anything and you can easily build and try.

Also, a major factor was that this entire project is mostly a learning experience to understand how the transformers work on a lower level and improve my C programming and opitmization skills.

One thing I noticed is that the 32FP mat mul from Torch outperforms my F16 mat mul on M1 for big matrices (> 1024x1024). It seems that it uses MKL under the hood. For bigger sizes, it can be up to 3 times faster. It would be interesting to explore how this can be achieved manually.

4

ggerganov OP t1_irv0gki wrote

Here are some benchmarks that other people did (both vs CPU and vs GPU):

- vs OpenVINO + ONNX on CPU - more than 2x faster

https://github.com/openai/whisper/discussions/208#discussioncomment-3827022

- vs PyTorch (CPU: i7 11800H, GPU: RTX 3080 Laptop):

https://github.com/ggerganov/whisper.cpp/issues/2#issuecomment-1257808576

- whisper.cpp on Xeon processor

https://github.com/ggerganov/whisper.cpp/issues/16

Also, my implementation is focused for performance on M1 chips and it looks like most of the Python frameworks do not support it properly yet, so I cannot make a proper benchmark.

Additionally, my implementation can also run the "large" model on an android phone (Samsung A52) - would be interesting to see how this compares with existing implementations:

https://github.com/ggerganov/whisper.cpp/issues/18#issue-1395784900

8