ggerganov OP t1_itcgtpx wrote on October 22, 2022 at 4:13 PM

Reply to comment by mrpogiface in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Hey, just in case you are still interested - today I finished the WASM port and the performance not really bad:

https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.wasm

There is a link to a live demo page where you can play with it.

Cheers!

ggerganov OP t1_is1tgok wrote on October 12, 2022 at 6:06 PM

Reply to comment by mrpogiface in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Looks like WASM actually support SIMD:

https://emscripten.org/docs/porting/simd.html

Will definitely give this a try when I get some free time. I will post updates here, if you are interested in the progress:

https://github.com/ggerganov/whisper.cpp/issues/44

ggerganov OP t1_irwlluz wrote on October 11, 2022 at 4:13 PM

Reply to comment by mrpogiface in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

I was thinking about this too.

Compiling the code is easy. The problem is you need to load 75 MB model data (this is the "tiny" model). I guess nobody would want to download 75 MB every time they load a page.

Even if we say you are OK with 75 MB assets, the next problem is WASM not supporting SIMD. So the performance would be much worse compared to native. How much? Not sure.

But nevertheless - it might be fun to try and run it in the browser.

ggerganov OP t1_irw8n49 wrote on October 11, 2022 at 2:45 PM

Reply to comment by CommunismDoesntWork in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Someone already provided Rust bindings to the C-style API:

https://github.com/tazz4843/whisper-rs

ggerganov OP t1_irw8eho wrote on October 11, 2022 at 2:44 PM

Reply to comment by ThisIsMyStonerAcount in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Essentially, it's the mat mul routine that I have re-implemented. It consumes more than 90% of the computation.

I tried using built-in BLAS implementation that comes from Apple Accelerate framework. My F16 mat mul performed better compared to cblas_sgemm and the Accelerate framework didn't provide F16 overloads.

I didn't wan't to include external BLAS implementations, because I wanted to have inference implementation that does not depend on anything and you can easily build and try.

Also, a major factor was that this entire project is mostly a learning experience to understand how the transformers work on a lower level and improve my C programming and opitmization skills.

One thing I noticed is that the 32FP mat mul from Torch outperforms my F16 mat mul on M1 for big matrices (> 1024x1024). It seems that it uses MKL under the hood. For bigger sizes, it can be up to 3 times faster. It would be interesting to explore how this can be achieved manually.

ggerganov OP t1_irw7dy6 wrote on October 11, 2022 at 2:37 PM

Reply to comment by MidnightSun_55 in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

No. I tried using Metal Performance Shaders (MPS) but was not able to utilize it properly. Here are some notes on this:

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j#attempt-to-use-the-m1-gpu

ggerganov OP t1_irv1b8s wrote on October 11, 2022 at 6:49 AM

Reply to comment by Fit_Schedule5951 in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Here is a comparison for Intel CPU:

https://github.com/ggerganov/whisper.cpp/issues/2#issuecomment-1257808576

Would be interesting to compare it on M1 when torch starts supporting F16.

ggerganov OP t1_irv0mle wrote on October 11, 2022 at 6:40 AM

Reply to comment by justgord in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Hi, yes - I'm using SIMD intrinsics. AVX2 on x86 and NEON on ARM.

I am taking advantage of F16 floating-point arithmetic if available. Otherwise, I use it just as storage type to reduce memory bandwidth.

ggerganov OP t1_irv0gki wrote on October 11, 2022 at 6:37 AM

Reply to comment by upperfloormaster in [P] Pure C/C++ port of OpenAI's Whisper by ggerganov

Here are some benchmarks that other people did (both vs CPU and vs GPU):

- vs OpenVINO + ONNX on CPU - more than 2x faster

https://github.com/openai/whisper/discussions/208#discussioncomment-3827022

- vs PyTorch (CPU: i7 11800H, GPU: RTX 3080 Laptop):

https://github.com/ggerganov/whisper.cpp/issues/2#issuecomment-1257808576

- whisper.cpp on Xeon processor

https://github.com/ggerganov/whisper.cpp/issues/16

Also, my implementation is focused for performance on M1 chips and it looks like most of the Python frameworks do not support it properly yet, so I cannot make a proper benchmark.

Additionally, my implementation can also run the "large" model on an android phone (Samsung A52) - would be interesting to see how this compares with existing implementations:

https://github.com/ggerganov/whisper.cpp/issues/18#issue-1395784900