I would like to know what are some of the best practice is to convert pytorch to embedded C (bare metal micro-controllers) during A. initial phase and B. for deployment.

A. Initial phase is to understand the profiling of the model performance (RAM usage and processing time) for a targetted hardware.

I understand that Tensorflow lite might be the best route for initial profiling but there are restrictions. It will be great if you could tell the framework that you follow. Currently framework: 1. Pytorch -> 2. ONNX -> 3. Keras -> 4. Tensorflowlite or 5. Tensorflowlite micro

B. Deployment is to run inference for production in a targetted hardware. I think hand coding in C is the best way.

Please ignore optimisation techniques in the workflow for simplicity.

Comments

gosnold t1_j7l8que wrote on February 7, 2023 at 4:46 PM

Look up NNOM

mikljohansson t1_j7ojjjm wrote on February 8, 2023 at 7:35 AM

I have been building a PyTorch > ONNX > TFlite > TFMicro toolchain for a project to get a vision model running on an ESP32-CAM with PlatformIO and Arduino framework. Perhaps it could be of use as a reference

https://github.com/mikljohansson/mbot-vision

Some caveats to consider when embarking on this kind of project

PyTorch/ONNX is channels-first memory format, while tensorflow is channels-last. Converting the model with onnx-tf inserts lots of Transpose ops in the graph which decreases performance (with 3x for my model) and increased memory usage. I'm using onnx2tf module instead, which also coverts operators to channels-last
You may want to fully quantize the model to int8, since fp16/fp32 is really slow on smaller MCUs, especially those lacking FPUs and vector instructions. And watch out for Quantize/Dequantize ops in the converted graph, it means some op didn't support quantization so needed to be wrapped and executed (slowly) in fp16/fp32 mode
There may be lots of performance to gain by using hardware optimized kernels, but it depends on what MCU and what operators your model is using. E.g. for ESP32 there's ESP-NN which greatly sped up inference times for my project (2x)

https://github.com/espressif/esp-nn https://github.com/espressif/tflite-micro-esp-examples

And for really tiny MCUs there's this library which could perhaps be useful, it doesn't support so many operators but it does work in my testing for simple networks

https://github.com/sipeed/TinyMaix

How to figure out memory needs and performance. It's a bit trickier, I've simply been using for example torchinfo module, and the graph output and graph statistics that onnx2tf displays to see how many muls the model is using and the approximate parameter and tensor memory usage. Then I've had an improvement cycle where I've "trained" the model for 1 step, deployed it to the hardware to measure the FPS and then adjust the hyperparameters and model architecture until I have an FPS that is acceptable. Then train it fully to see if that model config can do the job. And then iterate...

ramv0001 OP t1_j7ovy25 wrote on February 8, 2023 at 10:32 AM

Yes, completely agree on the onnx2tf.

Have you tried using emulators instead of actual hardware?

mikljohansson t1_j7p0o1o wrote on February 8, 2023 at 11:35 AM

Nope, haven't used any emulators for this project. The ESP32 hardware I've been using is so cheap and convenient to use that there's been no need

mikljohansson t1_j7ok819 wrote on February 8, 2023 at 7:44 AM

What kind of MCU are you targeting? It depends a lot of the capabilities of the MCU, how fast is it, how much memory, does it have a dedicated NPU/TPU, vector instructions, ..

ramv0001 OP t1_j7ovtm5 wrote on February 8, 2023 at 10:30 AM

Something like the ultralow power ARC EM series.