mikljohansson

mikljohansson t1_jckedf9 wrote

Very interesting work! I've been following this project for a while now

Can I ask a few questions?

  • What's the difference between RWKV-LM and ChatRWKV, e.g. is ChatRWKV mainly RWKV-LM but streamlined for inference and ease of use, or is there more differences?

  • Are you planning to fine tune on the Stanford Alpaca dataset (like was recently done for LLaMa and GPT-J to create instruct versions of them), or a similar GPT-generated instruction dataset? I'd love to see a instruct-tuned version of RWKV-LM 14B with a 8k+ context len!

3

mikljohansson t1_j87y870 wrote

Trying to teach my daughter and her cousins a bit about programming and machine learning. We're building a simple robot with an object detection model and Scratch block programming, so they can get it to chase after the objects it recognises. It works fine, but the kids seem to enjoy driving the robot around via remote and looking through its camera more than programming it 😅 There's an image in the repo readme

https://github.com/mikljohansson/mbot-vision

3

mikljohansson t1_j7ojjjm wrote

I have been building a PyTorch > ONNX > TFlite > TFMicro toolchain for a project to get a vision model running on an ESP32-CAM with PlatformIO and Arduino framework. Perhaps it could be of use as a reference

https://github.com/mikljohansson/mbot-vision

Some caveats to consider when embarking on this kind of project

  • PyTorch/ONNX is channels-first memory format, while tensorflow is channels-last. Converting the model with onnx-tf inserts lots of Transpose ops in the graph which decreases performance (with 3x for my model) and increased memory usage. I'm using onnx2tf module instead, which also coverts operators to channels-last

  • You may want to fully quantize the model to int8, since fp16/fp32 is really slow on smaller MCUs, especially those lacking FPUs and vector instructions. And watch out for Quantize/Dequantize ops in the converted graph, it means some op didn't support quantization so needed to be wrapped and executed (slowly) in fp16/fp32 mode

  • There may be lots of performance to gain by using hardware optimized kernels, but it depends on what MCU and what operators your model is using. E.g. for ESP32 there's ESP-NN which greatly sped up inference times for my project (2x)

https://github.com/espressif/esp-nn https://github.com/espressif/tflite-micro-esp-examples

And for really tiny MCUs there's this library which could perhaps be useful, it doesn't support so many operators but it does work in my testing for simple networks

https://github.com/sipeed/TinyMaix

  • How to figure out memory needs and performance. It's a bit trickier, I've simply been using for example torchinfo module, and the graph output and graph statistics that onnx2tf displays to see how many muls the model is using and the approximate parameter and tensor memory usage. Then I've had an improvement cycle where I've "trained" the model for 1 step, deployed it to the hardware to measure the FPS and then adjust the hyperparameters and model architecture until I have an FPS that is acceptable. Then train it fully to see if that model config can do the job. And then iterate...
2