Submitted by bo_peng t3_11iwt1b in MachineLearning
Hi everyone. I have tested RWKV [loss vs token position] for 10000 ctx4k+ documents in Pile:
RWKV 1B5-4k is mostly flat after ctx1500, but 3B-4k and 7B-4k and 14B-4k have some slopes, and they are getting better. This debunks the old view that RNNs cannot model long ctxlens. These ctx4096 models are available at https://huggingface.co/BlinkDL.
We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)
RWKV is simple. You can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations. The RWKV paper is coming too.
Here is RWKV model+inference+generation (yes, everything) in 150 lines of Python:
https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py
It is a slower version but works 🙂 hopefully this can make it easier to understand and optimize RWKV. [Only the preprocessing of context is slower here, because I am using RNN mode to process the context token-by-token. the faster seq. version is in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py]
I believe in Open AIs built by communities, and you are welcome to join the RWKV community :) Please feel free to msg in RWKV Discord if you are interested.
RWKV has been mostly a single-developer project for the past 2 years: designing, tuning, coding, optimization, distributed training, data cleaning, managing the community, answering questions... All your help will be much appreciated. https://github.com/BlinkDL/RWKV-LM
It will be great if we can build optimized INT8/INT4 inference for Nvidia/AMD/Intel GPUs, Intel/AMD CPUs, and Android/iOS phones. Because RWKV has RNN mode, it is very hardware-friendly (no need for kv cache). Let's build a future where everyone can run LLMs.
Art10001 t1_jb0q49f wrote
If you are RWKV's creator, kudos to you, the work you have done is amazing.
Reminder for everybody: it can run rather quickly in CPU, meaning it can truly run locally in phones. It also is 100 times faster, and uses 100 times less (V)RAM.