Hi everyone. I have tested RWKV [loss vs token position] for 10000 ctx4k+ documents in Pile:

https://preview.redd.it/3ld2629h6xla1.png?width=941&format=png&auto=webp&v=enabled&s=e5afbb8d9704595f4d61db4e2c307ee09e4a4e69

RWKV 1B5-4k is mostly flat after ctx1500, but 3B-4k and 7B-4k and 14B-4k have some slopes, and they are getting better. This debunks the old view that RNNs cannot model long ctxlens. These ctx4096 models are available at https://huggingface.co/BlinkDL.

We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)

https://preview.redd.it/e3tbivtx6xla1.png?width=1174&format=png&auto=webp&v=enabled&s=2e4c1f4806c88a5b2820a0458321a0401dcc8087

RWKV is simple. You can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations. The RWKV paper is coming too.

Here is RWKV model+inference+generation (yes, everything) in 150 lines of Python:

https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

It is a slower version but works 🙂 hopefully this can make it easier to understand and optimize RWKV. [Only the preprocessing of context is slower here, because I am using RNN mode to process the context token-by-token. the faster seq. version is in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py]

I believe in Open AIs built by communities, and you are welcome to join the RWKV community :) Please feel free to msg in RWKV Discord if you are interested.

RWKV has been mostly a single-developer project for the past 2 years: designing, tuning, coding, optimization, distributed training, data cleaning, managing the community, answering questions... All your help will be much appreciated. https://github.com/BlinkDL/RWKV-LM

It will be great if we can build optimized INT8/INT4 inference for Nvidia/AMD/Intel GPUs, Intel/AMD CPUs, and Android/iOS phones. Because RWKV has RNN mode, it is very hardware-friendly (no need for kv cache). Let's build a future where everyone can run LLMs.

Comments

You must log in or register to comment.

_Arsenie_Boca_ t1_jb0sm2c wrote on March 5, 2023 at 3:49 PM

I have been following your reddit posts for some while now, but I still dont think I fully understand it. Did you consider writing a paper? It might help people get the method and might fuel the open source help you get.

luxsteele t1_jb1b68d wrote on March 5, 2023 at 5:59 PM

Totally agree.

I have been following this from some time but I can't fully understand it and explain it to my collaborators.

I work in ML and I have quite some experience with transformers and I still can't fully get it. Let alone convince some of my collaborator that is worth pursuing it.

It is paramount that we have a paper that explains this in more detail if we want the community to consider this seriously.

Please do it!

bo_peng OP t1_jb1q5fu wrote on March 5, 2023 at 7:38 PM

Yes a paper is coming. Meanwhile you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)

bo_peng OP t1_jb1po7i wrote on March 5, 2023 at 7:34 PM

Will the 150 lines help? Please read the code first :)

https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

This is ALL you need for RWKV inference.

And you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)

_Arsenie_Boca_ t1_jb1wjfi wrote on March 5, 2023 at 8:22 PM

It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.

What is the magic number 5 all about? It seems to appear all over the code without explanation.

Are the time mixing and channel mixing operations novel or were they introduced by a citable work?

How does the parallelization during training work?

bo_peng OP t1_jb1z3an wrote on March 5, 2023 at 8:40 PM

5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).

TimeMixing is RWKV.

ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).

Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.

Art10001 t1_jb0q49f wrote on March 5, 2023 at 3:31 PM

If you are RWKV's creator, kudos to you, the work you have done is amazing.

Reminder for everybody: it can run rather quickly in CPU, meaning it can truly run locally in phones. It also is 100 times faster, and uses 100 times less (V)RAM.

royalemate357 t1_jb0smq3 wrote on March 5, 2023 at 3:49 PM

It's awesome work, but I don't think anyone is claiming anywhere near 100x faster speed and lower VRAM are they?

>RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
>
>GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

From this it sounds like about ~2x improvement (dont get me wrong 2x improvement is great for same performance). As for you have to store all the parameters of RWKV model just like GPT, that takes up most of the memory if you're trying to fit models in consumer hardware. Memory is just less because of no need for KV cache.

Art10001 t1_jb172wo wrote on March 5, 2023 at 5:32 PM

It once said 100 times faster and 100 times less (V)RAM here. However, it now says that RWKV-14B can be run with only 3 GB of VRAM, which is regardless a massive improvement, because a 14B model normally requires about 30 GB of VRAM or thereabouts.

royalemate357 t1_jb1h7wl wrote on March 5, 2023 at 6:37 PM

hmm I very much doubt it couldve ran 100x faster for the same parameter count, as you are memory bandwith bound (both GPT and RWKV have to load the parameters n times to generate n tokens). Also Im somewhat skeptical that you only need 3GB for 14B parameters *without offloading the model*, as even 4-bit quantization is 14B/2 = 7GB needed. and offloading the model is slow to the point of being unusable as you need to do CPU<->GPU transfers.

xx14Zackxx t1_jb1zk8v wrote on March 5, 2023 at 8:43 PM

It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).

I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.

Nextil t1_jb1sg1c wrote on March 5, 2023 at 7:54 PM

I think they mean with offloading/streaming you need 3GB minimum, but it's much slower.

ThirdMover t1_jb0x91p wrote on March 5, 2023 at 4:24 PM

I think this is really exciting. LLM applications like ChatGPT seem to still mostly just pipe the result of the model sampling directly out but with 100 times faster inference, maybe complex chain of thought procedures with multiple differently prompted model instances (well, the same model but different contexts) can be chained and work together to improve their output while still running close to real time.

Art10001 t1_jb176r8 wrote on March 5, 2023 at 5:32 PM

Indeed.

Spare_Side_5907 t1_jb0rhw1 wrote on March 5, 2023 at 3:41 PM

Is this similar to Toeplitz Neural Network for Sequence Modeling https://openreview.net/forum?id=IxmWsm4xrua ?

bo_peng OP t1_jb1qws0 wrote on March 5, 2023 at 7:43 PM

TNN is like convolution, while RWKV can be written as a CNN too (RWKV v1 is a CNN). So there's some similarity, though not much :)

estrafire t1_jc2umln wrote on March 13, 2023 at 5:15 PM

Any particular reason for moving from CNN to RNN?

CatalyzeX_code_bot t1_jb1r9fn wrote on March 5, 2023 at 7:46 PM

Found relevant code at https://github.com/ridgerchu/SpikeGPT + all code implementations here

To opt out from receiving code links, DM me

[deleted] t1_jb0sy46 wrote on March 5, 2023 at 3:51 PM

[removed]

I_will_delete_myself t1_jb32fo5 wrote on March 6, 2023 at 1:34 AM

What’s the reason to use this over a transformers? Transformers allow transfer learning and is able to paralize easier. 啊我看到你的知乎。你在什么公司去工作？

[deleted] t1_jb32vlo wrote on March 6, 2023 at 1:38 AM

[deleted]

Philpax t1_jb471z4 wrote on March 6, 2023 at 8:20 AM

There's information about this in the README, but I'll admit that it's a little too technical and doesn't have a high-level description of the ideas. Looking forward to the paper!

I_will_delete_myself t1_jb532v5 wrote on March 6, 2023 at 2:32 PM

Intelligence is the ability to take complex information into a simple explanation that a child can understand .

It makes me skeptical if someone doesn’t explain besides performance reasons . Most people just use the cloud because ML networks regardless of size take up a lot of battery.

Philpax t1_jb53nhe wrote on March 6, 2023 at 2:36 PM

As far as I can tell, the sparse documentation is just because they've been in pure R&D mode. I've played around with it in their Discord server and can confirm it does perform well, but I've struggled to get it working locally.

bo_peng OP t1_jb9bdw3 wrote on March 7, 2023 at 12:04 PM

Directly from RWKV-LM Github:

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

deep-yearning t1_jb625mr wrote on March 6, 2023 at 6:32 PM

Attention is all you want, but not all you need