Submitted by t3_11iwt1b in MachineLearning

Hi everyone. I have tested RWKV [loss vs token position] for 10000 ctx4k+ documents in Pile:

https://preview.redd.it/3ld2629h6xla1.png?width=941&format=png&auto=webp&v=enabled&s=e5afbb8d9704595f4d61db4e2c307ee09e4a4e69

RWKV 1B5-4k is mostly flat after ctx1500, but 3B-4k and 7B-4k and 14B-4k have some slopes, and they are getting better. This debunks the old view that RNNs cannot model long ctxlens. These ctx4096 models are available at https://huggingface.co/BlinkDL.

We can predict that RWKV 100B will be great, and RWKV 1T is probably all you need :)

https://preview.redd.it/e3tbivtx6xla1.png?width=1174&format=png&auto=webp&v=enabled&s=2e4c1f4806c88a5b2820a0458321a0401dcc8087

RWKV is simple. You can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations. The RWKV paper is coming too.

Here is RWKV model+inference+generation (yes, everything) in 150 lines of Python:

https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

It is a slower version but works 🙂 hopefully this can make it easier to understand and optimize RWKV. [Only the preprocessing of context is slower here, because I am using RNN mode to process the context token-by-token. the faster seq. version is in https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py]

I believe in Open AIs built by communities, and you are welcome to join the RWKV community :) Please feel free to msg in RWKV Discord if you are interested.

RWKV has been mostly a single-developer project for the past 2 years: designing, tuning, coding, optimization, distributed training, data cleaning, managing the community, answering questions... All your help will be much appreciated. https://github.com/BlinkDL/RWKV-LM

It will be great if we can build optimized INT8/INT4 inference for Nvidia/AMD/Intel GPUs, Intel/AMD CPUs, and Android/iOS phones. Because RWKV has RNN mode, it is very hardware-friendly (no need for kv cache). Let's build a future where everyone can run LLMs.

63

Comments

You must log in or register to comment.

t1_jb0q49f wrote

If you are RWKV's creator, kudos to you, the work you have done is amazing.

Reminder for everybody: it can run rather quickly in CPU, meaning it can truly run locally in phones. It also is 100 times faster, and uses 100 times less (V)RAM.

11

t1_jb0sm2c wrote

I have been following your reddit posts for some while now, but I still dont think I fully understand it. Did you consider writing a paper? It might help people get the method and might fuel the open source help you get.

13

t1_jb0smq3 wrote

It's awesome work, but I don't think anyone is claiming anywhere near 100x faster speed and lower VRAM are they?

>RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
>
>GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

From this it sounds like about ~2x improvement (dont get me wrong 2x improvement is great for same performance). As for you have to store all the parameters of RWKV model just like GPT, that takes up most of the memory if you're trying to fit models in consumer hardware. Memory is just less because of no need for KV cache.

7

t1_jb0x91p wrote

I think this is really exciting. LLM applications like ChatGPT seem to still mostly just pipe the result of the model sampling directly out but with 100 times faster inference, maybe complex chain of thought procedures with multiple differently prompted model instances (well, the same model but different contexts) can be chained and work together to improve their output while still running close to real time.

3

t1_jb172wo wrote

It once said 100 times faster and 100 times less (V)RAM here. However, it now says that RWKV-14B can be run with only 3 GB of VRAM, which is regardless a massive improvement, because a 14B model normally requires about 30 GB of VRAM or thereabouts.

3

t1_jb1b68d wrote

Totally agree.

I have been following this from some time but I can't fully understand it and explain it to my collaborators.

I work in ML and I have quite some experience with transformers and I still can't fully get it. Let alone convince some of my collaborator that is worth pursuing it.

It is paramount that we have a paper that explains this in more detail if we want the community to consider this seriously.

Please do it!

8

t1_jb1h7wl wrote

hmm I very much doubt it couldve ran 100x faster for the same parameter count, as you are memory bandwith bound (both GPT and RWKV have to load the parameters n times to generate n tokens). Also Im somewhat skeptical that you only need 3GB for 14B parameters *without offloading the model*, as even 4-bit quantization is 14B/2 = 7GB needed. and offloading the model is slow to the point of being unusable as you need to do CPU<->GPU transfers.

1

t1_jb1wjfi wrote

It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.

What is the magic number 5 all about? It seems to appear all over the code without explanation.

Are the time mixing and channel mixing operations novel or were they introduced by a citable work?

How does the parallelization during training work?

5

t1_jb1zk8v wrote

It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).

I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.

3

t1_jb32fo5 wrote

What’s the reason to use this over a transformers? Transformers allow transfer learning and is able to paralize easier. 啊我看到你的知乎。你在什么公司去工作?

1

t1_jb532v5 wrote

Intelligence is the ability to take complex information into a simple explanation that a child can understand .

It makes me skeptical if someone doesn’t explain besides performance reasons . Most people just use the cloud because ML networks regardless of size take up a lot of battery.

−1

t1_jb53nhe wrote

As far as I can tell, the sparse documentation is just because they've been in pure R&D mode. I've played around with it in their Discord server and can confirm it does perform well, but I've struggled to get it working locally.

1

OP t1_jb9bdw3 wrote

Directly from RWKV-LM Github:

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

1