Art10001 t1_jb172wo wrote on March 5, 2023 at 5:32 PM

Reply to comment by royalemate357 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

It once said 100 times faster and 100 times less (V)RAM here. However, it now says that RWKV-14B can be run with only 3 GB of VRAM, which is regardless a massive improvement, because a 14B model normally requires about 30 GB of VRAM or thereabouts.

royalemate357 t1_jb1h7wl wrote on March 5, 2023 at 6:37 PM

hmm I very much doubt it couldve ran 100x faster for the same parameter count, as you are memory bandwith bound (both GPT and RWKV have to load the parameters n times to generate n tokens). Also Im somewhat skeptical that you only need 3GB for 14B parameters *without offloading the model*, as even 4-bit quantization is 14B/2 = 7GB needed. and offloading the model is slow to the point of being unusable as you need to do CPU<->GPU transfers.

xx14Zackxx t1_jb1zk8v wrote on March 5, 2023 at 8:43 PM

It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).

I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.

Nextil t1_jb1sg1c wrote on March 5, 2023 at 7:54 PM

I think they mean with offloading/streaming you need 3GB minimum, but it's much slower.