bo_peng OP t1_jc2alfm wrote on March 13, 2023 at 3:03 PM

Reply to comment by KerfuffleV2 in [P] ChatRWKV v2 (can run RWKV 14B with 3G VRAM), RWKV pip package, and finetuning to ctx16K by bo_peng

Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

KerfuffleV2 t1_jc3jith wrote on March 13, 2023 at 7:54 PM

> Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

Nice, that makes a big difference! (And such a small change too.)

The highest speed I've seen so far is with something like cuda fp16i8 *15+ -> cuda fp16 *1 at about ~~1.21tps~~ edit: I was mistaken, it was actually 1.17. Even cuda fp16i8 *0+ gets quite acceptable speed (.85-.88tps) and uses around 1.3GB VRAM.

I saw your response on GitHub. Unfortunately, I don't use Discord so hopefully it's okay to reply here.

bo_peng OP t1_jc9gf72 wrote on March 15, 2023 at 6:25 AM

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'

for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)

KerfuffleV2 t1_jcadn3g wrote on March 15, 2023 at 1:01 PM

Unfortunately, it doesn't compile for me: https://github.com/BlinkDL/ChatRWKV/issues/38

I'm guessing even if you implement special support for lower compute versions that will probably cancel out the speed (and maybe size) benefits.

bo_peng OP t1_jcb05e8 wrote on March 15, 2023 at 3:36 PM

stay tuned :) will fix it

KerfuffleV2 t1_jccb5v1 wrote on March 15, 2023 at 8:26 PM

Sounds good! The 4bit stuff seems pretty exciting too.

By the way, not sure if you saw it but it looks like PyTorch 2.0 is close to being released: https://www.reddit.com/r/MachineLearning/comments/11s58n4/n_pytorch_20_our_next_generation_release_that_is/

They seem to be claiming you can just drop in torch.compile() and see benefits with no code changes.

bo_peng OP t1_jccc46c wrote on March 15, 2023 at 8:32 PM

I am using torch JIT so close ;)