_Arsenie_Boca_ t1_jb1wjfi wrote
Reply to comment by bo_peng in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.
What is the magic number 5 all about? It seems to appear all over the code without explanation.
Are the time mixing and channel mixing operations novel or were they introduced by a citable work?
How does the parallelization during training work?
bo_peng OP t1_jb1z3an wrote
5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).
TimeMixing is RWKV.
ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).
Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.
Viewing a single comment thread. View all comments