MysteryInc152 t1_jclpjzi wrote
Reply to comment by FallUpJV in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
It's predicting language. as long as the structure can allow properly to learn to predict language, you're good to go.
turnip_burrito t1_jcoul9i wrote
Yes, exactly. Everyone keeps leaving the architecture's inductive structural priors out of the discussion.
It's not all about data! The model matters too!
Viewing a single comment thread. View all comments