Submitted by bo_peng t3_11teywc in MachineLearning

I try the "Alpaca prompt" on RWKV 14B ctx8192, and to my surprise it works out of box without any finetuning (RWKV is a 100% RNN trained on 100% Pile v1 and nothing else):

https://preview.redd.it/fciatottq7oa1.png?width=1046&format=png&auto=webp&v=enabled&s=891904adbadefb5902b86f67098c852da88dc167

You are welcome to try it in RWKV 14B Gradio (click examples below the panel):

https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

Tips: try "Expert Response" or "Expert Long Response" or "Expert Full Response" too.

https://preview.redd.it/qo71b85vq7oa1.png?width=2516&format=png&auto=webp&v=enabled&s=c4b1717754d03e28b4bba01530672935407e7797

===================

ChatRWKV v2 is now using a CUDA kernel to optimize INT8 inference (23 token/s on 3090): https://github.com/BlinkDL/ChatRWKV

Upgrade to latest code and "pip install rwkv --upgrade" to 0.5.0, and set os.environ["RWKV_CUDA_ON"] = '1' in v2/chat.py to enjoy the speed.

The inference speed (and VRAM consumption) of RWKV is independent of ctxlen, because it's an RNN (note: currently the preprocessing of a long prompt takes more VRAM but that can be optimized because we can process in chunks).

Meanwhile I find the latest RWKV-4-Pile-14B-20230313-ctx8192-test1050 model can utilize a long ctx:

https://preview.redd.it/a68dw0hzq7oa1.png?width=398&format=png&auto=webp&v=enabled&s=307e4d7847cb03cab3930b3ea07e9b2f856c9b1c

101

Comments

You must log in or register to comment.

yehiaserag t1_jcj305q wrote

How does that version compare to "RWKV-4-Pile-14B-20230228-ctx4096-test663"?

6

londons_explorer t1_jcj8p9y wrote

Can we run things like this through github.com/OpenAI/evals?

They have now got a few hundred tests, which is a good way to gauge performance.

9

FallUpJV t1_jcje89y wrote

I don't get this anymore if it's not the model size nor the transformer architecture then what is it?

Models were just not trained enough / not on the right data?

11

cipri_tom t1_jcjeehj wrote

This is great! It just needs a name that's as great as the work

RWKV is a tongue twister. How about Ruckus?

7

xEdwin23x t1_jcjfnlj wrote

First, this is not a "small" model so size DOES matter. It may not be hundreds billion parameters but it's definitely not small imo.

Second, it always has been (about data) astronaut pointing gun meme.

30

blueSGL t1_jcjga2i wrote

Is it possible to split the model and do inference across multiple lower VRAM GPUs or does a single card have to have the minimum 16gig VRAM?

5

bo_peng OP t1_jcjuejz wrote

ChatRNN is indeed a great name :)

R W K V are the four major parameters in RWKV (similar to QKV for attention).

I guess you can pronounce it like "Rwakuv" (A bit like racoon)

9

yaosio t1_jckchbe wrote

I like it's plan to make money. Did it learn from wallstreetbets?

1

mikljohansson t1_jckedf9 wrote

Very interesting work! I've been following this project for a while now

Can I ask a few questions?

  • What's the difference between RWKV-LM and ChatRWKV, e.g. is ChatRWKV mainly RWKV-LM but streamlined for inference and ease of use, or is there more differences?

  • Are you planning to fine tune on the Stanford Alpaca dataset (like was recently done for LLaMa and GPT-J to create instruct versions of them), or a similar GPT-generated instruction dataset? I'd love to see a instruct-tuned version of RWKV-LM 14B with a 8k+ context len!

3

Taenk t1_jckzuxm wrote

Sorry, I am not an expert, just an enthusiast, so this is a stupid question: Where can I see a list of these few hundred tests and is there some page where I can see comparisons between different models?

3

FallUpJV t1_jclpydo wrote

Yes it's definitely not small, I meant comparated to the models people have been paying attention to the most on the last few years I guess.

The astronaut pointing gun meme is a good analogy, almost a scary one, I wonder how much we could improve existing models with simply better data.

2

satireplusplus t1_jcp6bu4 wrote

This model uses a "trick" to efficiently train RNNs at scale and I still I have to take a look to understand how it works. Hopefully the paper is out soon!

Otherwise size is what matters! To get there it's a combination of factors - the transformer architecture scales well and was the first architecture that allowed to train these LLMs cranked up to enormous sizes. Enterprise GPU hardware with lots of memory (40G, 80G) and frameworks like pytorch that make parallelizing training across multiple GPUs easy.

And OPs 14B model might be "small" by today's standard, but its still gigantic compared to a few years ago. It's ~27GB of FP16 weights.

Having access to 1TB of preprocessed text data that you can download right away without doing your own crawling is also neat (pile).

3