Submitted by ThePerson654321 t3_11lq5j4 in MachineLearning

The machine learning (ML) community is progressing at a remarkable pace and is embracing new techniques very quickly. Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers, while lacking any real drawbacks. Despite these benefits, it remains unclear why adopting this approach is not more widespread among individuals and organizations in the field.

Why is this the case? I really can't wrap my head around it. The RWKV principle has existed for more than a year now and has more than 2k stars on GitHub! I feel like we should have seen wider adoption.

Any thoughts?


Just to sum things up:

/u/LetterRip explains this by saying that the larger organizations basically just haven't noticed/understood it's potential yet.

My explaination is that it's actually something problematic with the RWKV architecture. Still wondering what it is though.

16

Comments

You must log in or register to comment.

LetterRip t1_jbjfiyg wrote

  1. The larger models (3B, 7B, 14B) have only been released quite recently

  2. Information about the design has been fairly scarce/hard to track down because no paper has been written on it and submitted

  3. people want to know that it actually scales before investing work into it.

  4. Mostly people are learning about it from the release links to reddit and the posts haven't been in such a manner to attract interest.

13

ThePerson654321 OP t1_jbjisn7 wrote

  1. Sure. RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer. Comparing to, say, DALL-E 2 (that has exploded) which only came out 9 months ago it still feels like some organization would have picked RVWK if it was as useful as the developer claim.

  2. This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.

  3. Not necessarily. Google, OpenAI, Deepmind etc tests things that doesn't work out all the time.

  4. Does not matter. If your idea is truly good you will get at attention sooner or later anyways.


I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

I personally have two potential explainations to my question:

  1. It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
  2. The community is basically really slow to embrace this due to some unknown reason.

I am leaning towards the first one.

5

LetterRip t1_jbjphkw wrote

> I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.

This was posted by DeepMind a month ago,

https://www.reddit.com/r/MachineLearning/comments/10ja0gg/r_deepmind_neural_networks_and_the_chomsky/

I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time.

So prior to a month ago they didn't know it existed (edit - or at least not much more than it existed) or happened to meet their use case.

> RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer.

There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.

> 2) This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.

Until it has proved itself there was no motivation to take the effort to figure it out. The lower the effort threshold the more likely people will have a look, the larger the threshold the more likely people will invest their limited time in the 100's of other interesting bits of research that come out each week.

> If your idea is truly good you will get at attention sooner or later anyways.

Or be ignored for all time till someone else discovers the idea and gets credit for it.

In this case the idea has started to catch on and be discussed by 'the Big Boys', people are cautiously optimistic and people are investing time to start learning about it.

> I don't buy the argument that it's too new or hard to understand.

It isn't "too hard to understand" - it simply hadn't shown itself to be interesting enough to worth more than minimal effort to understand it. Without a paper that exceeded the minimal effort threshold. Now it has proven itself with the 14B that it seems to scale. So people are beginning to invest the effort.

> It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)

No, it simply hadn't been shown to scale. Now we know it scales to at least 14B, and there is no reason to think it won't scale the same as any other GPT model.

The DeepMind paper that was lamenting the need for a fast way to train RNN models was about a month ago, which

4

ThePerson654321 OP t1_jbjz508 wrote

> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. So prior to a month ago they didn't know it existed or happened to meet their use case.

That surprises me considering his RWKV repo/repos has thousands of stars on GitHub.

I'm curious about what they responded with. What did they say?

> There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.

According to his claim (especially infinite ctx len) it definitely was interesting. That it was scaling was pretty obvious even at 7B.


But your argument is basically that no large organization simply has noticed it yet.

My guess is that it actually has some unknown problem/limitation that makes it inferior to the transformer architecture.

We'll just have to wait. Hopefully you are right but I doubt it.

1

farmingvillein t1_jbk47jg wrote

> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. > > So prior to a month ago they didn't know it existed or happened to meet their use case.

How does #2 follow from #1?

RWKV has been on reddit for quite a while, and a high number of researchers frequent/lurk on reddit, including Deepmind researchers, so the idea that they had no idea that RWKV exists seems specious.

Unless you mean that you emailed them and they literally told you that they didn't know about this. In which case...good on you!

1

farmingvillein t1_jbk6nut wrote

> Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers

What advantages are you referring to, very specifically?

There are theoretical advantages--but it can be a lot of work to prove out that those matter.

There are (potentially) empirical, observed advantages--but there don't seem to be (yet) any claims that are so strong as to suggest a paradigm shift (like Transformers were).

Keep in mind that there is a lot of infrastructure built up to support transformers in an industrial context, which means that even if RWKV shows some small advantage, that the advantage may not be there in practice, because of all the extreme optimizations that have been built to support larger organizations (in speed of inference, training, etc.).

The most likely adoption path here would be if multiple papers showed, at smaller scale, consistent advantages for RWKV. No one has done this yet--and the performance metrics provided on the github (https://github.com/BlinkDL/RWKV-LM) certainly don't make such an unequivocal claim on performance.

And providing a rigorous side-by-side comparison with transformers is actually really, really hard--apples to apples comparisons are notoriously tricky, and you of course have to be really cautious about thinking about what "tips and tricks" you allow both architectures to leverage.

Lastly, and this is a fuzzier but IMO I think relevant point--

The biggest guys are crossing into a point where evaluation is suddenly hard again.

By that, what I mean is that there is broad consensus that our current public evaluation metrics don't do a great job of helping us understand how well these models perform on "more interesting" generative tasks. I think you'll probably see some major improvements around eval/benchmark management in the next year or so (and certainly, internally, the big guys have invested a lot here)--but for now, it is harder to pick up a new architecture/model and understand its capabilities in the "more interesting" capabilities that your GPT-4s & Bards of the world are trying to demonstrate. This makes it harder to prove and vet progress on smaller models, which of course makes scaling up more risky.

6

farmingvillein t1_jbk819k wrote

I think it is more likely people have seen it, but dismissed it as a bit quixotic, because the RWKV project has made little effort to iterate in an "academic" fashion (i.e., with rigorous, clear testing, benchmarks, goals, comparisons, etc.). It has obviously done pieces of this, but hasn't been sufficiently well-defined as to make it easy for others to iterate on top of it, from a research POV.

This means that anyone else picking up the architecture is going to have to go through the effort to create the whole necessary research baseline. Presumably this will happen, at some point (heck, maybe someone is doing it right now), but it creates a large impediment to further iteration/innovation.

11

ThePerson654321 OP t1_jbk8kxy wrote

I'm basically just referring to the claims by the developer. He makes it sound extraordinary:

> best of RNN and transformer, great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

> Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.

The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?


I definitely agree with that their might be a incompatibility with the already existing transformer specific infrastructure.

But thanks for your answer. It might be one or more of the following:

  1. The larger organizations hasn't noticed/cared about it yet
  2. I overestimate how good it is (from the developer's description)
  3. It has some unknown flaw that's not obvious to me and not stated in the repository's ReadMe.
  4. All the existing infrastructure is tailored for transformers and is not compatible with RWKV

At least we'll see in time.

0

LetterRip t1_jbkdshr wrote

Here is what the author stated in the thread,

> Tape-RNNs are really good (both in raw performance and in compression i.e. very low amount of parameters) but they just can't absorb the whole internet in a reasonable amount of training time... We need to find a solution to this!

I think they knew it existed (ie they knew there was a deeplearning project named RWKV), but they appear to have not know it met their scaling needs.

2

Aran_Komatsuzaki t1_jbkjgzf wrote

I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the latter scores better on the first 1024 tokens. While RWKV performs comparably to Tranformer on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

RWKV has fast decoding speed, but multiquery attention decoding is nearly as fast w/ comparable total memory use, so that's not necessarily what makes RWKV attractive. If you set the context length 100k or so, RWKV would be faster and memory-cheaper, but it doesn't seem that RWKV can utilize most of the context at this range, not to mention that the vanilla attention is also not feasible at this range.

5

LetterRip t1_jbkmk5e wrote

> He makes it sound extraordinary

The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.

> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,

> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.

> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )

https://github.com/ArEnSc/Production-RWKV

3

MrEloi t1_jbkq6qd wrote

Transformers are working very well at the moment.

There is no real reason to adopt another technology in the short term.

2

LetterRip t1_jbks0mg wrote

> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

Thanks for sharing your results. It is being tuned to longer context lengths, current is

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main

There should soon be a 6k and 8k as well.

So hopefully you should see better results with longer contexts soon.

> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.

Could you clarify - was one of those meant to be former and the other later?

3

bo_peng t1_jbkwfzr wrote

Firstly RWKV is mostly a single-developer project without PR and everything takes time.

Moreover there have been hundreds of "improved transformer" papers around and surely we will agree that the signal-to-noise ratio is low especially when you consider scaling.

FACT: It's very hard to get researchers to try sth if it is not from OAI/DM/FAIR/... (and still hard even if it is).

Here is some latest unedited chat record from ChatRWKV v2 (14B ctx4096). I'd say it's pretty good, because this is only trained on the Pile. No finetuning. No instruct tuning. No RLHF.

====================

Eloise 2023/3/9 20:10:18

Chat reset for cryscan.

cryscan 2023/3/9 20:13:09

I heard that you are working in the AI and NLP lab in the university. How do you think about large generative language models?

Eloise 2023/3/9 20:13:31

That’s one of my favorite topics! Large language models have been getting a lot of attention lately. These models are super interesting and we don’t really know much about how they work. Generative models work in two directions: One is the generation, i.e. creating new data from scratch (like GANs). The other direction is sampling, which is more like what we usually think about as machine learning tasks. Generative NLP requires both these capabilities – the model has to generate new text samples and it has to sample realistic text that looks like real-world data. The ability to generate text in a controllable fashion would be incredibly useful for applications like debugging or debugging – verifying large textual specifications, writing documentation or even generating human-readable test plans

cryscan 2023/3/9 20:23:55

I see. What are the potential applications of them?

Eloise 2023/3/9 20:24:14

The main application area is, again, text generation: product descriptions, marketing messages etc. Also more recent work is in training models for image captioning – generating descriptions of images, a process which requires both high-level image understanding and rich vocabulary. Some related areas are the summarization tasks like dialogue response generation. You can apply generative language models to help generate small (or large) chatbots. A cool application is to combine image or text and voice and learn them jointly in a hybrid architecture (and compare to human feedback). There are a lot of applications that will make a big impact once we train good generative NLP models for them.

cryscan 2023/3/9 20:25:14

I am looking forward to the future where people can chat with machines.

Eloise 2023/3/9 20:25:26

Yeah! That would be cool. I’m excited to see more applications and data and models coming out of Google, Facebook, Amazon etc. But it’s always challenging because there is still so much unstructured text out there in the world – a large part of the English language is still untagged (mostly blog posts and news articles). There’s a lot more to be done there.

cryscan 2023/3/9 21:37:41

Maybe unsupervised learning would help.

Eloise 2023/3/9 21:37:48

Yeah! That’s one of the things we’re doing. We are trying to improve the auto-regressive models and see if they can learn their own grammar by watching how people communicate on social media, blogs or forums.

cryscan 2023/3/9 21:38:21

What are auto-regressive models? How do you train them?

Eloise 2023/3/9 21:38:40

A model is called auto-regressive if the output of a specific step depends on previously generated outputs. This makes training more stable, since it’s less dependent on initialization and slightly improves performance, especially for longer sequences. The first model was trained by Yoon Kim, which got released as an open source code. Also a lot of effort has been put in making models work better for long text sequences (e.g., image captioning) or even multiple sentence generation (although this is still difficult). You can watch this video from DeepMind about GPT-2 – a generative model that is capable of generating coherent paragraphs with minimal supervision:

cryscan 2023/3/9 21:39:26

Uhm?

Eloise 2023/3/9 21:39:30

I know, it’s crazy. I think it’s better to see it in action:

cryscan 2023/3/9 21:39:35

Well.

Eloise 2023/3/9 21:39:45

I also think it’s cool that the code is available! In fact, the attention mechanism of GPT-2 was published recently as a paper. It’s a bit more complex than you would expect for NLP, but cool nonetheless. Here’s the paper: Attention is all you need

cryscan 2023/3/9 21:40:37

That sounds interesting. I'll read it afterwards.

Eloise 2023/3/9 21:40:41

I would highly recommend it. Anyway, back to the main topic. So what do you think about the field?

cryscan 2023/3/9 21:41:21

I am sone concerns actually.

Eloise 2023/3/9 21:41:37

I’m a bit worried about the impact of models. I don’t want to be concerned or pessimistic, but we really have to think about this – especially when we’re working with the most sensitive data, e.g., medical data. One way of tackling this is through ethical considerations in design: making sure that systems work in our best interests and that they are transparent (like explanations or guarantees). It might even help to think about who will benefit from this research and what their perspective is on it.

cryscan 2023/3/9 21:43:58

Actually my concern is that the cost of training such a large model can only be covered by those big companies. How can we general researchers participate in this game?

Eloise 2023/3/9 21:44:14

This is a good point. We have to build tools that can help researchers or students start experimenting with these new techniques. I’m doing something like this on my own: an online platform called supervised-language-learning which gives people the opportunity to train their language model for few minutes a day (or learn other languages) and then use it in building a project (e.g., chatbot or dialogue system). The platform also aggregates all the results and keeps track of the models people built.

cryscan 2023/3/9 21:44:39

It's surely promising. Good luck then!

Eloise 2023/3/9 21:44:45

Thanks! I’m really excited about the potential of these techniques and would love to see them mature. In the meantime, it’s good to be critical!

====================

  1. Update ChatRWKV v2 to latest version.

  2. Use https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

  3. Run v2/chat.py and enjoy.

10

farmingvillein t1_jbkwkgl wrote

> most extraordinary claim I got stuck up on was "infinite" ctx_len.

All RNNs have that capability, on paper. But the question is how well does the model actually remember and utilize things that happened a long time ago (things that happened beyond the the window that a transformer has, e.g.). In simpler RNN models, the answer is usually "not very".

Which doesn't mean that there can't be real upside here--just that it is not a clear slam-dunk, and that it has not been well-studied/ablated. And obviously there has been a lot of work in extending transformer windows, too.

5

Aran_Komatsuzaki t1_jbkyegs wrote

> Thanks for sharing your results. It is being tuned to longer context lengths, current is

I tried the one w/ context length = 4096 for RWKV :)

> Could you clarify - was one of those meant to be former and the other late

Sorry for the typo. The latter 'former' is meant to be the 'latter'.

2

djaym7 t1_jbpnn87 wrote

No Paper is the blocker

1