Submitted by ThePerson654321 t3_11lq5j4 in MachineLearning
ThePerson654321 OP t1_jbjisn7 wrote
Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
-
Sure. RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer. Comparing to, say, DALL-E 2 (that has exploded) which only came out 9 months ago it still feels like some organization would have picked RVWK if it was as useful as the developer claim.
-
This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.
-
Not necessarily. Google, OpenAI, Deepmind etc tests things that doesn't work out all the time.
-
Does not matter. If your idea is truly good you will get at attention sooner or later anyways.
I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.
I personally have two potential explainations to my question:
- It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
- The community is basically really slow to embrace this due to some unknown reason.
I am leaning towards the first one.
LetterRip t1_jbjphkw wrote
> I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.
This was posted by DeepMind a month ago,
I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time.
So prior to a month ago they didn't know it existed (edit - or at least not much more than it existed) or happened to meet their use case.
> RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer.
There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.
> 2) This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.
Until it has proved itself there was no motivation to take the effort to figure it out. The lower the effort threshold the more likely people will have a look, the larger the threshold the more likely people will invest their limited time in the 100's of other interesting bits of research that come out each week.
> If your idea is truly good you will get at attention sooner or later anyways.
Or be ignored for all time till someone else discovers the idea and gets credit for it.
In this case the idea has started to catch on and be discussed by 'the Big Boys', people are cautiously optimistic and people are investing time to start learning about it.
> I don't buy the argument that it's too new or hard to understand.
It isn't "too hard to understand" - it simply hadn't shown itself to be interesting enough to worth more than minimal effort to understand it. Without a paper that exceeded the minimal effort threshold. Now it has proven itself with the 14B that it seems to scale. So people are beginning to invest the effort.
> It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
No, it simply hadn't been shown to scale. Now we know it scales to at least 14B, and there is no reason to think it won't scale the same as any other GPT model.
The DeepMind paper that was lamenting the need for a fast way to train RNN models was about a month ago, which
ThePerson654321 OP t1_jbjz508 wrote
> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. So prior to a month ago they didn't know it existed or happened to meet their use case.
That surprises me considering his RWKV repo/repos has thousands of stars on GitHub.
I'm curious about what they responded with. What did they say?
> There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.
According to his claim (especially infinite ctx len) it definitely was interesting. That it was scaling was pretty obvious even at 7B.
But your argument is basically that no large organization simply has noticed it yet.
My guess is that it actually has some unknown problem/limitation that makes it inferior to the transformer architecture.
We'll just have to wait. Hopefully you are right but I doubt it.
farmingvillein t1_jbk47jg wrote
> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. > > So prior to a month ago they didn't know it existed or happened to meet their use case.
How does #2 follow from #1?
RWKV has been on reddit for quite a while, and a high number of researchers frequent/lurk on reddit, including Deepmind researchers, so the idea that they had no idea that RWKV exists seems specious.
Unless you mean that you emailed them and they literally told you that they didn't know about this. In which case...good on you!
ThePerson654321 OP t1_jbk6nb4 wrote
Thanks! I also find it very unlikely that nobody from a large organisation (Openai, Microsoft, Google Brain, Deepmind, Meta, etc) would have noticed it.
farmingvillein t1_jbk819k wrote
I think it is more likely people have seen it, but dismissed it as a bit quixotic, because the RWKV project has made little effort to iterate in an "academic" fashion (i.e., with rigorous, clear testing, benchmarks, goals, comparisons, etc.). It has obviously done pieces of this, but hasn't been sufficiently well-defined as to make it easy for others to iterate on top of it, from a research POV.
This means that anyone else picking up the architecture is going to have to go through the effort to create the whole necessary research baseline. Presumably this will happen, at some point (heck, maybe someone is doing it right now), but it creates a large impediment to further iteration/innovation.
LetterRip t1_jbkdshr wrote
Here is what the author stated in the thread,
> Tape-RNNs are really good (both in raw performance and in compression i.e. very low amount of parameters) but they just can't absorb the whole internet in a reasonable amount of training time... We need to find a solution to this!
I think they knew it existed (ie they knew there was a deeplearning project named RWKV), but they appear to have not know it met their scaling needs.
farmingvillein t1_jbkx0co wrote
I don't understand the relevance here--tape-RNNs != RWKV, unless I misunderstand the RWKV architecture (certainly possible).
Viewing a single comment thread. View all comments