Github: https://github.com/facebookresearch/NPM

Abstract:

>Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 9 closed-set tasks and 7 open-set tasks demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better on dealing with rare patterns (word senses or facts), and predicting rare or nearly unseen words (e.g., non-Latin script).

https://preview.redd.it/qf2lqrkku47a1.jpg?width=658&format=pjpg&auto=webp&s=7dc7e76f3075b4b4f0916c2de1e442b19b2c0f49

https://preview.redd.it/gqhlbykku47a1.jpg?width=1241&format=pjpg&auto=webp&s=39f63470d18ea6f4a8ed560b371cc46b939b2c6f

https://preview.redd.it/p7bzdukku47a1.jpg?width=883&format=pjpg&auto=webp&s=6a8eb2b66abcb1581abf7280180c1c0e86201232

https://preview.redd.it/z6niwykku47a1.jpg?width=1112&format=pjpg&auto=webp&s=8337a4802db983df1a4b0b11934c0708888641a4

https://preview.redd.it/s8fdhxkku47a1.jpg?width=1361&format=pjpg&auto=webp&s=28b307df857ef2262d3f8348fd1094ebb793a63d

https://preview.redd.it/94t5fwkku47a1.jpg?width=1362&format=pjpg&auto=webp&s=da8bca8fd08ecaf956658c674f5a32a930cdd3a2

Comments

You must log in or register to comment.

CatalyzeX_code_bot t1_j11aphv wrote on December 20, 2022 at 11:01 PM

#988,367

Found relevant code at https://github.com/facebookresearch/NPM + all code implementations here

To opt out from receiving code links, DM me

Singularian2501 OP t1_j11bgj5 wrote on December 20, 2022 at 11:06 PM

#988,429

Replying to CatalyzeX_code_bot (#988,367)

~~The github link is broken. That was also the reason I didn´t include it in the post. The paper is not from me! Also searched on paperswithcode but they also dont have a working link.~~

Edit the link is working now: https://github.com/facebookresearch/NPM !

Dankmemexplorer t1_j123o1b wrote on December 21, 2022 at 2:37 AM

#990,556

time to train gpt-4 on my mom's laptop

rjromero t1_j12aza8 wrote on December 21, 2022 at 3:36 AM

#991,110

> We use the model architecture and initial weights of RoBERTa large (Liu et al., 2019), consisting of 354M parameters. Training is done for 100,000 steps, using thirty-two 32GB GPUs.

354M parameters? At FP32 that's 1.41gb. It's tiny.

ItsTheUltimateBob t1_j12goke wrote on December 21, 2022 at 4:25 AM

#991,567

Replying to rjromero (#991,110)

That's a puny number of GPUs, too.

Purplekeyboard t1_j12lik7 wrote on December 21, 2022 at 5:11 AM

#991,894

Ok, but how does it compare in the real world to GPT-3?

master3243 t1_j12nmgc wrote on December 21, 2022 at 5:32 AM

#992,030

Replying to Purplekeyboard (#991,894)

There's no way for a paper to just have a table of "real world comparison of GPT-3",

There needs to (for now) be some benchmark created that systematically tests for the things we care about. Which is exactly why I deeply respect researchers dedicated on creating better and more useful benchmarks as their work immensely accelerates the field while they mostly don't get the attention they (IMO) deserve.

Abiacere t1_j12o7f6 wrote on December 21, 2022 at 5:38 AM

#992,072

Replying to Singularian2501 (#988,429)

Has anyone actually found the code?

[deleted] t1_j12qhwp wrote on December 21, 2022 at 6:03 AM

#992,216

[deleted]

farmingvillein t1_j12s4mn wrote on December 21, 2022 at 6:22 AM

#992,327

Replying to Dankmemexplorer (#990,556)

unfortunately still really slow (for now) to run, however:

> the speed of NPM is still on par with the speed of significantly larger parametric models that NPM outperforms

Purplekeyboard t1_j12uk1s wrote on December 21, 2022 at 6:50 AM

#992,485

Replying to master3243 (#992,030)

But what I'm asking is, how do the benchmarks match real world performance? Because I've seen claims that other language models were supposedly close to or equal to GPT-3 in this or that benchmark, but try interacting with them and the difference is striking. It's like the difference between talking to a college grad student and talking to the meth addled homeless guy who shouts at lampposts.

blose1 t1_j12voe0 wrote on December 21, 2022 at 7:04 AM

#992,543

Replying to Purplekeyboard (#991,894)

GPT-3 is like yesterday news, SOTA is chatGPT and it does circles around real world GPT-3 on every possible task.

RealGrande t1_j12zfl6 wrote on December 21, 2022 at 7:52 AM

#992,750

Replying to blose1 (#992,543)

ChatGPT is a fine-tuned version of gpt3 (well, gpt3.5 but pretty much the same barring some improvements)

Maximum t1_j136j97 wrote on December 21, 2022 at 9:29 AM

#993,062

Replying to master3243 (#992,030)

How about BIG-bench?

valdanylchuk t1_j137hla wrote on December 21, 2022 at 9:43 AM

#993,112

Replying to Purplekeyboard (#992,485)

From the paper:

>Extension for generation. It is currently non-trivial to use NPM for generation, since it is the encoder-only model. Future work can explore autoregressive generation as done in Patel et al. (2022) or use NPM for editing (Schick et al., 2022; Gaoet al., 2022).

So, don't expect to talk to it just yet.

ShowerVagina t1_j13gxva wrote on December 21, 2022 at 11:49 AM

#993,618

Replying to blose1 (#992,543)

GPT-3 is still the best for general use. Or for story writing. Novel AI is good, but still not as good as GPT-3.

Dankmemexplorer t1_j13k11f wrote on December 21, 2022 at 12:23 PM

#993,818

Replying to farmingvillein (#992,327)

aint that just the way

vwings t1_j13pguc wrote on December 21, 2022 at 1:16 PM

#994,220

Replying to rjromero (#991,110)

It was expected, right? A retrieval system should be much more efficient than storing phrases in neural net weights as GPT does...

machinelearner77 t1_j1437x9 wrote on December 21, 2022 at 3:03 PM

#995,288

Looks like cool stuff... but if you put a code link in the abstract and publish your paper, it should be a functioning link...

mtocrat t1_j14di0f wrote on December 21, 2022 at 4:11 PM

#996,041

Replying to blose1 (#992,543)

How is that relevant?

blose1 t1_j14q7ul wrote on December 21, 2022 at 5:33 PM

#996,965

Replying to RealGrande (#992,750)

Have you actually tried both on same tasks? I mean it seems like a lot of people here read a paper and some blog and make their conclusion without even using the tool, I've used both on the same tasks, compared on hundreds of real world cases and yes it's fine-tuned GPT3 but with human assisted RL and it's doing circles around GPT-3 in question answering, COT and code generation.

blose1 t1_j14qfir wrote on December 21, 2022 at 5:34 PM

#996,982

Replying to ShowerVagina (#993,618)

Have you compared both yourself on question answering, COT and code generation ?

[deleted] t1_j1513ye wrote on December 21, 2022 at 6:42 PM

#997,825

Replying to Abiacere (#992,072)

[removed]

Fit-Presence-8040 t1_j154l7z wrote on December 21, 2022 at 7:05 PM

#998,139

Replying to machinelearner77 (#995,288)

The github link is working now. Just got public.

[deleted] t1_j15c4qw wrote on December 21, 2022 at 7:54 PM

#998,716

[deleted]

[deleted] t1_j15cxh2 wrote on December 21, 2022 at 8:00 PM

#998,781

Replying to Abiacere (#992,072)

[deleted]

[deleted] t1_j15g25f wrote on December 21, 2022 at 8:20 PM

#999,057

Replying to farmingvillein (#992,327)

[deleted]

yaosio t1_j15h0xa wrote on December 21, 2022 at 8:27 PM

#999,136

Replying to farmingvillein (#992,327)

They also say there's room for improvement but they didn't explore that in this paper. Just think, one day we'll have the power of ~~the sun~~ GPT-3 in the palm of our hand. Could be really soon, could be far away, but it's coming.

oathbreakerkeeper t1_j15vgip wrote on December 21, 2022 at 10:03 PM

#1,000,474

Replying to blose1 (#996,965)

What's COT?

[deleted] t1_j15vrid wrote on December 21, 2022 at 10:05 PM

#1,000,501

Replying to oathbreakerkeeper (#1,000,474)

[removed]

gbfar t1_j16478a wrote on December 21, 2022 at 11:03 PM

#1,001,130

I see lots of potential applications for this. I wonder if we could reliably control text generation by tweaking the reference corpus.

Taenk t1_j16lvo8 wrote on December 22, 2022 at 1:17 AM

#1,002,550

Replying to [deleted] (#998,781)

Anyone got a demo running?

PengsoonThePenguin t1_j16rrtg wrote on December 22, 2022 at 2:03 AM

#1,003,061

Replying to [deleted] (#998,716)

I guess an easy explanation is that the model works solely from retrieval over the corpus. Every prediction has to be explained by the corpus.

ItsTheUltimateBob t1_j16z2v4 wrote on December 22, 2022 at 3:00 AM

#1,003,646

Replying to yaosio (#999,136)

Hopefully, they'll be beyond GPT-3.

yaosio t1_j17p2bx wrote on December 22, 2022 at 7:17 AM

#1,005,371

Replying to Purplekeyboard (#992,485)

There was a thread awhile back about one benchmark being filled with spelling errors, grammar errors, and wrong answers. In many cases there were multiple correct answers but one was picked as the correct answer for no particular reason. Creating a benchmark for the subjective task of "is this text good?" seems to be pretty hard. It's even harder when the people creating the benchmark have a poor grasp of language.

If I were to ask a language model "Describe an apple." There are many correct answers, none more correct than the others. Multiple independent humans would have to go over the answers and make subjective decisions on if the LLM answerded well. This becomes much more difficult with better LLMs because the prompts and answers have to become more complex, which makes reviewing the answers harder and more time consuming.

red75prime t1_j1899a0 wrote on December 22, 2022 at 11:49 AM

#1,006,527

Replying to yaosio (#999,136)

GPT-3: Sure, I can tell you power output of the sun. It would be 3.8 x 1026 W or 3.234 kW. I'm glad to help.

Think_Olive_1000 t1_j1cvpnz wrote on December 23, 2022 at 10:37 AM

#1,020,436

Replying to oathbreakerkeeper (#1,000,474)

Chain of thought

drd13 t1_j1h3gvy wrote on December 24, 2022 at 8:00 AM

#1,035,998

Replying to [deleted] (#998,716)

Similarly to T5 (abd Bert) the model is pre-trained by predicting some randomly masked spans of words. However the way these spans of words are predicted is different.

In T5, masked words are generated one-by-one autoregressively (i.e. use a softmax over vocabulary to generate words one by one). Here a set of candidate possible spans, covering your whole trained corpus is preliminarily created and the model looks at all the candidate spans and chooses the one it thinks is the best (using a contrastive loss).