Submitted by Emergency_Apricot_77 t3_zmd6l8 in MachineLearning

Been experimenting with language models a lot lately and wondering if human generated text (i.e. "natural" text) is really supposed to be maximally likely according to language models even after training. For example, has someone checked likelihood of human translated text to likelihood of machine translated text according to a language model like GPT-3 ?

​

Are there any works that do this already ? Does this idea even make sense to begin with ?

4

Comments

You must log in or register to comment.

breezedeus t1_j0alnlc wrote

Actually, it's really not like that. If our words came out that way, people would know what you were going to say without even having to say it.

0

dojoteef t1_j0ayqqq wrote

See the graphs in the paper that introduced nucleus sampling: The Curious Case of Neural Text Degeneration. They visualize how human authored text has different statistical properties from machine generated text. That's mainly a tradeoff between fluency and coherence. Sampling procedures like top-k or nucleus sampling restrict the tokens that can be emitted and thus introduce statistical bias in the generated text, but produce more fluent text. Rather, sampling from the full distribution gets closer to the distribution of human-authored text, but often degenerates into incoherence (hence the title of the paper).

12

farmingvillein t1_j0fh5lg wrote

> If our words came out that way, people would know what you were going to say without even having to say it.

Even if this were true, this would not be correct in any sort of general sense, since every person/agent has its own unique set of (incompletely observable) context that seeds any output.

1