Viewing a single comment thread. View all comments

[deleted] t1_j15c4qw wrote

[deleted]

3

PengsoonThePenguin t1_j16rrtg wrote

I guess an easy explanation is that the model works solely from retrieval over the corpus. Every prediction has to be explained by the corpus.

3

drd13 t1_j1h3gvy wrote

Similarly to T5 (abd Bert) the model is pre-trained by predicting some randomly masked spans of words. However the way these spans of words are predicted is different.

In T5, masked words are generated one-by-one autoregressively (i.e. use a softmax over vocabulary to generate words one by one). Here a set of candidate possible spans, covering your whole trained corpus is preliminarily created and the model looks at all the candidate spans and chooses the one it thinks is the best (using a contrastive loss).

2