Emergency_Apricot_77

Emergency_Apricot_77 t1_jah9rb7 wrote

Why go with BLEU though ? OP didn't particularly mention optimizing sequence level metrics. Can't we still use cross entropy ? Something as follows:

Sample first token, calculate cross-entropy with first token of gold

Sample second token, calculate cross-entropy with second token of gold

Sample third token, calculate cross-entropy with third token of gold

... and so on ?

​

This way we still have differentiable metric but we have a much better alignment between train and inference scenarios -- as opposed to current teacher forcing training and sampling inference -- which I thought the OP was going for.

1