Viewing a single comment thread. View all comments

AuspiciousApple OP t1_itww8zh wrote

I only skimmed the paper, but I think they said that (at least for some benchmarks) they count exact matches as correct, so yes maybe generating anything but the answer doens't count?

I tried the example from this dataset that they use in one of their figures, and I seemed to get the correct answer with the XL variant most of the time, but the rationale was nonsense ~80% of the time even when it was correct. E.g. "Plastic containers are a thing of the past" or "Plastic containers are way too precious to store food" or "Wood is not sturdy enough to store food".

1

utilop t1_itwxusd wrote

I think that would make sense and could see the small models - in particular with CoT - fail to produce a valid answer.

For both MMLU and BBH, they report a worse average score with CoT than the direct prompt.

I would take that as CoT not reliably producing correct explanations, as it does not encourage good answers.

Could be that the problem is their prompt, few-shot setup, or calibration though?

Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

2

AuspiciousApple OP t1_itwy8bj wrote

>Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

That sounds a good idea, though NLP isn't really my field, so I might also not be using the correct sampling parameters/make subtle mistakes in writing the question (e.g. punctuation, line breaks, etc.), so I was hoping someone here would know more.

Even for English to German translation, the model often generated obvious nonsense, sometimes even just repeating the english phrase, despite using the prompt as it is in the hugging face config/paper.

1