utilop

utilop t1_itwxusd wrote

I think that would make sense and could see the small models - in particular with CoT - fail to produce a valid answer.

For both MMLU and BBH, they report a worse average score with CoT than the direct prompt.

I would take that as CoT not reliably producing correct explanations, as it does not encourage good answers.

Could be that the problem is their prompt, few-shot setup, or calibration though?

Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

2

utilop t1_itwvjtb wrote

I haven't read this paper so do not know the details.

However, for SNARKS, it is defined as "Determine which of two sentences is sarcastic".

I must be missing something in that case because the base rate should be 50 %.

In the paper, they seem to state that even with CoT, Flan gets 9.6 % (small), 42.7 % (base), 60.1 % (large), 60.1 % (XL), 55.1 % (XXL).

So if I am interpreting it correctly, it is not doing much better than random chance even for the larger models, and I would not expect a good CoT nor significantly better results from testing on the larger model.

Detecting sarcasm might not be the best use of this model?

Not sure how they get so much less than 50 % - perhaps it includes failures to generate a valid answer.

2