utilop
utilop t1_itwvjtb wrote
Reply to comment by AuspiciousApple in [D] What's the best open source model for GPT3-like text-to-text generation on local hardware? by AuspiciousApple
I haven't read this paper so do not know the details.
However, for SNARKS, it is defined as "Determine which of two sentences is sarcastic".
I must be missing something in that case because the base rate should be 50 %.
In the paper, they seem to state that even with CoT, Flan gets 9.6 % (small), 42.7 % (base), 60.1 % (large), 60.1 % (XL), 55.1 % (XXL).
So if I am interpreting it correctly, it is not doing much better than random chance even for the larger models, and I would not expect a good CoT nor significantly better results from testing on the larger model.
Detecting sarcasm might not be the best use of this model?
Not sure how they get so much less than 50 % - perhaps it includes failures to generate a valid answer.
utilop t1_ironow2 wrote
Reply to comment by Less-Article1309 in [D] Quantum ML promises massive capabilities, while also demanding enormous training compute. Will it ever be feasible to train fully quantum models? by avialex
Why did this get downvoted?
Is there some fundamental limitation implying that we would have to rely on SGD and cannot do the optimization through superposition?
utilop t1_itwxusd wrote
Reply to comment by AuspiciousApple in [D] What's the best open source model for GPT3-like text-to-text generation on local hardware? by AuspiciousApple
I think that would make sense and could see the small models - in particular with CoT - fail to produce a valid answer.
For both MMLU and BBH, they report a worse average score with CoT than the direct prompt.
I would take that as CoT not reliably producing correct explanations, as it does not encourage good answers.
Could be that the problem is their prompt, few-shot setup, or calibration though?
Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?