Playing around with GPT3 was quite fun, but now my credits are expired. I used it primarily just for fun to see what the model knows and understands, and occasionally for brainstorming and writing.

I've tried T5-FLAN XL which runs very fast on my 3060ti and in theory should be quite good, but the output was quite lacklustre compared to GPT3. Even for examples from the paper (e.g. the scarcasm detection) the answer was typically correct but the chain of thought leading up to it was only convincing 10% of the time.

I experimented with a few different sampling strategies (temp, top_p, top_k, beam search etc.) but still quite disappointing results.

Any advice? Would the XXL model or one of meta's LLMs be much better? Are there magic settings for sampling? Are there any existing notebooks/GUI tools that I can conveniently run locally?

Cheers!

Comments

You must log in or register to comment.

Southern-Trip-1102 t1_itwt1ac wrote on October 26, 2022 at 9:43 PM

You might want to look into the BLOOM models on huggingface.

AuspiciousApple OP t1_itwvawz wrote on October 26, 2022 at 9:58 PM

Thanks, I'll take a look. Have you played around with them yourself?

Southern-Trip-1102 t1_itwyp8o wrote on October 26, 2022 at 10:22 PM

A bit, as far as I can tell they (the 176B one) are on par with gpt 3. Though I haven't done much testing or comparison. They are also trained on 13 programming and 59 languages from what i read.

AuspiciousApple OP t1_itx00gv wrote on October 26, 2022 at 10:32 PM

Thanks! Even a qualitative subjective judgement of rough parity is quite encouraging. I might need deepspeed/etc. to get it to run on my 8GB GPU, but if it's even similar quality, that's very cool.

visarga t1_itwxzgs wrote on October 26, 2022 at 10:17 PM

My experience is that models that have not had the instruction tuning treatment don't behave nice.

Southern-Trip-1102 t1_itwyur3 wrote on October 26, 2022 at 10:23 PM

Could that be because of Bloom being trained on a more varied datasets as opposed to being focused on English, as it was trained on multiple languages and programming langs?

_Arsenie_Boca_ t1_itwt4ls wrote on October 26, 2022 at 9:43 PM

I dont think any model you can run on a single commodity gpu will be on par with gpt-3. Perhaps GPT-J, Opt-{6.7B / 13B} and GPT-Neox20B are the best alternatives. Some might need significant engineering (e.g. deepspeed) to work on limited vram

deeceeo t1_itxiswn wrote on October 27, 2022 at 12:53 AM

UL2 is 20b and supposedly on par with GPT-3?

_Arsenie_Boca_ t1_ityby3b wrote on October 27, 2022 at 5:06 AM

True, I forgot about this one. Although getting to run a 20b model (Neox20b and UL20B) on an rtx gpu is probably a big stretch

AuspiciousApple OP t1_itwvq1f wrote on October 26, 2022 at 10:01 PM

>I dont think any model you can run on a single commodity gpu will be on par with gpt-3.

That makes sense. I'm not an NLP person, so I don't have a good intuition on how these models scale or what the benchmark numbers actually mean.

In CV, the difference between a small and large model might be a few % accuracy on imagenet but even small models work reasonably well. FLAN T5-XL seems to generate nonsense 90% of the time for the prompts that I've tried, whereas GPT3 has great output most of the time.

Do you have any experience with these open models?

_Arsenie_Boca_ t1_ityccjh wrote on October 27, 2022 at 5:10 AM

I dont think there is a fundamental difference between cv and nlp. However, we expect language models to be much more generalist than any vision model (Have you ever seen a vision model that performs well on discriminative and generative tasks across domains without finetuning?) I believe this is where scale is the enabling factor.

tim_ohear t1_ity5tvj wrote on October 27, 2022 at 4:02 AM

I've often found models that were exciting on paper to be very disappointing when you actually try them. For instance the recent opt releases.

I've used gpt-j a lot and it's really nice, but it takes 24gb of GPU ram in fp16 if you use the full 2048 token context. Eleuther also have the smaller gpt-neo models 1.3b and 2.7b which would be a better fit for your GPU.

For me the fun in these smaller models is how easily you can completely change their "personality" by finetuning on even tiny amounts of text, like a few 100kb. I've achieved subjectively better than gpt3 results (for my narrow purpose) finetuning gpt-j on 3mb of text.

Finetuning requires quite a bit more GPU ram. DeepSpeed can really help but if you're working with tiny amounts of data it will only take a couple of hours on a cloud GPU to get something fun.

hapliniste t1_itwncro wrote on October 26, 2022 at 9:05 PM

I'm interested as well. Just bought a 3090 so I have a bit more room. I think I saw optimized gptj that should run on it but haven't tried so far.

AuspiciousApple OP t1_itwoeep wrote on October 26, 2022 at 9:12 PM

A bit jealous of all that VRAM and all those cores.

The usage example here: https://huggingface.co/google/flan-t5-xl is quite easy to follow. Getting it up and running should take you all of 5 minutes plus the time to download the model. You could probably also run the XXL model.

AuspiciousApple OP t1_itwohgl wrote on October 26, 2022 at 9:12 PM

Would be curious to hear whether you get reasonable output with the XXL variant or with GPTJ.

utilop t1_itwvjtb wrote on October 26, 2022 at 9:59 PM

I haven't read this paper so do not know the details.

However, for SNARKS, it is defined as "Determine which of two sentences is sarcastic".

I must be missing something in that case because the base rate should be 50 %.

In the paper, they seem to state that even with CoT, Flan gets 9.6 % (small), 42.7 % (base), 60.1 % (large), 60.1 % (XL), 55.1 % (XXL).

So if I am interpreting it correctly, it is not doing much better than random chance even for the larger models, and I would not expect a good CoT nor significantly better results from testing on the larger model.

Detecting sarcasm might not be the best use of this model?

Not sure how they get so much less than 50 % - perhaps it includes failures to generate a valid answer.

AuspiciousApple OP t1_itww8zh wrote on October 26, 2022 at 10:04 PM

I only skimmed the paper, but I think they said that (at least for some benchmarks) they count exact matches as correct, so yes maybe generating anything but the answer doens't count?

I tried the example from this dataset that they use in one of their figures, and I seemed to get the correct answer with the XL variant most of the time, but the rationale was nonsense ~80% of the time even when it was correct. E.g. "Plastic containers are a thing of the past" or "Plastic containers are way too precious to store food" or "Wood is not sturdy enough to store food".

utilop t1_itwxusd wrote on October 26, 2022 at 10:16 PM

I think that would make sense and could see the small models - in particular with CoT - fail to produce a valid answer.

For both MMLU and BBH, they report a worse average score with CoT than the direct prompt.

I would take that as CoT not reliably producing correct explanations, as it does not encourage good answers.

Could be that the problem is their prompt, few-shot setup, or calibration though?

Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

AuspiciousApple OP t1_itwy8bj wrote on October 26, 2022 at 10:19 PM

>Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

That sounds a good idea, though NLP isn't really my field, so I might also not be using the correct sampling parameters/make subtle mistakes in writing the question (e.g. punctuation, line breaks, etc.), so I was hoping someone here would know more.

Even for English to German translation, the model often generated obvious nonsense, sometimes even just repeating the english phrase, despite using the prompt as it is in the hugging face config/paper.