farmingvillein t1_jdspflq wrote on March 26, 2023 at 10:00 PM

Reply to comment by enryu42 in [D] GPT4 and coding problems by enryu42

So, don't know whether this actually makes a difference, but I'd review the overall post-conversion text.

E.g.: https://github.com/enryu43/llm_coding/blob/main/atcoder_eval/statements/statement_abc293_b.txt

You'll see that it represent "K" and "N" wrong here (in sample 1, 15 versus 5, 12 versus 2).

Certainly, as a human, I would find this confusing. Maybe you could get some automated robustness by telling it how you converted the text (as it might automatically adjust its "expectations" on interpreting the numbers). Obviously, the fairer comparison though would just be to fix this.

> as they require coming up with some ideas before writing the code.

The other thing I'd note--

Not sure whether you're using the API directly, but if I play around with these in ChatGPT, I often run into the context window and have to nurse it along to complete text. I'd make sure that however you're running things, you're giving it enough "space" to iterate (particularly if you use any reflection techniques).

farmingvillein t1_jdsmsh9 wrote on March 26, 2023 at 9:41 PM

Reply to comment by E_Snap in [D] GPT4 and coding problems by enryu42

Gotcha. Yeah, that is presumably where the power of inner monologue / step-by-step / reflection come from.

Will be cool to see that (presumably) progressively systematized.

farmingvillein t1_jdsmdt2 wrote on March 26, 2023 at 9:38 PM

Reply to comment by nixed9 in [D] GPT4 and coding problems by enryu42

This isn't really an accurate summary of the Reflexion paper. As noted in the other post:

> Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

This version is correct.

However, if I do the above and I throw in a semi-random Beginner problem that failed in OP's original pass-through, it successfully builds the answer.

u/enryu42 -- if you care to take things forward, I'd try implementing Reflexion (either with the underlying codebase (https://github.com/noahshinn024/reflexion-human-eval/) or just manual prompt work.

Or if you can provide a link to the problems in copy-pastable text form (manually coercing the math notation is a little painful), since you presumably already did this, it would greatly accelerate others hopping on analysis.

The fact that I immediately saw improvement on a randomly-selected (Beginner) problem suggests that there is a bunch of upward room here.

farmingvillein t1_jdsm0hw wrote on March 26, 2023 at 9:35 PM

Reply to comment by muskoxnotverydirty in [D] GPT4 and coding problems by enryu42

No, you didn't misunderstand it--your understanding is correct. OP is giving an answer that is similar to part of the Reflexion paper, but not the entirety.

farmingvillein t1_jdsfaq5 wrote on March 26, 2023 at 8:47 PM

Reply to comment by enryu42 in [D] GPT4 and coding problems by enryu42

> Moreover, I doubt any human programmer will have troubles with the "Beginner" problems, regardless of their specialization.

FWIW, I think you overestimate humans. Particularly those who haven't actively been practicing leetcode-style coding. E.g., many of the problems are specified in "competition language", not "human-friendly language" (where "human-friendly", e.g., is something you'd be happy to see in a design doc). (Should that matter to GPT-4? I dunno.)

I do think it is fair though to say that, with some baseline level of practice (which is potentially the relevant comparison point), a lot of people would probably nail the "beginner" tests.

farmingvillein t1_jdsdalw wrote on March 26, 2023 at 8:33 PM

Reply to comment by enryu42 in [D] GPT4 and coding problems by enryu42

> Do you mean re-prompt it asking to correct its mistakes?

Well, re-prompt + asking it to bake test cases upfront and continuously analyze how failures line up with the test cases.

farmingvillein t1_jdsd5ae wrote on March 26, 2023 at 8:32 PM

Reply to comment by E_Snap in [D] GPT4 and coding problems by enryu42

> and won’t read its own output as it’s outputting

This is literally what transformer decoders do, unless I've strongly misunderstood your statement.

farmingvillein t1_jdo16sz wrote on March 25, 2023 at 9:08 PM

Reply to comment by learn-deeply in [R] Reflexion: an autonomous agent with dynamic memory and self-reflection - Noah Shinn et al 2023 Northeastern University Boston - Outperforms GPT-4 on HumanEval accuracy (0.67 --> 0.88)! by Singularian2501

> This 17 page could be a few sentences.

> Tl;DR the authors wrote prompts to tell GPT-4 to fix code given some unit tests and the output of the broken code. It performs better than GPT-4 that doesn't have access to the output of the code execution.

I agree with your overall sentiment--the paper IMO could be, in the very least, substantially re-organized for clarity--but your summary isn't actually accurate, since the paper itself has nothing to do with coding(!).

The coding work is all in their blog post...

...which also suffers from the same issue: a long preamble to scroll down and find the core nugget.

farmingvillein t1_jdnwda6 wrote on March 25, 2023 at 8:32 PM

Reply to comment by londons_explorer in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

> But apply those same tricks to a big model, and it works even better.

In general, yes, although there are many techniques that help small models that do not help large ones.

That said, agree with your overall point. I think the only reason we won't see model sizes continue to inflate is if 1) there are substantial underlying architecture discoveries (possible!) or 2) we really hit problems with data availability. But synthetic + multi-modal probably gives us a ways to go there.

farmingvillein t1_jdnuvnf wrote on March 25, 2023 at 8:21 PM

Reply to comment by Disastrous_Elk_6375 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.

This is a misleading summary of the paper.

They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").

Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.

It is very much not just a paper showing that you can make it output in the same "style".

farmingvillein t1_jdntw7b wrote on March 25, 2023 at 8:14 PM

Reply to comment by Sorry-Balance2049 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

pure marketing.

not even weights...due to the ToS issues with the fine-tune set, presumably.

farmingvillein t1_jdkdjye wrote on March 25, 2023 at 12:48 AM

Reply to comment by SatoshiNotMe in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-

> I see it’s multimodal but how do I use it with images?

You unfortunately can't right now--the image handling is not publicly available, although supposedly the model is capable.

farmingvillein t1_jdj9w98 wrote on March 24, 2023 at 8:05 PM

Reply to comment by yashdes in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-

> these models are very sparse

Hmm, do you have any sources for this assertion?

It isn't entirely unreasonable, but 1) GPU speed-ups for sparsity aren't that high (unless OpenAI is doing something crazy secret/special...possible?), so this isn't actually that big of an upswing (unless we're including MoE?) and 2) openai hasn't released architecture details (beyond the original gpt3 paper--which did not indicate that the model was "very" sparse).

farmingvillein t1_jdhua51 wrote on March 24, 2023 at 2:34 PM

Reply to comment by loopuleasa in [D] I just realised: GPT-4 with image input can interpret any computer screen, any userinterface and any combination of them. by Balance-

Hmm, what do you mean by "publicly"? OpenAI has publicly stated that GPT-4 is multi-modal, and that they simply haven't exposed the image API yet.

The image API isn't publicly available yet, but it is clearly coming.

farmingvillein t1_jd47vh9 wrote on March 21, 2023 at 6:43 PM

Reply to comment by usc-ur in Smarty-GPT: wrapper of prompts/contexts [P] by usc-ur

OK, insofar as you care about adoption, I'd encourage you to clean up the README to make it much clearer as to what you're doing. Right now, you've got API call examples, but it isn't clear what is actually happening, why this wrapper is helpful/necessary, etc.

I can guess/infer all the above, but you want your README to make it really, really quick and easy for your readers to figure out what is going on.

farmingvillein t1_jczf7z8 wrote on March 20, 2023 at 6:50 PM

Reply to Smarty-GPT: wrapper of prompts/contexts [P] by usc-ur

Maybe I'm reading too quickly, but I can't figure out what this actually does, from the README.

farmingvillein t1_jcsnx0f wrote on March 19, 2023 at 6:51 AM

Reply to [R] ChatGLM-6B - an open source 6.2 billion parameter Eng/Chinese bilingual LLM trained on 1T tokens, supplemented by supervised fine-tuning, feedback bootstrap, and RLHF. Runs on consumer grade GPUs by MysteryInc152

"open source".

That license, lol:

> You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.

> You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.

> This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.

What a nightmare.

farmingvillein t1_jckm5r2 wrote on March 17, 2023 at 2:54 PM

Reply to comment by 2muchnet42day in [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali

Although note that OP does say that his data isn't labeled...and you of course need to label it for Roberta. So you're going to need to bootstrap that process via manual labeling or--ideally, if able--via an LLM labeling process.

If you go through the effort to set up an LLM labeling pipeline, you might just find that it is easier to use the LLM as a classifier, instead of fine-tuning yet another model (depending on cost, quality, etc. concerns).

farmingvillein t1_jckjsyr wrote on March 17, 2023 at 2:38 PM

Reply to comment by 2muchnet42day in [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali

Much more off-the-shelf right now (although that is changing rapidly)
No/minimal IP issues/concerns (although maybe OP doesn't care about that)

farmingvillein t1_jccqy2i wrote on March 15, 2023 at 10:05 PM

Reply to comment by camp4climber in [D] Is there an expectation that epochs/learning rates should be kept the same between benchmark experiments? by TheWittyScreenName

> Generally it would be unfair to claim that you beat benchmark results if you train for 8x more epochs than other methods. Benchmarks exist to ensure that methods are on a somewhat level playing field. There's certainly some wiggle room depending on the task, but in this case I don't believe that a lower learning rate and more epochs is novel or interesting enough to warrant a full paper.

Although the Llama paper is a bit of a rejoinder here, since training longer is (arguably) their core contribution.

farmingvillein t1_jc3ljxu wrote on March 13, 2023 at 8:07 PM

Reply to comment by LetterRip in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Source code is also the same. Nothing changed.

farmingvillein t1_jc3fqod wrote on March 13, 2023 at 7:29 PM

Reply to comment by ribeirao in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

Speculative, but Emad has heavily signaled that they will be releasing to the public an LLM.

People are doing some really cool stuff with llama right now, but it all lives in a bit of a grey area, for the obvious reasons related to licensing (of both the model weights and the underlying gplv3 code).

If Emad releases a comparable LLM publicly, but with a generally permissive license (which is not a guarantee...), all of this hacker energy will immediately go into a model/platform that is suddenly (in this scenario) widely available, commercially usable (which means more people banging away at it, including with levels of compute that don't make sense for the average individual but are trivial for even a modestly funded AI startup), etc.

Further, SD has done a really good job of building a community around the successive releases, which--done right--means increased engagement (=better tooling) with each release, since authors know that they are not only investing in a model today, but that they are investing in a "platform" for tomorrow. I.e., the (idealized) open source snowball effect.

Additionally, there is a real chance that SD releases something better than llama*, which will of course further accelerate adoption by parties who will then invest dollars to improve it.

This is all extra important, because there has been a lot of cool research coming out about improving models via [insert creative fine-tuning/RL method, often combined with clever use of chain-of-thought/APIs/retrieval systems/etc.]. Right now, these methods are only really leveraged against very small models (which can be fine-tuned, but still aren't that great) or using something like OpenAI as a black box. A community building up around actually powerful models will allow these techniques to get applied "at scale", i.e., into the community. This has the potential to be very impactful.

Lastly, as noted, GPT-4 (even though notionally against ToS) is going to make it (presumably) even easier to create high-quality instruction tuning. That is going to get built and moved into public GPT-3-like models very, very quickly--which definitely means much faster tuning cycles, and possibly means higher-quality tuning.

(*=not because "Meta sux", to be clear, but because SD will more happily pull out all the stops--use more data, throw even more model bells & whistles at it, etc.)

farmingvillein t1_jc37p3h wrote on March 13, 2023 at 6:38 PM

Reply to comment by Bulky_Highlight_3352 in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

> The license is still limited to non-commercial use due to model being fine-tuned LLaMA.

Yeah, but they released the source code to replicate (I'm sure they knew exactly what they were doing--license is even Apache).

If the source code is pretty clean (including training code; I haven't looked closely), presumably this e2e process will be copied and the resulting model (by someone not beholden to the original LLaMA license) released to the public within the next day or so, if not by EOD.

If the code is messy, might take a couple more days.

I'd expect someone to follow the same process using turbo to bootstrap improvement (if they haven't already?), as well. This should be particularly helpful for getting it to be smarter using the entire context window in a conversation with the user.

I'd also expect someone to do so, but also mix DAN-style prompting, so that you natively can get a chatbot that is "unleashed" (whether or not this is a good idea is a separate discussion, obviously...).

Also you can expect all of the above to be applied against all the model sizes pretty quickly (33B and 65B might take a little longer, for $$$...but I wouldn't expect much longer).

It'll be extra fun because it will be released without acknowledge (for licensing reasons) of using OpenAI's API to bootstrap.

Even more fun when GPT-4 is release in the next week or so (assuming it isn't kicked out b/c SVB collapse making things noisy) and that can be used to bootstrap an even better instruction set (presumably).

tldr; things will change, quickly. (And then Emad releases an LLM and all bets are off...)

farmingvillein t1_jc3602d wrote on March 13, 2023 at 6:27 PM

Reply to comment by Taenk in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef

No source, they are making it up.

farmingvillein t1_jblnh6d wrote on March 9, 2023 at 10:44 PM

Reply to comment by mckirkus in [D] Why are so many tokens needed to train large language models? by blacklemon67

But she still had feedback loops.