farmingvillein

farmingvillein t1_jdspflq wrote

Reply to comment by enryu42 in [D] GPT4 and coding problems by enryu42

So, don't know whether this actually makes a difference, but I'd review the overall post-conversion text.

E.g.: https://github.com/enryu43/llm_coding/blob/main/atcoder_eval/statements/statement_abc293_b.txt

You'll see that it represent "K" and "N" wrong here (in sample 1, 15 versus 5, 12 versus 2).

Certainly, as a human, I would find this confusing. Maybe you could get some automated robustness by telling it how you converted the text (as it might automatically adjust its "expectations" on interpreting the numbers). Obviously, the fairer comparison though would just be to fix this.

> as they require coming up with some ideas before writing the code.

The other thing I'd note--

Not sure whether you're using the API directly, but if I play around with these in ChatGPT, I often run into the context window and have to nurse it along to complete text. I'd make sure that however you're running things, you're giving it enough "space" to iterate (particularly if you use any reflection techniques).

6

farmingvillein t1_jdsmdt2 wrote

Reply to comment by nixed9 in [D] GPT4 and coding problems by enryu42

  1. This isn't really an accurate summary of the Reflexion paper. As noted in the other post:

> Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

This version is correct.

  1. However, if I do the above and I throw in a semi-random Beginner problem that failed in OP's original pass-through, it successfully builds the answer.

u/enryu42 -- if you care to take things forward, I'd try implementing Reflexion (either with the underlying codebase (https://github.com/noahshinn024/reflexion-human-eval/) or just manual prompt work.

Or if you can provide a link to the problems in copy-pastable text form (manually coercing the math notation is a little painful), since you presumably already did this, it would greatly accelerate others hopping on analysis.

The fact that I immediately saw improvement on a randomly-selected (Beginner) problem suggests that there is a bunch of upward room here.

27

farmingvillein t1_jdsfaq5 wrote

Reply to comment by enryu42 in [D] GPT4 and coding problems by enryu42

> Moreover, I doubt any human programmer will have troubles with the "Beginner" problems, regardless of their specialization.

FWIW, I think you overestimate humans. Particularly those who haven't actively been practicing leetcode-style coding. E.g., many of the problems are specified in "competition language", not "human-friendly language" (where "human-friendly", e.g., is something you'd be happy to see in a design doc). (Should that matter to GPT-4? I dunno.)

I do think it is fair though to say that, with some baseline level of practice (which is potentially the relevant comparison point), a lot of people would probably nail the "beginner" tests.

6

farmingvillein t1_jdo16sz wrote

> This 17 page could be a few sentences.

> Tl;DR the authors wrote prompts to tell GPT-4 to fix code given some unit tests and the output of the broken code. It performs better than GPT-4 that doesn't have access to the output of the code execution.

I agree with your overall sentiment--the paper IMO could be, in the very least, substantially re-organized for clarity--but your summary isn't actually accurate, since the paper itself has nothing to do with coding(!).

The coding work is all in their blog post...

...which also suffers from the same issue: a long preamble to scroll down and find the core nugget.

10

farmingvillein t1_jdnwda6 wrote

> But apply those same tricks to a big model, and it works even better.

In general, yes, although there are many techniques that help small models that do not help large ones.

That said, agree with your overall point. I think the only reason we won't see model sizes continue to inflate is if 1) there are substantial underlying architecture discoveries (possible!) or 2) we really hit problems with data availability. But synthetic + multi-modal probably gives us a ways to go there.

2

farmingvillein t1_jdnuvnf wrote

> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.

This is a misleading summary of the paper.

They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").

Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.

It is very much not just a paper showing that you can make it output in the same "style".

4

farmingvillein t1_jdj9w98 wrote

> these models are very sparse

Hmm, do you have any sources for this assertion?

It isn't entirely unreasonable, but 1) GPU speed-ups for sparsity aren't that high (unless OpenAI is doing something crazy secret/special...possible?), so this isn't actually that big of an upswing (unless we're including MoE?) and 2) openai hasn't released architecture details (beyond the original gpt3 paper--which did not indicate that the model was "very" sparse).

1

farmingvillein t1_jd47vh9 wrote

OK, insofar as you care about adoption, I'd encourage you to clean up the README to make it much clearer as to what you're doing. Right now, you've got API call examples, but it isn't clear what is actually happening, why this wrapper is helpful/necessary, etc.

I can guess/infer all the above, but you want your README to make it really, really quick and easy for your readers to figure out what is going on.

1

farmingvillein t1_jcsnx0f wrote

"open source".

That license, lol:

> You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.

> You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.

> This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.

What a nightmare.

40

farmingvillein t1_jckm5r2 wrote

Although note that OP does say that his data isn't labeled...and you of course need to label it for Roberta. So you're going to need to bootstrap that process via manual labeling or--ideally, if able--via an LLM labeling process.

If you go through the effort to set up an LLM labeling pipeline, you might just find that it is easier to use the LLM as a classifier, instead of fine-tuning yet another model (depending on cost, quality, etc. concerns).

1

farmingvillein t1_jccqy2i wrote

> Generally it would be unfair to claim that you beat benchmark results if you train for 8x more epochs than other methods. Benchmarks exist to ensure that methods are on a somewhat level playing field. There's certainly some wiggle room depending on the task, but in this case I don't believe that a lower learning rate and more epochs is novel or interesting enough to warrant a full paper.

Although the Llama paper is a bit of a rejoinder here, since training longer is (arguably) their core contribution.

3

farmingvillein t1_jc3fqod wrote

Speculative, but Emad has heavily signaled that they will be releasing to the public an LLM.

People are doing some really cool stuff with llama right now, but it all lives in a bit of a grey area, for the obvious reasons related to licensing (of both the model weights and the underlying gplv3 code).

If Emad releases a comparable LLM publicly, but with a generally permissive license (which is not a guarantee...), all of this hacker energy will immediately go into a model/platform that is suddenly (in this scenario) widely available, commercially usable (which means more people banging away at it, including with levels of compute that don't make sense for the average individual but are trivial for even a modestly funded AI startup), etc.

Further, SD has done a really good job of building a community around the successive releases, which--done right--means increased engagement (=better tooling) with each release, since authors know that they are not only investing in a model today, but that they are investing in a "platform" for tomorrow. I.e., the (idealized) open source snowball effect.

Additionally, there is a real chance that SD releases something better than llama*, which will of course further accelerate adoption by parties who will then invest dollars to improve it.

This is all extra important, because there has been a lot of cool research coming out about improving models via [insert creative fine-tuning/RL method, often combined with clever use of chain-of-thought/APIs/retrieval systems/etc.]. Right now, these methods are only really leveraged against very small models (which can be fine-tuned, but still aren't that great) or using something like OpenAI as a black box. A community building up around actually powerful models will allow these techniques to get applied "at scale", i.e., into the community. This has the potential to be very impactful.

Lastly, as noted, GPT-4 (even though notionally against ToS) is going to make it (presumably) even easier to create high-quality instruction tuning. That is going to get built and moved into public GPT-3-like models very, very quickly--which definitely means much faster tuning cycles, and possibly means higher-quality tuning.

(*=not because "Meta sux", to be clear, but because SD will more happily pull out all the stops--use more data, throw even more model bells & whistles at it, etc.)

24

farmingvillein t1_jc37p3h wrote

> The license is still limited to non-commercial use due to model being fine-tuned LLaMA.

Yeah, but they released the source code to replicate (I'm sure they knew exactly what they were doing--license is even Apache).

If the source code is pretty clean (including training code; I haven't looked closely), presumably this e2e process will be copied and the resulting model (by someone not beholden to the original LLaMA license) released to the public within the next day or so, if not by EOD.

If the code is messy, might take a couple more days.

I'd expect someone to follow the same process using turbo to bootstrap improvement (if they haven't already?), as well. This should be particularly helpful for getting it to be smarter using the entire context window in a conversation with the user.

I'd also expect someone to do so, but also mix DAN-style prompting, so that you natively can get a chatbot that is "unleashed" (whether or not this is a good idea is a separate discussion, obviously...).

Also you can expect all of the above to be applied against all the model sizes pretty quickly (33B and 65B might take a little longer, for $$$...but I wouldn't expect much longer).

It'll be extra fun because it will be released without acknowledge (for licensing reasons) of using OpenAI's API to bootstrap.

Even more fun when GPT-4 is release in the next week or so (assuming it isn't kicked out b/c SVB collapse making things noisy) and that can be used to bootstrap an even better instruction set (presumably).

tldr; things will change, quickly. (And then Emad releases an LLM and all bets are off...)

28