farmingvillein t1_jdsmdt2 wrote on March 26, 2023 at 9:38 PM

Reply to comment by nixed9 in [D] GPT4 and coding problems by enryu42

This isn't really an accurate summary of the Reflexion paper. As noted in the other post:

> Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

This version is correct.

However, if I do the above and I throw in a semi-random Beginner problem that failed in OP's original pass-through, it successfully builds the answer.

u/enryu42 -- if you care to take things forward, I'd try implementing Reflexion (either with the underlying codebase (https://github.com/noahshinn024/reflexion-human-eval/) or just manual prompt work.

Or if you can provide a link to the problems in copy-pastable text form (manually coercing the math notation is a little painful), since you presumably already did this, it would greatly accelerate others hopping on analysis.

The fact that I immediately saw improvement on a randomly-selected (Beginner) problem suggests that there is a bunch of upward room here.

enryu42 OP t1_jdsokwz wrote on March 26, 2023 at 9:53 PM

Interesting! Here are the scraped and auto-converted statements (formatting is off sometimes, especially in the sample tests, but understandable). Prefixes are: "abc" for beginner, "arc" for regular, "agc" for "grand".

I do believe that the "Beginner" ones can be improved, but it'll be interesting to see what happens on "Grand" (or even "Regular"), as they require coming up with some ideas before writing the code.

farmingvillein t1_jdspflq wrote on March 26, 2023 at 10:00 PM

So, don't know whether this actually makes a difference, but I'd review the overall post-conversion text.

E.g.: https://github.com/enryu43/llm_coding/blob/main/atcoder_eval/statements/statement_abc293_b.txt

You'll see that it represent "K" and "N" wrong here (in sample 1, 15 versus 5, 12 versus 2).

Certainly, as a human, I would find this confusing. Maybe you could get some automated robustness by telling it how you converted the text (as it might automatically adjust its "expectations" on interpreting the numbers). Obviously, the fairer comparison though would just be to fix this.

> as they require coming up with some ideas before writing the code.

The other thing I'd note--

Not sure whether you're using the API directly, but if I play around with these in ChatGPT, I often run into the context window and have to nurse it along to complete text. I'd make sure that however you're running things, you're giving it enough "space" to iterate (particularly if you use any reflection techniques).

nixed9 t1_jdt1xyp wrote on March 26, 2023 at 11:35 PM

Ok my bad but that’s how I’ve been using the reflexion prompting