nixed9 t1_jdrxr76 wrote on March 26, 2023 at 6:44 PM

a Reflexion loop asks the model to react to it's own output and critique it before giving you an additional answer.

Edit: (In the paper, it provides a loop like this which feeds back into itself to help it's own cognition. It can repeat this loop multiple times.)

You can do a mini-loop by prompting. I've been playing with this all day.

I prompt it like this:

> "For this interaction, we are going to use the following structure.

> User (me): [I will ask a topic or question]

> You will provide an Assistant Hypothetical Response: [Brief or simplified answer to the topic or question]

> Then you will undergo Agent Reflection: [You will provide a Critique of the hypothetical response, highlighting the limitations, inaccuracies, or areas that need improvement or expansion, while providing guidance on how to address these issues in the revised response]

> Then you will provide an Actual Response: [The natural and contextually appropriate answer to the topic or question, as generated by the advanced language model, which incorporates the suggestions and improvements from the agent reflection for a more comprehensive and accurate response. This also can include step-by-step reasoning.]

> Do you understand?"

Hamoodzstyle t1_jdsbhec wrote on March 26, 2023 at 8:21 PM

What is the point of the "do you understand?" At the end? Does the model confirming that it understand add some sort of emphasis or something?

CobaltAlchemist t1_jdsqr5e wrote on March 26, 2023 at 10:10 PM

(not op) I've found that asking it directly if it understands helps to bridge any gaps I miss. It's asked me clarifying questions afterward in the past that I hadnt thought about

Alternatively, when I assume it understands sometimes it comes up with some real wild stuff because I wasn't clear

Hamoodzstyle t1_jdsy1cu wrote on March 26, 2023 at 11:05 PM

That's mind blowing holy moly

[deleted] t1_jdtubol wrote on March 27, 2023 at 3:31 AM

[removed]

Nowado t1_jdtr40r wrote on March 27, 2023 at 3:03 AM

I do the same thing I'd do with a human: ask it to repeat and rephrase instructions. After that I'm sure and it has multiple forms of instruction available to get less hanged up on some exact wording.

nixed9 t1_jdsegt9 wrote on March 26, 2023 at 8:41 PM

No explicit purpose. other than to respond with “yes I am ready”

DirtyKinkyInLove t1_jdwlgmp wrote on March 27, 2023 at 6:51 PM

It also reduces token usage. If the chatbot has a wordy response, it takes up more space in the context window and the chatbot will forget its instructions sooner. If sounds like gibberish, let me know and I'll break it down.

farmingvillein t1_jdsmdt2 wrote on March 26, 2023 at 9:38 PM

This isn't really an accurate summary of the Reflexion paper. As noted in the other post:

> Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

This version is correct.

However, if I do the above and I throw in a semi-random Beginner problem that failed in OP's original pass-through, it successfully builds the answer.

u/enryu42 -- if you care to take things forward, I'd try implementing Reflexion (either with the underlying codebase (https://github.com/noahshinn024/reflexion-human-eval/) or just manual prompt work.

Or if you can provide a link to the problems in copy-pastable text form (manually coercing the math notation is a little painful), since you presumably already did this, it would greatly accelerate others hopping on analysis.

The fact that I immediately saw improvement on a randomly-selected (Beginner) problem suggests that there is a bunch of upward room here.

enryu42 OP t1_jdsokwz wrote on March 26, 2023 at 9:53 PM

Interesting! Here are the scraped and auto-converted statements (formatting is off sometimes, especially in the sample tests, but understandable). Prefixes are: "abc" for beginner, "arc" for regular, "agc" for "grand".

I do believe that the "Beginner" ones can be improved, but it'll be interesting to see what happens on "Grand" (or even "Regular"), as they require coming up with some ideas before writing the code.

farmingvillein t1_jdspflq wrote on March 26, 2023 at 10:00 PM

So, don't know whether this actually makes a difference, but I'd review the overall post-conversion text.

E.g.: https://github.com/enryu43/llm_coding/blob/main/atcoder_eval/statements/statement_abc293_b.txt

You'll see that it represent "K" and "N" wrong here (in sample 1, 15 versus 5, 12 versus 2).

Certainly, as a human, I would find this confusing. Maybe you could get some automated robustness by telling it how you converted the text (as it might automatically adjust its "expectations" on interpreting the numbers). Obviously, the fairer comparison though would just be to fix this.

> as they require coming up with some ideas before writing the code.

The other thing I'd note--

Not sure whether you're using the API directly, but if I play around with these in ChatGPT, I often run into the context window and have to nurse it along to complete text. I'd make sure that however you're running things, you're giving it enough "space" to iterate (particularly if you use any reflection techniques).

nixed9 t1_jdt1xyp wrote on March 26, 2023 at 11:35 PM

Ok my bad but that’s how I’ve been using the reflexion prompting

muskoxnotverydirty t1_jds07qr wrote on March 26, 2023 at 7:01 PM

Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.

farmingvillein t1_jdsm0hw wrote on March 26, 2023 at 9:35 PM

No, you didn't misunderstand it--your understanding is correct. OP is giving an answer that is similar to part of the Reflexion paper, but not the entirety.

yaosio t1_jdtenqi wrote on March 27, 2023 at 1:17 AM

What's it called if you have it self-reflect on non-code it's written? For example, have it write a story, and then tell it to critique and fix problems in the story. Can the methods from the paper also be used for non-code uses? It would be interesting to see how much it's writing quality can improve using applicable methods.

Cool_Abbreviations_9 t1_jdryysg wrote on March 26, 2023 at 6:52 PM

Got it. thanks a ton !

AllAmericanBreakfast t1_jdtynpv wrote on March 27, 2023 at 4:14 AM

I tried this out, and it only had partial success.

First, just dumping in this prompt, then asking a question, resulted in the AI coming up with a laughably simple failed first response, followed by a critique and improvement. It is as if it recognized that the easiest way to "demonstrate improvement" would be to set the bar low by failing utterly on the first attempt.

Then, I tried breaking it up into stages, asking for a response, getting a response, asking for a critique, getting a critique, asking for an improvement, and getting an improvement.

This worked better.

However, when I tried asking for a critique and then an improvement (again in separate stages), it instead started inventing fake problems to solve. I was asking it to implement a case-insensitive longest common substring function, and to return the version of the LCS in the longer of the two strings.

The second-pass critique was that the original (working) code didn't deal with the possibilty that "the longer string may not contain the LCS", which is impossible given the way it was originally implemented. Then it added some extra code to deal with this "problem."

xjE4644Eyc t1_jdtefph wrote on March 27, 2023 at 1:15 AM

Thank you for this, what an novel way to approach the problem. I’m going to start using this regularly

TheShroomHermit t1_jdtequo wrote on March 27, 2023 at 1:17 AM

Neat

[deleted] t1_jdur1dt wrote on March 27, 2023 at 10:33 AM

[removed]

LightVelox t1_jdry1xp wrote on March 26, 2023 at 6:46 PM

This

Basically it makes GPT-4 reevaluate what it did wrong and try again until it can do it correctly

E_Snap t1_jdsbvd0 wrote on March 26, 2023 at 8:23 PM

It’s pretty amazing how many shortcomings of that architecture could be summarized by “It only outputs when directly prompted to output, and won’t read its own output as it’s outputting”. Once these things can continuously take input and output, we’ll probably see quite the rush of advancement.

farmingvillein t1_jdsd5ae wrote on March 26, 2023 at 8:32 PM

> and won’t read its own output as it’s outputting

This is literally what transformer decoders do, unless I've strongly misunderstood your statement.

E_Snap t1_jdsht5g wrote on March 26, 2023 at 9:05 PM

I guess I could have worded it better. What I mean to say is that once they’ve output something, it’s in the record. There’s no pausing to think and go through a few different iterations of the sentence, or evaluating if what they’re about to say has faults. They just output directly, instead of reading what they’re about to output and vetting it.

farmingvillein t1_jdsmsh9 wrote on March 26, 2023 at 9:41 PM

Gotcha. Yeah, that is presumably where the power of inner monologue / step-by-step / reflection come from.

Will be cool to see that (presumably) progressively systematized.

sdmat t1_jdt85pr wrote on March 27, 2023 at 12:24 AM

Yes, it's amazing to see something as simple as "Assess the quality of your answer and fix any errors" actually work.

Or for more subjective results such as poetry "Rate each line in the preceding poem" then "Rewrite the worst lines".

yaosio t1_jdtf57p wrote on March 27, 2023 at 1:21 AM

The neat part is it doesn't work for less advanced models. The ability to fix its own mistakes is an emergent property of a sufficiently advanced model. Chain of thought prompting doesn't work in less advanced models either.

sdmat t1_jdtj3ia wrote on March 27, 2023 at 1:54 AM

Definitely, I was extremely skeptical of LLMs as a path to AGI but this makes it look possible. Maybe even likely.

yaosio t1_jdtvycq wrote on March 27, 2023 at 3:47 AM

It's really neat how fast this stuff has been going. I remember when OpenAI claimed GPT-2 was too dangerous to release, which is amusing now because the output of GPT-2 is so bad. But when I used a demo that would write news articles from a headline I thought it was absolutely amazing. Then I, and most of the public, forgot about it.

Then GPT-3 comes out, and AI Dungeon used it before OpenAI censored it sonhsrd AI Dungeon stopped using it. The output was so much better than GPT-2 that I couldn't believe I liked anything GPT-2 made. I told people this was the real deal, it's perfect and amazing! But it goes off the rails very often, and it doesn't understand how a story should be told so it just does whatever.

Then ChatGPT comes out, which we now know is something like a finetune of GPT-3.5. You can chat, code, and it writes stories. The stories are not well written, but they follow the rules of story telling and don't go off the rails. It wasn't fine tuned on writing stories like AI Dungeon did with GPT-3.

Then Bing Chat comes out, which turned out to be based on GPT-4. It's story writing ability is so much better than ChatGPT. None of that "once upon a time" stuff. The stories still aren't compelling, but way better than before.

I'm interested in knowing what GPT-5 is going to bring. What deficiencies will it fix, and what deficiencies will it have? I'd love to see a model that doesn't try to do everything in a single pass. Like coding, even if you use chain of thought and self reflection GPT-4 will try to write the entire program in one go. Once something is written it can't go back and change it if it turns out to be a bad idea, it is forced to incorporate it. It would be amazing if a model can predict how difficult a task will be and then break it up into manageable pieces rather than trying to do everything at once.

sdmat t1_jdtytyy wrote on March 27, 2023 at 4:16 AM

> Like coding, even if you use chain of thought and self reflection GPT-4 will try to write the entire program in one go. Once something is written it can't go back and change it if it turns out to be a bad idea, it is forced to incorporate it. It would be amazing if a model can predict how difficult a task will be and then break it up into manageable pieces rather than trying to do everything at once.

I've had some success leading it through this in coding with careful prompting - have it give a high level outline, check its work, implement each part, check its work, then put the thing together. It will even revise the high level idea if you ask it to and update a corresponding implementation in the context window.

But it definitely can't do so natively. Intuitively it seems unlikely that we can get similar results to GPT4+human with GPT4+GPT4 regardless of how clever the prompting scheme is. But the emergent capabilities seen already are highly surprising, so who knows.

Really looking forward to trying these schemes with a 32K context window.

Add code execution to check results and browsing to to get library usage right and it seems all the pieces are there for an incredible level of capability even it still needs human input in some areas.

COMPEWTER_adminisp t1_jdtugix wrote on March 27, 2023 at 3:33 AM

> Once these things can continuously take input and output, we’ll probably see quite the rush of advancement.

interesting !

Cool_Abbreviations_9 t1_jds675j wrote on March 26, 2023 at 7:44 PM

Thank you :)

ghostfaceschiller t1_jds0zez wrote on March 26, 2023 at 7:07 PM

Basically just giving the model the ability to observe the results of its previous action and decide if it wants to try something different based on the feedback

[D] GPT4 and coding problems

Cool_Abbreviations_9 t1_jdrxcqi wrote on March 26, 2023 at 6:41 PM