Abstract:

>Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.

https://preview.redd.it/66zehsdps6fa1.jpg?width=811&format=pjpg&auto=webp&v=enabled&s=96db4cb832def624ad10f7383cde56c1444dcbcc

https://preview.redd.it/is4pzwdps6fa1.jpg?width=1638&format=pjpg&auto=webp&v=enabled&s=5e6c3137b982c91c658b58d286e5036a46a7d55d

https://preview.redd.it/szkbb0eps6fa1.jpg?width=711&format=pjpg&auto=webp&v=enabled&s=6eacbd0cdfc8ecc2c21ad1a46d87d8f367d9bbb5

https://preview.redd.it/6lk1wzdps6fa1.jpg?width=1468&format=pjpg&auto=webp&v=enabled&s=5a37d08a5677d927c1b017d711558a6d859e8f3c

https://preview.redd.it/8h7p8vdps6fa1.jpg?width=1177&format=pjpg&auto=webp&v=enabled&s=3e9926040e6af04ec8945fcfe81e51b5c94d5913

Comments

You must log in or register to comment.

farmingvillein t1_j6iwb5v wrote on January 30, 2023 at 5:34 PM

I like the big idea, and it is almost certainly indicative of one of the key tools to improve automated programming.

That said, I wish they had avoided the urge to build an intermediate programming language. This is likely unnecessary and is the type of semi-convoluted solution that you only come up with in an academic research lab (or out of true, deep product need--but I think that is highly unlikely the case).

My guess is that the same basic result in the paper could have been shown by using Python or Rust or similar as the root language, with a little work (time that you could have obtained by swapping out effort spent on the harry potter language development).

They do note:

> We generate 16 Python implementations per high-level plan on 100 randomly sampled problems and find that the performance drops to 6%.

But it isn't well-discussed (unless I skimmed too quickly) as to why a separate language is truly needed. They discussion advantages of Parsel, but there doesn't appear to be a deep ablation on why it is really necessary or where its supposed performance benefits come from, or how those could be enforced in other languages.

There is a bunch of discussion in the appendix, but IMO none of it is very convincing. E.g., Parsel enforces certain conventions around testing and validation...great, lets do that in Python or Rust or similar. Or--leveraging the value of LLMs--through a more natural language interface.

Yes, there is benefit to bridging these gap in a "universal" manner...but, as per https://xkcd.com/927/, a new programming language is rarely the right solution.

ezelikman t1_j6lx0vm wrote on January 31, 2023 at 6:39 AM

Hi, author here!

There are a few ways to interpret this question.

The first is, "why generate a bunch of composable small functions - why not generate complete Python/Lean/etc. implementations directly from the high-level sketch?" If you generate 10 complete implementations, you have 10 programs. If you generate 10 implementations of four subfunctions, you have 10,000 programs. By decomposing problems combinatorially, you call the language model less. You can see the benefits in Fig. 6 and our direct compilation ablation. There's also the context window: a hundred 500-token functions from Parsel is a 50,000-token program. You won't get that with Codex alone.

Another interpretation is, "why do you need to expose intermediate language when you can use a more abstract intermediate representation." You suggest "leveraging the value of LLMs--through a more natural language interface." That's the goal. Parsel is intentionally basically indented natural language w/ unit tests. There's minimal extra syntax for efficiency and generality - ideally, people who've never used Python can understand and write Parsel. The "expert" details here aren't syntax: most people are unfamiliar with the nuances of writing natural language that automatically compiles to code, like the value of comprehensive unit tests.

Another is, "why design a new language instead of writing this as, e.g., a Python library?" My response is we did this too. Internally, Parsel is in Python, and a "Function" class already exists - you can find it on GitHub. Still, you need a process to generate implementations and select one satisfying the constraints, which we call the compiler.

Hope this answers your question!

farmingvillein t1_j6nxa0i wrote on January 31, 2023 at 5:49 PM

> If you generate 10 complete implementations, you have 10 programs. If you generate 10 implementations of four subfunctions, you have 10,000 programs. By decomposing problems combinatorially, you call the language model less

Yup, agreed--this was my positive reference to "the big idea". Decomposition is almost certainly very key to any path forward in scaling up automated program generation in complexity, and the paper is a good example of that.

> Parsel is intentionally basically indented natural language w/ unit tests. There's minimal extra syntax for efficiency and generality.

I question whether the extra formal syntax is needed, at all. My guess is, were this properly ablated, it probably would not be. LLMs are--in my personal experience, and this is obviously born out thematically--quite flexible to different ways in representing, say, unit input and outputs. Permitting users to specify in a more arbitrary manner--whether in natural language, pseudocode, or extant programming languages--seems highly likely to work equally well, with some light coercion (i.e., training/prompting). Further, natural language allows test cases to be specified in a more general way ("unit tests: each day returns the next day in the week, Sunday=>Monday, ..., Saturday=>Sunday") that LLMs are well-suited to work with. Given LLM's ability to pick up on context and apply it, as well, there is a good chance that free-er form description of test cases are likely to drive improved performance.

If you want to call that further research--"it was easier to demonstrate the value of hierarchical decomposition with a DSL"--that's fine and understood, but I would call it out as a(n understandable) limitation of the paper and an opportunity for future research.

[deleted] t1_j6j9yun wrote on January 30, 2023 at 6:58 PM

[deleted]

farmingvillein t1_j6jdazy wrote on January 30, 2023 at 7:19 PM

This is, at best, a distinction without a difference.

The authors literally describe it as "language".

It gets "compiled".

It generates a "Parsel program".

It holds a distinct learning curve such that a user can be an "expert".

The point here is that it is a unique specification that needs to be separately learned--it asks the user to learn, in essence, a domain-specific language. Or, if you prefer, a domain-specific specification; the point stands either way.

theunixman t1_j6jff5n wrote on January 30, 2023 at 7:33 PM

We have to learn APIs all the time, and basically they're all DSLs that just don't admit they are so they're even harder.

farmingvillein t1_j6jgv48 wrote on January 30, 2023 at 7:41 PM

And this isn't a good thing, it is a necessary thing--we do it because someone bundled some logic together and you need to interact with it.

None of this addresses whether or why something like Parsel is necessary as an intermediate step. The authors do very little to justify the necessity of an intermediate representation; there is no meaningful analysis of why it apparently performs better, nor an ablation analysis to try to close the gaps.

The key benefits--like enforced test cases--could, hypothetically, very easily be enforced in something like Python, or many other languages.

And given the massive volumes of training data we have for these other languages, there are a lot of good reasons to think that we should be able to see equal or better behavior than with a wholly manufactured pseudocode (effectively) language.

The paper would have been much more convincing and interesting if, e.g., they started with something like python and progressively added the restrictions that apparently helped Parsel provide higher quality results.

abcdchop t1_j6m17n8 wrote on January 31, 2023 at 7:32 AM

wait bro the key benefit is the the hierarchical description -- the "language" is just a format for explaining the hierarchical description of the problem in natural language, I think that the improvements your suggesting pretty much describe the paper itself

farmingvillein t1_j6n4hqy wrote on January 31, 2023 at 2:47 PM

> wait bro the key benefit is the the hierarchical description

agreed

> I think that the improvements your suggesting pretty much describe the paper itself

Allow users to work in actual unstructured language, or an extant programming language, and I'd agree.

theunixman t1_j6jhf69 wrote on January 30, 2023 at 7:45 PM

Right, turning it into an actual DSL would be much better, and then you'd have better semantics for the library. But honestly I'm bored talking about aesthetics already, peace.

[deleted] t1_j6jnws8 wrote on January 30, 2023 at 8:25 PM

[deleted]

[deleted] t1_j6iw1ql wrote on January 30, 2023 at 5:33 PM

[deleted]