navillusr
navillusr t1_j4rexbm wrote
Reply to comment by mrconter1 in [R] The Unconquerable Benchmark: A Machine Learning Challenge for Achieving AGI-Like Capabilities by mrconter1
This is wrong, WoB/MiniWoB++ has a 160 x210px observation. Also some OS’s (chrome OS) are almost entirely web based, so this distinction is minimal.
navillusr t1_j4qumlu wrote
Reply to [R] The Unconquerable Benchmark: A Machine Learning Challenge for Achieving AGI-Like Capabilities by mrconter1
- If you list instructions step by step, the model doesn’t require reasoning to solve the problem. This is testing a very basic form of intelligence.
- Adept.ai can already solve more complex challenges than this (but still nowhere near AGI). They use a chatbot to automate simple tasks in common programs using LLMs.
- There’s a benchmark that already tests tasks like this, MiniWoB++
navillusr t1_j47wc3c wrote
Reply to comment by throwaway2676 in [D] What's your opinion on "neurocompositional computing"? (Microsoft paper from April 2022) by currentscurrents
It’s definitely a hard problem. The challenge isn’t a pipeline problem of “solve this reasoning task” where you can just take the english task -> convert to code -> run code-> convert to english answer. We could probably do that with some degree of accuracy in some contexts.
The hard part is having the agent solve reasoning tasks without prompt engineering, when they appear, without telling it that it’s a reasoning task. In essence it should be able to combine reasoning and planning seamlessly with the generative side of intelligence, not just piece them together when you tell it to outsource the task to a reasoning engine (assuming it could even do this accurately)
For example, if you ask ChatGPT to play rock paper scissors, but choose the option that beats the option that beats the option that you pick. (i.e if I pick Rock, you pick Scissors, because scissors beats paper which beats rock), it cant plan that far ahead.
> Let’s play a modified version of Rock Paper Scissors, but to win, you have to pick the option that beats the option that beats the option that I pick.
> Sure, I'd be happy to play a modified version of Rock Paper Scissors with you. Please go ahead and make your selection, and I'll pick the option that beats the option that beats it.
> Rock
> In that case, I will pick paper.
Since this game requires 2 steps of thinking, and goes against the statistically likely answer in this scenario it fails. As you described, you could maybe write code that identifies a rock paper scissor game, generates and runs code, then answers in english, but there are many real world tasks that require more than 1 step planning that the agent needs to be able to seamlessly identify and work through. (For the record, it also outputs incorrect python code for this game when prompted)
I don’t do research in this specific area so again I could be off base here, but I think that’s why its harder than you’re imagining.
Fwiw, there was a recent paper (the method was called the Mind’s Eye) where they used an LLM to generate physics simulator code to answer physics question similar to what you described.
navillusr t1_j43zaqk wrote
Reply to [D] What's your opinion on "neurocompositional computing"? (Microsoft paper from April 2022) by currentscurrents
I think this is a very common belief. Symbolic systems can do many things that neural networks struggle with very sample efficiently. But they’ve failed to scale with more data as well as neural networks for most tasks, and are harder to train. If we could magically combine the reasoning ability of symbolic systems with the pattern recognition and generalization of neural networks, we would be getting very close to AGI imo. That being said idk much about recent research in symbolic reasoning so my knowledge might be outdated.
navillusr t1_j4rhitt wrote
Reply to comment by mrconter1 in [R] The Unconquerable Benchmark: A Machine Learning Challenge for Achieving AGI-Like Capabilities by mrconter1
The distinctions you’re drawing, pixels vs selenium output and browser vs os, are far less significant than the complexity of the tasks (step-by-step vs entire processes). What they’ve achieved is strictly harder for humans than what you are testing. We can argue whether perception or planning are harder for current technology (the computer vision is far more developed than AI planning right now), but I think you need to reconsider the formulation of your tasks. It seems like they are designed to be easy enough for modern methods to solve.
On another note, most interesting tasks can’t be completed with just an x,y mouse location output. Why did you decide to restrict the benchmark to such a limited set of tasks?