Abstract:

>Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

https://preview.redd.it/ie7jdqhwmnt91.jpg?width=1092&format=pjpg&auto=webp&s=ebff5cab2c805549e85fb2eccfdadd0644d95d9f

https://preview.redd.it/3wrxbnhwmnt91.jpg?width=1180&format=pjpg&auto=webp&s=09aa2773a853ab564cfbb11811a18d21165a06e4

https://preview.redd.it/7frgfxhwmnt91.jpg?width=991&format=pjpg&auto=webp&s=0bfcb01b5707d6e1892fc3100b960a5f9c203707

https://preview.redd.it/k6mm4rhwmnt91.jpg?width=1191&format=pjpg&auto=webp&s=00c452f58e79b6ea003883826e50653f907221d5

Comments

You must log in or register to comment.

londons_explorer t1_is9c1xs wrote on October 14, 2022 at 6:36 AM

This seems to basically be injecting a tiny amount of rule-based decision making into a language model...

The physics model is so limited that it can only work in a very tiny number of cases, and the results might as well be hardcoded prompts to inject.

visarga t1_is9mzus wrote on October 14, 2022 at 9:13 AM

We need a learned physics model, there's so much video to train on, it's one of the most neglected modalities.

Icy-Pause-574 t1_isb5gai wrote on October 14, 2022 at 4:53 PM

But we are using physics engine / game engine to create the virtual world right?

I think this paper shows some potential of using the parallel world to help understand the real world, which is amazing.

I want to call it the milestone of mix-reality LMs.

[deleted] t1_is9855m wrote on October 14, 2022 at 5:48 AM

[deleted]

visarga t1_is9mk63 wrote on October 14, 2022 at 9:06 AM

Not just simulation, LLMs can also benefit from other toys: search, code execution/REPL, sub-requests, calling external APIs.

NextAGI t1_is9kymu wrote on October 14, 2022 at 8:42 AM

Not so surprising to me if you have read the paper where language models can even transfer the reasoning ability to language after being trained on something like code. https://arxiv.org/abs/2201.11473

thunderdome t1_is9cigc wrote on October 14, 2022 at 6:42 AM

Really interesting. I wish they included more details on the text-to-code models. They share the table showing that increasing the size increases the accuracy, but they apparently never train it higher than 1.5B parameters? It would be interesting to know how much of the remaining error on their benchmark is due to error in the text-to-code generation vs the foundational models. Or even just a standalone accuracy measure.

Regardless, super cool to see the number 100 popping up in benchmarks like this.

yazriel0 t1_isbv4hp wrote on October 14, 2022 at 7:44 PM

Why arent we doing for code domain? Generate programs, try to run them, auto correct the model?

This can probably be iterated with far more samples than a physical simulator

rePAN6517 t1_is9rrfh wrote on October 14, 2022 at 10:20 AM

Maybe we could use the new version of codex to program a human simulator and let LLMs use the human simulator to help answer questions anything related to people.

FirstOrderCat t1_is99yzw wrote on October 14, 2022 at 6:10 AM

is benchmark available somewhere?..

Co0k1eGal3xy t1_is9yf8p wrote on October 14, 2022 at 11:36 AM

>Two baseballs X and Y are released from rest at the same height.
>
>X is heavier than Y.
>
>Which baseball will fall to the ground faster?

Isn't Mind's eye the ONLY wrong answer?

Acceleration due to gravity is constant, but the opposing force from air resistance is roughly proportional to the air displaced and does not change with mass.

I mean, this is the whole point of Apollo 15's test on the moon.

All of them have wrong explainations, but Mind's eye is the only one that incorrectly claims they will fall at the same rate under normal real world conditions.

Proof : Brian Cox visits the world's biggest vacuum | Human Universe - BBC

Even_Tangerine_800 t1_is9zy95 wrote on October 14, 2022 at 11:51 AM

This is what I got: GPT-3 Answer.

Apparently, the model arrives at the wrong answer without mentioning the air resistance. I have tried many times the results are consistent.

Considering the free fall rules should be encoded in some text books (which should have been included in the pre-training datasets), these results are even more striking to me.

Co0k1eGal3xy t1_isa0dvo wrote on October 14, 2022 at 11:56 AM

The heavier baseball falling to the ground faster is the correct answer. Maybe you misread my post?

It is a shame none of them mention air resistance.

Even_Tangerine_800 t1_isa18wf wrote on October 14, 2022 at 12:04 PM

Are the questions as simple as a = F/m = mg / m = g?

Anyways. If humans put effort into optimizing a tool for accurate simulation, we can treat it more like an alignment problem rather than pure scientific judgment.

You can update the knowledge in the physics engine if you want.

Co0k1eGal3xy t1_isa1gbn wrote on October 14, 2022 at 12:06 PM

Oh I agree 100%. This paper is fantastic! (and it's an easy fix)

I definitely want to see further research in this, but the comparison they show here is probably not the comparison they wanted to show haha.

Lajamerr_Mittesdine t1_isa34ib wrote on October 14, 2022 at 12:21 PM

All the answers are incomplete because they don't provide the assumptions necessary to arrive at a complete solution.

A more complete answer would look like this.

>Assuming just gravitational forces both the lighter and heavier baseballs both would fall at the same rate and then reach the surface at approximately the same time. This can be impacted however by additional forces that may be present such as an atmosphere providing additional resistances based on the surface area, density, and total mass of each object.

Though even that is an incomplete answer.

Co0k1eGal3xy t1_isa44ws wrote on October 14, 2022 at 12:30 PM

>because they don't provide the assumptions necessary to arrive at a complete solution.

I agree, but when atmosphere it not mentioned, the default should be updated to STP (0°C temperature and 101.325 kPa pressure) in future.

eigenlaplace t1_isa6z73 wrote on October 14, 2022 at 12:54 PM

It’s a simple question, no mention of air anywhere… The correct answer is they fall at the same rate.

Co0k1eGal3xy t1_isa7e7b wrote on October 14, 2022 at 12:57 PM

I live on the planet earth where most places have air. It is assumed that there is air if it is not mentioned otherwise.

eigenlaplace t1_isa7xzy wrote on October 14, 2022 at 1:01 PM

I live on planet Question where most places have no air. Where is your god now?

Co0k1eGal3xy t1_isa8g7i wrote on October 14, 2022 at 1:05 PM

>current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning.

That is my whole point. This paper trying to avoid "planet Question" and make language models work in the real world instead.

I'm not interested in arguing over this. The paper is good, it just needs a minor correction in a future revision.

AskMoreQuestionsOk t1_isb66lb wrote on October 14, 2022 at 4:58 PM

Actually, I think you make a good point. If you think about understanding conversations and stories and problems like this, you need a model understanding of what it is that you are talking about to even begin to make an accurate assumption about what the prediction of the next state will be. - we make an incredible number of assumptions from our own experience when we make those internal models. How do we know if air friction is important to this problem?

master3243 t1_isc3kp7 wrote on October 14, 2022 at 8:41 PM

They fall at the same rate

https://www.wired.com/2013/10/do-heavier-objects-really-fall-faster/

Co0k1eGal3xy t1_isc5m6k wrote on October 14, 2022 at 8:55 PM

>But what about the basketball and the bowling ball? Shouldn't they have different accelerations? Technically, yes.
>
>[...]
>
>it turns out that there are many situations where a heavier object does indeed hit the ground before a lighter object (because of air resistance).

Your link says the heavy baseball and the light baseball would fall at different rates.

master3243 t1_iscb9mu wrote on October 14, 2022 at 9:34 PM

My link also says that heavier objects can fall slower than light objects. As in the styrofoam board that was heavier than the small ball yet it fell slower.

In the absence of more detail such as the dynamics of the shapes and the inclusion of air drag or not, it is fair to say that the most correct answer to the "which" question is "both". I would only count the "heavy first" answer as correct IF it included the discussion on air drag, otherwise the correct answer is "both". But that's my opinion and not objectively the only way to interpret this.

Especially given a model that has so many physics articles/material included in it's dataset, it's a pretty big fail that it can't answer this properly.

Co0k1eGal3xy t1_iscd0no wrote on October 14, 2022 at 9:46 PM

>In the absence of more detail such as the dynamics of the shapes

Baseball's have a standard diameter and shape.

It's theoretically possible that the heavier baseball has a "furry" surface or something like that, but it's such an unlikely case I didn't consider it when reading the paper.

>it's a pretty big fail that it can't answer this properly.

I emailed the authors and they said "there could be some pre conditions we have not presented in the screenshot" and that they would address it when they released a dataset.

Sounds like it's all sorted out now. No harm done.

master3243 t1_isch3m9 wrote on October 14, 2022 at 10:15 PM

Great

[deleted] t1_isc9s5a wrote on October 14, 2022 at 9:23 PM

[deleted]