Github: https://github.com/amazon-science/mm-cot

Twitter: https://paperswithcode.com/top-social

Abstract:

>Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance.

https://preview.redd.it/g9eo0f94k1ga1.jpg?width=1331&format=pjpg&auto=webp&v=enabled&s=a51e29ed523b624dd70d97841c8b0a5442915c80

https://preview.redd.it/fgboci94k1ga1.jpg?width=1323&format=pjpg&auto=webp&v=enabled&s=1a3a2fe1a47d4ca04f992b2cf72832f024166711

https://preview.redd.it/2ojfym94k1ga1.jpg?width=1660&format=pjpg&auto=webp&v=enabled&s=e7431fb8532d6331374f1b00adc40248de94f381

https://preview.redd.it/k7huem94k1ga1.jpg?width=1326&format=pjpg&auto=webp&v=enabled&s=2bcbe91afcdf815171b4c0fd7f8e48f63a8bbb4c

https://preview.redd.it/05m8rf94k1ga1.jpg?width=658&format=pjpg&auto=webp&v=enabled&s=a8384d649e2140b27dc87525c1546403cd3409f7

Comments

You must log in or register to comment.

throwaway2676 t1_j74iilz wrote on February 4, 2023 at 12:30 AM

#1,717,360

Imo, chain-of-thought and program-of-thought reasoning will be the next major generation of progress for LLMs. Probably another year or two and we will be able to eliminate those goofy instances where the models confidently produce nonsense (well, mostly anyway).

AiChip t1_j74ku5a wrote on February 4, 2023 at 12:48 AM

#1,717,470

Wow! This is huge! 1B parameters model beating 175 B parameters model…

Lengador t1_j74ro7q wrote on February 4, 2023 at 1:42 AM

#1,717,876

Replying to AiChip (#1,717,470)

That's the number in the headline, but if you look at the tables you can see their 223M parameter model beats the 175B parameter model significantly as well. That's 0.1% the size! Absolutely insane.

zbyte64 t1_j74y5o9 wrote on February 4, 2023 at 2:33 AM

#1,718,223

What kind of hardware do I need to train this?

Parzival_007 t1_j75c9u0 wrote on February 4, 2023 at 4:33 AM

#1,718,938

This is big. Thanks for sharing this !

[deleted] t1_j75xi18 wrote on February 4, 2023 at 8:53 AM

#1,719,874

[removed]

PedroGonnet t1_j760d03 wrote on February 4, 2023 at 9:35 AM

#1,719,980

fewer than 1B params 😶

ThirdMover t1_j760ojx wrote on February 4, 2023 at 9:39 AM

#1,719,993

Replying to throwaway2676 (#1,717,360)

I think it's going to be interesting if we manage to teach a model to actually have a notion of "factual" and "counterfactual" - right now every prompt is treated as equally valid, GPT3 doesn't have an "opinion" as to what is actually really true. I am not sure that is even possible with text (maybe with some sort of special marker token?) but multimodality might lead the way there.

ThirdMover t1_j760u5i wrote on February 4, 2023 at 9:42 AM

#1,720,002

Replying to PedroGonnet (#1,719,980)

Well, if you are at a billion the difference between continuous and discrete quantities becomes kind of hair splitting anyway....

HunteronX t1_j761xqh wrote on February 4, 2023 at 9:58 AM

#1,720,044

The economics is getting there for these models to be big news...
The key features of this work seem to be:

A multimodal embedding representation obtained by individual modality encoders (patch-level for images, token level for text), combined via attention.
Generate rationales first, then infer answers from them, due to accuracy reduction on answers.
(Not an expert: but is the greater % of hallucinated rationales in baseline case - no vision features - due to large 'context' needed for both rationale + answer, without those features?)

Seems that multimodal representations (language + n=? other modalities) may be important for introducing a loose physical grounding to avoid hallucinating plausible ideas/suggestions + efficient representation of the remaining ideas.

[deleted] t1_j766r6p wrote on February 4, 2023 at 11:09 AM

#1,720,218

Replying to ThirdMover (#1,719,993)

[deleted]

dancingnightly t1_j76t0gh wrote on February 4, 2023 at 2:56 PM

#1,721,447

Replying to zbyte64 (#1,718,223)

In theory training T5 alongiside the image embedding models they use (primarily DETR?) shouldn't take much more than a 3090 or Collab Pro GPU. You could train T5s on even consumer high end GPUs in 2020, for example, but the DETR image model probably needs to be ran for each image at the same time which might take up quite a bit of GPU together. The `main.py` script looks like a nice and fairly short typical training script you'd be able to quickly run if you download their repo, pull the scienceQA dataset and send the training args to see if it crashes.

__lawless t1_j76vq7h wrote on February 4, 2023 at 3:16 PM

#1,721,603

Just finished reading. Although imho not a very fair comparison with GPT it still is super impressive

yaosio t1_j76vwr2 wrote on February 4, 2023 at 3:17 PM

#1,721,613

Replying to ThirdMover (#1,719,993)

I think it's likely the ability to determine what is true and what isn't will come from a capability of the model rather than it being told what is and isn't true. It's not possible to mark text as true or not true as this assumes whomever is mafking these things is the sole authority on the truth and never makes mistakes.

At a certain level of capability the AI will be able to use all of its knowledge to determine what is and isn't true. For example, if you know enough about physics and the Earth, you'll know that the sky is blue without seeing it. For something that can't be confirmed or denied, such as, "Bob puts his shoes on before his pants." The AI could determine the likelihood of such an action based on what it knows about Bob, pants, and shoes.

If it's trained on lies it could determine they are lies because the data is not consistent. If I train you that every number plus another number is a number, but 2+2 is special and equals chair, you could determine I'm lying because it's not consistent with all the data as a whole.

Truth has a consistency to it that lies don't have, and a model can learn that.

jaqws t1_j76wfb1 wrote on February 4, 2023 at 3:21 PM

#1,721,638

Replying to __lawless (#1,721,603)

Why do you say it isn't a fair comparison?

__lawless t1_j76wlwb wrote on February 4, 2023 at 3:23 PM

#1,721,651

Replying to zbyte64 (#1,718,223)

They did it on 4 V100 with 32GB RAM

__lawless t1_j76xpgk wrote on February 4, 2023 at 3:31 PM

#1,721,720

Replying to jaqws (#1,721,638)

Just 2 points a) They fine tuned this model to death. Where as GPT3.5 has a handful of examples to fine tune b) This is a multi modal model which consumes the image directly. Where as GPT can only consume text, so they fed it caption of the image

jaqws t1_j76zkhu wrote on February 4, 2023 at 3:44 PM

#1,721,826

Replying to __lawless (#1,721,720)

Ah, yeah I would agree that's not a fair comparison. Thanks for sharing.

Lopsided-Factor-780 t1_j7743wv wrote on February 4, 2023 at 4:15 PM

#1,722,029

Question from a noob:
When they say H_Fuse is fed into the decoder model, such that Y = Decoder(H_Fuse), how is it fed in? Is it fed in like the encoder output in an encoder-decoder transformer with cross-attention? Or something else?

Also, if there is a separate encoder and decoder component, are they trained together or separately?

ThirdMover t1_j77bf6z wrote on February 4, 2023 at 5:04 PM

#1,722,385

Replying to yaosio (#1,721,613)

> I think it's likely the ability to determine what is true and what isn't will come from a capability of the model rather than it being told what is and isn't true. It's not possible to mark text as true or not true as this assumes whomever is mafking these things is the sole authority on the truth and never makes mistakes.

I think there is a bit of a misunderstanding here. The issue isn't that GPT3 has wrong opinions about stuff. The issue is that it doesn't have any opinions about what is real or isn't whatsoever. Of course any future AI will operate on limited and flawed information and thus have opinions that are not perfectly true. But before we can even get to that point a model needs to even have the idea of "real" and "not real" as fundamental categories. For GPT3 everything is just text, Harry Potter is as real as Obama. Maybe I am wrong and inference can actually get you there through pure consistency checks, as you say. But we will have to see about that.

[deleted] t1_j77e3ku wrote on February 4, 2023 at 5:22 PM

#1,722,520

Replying to AiChip (#1,717,470)

[deleted]

ipoppo t1_j77l1hr wrote on February 4, 2023 at 6:08 PM

#1,722,911

Replying to ThirdMover (#1,719,993)

Taking from Judea Pearl's book, capability of coming up with useful counterfactuals and causalities will likely built upon foundation of having good assumption about "world model(s)"

HeyLittleTrain t1_j77w36w wrote on February 4, 2023 at 7:23 PM

#1,723,578

Replying to Lengador (#1,717,876)

At what size could I run a model on a decent gaming PC?

i2mi t1_j786bu0 wrote on February 4, 2023 at 8:35 PM

#1,724,049

Replying to HeyLittleTrain (#1,723,578)

Around 2M Edit: the number I gave is completely delusional. Sorry

PedroGonnet t1_j7882zx wrote on February 4, 2023 at 8:48 PM

#1,724,117

Replying to ThirdMover (#1,720,002)

Countable does not mean that you have to count them, only that you could, if you wanted to.

ThirdMover t1_j7899qa wrote on February 4, 2023 at 8:56 PM

#1,724,158

Replying to PedroGonnet (#1,724,117)

You could also count water molecules.

PedroGonnet t1_j78ae8y wrote on February 4, 2023 at 9:04 PM

#1,724,216

Replying to ThirdMover (#1,724,158)

That would be many molecules for little water.

emotionalfool123 t1_j78bjj2 wrote on February 4, 2023 at 9:13 PM

#1,724,270

Replying to HeyLittleTrain (#1,723,578)

Stable diffusion is around 866M params which can be run on 12gb 3080

Lengador t1_j78ovy2 wrote on February 4, 2023 at 10:51 PM

#1,724,916

Replying to HeyLittleTrain (#1,723,578)

You can (just) run a 1B parameter model on a good gaming rig.

astonzhang t1_j79i4jj wrote on February 5, 2023 at 2:44 AM

#1,726,316

Hi, I am an author of the paper. Opinions below are my own.

After we arXiv-ed our "Automatic Chain of Though Prompting in Large Language Models" paper in Oct 2022 (here's a TLDR, ICLR'23), we were asking ourselves:

"If AGI (artificial general intelligence) is the goal, what kind of chain of thought (CoT) research do we need next? Is relying on a text-only generalist model that can perform text-only multitasks the final answer?"

"How can we connect the dots between NLP and CV communities so more researchers can contribute?"

"Since not everyone can afford playing with large models, how can we deal with input in more general form (text and images) *without* relying on larger models so a larger research community can contribute?"

One day I was teaching my kid how to solve arithmetic reasoning problems (not from the MultiArith dataset...). My kid told me that it's much easier to understand reasoning problems with the help from figure illustrations.

"Oh, can we leverage vision input to improve chain of thought reasoning?"

"The current generalist models like GPT-3.5 (text-davinci-002/003) only offer a blackbox API (at a cost) for transforming text input into text output. Why not just fine-tune a smaller model where we have full control of all its layers (whitebox) to fuse inputs in a more general form?"

Fortunately, Pan Lu et al. released the ScienceQA benchmark, just in time. This is a great contribution to the community and we benefited from it by testing our idea early on this benchmark (see acknowledgement in our GitHub repo). Showing the promise of fine-tuning a smaller model with task-specific datasets (rather than feeding in-context learning demos to a larger generalist LLM) is exactly what we wanted in this study (you may feel more motivated after reading the T-Few paper).

If you feel motivated to try parameter-efficient fine-tuning (PEFT) ideas from the aforementioned T-Few paper to improve Multimodal-CoT, you may also wish to check out our recent PEFT design space paper at ICLR'23 (here's a TLDR).

Dr_Love2-14 t1_j7aqm6x wrote on February 5, 2023 at 11:27 AM

#1,728,044

Replying to ThirdMover (#1,719,993)

During model training, I imagine the model would benefit from some form of "self-reflection" at recurrent intervals, similar to human sleep. For a crude workflow, one could design the model to recall through auto-prompting onto a context window everything its learned that is relevant to the newly exposed training data, and then the model makes a rationale decision (following a constant pre-encoded prompt) to restate the information and classify it as factual or non-factual, and then this self-generated text is backpropagated to the model.

(Disclaimer: I follow ML research as a layman)

HeyLittleTrain t1_j7avkil wrote on February 5, 2023 at 12:32 PM

#1,728,235

Replying to i2mi (#1,724,049)

Your answer seems substantially different than the others.

42gauge t1_j7e9mb2 wrote on February 6, 2023 at 3:44 AM

#1,733,356

Replying to yaosio (#1,721,613)

> If I train you that every number plus another number is a number, but 2+2 is special and equals chair, you could determine I'm lying because it's not consistent with all the data as a whole.

If I train you that every animal isn't conscious, but humans are special and conscious, you could "determine" I'm lying because it's not consistent with all the data as a whole.

42gauge t1_j7e9twt wrote on February 6, 2023 at 3:46 AM

#1,733,365

Replying to astonzhang (#1,726,316)

> I was teaching my kid how to solve arithmetic reasoning problems (not from the MultiArith dataset...

lol ..

42gauge t1_j7ea8pz wrote on February 6, 2023 at 3:49 AM

#1,733,383

Replying to PedroGonnet (#1,724,216)

And this would be many parameters for little... model

Balance- t1_j8copxz wrote on February 13, 2023 at 9:15 AM

#1,795,258

Replying to __lawless (#1,721,651)

Damn, imagine what happens when you throw a A100 or H100 datacenter against it for a few months

lwl t1_j8hoxpg wrote on February 14, 2023 at 11:42 AM

#1,806,210

Replying to astonzhang (#1,726,316)

Super interesting work, thank you for sharing! If you are still active on reddit - we noticed that the pdf is no longer available on arxiv, are you able to say why that is?

astonzhang t1_j8kcydh wrote on February 14, 2023 at 11:02 PM

#1,812,116

Replying to lwl (#1,806,210)

Can you check it again?

lwl t1_j8m2h7b wrote on February 15, 2023 at 8:33 AM

#1,815,739

Replying to astonzhang (#1,812,116)

Ah great, thanks!!

Alarming_Turnover578 t1_j8poufw wrote on February 16, 2023 at 1:30 AM

#1,825,254

Replying to 42gauge (#1,733,356)

According to Cambridge Declaration on Consciousness that would be correct. Unique property of Homo Sapiens mind is sapience not consciousness or sentience.

42gauge t1_j8pzroz wrote on February 16, 2023 at 2:52 AM

#1,826,065

Replying to Alarming_Turnover578 (#1,825,254)

Fine, just mentally replace both instances of "conscious" with "sapient"

mycall t1_j8sjg02 wrote on February 16, 2023 at 5:27 PM

#1,832,279

Replying to throwaway2676 (#1,717,360)

> chain-of-thought and program-of-thought reasoning

Isn't that what InstructGPT does?

zisyfos t1_j9j7zsk wrote on February 22, 2023 at 10:40 AM

#1,916,290

Replying to astonzhang (#1,726,316)

Really interesting! What are the minimum requirements to run this?

IluvBsissa t1_j9j9ml9 wrote on February 22, 2023 at 11:01 AM

#1,916,437

Replying to astonzhang (#1,726,316)

Dr. Zhang, thank you so much. Please can you tell us more about your model's performance ? How would it do on standard MMLU ? Can it be improved by increasing parameters count ? The paper didn't mention if the human testers were average human or experts ?

7734128 t1_j9j9r06 wrote on February 22, 2023 at 11:03 AM

#1,916,451

Replying to emotionalfool123 (#1,724,270)

And on my 8 GB GTX 1080.

chinguetti t1_j9joqfu wrote on February 22, 2023 at 1:34 PM

#1,918,124

Replying to astonzhang (#1,726,316)

Will make a good story when you accept your Nobel prize. Well done.

[deleted] t1_j9jsjad wrote on February 22, 2023 at 2:03 PM

#1,918,552

[removed]

NapkinsOnMyAnkle t1_j9jtolb wrote on February 22, 2023 at 2:12 PM

#1,918,706

Replying to i2mi (#1,724,049)

I've trained 100m CNNs on my laptop 3070 6gb. So...

kermunnist t1_j9kpp3s wrote on February 22, 2023 at 6:10 PM

#1,922,502

Replying to __lawless (#1,721,720)

I wonder how flamingo would compare

[deleted] t1_j9lopbv wrote on February 22, 2023 at 9:43 PM

#1,926,963

Replying to Alarming_Turnover578 (#1,825,254)

[deleted]

ihopeshelovedme t1_j9nhhgs wrote on February 23, 2023 at 5:53 AM

#1,934,135

Replying to chinguetti (#1,918,124)

You think the r/singularity will be kind enough to grant him a Nobel price?

astonzhang t1_j9scuwn wrote on February 24, 2023 at 5:15 AM

#1,957,300

Replying to zisyfos (#1,916,290)

We ran experiments on 4 NVIDIA Tesla V100 32G GPUs

astonzhang t1_j9sd3mw wrote on February 24, 2023 at 5:17 AM

#1,957,341

Replying to IluvBsissa (#1,916,437)

The human performance was taken from the paper from Lu et al.

JClub t1_jabyh73 wrote on February 28, 2023 at 9:30 AM

#2,096,982

Replying to astonzhang (#1,726,316)

GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2022, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT

JClub t1_jabyhe8 wrote on February 28, 2023 at 9:30 AM

#2,096,985

GPT was never trained with image data, why is this a fair comparison? The UnifiedQA model is from 2020, so it doesn't seem fair either. Why don't we have some comparisons with other SOTA multimodal models? Such as OFA or UniT

JClub t1_jabyi76 wrote on February 28, 2023 at 9:30 AM

#2,096,992

Replying to AiChip (#1,717,470)