The economics is getting there for these models to be big news...
The key features of this work seem to be:
A multimodal embedding representation obtained by individual modality encoders (patch-level for images, token level for text), combined via attention.
Generate rationales first, then infer answers from them, due to accuracy reduction on answers.
(Not an expert: but is the greater % of hallucinated rationales in baseline case - no vision features - due to large 'context' needed for both rationale + answer, without those features?)
Seems that multimodal representations (language + n=? other modalities) may be important for introducing a loose physical grounding to avoid hallucinating plausible ideas/suggestions + efficient representation of the remaining ideas.
HunteronX t1_j761xqh wrote
Reply to [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501
The economics is getting there for these models to be big news...
The key features of this work seem to be:
A multimodal embedding representation obtained by individual modality encoders (patch-level for images, token level for text), combined via attention.
Generate rationales first, then infer answers from them, due to accuracy reduction on answers.
(Not an expert: but is the greater % of hallucinated rationales in baseline case - no vision features - due to large 'context' needed for both rationale + answer, without those features?)
Seems that multimodal representations (language + n=? other modalities) may be important for introducing a loose physical grounding to avoid hallucinating plausible ideas/suggestions + efficient representation of the remaining ideas.