HunteronX

t1_j761xqh wrote

The economics is getting there for these models to be big news...
The key features of this work seem to be:

  1. A multimodal embedding representation obtained by individual modality encoders (patch-level for images, token level for text), combined via attention.

  2. Generate rationales first, then infer answers from them, due to accuracy reduction on answers.
    (Not an expert: but is the greater % of hallucinated rationales in baseline case - no vision features - due to large 'context' needed for both rationale + answer, without those features?)

Seems that multimodal representations (language + n=? other modalities) may be important for introducing a loose physical grounding to avoid hallucinating plausible ideas/suggestions + efficient representation of the remaining ideas.

15