>We speculate that such a phenomenon of hallucination is due to a lack of necessary vision contexts for performing effective Multimodal-CoT. To inject vision information, a simple way is to transform the paired image into a caption (Lu et al., 2022a) and then append the caption in the input of both stages. However, as shown in Table 3, using captions only yields marginal performance gains (↑0.59%). Then, we explore an advanced technique by incorporating vision features into the language model. Concretely, we feed the paired image to the DETR model (Carion et al., 2020) to extract vision features. Then we fuse the vision features with the encoded language representations before feeding to the decoder (more details will be presented in Section 4). Interestingly, with vision features, the RougeL score of the rationale generation has boosted to 96.97% (QCM→R), which correspondingly contributes to better answer accuracy of 84.91% (QCMR→A).3 With those effective rationales, the phenomenon of hallucination is mitigated — 62.5% hallucination mistakes in Section 3.2 have been corrected (Figure 3(b)), as an example shown in Figure 2 (right part).4 The analysis so far compellingly shows that vision features are indeed beneficial for generating effective rationales and contributing to accurate answer inference.
>
>[...]
>
>Compared with existing UnifiedQA and GPT-3.5 methods that leverage image captions in the context to provide vision semantics, the results indicate that using image features is more effective.
It's clear that there's still a lot more to go in terms of representing the data in such a way that these networks can fully process them without factual errors, and this paper is a strong demonstration of the gains that can happen when you address this aspect specifically. Very promising stuff, and frankly, it's also kind of terrifying. Human obsolescence is frighteningly closer than I had already imagined...
RahnuLe t1_j8cz9sc wrote
Reply to This is Revolutionary?! Amazon's 738 Million(!!!) parameter's model outpreforms humans on sience, vision, language and much more tasks. by Ok_Criticism_1414
These passages really stuck out to me.
>We speculate that such a phenomenon of hallucination is due to a lack of necessary vision contexts for performing effective Multimodal-CoT. To inject vision information, a simple way is to transform the paired image into a caption (Lu et al., 2022a) and then append the caption in the input of both stages. However, as shown in Table 3, using captions only yields marginal performance gains (↑0.59%). Then, we explore an advanced technique by incorporating vision features into the language model. Concretely, we feed the paired image to the DETR model (Carion et al., 2020) to extract vision features. Then we fuse the vision features with the encoded language representations before feeding to the decoder (more details will be presented in Section 4). Interestingly, with vision features, the RougeL score of the rationale generation has boosted to 96.97% (QCM→R), which correspondingly contributes to better answer accuracy of 84.91% (QCMR→A).3 With those effective rationales, the phenomenon of hallucination is mitigated — 62.5% hallucination mistakes in Section 3.2 have been corrected (Figure 3(b)), as an example shown in Figure 2 (right part).4 The analysis so far compellingly shows that vision features are indeed beneficial for generating effective rationales and contributing to accurate answer inference.
>
>[...]
>
>Compared with existing UnifiedQA and GPT-3.5 methods that leverage image captions in the context to provide vision semantics, the results indicate that using image features is more effective.
It's clear that there's still a lot more to go in terms of representing the data in such a way that these networks can fully process them without factual errors, and this paper is a strong demonstration of the gains that can happen when you address this aspect specifically. Very promising stuff, and frankly, it's also kind of terrifying. Human obsolescence is frighteningly closer than I had already imagined...