YonatanBitton OP t1_iua6jz7 wrote on October 29, 2022 at 7:28 PM

Reply to comment by Nir_Kap in [R] WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models by YonatanBitton

Thank you :) The random chance with 10-12 candidates is pretty low - 17%-24%, so fine-tuned model performance of 55% is much above random chance. However, we still see that humans perform much better. A possible explaination for this gap is that the datasets is challenging, containing complex social and caltural cues, that challenges the current models who didn't train on similar tasks. We explored this direction on the last section (Table 6) where there are easier classes like "visually salient" (which is more similar to the pre-training task of the model) with performance of 67%, and more difficult ones (different from the pre-training) like "visually non-salient" with 36%.