Submitted by YonatanBitton t3_yeppof in MachineLearning

Our paper "WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models" was accepted to NeurIPS 2022, Datasets and Benchmark.

Paper: http://arxiv.org/abs/2207.12576Website: http://winogavil.github.ioHuggingface: https://huggingface.co/datasets/nlphuji/winogavilColab: https://colab.research.google.com/drive/19qcPovniLj2PiLlP75oFgsK-uhTr6SSi

Which images best fit the cue werewolf? Did you know V&L AI models only get ~50% on our challenging WinoGAViL association task but humans get 90%?

https://preview.redd.it/upvig7xbrcw91.png?width=1658&format=png&auto=webp&s=8f733a993bb7a6bf2a1377d39ae830e46e9c8cab

Introducing WinoGAViL, an online game you can play now against AI! WinoGAViL is a dynamic benchmark for evaluating V&L models. Inspired by the popular card game Codenames, spymasters give a textual cue related to several visual candidates, and another player identifies them.

We analyze the skills required to solve the WinoGAViL dataset, observing challenging visual and non-visual patterns, such as attribution, general knowledge, word sense making, humor, analogy, visual similarity, abstraction, and more.

We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient.

Our game is live, you are welcome to play it now (should be fun 😊). Explore (https://winogavil.github.io/explore) random samples created by humans in our benchmark, or try to create new associations that challenge the AI (https://winogavil.github.io/challenge-the-ai).

​

20

Comments

You must log in or register to comment.

shahaff32 t1_itzzb72 wrote

It looks interesting, but associations can have many aspects and may lead to misunderstanding. How do you deal with it?

2

YonatanBitton OP t1_iu02pl6 wrote

This is a great point, thank you. The interpretation of common sense tasks varies from person to person, and common sense reasoning involves some ambiguity. WinoGAViL, however, only uses instances which were solved well by three human solvers (over 80% Jaccard index). To validate our dataset, we took other players (who did not take part in the data generation task) and verified that it was solved with high human accuracy (90%).

4

shahaff32 t1_iu0mobx wrote

Thank you for your answer, we will look into it :)

1

Nir_Kap t1_iu5xfm5 wrote

Very interesting, one of the best works I've seen in a while. I have a question, how do you explain the low performance of the fine-tuned models?

1

YonatanBitton OP t1_iua6jz7 wrote

Thank you :) The random chance with 10-12 candidates is pretty low - 17%-24%, so fine-tuned model performance of 55% is much above random chance. However, we still see that humans perform much better. A possible explaination for this gap is that the datasets is challenging, containing complex social and caltural cues, that challenges the current models who didn't train on similar tasks. We explored this direction on the last section (Table 6) where there are easier classes like "visually salient" (which is more similar to the pre-training task of the model) with performance of 67%, and more difficult ones (different from the pre-training) like "visually non-salient" with 36%.

2