Questions about the performance of pretrained multi-modal models.

73 views

Skip to first unread message

Ziwei Qin

unread,

Aug 31, 2021, 1:58:36 AM8/31/21

to Visual Commonsense Reasoning

Hi,

I found that the commonly used multi-modal pre-trained models, such as vl-bert, uniter, ernie-vil take the same method as you proposed in the paper, to process questions and answers, which is substituting the person tags with unisex names Besides, there is no information of the alignments between the unisex names and the regions in images being introduced in models. Does this mean that the models have to infer which region the names in text refer to？ It seems impossible, as little reference expression in texts can be utilized. Is it possible that these models have learned from the potential bias in VCR？

Thanks,

Ziwei

Rowan Zellers

unread,

Sep 10, 2021, 2:02:01 PM9/10/21

to Visual Commonsense Reasoning

hi Ziwei,

sorry for the delay :) My hunch is that since all of these models are finetuned, they can learn that alignment via finetuning (e.g. "Casey = the first 'person' box"). it's possible that they don't need to do that though : ie if there's only one person in the image the link to it, and a tag like "Person1" is unambiguous. That's just my hypothesis, would love to see an empirical confirmation of that though :)