Hi,
I found that the commonly used multi-modal pre-trained models, such as vl-bert, uniter, ernie-vil take the same method as you proposed in the paper, to process questions and answers, which is substituting the person tags with unisex names Besides, there is no information of the alignments between the unisex names and the regions in images being introduced in models. Does this mean that the models have to infer which region the names in text refer to? It seems impossible, as little reference expression in texts can be utilized. Is it possible that these models have learned from the potential bias in VCR?
Thanks,
Ziwei