1) If you refer to the paper, we propose a self-supervised training strategy, where we do not require the so-called "context_label" that you see in the test file. The core objective during training should be to improve the match/scoring between the image and its associated captions, i.e. correct grounding of objects in the image with the captions. This grounding would ultimately help you achieve the end goal, i.e. detecting out-of-context image use. This is explained in detail in Sections 4.2 and 4.3 of the main paper
. For a quick explanation about the exact training and test pipeline, refer to the video posted here
2) I assume you are referring to out-of-context image caption pairs as fake, we don't use the label fake because in principle the news caption could be true independently but associating it with an unrelated image would still be classified as out-of-context. Regarding your comment "Some of the fake news have related picture and captions." , this is exactly what out-ofo-context use of images is, it looks like that the news caption is about the image it is linked with in the dataset, but in reality the image is evidence of a totally different image.
A readme about the model input and outputs would be added to this GitHub
soon for your reference.
Btw, we will also be releasing source code for the original paper at this GitHub
repo in the coming week, in case you are interested.
MMSys2021 Cheapfakes Challenge Team