<<Apologies for cross-posting>>
Dear Community,
We are excited to announce the Dishcovery Mission II Challenge, part of the 3rd MetaFood Workshop at CVPR 2026.
The goal of this challenge is to develop a Vision Language Model that can accurately understand food images and match them to the correct textual descriptions. This is a demanding test of fine grained visual recognition and multi-modal alignment in one of the most diverse domains.
Challenge Highlights
- ~400,000 food image–caption pairs
- Realistic multi-modal noise and fine-grained dish ambiguity
- Focus on efficient and scalable VLM architectures
-
Global leader board visibility
Whether you're building
the next CLIP, scaling SigLIP, or crafting your own multimodal beast ---
this is your chance to stress-test and showcase your models.
Top solutions will be
invited to present at the MetaFood Workshop, bringing together
researchers working on multimodal AI, food computing, and real-world
visual understanding.
---------------------------------------------------------------
3rd MetaFood Workshop
Held in conjunction with CVPR 2026
June 3-4th, Denver, Colorado, USA
----------------------------------------------------------------
Best regards,
Dishcovery Challenge Organizing Committee