Dear VOTSr organizers,
Could you please clarify whether the four referring texts provided for each video sequence correspond to the same target as alternative descriptions, or whether they should be treated as independent queries that may refer to different targets/regions?
We noticed that some expressions can be interpreted in multiple ways, so this clarification would be very helpful for our inference design.
Thank you!