Criteria for evaluating system submission quality

47 views
Skip to first unread message

Frenchy's fried chicken

unread,
Aug 14, 2023, 6:09:50 PM8/14/23
to lm-kbc2023
Hi Organizers

Just wanted to check on the following points:

1. How would you check whether data leakage has taken place in a submission?
2. How would you check if the system has got a high score because of using Wikipedia Infobox content (which already has answers to most of the relations)/ Wikidata to generate object texts?
3. Since LLM model size plays a huge role in the quality of results, if two systems are identical but one uses a 70 billion variant of a model vs. another uses a 7 billion variant of the same model, it is obvious that the 70 billion variant will produce better results. This would mean that a person with more compute resources would naturally win the challenge. To handle this, will you also give weightage to the system architecture and approach and not just to the leaderboard score for deciding the rankings?

Thanks

Simon Razniewski

unread,
Aug 15, 2023, 3:47:45 AM8/15/23
to lm-kbc2023
Dear Frenchy's fried chicken,

1. How would you check whether data leakage has taken place in a submission?
2. How would you check if the system has got a high score because of using Wikipedia Infobox content (which already has answers to most of the relations)/ Wikidata to generate object texts?

Short answer: We will check the system reports and the code.
Long answer: In the times of LLMs, leakage is not an easily defined concept. In the old days of ML this was different, when leakage mostly referred to overlaps of train and test samples. Such simple cases could still happen if someone fine-tuned on Wikidata etc., and this is not allowed. However, with LLMs there is the general step of pre-training, that makes the issue quite ill-defined. In pre-training, all state-of-the-art LLMs use large web document collections, naturally including Wikipedia. So they will have seen relevant content, and most likely infoboxes too. We do not penalize this for two reasons: (i) by extension, any content on the web is problematic, and so one would have to define automated evaluation data that appears nowhere on the web (which is impossible), or employ post-hoc human annotation of system results (which however entails that systems can be evaluated reasonably only a single time), (ii) the focus of the challenge is not to predict totally-never-ever-reported anywhere topics, but to see, to which degree LLMs distill web content so as to allow KBC in the long-tail, for topics that likely appear here and there in web documents, but not at high frequency.
Summary: We (a) will check on a case-by-case basis, (b) put trust in the author's honest self-reporting of potentially critical issues (c) accept that leakage is not a binary yes/no concept, but that there may be gray cases (which do not invalidate work and may make interesting points for discussion).

3. Since LLM model size plays a huge role in the quality of results, if two systems are identical but one uses a 70 billion variant of a model vs. another uses a 7 billion variant of the same model, it is obvious that the 70 billion variant will produce better results. This would mean that a person with more compute resources would naturally win the challenge. To handle this, will you also give weightage to the system architecture and approach and not just to the leaderboard score for deciding the rankings?

To have a venue for participants that cannot or do not wish to take part in overly resource-intensive competition, we have introduced defined Track 1, where only models with up to 1B parameters are allowed. The 1B parameter and the open track will have separate final leaderboards and winners (which we will announce shortly). If there are especially innovative not-winning solutions inside the tracks, we will also highlight that.

Cheers,
Simon
Reply all
Reply to author
Forward
0 new messages