Discrepancy in metrics between local and leaderboard

20 views
Skip to first unread message

Thanh-Dan Bui

unread,
Jun 21, 2026, 11:07:45 AM (2 days ago) Jun 21
to akbc2026-shared-task
Hi,

I am facing an issue with a discrepancy in evaluation results. Specifically, when running the evaluate.py file locally, my All Relations Macro F1 score reaches 0.559. However, when submitting the results to the leaderboard, this metric drops to only 0.5509. When comparing the two tables more closely, I noticed clear differences in a few specific metrics of the relations:

1. hasCapacity: Locally, my Macro Precision only reaches 0.250, but on the leaderboard, it jumps to 0.3400.

2. companyTradesAtStockExchange: The metrics online are significantly lower than local; for example, Macro F1 decreases from 0.713 (local) to 0.6926 (leaderboard).

3. hasArea: All metrics (Precision, Recall, F1) locally are consistently higher than the leaderboard by about 0.02.

In contrast, the personHasCityOfDeath relation keeps the exact same results in both places. I am still not sure about the root cause: is it due to a slight difference in the leaderboard dataset, or because of a mismatch in how the two sides handle empty predictions in the Macro Average calculation code? I look forward to your feedback. 

Thank you!

leaderboard.png
local.png

BH Zhang

unread,
Jun 22, 2026, 2:21:31 AM (yesterday) Jun 22
to akbc2026-shared-task
Hi Thanh-Dan,

Thanks for raising this. We have checked the leaderboard, and the problem has been resolved. It should now produce the same results as the local evaluation pipeline. Please note that old submissions may need to be rerun manually. 

Please let us know if you find any further issues.

Best,
The AKBC 2026 Shared Task Organizers

Reply all
Reply to author
Forward
0 new messages