Discrepancy in metrics between local and leaderboard

27 views

Skip to first unread message

Thanh-Dan Bui

unread,

Jun 21, 2026, 11:07:45 AMJun 21

to akbc2026-shared-task

Hi,

I am facing an issue with a discrepancy in evaluation results. Specifically, when running the evaluate.py file locally, my All Relations Macro F1 score reaches 0.559. However, when submitting the results to the leaderboard, this metric drops to only 0.5509. When comparing the two tables more closely, I noticed clear differences in a few specific metrics of the relations:

1. hasCapacity: Locally, my Macro Precision only reaches 0.250, but on the leaderboard, it jumps to 0.3400.

2. companyTradesAtStockExchange: The metrics online are significantly lower than local; for example, Macro F1 decreases from 0.713 (local) to 0.6926 (leaderboard).

3. hasArea: All metrics (Precision, Recall, F1) locally are consistently higher than the leaderboard by about 0.02.

In contrast, the personHasCityOfDeath relation keeps the exact same results in both places. I am still not sure about the root cause: is it due to a slight difference in the leaderboard dataset, or because of a mismatch in how the two sides handle empty predictions in the Macro Average calculation code? I look forward to your feedback.

Thank you!

leaderboard.png

local.png

BH Zhang

unread,

Jun 22, 2026, 2:21:31 AMJun 22

to akbc2026-shared-task

Hi Thanh-Dan,

Thanks for raising this. We have checked the leaderboard, and the problem has been resolved. It should now produce the same results as the local evaluation pipeline. Please note that old submissions may need to be rerun manually.

Please let us know if you find any further issues.

Best,
The AKBC 2026 Shared Task Organizers

Reply all

Reply to author

Forward

0 new messages