Hi,
I am facing an issue with a discrepancy in evaluation results. Specifically, when running the evaluate.py file locally, my All Relations Macro F1 score reaches 0.559. However, when submitting the results to the leaderboard, this metric drops to only 0.5509. When comparing the two tables more closely, I noticed clear differences in a few specific metrics of the relations:
1. hasCapacity: Locally, my Macro Precision only reaches 0.250, but on the leaderboard, it jumps to 0.3400.
2. companyTradesAtStockExchange: The metrics online are significantly lower than local; for example, Macro F1 decreases from 0.713 (local) to 0.6926 (leaderboard).
3. hasArea: All metrics (Precision, Recall, F1) locally are consistently higher than the leaderboard by about 0.02.
In contrast, the personHasCityOfDeath relation keeps the exact same results in both places. I am still not sure about the root cause: is it due to a slight difference in the leaderboard dataset, or because of a mismatch in how the two sides handle empty predictions in the Macro Average calculation code? I look forward to your feedback.
Thank you!