Here are the results of the Eval Trial phase.
The ratings were determined from the annotation page, and we also added a tie for every pair of systems that produced the same output (for the same input). For example, this occurred frequently across different submissions from the same participant.
The results were computed as follows. For each participant/system, we took the submission with the highest rating and ignored the rest. We determine the rank based on how many other participants are significantly better, as we have been doing on the Leaderboard web page. For example, a system ranked 6th means that there are five other systems whose ratings are significantly better (i.e., the 95% confidence interval of this system is smaller than each of the other five systems' confidence intervals). Note that this can cause the rating order not to directly correspond with the rank order, though in these cases, the difference in ratings should be slight.
Subtask A in English:
Rank System Rating (95% CI)
1 insa_abbas 1282 [1228, 1356]
2 baseline 1096 [1054, 1154]
2 eshitaguptaa 1086 [1040, 1140]
2 lmfaoooo 1081 [1051, 1122]
2 dangnt 1067 [1037, 1093]
2 tanlocn 1063 [1031, 1087]
2 wangkongqiang 1053 [1009, 1103]
2 zhangxulong 1040 [1013, 1068]
4 jjuliar 1021 [ 991, 1050]
2 bebra 1020 [ 973, 1063]
2 begumyivli 1020 [ 958, 1068]
3 walisa_alam 1016 [ 979, 1053]
7 yxmmmm 981 [ 942, 1026]
7 luttt 981 [ 932, 1020]
7 dharika 976 [ 927, 1027]
9 jpsnitbihta 976 [ 936, 1009]
7 junyao_chen 974 [ 925, 1024]
9 marynj1995 972 [ 938, 1008]
8 j10official 960 [ 910, 1009]
7 jychen630 959 [ 917, 1015]
13 zlatamoore 908 [ 859, 952]
14 andrey300902 907 [ 855, 940]
13 amazonka 903 [ 855, 942]
19 hemeshkumar_31 877 [ 828, 921]
21 liquid_horse 869 [ 841, 901]
Subtask A in Spanish:
Rank System Rating (95% CI)
1 baseline 1292 [ 1238, 1466]
1 zhangxulong 1169 [ 1131, 1369]
2 yxmmmm 1000 [ 957, 1191]
3 marynj1995 912 [ 859, 1101]
4 jpsnitbihta 744 [ 641, 931]
6 j10official 340 [-1445, 455]
Subtask A in Chinese:
Rank System Rating (95% CI)
1 deepgpt 1250 [1138, 1449]
1 baseline 1193 [1044, 1415]
1 zhangxulong 1178 [1054, 1322]
1 hugang 1148 [1031, 1322]
1 wangkongqiang 1111 [ 997, 1243]
1 junyao_chen 980 [ 856, 1140]
2 yxmmmm 956 [ 852, 1071]
2 jpsnitbihta 909 [ 739, 1061]
6 j10official 781 [ 568, 906]
6 shenwutao 700 [ 543, 865]
9 marynj1995 604 [ 378, 731]
Subtask B1:
Rank System Rating (95% CI)
1 baseline 1287 [1218, 1398]
1 eshitaguptaa 1210 [1140, 1326]
1 jpsnitbihta 1128 [1023, 1283]
2 yxmmmm 1111 [1044, 1204]
2 wangkongqiang 1093 [1037, 1177]
2 warda_yousaf 1015 [ 922, 1152]
7 j10official 565 [ 310, 674]
7 poojashree_81 498 [ 299, 561]
Subtask B2:
Rank System Rating (95% CI)
1 baseline 1277 [ 1172, 1558]
1 eshitaguptaa 1109 [ 1002, 1369]
1 wangkongqiang 1063 [ 1018, 1337]
1 yxmmmm 995 [ 902, 1260]
2 jpsnitbihta 892 [ 783, 1108]
6 j10official 540 [-1462, 688]
We also attach the results per submission.