Results of the Eval Trial phase

30 views
Skip to first unread message

semeval-2026-task-1-humor-gen

unread,
Dec 30, 2025, 10:56:47 AM12/30/25
to semeval-2026-task-1-humor-gen
Here are the results of the Eval Trial phase.

The ratings were determined from the annotation page, and we also added a tie for every pair of systems that produced the same output (for the same input). For example, this occurred frequently across different submissions from the same participant.

The results were computed as follows. For each participant/system, we took the submission with the highest rating and ignored the rest. We determine the rank based on how many other participants are significantly better, as we have been doing on the Leaderboard web page. For example, a system ranked 6th means that there are five other systems whose ratings are significantly better (i.e., the 95% confidence interval of this system is smaller than each of the other five systems' confidence intervals). Note that this can cause the rating order not to directly correspond with the rank order, though in these cases, the difference in ratings should be slight.

Subtask A in English:

Rank         System   Rating (95% CI)
   1     insa_abbas 1282 [1228, 1356]
   2       baseline 1096 [1054, 1154]
   2   eshitaguptaa 1086 [1040, 1140]
   2       lmfaoooo 1081 [1051, 1122]
   2         dangnt 1067 [1037, 1093]
   2        tanlocn 1063 [1031, 1087]
   2  wangkongqiang 1053 [1009, 1103]
   2    zhangxulong 1040 [1013, 1068]
   4        jjuliar 1021 [ 991, 1050]
   2          bebra 1020 [ 973, 1063]
   2     begumyivli 1020 [ 958, 1068]
   3    walisa_alam 1016 [ 979, 1053]
   7         yxmmmm  981 [ 942, 1026]
   7          luttt  981 [ 932, 1020]
   7        dharika  976 [ 927, 1027]
   9    jpsnitbihta  976 [ 936, 1009]
   7    junyao_chen  974 [ 925, 1024]
   9     marynj1995  972 [ 938, 1008]
   8    j10official  960 [ 910, 1009]
   7      jychen630  959 [ 917, 1015]
  13     zlatamoore  908 [ 859,  952]
  14   andrey300902  907 [ 855,  940]
  13       amazonka  903 [ 855,  942]
  19 hemeshkumar_31  877 [ 828,  921]
  21   liquid_horse  869 [ 841,  901]


Subtask A in Spanish:

Rank      System    Rating (95% CI)
   1    baseline 1292 [ 1238, 1466]
   1 zhangxulong 1169 [ 1131, 1369]
   2      yxmmmm 1000 [  957, 1191]
   3  marynj1995  912 [  859, 1101]
   4 jpsnitbihta  744 [  641,  931]
   6 j10official  340 [-1445,  455]


Subtask A in Chinese:

Rank        System   Rating (95% CI)
   1       deepgpt 1250 [1138, 1449]
   1      baseline 1193 [1044, 1415]
   1   zhangxulong 1178 [1054, 1322]
   1        hugang 1148 [1031, 1322]
   1 wangkongqiang 1111 [ 997, 1243]
   1   junyao_chen  980 [ 856, 1140]
   2        yxmmmm  956 [ 852, 1071]
   2   jpsnitbihta  909 [ 739, 1061]
   6   j10official  781 [ 568,  906]
   6     shenwutao  700 [ 543,  865]
   9    marynj1995  604 [ 378,  731]


Subtask B1:

Rank        System   Rating (95% CI)
   1      baseline 1287 [1218, 1398]
   1  eshitaguptaa 1210 [1140, 1326]
   1   jpsnitbihta 1128 [1023, 1283]
   2        yxmmmm 1111 [1044, 1204]
   2 wangkongqiang 1093 [1037, 1177]
   2  warda_yousaf 1015 [ 922, 1152]
   7   j10official  565 [ 310,  674]
   7 poojashree_81  498 [ 299,  561]


Subtask B2:

Rank        System    Rating (95% CI)
   1      baseline 1277 [ 1172, 1558]
   1  eshitaguptaa 1109 [ 1002, 1369]
   1 wangkongqiang 1063 [ 1018, 1337]
   1        yxmmmm  995 [  902, 1260]
   2   jpsnitbihta  892 [  783, 1108]
   6   j10official  540 [-1462,  688]


We also attach the results per submission.
submission-results-a-zh.txt
submission-results-b1.txt
submission-results-a-en.txt
submission-results-a-es.txt
submission-results-b2.txt
Reply all
Reply to author
Forward
0 new messages