Also, participants in the shared evaluation task might be interested
in the Z-MERT software developed by Omar Zaidan (one of the students
in my lab at Hopkins):
http://www.cs.jhu.edu/~ozaidan/zmert/
It performs minimum error rate training for SMT systems, and has a
modular design that allows you to easily use your own automatic
evaluation metric as the objective function. Omar even put together a
YouTube tutorial on how to integrate your own metric:
http://www.youtube.com/watch?v=Yr56pD8bTUc&fmt=18
I'm looking forward to seeing everyone in Athens.
All the best,
Chris Callison-Burch
I define consistency over every pair of sentences that was assessed in
the the manual evaluation. A metric is consistent if the human
judgment was A > B and the metric score was A > B, or if the human
judgment was A < B and the metric score was A < B. I ignore
situations where the human judgment was A = B, and I was careful about
reversing the polarity of scores for error metrics like TER.
Tables 10 and 11 show the percent of time that each metric was
consistent with the human judgments for all of the language pairs.
Since we're only assessing whether A > B or B > A, the random-choice
baseline is 50%. The tables show that the task is very difficult,
because many of the metrics underperform the chance baseline.