Differences when generating M2 for same text

Yoav Katz

unread,

Feb 25, 2019, 9:18:58 AM2/25/19

to BEA 2019 Shared Task: Grammatical Error Correction

Hi,

As part of our data preparation, we regenerated the gold (corrected) data of the wi+locness dataset in txt format. We then ran m2 generation on these txt files.
We received m2 files which shares only 85% of the annotations with the original m2 files.
If we are not missing something - then it seems the assessment process is sensitive to the m2 format , and there could be 15% gap between the system's actual performance and its assessment.

Below are some details on the results and process we performed. Can you please review this and let us know if we did anything wrong in the process?

Thanks.

Yoav Katz
IBM Research

Our findings:

There were 170 cases (2% of the annotations) where there are UNK annotations in the original file which seem superfluous as they do not change anything in the text. For example.

A.dev.gold.bea19.m2

S It is true that going abroad can open new point of views about your own learning process .
A 7 8|||UNK|||open|||REQUIRED|||-NONE-|||0

B.dev.gold.bea19.m2

S For example , it is not suitable for children due to official Language . there were also a huge number of commercials advertisements which make the readers bored .
A 11 12|||UNK|||official|||REQUIRED|||-NONE-|||0

S Moreover , I think before any kind of sports that we should do some exercises that is because we need to relax our bodies first . If we do not do it that we will get dangerous while we are doing any sports .
A 35 37|||UNK|||get dangerous|||REQUIRED|||-NONE-|||0

There are also many cases where the same text is broken differently.

A.dev.gold.bea19.m2

S I did not have to go to NewZealand but believe me it is very beautiful place .
A 1 6|||R:OTHER|||have not been|||REQUIRED|||-NONE-|||0
A 7 8|||R:ORTH|||New Zealand|||REQUIRED|||-NONE-|||0
A 9 9|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 11 11|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 13 13|||M:DET|||a|||REQUIRED|||-NONE-|||0

new.m2 (regenerated)

S I did not have to go to NewZealand but believe me it is very beautiful place .
A 1 2|||U:VERB:TENSE||||||REQUIRED|||-NONE-|||0
A 2 4|||R:WO|||have not|||REQUIRED|||-NONE-|||0
A 4 6|||R:VERB|||been|||REQUIRED|||-NONE-|||0
A 7 8|||R:ORTH|||New Zealand|||REQUIRED|||-NONE-|||0
A 9 9|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 11 11|||M:PUNCT|||,|||REQUIRED|||-NONE-|||0
A 13 13|||M:DET|||a|||REQUIRED|||-NONE-|||0

(In some example we saw the opposite, where the original m2 broke corrections into multiple annotations while the new m2 kept them as a single correction)

The process we used:

1. Cloned copy of errant (https://github.com/chrisjbryant/errant.git)
1. Took the original A.dev.gold.bea19.m2 file and extracted all the sentences in it (lines beginning with an 'S') - creating orig.text
2. Created the corrected sentences by applying the changes described in the original m2 - creating fixed.txt
3. Ran

parallel_to_m2 -orig orig.txt -cor fixed.txt -out new.m2

4. Ran

compare_m2 -hyp new.m2  -ref dev.gold.bea19.m2

BEA 2019 Shared Task Organisers

unread,

Feb 25, 2019, 2:50:46 PM2/25/19

to BEA 2019 Shared Task: Grammatical Error Correction

Hey,

You're right that there is a little bit of sensitivity here.

Regarding all the UNK annotations, UNK stands for Unknown, which means annotators identified errors in the text but were unable to correct them. Consequently, the "correction" is just a repetition of the original string to make it clear what the erroneous uncorrected word is. This is also why these errors disappear if you run parallel_to_m2.py on them.

These errors are kept mainly for the purposes of error detection; a system should be rewarded if it identifies the token as an error, even if there is no known correction. All UNK errors will be excluded from the evaluation on error correction.

As for cases where the edits are broken differently, this largely depends on which version/model of spacy you use. Since the automatic annotation relies on POS tags and other information automatically obtained from spacy, different models will produce slightly different results. We used spacy 1.9.0 with the en_core_web_sm-1.2.0 model to generate the official data, and will be using the same setup to evaluate all system output.

You can still use a different version to develop your system however, and may get slightly different numbers when you evaluate yourself, but the official output will all be ranked using the same aforementioned spacy version/model.

Hopefully that all makes sense!

Chris

BEA 2019 Shared Task Organisers

unread,

Feb 25, 2019, 4:56:25 PM2/25/19

to BEA 2019 Shared Task: Grammatical Error Correction

Reading your post again more closely, I realised you were actually asking about gold vs auto references. Although the difference between spacy versions/models will still have an effect on the scores, it won't be quite so dramatic. Hopefully the following is more informative:

Since we had gold standard annotations for most datasets, we were generally able to release gold standard M2 files for the shared task. The exception is Lang-8 because no gold standard annotations exist and so the edits had to be extracted automatically. This does mean however, that it is also possible to generate automatic annotations from parallel sentences for all the other datasets too, as you did in your experiment.

When it comes to system output however, automatic annotations are the only choice as it is too expensive to annotate everything manually. In terms of evaluation, this means we can compare the automatic hypothesis edits against either the gold reference edits or the automatic reference edits. We did an experiment along those lines in the original ERRANT paper (section 4.1) and found no statistically significant difference between the results.

Reply all

Reply to author

Forward