Tokenization issues

Abhijeet Awasthi

unread,

Mar 20, 2019, 10:02:55 AM3/20/19

to BEA 2019 Shared Task: Grammatical Error Correction

Consider the following line in training dataset:

Firstly , travelling by bus or in other public transport,(as underground ) , help us to decrease the emissions of gas , which causes pollution and global warming .

It seems that spacy is unable to tokenize this line well.

If a system outputs transport , ( as underground ) instead of transport,(as underground ) will it be penalized for doing correct tokenization?

BEA 2019 Shared Task Organisers

unread,

Mar 20, 2019, 10:22:56 AM3/20/19

to BEA 2019 Shared Task: Grammatical Error Correction

Yes. That's one of the reasons why we specified which version of spacy and which model we used for tokenisation. If you use a different tokeniser, you're likely to get slightly different results and be penalised. This should be very rare however.

It's also worth noting that this is not really a spacy problem, but rather something that will always happen if you use 2 different tokenisers to process the same data.

Abhijeet Awasthi

unread,

Mar 20, 2019, 12:09:39 PM3/20/19

to bea2...@googlegroups.com

Such cases do not seem to be rare at least in the training data.

A GEC system might see a bad token as a spelling mistake and try to rectify it. Further, a character level transduction model might try to correct spacing errors within a bad token. So the problem persists even if one uses spacy, and the system is penalized for making a correct edit.

BEA 2019 Shared Task Organisers

unread,

Mar 21, 2019, 11:06:05 AM3/21/19

to BEA 2019 Shared Task: Grammatical Error Correction

I guess the bottom line is that yes, there are some tokenisation issues in the data, but everyone will be evaluated under the same conditions so the effect should already be taken into account.

Reply all

Reply to author

Forward