Is it allowed to normalize use of apostrophe?

164 views
Skip to first unread message

julia

unread,
Feb 25, 2019, 8:47:56 AM2/25/19
to BEA 2019 Shared Task: Grammatical Error Correction
In the given datasets there are various cases of use of words with different apostrophes symbols and use of spaces such as It ` s | It ’ s | It ‘s or n`t | n’t | n ‘t | n ' t. From the grammar point of view, the meaning of the words are the same. However, there are considered to be different for the model. Is it allowed to normalize this entries to the notation, e.g. It ' s

Will you also consider normalizing such use in your test data?

BEA 2019 Shared Task Organisers

unread,
Feb 25, 2019, 2:12:59 PM2/25/19
to BEA 2019 Shared Task: Grammatical Error Correction
We considered this when compiling the corpora, but ultimately decided not to standardise anything.

In a real use case scenario, such as on Write&Improve, users can input whatever apostrophe/text style they want so the model should be prepared to handle this. 

I realise this makes things messier, but you can perhaps normalise this for your model and then convert the apostrophe styles back later?

bog...@webspellchecker.net

unread,
Feb 26, 2019, 5:04:04 AM2/26/19
to BEA 2019 Shared Task: Grammatical Error Correction
Thank you for your response. Let me describe my doubts.

Below you can see the samples from the dataset:

Screen Shot 2019-02-25 at 10.45.55 PM.png

Screen Shot 2019-02-25 at 10.53.27 PM.png


As you can see, there is no one convention for prediction format. Yes of course, if we are talking about production system for users we should handle these cases user oriented. As I correctly understand the evaluation of result will be done automatically. Evaluation section of task description indicates “Systems will be evaluated using the ERRANT scorer.” This scorer work with Span-based and Token-based matches. In this case, if my system predicts (img 2)  "did n't" instead "didn ` t" will it be span and token error?


Thank you in advance,


понедельник, 25 февраля 2019 г., 21:12:59 UTC+2 пользователь BEA 2019 Shared Task Organisers написал:

BEA 2019 Shared Task Organisers

unread,
Feb 28, 2019, 9:59:38 AM2/28/19
to BEA 2019 Shared Task: Grammatical Error Correction
Hi Bogdan,

I just wanted to let you know we've contacted the annotators to ask if there is any convention/guideline when annotating apostrophes. You're right that it'd be unfair if your system lost points simply because of different apostrophe styles. 

In the meantime, the following information may be useful:
75% of all apostrophe like characters are: ' 
There are roughly only 30 edits in the entire training+dev set where an apostrophe changes style. 

We'll let you know more when we hear back from the annotators, and if the problem is large enough to warrant an update to the data.

Chris

writeto...@gmail.com

unread,
Mar 9, 2019, 11:06:03 AM3/9/19
to BEA 2019 Shared Task: Grammatical Error Correction
Style of quotation marks also differs across various corpora.
 e.g. --  {" ,  ' ' , ``}
Is it possible to normalize the datasets for these ?

BEA 2019 Shared Task Organisers

unread,
Mar 9, 2019, 12:16:07 PM3/9/19
to BEA 2019 Shared Task: Grammatical Error Correction
We considered that too, but as we're not the original authors of NUCLE or Lang-8, we didn't want to mess with already well-established corpora. You can probably normalise them yourself if you think it'll matter. 

The most important thing for us was to make sure the W&I+LOCNESS train/dev/test data was normalised. 
Reply all
Reply to author
Forward
0 new messages