About the NUCLE dataset

379 views

Skip to first unread message

14484...@qq.com

unread,

Feb 18, 2019, 12:25:26 AM2/18/19

to BEA 2019 Shared Task: Grammatical Error Correction

In the NUCLE Release 3.3 dataset I request, the file nucle.train.gold.bea19.m2 only has about 50000 sentences. I wonder it is correct ?

BEA 2019 Shared Task Organisers

unread,

Feb 18, 2019, 8:53:46 AM2/18/19

to BEA 2019 Shared Task: Grammatical Error Correction

That is correct.

Here are the statistics for all the datasets allowed in the Restricted Track:

FCE: 33,237 sentences (train/dev/test)

Lang-8: 1,037,561 sentences

NUCLE: 57,151 sentences

W&I+LOCNESS: 43,129 sentences (train/dev/test)

Although Lang-8 is the biggest, it is also the noisiest of these datasets. All the others were professionally annotated with GEC in mind and should be of higher quality.

Reply all

Reply to author

Forward

0 new messages