In dev dataset, some lines are not a sentences?

72 views
Skip to first unread message

14484...@qq.com

unread,
Mar 22, 2019, 10:14:14 PM3/22/19
to BEA 2019 Shared Task: Grammatical Error Correction
I find some lines in dev dataset and ABC train dataset,they are not complete sentences ,
such as :

line 669   I really cra
line 670    zy about playing it . 

in  ABCN.dev.gold.bea19.m2 file(line 669, 670)

line 52   The movie is about Lucy 's family , who move into a farmhouse that is mysterious and
line 53   scary .
line 54   Also reflects a comparison of Lucy 's family , and characters
line 55   from a story their mom often reads .
line 56   The story actually takes place in a camp and a farmhouse , mysteriously
line 57   used in other times .

In  A.dev.gold.bea19.m2

I  want to ask, in test file, are these will appear ?

and I want ask the difference of "-" and "--" .  


BEA 2019 Shared Task Organisers

unread,
Mar 23, 2019, 10:16:30 AM3/23/19
to BEA 2019 Shared Task: Grammatical Error Correction
So when we preprocess the data, we split it on newlines on the assumption that a newline ends a sentence/paragraph.

You're right that the text is a bit noisy however, and so sometimes there are newlines in unexpected places. I checked the raw data in the JSON file and verified that the issues you found were caused by misplaced newlines. Also remember that since the data was tokenised automatically, it's also possible the sentence tokeniser occasionally went wrong.

So the bottom line is yes, these kinds of "sentences" could appear in the test data too.

I'm not sure what you mean about - and --? I think there is no difference?

Chris
Reply all
Reply to author
Forward
0 new messages