In dev dataset， some lines are not a sentences？

72 views

Skip to first unread message

14484...@qq.com

unread,

Mar 22, 2019, 10:14:14 PM3/22/19

to BEA 2019 Shared Task: Grammatical Error Correction

I find some lines in dev dataset and ABC train dataset，they are not complete sentences ，

such as :

line 669 I really cra

line 670 zy about playing it .

in ABCN.dev.gold.bea19.m2 file（line 669， 670）

line 52 The movie is about Lucy 's family , who move into a farmhouse that is mysterious and

line 53 scary .

line 54 Also reflects a comparison of Lucy 's family , and characters

line 55 from a story their mom often reads .

line 56 The story actually takes place in a camp and a farmhouse , mysteriously

line 57 used in other times .

In A.dev.gold.bea19.m2

I want to ask， in test file, are these will appear ？

and I want ask the difference of "-" and "--" .

BEA 2019 Shared Task Organisers

unread,

Mar 23, 2019, 10:16:30 AM3/23/19

to BEA 2019 Shared Task: Grammatical Error Correction

So when we preprocess the data, we split it on newlines on the assumption that a newline ends a sentence/paragraph.

You're right that the text is a bit noisy however, and so sometimes there are newlines in unexpected places. I checked the raw data in the JSON file and verified that the issues you found were caused by misplaced newlines. Also remember that since the data was tokenised automatically, it's also possible the sentence tokeniser occasionally went wrong.

So the bottom line is yes, these kinds of "sentences" could appear in the test data too.

I'm not sure what you mean about - and --? I think there is no difference?

Chris

Reply all

Reply to author

Forward

0 new messages