Hi Mishal, all,
The answers-students dataset is not tokenized. As the gold standard
and evaluation dataset depend on the indices of the tokens (where
token is string separated by whitespace), we suggest that
participants do the following for the answers-students dataset:
- remove all punctuation which is not tokenized.
- don't do anything which changes the indices of tokens (where token
= string separated by whitespace)
Two examples follow:
A battery should connect to a bulb in a closed path.
bulbs a, b, and c are on a path with the battery
- in the first example the punctuation in the end of the sentence
would be removed (there are 158 sentences in the test dataset)
- in the second example the punctuation would remove the commas of
"a," and "b," (73 sentences affected)
Sorry about this. There was an overlook/misunderstanding among the
organizers in the case of this dataset. We hope this solution is
acceptable for participants.
best
eneko
01/25/2016 10:06 AM(e)an, Mishal Kazmi
igorleak idatzi zuen: