Inconsistent Format in Train and Test

32 views

Skip to first unread message

yuchenlin

unread,

Jul 10, 2017, 2:09:21 AM7/10/17

to Workshop on Noisy User-generated Text (WNUT)

Hi Leon,

I just found that there is some inconsistency in the formats between the training data and the test data.

For example, in the training data, the "@" and "#" are not split from the original tokens, while in the test data "@"s and "#"s are treated as single tokens.

I believe the dataset could be better for future work if such inconsistency could be eliminated or informed to the public.

Thanks and regards,

Bill

Leon Derczynski

unread,

Jul 14, 2017, 5:16:01 PM7/14/17

to Workshop on Noisy User-generated Text (WNUT)

Hi Bill,

Yeah, thanks, this is true. The training data follows a different standard. Getting this consistent has been hard - noisy text processing has these issues! We didn't create general training data for this task, so this older style ended up being used there. A unified tokenization scheme would be helpful for future work. Out of interest - do you have a preference one way or the other? What seems most appropriate, or defensible from linguistic principles? I'm a little torn on this; on the one hand, a style-agnostic tagger wouldn't give any special meaning to @ or # and would continue to split them off. On the other hand, they do change the meaning of the word they precede - but is that enough to mean that they should be part of the same token?

Best,

Leon

Reply all

Reply to author

Forward

0 new messages