Inconsistent Format in Train and Test

30 views
Skip to first unread message

yuchenlin

unread,
Jul 10, 2017, 2:09:21 AM7/10/17
to Workshop on Noisy User-generated Text (WNUT)
Hi Leon, 

    I just found that there is some inconsistency in the formats between the training data and the test data. 
    For example, in the training data, the "@" and "#" are not split from the original tokens, while in the test data "@"s and "#"s are treated as single tokens.  
    I believe the dataset could be better for future work if such inconsistency could be eliminated or informed to the public.

Thanks and regards,
Bill 

Leon Derczynski

unread,
Jul 14, 2017, 5:16:01 PM7/14/17
to Workshop on Noisy User-generated Text (WNUT)
Hi Bill,

Yeah, thanks, this is true. The training data follows a different standard. Getting this consistent has been hard - noisy text processing has these issues! We didn't create general training data for this task, so this older style ended up being used there. A unified tokenization scheme would be helpful for future work. Out of interest - do you have a preference one way or the other? What seems most appropriate, or defensible from linguistic principles? I'm a little torn on this; on the one hand, a style-agnostic tagger wouldn't give any special meaning to @ or # and would continue to split them off. On the other hand, they do change the meaning of the word they precede - but is that enough to mean that they should be part of the same token?

Best,


Leon
Reply all
Reply to author
Forward
0 new messages