If I understood Joseph Turian's intentions correctly, if you can get a
better precision by cleaning the data set you're welcome to try it and
it will be considered part of your solution.
--
- Alexandre
Cleaning the dataset is ok. But if the vocabulary is expected to be
used for the evaluation of the results, it should be common to all the
participants hence clean. At least lets remove tokens smaller that 3
symbols for instance or comprising only decimal numbers.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel