Why not clean the vocabulary?

Christian

unread,

Nov 6, 2010, 10:08:47 PM11/6/10

to MetaOptimize Challenge [discuss]

The vocabulary file contains a lot of 'garbage'. Why don't you clean
it of stop word, arbitrary symbols, etc.? This would reduce the load
and might even improve precision because most of that will be noise
and not information. It is a win-win.

Alexandre Passos

unread,

Nov 6, 2010, 10:09:57 PM11/6/10

to metaoptimize-ch...@googlegroups.com

If I understood Joseph Turian's intentions correctly, if you can get a
better precision by cleaning the data set you're welcome to try it and
it will be considered part of your solution.

--
- Alexandre

Olivier Grisel

unread,

Nov 6, 2010, 10:25:22 PM11/6/10

to metaoptimize-ch...@googlegroups.com

2010/11/7 Alexandre Passos <alexan...@gmail.com>:

Cleaning the dataset is ok. But if the vocabulary is expected to be
used for the evaluation of the results, it should be common to all the
participants hence clean. At least lets remove tokens smaller that 3
symbols for instance or comprising only decimal numbers.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply all

Reply to author

Forward