Why not clean the vocabulary?

6 views
Skip to first unread message

Christian

unread,
Nov 6, 2010, 10:08:47 PM11/6/10
to MetaOptimize Challenge [discuss]
The vocabulary file contains a lot of 'garbage'. Why don't you clean
it of stop word, arbitrary symbols, etc.? This would reduce the load
and might even improve precision because most of that will be noise
and not information. It is a win-win.

Alexandre Passos

unread,
Nov 6, 2010, 10:09:57 PM11/6/10
to metaoptimize-ch...@googlegroups.com

If I understood Joseph Turian's intentions correctly, if you can get a
better precision by cleaning the data set you're welcome to try it and
it will be considered part of your solution.


--
 - Alexandre

Olivier Grisel

unread,
Nov 6, 2010, 10:25:22 PM11/6/10
to metaoptimize-ch...@googlegroups.com
2010/11/7 Alexandre Passos <alexan...@gmail.com>:

Cleaning the dataset is ok. But if the vocabulary is expected to be
used for the evaluation of the results, it should be common to all the
participants hence clean. At least lets remove tokens smaller that 3
symbols for instance or comprising only decimal numbers.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply all
Reply to author
Forward
0 new messages