What are the documents intended to represent?

5 views

Skip to first unread message

Rob Speer

unread,

Nov 14, 2010, 3:07:48 PM11/14/10

to metaoptimize-ch...@googlegroups.com

I am left with some questions about this challenge. One big one is:
what are these documents intended to represent? If I understand
correctly that the file of largely two-word phrases represents the
documents, these don't bear much of a resemblance to real documents
you'd use in topic modeling. Many of them don't even bear much of a
resemblance to human language, for that matter. Yes, corpora are
noisy, but this looks like the result of an effort to deliberately
collect the noise and hide the signal.

I know I get to define the similarity measure, but is it supposed to
be a measure of similarity as it occurs in this set of documents
(which I believe people will have a very difficult time evaluating),
or is it supposed to be word similarity in general with the documents
as an optional training set?

Or is this still sample data, with real documents to run the algorithm
on forthcoming? The blog post is fairly unclear about what's sample
data and what's the real thing.

If I believe there is a term with no meaningful semantics (for
example, the snippets of ASCII art all over the place), or a term
where the most reasonable similarity measure would be unconnected to
natural language (for example, numbers and dates), do I have the
option of leaving these out of my results, to focus on the terms where
it is possible to return good results?