I know I get to define the similarity measure, but is it supposed to
be a measure of similarity as it occurs in this set of documents
(which I believe people will have a very difficult time evaluating),
or is it supposed to be word similarity in general with the documents
as an optional training set?
Or is this still sample data, with real documents to run the algorithm
on forthcoming? The blog post is fairly unclear about what's sample
data and what's the real thing.
If I believe there is a term with no meaningful semantics (for
example, the snippets of ASCII art all over the place), or a term
where the most reasonable similarity measure would be unconnected to
natural language (for example, numbers and dates), do I have the
option of leaving these out of my results, to focus on the terms where
it is possible to return good results?
-- Rob