Might there be other important things than just the body text to compare to though, such as author, publisher,
date, etc. For example the body text of an article on the first Iraq war might look similar to an article on the
second (same location, 2 Bushes, etc) but knowing the date would really set them apart.
Also I am not sure how useful an API like the above would be in realtime? Suppose I have collected a set of
thousands of documents, ideally I'd want something where I can just say "give me back other docs looking like this".
We use Apache Solr which has a "more like this" feature, but so far have not got it working well.
http://wiki.apache.org/solr/MoreLikeThis
2. And what do people feel about narrowing it down for journalistic
needs (say published news artifacts) at this nascent stage, compared
to being The Uber meta ?
[1] metagoofilter - http://www.edge-security.com/metagoofil.php (it is
generally used for some nefarious purposes)
[2] http://schema.org/
--
Cheers,
T