document comparison

Matthew Terenzio

unread,

Oct 4, 2011, 9:08:46 AM10/4/11

to meta-met...@googlegroups.com

So, not to get ahead of ourselves, but at hacktoberfest I asked if document comparison would be out of scope for meta-meta.

I think Laurian and Dan were there when I asked and I also think the fact that there would be no storage was posed as being one challenge to doing something like that.

It took a while to process that and maybe I misunderstood.

But couldn't one API endpoint be something like /compare?doc1=URL&doc2=URL

and then extract entities from both and apply a comparison algorithm and return some type of value

Just brainstorning.

Matthew Haynes

unread,

Oct 4, 2011, 9:38:15 AM10/4/11

to meta-met...@googlegroups.com

Hiya,

I am also very interested in this subject, would be very interested if anybody could share links to
existing software / API's / articles!

Might there be other important things than just the body text to compare to though, such as author, publisher,
date, etc. For example the body text of an article on the first Iraq war might look similar to an article on the
second (same location, 2 Bushes, etc) but knowing the date would really set them apart.

Also I am not sure how useful an API like the above would be in realtime? Suppose I have collected a set of
thousands of documents, ideally I'd want something where I can just say "give me back other docs looking like this".
We use Apache Solr which has a "more like this" feature, but so far have not got it working well.

http://wiki.apache.org/solr/MoreLikeThis

I'm just beginning my adventures in machine learning, the stanford ml class is open if anybody else is
interested ...

http://www.ml-class.com/

I have also found this book a great introduction to the subject ...

http://shop.oreilly.com/product/9780596529321.do

Cheers,

Matt

Raynor Vliegendhart

unread,

Oct 4, 2011, 9:45:26 AM10/4/11

to meta-met...@googlegroups.com

If we are going to focus on comparison of documents, I think we should make it easily composable. For example, we first should have something like:

/extract_tfidf?doc=URL

Then we could build a generic /compare API function that consumes another API function like /extract_tfidf and gives you back a comparison function that is specialised to your needs.

Some hypothetical usage:

/compare?by=/extract_tfidf&doc1=URL&doc2=URL

Or maybe I'm just crazy and we should avoid higher-order APIs. :p

-Raynor

Matthew Terenzio

unread,

Oct 4, 2011, 9:56:11 AM10/4/11

to meta-met...@googlegroups.com

On Tue, Oct 4, 2011 at 9:38 AM, Matthew Haynes <ma...@matthaynes.net> wrote:

Might there be other important things than just the body text to compare to though, such as author, publisher,
date, etc. For example the body text of an article on the first Iraq war might look similar to an article on the
second (same location, 2 Bushes, etc) but knowing the date would really set them apart.

Yes, if available, entities should not be confined to body and headline but to any available metadata, which is why I'm big on things like rNews. But we aren't there yet so we should do our best at getting that info in other ways.

Also I am not sure how useful an API like the above would be in realtime? Suppose I have collected a set of
thousands of documents, ideally I'd want something where I can just say "give me back other docs looking like this".
We use Apache Solr which has a "more like this" feature, but so far have not got it working well.

Yes, that would certainly be a better use case for my needs as well. Heard about Solr during the week (maybe from you) and am looking into it as well.

http://wiki.apache.org/solr/MoreLikeThis

Matthew Terenzio

unread,

Oct 4, 2011, 10:00:11 AM10/4/11

to meta-met...@googlegroups.com

Seems fine to me. Is there a downside to it?

Dan Schultz

unread,

Oct 4, 2011, 10:02:40 AM10/4/11

to meta-met...@googlegroups.com

Currently the API specs outline something like this to handle tasks which would require multiple inputs:

Each media form has a GET call --- which is the workhorse where you say what tasks you want done and provide optional results from past tasks --- but it also has a POST call --- which allows you to upload a file to the server and specify a time to live. The server responds from the POST with a Media Item ID (miid) which can then be used as inputs for future GET tasks.

For instance you could upload 15 documents by POSTing them, and get 15 ID's back, and then we could have a "get_similar_documents" task which takes in a list of miid's that you want to compare to.

Look at the image POST and the video "identify_faces" on the wiki to see another example of this kind of planned interaction.

All of that being said, I think that the reason we wanted to table this convo until after we get the more simple "single document" tasks out of the way wasn't the issue of storage but was because it adds a lot of complexity (Raynor just brought up a great example: ideally we would use metadata surrounding the other documents to help with the similarity).

I'd say that the best thing to do would be take a stab at it yourself and see what comes out! We can at the very least learn from the attempt.

- Dan

--
Dan Schultz
P: (215) 400-1233
E: schu...@mit.edu
T: @slifty
W: http://www.pbs.org/idealab/dan_schultz/

On Tue, Oct 4, 2011 at 9:45 AM, Raynor Vliegendhart <shinn...@gmail.com> wrote:

Matthew Haynes

unread,

Oct 4, 2011, 10:08:48 AM10/4/11

to meta-met...@googlegroups.com

Loving the idea of high-order API's, need to get recursion in there too :)

More seriously, is the power of the "IDF" part of the equation diluted by a smaller data set?

For example if I compare just two BBC News articles both will likely match the entity "BBC", so in terms of IDF this entity will be as important to the matching algorithm as any other that appears in both docs, for example perhaps "Libya". On the other hand if I compare a single BBC News article against a whole archive then the entity "BBC" will become much more frequent than something like "Libya" and as such have a lower IDF score?

Just thinking, I may well be wrong :)

Cheers,

Matt

Raynor Vliegendhart

unread,

Oct 4, 2011, 11:14:15 AM10/4/11

to meta-met...@googlegroups.com

Ah right... tfidf might not be a suitable example when applied to a single document. What was I thinking? :p

Also, your BBC-Libya example makes total sense.

-Raynor

PS: And yes, we need to get recursion in it. Let's make a Turing-complete API! :p

Tathagata Dasgupta

unread,

Oct 4, 2011, 12:42:43 PM10/4/11

to meta-met...@googlegroups.com

Quick questions:
1. The metadata extraction seems like a multilayered task. Do we or
anybody deploying the service have the infrastructural capability of
doing it in a cost effective manner? Say face recognition in videos -
will that scale? It is easy to think about document similarity like
tf-idf based cosine similarity, entity similarity, or whatever - but
google/bing etc search algorithms have many more secret sauces to do
the job infinitely better. An example of Level 0 metadata of anything
that is publicly available von a website via google. There might be an
"openness" thorn about it - but its too difficult to not get seduced
by their infrastructure and making a REST-ful api that leeches on
giants.

2. And what do people feel about narrowing it down for journalistic
needs (say published news artifacts) at this nascent stage, compared
to being The Uber meta ?

[1] metagoofilter - http://www.edge-security.com/metagoofil.php (it is
generally used for some nefarious purposes)
[2] http://schema.org/

--
Cheers,
T

Matthew Terenzio

unread,

Oct 4, 2011, 12:59:42 PM10/4/11

to meta-met...@googlegroups.com

I have no idea about the first part. Certainly not about face recognition.

My only thought about the other half is that schema.org (microdata), rNews (RDFa) and others (Microformats) can be distilled to a common format (maybe RDF) and then queried using more general tools, no?

But this is about news, and if others see ways to gain performance or reduce code by reducing the scope to news-related ontologies than I vote for it.

MT

Matthew Terenzio

unread,

Oct 4, 2011, 1:22:03 PM10/4/11

to meta-met...@googlegroups.com

BTW, for those that might not have seen, here is an RDFa distiller script that gets fed an article page url that has been marked up with rNews. I don't have the whole spec covered in my markup but you can see it distills tags, dateCreated, headline body etc. You can add licensing, organization, author etc.

http://followth.is/cgi-bin/RDFa.py?uri=http://www.thehour.com/story/511761/oyster-shell-park-1-4-million-plan-would-create-grading-creating-paths-lighting-and-landscaping

For my purposes, I'd be looking for this markup or something similar and then falling back on NLP.

Also note that rNews has aligned itself to be compatible with schema.org :

http://www.iptc.org/site/Home/Media_Releases/schema.org_adopts_IPTC%27s_rNews_for_news_markup

On Tue, Oct 4, 2011 at 12:42 PM, Tathagata Dasgupta <tatha...@gmail.com> wrote:

Dan Schultz

unread,

Oct 4, 2011, 1:23:10 PM10/4/11

to meta-met...@googlegroups.com

What exactly did you have in mind? Any specific tasks you are thinking of when you suggest narrowing down to specific formats? You bring up a good question of "how specific can we get?" And I would actually bounce back saying do you think it is possible to design an elegant, modular solution that could be as specific as needed?

We have "text" "image" "video" "audio" but I think there is room for a system that makes it easy for forkers and contributors to add new types. for instance "news_text" or hell, even "bbc_news_text" which has tasks that are truly unique to the domain. The types that we feel are universally useful can be integrated back into the main project. I don't think we should open up the floodgates yet until we build up our core first.

The most important thing to keep in mind here is that for the health of the project it is our job to provide the framework for people who want to add to the system to be able to do so without cluttering that system. Now that jetlag is gone and my inbox is almost clean I'll put my brain sickles to work on this question (how do we add more media types), the concept of multi-document tasks, the concept of "pipeable" tasks, and the issues of a-synchronous and processor intensive tasks.

- Dan

--
Dan Schultz
P: (215) 400-1233
E: schu...@mit.edu
T: @slifty
W: http://www.pbs.org/idealab/dan_schultz/

Raynor Vliegendhart

unread,

Oct 4, 2011, 1:34:23 PM10/4/11

to meta-met...@googlegroups.com

Responding to the first question, there probably is no way of doing it cost effective. However, to prevent a service being drowned in requests, we could introduce artificial limits like N requests per T time units. I mean, even free services like Yahoo BOSS API have limits (one of my colleagues didn't book any progress when he was hitting Yahoo with like 20 threads :p).

-Raynor

Laurian

unread,

Oct 4, 2011, 3:55:49 PM10/4/11

to meta-met...@googlegroups.com

I'll try to synthesise some ideas on this later this week (read weekend) as I'm drowned in work.

One thing I would love is that no matter what is the input thing (text, url, video, etc) we could generate a lexical signature, well, a semantic one if possible (but it is trademarked now) like:

http://textwise.com/technology-0

http://textwise.com/api/documentation/api-services/signature-service

imagine that you get a such signature for a document (compared against a big corpus, think of a fixed one for now, like a wikipedia snapshot; we may have custom ones, collections later); then you can compare two or more documents using only the signature (which is in essence the vector that represents that document in that corpus' space), and you can do that comparison local, not necessarily via a remote API call.

BTW, it might be interesting to explore http://www.elasticsearch.org as an option for text storage, indexing and search

Laurian

Matthew Terenzio

unread,

Oct 5, 2011, 7:56:00 AM10/5/11

to meta-met...@googlegroups.com

Elasticsearch gets a 10 on ease of set up. I have a cold and was able to get it installed and running in a few minutes while sitting up in bed. Easy to insert JSON and query it back. Today I'm going to try to poulate it with some RDFJSON http://docs.api.talis.com/platform-api/output-types/rdf-json and see if I can build some meaningful queries.

Documentation is good, though I wasn't sure what they meant by:

Think of the Query DSL as an AST of queries.

Data Set Language?? I guess AST is abstract syntax tree, but that doesn't help much.

I'm not an expert with NOSQL solutions , but it seems like this could also be an alternative to some of them for many use cases.

Matthew Haynes

unread,

Oct 5, 2011, 8:12:42 AM10/5/11

to meta-met...@googlegroups.com

Domain Specific Language I think ...

http://en.wikipedia.org/wiki/Domain-specific_language

I have not used Elastic Search, only Solr but they are both built on top of Apache Lucene. With Solr at least I have found the docs so bad in places that it is usually easier to read up on Lucene instead to get an idea of how things work!

We use Solr in addition to SQL here, as opposed to a replacement. In our project we have millions of entities that are all mapped together in a relational SQL database and linked in some places to XML stores. I guess you could call this our "data warehouse", we then populate Solr documents from all this data and let the front end app query that. We have found having all the data still available in SQL / XML stores makes things much much easier to manage. For example re-indexing the whole Solr DB is very quick and easy as we have the data stored in a sensible and accessible structure.

Cheers,

Matt

Laurian

unread,

Oct 5, 2011, 8:29:29 AM10/5/11

to meta-met...@googlegroups.com

BTW, I usually prefer to use Lucene directly for two reasons:

Do magic with Semantic Vectors http://code.google.com/p/semanticvectors/
Cluster things with Carrot2 http://project.carrot2.org/

as both can directly work with a Lucene index.

Laurian

Reply all

Reply to author

Forward