Speeding up ESA similarity

11 views
Skip to first unread message

Alain Désilets

unread,
Jan 26, 2018, 8:36:26 AM1/26/18
to DKPro Similarity Users
I am using DKPro Similarity in the context of a document near-duplicate search application.

I have been able to implement a class for evaluating the similarity between 2 documents, but it takes about 3 secs per pair comparison. Is that normal? My documents typically are a couple of pages long.

If this is normal, I wonder if there are ways in which comparison could be sped up and would be willing to contribute any changes I make to the library.

One idea I have is to implement a similarity method that takes two vector representations as inputs instead of two lists of lemmatized tokens.

In my application, if I have N documents, then that means I have N*(N -1) / 2 pairs of document for which I need to compute similarity. Since the only public method for measuring similarity takes the lemmatized tokens as inputs instead of their vector representations, that means I end up computing 2 * N * (N - 1) / 2 = N*(N-1) vector representations.

If I could feed the vector representations directly, I could compute those once and then feed pairs of vectors to the similarity method, resulting in only N vector computation.

Does that sound like a reasonable thing to do? If so, I am willing to implement it and contribute it back to this community.

Thx.

Alain Desilets

Torsten Zesch

unread,
Jan 26, 2018, 9:06:10 AM1/26/18
to Alain Désilets, DKPro Similarity Users
Dear Alain,

the execution time sound reasonable, as your analysis of what is going on internally is correct.

Note that there are already two different versions of ESA Indexes in DKPro similarity. One where the vector is constructed from the inverted index on query time (saves space) and one where we already store the vector (saves time).
Here is a relevant issue with some code examples:

You could use the VectorIndex representation to store whole document vectors.
We haven't implemented this specific version so far, as ESA indexes are generic, i.e. you can compute the similarity for any given document pair.
Once you have precomputed document vectors, you can only measure similarity between the precomputed documents.


Another note: you asked specifically about ESA indexes, but I have personally found that using smaller vectors (e.g. embeddings) gives similar results with much less computations.

-Torsten

--
You received this message because you are subscribed to the Google Groups "DKPro Similarity Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-similarity-users+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alain Désilets

unread,
Jan 26, 2018, 9:36:54 AM1/26/18
to DKPro Similarity Users


On Friday, 26 January 2018 09:06:10 UTC-5, Torsten Zesch wrote:
Dear Alain,

Note that there are already two different versions of ESA Indexes in DKPro similarity. One where the vector is constructed from the inverted index on query time (saves space) and one where we already store the vector (saves time).
Here is a relevant issue with some code examples:

In the above, when you say "vector" you mean the vector for a given word, or the vector for a given document? I am assuming "word". I am using a VectorIndexReader, so I am assuming that I am doing the second approach (already store the vector). Is that correct?

You could use the VectorIndex representation to store whole document vectors.
We haven't implemented this specific version so far, as ESA indexes are generic, i.e. you can compute the similarity for any given document pair.
Once you have precomputed document vectors, you can only measure similarity between the precomputed documents.

If you were presented with two new documents A and B, couldn't you just compute the vectors for those two and feed them to the new similarity() method (and possibly cache the vectors somewhere in case one or both of those docs come up again in the near future)? 

Another note: you asked specifically about ESA indexes, but I have personally found that using smaller vectors (e.g. embeddings) gives similar results with much less computations.

Yes, we are looking at embeddings as well. But our intuition is that if you for the purpose of evaluating similarity of documents in a specialized domain, you need you won't get good performance by using embeddings trained on a generic corpus like WP. You need to train the embeddings on a specialized corpus. And we want to avoid that as much as possible. So, we are also toying with ESA because we think it may work even if the vectors use a generic corpus like WP.

That said, my first experiment with ESA similarity indicates that it actually performs worse than tfidf (not sure yet, it might be a bug in our code). If that turns out to be true, then there is no point in me trying to speed it up.

Thx for the help Torsten.

Alain

Torsten Zesch

unread,
Jan 26, 2018, 11:04:55 AM1/26/18
to Alain Désilets, DKPro Similarity Users
Note that there are already two different versions of ESA Indexes in DKPro similarity. One where the vector is constructed from the inverted index on query time (saves space) and one where we already store the vector (saves time).
Here is a relevant issue with some code examples:

In the above, when you say "vector" you mean the vector for a given word, or the vector for a given document? I am assuming "word". I am using a VectorIndexReader, so I am assuming that I am doing the second approach (already store the vector). Is that correct?

Yes, correct.
 

You could use the VectorIndex representation to store whole document vectors.
We haven't implemented this specific version so far, as ESA indexes are generic, i.e. you can compute the similarity for any given document pair.
Once you have precomputed document vectors, you can only measure similarity between the precomputed documents.

If you were presented with two new documents A and B, couldn't you just compute the vectors for those two and feed them to the new similarity() method (and possibly cache the vectors somewhere in case one or both of those docs come up again in the near future)?

Yes, but I assume you are fetching the word vector for each (content) word in the document. Depending on the size of the documents, the 3s per comparison are probably due to reading a lot of vectors from disk.
You can improve a bit using caching. There already is an implementation in the form of CachingVectorReader


We are currently working on improving the support for standard embedding formats, so some additions are to be expected in the next months along those lines.

-Torsten

Alain Désilets

unread,
Jan 26, 2018, 11:49:18 AM1/26/18
to DKPro Similarity Users
If you were presented with two new documents A and B, couldn't you just compute the vectors for those two and feed them to the new similarity() method (and possibly cache the vectors somewhere in case one or both of those docs come up again in the near future)?

Yes, but I assume you are fetching the word vector for each (content) word in the document. Depending on the size of the documents, the 3s per comparison are probably due to reading a lot of vectors from disk.
You can improve a bit using caching. There already is an implementation in the form of CachingVectorReader

What does CachingVectorReader do? Does it cache the vector of individual words into memory, or does it cache the vector of a list of tokens to disk? If it's the latter, then I think my app would greatly benefit from it.

Alain

Torsten Zesch

unread,
Jan 26, 2018, 11:51:21 AM1/26/18
to Alain Désilets, DKPro Similarity Users
It‘s the first one. Word vectors to memory.
--
You received this message because you are subscribed to the Google Groups "DKPro Similarity Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-similarity-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages