--
You received this message because you are subscribed to the Google Groups "DKPro Similarity Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-similarity-users+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Dear Alain,
Note that there are already two different versions of ESA Indexes in DKPro similarity. One where the vector is constructed from the inverted index on query time (saves space) and one where we already store the vector (saves time).Here is a relevant issue with some code examples:
You could use the VectorIndex representation to store whole document vectors.We haven't implemented this specific version so far, as ESA indexes are generic, i.e. you can compute the similarity for any given document pair.Once you have precomputed document vectors, you can only measure similarity between the precomputed documents.
Another note: you asked specifically about ESA indexes, but I have personally found that using smaller vectors (e.g. embeddings) gives similar results with much less computations.
Note that there are already two different versions of ESA Indexes in DKPro similarity. One where the vector is constructed from the inverted index on query time (saves space) and one where we already store the vector (saves time).Here is a relevant issue with some code examples:In the above, when you say "vector" you mean the vector for a given word, or the vector for a given document? I am assuming "word". I am using a VectorIndexReader, so I am assuming that I am doing the second approach (already store the vector). Is that correct?
You could use the VectorIndex representation to store whole document vectors.We haven't implemented this specific version so far, as ESA indexes are generic, i.e. you can compute the similarity for any given document pair.Once you have precomputed document vectors, you can only measure similarity between the precomputed documents.If you were presented with two new documents A and B, couldn't you just compute the vectors for those two and feed them to the new similarity() method (and possibly cache the vectors somewhere in case one or both of those docs come up again in the near future)?
If you were presented with two new documents A and B, couldn't you just compute the vectors for those two and feed them to the new similarity() method (and possibly cache the vectors somewhere in case one or both of those docs come up again in the near future)?Yes, but I assume you are fetching the word vector for each (content) word in the document. Depending on the size of the documents, the 3s per comparison are probably due to reading a lot of vectors from disk.You can improve a bit using caching. There already is an implementation in the form of CachingVectorReader
--
You received this message because you are subscribed to the Google Groups "DKPro Similarity Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-similarity-...@googlegroups.com.