Similarity between a (large) document corpus and a (small) document corpus

113 views
Skip to first unread message

Stefano Zacchiroli

unread,
Oct 25, 2021, 10:19:16 AM10/25/21
to Gensim
Heya, I'm new to gensim, so first all let me thank you all for this amazing tool !

I'm dealing with the following scenario.
I've a large-ish corpus of ~6 M documents text documents and I know/expect that most of them will be very similar to a single document among a small-ish golden set of ~100 documents.
I'm trying to use the various similarity classes to "partition" the input into those 100 documents + outliers.

What is not clear to me is what would be the most appropriate approach among the following ones:
1) build a similarity index for the entire (large + small) corpus and then query it with all the documents from the large corpus, looking for which document, among the small set, is the closest
2) build a similarity index only using the small corpus, and query it with all the documents from the large corpus
3) INSERT HERE MIDDLE GROUND SOLUTION that I'm not seeing.

The problem with (1) is that building the index is quite expensive.
The problem with (2) is that I'll be querying an index with a lot of terms that it has never seen (although that's maybe not a problem?).

Any suggestion?

Thanks a lot for your help!

Gordon Mohr

unread,
Oct 26, 2021, 6:23:28 PM10/26/21
to Gensim
How big are the individual documents, in word-count?

Because there are many possible ways to assess "document similarity" (for any particular purpose), there are many potential ways to do this, with no one necessarily being "the best" – as results will often vary based on the quality of your data, your own preprocessing/enrichment steps, & your specific problem domain & goals. So you'll probably want to try a bunch of things & compare them against each other, starting very simple, tinkering to see what can demonstrably improve on the simple approach. 

For example, you might evaluate your results by gradually hand-labeling more and more of the 'desired' associations – which of the 6M docs should go into which of the 100 'protypical' doc buckets. For each approach, check how many such associations the approach "gets right". Potentially, expand the 'ground truth' set by adding labels for more of the 'tough cases'. In many casee, 'tough cases' may be detectable by how strongly the 'top' label is asserted (either overall, or specifically versus the 2nd-best, etc).

Once you have a large enough set of 'ground truth' labels - at least, a few for each bucket - you could even just skip forward, not using any distinct 'similarity' measure, to trying any number of `scikit-learn` text-classification techniques to train a classifier which puts each candidate doc into one of the 100 bins. This doesn't even technically require looking at the 'prototype' docs at all, though you might want to use them as seeds for the training set.

Indeed, training a classifier on *just* the 100 'prototype' docs, then starting to review its results on the other 6M, taking items it got right, or the corrected results you create during review,  as new 'gold standard' training/testing data, is a reasonable wya to get started. 

Another super-simple baseline to try based more on Gensim similarity tools could be:
* grab an off-the-shelf set of word-vectors, such as the GoogleNews-2013 set (which depending on your problem domain, might still be pretty good, or hopelessly dominated by older & less-relevant word-senses)
* summarize every document as the average-vector of each of its word-vectors - this is crude but fast
* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fast

This is likely to already get a lot of the bucket-assignment right; looking at the cases where it doesn't, or where it lacks confidence (no one reference doc sticks out as far-more-similar than others) will give hints as to what sorts of extra distinctions you might want to add.

After either or both of these approaches have set some initial baseline parameters for what's possible & easy, you'd gradually try more things, comparing any improvement they achieve to see which are worth the effort. For example:

* other feature-enhancements, like different preprocessing/tokenizations (such as turning known or common phrases into single-tokens)
* other weightings of the words in the average
* other ways to vectorize the text, like `Doc2Vec` or off-the-shelf deep neural models
* other ways to compute text-to-text similarity, like "Word Mover's Distance" (which being quite a bit more expensive might be impractical on longer text or the full 6M set of docs)
* using the vector-based text representations (either the wv-average or other vectors) as the input to formal classifiers, so that you're no longer just using "single closest reference doc" as the classification method (and thus, via training examples, the classifier learns when classes may be more exotically-shaped regions of the full feature space than just nearest-spheres arond the reference docs)

- Gordon

Stefano Zacchiroli

unread,
Oct 28, 2021, 5:21:15 AM10/28/21
to Gensim
Hi Gordon, thanks a lot for your suggestions, which I found to be super-useful!
Answer/comments to specific points in-line below:

On Wednesday, October 27, 2021 at 12:23:28 AM UTC+2 Gordon Mohr wrote:
How big are the individual documents, in word-count?

Here are the corpus stats, *before* any filtering (so even stuff like stopwords is still included here):

         line_count    word_count          size
count  4.988074e+06  4.988074e+06  6.008466e+06
mean   1.138753e+02  7.463865e+02  9.133901e+03
std    4.299756e+03  1.190882e+04  2.238914e+05
min    1.000000e+00  0.000000e+00  0.000000e+00
25%    2.000000e+01  1.680000e+02  1.065000e+03
50%    2.100000e+01  1.690000e+02  1.080000e+03
75%    2.600000e+01  1.840000e+02  2.054000e+03
max    6.373094e+06  7.374871e+06  1.048576e+08

I suspect that in gensim land, outliers aside, these would qualify as fairly small documents.
 
Because there are many possible ways to assess "document similarity" (for any particular purpose), there are many potential ways to do this, with no one necessarily being "the best" – as results will often vary based on the quality of your data, your own preprocessing/enrichment steps, & your specific problem domain & goals. So you'll probably want to try a bunch of things & compare them against each other, starting very simple, tinkering to see what can demonstrably improve on the simple approach. 

For example, you might evaluate your results by gradually hand-labeling more and more of the 'desired' associations – which of the 6M docs should go into which of the 100 'protypical' doc buckets. For each approach, check how many such associations the approach "gets right". Potentially, expand the 'ground truth' set by adding labels for more of the 'tough cases'. In many casee, 'tough cases' may be detectable by how strongly the 'top' label is asserted (either overall, or specifically versus the 2nd-best, etc).

Yeah, I'm indeed in the process of exploring various options by trial & error, as you suggest. Thus far I had excluded supervised approaches (based on ground truth labeling), but based on your suggestion I'll probably reconsider and give that a try too. (FWIW, my other message on list about using DBSCAN on gensim corpora is my ongoing attempt as unsupervised classification of this corpus.)
 
Indeed, training a classifier on *just* the 100 'prototype' docs, then starting to review its results on the other 6M, taking items it got right, or the corrected results you create during review,  as new 'gold standard' training/testing data, is a reasonable way to get started. 

Just to be sure I'm reading this correctly. When you say "training a classifier" here you mean using one of the various similarities.docsim or something else?
The former is what I had in mind with my original point (2).
If you instead mean actual training (e.g., in the supervised machine learning sense), I can indeed use my golden set as (trivial) ground truth, but it will only have 1 sample per label, so I'm skeptical it will work well when exposed to the diversity of the full corpus.

Another super-simple baseline to try based more on Gensim similarity tools could be:
* grab an off-the-shelf set of word-vectors, such as the GoogleNews-2013 set (which depending on your problem domain, might still be pretty good, or hopelessly dominated by older & less-relevant word-senses)
* summarize every document as the average-vector of each of its word-vectors - this is crude but fast
* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fast

Interesting, thanks for the idea, I'll look into this. About the last point, is that an operation that is already vectorized/streamed by gensim somehow or should it be implemented as a loop on all documents?
 
After either or both of these approaches have set some initial baseline parameters for what's possible & easy, you'd gradually try more things, comparing any improvement they achieve to see which are worth the effort. For example:

Thanks for these precious leads too!

Cheers

Gordon Mohr

unread,
Oct 28, 2021, 3:09:04 PM10/28/21
to Gensim
On Thursday, October 28, 2021 at 2:21:15 AM UTC-7 za...@upsilon.cc wrote:
On Wednesday, October 27, 2021 at 12:23:28 AM UTC+2 Gordon Mohr wrote:
How big are the individual documents, in word-count?

Here are the corpus stats, *before* any filtering (so even stuff like stopwords is still included here):

         line_count    word_count          size
count  4.988074e+06  4.988074e+06  6.008466e+06
mean   1.138753e+02  7.463865e+02  9.133901e+03
std    4.299756e+03  1.190882e+04  2.238914e+05
min    1.000000e+00  0.000000e+00  0.000000e+00
25%    2.000000e+01  1.680000e+02  1.065000e+03
50%    2.100000e+01  1.690000e+02  1.080000e+03
75%    2.600000e+01  1.840000e+02  2.054000e+03
max    6.373094e+06  7.374871e+06  1.048576e+08

I suspect that in gensim land, outliers aside, these would qualify as fairly small documents.

I'm not sure how to turn that report into a count of docs, and words per document. (In typical input files, docs are often 1 to a line... so that line_count looks like you have almost 5M docs. But then that word_count suggests the exact same number of words, which would leave 1 word per document, which wouldn't be usefully trainable content.)  

Note that these algorithms don't necessarily need stopwords removed; the downsampling from `sample` already skips many of the most-common words. 

Because there are many possible ways to assess "document similarity" (for any particular purpose), there are many potential ways to do this, with no one necessarily being "the best" – as results will often vary based on the quality of your data, your own preprocessing/enrichment steps, & your specific problem domain & goals. So you'll probably want to try a bunch of things & compare them against each other, starting very simple, tinkering to see what can demonstrably improve on the simple approach. 

For example, you might evaluate your results by gradually hand-labeling more and more of the 'desired' associations – which of the 6M docs should go into which of the 100 'protypical' doc buckets. For each approach, check how many such associations the approach "gets right". Potentially, expand the 'ground truth' set by adding labels for more of the 'tough cases'. In many casee, 'tough cases' may be detectable by how strongly the 'top' label is asserted (either overall, or specifically versus the 2nd-best, etc).

Yeah, I'm indeed in the process of exploring various options by trial & error, as you suggest. Thus far I had excluded supervised approaches (based on ground truth labeling), but based on your suggestion I'll probably reconsider and give that a try too. (FWIW, my other message on list about using DBSCAN on gensim corpora is my ongoing attempt as unsupervised classification of this corpus.)

I do think that you'll need some 'ground truth' data just to be able to evaluate your own incremental improvements... and once you have enough labeled data, all sorts of text-classification methods become options. 
 
Indeed, training a classifier on *just* the 100 'prototype' docs, then starting to review its results on the other 6M, taking items it got right, or the corrected results you create during review,  as new 'gold standard' training/testing data, is a reasonable way to get started. 

Just to be sure I'm reading this correctly. When you say "training a classifier" here you mean using one of the various similarities.docsim or something else?
The former is what I had in mind with my original point (2).
If you instead mean actual training (e.g., in the supervised machine learning sense), I can indeed use my golden set as (trivial) ground truth, but it will only have 1 sample per label, so I'm skeptical it will work well when exposed to the diversity of the full corpus.

I mean classifiers like those in `scikit-learn` – no Gensim necessarily involved at all – like in the tutorial at <https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html>, or among the range of options listed in the overview diagram at <https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html>.

You may be skeptical about how well training on just 1 positive example per class might do... but it might do very well! You don't know until you try & evaluate... which is pretty easy code-wise, using off-the-shelf tools. The hardest part ifs the evaluation... but the very act of double-checking its crude initial guesses, & remembering the ones you confirm or correct – expands your training/testing data. You'll ultimately want at least a handful, and perhaps dozens, or labeled examples per class - and then you can use most of them, not just your original 100 seed docs, for future training, and a smaller subset as testing. 

Another super-simple baseline to try based more on Gensim similarity tools could be:
* grab an off-the-shelf set of word-vectors, such as the GoogleNews-2013 set (which depending on your problem domain, might still be pretty good, or hopelessly dominated by older & less-relevant word-senses)
* summarize every document as the average-vector of each of its word-vectors - this is crude but fast
* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fast

Interesting, thanks for the idea, I'll look into this. About the last point, is that an operation that is already vectorized/streamed by gensim somehow or should it be implemented as a loop on all documents?

There's not yet a batch `most_similar_to_each()` function in Gensim, which might help a bit. But the source of `most_similar()` already does a bulk comparison of its single target against all candidates that's pretty efficient (using vectorized & often multicore BLAS operations). You could copy its logic, or load your 100 candidate vectors into a `KeyedVectors` instance to use `most_similar() directly. And doing a `most_similar()` 6M times against a mere 100 targets may not take very long. 

- Gordon

Stefano Zacchiroli

unread,
Oct 29, 2021, 7:06:12 AM10/29/21
to Gensim
On Thursday, October 28, 2021 at 9:09:04 PM UTC+2 Gordon Mohr wrote:
Here are the corpus stats, *before* any filtering (so even stuff like stopwords is still included here):

         line_count    word_count          size
count  4.988074e+06  4.988074e+06  6.008466e+06
mean   1.138753e+02  7.463865e+02  9.133901e+03
std    4.299756e+03  1.190882e+04  2.238914e+05
min    1.000000e+00  0.000000e+00  0.000000e+00
25%    2.000000e+01  1.680000e+02  1.065000e+03
50%    2.100000e+01  1.690000e+02  1.080000e+03
75%    2.600000e+01  1.840000e+02  2.054000e+03
max    6.373094e+06  7.374871e+06  1.048576e+08

I'm not sure how to turn that report into a count of docs, and words per document. (In typical input files, docs are often 1 to a line... so that line_count looks like you have almost 5M docs. But then that word_count suggests the exact same number of words, which would leave 1 word per document, which wouldn't be usefully trainable content.) 

Right, I should have clarified this. My documents are regular text files, one document per file, and the counts above are in wc (the Unix CLI tool) style.
I tokenize them before processing with gensim, returning one list of "words" per document, but the counts above where taken *before* that step.
I guess for the usual metrics used in gensim land you should just ignore the 1st and 3rd column, and only look at the word_count column.
I've ~5 M documents that can be parsed as text (and ~1M more that cannot, for which only the size in byte is meaningful, but irrelevant for our conversation as they're gonna be ignored).
The average size in words (before any stop word filtering) is ~750 words, the median size ~170 (which is also about the size of most documents, given the 25 and 75 quantile are also very similar).
There are also some crazy outliers (7 M words!) which I should probably just filter out.
 
Note that these algorithms don't necessarily need stopwords removed; the downsampling from `sample` already skips many of the most-common words. 

Interesting, thanks.
 
I mean classifiers like those in `scikit-learn` – no Gensim necessarily involved at all – like in the tutorial at <https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html>, or among the range of options listed in the overview diagram at <https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html>.

I see, thanks, I'll give this a try as well.
 
* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fast

Something else came to mind about this point which I didn't ask before.
You mean cosine similarity as implemented by one of the Similarity classes on any "simple" vectorized representation of the corpus, right?
For instance, either straight bag-of-words out of the box, or its TFIDF equivalent.

(I'm asking because I'm also trying LSI on the corpus, with the idea of measuring similarities in LSI space, but: (1) it's taking forever to build the model and (2) I'm not even sure it's appropriate for my corpus, as I expect the relevant similarity distance to be much more "syntactic" than "semantic".)
 
Interesting, thanks for the idea, I'll look into this. About the last point, is that an operation that is already vectorized/streamed by gensim somehow or should it be implemented as a loop on all documents?

There's not yet a batch `most_similar_to_each()` function in Gensim, which might help a bit. But the source of `most_similar()` already does a bulk comparison of its single target against all candidates that's pretty efficient (using vectorized & often multicore BLAS operations). You could copy its logic, or load your 100 candidate vectors into a `KeyedVectors` instance to use `most_similar() directly. And doing a `most_similar()` 6M times against a mere 100 targets may not take very long. 

The only most_similar() function I've found in the API is in the Word2Vec module which I'm not using.
I'll look at its logic anyway, but please shout if there's another most_similar() function which I'm missing.

(BTW documentation search at https://radimrehurek.com/gensim/ doesn't work for me, it appears to hang forever on "searching...", without returning any result.)

Thanks again for your awesome help!
Cheers
Reply all
Reply to author
Forward
0 new messages