How big are the individual documents, in word-count?
Because there are many possible ways to assess "document similarity" (for any particular purpose), there are many potential ways to do this, with no one necessarily being "the best" – as results will often vary based on the quality of your data, your own preprocessing/enrichment steps, & your specific problem domain & goals. So you'll probably want to try a bunch of things & compare them against each other, starting very simple, tinkering to see what can demonstrably improve on the simple approach.For example, you might evaluate your results by gradually hand-labeling more and more of the 'desired' associations – which of the 6M docs should go into which of the 100 'protypical' doc buckets. For each approach, check how many such associations the approach "gets right". Potentially, expand the 'ground truth' set by adding labels for more of the 'tough cases'. In many casee, 'tough cases' may be detectable by how strongly the 'top' label is asserted (either overall, or specifically versus the 2nd-best, etc).
Indeed, training a classifier on *just* the 100 'prototype' docs, then starting to review its results on the other 6M, taking items it got right, or the corrected results you create during review, as new 'gold standard' training/testing data, is a reasonable way to get started.
Another super-simple baseline to try based more on Gensim similarity tools could be:* grab an off-the-shelf set of word-vectors, such as the GoogleNews-2013 set (which depending on your problem domain, might still be pretty good, or hopelessly dominated by older & less-relevant word-senses)* summarize every document as the average-vector of each of its word-vectors - this is crude but fast* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fast
After either or both of these approaches have set some initial baseline parameters for what's possible & easy, you'd gradually try more things, comparing any improvement they achieve to see which are worth the effort. For example:
On Wednesday, October 27, 2021 at 12:23:28 AM UTC+2 Gordon Mohr wrote:How big are the individual documents, in word-count?Here are the corpus stats, *before* any filtering (so even stuff like stopwords is still included here):line_count word_count sizecount 4.988074e+06 4.988074e+06 6.008466e+06mean 1.138753e+02 7.463865e+02 9.133901e+03std 4.299756e+03 1.190882e+04 2.238914e+05min 1.000000e+00 0.000000e+00 0.000000e+0025% 2.000000e+01 1.680000e+02 1.065000e+0350% 2.100000e+01 1.690000e+02 1.080000e+0375% 2.600000e+01 1.840000e+02 2.054000e+03max 6.373094e+06 7.374871e+06 1.048576e+08I suspect that in gensim land, outliers aside, these would qualify as fairly small documents.
Because there are many possible ways to assess "document similarity" (for any particular purpose), there are many potential ways to do this, with no one necessarily being "the best" – as results will often vary based on the quality of your data, your own preprocessing/enrichment steps, & your specific problem domain & goals. So you'll probably want to try a bunch of things & compare them against each other, starting very simple, tinkering to see what can demonstrably improve on the simple approach.For example, you might evaluate your results by gradually hand-labeling more and more of the 'desired' associations – which of the 6M docs should go into which of the 100 'protypical' doc buckets. For each approach, check how many such associations the approach "gets right". Potentially, expand the 'ground truth' set by adding labels for more of the 'tough cases'. In many casee, 'tough cases' may be detectable by how strongly the 'top' label is asserted (either overall, or specifically versus the 2nd-best, etc).Yeah, I'm indeed in the process of exploring various options by trial & error, as you suggest. Thus far I had excluded supervised approaches (based on ground truth labeling), but based on your suggestion I'll probably reconsider and give that a try too. (FWIW, my other message on list about using DBSCAN on gensim corpora is my ongoing attempt as unsupervised classification of this corpus.)
Indeed, training a classifier on *just* the 100 'prototype' docs, then starting to review its results on the other 6M, taking items it got right, or the corrected results you create during review, as new 'gold standard' training/testing data, is a reasonable way to get started.Just to be sure I'm reading this correctly. When you say "training a classifier" here you mean using one of the various similarities.docsim or something else?The former is what I had in mind with my original point (2).If you instead mean actual training (e.g., in the supervised machine learning sense), I can indeed use my golden set as (trivial) ground truth, but it will only have 1 sample per label, so I'm skeptical it will work well when exposed to the diversity of the full corpus.
Another super-simple baseline to try based more on Gensim similarity tools could be:* grab an off-the-shelf set of word-vectors, such as the GoogleNews-2013 set (which depending on your problem domain, might still be pretty good, or hopelessly dominated by older & less-relevant word-senses)* summarize every document as the average-vector of each of its word-vectors - this is crude but fast* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fastInteresting, thanks for the idea, I'll look into this. About the last point, is that an operation that is already vectorized/streamed by gensim somehow or should it be implemented as a loop on all documents?
Here are the corpus stats, *before* any filtering (so even stuff like stopwords is still included here):line_count word_count sizecount 4.988074e+06 4.988074e+06 6.008466e+06mean 1.138753e+02 7.463865e+02 9.133901e+03std 4.299756e+03 1.190882e+04 2.238914e+05min 1.000000e+00 0.000000e+00 0.000000e+0025% 2.000000e+01 1.680000e+02 1.065000e+0350% 2.100000e+01 1.690000e+02 1.080000e+0375% 2.600000e+01 1.840000e+02 2.054000e+03max 6.373094e+06 7.374871e+06 1.048576e+08
I'm not sure how to turn that report into a count of docs, and words per document. (In typical input files, docs are often 1 to a line... so that line_count looks like you have almost 5M docs. But then that word_count suggests the exact same number of words, which would leave 1 word per document, which wouldn't be usefully trainable content.)
Note that these algorithms don't necessarily need stopwords removed; the downsampling from `sample` already skips many of the most-common words.
I mean classifiers like those in `scikit-learn` – no Gensim necessarily involved at all – like in the tutorial at <https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html>, or among the range of options listed in the overview diagram at <https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html>.
* for each of the 6M docs, find the top-N most-similar vectors (by cosine-similarity) in the 100 - this is again crude but fast
Interesting, thanks for the idea, I'll look into this. About the last point, is that an operation that is already vectorized/streamed by gensim somehow or should it be implemented as a loop on all documents?There's not yet a batch `most_similar_to_each()` function in Gensim, which might help a bit. But the source of `most_similar()` already does a bulk comparison of its single target against all candidates that's pretty efficient (using vectorized & often multicore BLAS operations). You could copy its logic, or load your 100 candidate vectors into a `KeyedVectors` instance to use `most_similar() directly. And doing a `most_similar()` 6M times against a mere 100 targets may not take very long.