Instances of semantic similarity with Doc2Vec?

gabor...@maximilianeum.de

unread,

Aug 7, 2022, 6:21:45 PM8/7/22

to Gensim

Dear All,

I have trained a Word2Vec model on a reasonably large corpus (800.000 tokens); I identified terms that are semantically closely related; now I would like to find instances of semantic similarities in the original texts of the corpus. Say, queen and crown are close to each other and I would like to find texts that demonstrate this semantic similarity.

I used doc2vec in the following way but the result does not make sense.

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(lems)]
documentModel = Doc2Vec(window=5,vector_size=50)
documentModel.build_vocab(documents)
documentModel.train(documents , total_examples=len(documents), epochs=model.epochs)
termsIndex = [model.wv.key_to_index['queen'],model.wv.key_to_index['crown']]
similar_docs = documentModel.docvecs.most_similar(termsIndex)

I am wondering if document tag can be some kind of unique identifier of each document. My assumption is that the output of most_similar is then the unique identifier of the most relevant documents (and the similarity value) for the two terms (queen and crown). Is this right?

I also tried this method but no meaningful result:

new_vector = documentModel.infer_vector(['queen','crown'])

results = []
# Iterate through all document vectors
for f, doc in enumerate(documentModel.dv):
# Calculate the cosine similarity
result = np.dot(doc, new_vector)/(norm(doc)*norm(new_vector))
results.append(result)

# Find the index of the document with the highest value
index = np.argmax(np.array(results))

# Take this index and check the very original text corresponding to this; (not meaningful)

print(originalTexts[index])

Many thanks,

Gabor

Gordon Mohr

unread,

Aug 11, 2022, 4:34:41 PM8/11/22

to Gensim

On Sunday, August 7, 2022 at 3:21:45 PM UTC-7 gabor...@maximilianeum.de wrote:

Dear All,

I have trained a Word2Vec model on a reasonably large corpus (800.000 tokens); I identified terms that are semantically closely related; now I would like to find instances of semantic similarities in the original texts of the corpus. Say, queen and crown are close to each other and I would like to find texts that demonstrate this semantic similarity.

Note that whether 800k raw tokens is sufficient for quality results will also depend on:

* how many unique documents those tokens are spread over

* how many unique words survive the default `min_count=5` cutoff (or whatever other value you might choose)

* other parameter choices, especially `vector_size`, `epochs`, & `window`, which will affect the amount of training done, the size of the model as compared to the training-data, & relative influence accorded to word-to-word relations versus doc(tag)-to-word patterns

I used doc2vec in the following way but the result does not make sense.

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(lems)]
documentModel = Doc2Vec(window=5,vector_size=50)
documentModel.build_vocab(documents)
documentModel.train(documents , total_examples=len(documents), epochs=model.epochs)
termsIndex = [model.wv.key_to_index['queen'],model.wv.key_to_index['crown']]
similar_docs = documentModel.docvecs.most_similar(termsIndex)

I am wondering if document tag can be some kind of unique identifier of each document. My assumption is that the output of most_similar is then the unique identifier of the most relevant documents (and the similarity value) for the two terms (queen and crown). Is this right?

Yes, in the original & most-common application of the `Doc2Vec` algorithm, the `tags` for each document will only be a single `tag`, which serves as a unique lookup-ID for that document's doc-vector. And, using a plain incrementing integer as the document IDs is an acceptable and efficient choice.

What you've tried here looks like you intended to average the word-vectors together into a single target vector, then return the doc-vectors that are closest to that vector.

In general, with sufficient data/training/parameters, and if there are some meaningful/learnable differences in the corpus between documents discussing 'queen' and 'crown' versus other topics, I would expect that intent, *if properly executed*, to return the document IDs of documents more-relevant to those words than other documents.

BUT, you've made some errors in supplying not the words' vectors for doc-vector lookup, but raw indexes which are likely interpreted as doc-vector lookups.

First, you seem to be getting word-vectors from some *other* model, named `model`, instead of the `documentModel` you've just trained. Vectors from models trained separately will generally *not* be in a compatible coordinate-space (without specific extra forcing steps). So if you want to use word-vectors for `'queen'` and `'crown'` that have relevant positions compared to the `documentModel.docvecs` doc-vectors, they have to come specifically from `documentModel.wv`, not some other `model`.

The `model.wv` vectors (for words) and the `model.docvecs` vectors (for doc-tags) are separate structures/namespaces, even when they are co-trained into the same vector space (as in the default `dm=1` mode that you haven't overridden).

So you'll likely get more-sensible results replacing your last two lines with:

target_vectors = [documentModel.wv['queen'], documentModel.wv['crown']]

similar_docs = documentModel.docvecs.most_similar(positive=targetVectors)

Also: to the extent your data may be a little thin, and you've already adjusted for that by using a relatively-small `vector_size=50`, you may also want to consider more `epochs` of training, to squeeze as much as possible out of the limited data. By not specifying `epochs`, you're inheriting the `Word2Vec` default of `epochs=5`, which really needs plentiful word-data to work well. Published `Doc2Vec` work often uses at least 10 to 20 epochs; with a smaller amount of data, you may want to try even higher values.

I also tried this method but no meaningful result:

new_vector = documentModel.infer_vector(['queen','crown'])

results = []
# Iterate through all document vectors
for f, doc in enumerate(documentModel.dv):
# Calculate the cosine similarity
result = np.dot(doc, new_vector)/(norm(doc)*norm(new_vector))
results.append(result)

# Find the index of the document with the highest value
index = np.argmax(np.array(results))

# Take this index and check the very original text corresponding to this; (not meaningful)

print(originalTexts[index])

Again, this approach is generally reasonable, but may require tuning and tempered expectations.

Inferring a doc-vector from a new synthetic document – `['queen', 'crown']` – is a plausible approach for choosing a target-vector for a later `.most_similar()` operation. However:

* A mere two-word document isn't particularly 'natural', and thus depending on lots of other model-quality considerations might be more likely to generate some peculiar doc-vector, given the number of limited influences, than a more typically-sized many-word example document.

* Without a specified `epochs`, `infer_vector()` reuses the model's `epochs`, which at only the default `epochs=5` (noted above as small during training) is also quite-small for inference. And, especially small for inference from a tiny (2-word) document. It means the inferred-vector is only nudged, from its initial random state, a mere *10* times: in each epoch, nudged a little to be better at predicting `'queen'`, then a little better at predicting `'crown'`. Imagine, instead, a model that trains/infers with `epochs=20`, and a supplied document of 50 words. *That* inferred doc-vector will have been nudge-improved 20 epochs * 50 words = 1000 times. So in addition to the general advice above that you're likely to benefit from more `epochs` during training, any time you're inferring new *tiny* documents, you might want to use even more `epochs`.

Further, you could use the existing `most_similar()` method to do essentially what you're attempting as a 1-liner:

sims_to_inferred = documentModel.docvecs.most_similar(positive=[documentModel.infer_vector(['queen', 'crown'])])

Finally, I notice that your final display of possible-match texts useds `originalTexts`, but the training at top began with a corpus in a variable named `lems`. Keep in mind:

* Any docs that are later supplied for inference should go through the exact same preprocessing (tokenization/etc) as training-docs did: the model only 'knows' the tokens it saw during training, and both training & inference ignore any unknown tokens.

* Lemmatization doesn't necessarily help these algorithms, especially if once you have enough data that all the variant forms of a word, that would lemmatize to a shared token, could be learned in shades-of-meaning without lemmatization. I suppose it might help if it manages to coalesce several close-forms of a word that each wouldn't survive the `min_count=5` cut individually into a shared root token that does - but don't assume that it helps, try it both ways. It might be extra preprocessing complexity that pushes end-evaluations in the wrong direction!

- Gordon

Many thanks,

Gabor

gabor...@maximilianeum.de

unread,

Aug 16, 2022, 12:13:31 PM8/16/22

to Gensim

Dear Gordon,

Many thanks for this detailed and long answer; I appreciate your time and your answer is indeed very useful. I have learned a lot from it.