Word2Vec TaggedDocument tags

Mauro Pezzati

unread,

Jun 13, 2021, 2:00:20 PM6/13/21

to Gensim

Hello, i would like to understand what is the best way to use the TaggedDocument tags.

If i have a situation like this:

TaggedDocument(words=["Football","fun",...], tags=["Sport","Ball","Team"])

TaggedDocument(words=["Basketball","fun",...], tags=["Sport","Ball","Team"])

or

TaggedDocument(words=["Football","fun",...], tags=[0])

TaggedDocument(words=["Basketball","fun",...], tags=[1])

What would be the difference in the result of a most similiar query?

Gordon Mohr

unread,

Jun 13, 2021, 3:09:57 PM6/13/21

to Gensim

Note the `TaggedDocument`-shaped training data – where each item has both a list of `words` and a list of `tags` – is only required for `Doc2Vec`, not `Word2Vec`.

The original 'Paragraph Vectors' papers (that defined the algorithm used by `Doc2Vec`) just gave each document a unique ID. Each such unique ID received one trained vector - feeding into later search/ranking steps.

Allowing more than one vector to be trained per doc, and for such named vectors to potentially be repeated across many docs, was a simple & natural extension to the algorithm. But, there's not been much (if any) published work about the ins-and-outs of that style of use. So, it's best considered an advanced option, sometimes worth atry with experimental evaluation if you think it fits a certain need, but without any real 'right' or 'wrong' way to use, or firm expectations about what value might deliver.

In all cases, though, the key is that the `tags` you supply are the *names* of the vectors learned by (and stored within) the model. If you need the model to answer a list of specific documents in response to a `.most_similar()` search against its set-of-doc-vectors (`.dv` attribute), then you should ensure each training-document has a unique ID tag. If you begin to mix-in other kinds of repeated labels as `tags`, each such tag will also receive a trained vector - as if trained on a virtual, composite document with all the text of all documents that share that tag. And, such tags will then also populate any `.most_similar()` results.

For example, if you have 100,000 documents, but they each are assigned to one of 25 known categories, you conceivably could just feed them all to the model with only their category as their `tag`. But in a sense, you've then just trained the model on only 25 unique texts, and it will have only learned 25 doc-vectors. You can then only get those 25 vectors back as `.most_similar()` results, and also may have prematurely collapsed a lot of the useful variety in the original 100,000 documents down to just 25 summary points. So you'd be unlikely to want to use broad tags that way.

I'd suggest: start with unique per-document IDs. Once you have a sense of how well that's working for your task, experiment with the option of adding other kinds of tags - evaluating at each step to be sure it's doing something sensible/useful.

- Gordon

Artanox

unread,

Jun 14, 2021, 1:29:19 PM6/14/21

to Gensim

I've done a similar thing with doc2vec and i'm having trouble at evaluating the model accuracy/recall.
If i feed my documents like:

TaggedDocument(words=["Football","fun",...], tags=["Sport","Ball","Team"])

TaggedDocument(words=["Basketball","fun",...], tags=["Sport","Ball","Team"])

Then when calling most_similiar for a document like: TaggedDocument(words=["Hiking","fun",...], tags=["Walking",,"Mountain"])

I will get k response like

(Tag1, similarity)

(Tag2, similarity)

(Tag3, similarity)

...

If i evaluate the model on tags hit then i can calculate the accuracy/recall at k using the most similiar tags result

Accuracy/recall at 1 -> Actual [Walking, Mountain] Predicted[Tag1]

Accuracy/recall at 2 -> Actual [Walking, Mountain] Predicted[Tag1, Tag2]

Accuracy/recall at 3 -> Actual [Walking, Mountain] Predicted[Tag1, Tag2, Tag3]

If i train my model with documents like :

TaggedDocument(words=["Football","fun",...], tags=[0]) with associated tags [Tag1, Tag2,]

TaggedDocument(words=["Basketball","fun",...], tags=[1]) with associated tags [Tag1, Tag5, Tag8, Tag10]

If i want to evaluate the model based on tags hit for a document like TaggedDocument(words=["Hiking","fun",...], tags=[0]) with associated tags [Tag1]

The most similiar result would be

(id1, similiarity)

(id2, similiarity)

...

each one with his own set of tags, then the accuracy/recall depend on the numbers of tags of those documents

I tried to calculate accuracy/recall with a confusion matrix, if one of the actual tags is present in the predicted list of tags it count as a hit

Actual [Tag1] Predicted [Tag1, Tag5, Tag8, Tag10]

But if the tag is not present then the accuracy/recall will go down by a lot because you will evaluate all the couples as miss

Actual [Tag1] Predicted [Tag2, Tag5, Tag8, Tag10] -> [Tag1,Tag2] miss -> [Tag1.Tag5] miss ...

Sorry for the bad explaination, i hope it's understandable, how can i evaluate such model using id as tags?

Gordon Mohr

unread,

Jun 14, 2021, 9:14:40 PM6/14/21

to Gensim

It sounds like, given a text, you want to predict one or more labels, after using other text examples, each with a known set of one or more labels, as training.

Though it's intuitively tempting to try to do this via a `most_similar()` operation, especially when starting out, that's actually a pretty clumsy way to do what you want.

The term-of-art in machine learning for what you want to do is "multi-label classificatiion". Even before trying `Doc2Vec`, I'd highly suggest working though some online examples using scikit-learn to do 1st, some simple 'binary classification' of texts. (That's: every text belongs in one or the other class – like 'spam' or 'not-spam', or 'on-topic' or 'not-on-topic', etc.) Don't even try using `Doc2Vec` at 1st - just more simple 'bag of words' models, like those created by the scikit-learn `CountVectorizer` or `TfidfTransformer` classes.

Then, work through some 'multiclass classification' of texts - where every text gets exactly one class. Then, finally, a multi-label problem.

Doing those will help you better think about the steps of your task, like converting the text into feature-vectors and then training any one of many possible algorithms to learn to make predictions. Only after doing that in other ways, you might consider adding `Doc2Vec` as a way to get feature vectors from text - either as the main way, or an addition to other techniques. On some tasks, it'll help, but on others, its condensation of the whole text into a single small 'dense' vector might discard key aspects (like say a single word that's *always* a reliable indicator some label should be applied) moreso than othr approaches.

(In particular, the intuition to "find a ranked list of similar vectors & use those to pick the labels" is also part of the foundational classifier algorithm "K-Nearest-Neighbors", which might be a top-performer for you, but can also be a bit slower and memory-hungry than many other algorithms. So it's a good exercise to formulate your task so that you can try it, from a standard library, and then other classifiers, on your task. If you were trying to extend what you've tried to be more like KNN, you'd essentially take your new text of unknown labels, find its 5 nearest *exact document* neighbors, then look at *their* known labels - and use some heuristic to choose which of those candidate labels should be imputed for your new text. But really: you probably want to grab an off-the-shelf implementation, which is likely tobe well-tested, offer tunable options, and fit well into other evaluation techniques.)

- Gordon

Artanox

unread,

Jun 15, 2021, 7:09:19 AM6/15/21

to Gensim

Thanks for the detailed response Gordon, i'm new to the field, i'm doing this type of work for my Computer Science internship with the goal to build a classifier, we are working with the Yelp! dataset and given a user we want to infer the categories he may be interest in, based on his reviews.

Actually i had good luck with Fasttext with the test corpus scoring very high in accuracy.

Right now i'm trying to score how the Doc2Vec model is doing compared to Fasttext.

I wanted to ask what is Doc2Vec suitable for, because i'm interested in using those models after the internship, like for what is usually used Doc2Vec?

Thanks for your time, i appreciate the help, it's a very complex environment when starting out

Gordon Mohr

unread,

Jun 15, 2021, 1:39:43 PM6/15/21

to Gensim

As FastText has an explicit `-supervised` classification mode, which directly learns (list-of-words)->(labels) relationships, I can see how it may have been both easy to use & high-performing.

But, with more general text-to-vector techniques (like `Doc2Vec` & many others, including all the `scikit-learn` transformers), you'll want to be explicit about separate preprocessing/feature-extraction and classification steps, and explicit in choosing/tuning/evaluating alternate classification algorithms – as per my prior message's suggestion of working through some increasing complexity `scikit-learn` pipelines. (And: not necessarily on your exact data/problem - just stepping through simplified online examples, like say the "Working With Text Data" tutorial <https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html> or other intro tutorials/presentations, will massively expand your understanding & choice of techniques.)

The people I recall raving the most about `Doc2Vec` have been using it for fuzzy search. Especially as an adjunct to traditional full-text/keyword search, it's often good at revealing that docs are similar even when the exact expected words aren't present, especially in titles/abstracts/paragraphs. In that way, it provides much of the same value other sorts of tools for keyword-expansion/synonym-search/result-diversity can also provide, but perhaps in fewer steps or catching a few more fuzzy-matches without lots of manual assistance. In this use, the fact that it can (in most modes) plot both docs and individual words into the same coordinate space can also be interesting - as per the observations in the "Document Embedding With Paragraph Vectors" paper <https://arxiv.org/abs/1507.07998>.

I've seen less praise for `Doc2Vec` features for thorny classification problems - I recall seeing it doing about as well, or sometimes a little better, than very-simple document summaries like a plain bag-of-words or average-of-all-word-vectors. I'd expect it to do better on matters of broad-topicality – "is this review about a dry cleaner?" – than potentially subtle sentiments – "were customers happy with the turnaround times?" Proper interpretation of the latter often relies on grammar, sarcasm, & relative comparisons to which a pure, mostly-order-invariant word-frequencies algorithm is oblivious.

- Gordon

Reply all

Reply to author

Forward