Help me to understand the `Tag` in TaggedDocument

Johnny Lu

unread,

Aug 19, 2021, 11:55:41 AM8/19/21

to Gensim

Hi:

Recently I am reading the tutorial of Doc2Vec, one example from Gensim:

Document classification with word embeddings tutorial

The `tag` is actually the plot's Genre, aka the label

And the tutorial in website:

Doc2Vec Model

the `tag` is actually just sequence of integer

Please help me to understand, how to use the `tag` properly? especially in text classification task.

Thanks

//JL

Gordon Mohr

unread,

Aug 19, 2021, 2:30:00 PM8/19/21

to Gensim

The original published 'Paragraph Vector' work (specifying the algorithm that `Doc2Vec` uses) typically assigned each document a unique ID. Thus each document received its own vector – which could then be passed to downstream classifiers. That's essentially the standard/classic way to apply `Doc2Vec` to a set of documents.

But, it's an easy & natural extension to allow more than one full-doc vector floating over all of one text's words- with such vectors potentially also repeating, & recombining, in different texts throughout the corpus. This offers some interesting possibilities, such as using such doc-vectors to model various known characteristics of training documents – but it's not necessarily any better than the classic approach.

And, in some cases, it could be worse, prematurely collapsing the inherent & useful complexity of the learned space.

For example, imagine you have 100,000 documents, which each fall into one of 7 categories. If you label them with 100,000 unique IDs, every doc gets a different vector in the high-dimensional space, and any downstream classifier could potentially learn that various odd-shaped/discontiguous/"lumpy" regions of the space fit certain categories. But if you've only labeled all the docs with one of 7 tags, the model is essentially only learning 7 total doc-vectors, as if you'd concatenated all docs of the same category into one synthetic mega-doc. You migth then wind up with an overfit model – a 100d 'dense' embedding is overkill if there are only 7 docs! – or wind up summarizing a category that's really an exotically-shaped region as a single point.

For that reason, I tend to be suspicious of labeling a large set of documents with *only* a small set of known-labels, as that demo notebook shows. But, it could be worth trying – especially if at the end, rather than giving every document just the vector from its category, re-inferring a document-unique vector.

You could also consider giving each document both a unique ID tag, and a known-category tag. Then you get a handy summary vector for each category – perhaps oversimplified to a single point, but still of interest for some comparisons – and still try to model the other interesting variances per-document.

But none of these is undisputably 'best', so you may want to try several variants on your data & goals. The "one ID per doc" approach can be considered the classic/baseline approach, worth doing first.

- Gordon

santosh.b...@gmail.com

unread,

Aug 20, 2021, 5:34:10 AM8/20/21

to Gensim

Great explanation, Gordon! Thank you.

I was tagging multiple docs (in my case tweets) with the same Twitter handle. From your explanation, it seems to me that I should be tagging each document with the unique tweet id and leaving any cosine similarity computations (e.g., between two Twitter users) to a downstream application that does suitable aggregations.

Just to clarify, would the doc vectors be different if I label each document/tweet with:

a) only tweet id;

b) tweet id; and tweet user handle.
Are there downsides of the former/latter approach?

Thanks!

sbs

Gordon Mohr

unread,

Aug 20, 2021, 2:15:57 PM8/20/21

to Gensim

I can't be sure what "should" be done for your data/goals - I was just outlining some of the variations to try & potential tradeoffs. It might be better with just one ID tag per doc, or with multiple tags - only trying & evaluating each will tell for sure.

Choosing different things as the `tags` during training will definitely change the results, as it means a different amount of model internal state, & different goals/releative-weights during training.

For example, again imagine 100,000 docs, each with a unique ID, plus 7 categories. If you just gave each doc a tag of its unique ID, the model during training has 100,000 doc-tags to learn, and tries in its loops to make those 100,000 vectors ever-better at predicting their respective doc's words.

If instead you gave every doc two tags, its ID & its category, then the model during training is now trying to adjust 100,007 vectors. While 100,000 of those vectors are trained from one matching doc's words, the last 7 are each influenced by the thousands of docs that all share that same category-tag. And essentially, those 7 tags are getting as much 'training attention/updating' as the 100,000, which could be a good or bad thing in the interleaved tug-of-war training process. In some modes, like plain PV-DBOW (`dm=0`), having 2 tags per doc will mean about twice as much training time. (It's as if each doc repeats twice, once with one tag, once with the other.)

- Gordon

santosh.b...@gmail.com

unread,

Aug 20, 2021, 2:25:45 PM8/20/21

to Gensim

Thanks for the explanation, Gordon. This is very helpful!

Kind regards

sbs

santosh.b...@gmail.com

unread,

Aug 23, 2021, 5:30:40 AM8/23/21

to Gensim

For tweets, I tried doc2vec with:

a) only tweet handle/user as a tag for every tweet;

b) tweet id and tweet handle/user as two tags per tweet;

I am surprised to see the drastically different results when using cosine similarities. This is quite unexpected and confusing. Not sure how to test which of the two word embeddings should be used for subsequent downstream analysis.

sbs

Gordon Mohr

unread,

Aug 24, 2021, 4:52:02 PM8/24/21

to Gensim

Indeed, as mentioned, the choice of how to group/tag/repeat the text, with respect to what parts of the model are being adjusted (per-tag doc-vectors), can have a big effect – along with all the other preprocessing & metaparameter choices.

But, I'm curious: what code ran, & what outputs compared, that led to an assessment of "drastically different results"?

And, in the drastic differences, does either appear better, either to your ad hoc probes, or any quantitative scoring you can apply?

Ultimately, to choose between different alternative preprocessing steps (such as how `tags` are assigned) or values to use for other tunable metaparameters, you'll need some repeatable way to score either the `Doc2Vec` output itself, or the overall system that makes use of the `Doc2Vec` outputs.

- Gordon

santosh.b...@gmail.com

unread,

Aug 26, 2021, 11:45:50 AM8/26/21

to Gensim

I am not sure how to quantitatively evaluate which is the best model. I find some opposite results when I use downstream approaches like word embeddings association test (WEAT). Since it may not make sense to share that here, I am sharing below the basic results based on word vectors obtained from gensim.doc2vec (w2v= model.wv).

The thing that I am unable to understand is why word vectors yield drastically different results by changing what constitutes tag id(s) for a document. From a cursory glance, Twitter user handle as the only tag ID yields results that align with my expectation. But I am not sure how to quantitatively ascertain that downstream statistical analysis (WEAT) using the corresponding doc vectors are the most correct.

Tag= [Twitter User Handle]

w2v.most_similar(['passion'])

[('desire', 0.4037601947784424), ('dream', 0.3962423801422119), ('enthusiasm', 0.36689141392707825), ('discipline', 0.35415637493133545), ('optimism', 0.35155346989631653), ('happiness', 0.3514486253261566), ('attitude', 0.34471186995506287), ('ambition', 0.3406122326850891), ('dedication', 0.33606553077697754), ('courage', 0.32806462049484253)]

w2v.most_similar(['money'])

[('monie', 0.42207765579223633), ('cash', 0.42051470279693604), ('dollar', 0.41079556941986084), ('purse', 0.39792945981025696), ('paycheck', 0.3733024597167969), ('bet', 0.34665751457214355), ('mil', 0.3465754985809326), ('penny', 0.34436628222465515), ('buck', 0.338899701833725), ('winning', 0.33551734685897827)]

w2v.most_similar(['family'])

[('fam', 0.6488237977027893), ('familia', 0.441882461309433), ('familyfriend', 0.4074658751487732), ('brother', 0.406892329454422), ('crew', 0.39718306064605713), ('wife', 0.3907517194747925), ('friend', 0.38892146944999695), ('dad', 0.3854900896549225), ('parent', 0.3845958411693573), ('father', 0.38032975792884827)]

w2v.most_similar(positive= ['she', 'king'], negative= ['he'], topn= 1)

[('queen', 0.5003858804702759)]

w2v.most_similar(positive= ['she', 'boy'], negative= ['he'], topn= 1)

[('girl', 0.599285900592804)]

w2v.most_similar(positive= ['she', 'father'], negative= ['he'], topn= 1)

[('mother', 0.5868934392929077)]

w2v.most_similar(positive= ['she', 'handsome'], negative= ['he'], topn= 1)

[('gorgeous', 0.4948742687702179)]

Tag= [Tweet ID]

w2v.most_similar(['passion'])

[('success', 0.7988051772117615), ('knowledge', 0.7854339480400085), ('dedication', 0.7460174560546875), ('mindset', 0.7398138046264648), ('friendship', 0.7276818156242371), ('wisdom', 0.7246858477592468), ('motivation', 0.7208741307258606), ('faith', 0.719303548336029), ('effort', 0.7175871729850769), ('sacrifice', 0.712775468826294)]

w2v.most_similar(['money'])

[('pay', 0.6928053498268127), ('risk', 0.6863917112350464), ('dollar', 0.6666136980056763), ('purse', 0.6618797779083252), ('cause', 0.6591783165931702), ('paycheck', 0.6582706570625305), ('tax', 0.6543160080909729), ('attention', 0.6445984840393066), ('taxis', 0.6415027976036072), ('cash', 0.6400747895240784)]

w2v.most_similar(['family'])

[('friend', 0.8546327352523804), ('bless', 0.8086105585098267), ('support', 0.7748953104019165), ('prayer', 0.7736652493476868), ('father', 0.7578744888305664), ('birthday', 0.755126416683197), ('fam', 0.7372110486030579), ('thank', 0.7319799065589905), ('sister', 0.7269537448883057), ('supporter', 0.7247592210769653)]

w2v.most_similar(positive= ['she', 'king'], negative= ['he'], topn= 1)

[('queen', 0.7499951720237732)]

w2v.most_similar(positive= ['she', 'boy'], negative= ['he'], topn= 1)

[('sis', 0.8260940909385681)]

w2v.most_similar(positive= ['she', 'father'], negative= ['he'], topn= 1)

[('mother', 0.8420456647872925)]

w2v.most_similar(positive= ['she', 'handsome'], negative= ['he'], topn= 1)

[('gorgeous', 0.7570988535881042)]

Tag= [[Twitter User Handle, Tweet ID]

w2v.most_similar(['passion'])

[('dream', 0.5422909259796143), ('greatness', 0.5121507048606873), ('sacrifice', 0.48439452052116394), ('success', 0.4816773533821106), ('journey', 0.4728812575340271), ('position', 0.471219003200531), ('career', 0.46483755111694336), ('hard', 0.45550456643104553), ('ethic', 0.453106164932251), ('courage', 0.45210179686546326)]

w2v.most_similar(['money'])

[('it', 0.5695531964302063), ('pay', 0.5601360201835632), ('he', 0.5574212670326233), ('they', 0.5525235533714294), ('do', 0.5460174083709717), ('to', 0.5378518104553223), ('ppl', 0.5377398729324341), ('retweet', 0.5336585640907288), ('you', 0.5308172106742859), ('support', 0.527155339717865)]

w2v.most_similar(['family'])

[('fam', 0.6382679343223572), ('teammate', 0.6258252263069153), ('support', 0.5977908372879028), ('brother', 0.5847072005271912), ('to', 0.5798584222793579), ('mom', 0.5780133605003357), ('help', 0.5741506218910217), ('journey', 0.572892427444458), ('manager', 0.569219172000885), ('parent', 0.563672661781311)]

w2v.most_similar(positive= ['she', 'king'], negative= ['he'], topn= 1)

[('megumi', 0.4182150363922119)]

w2v.most_similar(positive= ['she', 'boy'], negative= ['he'], topn= 1)

[('girl', 0.6591248512268066)]

w2v.most_similar(positive= ['she', 'father'], negative= ['he'], topn= 1)

[('sister', 0.6040921807289124)]

w2v.most_similar(positive= ['she', 'handsome'], negative= ['he'], topn= 1)

[('sexy', 0.511592447757721)]

Gordon Mohr

unread,

Aug 26, 2021, 3:31:22 PM8/26/21

to Gensim

Those results don't look drastically different to me. Each seems to be giving plausibly reasonable results, with a lot of overlap. Any marginal differences in exactly how good those results are for a specific purpose is something that'd depend entirely on your specific (& unshown) downstream goals/steps.

Without seeing your other training code & parameters, it's hard to speculate about what else might be influencing the results. Remember: in some modes, supplying 2 tags means twice as much training occurs. Also, given all the randomness in the algorithm, there's plenty of 'jitter' from run-to-run even with the exact same parameters - especially in cases where the data may be thin or model over-sized.

- Gordon

santosh.b...@gmail.com

unread,

Aug 27, 2021, 11:41:14 AM8/27/21

to Gensim

Many thanks for further clarification, Gordon.

A follow-up question: What exactly do you mean by a model being over-sized? Does it mean we specifying, for example, 300 dimensions when we should be using smaller dimensions of say 25 or 50? Is there a way to quantify if the model is over-sized?

Thanks,

sbs

Gordon Mohr

unread,

Aug 30, 2021, 3:41:49 PM8/30/21

to Gensim

Imagine taking some tiny corpus – such as the 4-word, 4-unique-sentence one, maybe 100 bytes worth of data – but then seeking to train 100-dimensional word-vectors from it. At that point, the model (4x100 dimensional word-vectors, plus internal weights) is far, far *larger* than the training data.

Such a large model will be able to become quite good at its internal training goal – predicting words from neighboring words – just by approximating (as closely as its internal structure allows) a pure 'memorization' of training examples. What's more, with so many free parameters, there are many, many alternate model configurations that will all be equally good at that training task. Two words that in a sense "should" have the same representation (because they have the exact same usage contexts) could in fact wind up with arbitrarily different word-vectors - that may work arbitrarily well for the tiny training task, but aren't of much use elsewhere.

In fact, in such a case, the oversized model has really just added arbitrary noise (from the algorithm's randomized initialization & training) atop the raw info in the tiny corpus. If you were really trying to do the training task – predicting neighboring words – and were willing to devote more state to the problem than the training data itself, you could just... make a lookup table of exactly the co-occurrences, and always make the "perfect" prediction (for the limited training set) from that table.

Instead, truly useful training tends to have a strong aspect of 'compression' or 'summarization': converting a lot of training data into a much-smaller model that reflects the essential, generalizable patterns.

- Gordon

Reply all

Reply to author

Forward