Document topic classification using doc2vec

204 views
Skip to first unread message

je...@onlaw.dk

unread,
Aug 15, 2018, 10:18:35 AM8/15/18
to Gensim
Dear all 

At the moment I am trying to add topics to my database of documents. I do have predefined set of topics but I have no labelled data. Since I am lazy by nature I have been experimenting with doc2vec to avoid labelling my docs creating a training set....

I have trained the doc2vec model on my corpus (dm=0). My idea is that it might be possible to obtain relevant topics for all documents in an unsupervised fashion in the following way:

1) infer_vector on a topic word or a relevant sentence (or both), e.g. infer the low dimensional representation v_t of the topic "machine learning"
2) calculate cosine similarity i.e. \cos \theta = \frac{\bm{v_t} \cdot \bm{v_{doc}} }{\|v_t\|\|v_{doc}\|}, where v_{doc} is the learned vector representation of a given doc
3) if \cos \theta > topic_limit then "attach" topic to doc; topic_limit \in [0,1]

My initial tries seems to work to some extend but far form perfect. Do any of you know similar approaches or even better do you have any tricks? 

Best Jens

Evan Goodwin

unread,
Aug 15, 2018, 8:12:18 PM8/15/18
to gen...@googlegroups.com
Have you looked at this approach? I guess the trick is picking the number of clusters for the latent topics that are in your corpus. It is similar to LDA. 

https://towardsdatascience.com/automatic-topic-clustering-using-doc2vec-e1cea88449c

I don't know if inferring the vector for the words 'machine learning' would really represent the concept of the machine learning in your corpus. 

If you could just search the web for 10-20 articles that are most purely about a given topic - machine learning tutorial for instance, intro to machine learning article - you could infer vectors for those 10-20 articles and average them to discover the 'center' point in your corpus for that topic. 

If you do cosine distance for this point and article vectors in your corpus, you should be able discover articles that discuss that topic. I am not sure how they would be ranked - if articles most purely about the topic would be ranked at the top and articles about many topics including the given topic would be ranked lower down. There would also be normalization issues. You would have to experiment to find out how it works. 

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Aug 15, 2018, 8:55:05 PM8/15/18
to Gensim
Ultimately, you'll have to start adding some human-judged, gold-standard topics to existing articles – to have a training set, or be able to evaluate any other ad hoc methods you devise. 

Using either the inferred vector for ['machine', 'learning'], or (if you have compatible co-trained word-vectors) the average of wv('machine') & wv('learning'), as a sort of bootstrapped starting-point for the topic 'machine learning' might be better than nothing. But, such tiny-phrase vectors are likely to be idiosyncratic, perhaps not matching human senses of the "centroid" for that topic. You'd probably want the anchor for the topic to move from its initial point, or grow to include multiple points, as you actually confirm/reject, with human expertise, that certain docs truly fit under that topic. 

To the extent that you have predetermined topics, is the sole description of those topics their short text names? Or, do you have (or could you find/create) groups of docs that are definitely representative of those topics? EG's suggestion of a web search for those topic-names would help use the search engine's encoded understanding of the topic to flesh out the training set. For your domain, and if the topics come from something already established, perhaps there are even better sets of representative docs for each topic. (You could mix these 'reference docs' into your corpus during training, or just infer their doc-vecs for them post-training.)  Even something like the text of the Wikipedia article 'Machine Learning' – as a whole or by section – might, when fed to your model, give a better point/points than the tiny phrase 'Machine Learning' itself. 

You may want to review a followup paper about the Paragraph Vector algorithm (aka gensim's Doc2Vec) that applied it to both Wikipedia articles, and Arxiv papers:

"Document Embedding with Paragraph Vector"

They used existing (human) categories of these corpuses to auto-score the quality of doc-vectors, under different metaparameters, based on how well pairs of docs declared to be in the same topic had closer doc-vecs to each other than some 3rd document picked randomly. 

Perhaps there's other structure in your docs – subheads, footnotes, capitalized-phrases – that could serve as a sort of extra hint of docs that "oughtta be" close to each other, and thus help metaoptimize the Doc2Vec model for topical purposes. 

And if you can generate a seed set of "known points and their strongly-associated topics", perhaps from ref docs borrowed/hand-selected from elsewhere, a K-Nearest-Neighbors report of predicted topics for unlabeled docs would work pretty well. And where it doesn't, each time you hand-correct the labeled-topics for a doc/doc-vector, it'd improve for all others. That is, use an iterative process: once you've got at least one 'seed point' for every topic, check all your docs' distances from all seed-points. Manually review the document with the "most confused" topics - giving it definitive labels. Then repeat. (And, for all hand-labeled docs, constantly re-score whether they'd be properly classified, by their nearest-known-neighbor, if their own labels were ignored.)

- Gordon

je...@onlaw.dk

unread,
Aug 16, 2018, 2:22:58 AM8/16/18
to Gensim
Thanks eg and Gordon

Good ideas. I will try to mix them with my own and report my findings here if I get anywhere:-) I guess that given the uncertainty of my/your approaches it might be less work/better results if I simply annotate a training set and use a classifier on the paragraph vectors? 

I found this piece which resembles my ramblings a bit https://arxiv.org/pdf/1612.05340.pdf

Best Regards 

Jens

Gordon Mohr

unread,
Aug 16, 2018, 3:08:05 PM8/16/18
to Gensim
Until/unless you annotate some texts with your 'ground truth', it's hard to see how you'd even evaluate whether various experimental strategies are working. 

And once you do have some trusted labels, and perhaps even grow them over time as you manually review hard cases, then you can try other more standard classifier strategies, for predicting/auto-labeling other docs. 

- Gordon

je...@onlaw.dk

unread,
Aug 16, 2018, 3:29:45 PM8/16/18
to Gensim
ok thanks. Found this approach as well :-)
Reply all
Reply to author
Forward
0 new messages