Clustering documents using Doc2vec & [H]Dbscan

2,535 views
Skip to first unread message

Andres Moreira

unread,
Jan 9, 2018, 8:47:16 PM1/9/18
to gensim
Hello everyone,

I'm working on a project that I need to cluster documents (mainly news articles, press releases) based on its content. Initially, we started with tf-idf with TruncatedSVD and it was performing "fine", about 75-78% of accuracy using a manually curated cluster of news. A few weeks ago, we started exploring doc2vec to see compare and see if we can obtain an improvement. However, I haven't been able to get near the quality I was getting with the current method (tf-idf + truncatedSVD) and I'd like to get some advice on things I may be doing wrong.

Some information about the project data,
* corpus: 790,000 articles in Spanish
* tokenized removing punctuation, tildes and converting numbers to symbol (_NUMBER_), and similar actions
* average doc size: 300-500 tokens

Code I'm using for training,

alpha = 0.025
min_alpha = 0.001
shuffle(train_articles)
passes = 20

# cv = Doc2Vec.load('fully_trained_model_andres.doc2vec')

# w1, clusters: 2, 4 elems (2, 2): m=1, size=200, window=15, negative=10, hs=0, workers=-1, alpha=alpha, min_alpha=min_alpha, dm_concat=0, dm_mean=1, min_count=2
# w2, clusters: 2, 6 elems (2, 4): dm=1, size=150, window=15, negative=5, hs=0, workers=-1, alpha=alpha, min_alpha=min_alpha, dm_concat=0, dm_mean=1, min_count=2

cv = Doc2Vec(dm=1, size=150, window=15, negative=5, hs=0, workers=2, alpha=alpha, sample=1e-05,
             min_alpha=min_alpha, dm_concat=1, dm_mean=0, min_count=10, iter=passes)

cv.build_vocab(train_articles)
cv.train(train_articles, total_examples=len(train_articles), epochs=cv.iter)


Things I have tested,
* size: best size - 150, others that don't perform well: 100, 200, 250, 300, 400
* window: 15 (10, 20)
* negative: 5 (10, 15, 20)
* alpha: 0.025 (0.05, 0.01, 0.1)
* min_alpha: 0.001 (0.0001)
* and switched the other parametrs dm_contact, dm_mean, min_count, etc

I'm using this code to *infer_vectors* and identify the different clusters them:
golden_set_vectors = [cv.infer_vector(x.words, alpha=0.05, min_alpha=0.001, steps=50) for x in new_articles]
X = np.asarray(golden_set_vectors)
distances = pairwise_distances(X, metric='cosine').astype('float64')

from hdbscan import HDBSCAN

clusterer = HDBSCAN(algorithm='best', approx_min_span_tree=True, 
                    gen_min_span_tree=True, leaf_size=40, metric="precomputed",
                    min_cluster_size=3, min_samples=None, p=None, core_dist_n_jobs=-1)

clusterer.fit(distances)
set(clusterer.labels_)

The performance I'm obtaining, in comparison with the manually annotated set of clusterized news is around 35-40%. 

So, my question is, has anyone has some experience doing this task and can provide any tip or feedback? Is there anything I should be considering I'm not doing it atm?

Thanks, and any help is very much appreciated!

Best,


Ivan Menshikh

unread,
Jan 10, 2018, 3:51:27 AM1/10/18
to gensim
Hello Andres,

your corpus have a good size (enough for Doc2Vec), as I see, you make all in the right way, but sometimes "new" algorithms work worse than "old" techniques.

Possible improvements:
- Different clustering algorithm 
- Try to use another type of document embeddings like LDA or Sent2Vec (coming soon, now available as PR)

Also, I hope that Gordon can suggest more.

Andres Moreira

unread,
Jan 10, 2018, 9:14:34 AM1/10/18
to gen...@googlegroups.com
Hello Ivan,

Thanks for the feedback. Yes, indeed. I'm exploring options right now and I thought that Doc2Vec was worth the try, but it seems that the vectors being inferred aren't good enough. I'm using Cosine similarity as the distance metric, but 2 clear distinct documents, one of sports and one of politics, the distance between them is too narrow. 

I'm using [H]DBscan as I don't know the number of clusters I'll have beforehand, I know that they will have at least 2-3 items on it, and the topics are differentiated: Politics, Sports, Economics, Entertainment, etc.  

I was checking lda2vec. I didn't know about Sent2vec however. It may worth to try when it's ready.

Anyway, if any other can tell any other tip/advice is highly appreciated.

Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/tokwQSeNwoE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Jan 10, 2018, 12:23:48 PM1/10/18
to gensim
Within the realm of Doc2Vec, your corpus size of 790,000 looks good, as well as documents of 300-500 tokens. 

I wouldn't bother with the non-default `dm_concat=1` mode – it results in much-bigger, much-slower-to-train models whose utility remains unproven. (Even the claims in the original 'Paragraph Vector' paper haven't fully checked out.) Similarly tinkering with `dm_mean` or the training `alpha`/`min_alpha` is rarely a win. 

I would be sure to try the faster, simpler non-default `dm=0` (PV-DBOW) mode – especially if the main interest is full-document modeling (without any need for separate word-vectors), it's often a quick top performer, good as a baseline if nothing else. (As a smaller model, inference also tends to work better, with relatively fewer `steps` iterations.)

Sometimes more-aggressive `sample` (smaller) or `min_count` (larger) values make a noticeable difference – but likely not of the magnitude of a accuracy-score jump from high-30s% to high-70s%.

I'm guessing your accuracy-score was based on something like: "a human said these pairs of docs should be in the same cluster, are they?"

Is perhaps the Doc2Vec process using a very different number of input-dimensions to the clustering than the prior TruncatedSVD process? Or is the clustering detecting more clusters? (Perhaps: the 1st process only creates a "Politics" supercluster, but the 2nd "PoItics-Domestic" & "Politics-International" & etc – which could make previous "same pairs" land in different clusters, but not necessarily in a bad way. Either forcing the clustering to output the same number of clusters, or changing the scoring to be something like "these human-selected pairs are closer to each other than to other random docs/cluster-centroids", might improve the comparison.)

- Gordon

Gordon Mohr

unread,
Jan 10, 2018, 12:31:44 PM1/10/18
to gensim
Also just noticed: clustering is only done on the (presumably much smaller) subset of `new_articles`. Is this subset prepared/sized the same for both clustering-steps?

And, be sure the tokenization to `words` of these new articles matches that of the training set – sometimes people do that pre-inference tokenization wrong (such as by passing a string rather than list-of-tokens), and the resulting unrecognized/unintentional tokens in inference can contribute to nearly-random results. 

Also, the inference parameters `steps` and starting `alpha` are worth exploring further for incremental gains– more steps or an alpha-same-as-training 0.025 may help, and optimal values here change with model type/size. ( For example, a larger model, like the default `dm=1`, can benefit from more `steps` than smaller/simpler `dm=0`.)

- Gordon

Andres Moreira

unread,
Jan 11, 2018, 9:03:24 AM1/11/18
to gen...@googlegroups.com
Hey Gordon,

Thanks for the great answer. Let me answers the questions before, and then I described the last I did.

* the *golden_set* is a set of articles that a human annotated that they talk about the same news and belong to a same cluster (e.g: nytimes:"First mars landing happens ", washington-post:"Humans lands on Mars first time", guardian:"It's happening, first Mars landing", etc). It contains around 1,000 annotated news we're using for testing, and around 180 clusters.
* the clusters generated by [H]dbscan using doc2vec actually are around 140, but the selection is 60-70% incorrect. It's not creating better sub-clusters, it just grouping totally unrelated news together. 
* I'm using the same pre-processing on both sets (training & new_articles). 

Yesterday I tested as per your suggestion with dm=0 (PV-DBOW) and changing: size, sample & min_count. The results didn't change at all, actually in some cases were worse. Then I tested with dm=1, changing size, sample, min_count and the results were *a bit better*, but overall the number was still around 40%. 

I believe the PV-DM model is slightly better, however, it may require much more tunning. I'm still getting the best results with TF-IDF-TruncatedSVD. 

Something I didn't try and may be worth is using only the Article title to create the Vectors. Also, based on your experience, is there any feedback in the size of vectors, window & negative samples for this type of tasks? I read the original paper and they suggest a size of 300 and a window of 7-8 for some tasks, but it didn't work for me at all.

Have a good day, and thanks for the help!




Gordon Mohr

unread,
Jan 11, 2018, 3:22:26 PM1/11/18
to gensim
It's still unclear to me:

Are the 'golden-set' articles full 300_ token articles, or just the short phrases/titles you provide as examples? 

If 1000 articles are broken into 180 clusters, each cluster has an average of 5.6 items, with some much more or less, correct? (Are there single-item clusters in the golden-set? Are there 100-item clusters?)

How are the golden-set clusters turned into an accuracy score for a model - is it checking if pairs-of-docs are in the same-cluster, or some other way of iterating over the test items?

How many articles are in the `new_articles` used for learning the clusters? And, are they representative/randomly-subsetted from the full 790K?

How many dimensions come out of the TruncatedSVD process, and how does that compare to your tested Doc2Vec dimensionalities?

How many clusters come out of the TruncatedSVD process, and how does that compare to the number of clusters learned from your tested Doc2Vec dimensionalities?

If the 'golden-set' that you are optimizing-towards implies exactly 180 categories, perhaps it'd be good to tune the clustering algorithm to give 180 clusters, no matter the earlier vectorization steps?

When you suggest the clusters from (texts)->(Doc2Vec)->HDBSCAN include "totally unrelated news together", that seems odd, because usually the results of Doc2Vec at least deliver the quality that nearest-doc-vectors are (to human eyes) recognizably-similar in topic. 

It'd be worth doing a deep-dive on certain article-pairs. For example. pick an anchor article A in the golden-set. Rank all other articles by closeness to this article (in both the Doc2Vec and TruncatedSVD spaces.) At what ranks do the N other articles, that humans put in the same category as A, appear? To the extent any nearest-neighbors weren't in the same golden-set category, do they still appear related by other human=perceptible factors? (Are they a closely-related category, like the "Politics-Domestic" vs "Politics-International" conjecture I made earlier?) 

Given that both processes are, at a really high-level, "tally-of-terms -> lower-dimensionality -> same-clustering-algorithm", I'd expect them, when similarly tuned, to perform in a broadly-similar way. (I'd not especially expect Doc2Vec to give much better or worse results, though it's worth trying in case it fits the data/goals well.) So, the big drop-off in your evaluation is still surprising, and suggestive to me there may be some extra inadvertent bottleneck (in dimensions/clusters/training-data) in your Doc2Vec process compared to the other. 

Regarding other questions:

* Published work tends to use doc-vector sizes from 100-1000 dimensions – but the optimal level depends on the dataset & application.

* In Word2Vec, it's been observed that larger `window` values tend to emphasize topical-similarity in resulting vectors, and smaller `window` values emphasize functional interchangeability. The same probably applies in Doc2Vec. (Though note that in pure `dm=0`, the `window` parameter is irrelevant, because each doc-vector is simply trained to predict each doc-word in turn – a sort of full-document window. If you go to PV-DBOW plus word-training, `dm=0, dbow_words=1`, then `window` is again relevant.)

* Larger datasets tend to do fine with smaller `negative` and `window` values – as low as 1 or 2 in giant datasets.

- Gordon

Andres Moreira

unread,
Jan 19, 2018, 9:35:51 PM1/19/18
to gen...@googlegroups.com
Hello Gordon,

I just wanted to say I've been busier than usual and I haven't been able to answer you either keep my testings. As soon as I return to that I'll let you know.

Answering some of your questions,

* the golden set has 1000 articles like the others (300-500) words
* in general, the cluster size varies between 3-5 and there is one cluster, bigger than the others with news articles not related to any other cluster (around 150)
* to check accurancy, I iterate over the golden set clusters, per cluster I check the IDs and check to the clusterized output if they belong to the same cluster or not. Then I check if the clusters they were classified, are subclusters or other clusters iwth elements of other distant clusters.
* currently, the news_articles and training set (790K) articles share like 10% of articles only.
* TruncatedSVD : 300 dimensions

To answer the other questions I need to do some run as the other process is running and it was not tested on this data as it.

I'll follow up.

Thanks, Gordon!


Ivan Menshikh

unread,
Jan 22, 2018, 3:17:59 AM1/22/18
to gensim
BTW try to use Lda, because this model naturally split your corpus by topics. Probably, this will works better than Doc2Vec here.
Reply all
Reply to author
Forward
0 new messages