It's still unclear to me:
Are the 'golden-set' articles full 300_ token articles, or just the short phrases/titles you provide as examples?
If 1000 articles are broken into 180 clusters, each cluster has an average of 5.6 items, with some much more or less, correct? (Are there single-item clusters in the golden-set? Are there 100-item clusters?)
How are the golden-set clusters turned into an accuracy score for a model - is it checking if pairs-of-docs are in the same-cluster, or some other way of iterating over the test items?
How many articles are in the `new_articles` used for learning the clusters? And, are they representative/randomly-subsetted from the full 790K?
How many dimensions come out of the TruncatedSVD process, and how does that compare to your tested Doc2Vec dimensionalities?
How many clusters come out of the TruncatedSVD process, and how does that compare to the number of clusters learned from your tested Doc2Vec dimensionalities?
If the 'golden-set' that you are optimizing-towards implies exactly 180 categories, perhaps it'd be good to tune the clustering algorithm to give 180 clusters, no matter the earlier vectorization steps?
When you suggest the clusters from (texts)->(Doc2Vec)->HDBSCAN include "totally unrelated news together", that seems odd, because usually the results of Doc2Vec at least deliver the quality that nearest-doc-vectors are (to human eyes) recognizably-similar in topic.
It'd be worth doing a deep-dive on certain article-pairs. For example. pick an anchor article A in the golden-set. Rank all other articles by closeness to this article (in both the Doc2Vec and TruncatedSVD spaces.) At what ranks do the N other articles, that humans put in the same category as A, appear? To the extent any nearest-neighbors weren't in the same golden-set category, do they still appear related by other human=perceptible factors? (Are they a closely-related category, like the "Politics-Domestic" vs "Politics-International" conjecture I made earlier?)
Given that both processes are, at a really high-level, "tally-of-terms -> lower-dimensionality -> same-clustering-algorithm", I'd expect them, when similarly tuned, to perform in a broadly-similar way. (I'd not especially expect Doc2Vec to give much better or worse results, though it's worth trying in case it fits the data/goals well.) So, the big drop-off in your evaluation is still surprising, and suggestive to me there may be some extra inadvertent bottleneck (in dimensions/clusters/training-data) in your Doc2Vec process compared to the other.
Regarding other questions:
* Published work tends to use doc-vector sizes from 100-1000 dimensions – but the optimal level depends on the dataset & application.
* In Word2Vec, it's been observed that larger `window` values tend to emphasize topical-similarity in resulting vectors, and smaller `window` values emphasize functional interchangeability. The same probably applies in Doc2Vec. (Though note that in pure `dm=0`, the `window` parameter is irrelevant, because each doc-vector is simply trained to predict each doc-word in turn – a sort of full-document window. If you go to PV-DBOW plus word-training, `dm=0, dbow_words=1`, then `window` is again relevant.)
* Larger datasets tend to do fine with smaller `negative` and `window` values – as low as 1 or 2 in giant datasets.
- Gordon