Doc2Vec parameters for Wikipedia.

jose ipc

unread,

Jun 1, 2016, 9:20:58 AM6/1/16

to gensim

Hello everyone.

I'm working on my thesis, using word2vec and doc2vec for topic detection in TC-STAR corpus. I have already experimented with word2vec obtaining an accuracy of 64%, doing a previous training with spanish wikipedia articles. In this case I want to repeat the experimentation with doc2vec but I am confused with its parameters. Should I use PV-DM or PV-DBOW, hierarchical softmax or negative sampling, concatenation or sum?.

The script that i'm using to train doc2vec model with spanish wiki corpus is:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
 
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    with open(inp,"r") as fd:
    i = 0
    train_labeled_sentences = []
    for line in fd.readlines():
        train_labeled_sentences.append(LabeledSentence(line,tags=[str(i)]))
        if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles")
        i += 1
    fd.close()
    model = Doc2Vec(size=400, window=8, min_count=3, workers=8, dm=1, hs=0, dbow_words=0, dm_concat=1)
    model.build_vocab(train_labeled_sentences)
    for epoch in range(10):
    shuffle(train_labeled_sentences)
    model.train(train_labeled_sentences)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
    model.save("wiki.recommend.mikolov.doc2vec")

In this script, PV-DM with concatenation is used such as Mikolov recommends in [Distributed Representations of Sentences and Documents]. It's possible to obtain good results with this configuration and wiki corpus?.

Thanks for your answers.

Gordon Mohr

unread,

Jun 1, 2016, 5:00:05 PM6/1/16

to gensim

There's no set of options best for all corpuses and purposes – you have to experiment with your own data and goals.

You may be interested in the paper "Document Embeddings with Paragraph Vectors" (http://arxiv.org/abs/1507.07998), which trains PV-DBOW vectors along with word-vectors on Wikipedia, and gets interesting results. (In gensim Doc2Vec, this is equivalent to both `dm=0, dbow_words=1` non-default options.) Unfortunately as with some other papers, they don't seem to completely specify their choice of options. (For example, I can't find them ever saying what 'window' size they're using.)

A few observations on your existing code/choices:

* DM-with-concatenation results in the largest, slowest models and I haven't yet found a demo dataset/problem where it gives the best results (as is implied in the Mikolov/Le paper). So, I'd try it last, if you've got lots of free time and RAM.

* The 'window' is the maximum number of context words used on both sides of the 'target' word, so a a value of 8 actually uses up to 16 words. It's not automatically the case that larger-is-better; I've seen datasets where `window=2` resulted in the best analogies-scores.

* It's tough to learn much from words that only appear 3 times; the paper above says they used a cutoff that resulted in a 915,000-word vocabulary, which I think the Wikipedia dumps I've worked with is a `min_count` closer to 40 or 50.

* if specifying `hs=0` it's good to be explicit about the number of negative-samples used (though your code is OK in the latest gensim versions, where a default of 'negative=5` applies). As with 'window', though, sometimes even fewer negative-samples are sufficient or even best-performing (on larger training sets).

* because the default 'min_alpha' is 0.001, and the default 'iter' (controlling the class's own multiple-passes-per-`train()`), your first epoch will actually be 5 passes over the data, *and* descend the effective-alpha to 0.001. Then, on the next epoch, you'll do another 5 passes, but now in the fixed new alpha/min_alpha value. You probably don't want this behavior. You can just let the class do the iterations and alpha-management – set 'iter' to 10 – OR if you want to manage it manually set 'iter' to 1 (so each 'train()` does one pass) and set 'min_alpha' to be equal to 'alpha' (or whatever you want it to be at the end of the next epoch), so it doesn't zig-zag across your iterations.

- Gordon

Kamal Garg

unread,

Apr 25, 2018, 2:29:41 AM4/25/18

to gensim

Hi Gordon,

I used doc2vec PV-DBOW in two ways:

1) Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=20, iter=5, workers=cores),

2) Doc2Vec(dm=0, dbow_words=1, size=200, window=5, min_count=12, iter=8, workers=cores),

to train my wikicorpus(14GB).

The model trained successfully. It gave relevant results for many phrases but when i tried artificial intelligence.

In first model, i got the following suggestions:

1) [('Existential risk from artificial general intelligence', 0.7284922003746033),

 ('Ethics of artificial intelligence', 0.7267584800720215),
 ("Turing's Wager", 0.7224212884902954),
 ('Oracle (AI)', 0.7094788551330566),
 ('AI aftermath scenarios', 0.703824520111084),
 ('AI control problem', 0.6999846696853638),
 ('Superintelligence: Paths, Dangers, Strategies', 0.691785454750061),
 ('Murray Shanahan', 0.6860222220420837),
 ('Artificial empathy', 0.6842677593231201),
 ('Explainable Artificial Intelligence', 0.682081937789917),
 ('Iyad Rahwan', 0.681956946849823),
 ('Moral Machine', 0.6816681027412415),
 ('Timeline of artificial intelligence', 0.676627516746521),
 ('Susan Schneider (philosopher)', 0.6764435768127441),
 ('From Bacteria to Bach and Back', 0.6752616167068481),
 ('AI-complete', 0.6739200353622437),
 ('David A. McAllester', 0.673627495765686),
 ('Knowledge acquisition', 0.6730433702468872),
 ('OpenAI', 0.6718262434005737),
 ('Open Letter on Artificial Intelligence', 0.6698791980743408)]

In second model-

2) [('Existential risk from artificial general intelligence', 0.7561817765235901),

 ('History of artificial intelligence', 0.734763503074646),
 ('Ethics of artificial intelligence', 0.7274946570396423),
 ('Oracle (AI)', 0.7165532112121582),
 ("Turing's Wager", 0.7119142413139343),
 ('Artificial general intelligence', 0.7059307098388672),
 ('Deep learning', 0.7024167776107788),
 ('AI takeover', 0.701856791973114),
 ('AI aftermath scenarios', 0.6950700879096985),
 ('Cognitive science', 0.6925462484359741),
 ('Symbolic artificial intelligence', 0.6894776821136475),
 ('AI-complete', 0.6873871088027954),
 ("Hubert Dreyfus's views on artificial intelligence", 0.6849253177642822),
 ('Moral Machine', 0.6835113167762756),
 ('Artificial neural network', 0.6826612949371338),
 ('Mind uploading', 0.6812909841537476),
 ('Cognitive bias mitigation', 0.6788017749786377),
 ('Explainable Artificial Intelligence', 0.6765998601913452),
 ('Bayesian cognitive science', 0.6736477017402649),
 ('Intelligence explosion', 0.671064019203186)]

Model two seems to be giving better results, but i want to eliminate suggestions like Existential risk from artificial general intelligence,History of artificial intelligence.

Is there a way where I can tune the parameters to get better results. Also should i try PV-DM w/average so that i can get better phrases and if so what window size and min_count should I use.

Any help will be appreciated. Thank you in advance

Gordon Mohr

unread,

Apr 25, 2018, 1:43:43 PM4/25/18

to gensim

You can always try different values for the meta-parameters to see if they give better results for your purposes. This works best if you create an repeatable, automated evaluation that scores each model (rather than just manually eyeballing results), then use that score to pick from among many parameter combinations.

Published `Doc2Vec` work tends to use 10-20 (or more) training iterations, so your current choices of 5 and 8 are on the low side.

Note that a larger `window` means relatively more word-to-word training, and thus slower training overall and proportionately less tag-to-word (doc-vector) training. If your main interest in the article-titles (doc-tag) vector quality, you *might* find smaller windows but more iterations a useful tradeoff.

If you want to eliminate results like 'Existential risk from artificial general intelligence', you will likely have to devise your own heuristics for eliminating those kind of articles from your training, or filter those titles from your results. The `Doc2Vec` algorithm looks at text content, and I would expect an article like 'Existential risk from artificial general intelligence' to be an excellent match, by text content, with the article `Artificial intelligence'.

- Gordon

Kamal Garg

unread,

May 1, 2018, 8:54:38 AM5/1/18

to gensim

Thank you for the reply Gordon. I have worked around the problem of artificial intelligence. But I am stuck in another problem. For e.g I tried clay mineral on wikipedia doc2vec trained model, i got none results. But when I tried to find similar words for clay minerals it worked and showed me the results because wikipedia has an article on clay minerals not clay mineral. Is there a way,when the user tries to find similar words related to word like clay mineral and even if clay mineral is not present, it searches for the closed string in doc2vec dictionary related to clay mineral and give me the results, For e.g in this case, i searched for clay mineral and if similar words are not found, it gives me results of clay minerals.
Thank you for the help in advance.

Gordon Mohr

unread,

May 1, 2018, 3:00:21 PM5/1/18

to gensim

Do you mean that you're looking among the `Doc2Vec` model `docvecs` tags for the exact string 'clay minerals', and getting results because you trained with a lowercased article title 'Clay minerals') – but then not finding anything when you look for the exact string `clay mineral`, because you didn't train any documents with that exact string tag? (It'd be clearer if you used precise quoting of the strings & code you're trying – helpful to me, but also a good habit for thinking about the problem because such absolute precision is required for working code.)

`Doc2Vec` only offers exact tag lookup of doc-vectors - if a string wasn't offered as a trained tag, it's just not present.

But there are lots of techniques, mostly outside the purview of gensim, that can help in such situations. There's no one that's best. Many go under the name 'query expansion'. Some might involve extra preprocessing/stemming/lemmatization before training, or be able to leverage word-vectors in other ways. For example:

* if you added auto-complete or live similar-string-matching, seeded with the set of known tags, someone typing 'clay mi...' would see good contiuations or small-edit-distance variations of what they've typed, and be able to select/self-correct to something that's known.

* similarly, in the case of 'no results' you could run extra code to try things like – (1) list small edit-distance variations of what was typed from among the known set; or (2) tokenize the full string (int ['clay', 'mineral']) and fall back to traditional keyword- or pattern/substring- matching against known tags – and then offer those matches as suggestions, or just compose results from some blend of those transformations, as if they were what was originally queried

* you might go beyond just lowercasing titles, to more aggressive removal of plurals/suffixes/etc (stemming) or coercing terms into unified forms (lemmatization). Even if you still use the original unique titles as training-tags (or the display-titles), you'd calculate the 'canonical name' of a tag via this extra processing, and also do it to any queries, to force more variations of "the same or similar idea" to collapse to the same lookup keys

* since you're using a mode that also creates compatible word-vectors, breaking an unknown multi-word string (like 'clay mineral') into its words (['clay', 'mineral']), then using each word, or some average of the words, as a search might yield usefully-related tags

These are just a few of the tricks used to improve search/information-retrieval over "perfect string matches" - there are many more provided by other IR/text libraries, which are practical or worthwhile for you will depend on the specifics of your project/goals.

- Gordon

Reply all

Reply to author

Forward