Gensim Doc2Vec vs Tensorflow

Sachinthaka Abeywardana

unread,

Oct 3, 2016, 11:16:44 PM10/3/16

to gensim

This is a crosspost from here: http://stackoverflow.com/questions/39843584/gensim-doc2vec-vs-tensorflow-doc2vec

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.

I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:

1. Is my tf implementation of Doc2Vec correct. Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context?

2. Does the `window=5` parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10.

3. Any insights as to why Gensim is performing better? Is my model any different to how they implement it?

4. Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? There are infinite solutions to this since its a rank deficient problem. <- This last question is simply a bonus.

Gensim

    model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
    model.build_vocab(corpus)
    epochs = 100
    for i in range(epochs):
        model.train(corpus)

TF:

    batch_size = 512
    embedding_size = 100 # Dimension of the embedding vector.
    num_sampled = 10 # Number of negative examples to sample.
    
    
    graph = tf.Graph()
    
    with graph.as_default(), tf.device('/cpu:0'):
        # Input data.
        train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
        train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
        train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])
    
        # The variables   
        word_embeddings =  tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
        doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
        softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
                                 stddev=1.0 / np.sqrt(embedding_size)))
        softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
    
        ###########################
        # Model.
        ###########################
        # Look up embeddings for inputs and stack words side by side
        embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
                                shape=[int(batch_size/context_window),-1])
        embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
        embed = tf.concat(1,[embed_words, embed_docs])
        # Compute the softmax loss, using a sample of the negative labels each time.
        loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                                       train_labels, num_sampled, vocabulary_size))
    
        # Optimizer.
        optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

Gordon Mohr

unread,

Oct 4, 2016, 5:21:38 AM10/4/16

to gensim

I don't yet know the idioms and ins & outs of TF enough to evaluate your TF code for correctness, or comment on why any particular implementation could be slower than gensim.

Generally speaking, the gensim cython codepaths do the raw calculations at the heart of the algorithm via native libraries that should be as efficient as any other language/library implementation. Performance differences with another implementation (as with gensim Word2Vec versus the original word2vec.c or versus the fastText word2vec) will most likely be due to differences in corpus IO/prep or the effective amount of multithreading achieved (which can be a special challenge for Python due to its Global Interpreter Lock).

Note that the `dm_concat` mode creates the largest, slowest-to-train models and is also the least-tested gensim code. I'm not sure of any tasks on which it definitively shows the benefits claimed for it in the Paragraph Vectors paper. (Maybe, much larger/longer-trained corpuses?) So while your description seems correct ("concatenating the word vectors and the document vector to predict the middle word in a certain context"), for any task or reimplementation, I would recommend starting with the 'DBOW' mode (`dm=0`). It's the most simple and fast mode – and often does best on evaluations! Then, try the slightly-more-complex DM/averaging mode (`dm=1`). Only after succeeding in understanding/evaluating those, would I then consider tackling the DM-with-concatenation mode.

`window=5` means 5 words on both sides of the target (to-be-predicted) word – so 10 words in total.

- Gordon

Sachinthaka Abeywardana

unread,

Oct 4, 2016, 7:50:43 AM10/4/16

to gensim

Hi Gordon,

Thanks for that. I should have made it clear that speed wasn't really a concern. Just that gensim when comparing nearest document vectors, (most similar in terms of cosine distance) seemed to make sense. Considering the number of tweets that I have (~30k). Both Gensim and TF runs under 15 mins (Gensim still being fastest).

Thanks for clarifying the window parameter.

Lastly, could you comment on whether the way I've trained the gensim model using the for loop makes sense?

Thank you,

Sachin

Amar Budhiraja

unread,

Oct 4, 2016, 10:54:03 AM10/4/16

to gensim

Hi Sachinthaka,

I have extensively used Doc2Vec. Yes your way of training is correct.

Instead of just giving the entire thing to the constructor, you are doing the steps on your own.

Hope it helps,

Amar

Gordon Mohr

unread,

Oct 4, 2016, 2:53:53 PM10/4/16

to gensim

My comments about `dm_concat` mode apply ten-times-over on such a tiny dataset (30k examples) of tiny texts (tweets). Your model is likely far larger than your dataset – not a good setup for generalizable learning.

There are actually two problems with the way you're explicitly calling `train()`. First, each time `train()` is called, it will cycle over the provided corpus `iter` number of times, where `iter` was an instance-initialization parameter with a default value of 5. So you're iterating over the data 500 times, which is probably not your intent. Second, each call to `train()` causes the effective `alpha` learning rate to start at the value of the `alpha` initialization parameter and linearly-descend to `min_alpha`. So your effective learning rate see-saws from max-to-min 100 times, when proper training would have one long smooth descent. (Recent versions of gensim should be logging a warning when it detects this error.)

Having glanced a bit more at your code, I suspect it's an error to be using any functions with 'softmax' in their name, if what you really want to implement is negative-sampling. See for example the discussion of negative-sampling in the TensorFlow Word2Vec tutorial (https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html#scaling-up-with-noise-contrastive-training).

- Gordon

Radim Řehůřek

unread,

Oct 7, 2016, 8:15:46 AM10/7/16

to gensim

Changing the `iter` to default to 5 was such a bad move... it trips everyone up. I know it would trip me up!

Did we do it just to match the C word2vec defaults, or was there something else? Can we go back to iter=1 default?

Best,

Radim

Andrey Kutuzov

unread,

Oct 7, 2016, 8:40:30 AM10/7/16

to gen...@googlegroups.com

On 10/07/2016 02:15 PM, Radim Řehůřek wrote:
> Changing the `iter` to default to 5 was such a bad move... it trips
> everyone up. I know it would trip me up!
> Did we do it just to match the C word2vec defaults

Yes, and actually you requested it :)
https://github.com/RaRe-Technologies/gensim/pull/538#issuecomment-158861275

--
Solve et coagula!
Andrey

Radim Řehůřek

unread,

Oct 14, 2016, 3:20:12 AM10/14/16

to gensim

Oops, must have been a weak moment :)

Maybe what we need is some fatter warning (docs, runtime)?

This default is really counter-intuitive. It also doesn't match what the rest of gensim does, its philosophy (stream input by default & leave input repetition explicitly to users).

Best,

Radim

Andrey Kutuzov

unread,

Oct 14, 2016, 8:03:11 AM10/14/16

to gen...@googlegroups.com

I would vote for a fat warning at runtime, yes.

> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,

Oct 14, 2016, 6:50:30 PM10/14/16

to gensim

There is a warning generated if you call `train()` again with an effective `alpha` higher than it's previously reached – often indicative of a mistake. We could add other warnings.

Some of the confusion arises from the dual paths offered by the model – "there's more than one way to do it!" – where you might trigger all training by supplying a corpus, or do vocabulary-building/training explicitly later. In the 1-liner all-ini-initialization approach, the case for matching the word2vec.c expectations (and common need for multiple iterations) is strongest. On the other hand, if calling `train()` yourself, you might not expect a remembered-parameter-from-earlier-initialization to have such an effect, and some example code from when the default was `iter=1` has led people astray.

Perhaps `train()` should require an explicit (non-default) `passes` parameter. In the 1-liner/initialization-trains case, it'd be called with the 'iter' value. For anyone calling it themselves, they'd have to make an explicit choice of 'passes'. And after the change, any old code without the parameter would (by design) break, forcing a change to explicit specification.

Perhaps even `alpha` and `min_alpha` should be explicit required parameters to `train()`, to ensure those calling it directly aren't see-sawing the values each call. Calling 'train()' directly is kind of an advanced approach, so requiring this level of choice may be appropriate.

- Gordon

Radim Řehůřek

unread,

Oct 15, 2016, 7:50:34 AM10/15/16

to gensim

Perhaps `train()` should require an explicit (non-default) `passes` parameter. In the 1-liner/initialization-trains case, it'd be called with the 'iter' value. For anyone calling it themselves, they'd have to make an explicit choice of 'passes'. And after the change, any old code without the parameter would (by design) break, forcing a change to explicit specification.

Perhaps even `alpha` and `min_alpha` should be explicit required parameters to `train()`, to ensure those calling it directly aren't see-sawing the values each call. Calling 'train()' directly is kind of an advanced approach, so requiring this level of choice may be appropriate.

I like these suggestions. Better than a warning.

Lev, you add this is an "easy wishlist" item to github? Will need a clear description in changelog, because of the backward compatibility breakage.