Doc2Vec inference stage

11,297 views
Skip to first unread message

miel.shayne

unread,
Dec 5, 2014, 4:25:13 PM12/5/14
to gen...@googlegroups.com
In the last paragraph on page 3 of the paper describing the algorithm behind gensim's Doc2Vec implementation (http://arxiv.org/pdf/1405.4053v2.pdf), they talk about the "inference stage", where they add a column to D and gradient descend on it while holding W, U and b constant. They do this to get paragraph vectors for as-yet unseen paragraphs. How would we do this with the gensim implementation? Should I do something like this:

m = gensim.models.Doc2Vec(training_sentences)
m.train_words = False
m.train(test_sentences)

where "test_sentences" have labels that is are not yet in m's vocabulary? If I do that, will the training labels also get updated (in which case I would need a fresh copy of m for each test task)?

Thanks!

miel.shayne

unread,
Dec 15, 2014, 9:21:53 AM12/15/14
to gen...@googlegroups.com
Updating this message for anyone else who is interested in doing this. It does appear to be possible to do an inference step after the initial training, but there are some caveats. Also, I have only tested this with the distributed memory and hierarchical sampling configuration. I will eventually test to see what would need to be done to make it work in the other paths, but it is lower on my priority list right now. The steps I used are:

1) Determine how many new labels are being inferred (Note that the new sentences should only have new labels. If there are labels that were seen during the initial training, they will also be updated with this algorithm).
2) Add the new labels to model.vocab. For hierarchical sampling, the only field in Vocab that really matters is "index". Set this to len(model.vocab) + i for each new label. The other fields can get default values, although to make sure the labels don't get skipped, "code" must be at least length 1 (I use [0]) and "sample_probability" should be 1.0.
3) Add len(new_labels) new rows to model.syn0, using random values as is done in Word2Vec.reset_weights().
4) Set model.train_words to False and model.train_labels to True
5) Train the model on the new sentences
6) The rows that you added to model.syn0 will have the inferred vectors for the new labels. You can slice them out of model.syn0 if you wish to leave the model unchanged (don't forget to also remove the new labels from model.vocab)

I have done a few small tests to convince myself that the resulting vectors are appropriate, but would love some feedback from Radim or anyone who is familiar with the gensim doc2vec implementation. I'm also happy to put together a pull request once I've verified that this works for negative sampling and cbow.

Thanks again for this awesome package!

Radim Řehůřek

unread,
Dec 15, 2014, 10:17:53 AM12/15/14
to gen...@googlegroups.com, Tim Emerick
Hello Miel,

good timing -- I just published Tim's description of doc2vec (Tim is the person who implemented doc2vec in gensim):


Huge thanks to Tim!

It may answer some of your question, but regarding the vector inference on new sentences, the current API (basically the steps you describe, or maybe keeping one "fake label" at all times, and resettings and re-using that during inference) is way too cumbersome. I'm not happy with it.

Your offer of a pull request is very welcome -- please do! And if you can separate the inference on new documents completely read-only (no reason to do that add/remove dance, model object should stay unchanged the whole time), that would be absolutely perfect :)

By the way, I'm also thinking of trying the simpler "sentence2vec" method that Tomas Mikolov suggested on the word2vec mailing list, https://groups.google.com/forum/#!topic/word2vec-toolkit/wTx3E5D0n9s .

That won't require adding new vectors to syn0 or new vocab entries (which are useless anyway, for most scenarios). Much more memory friendly.

Although it does do a slightly different thing (vectors for words no longer trained together with vectors for labels... labels only trained with existing words), so accuracy needs to be checked and compared to Le's doc2vec.

Best,
Radim

miel.shayne

unread,
Dec 15, 2014, 11:39:15 AM12/15/14
to gen...@googlegroups.com, tim.e...@ccri.com
Thanks (to you and Tim) for the tutorial!

I'll try to put a pull request together when I have a little time. The reasoning behind my hack-ish solution above was that I wanted to do what I could without having to dive into the Cython code. I'm pretty sure that to make the model read-only when inferring new labels will require new Cython code, which I'm happy to do but might take longer since I have less experience in it. If we do make it read only, we should think about how to deal with wanting multiple epochs during the infer step for manipulating alpha. Solution #1 in the tutorial would work fine, but solution #2 would not (unless we could pass in pre-existing vectors to the infer method, which seems questionable).

Tim, I'm curious about the motivation behind treating the document vectors as if they were new words (adding them to syn0, vocab, the huffman tree, etc.). Was it simply for the convenience of being able to reuse some of the gensim word2vec implementation, or is there deeper reasoning behind it? It feels like separating them would fix some of these issues, and would also allow for having different sizes for the word and document vectors, as well as opening the door for actually concatenating the vectors (instead of summing/averaging them) as is done in Le's paper. I don't know if that is more trouble than it's worth though.

Is there a particular branch that I should be working from for changes to the Doc2Vec API? Also, are there any tests? I couldn't find any on Github.

Thanks so much!
Shayne

Timothy Emerick

unread,
Dec 15, 2014, 2:03:36 PM12/15/14
to gen...@googlegroups.com, tim.e...@ccri.com
miel.shayne: Sorry the currently existing API is so cumbersome for this. There was some discussion about what you're talking about in the initial PR for doc2vec here

My reason for treating them as words was initially twofold:

1) The biggest of which was convenience, as you mentioned, and 
2) I wanted to see what would happen if I blurred the difference between labels and words a bit by doing things like permitting overlap between the two.

The second reason was motivated by my desire to play around with first applying something like topic modeling to the collection of documents, then feed (some appropriate mutation of) the output as additional labels for doc2vec to see what would happen. It's probably a stupid idea, and I've been too busy to try it so I'm not sure if it'll produce anything of value. 

Based upon my knowledge about the performance of doc2vec and its main use cases, I completely agree that there are several things with regard to sentence embedding performance which should be improved here.

I hope that helps!
Tim

Ivy Junior

unread,
Jan 9, 2015, 7:56:11 AM1/9/15
to gen...@googlegroups.com
Anybody has any new progress on the "inference stage"?

在 2014年12月6日星期六 UTC+8上午5:25:13,miel.shayne写道:
Message has been deleted

Ivy Junior

unread,
Jan 12, 2015, 10:35:11 PM1/12/15
to gen...@googlegroups.com
Thanks for Miel.shayne suggestions on inference stage!
I believe in the steps that they can lead me to predict a new sentence's vector, however, I got a non-sense result which cannot convince myself its ok for generating a new sentences's vector.
If you have any time, would you please scratch my head out of the confusion?

The predict.py code is as following:

def MyLabeledLineSentence(train_sen_len, model,  test_file):
    sens = []
    inner_id = 0
    try:
        for line in open(test_file, 'r'):
            line = line.strip()
            item_no =  train_sen_len + inner_id
            label = 'SENT_'+str(item_no)
            newvocab = gensim.models.doc2vec.Vocab()
            newvocab.index = item_no
            newvocab.sample_probability = 1.0
            newvocab.code = []
            for i in range(0, int(math.log(item_no, 2)+1)):
                newvocab.code.append(1)
            model.vocab[label] = newvocab
            model.syn0 = numpy.vstack((model.syn0, model.syn0[0]))
            model.index2word.append(label)
            random.seed(uint32(model.hashfxn(model.index2word[item_no] + str(model.seed))))
            model.syn0[item_no] = (random.rand(model.layer1_size) - 0.5) / model.layer1_size
            sens.append(gensim.models.doc2vec.LabeledSentence(utils.to_unicode(line).split(), ['SENT_%s' % item_no]))
            inner_id += 1
    except:
        print "test_file:", test_file, " Load error!"
        print traceback.format_exc()
        sys.exit(-1)
    return sens;

model = gensim.models.doc2vec.Doc2Vec.load(model_file) 
train_sen_len=len(model.vocab)
sentences=MyLabeledLineSentence(train_sen_len, model, test_file)
          model.train_labels=True
model.train_words=False
model.train(sentences)
wfile=open(out_file, 'w')
for sen in sentences:
    label = sen.labels[0]
    similar_array = model.most_similar(label)
    wfile.write("Input test sentence:%s\n" % (' '.join(sen.words).encode('utf-8')))
    for sim in similar_array:
        wfile.write("\t\t%20s\t%.6f\n" % (sim[0].encode('utf-8'), sim[1]))
    wfile.write("\n")
wfile.close() 


在 2014年12月15日星期一 UTC+8下午10:21:53,miel.shayne写道:

miel.shayne

unread,
Jan 12, 2015, 11:31:59 PM1/12/15
to gen...@googlegroups.com
Hi Ivy,

I'm glad my post was helpful. Sorry I haven't had a chance to get any of this added to a pull request into gensim.

Your code looks good to me, except that you probably need to train the new sentences more times. If I recall correctly, when you call model.train(sentences), each sentence is put through the stochastic gradient descent exactly once. Unless you have extremely long sentences, you'll want to call train() multiple times (randomizing the order of your new sentences each time), so that the network can spend enough time actually training the sentence labels. This is also true of the initial training phase, not just the inference phase.

Let me know if that helps!
Shayne

Ivy Junior

unread,
Jan 23, 2015, 6:09:44 AM1/23/15
to gen...@googlegroups.com

Hi Shayne,
     Thanks to your warm response. Sorry for my delay, it's swamped me recently.
      Today, as you said, Multiple time training and predicting - I tried following code to call train() 20 times in training and testing period. However, beyond my expectation, the inference effect didn't get improved.
      Then, I checked the train_model effect on some words, and I found the train_model was worse enough.
      So, would you like to give some guidance on param setting?
      Is it  gensim.models.doc2vec.Doc2Vec(alpha=0.05, min_alpha=0.05, hs=1) ok?

Hope your reply!
Ivy

------------------------------------------Train.py-------------------------------------------------------------------
#coding=gbk
import gensim
import sys

if len(sys.argv) < 4:
    print 'python train.py input iter_times model'
    sys.exit(-1)

input=sys.argv[1]
iter_times=int(sys.argv[2])
model_file=sys.argv[3]

sentences=gensim.models.doc2vec.LabeledLineSentence(input)
model = gensim.models.doc2vec.Doc2Vec(alpha=0.05, min_alpha=0.05, hs=1)
model.build_vocab(sentences)
model.train_words=True
model.train_labels=True
for i in range(iter_times):
    model.alpha -= 0.002
    model.min_alpha = model.alpha
    model.train(sentences)
model.save(model_file)

------------------------------------------------------------predict.py---------------------------------------------------------------
#------------------main------------------------ 
if len(sys.argv) < 4:
    print 'python predict.py model testfile outfile'
    print 'Pre-assumption: less test sen'
    sys.exit(-1)

test_file=sys.argv[2]
model_file=sys.argv[1]
out_file = sys.argv[3]

model = gensim.models.doc2vec.Doc2Vec.load(model_file)
train_sen_len=len(model.vocab)
sentences=MyLabeledLineSentence(train_sen_len, model, test_file)
model.train_labels=True
model.train_words=False
model.alpha = 0.025
for i in range(0, 10):
    model.alpha -= 0.002
    model.min_alpha = model.alpha
    model.train(sentences)
wfile=open(out_file, 'w')
for sen in sentences:
    label = sen.labels[0]
    similar_array = model.most_similar(label)
    wfile.write("Input test sentence:%s\n" % (' '.join(sen.words).encode('utf-8')))
    for sim in similar_array:
        wfile.write("\t\t%20s\t%.6f\n" % (sim[0].encode('utf-8'), sim[1]))
    wfile.write("\n")
wfile.close()

在 2015年1月13日星期二 UTC+8下午12:31:59,miel.shayne写道:

miel.shayne

unread,
Jan 27, 2015, 4:46:55 PM1/27/15
to gen...@googlegroups.com
Ivy,

How big is your data? I'll describe what worked for me and what my data looked like, and maybe you can extrapolate from there. Each of my "sentences" were full documents, approximately 10-20 actual sentences in length, and I trained on about 500 documents. I ran 200 epochs, but rather than looping 200 times and adjusting the alpha, I concatenated 200 lists of the documents, where each list had been randomly shuffled. That way gensim's method of slowly lowering alpha during training worked for me. At prediction time I did the same thing with 200 epochs on new data.

I should say too that the results I got were believable but not astounding. I chalk that up to having a fairly small data set that I was experimenting on. I have plans to try initializing the model with a bigger corpus first, but there are some other things taking my attention..

I hope that info helps.

chenyan xiong

unread,
Mar 5, 2015, 8:10:16 PM3/5/15
to gen...@googlegroups.com
Hi, everyone
Has there been any update in doing inference for new doc?

Thanks!
Chenyan

Radim Řehůřek

unread,
Mar 7, 2015, 4:39:31 AM3/7/15
to gen...@googlegroups.com
Hello Chenyan,

yes, Gordon is working on it. You can check his progress in his fork:

(work in progress, as yet unmerged).

Radim

Gordon Mohr

unread,
Mar 11, 2015, 4:59:17 AM3/11/15
to gen...@googlegroups.com
In the fork, you can now ask the (trained) Doc2Vec model to infer_vector() for a new document (list of tokens):


There's not yet a cython/BLAS implementation, so inference is relatively slow for bulk use. The default steps/alpha/min_alpha are wild guesses that seem enough for the inferred vectors to have some similarity to those trained up by earlier bulk passes... but many more passes or a higher starting alpha may make sense as well.  

I've also added a variant of the "PV-DM" (dm=1) mode that concatenates the document vector ("lbl"-keyed) and surrounding-window word vectors, rather than summing/averaging them.  (This mode is enabled with initialization arguments "dm=1, dm_concat=1".) This is to try to match the experimental setup of the Le/Mikolov paragraph-vectors paper, with regard to the Stanford Sentiment Treebank (rotten tomatoes movie review) dataset. 

The dm_concat mode needs a lot of memory, and also currently lacks a fast cython/BLAS implementation. I'm also not yet sure of its correctness: the resulting doc vectors are not as good as the other methods in predicting positive/negative reviews, and all models' vectors are still very far from the paper's reported prediction results. An IPython Notebook trying to reproduce one of the paper's results is at:


When the dm_concat mode is optimized I'll get a better idea whether more-training-passes may close the error-rate gap with the paper. Any other ideas for things to try, in the code or parameters, are welcome. 

At some point I also plan to disentangle the bulk-training document-ids from the word-vocabulary – so the RAM required for bulk-training is only a function of the count of unique words, rather than the count of all documents. 

- Gordon

Parkway

unread,
Mar 31, 2015, 9:45:04 AM3/31/15
to gen...@googlegroups.com
@gordon
Hi! What is the difference between the sum/average method and the concatenation method? My misguided understanding was that concatenation generated doc2vec vectors with a dimension equal to the number of words in the vocab but don't believe that is right. Thanks.

Gordon Mohr

unread,
Apr 5, 2015, 4:12:16 AM4/5/15
to gen...@googlegroups.com
Let's assume both the word and document vectors have n dimensions, and you're using a window of k words before and k words after. 

With the sum/average method, you prepare the input layer L1 to be the sum/average of all the context vectors: the 2k word vectors and the 1 document vector – still just n dimensions.

With concatenation, the input layer L1 is the concatenation of all the context vectors: the 2k word vectors and the 1 document vector – so now (2k+1)*n dimensions instead of n. (This makes the model much bigger, and is retaining positional information that's lost with sum/average.) But it's still being constructed from the same n-dimensional vectors-per-word... so the trained vectors per word and per document still have n dimensions. 

With sum/average, the doc vectors must have the same dimensionality as the word vectors, to sum/average together. 'Related' doc and word vectors might be meaningfully 'near' each other... because the doc vectors are mixed in the same way, in the same place. 

With concatenation, the doc and word vectors could theoretically be of different dimensionality. (This isn't implemented yet.) And, since the doc and word vectors never substitute for each other in the same part of the input layer – the doc vector is always in the doc position – nearness between them shouldn't mean anything. (They're essentially from different spaces, even if of the same dimensionality.)

- Gordon
Message has been deleted

Soumyajit Ganguly

unread,
Apr 5, 2015, 7:38:25 AM4/5/15
to gen...@googlegroups.com
I was trying to use Doc2Vec on 20newsgroup dataset, I created the model successfully. But when I am doing this:
print model.infer_vector(['breakfast', 'cereal', 'dinner', 'lunch'])
I am getting the error:
 File "C:\Python27\lib\site-packages\gensim-0.10.3-py2.7-win-amd64.egg\gensim\models\word2vec.py", line 543, in seeded_vector
    return (random.rand(self.vector_size) - 0.5) / self.vector_size
AttributeError: 'Doc2Vec' object has no attribute 'vector_size'

Parkway

unread,
Apr 5, 2015, 3:30:04 PM4/5/15
to gen...@googlegroups.com
Gordon: Thank-you for the very clear explanation. 

So, if concatenation was implemented with word and doc vectors of different dimensions then the raw output vectors (model.syn0) will be of dimension (2k+1)n?

Gordon Mohr

unread,
Apr 5, 2015, 5:49:57 PM4/5/15
to gen...@googlegroups.com
Are you sure you pulled in all the changes in my work-in-progress branch, and initialized your Doc2Vec instance before training in the usual way (which also calls the Word2Vec initialization method)? vector_size should be set for all new models here:


Alternatively, I haven't yet tested save/reload of models; if you're doing that maybe the expected vector_size was lost across such a save/load cycle (or was never set, in a model originally built earlier)? 

- Gordon

Gordon Mohr

unread,
Apr 5, 2015, 6:23:31 PM4/5/15
to gen...@googlegroups.com
While in some sense the `syn0` vector values are the "output" resulting from the entire training process (the reusable vectors we're building), during training they are used to form the "input layer" of the model.

And in fact `syn0` is in the current implementation holding both of all the vocabulary word vectors, and all the available-for-bulk-training doc vectors. So even with fixed dimensionality `vector_size` for both, with `doc_count` docs and word_count words, `syn0` has shape `(word_count + doc_count) x vector_size`. Looking up a single word or doc consults the same big `syn0`, and gives back a single `vector_size`-dimensional vector for either.

Now let's consider the hypothetical case where doc vectors were allowed to have a different dimensionality. Then they'd probably not be mixed into the same in-memory `syn0` structure at all. `syn0` would likely just hold word-vectors-in-training like in the plain Word2Vec case, and be of dimensionality `word_count x word_vector_size`. Some other structure would hold the doc-vectors-in-training, of dimensionality `doc_count x doc_vector_size`. 

Vectors-in-training from both of these sources would be stitched-together to form the input L1 for all training steps. That is, there might be a bit more copying back-and-forth compared to the case where L1 can be built completely from indexing into a single `syn0` array. But this separation would also allow for doc vectors to come from a different persistence-backed structure, which would also help allow training on `doc_count`s much larger than fit in memory, and perhaps eliminate some of the current duplication between the train_* and infer_* methods.

On my to-do list, enabling training on much-larger `doc_count`s is higher priority than allowing `doc_vector_size` and `word_vector_size` to vary. (While I can imagine reasons it might be useful to have such varied-dimensionality, the Paragraph Vectors paper's results I'd like to match use same-sized vectors.) But since both goals require pulling together L1 from different word- or doc- vector sources, the two capabilities might just naturally fit together in the same patch.

- Gordon

Parkway

unread,
Apr 7, 2015, 7:29:32 AM4/7/15
to gen...@googlegroups.com
Yes, I've been caught out by the current dual nature of "syn0". It makes more sense to separate the vocabulary word vectors and the in-training word vectors. Plus, a separate data object for the in-training doc vectors. The to-do-list looks great!

Michaël BENESTY

unread,
Apr 12, 2015, 6:19:12 PM4/12/15
to gen...@googlegroups.com
@Gordon can you tell me if doc inference makes sense in this use case:

I build a doc2vec model for many documents (each doc is a real one, meaning it is made of >50 sentences...) in a classical way with Gensime.
Then I want to perform a kind of "smart" search for these documents. 
For that purpose I want to look for document similar to the description I wrote.

Do you think your vector inference function would work for so few words compared to a real document?
Is there a better approach? I don't think summing the vector of each word would work document vector are not built that way.

Do you have an idea of how I may perform this task?

Kind regards,
Michael

Gordon Mohr

unread,
Apr 12, 2015, 8:20:50 PM4/12/15
to gen...@googlegroups.com
While I've been tinkering with and improving the code to learn, I don't yet have enough experience on real data to know.  

If I understand correctly, you have longer documents, but then when someone composes a shorter description, you want to calculate the vector for the shorter description (via post-bulk-training inference with the trained-up model), then use closeness to these query vectors to suggest candidate longer documents from the earlier bulk indexing. I think you'd have to try it – it might work!

One thing to note is that while the technique clearly can be used on documents of any size, the paper calls them "paragraph vectors" and calculates them for handfuls-of-sentences (rather than say dozens-of- or hundreds-of- sentences). This might not be a concern in practice, and even if it is, you might be able to adjust by calculating vectors for subsections of your documents... then suggest any document whose subsection is close to the query. 

Also, other topic-modelling techniques may be more appropriate, and may have more prior work about the effectiveness of using smaller-docs to query for larger-docs. People with more familiarity with the other algorithms in the gensim toolbox, and in this domain generally, may be able to offer better suggestions for your use-case.

An interesting experiment for evaluating any system which vectorizes documents might be: taking a body of academic literature, do the vectors for the paper abstracts, alone, map closely to the vectors for the rest-of-the-paper-sans-abstract. And, if they don't, perhaps because abstracts use a much different (indeed more 'abstract') vocabulary, is there an adjustment that can make them fit more closely? 

- Gordon

Michaël BENESTY

unread,
Apr 13, 2015, 7:11:31 AM4/13/15
to gen...@googlegroups.com
Thanks Gordon for this precise answer.

Your understanding of the use case is perfect.
I will try tonight and post result here.

My plan B (in case vector inference doesn't make the job) is to use LDA to get a distribution of topics for each document and find which topic is related to the words of the request, then guess the best documents, but according to the very good result I get in doc similarity with word2vec, I would prefer to leverage the vectors computed and therefore catch the general meaning of the sentence.

From my understanding of your code (https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py) I should use something like model.infer_vector(myNewDocument) and the function will return a vector.
It seems that the model is not updated after the computation of the vector for the new document. How am I supposed to use the new document vector in Gensim to compute the doc similarity? Should I use do it by myself?

Thank you a lot for the tips you may provide.

Kind regards,
Michaël

Michaël BENESTY

unread,
Apr 13, 2015, 6:06:46 PM4/13/15
to gen...@googlegroups.com
Ok I installed your version of Gensim and tried the infer_vector function.
For > 6/7 words, results are very good. Tried on a 100Mb dataset made of > 12000 documents.
For sentence of 2/3 words, result are ok.

When do you think to make a pull request to gensim project?

Gordon Mohr

unread,
Apr 13, 2015, 9:31:34 PM4/13/15
to gen...@googlegroups.com
I'm glad it's shown good results! (As I presume you figured out, the most_similar() method can accept a vector, as well as a vocabulary-key... but some people might be exporting all the vectors elsewhere, and using some other system, perhaps even nearest-neighbors indexing, to calculate nearest-matches.)

Before an integration to the main project, I'd like to have cythonized (BLAS-accelerated) versions of the infer_ methods, and a bit more testing... probably in a week or two, unless someone else wants to get to it sooner. (Other changes including factoring out the doc vectors from the vocabulary structure would come some time later.)

Whether (and how) later incremental updates or inferencing could update the model is another issue. It's certainly possible to keep training, but you'd likely not want the same tight-loop alpha/steps on a single document (as done in the inference case) to have a disproportionate effect on the model (dragging it away from being representative of the earlier, bulk examples). I have a hunch you'd rely on the inferred vectors for a while, but when you've got enough new documents, repeat the bulk training for a balanced model where all documents have had equal influence. 

(There's also the interesting question, for your use case, of whether historical/example query-documents should be part of the bulk-training, too. And if you have any positive examples – "these queries should return these documents" – it might be possible to interleave those expectations into the training, as well.)

- Gordon

Michaël BENESTY

unread,
Apr 14, 2015, 7:07:11 AM4/14/15
to gen...@googlegroups.com
Yep I found out that I should used most_similar. In the past I used Annoy library (approx nearest neighbors) with C version of word2vec. It gives good result.
There is something I am not sure to understand.
Is there a (big) difference between the vector I would get if the document is included in the dataset before the model is built and the one I infer with your function?

I understand the method is slightly different because I change the word / context vectors for each document I have in the dataset when I generate the model. But for only one document, it is supposed to be very similar, right?

Other question, with your inference function, and when I build the doc2vec model, I have several sentence in each paragraph. Right now, I remove punctuation and so on to normalize the text. So one document is made of many sentences made of many tokens. When doc2vec or your function analyze this block of tokens, for the last word of a specific sentence in the middle of my document, the window of context (let s say 5 tokens) will take the tokens of the next sentence which are probably describing another idea. These words will count in the construction of the word vector but they should not. It is a kind of noise in the model. How am I supposed to manage that case?

Regarding your suggestion, I have read a recent paper which goes further, they say, in a log session, the user will make several requests with different words and they are probably all connected because the user search for something specific. So they take all the requests together instead of just one, and build a kind of sentence made of many key words. They say it works pretty well. The paper is from people working for Yahoo. It should be easy to find on Google Scholar. I think the idea is good. I will try later. It may have good impact on the quality of the model (add new training example to the dataset only if they makes sense and are made of key words all related).

Kind regards,
Michael

Eric Lind

unread,
Apr 14, 2015, 4:58:56 PM4/14/15
to gen...@googlegroups.com
@Gordon:  Have you had a chance to look at this yet: https://groups.google.com/forum/#!msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ ?
Mikolov apparently released some modified Word2Vec C code that others have reported gives similar results to those reported in the paper.  Maybe something in those changes will give a hint of what has to be done to replicate the state of the art results with gensim?  

It also looks like the key to replicating the paper's results may be concatenating the DM vectors with the DBOW vectors when classifying, is that something you've had a chance to try?

Eric

Gordon Mohr

unread,
Apr 14, 2015, 7:06:50 PM4/14/15
to gen...@googlegroups.com
On Tuesday, April 14, 2015 at 4:07:11 AM UTC-7, Michaël BENESTY wrote:
Yep I found out that I should used most_similar. In the past I used Annoy library (approx nearest neighbors) with C version of word2vec. It gives good result.
There is something I am not sure to understand.
Is there a (big) difference between the vector I would get if the document is included in the dataset before the model is built and the one I infer with your function?

I understand the method is slightly different because I change the word / context vectors for each document I have in the dataset when I generate the model. But for only one document, it is supposed to be very similar, right?

My sense is that the inferred vector should be similar to one that emerged from bulk training... but *how* similar will depend on many things, including the number of training passes, how much the document itself dominates the training set (because the set is small or the document is very representative), the number of inference steps, etc. 

Consider the degenerate corner case of only using a single document during multi-pass "bulk" training. (This would result in a pathologically specialized-for-that-one-document model.) Using that model, I'd expect a followup inference of the same document become very close the the same vector. (Perhaps, with arbitrarily many training-passes and then inference-steps, arbitrarily close?)

But with real generalized models from larger datasets, the (bulk) doc-vector being improved each time the doc comes up in (shuffled) training passes, and the later inferred doc-vector being improved through tight inference on a frozen model, are each facing more-different optimization environments. (For example, maybe you've done 50 training passes over your 12K docs, of which the doc-in-question was only 1-of-12K. So the one doc provides 50/600000 of the influence, over many thousands of adjusted model values. But then you do 200 inference passes with that single doc in the otherwise-frozen model. It's providing 200/200 of the influence, but only against a smaller number of unfrozen model values. The two targets are related enough that the resulting vectors will be similar, but not so identical that any number of steps would necessarily settle on arbitrarily-close vectors.)

In my attempts to simulate the paragraph-vectors paper's S3.1 experiment (Stanford Sentiment Treebank), the bulk-trained doc vectors generally do better at the sentiment-prediction task than later-inferred vectors of the same documents, at least for infer_steps <= train_passes.  However, since the inference is still slow/unoptimized, I haven't yet tried a full variety of inference gradient-descent parameters (such as extremely many steps). I suspect it could be possible that whether the bulk-trained vector or the later-inferred vector is "better" for a particular task could vary, depending on the specific end-application or other particulars of the data/dataset-size or training parameters..  

Other question, with your inference function, and when I build the doc2vec model, I have several sentence in each paragraph. Right now, I remove punctuation and so on to normalize the text. So one document is made of many sentences made of many tokens. When doc2vec or your function analyze this block of tokens, for the last word of a specific sentence in the middle of my document, the window of context (let s say 5 tokens) will take the tokens of the next sentence which are probably describing another idea. These words will count in the construction of the word vector but they should not. It is a kind of noise in the model. How am I supposed to manage that case?

Note that the experiments described in the paper turned the punctuation into tokens; their results were either assisted by that, or simply robust to such sentence-overlap. Window-overlap might not even be 'noise', to the extent you want to create one summary vector for the whole document.  (Why wouldn't overlapping fragments of each sentence be relevant to the training-goal, of predicting word occurrences, or the ultimate goal, of modeling the full doc topic(s)?) 

Still, if such sentence-to-sentence topic shifts are a concern with your data, it could be interesting to calculate vectors per-sentence, or per-paragraph, and see if that gives more satisfying query-results.

Regarding your suggestion, I have read a recent paper which goes further, they say, in a log session, the user will make several requests with different words and they are probably all connected because the user search for something specific. So they take all the requests together instead of just one, and build a kind of sentence made of many key words. They say it works pretty well. The paper is from people working for Yahoo. It should be easy to find on Google Scholar. I think the idea is good. I will try later. It may have good impact on the quality of the model (add new training example to the dataset only if they makes sense and are made of key words all related).

Interesting, and also relevant to my project! When you say recent, was this specifically in the context of word/phrase vector models? (I vaguely recall similar results, around user-sessions and personalized-search, from the big search engines longer ago... but would really appreciate pointers to any ideas around composing a running 'intent vector' for a user, from a time series of user queries and satisfaction-signals.)

- Gordon

Gordon Mohr

unread,
Apr 14, 2015, 7:22:56 PM4/14/15
to gen...@googlegroups.com
I've seen that – he's reproducing the paper's second experiment (with longer-review IMDB data), which I haven't yet tried with gensim but will soon.  

I have tried the DM+DBOW vector-concatenation on the first experiment (with shorter-phrase Rotten Tomato data) – you can see it in the IPython notebook in my fork. It doesn't get close to the reported best results, with any configuration I've tried so far. Nor does concatenation of the two vectors seem to create more than a tiny improvement over the better of the vectors, alone. There are still some variants left to try (especially concatenated-context-windows in DM with many training-passes), and there might be an error in my analysis... but for now, the first experiment recipe (section 3.1 of the PV paper) seems highly resistant to reproduction. 

- Gordon

Parkway

unread,
Apr 15, 2015, 3:27:38 AM4/15/15
to gen...@googlegroups.com
@Gordon: A comment by the other author Quoc Le at https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/CJLWzmr0LaUJ may help.

Michaël BENESTY

unread,
Apr 15, 2015, 4:01:12 AM4/15/15
to gen...@googlegroups.com
@Gordon the document I am referring to is this one:

It may be not perfectly what you are thinking to but it may gives some idea.

za...@okcupid.com

unread,
Jun 8, 2015, 2:55:23 PM6/8/15
to gen...@googlegroups.com
Just read through all of this, are your forks' infer methods cythonized yet or should I stick to using the OPs method for large bulk inference and training?

- Zach

Gordon Mohr

unread,
Jun 10, 2015, 8:07:21 PM6/10/15
to gen...@googlegroups.com
Yes!  The code is also now ready for wider review (and will apply cleanly against the main gensim 'develop' branch) via PR#356 on Github. 

Please give it a try and let us know how well it works for your needs. The current infer_vector gradient-descent defaults – just 5 steps starting from an initial alpha of 0.1 – are arbitrary but have worked surprisingly well in limited IMDB sentiment-prediction scenarios.

- Gordon

za...@okcupid.com

unread,
Jun 17, 2015, 10:31:20 AM6/17/15
to gen...@googlegroups.com
Alright, so I tried running this and evaluating the effectiveness of the paragraph vectors on my task by cross-validating a Logistic Regression classifier at every epoch of training, and I see a steady increase levelling off around epoch 200 (which may be more due to the Logistic Regression's limitations). However, when I go to infer these vectors again (200 iterations, though I also tried 10 to no avail), then evaluate Logistic Regression both on the test set and cross-validation of the training set, I get an accuracy near random (50% for two-class classification). All my baselines (averaging word embeddings both trained and pre-trained by GloVe on the common crawl corpus, and also bag of words) get around 65% accuracy (which isn't great either).

Essentially I'm wondering if you think I should try either different parameter settings or you think there might be a bug. I noticed your comment about the alternating PV-DBOW and Skipgram training, I might try that although I'm not sure if it will help much at inference time.

Also, I tried loading the previously trained model (using Doc2Vec.load) and noticed that (I'm pretty sure) it didn't load the document vectors again, as the accuracy from Logistic Regression went back down to where it started (around 50%).

Gordon Mohr

unread,
Jun 17, 2015, 6:19:57 PM6/17/15
to gen...@googlegroups.com
Are you using the latest code from the pending-pull-request branch (bigdocvec_pr)? There was a bug for a day or two last week (fixed in f59e483) that broke inference which would show exactly that symptom. 

I just double-checked inferred vectors on the IMDB/logistic-regression task, before and after a load-save cycle, and the inferred-vector logit predictiveness in each case approaches the same performance as the bulk-trained vectors.

I do have the vague impression inferred-vectors best match the bulk-vectors in smaller models (fewer dimensions, etc) – perhaps in larger models inference can descend to regions that are as good on the training task but still bad on the true task? 

FYI, on my sentiment experiments based on the paper's datasets, logit predictiveness seems to plateau after just 10 to at most a few dozen epochs. In the paper that talks about mixed DBOW & skip-gram training, and mixing word & document vectors for analogical-like topic-space navigation [1], they report using 10 epochs, training over 4M+ Wikipedia articles for a topical-similarity goal.

- Gordon

Zachary Jablons

unread,
Jun 17, 2015, 6:45:08 PM6/17/15
to gensim
I pulled the PR (yesterday), not your repo. Should I switch to your repo instead?

I'm not testing on a large amount of data (mostly due to time constraints), maybe around 1M pieces of short text. I'm using 300 dimensions, which seems about standard, mostly because I'm also experimenting with initialization from the GloVe pretrained Common Crawl vectors.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/EFy1f0QwkKI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Jun 17, 2015, 7:26:24 PM6/17/15
to gen...@googlegroups.com
The PR is 1:1 backed by the bigdocvec_pr branch, so either is identical. You definitely have the relevant fix if in doc2vec.py, Doc2Vec.clear_sims() calls Word2Vec.clear_sims() (rather than calling reset_weights(), which was the bug). If you're doing anything else to modify/reset the any syn0/syn1 weights before inference (maybe init_sims(replace=True)?) that could also ruin inference, as it is essentially continued training (under heavy constraints). 

1M docs is a lot more than the movie-review corpuses, and 300 dimensions isn't extra-large, so you sound good on those fronts. Might the version of the doc presented to infer_vector() not be prepared in the same way – different tokenization, normalization, etc? (The infer_vector() method expects a list of tokens, rather than a TaggedDocument or String.)

On the 100K docs IMDB dataset, with DBOW training, even after just a few passes, inference is usually good enough that the following works (similar to one of the model_sanity() checks in test_doc2vec.py): 

  # all_docs has all documents as TaggedDocument or similar
  # doc_id is key to one document of interest
  tagged_doc = all_docs[doc_id] 
  inferred_vec = d2v_model.infer_vector(tagged_doc.words)
  similars = d2v_model.docvecs.most_similar([inferred_vec])
  assert doc_id in [match[0] for match in similars]  # that is, doc_id usually among closest bulk-trained vectors to new inferred vector

- Gordon

za...@okcupid.com

unread,
Jun 18, 2015, 3:50:25 PM6/18/15
to gen...@googlegroups.com
Same generator (and thus tokenization etc) is used for both training and inference, so that's not the problem. It looks like the code I have is correct wrt the bug you mentioned, so that's also not the issue.

I tested out the sanity check and found that if I load the model using Doc2Vec.load, every sanity check fails. However if I train the model from random and then do the sanity check, it works (most of the time). In both cases however, testing Logistic Regression afterwards gives results near random, while the average baseline still does better.

Is there anything else I should check? For these checks I trained and inferred 10 epochs.

Gordon Mohr

unread,
Jun 18, 2015, 4:46:20 PM6/18/15
to gen...@googlegroups.com
So to be clear about what you're reporting with Doc2Vec.load:

- you start a fresh d2v_model, train it 10 epochs
- then, on that d2v_model, the  'sanity check' (looked-up bulk-trained vectors for doc N tend to be in the 'most_similar' set of a freshly-inferred vector for same doc N) usually passes
- you save that model: d2v_model.save('filename')
- you load a 2nd model instance from that save: model2 = Doc2Vec.load('filename')
- on that reloaded model, the same 'sanity checks' reliably fail?

If so, clearly a key bit of state is getting clobbered or otherwise failing to make the save-load roundtrip. 

Comparing things like `len(d2v_model.syn1) == len(model2.syn1)', 'all(d2v_model.docvecs.doctag_syn0 == model2.docvecs.doctag_syn0)', etc might pinpoint the deviation. (Perhaps the generic pickle-saving of the docvecs is failing? In which case it's not really inference that's broken after the re-load, but the 'precalculated' vector being used as a yardstick.)

Regarding the regression results, it's harder to speculate. *If* inference approaches similar vectors as the bulk training, *then* the logit-predictor's results on inferred-vectors should approach (but perhaps not match) what the same predictor achieves on bulk-trained vectors. (Whether that will be any good is a separate issue; by 'average baseline' do you mean something like sum-of-word-vectors as a candidate doc-vector?) But when you're persistently getting ~50%, it almost sounds like some prep/comparison/rounding step is broken... 

- Gordon

Zachary Jablons

unread,
Jun 18, 2015, 6:55:22 PM6/18/15
to gensim
If I reuse the same tags for training after loading the model will it overwrite the existing ones? That would explain this issue, although I can't think of an immediate workaround (besides changing the labels). I verified that the docvecs.doctag_syn0 remain the same between saved and loaded models, so that shouldn't be an issue.

The average baseline is in fact the average over the word vectors as a document vector. I've looked over my code and verified that the label order should stay the same, however I elide over documents (and corresponding class labels) that have no words in the vocabulary of the model. For some reason between doing this for the inference and doing this for the average baseline, a (slightly) different amount of documents are elided over, which seems to suggest that maybe the vocabulary is modified by inference somehow? I did notice as well that when getting the vocabulary items for inference, it also checks against sample probability, which doesn't make sense (since the sampling really only should happen during training). Do you think that might affect it?

Gordon Mohr

unread,
Jun 18, 2015, 8:11:08 PM6/18/15
to gen...@googlegroups.com
Re: "If I reuse the same tags for training after loading the model will it overwrite the existing ones?"

I'm not sure what this means. To continue training will continue to change the corresponding model vectors. You won't be able to train any word/tag vectors for any word/tags that weren't available at the first build_vocab() scan over the data – only words/tags that are recognized as known contribute to training examples. If your method of assigning IDs is based on order-of-presentation, maybe the same docs are being represented with shifted IDs? (That'd tend to scramble things...) 

No tags are provided during inference, and inference shouldn't change anything about any model weights.

If you can share some (pseudo-) code showing your steps it might generate more ideas. What training-mode(s) are you using?

Even though frequent-word subsampling (via the 'sample' parameter) definitely improves word-vector-quality on large datasets, it only seemed to hurt the doc-vector power on IMDB sentiment-prediction task, so I left it out of my trial runs. I believe that theoretically, the inference-descent should match the training-descent as closely as possible – and that's a part of the reason they use the exact same parameterized training methods. So if frequent words were downsampled during bulk-training, a similar probabilistic downsampling should happen during inference. 

Might a glitch in elision of bad documents cause the regressors and target values to be misaligned at some evaluation?

FYI, I've seen a suggestion that 'add' is better than 'average' for composing a text vector from word vectors, but haven't done either myself. 

- Gordon

Zachary Jablons

unread,
Jun 19, 2015, 3:32:14 PM6/19/15
to gensim
Ok, sorry about the confusion with loading the model giving bad results. It turns out every time I ran my script I was reshuffling the dataset and then giving it tags (in order), which was causing the results to be wonky. That's now fixed, and I'm getting the expected results.

After some debugging, I found the issue was entirely on my end, due to some ambiguity about how generators work. I'm now getting reasonable results for the Paragraph Vectors.

I think summing the vectors as opposed to averaging only makes sense if you're comparing them somehow with the cosine distance, as it wouldn't be sensitive to their scale, whereas models like Logistic Regression would still be influenced by the scale of the vectors.

Thanks for all the help!

Annu Sachan

unread,
Jun 23, 2015, 2:49:56 AM6/23/15
to gen...@googlegroups.com
@gordon
i m new to this gensim and i m using the infer_vector method to get the vectors of the fresh document using the model that i had trained earlier using Doc2vec.train(sentences) and then feeding these vectors as features to Random Forest Classifier with 1000 estimators(i am using scikit) and unfortunately i m getting very less accuracy around 25%. Where possibly i went wrong.
For training i have around 7.5K documents each of avg size 4.2 kb and for the test i have around2.5k documents whose vector i get using infer_vector. Is less data reason for this less accuracy?
i downloaded the bigrefactor branch.

Have you uploaded the bigrefacotor_pr one????

Gordon Mohr

unread,
Jun 23, 2015, 8:41:14 PM6/23/15
to gen...@googlegroups.com
The best branch to pull from is the one referenced in the PR356 (https://github.com/piskvorky/gensim/pull/356), 'bigdocvec_pr'. (The other one is older and not getting any new work.)

The 'sentiment treebank' (rotten tomatoes) sentiment corpus is about 12K examples and ~28MB of text, and in my opinion the doc-vector sentiment-prediction results there are very weak. The IMDB sentiment corpus is 100K examples and ~131MB of text, and gets much much better results. 

So 10k docs  and (10k* 4.2kB =) ~42MB of text is likely on the small side for any useful results. But also, if you're doing a binary classification, an "always guess the more common result" should be getting at least 50% right – so maybe there are other things wrong with your setup? 

Also, although I'm pretty confident the inference is working, it adds extra choices and variables to the results – so I would leave it out of initial tests, and then evaluate it separately, in a later step. You'd train your doc2vec model on all 10k raw text examples, then your classifier on the 7.5k doc vecs and target values, then test the classifier on 2.5k doc vecs (leftover from bulk training). 

(Only if/when that's giving meaningful results, then tinker with inference, to potentially handle an unbounded number of new examples.)

- Gordon

Annu Sachan

unread,
Jun 24, 2015, 8:40:47 AM6/24/15
to gen...@googlegroups.com
@gordan
1. I forked the mentioned repo and downloaded the bigdocvec_pr one and installed on my system.
2. I have larger data set now 7.5 k document each of around 40kb equivalent to  212.5 MB (no of tokens for each document are more in this case)
3. And then trained the complete 212.5 MB documents using Doc2vec.train , 5 epochs
4. Then ran the same classifier dividing the vectors into 75% train and 25% test and the accuracy is 52.734%(with 90-10 it is 53.045%), there are 39 classes in which data is to be classified.

the accuracy is quite low here itself, it is slightly greater than the accuracy which i obtained  when i did the same thing with the 7.5k docs each of 4.2 kb with the hope that accuracy may increase when data  is increased. now i m getting the felling that i will still get lower accuracy when i will try the infer_vector approach for the fresh documents.

Gordon Mohr

unread,
Jun 24, 2015, 5:15:00 PM6/24/15
to gen...@googlegroups.com
I haven't yet done any experiments using Doc2Vec for multiclass classification, though I've seen mention of projects using it productively. For your data, maybe those are good results – do those numbers beat other methods? 

- Gordon

Annu Sachan

unread,
Jun 25, 2015, 3:50:21 AM6/25/15
to gen...@googlegroups.com
yeah with Naive Bayes classifier the accuracy for same is around 78%.
Umm if possible, can you give me the link where you have seen those mentions?

Gordon Mohr

unread,
Jun 25, 2015, 5:38:35 PM6/25/15
to gen...@googlegroups.com
The kinds of mentions I'm thinking of:

https://www.youtube.com/watch?v=vkfXBGnDplQ – mostly word2vec-focused, talks a bit about doc2vec or mixture with other techniques like LDA

https://www.youtube.com/watch?v=7BgpaZltW8s – mentions doc2vec combined w/ other technique for news-story classification

https://www.youtube.com/watch?v=7gTjYwiaJiU – very high-level, suggests classification success on low numbers of examples (but few specifics)

What features did you feed to Naive Bayes? Doc2vec is a way to get new, maybe-useful continuous features from blocks of text. What if you feed the doc2vec features to NB?

- Gordon

Annu Sachan

unread,
Jun 29, 2015, 6:00:01 AM6/29/15
to gen...@googlegroups.com
hey thank you for the links Gordon and sorry for late reply... i'll go through these links and hopefully they will help me a lot.
For the naive bayes we simply used the tfs(term frequencies)

Christopher Moody

unread,
Jun 30, 2015, 4:06:57 PM6/30/15
to gen...@googlegroups.com
Hey Annu, Gordon,
Thanks for the link to my talk! :) I've written a small bit of code which does a hacky version inference-stage Doc2Vec on pre-trained word vectors: 

We've seen some success with it. Let me know how it works if you use it!

chris

Annu Sachan

unread,
Jul 1, 2015, 7:24:12 AM7/1/15
to gen...@googlegroups.com
@chris
thank you for providing me another way out, i'll give it a try when i am done with my current engagements and let you know.

Gordon Mohr

unread,
Jul 1, 2015, 5:03:01 PM7/1/15
to gen...@googlegroups.com
Thanks for the pointer (and the talk)! From a quick browse of the code, some thoughts:

- it's interesting in how it adds new docs in blocks (so they become permanently part of the model, like Rutu M's patch for adding batches of new words), and offers a report/evaluation hook each training cycle

- as a subclass it won't work with the upcoming gensim changes, as it relies on the old mix of words/docs into the same arrays/dicts

- the inference-training doesn't actually freeze the prior-learned word/hidden weights during followup batch work, but rather lets them vary, then restores them from a checkpoint at the very end. I suspect this would mean, among other things, that larger batches-of-new-docs will get vectors less 'tightly matched' to the original batch, simply because there's more uncorrected drift during the inference steps.

The new `infer_vector()` method will likely be a preferable way to deduce new vectors... but the ability to remember newly inferred-vectors inside the model may be a worthwhile feature to pull up to the base class(es), depending on whether most users continue to rely on the model for (full-scan) similarity queries, or if they instead export the vectors to other systems. 

- Gordon

James Schneider

unread,
Jul 2, 2015, 4:57:41 PM7/2/15
to gen...@googlegroups.com
I have worked around this problem by using doc2vec to mass train both testing and training data. I then recover the unknown labels and use those features as my testing data and the rest as training. I'm using doc2vec for multiclass classification and have on the order of ~200 classes over ~50GB of structured text data. I don't know if this is the correct usage, if not please feel free to correct me as I'm welcome to any suggestions.

Currently the results are promising, using doc2vec to create the features and feeding them to logisitic regression produces 79.2% 10-folded leave one out crossover validation. Meanwhile using traditional tf-idf, hashing trick, and feature selection/normalization and scaling with LSI only leads to 63.9% accuracy with C4.5 algorithm. I'm very impressed with Doc2Vec after doing basic MLP/HMM models in my graduate curriculum back 4 years ago.

Parkway

unread,
Jul 3, 2015, 5:35:37 AM7/3/15
to gen...@googlegroups.com
@gogo This thread is along the same lines I mentioned a week or so back ie. generating word vectors (a la the google news vectors) and at a later time using them for doc2vec vector generation. I haven't tested Chris' code yet but it sounds like that is what it does. Can see this way of working becoming a common need. 

Gordon Mohr

unread,
Jul 3, 2015, 6:06:26 AM7/3/15
to gen...@googlegroups.com
It does not do that. 

Devendra Singh Sachan

unread,
Jul 9, 2015, 2:19:06 AM7/9/15
to gen...@googlegroups.com
Hi Annu,

My experience with text categorisation tasks is that, normally in case of medium sized labelled data, usually features vectors with normalized tf/ tf-idf followed by training with Logistic Regression or SVM gives one of the best performance which is usually sufficient for daily tasks and its very fast to train as well. The performance of Logistic Regression / SVM based models can be improved by using extra features from Naive Bayes / LDA / LSI based approaches . Document Vectors can give good results but their parameters like subsampling of frequent words, window size, negative samples have to be tuned according to the task .

In case of IMDB movie reviews task, the tfidf based methods gave around 92% accuracy while Document Vectors gave around 88% accuracy , but document vectors are one of the best approaches for doing clustering of documents.

Thanks,
Devendra

Dirk Brand

unread,
Aug 6, 2015, 5:49:23 AM8/6/15
to gensim
I am also busy with a text classification task (binary), with a fairly large data set of texts.  I now understand that it is better to train the Doc2Vec model on the entire corpus before worrying about splitting into training and testing sets, which makes sense.   I am just wondering what would be the best practice: training on multi-sentence texts where all the punctuation is removed (i.e. contexts will flow across sentences) or on each sentence separately, but with the same paragraph_id for each of these sentences?  If you reuse a paragraph_id, will it continue training on that one id with the new contexts?  I hope so, because that will be great!

Also, I understand the difference between the concatenation vs sum/average step in the training, but I want to know if it will make a significant difference to my results (intuitively) if I sum or concatenate?

Gordon Mohr

unread,
Aug 6, 2015, 2:26:25 PM8/6/15
to gensim
You *can* reuse a paragraph ID ('tag'), and indeed splitting the text that would normally be presented once with a single paragraph ID is one way to work around an implementation limit – 10000 word tokens per example – inside the optimized methods, if some of your documents are longer.

As you note, the main difference is then that surrounding-word windows, in composing individual NN inputs, will never cross those split-boundaries. I don't have a good idea whether this helps or hurts the resulting vectors – that would need investigation, and it might vary for different datasets and training parameters. (In pure PV-DBOW – dm=0, dbow_words=0, where word vectors aren't trained at all – it couldn't make any difference, since the 'window' is irrelevant in that mode.)

Note that the Doc2Vec practices I've seen elsewhere (and in gensim's demos) often preserves punctuation marks, as if they were word-tokens. But, I don't know of a controlled test of that approach versus the alternative. 

In the DM modes, concatenation means a much larger (and slower-to-train) model, because the input-layer is no longer just `vector_size` dimensions, but rather `(tags_per_example + 2 * window) * vector_size` dimensions. And, in my experiments with the IMDB data, DM/concat vectors' logistic-regression sentiment predictiveness is noticeably worse than other methods. But given that the original Paragraph Vectors paper seemed to recommend concatenation, there may be other tasks/datasets on which it excels. So: it's different enough from the other methods that results are likely to vary, but only testing can reveal which way and whether the extra memory/time may be worth it. 

- Gordon

Annu Sachan

unread,
Aug 20, 2015, 10:14:58 AM8/20/15
to gensim
Hi Devendra

Thank you for the suggestions. I will try to implement them hoping to get better accuracy. I came across with the concept of document vectors and found them interesting so just played with them and the game still goes on.

Lukáš Svoboda

unread,
Oct 20, 2015, 6:27:41 AM10/20/15
to gensim
Hi Radim,

is already Gordons work in current gensim version 0.12.2 included? So far I have checked, that function infer_vector is already there, but I am not sure about the rest of his work. Is there any other way how to infer new previously unseen sentences? What about  "sentence2vec" method that Tomas Mikolov suggested on the word2vec mailing list, https://groups.google.com/forum/#!topic/word2vec-toolkit/wTx3E5D0n9s which you have mentioned? 

Or do you have some current tutorial for doc2vec implementation? 


Lukas

On Saturday, March 7, 2015 at 10:39:31 AM UTC+1, Radim Řehůřek wrote:
Hello Chenyan,

yes, Gordon is working on it. You can check his progress in his fork:

(work in progress, as yet unmerged).

Radim


On Friday, March 6, 2015 at 2:10:16 AM UTC+1, chenyan xiong wrote:
Hi, everyone
Has there been any update in doing inference for new doc?

Thanks!
Chenyan

On Friday, December 5, 2014 at 4:25:13 PM UTC-5, miel.shayne wrote:
In the last paragraph on page 3 of the paper describing the algorithm behind gensim's Doc2Vec implementation (http://arxiv.org/pdf/1405.4053v2.pdf), they talk about the "inference stage", where they add a column to D and gradient descend on it while holding W, U and b constant. They do this to get paragraph vectors for as-yet unseen paragraphs. How would we do this with the gensim implementation? Should I do something like this:

m = gensim.models.Doc2Vec(training_sentences)
m.train_words = False
m.train(test_sentences)

where "test_sentences" have labels that is are not yet in m's vocabulary? If I do that, will the training labels also get updated (in which case I would need a fresh copy of m for each test task)?

Thanks!

Gordon Mohr

unread,
Oct 20, 2015, 12:13:55 PM10/20/15
to gensim
Everything's in the official gensim release! 

What Mikolov calls "sentence vectors" in that post is what the Le/Mikolov paper calls "Paragraph Vector" and gensim implements as the class Doc2Vec. The `infer_vector()` method is the only way offered to create a vector on new text. 

You can see a demo of how it's used, and perform part of the experiments from the 'Paragraph Vector' paper, with the supplied IPython notebook bundled inside gensim at https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

- Gordon
Message has been deleted

Azad Soni

unread,
May 17, 2016, 8:14:34 AM5/17/16
to gensim
Hello Jordan,

Thanks for suggestting this. Can you tell me that how 'infer_vector' works internally? How it's predicting the probability of unknown words?

Thanks,
Azad

Gordon Mohr

unread,
May 17, 2016, 3:23:05 PM5/17/16
to gensim
The best reference for understanding the internal workings of `infer_vector()` is the code itself, viewable at:


Note that it just re-uses the training routines, but with added constraints that mean most of the model doesn't change. Only the new candidate-vector-in-training varies.

Previously-unknown words are just elided/ignored. (This is the same as for words in the original dataset that were too infrequent to make the `min_count` cutoff: the algorithm just pretends they aren't there.)

- Gordon
Reply all
Reply to author
Forward
0 new messages