Doc2Vec slow for tweets

Harsha Manjunath

unread,

Mar 24, 2017, 2:15:55 PM3/24/17

to gensim

I have noticed Doc2Vec performs poorly when each TaggedDocument corresponds to a single tweet/sentence.

Doc2Vec is using all the cpu cores, but the utilization on each core doesn't exceed 15% ( 30000 to 50000 words/sec)

My corpus is 4.3 million tweets & 12458 hashtags , where each tokenized tweet becomes a Tagged Document & gets a single hashtag as Document Label.

I followed the tutorial here [0]

models = [ Doc2Vec( dm=1, dm_mean=0, size=400,

hs=0, workers=8, window=10,

negative=5,

sample=1e-5,

min_count=20 ), ]

When I concatenate the tweets (corresponding to a single hashtag) together into a single TaggedDocument, there is the expected speedup across all cores, but at the cost of a poor doc2vec model. (500,000 words/sec)

[0] https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

Gordon Mohr

unread,

Mar 24, 2017, 6:50:58 PM3/24/17

to gensim

Are you using a recent gensim version? (0.12.4 in January 2016 got an optimization that helps a lot with tiny documents.)

Is there a chance that it's really your IO or preprocessing/tokenization that's the bottleneck? (Can you do any expensive work in a 1st pass, then only create TaggedDocuments from a simple file of already space-delimited tokens? Or pre-create all TaggedDocument instances into an in-memory list?)

- Gordon

Harsha Manjunath

unread,

Mar 24, 2017, 7:53:24 PM3/24/17

to gensim

Thank you Gordon for replying.

I am using gensim 1.0.1

I do all preprocessing/tokenization of tweets in a pandas Dataframe.

I construct a list of TaggedDocuments from the dataframe object, & delete the original Dataframe, to free up memory.

I shuffle the list of TaggedDocuments in each epoch, when training Doc2Vec model, like the Doc2Vec-imdb tutorial

models = [ Doc2Vec( dm=1, dm_mean=0, size=400,

hs=0, workers=2, window=10,

negative=5,

sample=1e-5,

min_count=20 ),

]

doc_list_twt = shuffle_(df, "tweet_tkns") # generates list of TaggedDocuments from a dataframe object containing tokenized tweets

models[0].build_vocab(doc_list_twt)

print(str(models[0]))

del df

Gordon Mohr

unread,

Mar 24, 2017, 8:21:06 PM3/24/17

to gensim

Is `doc_list_twt` below an actual list of TaggedDocuments, or some iterable object that generates them each time they're needed? (That is, what does `shuffle_()` do?)

If doing the looping like in the IMDB example, you'll want to create your model with `iter=1` – or else you'll be doing 5 passes per `train()` rather than 1. (The example notebook hasn't been updated for current defaults.)

It's also not strictly necessary to re-shuffle the items each epoch – so you could just pick your desired iterations (eg `iter=20`), leave default `alpha` management in place, and call `train()` just once.

- Gordon

Harsha Manjunath

unread,

Mar 24, 2017, 9:03:30 PM3/24/17

to gensim

Thank you Gordon for the quick reply.

doc_list_twt is an actual in-memory list of TaggedDocuments, NOT an iterable object. shuffle_ returns a list of TaggedDocuments from a dataframe

Since I fix the learning rate to be constant i.e alpha, min_alpha in each epoch, default value of iter=5 shouldnt matter???

models = [ Doc2Vec( dm=1, dm_mean=0, size=400,

hs=0, workers=4, window=10,

negative=5,

sample=1e-5,

min_count=20, iter=1 ),

]

doc_list_twt = shuffle_(df, "tweet_tkns")

models[0].build_vocab(doc_list_twt)

print(str(models[0]))

del df

from random import shuffle

def train_gensim(i, model):

alpha, min_alpha, passes = (0.025, 0.0001, 40)

alpha_delta = (alpha - min_alpha) / passes

for epoch in range(passes):

start = time.time()

shuffle(doc_list_twt)

print("Random shuffle time %s" % (time.time() - start))

model.alpha, model.min_alpha = alpha, alpha # fix the learning rate, no decay

print('Now training epoch {0}, alpha = {1}'.format(epoch, model.alpha) )

model.train(doc_list_twt)

alpha -= alpha_delta # decrease the learning rate

print("Epoch TIME %s" % (time.time() - start))

return model

train_gensim(0, models[0]).save("modelfile.doc2vec")

Harsha Manjunath

unread,

Mar 24, 2017, 9:06:54 PM3/24/17

to gensim

Is it because there are 4 million tweets/TaggedDocuments (short in length) & only 12,500 DocTags?

Is it causing too much inter-process communications that there is No effective parallel speedup across all cpu cores?

Gordon Mohr

unread,

Mar 24, 2017, 9:12:54 PM3/24/17

to gensim

If the default `iter=5` is left in place, every call to `train()` will result in 5 passes over the supplied data. So your loop, as shown, will make (40 loops time 5 passes per loop=) 200 passes over the data.

I highly recommend just setting `iter=40` and calling `train()` once. Enabling logging at the 'info' level will give you as much or better progress/speed information as printing the per-loop timings.

If `doc_list_twt` is already ready-for-training TaggedDocuments, I'm not sure why things may seem slow. Do note that do to implementation limits in the optimized code, text examples with more than 10,000 tokens are truncated to 10,000, with extra tokens discarded. (There should be a warning logged when this happens.) So the apparent rate on a test all-texts-combined document will not be representative of the actual training rate.

- Gordon

Gordon Mohr

unread,

Mar 24, 2017, 9:21:12 PM3/24/17

to gensim

That shouldn't be a factor; the threads can read/write from the same raw in-training arrays.

If for some reason you're not using the optimized cython routines, there should be a logged warning about the "Slow version" being used.

Is a lot of virtual memory in use? If the model exceeds the size of RAM, the random-access into virtual memory during training will make things very slow.

- Gordon

Reply all

Reply to author

Forward