Doc2Vec training speed

Dave Challis

unread,

Jul 29, 2015, 5:35:19 AM7/29/15

to gen...@googlegroups.com

Just wondering if anyone could share some of the training speeds
they've been seeing for doc2vec?

It seems faily slow in my environment, so just wanted to check whether
there might be something wrong with my setup, or whether it's the sort
of performance I should be expecting.

Running on an Ubuntu host with 16 cores and ~64Gb RAM (with numpy
compiled against openblas), I'm running doc2vec with:

'iter': 10,
'size': 1000,
'alpha': 0.025,
'window': 8,
'min_count': 5,
'max_vocab_size': 2e8,
'sample': 0,
'seed': 1,
'min_alpha': 1e-4,
'dm': 1,
'hs': 1,
'negative': 0,
'dbow_words': 0,
'dm_mean': 0,
'dm_concat': 0,
'dm_tag_count': 1,
'workers': 1

Gensim's logs show training going at ~16500 words/s, e.g.:
INFO:gensim.models.word2vec:PROGRESS: at 52.00% examples, 16536 words/s

Increase the above to 'workers=8' results in only a tiny speedup to
~18500 words/s, e.g.:
INFO:gensim.models.word2vec:PROGRESS: at 0.47% examples, 18310 words/s

So I was wondering, do these speeds seem about normal? Also, why
doesn't increasing the number of workers result in much performance
gain? (or is the words/s metric per worker rather than overall?)

Thanks,
Dave

Christopher S. Corley

unread,

Jul 29, 2015, 10:08:51 AM7/29/15

to gensim

Hi,

I've seen some very good training times on using Doc2Vec, especially compared to LDA.

Here's some training times from some software systems I'm using as corpora to give you an idea.

ArgoUML v0.22
LDA: 0m 58.207s
Doc2Vec: 0m 02.070s

ArgoUML v0.24
LDA: 1m 05.507s
Doc2vec: 0m 02.267s

ArgoUML v0.26.2
LDA: 1m 21.176s
Doc2vec: 0m 02.736s

JabRef v2.6
LDA: 0m 29.504s
Doc2vec: 0m 01.280s

jEdit v4.3
LDA: 0m 36.701s
Doc2vec: 0m 01.519s

muCommander v0.8.5
LDA: 0m 42.897s
Doc2vec: 0m 01.696s

Chris.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,

Jul 29, 2015, 3:28:34 PM7/29/15

to gensim, dave.c...@aistemos.com

Seems slow. I haven't run much at 1000d, but the 100d training in the demo notebook (under docs/notebooks) runs at many hundreds-of-thousands of words-per-second with 8 workers on an MBP laptop.

Are you sure the C-extensions compiled properly? (You should have seen a warning when training started if they didn't. If they didn't, re-install after making sure the machine has support packages like 'build-essentials' and 'python-dev', and watch for any other revealing errors during installation.)

What version of gensim are you using? (If 0.11.1 or earlier, there was an issue that would definitely slow the multi-worker case. I think 'conda install' will still get an older gensim version; you should prefer the latest, 0.12.1, though see another thread for another potential C-extension-interfering problem that arises with just-released scipy 0.16.0.)

If those aren't factors, might the source of your data be a bottleneck? (Maybe a slow disk or network source, or a computationally-involved process for turning the raw data into the expected TaggedDocuments?)

- Gordon

Dave Challis

unread,

Jul 30, 2015, 4:39:37 AM7/30/15

to gensim

Hi Gordon,
Yup, after looking around some more it did seem pretty slow. After
further investigation, turned out that another module had scipy
>=0.16.x as a requirement, causing my environment to upgrade to that,
which in turn caused gensim's C-extension not to be used (the warning
got lost somehow, probably due to some custom log settings I'm using).

Dave

Reply all

Reply to author

Forward