Running word2vec with multiple workers

Tom Kenter

unread,

Sep 16, 2014, 11:56:47 AM9/16/14

to gen...@googlegroups.com

I am very sorry if this a very n00b question, but I can't seem to get the multi-threading to work when training a word2vec model.

I set the number of workers to 30 (on a 32 core machine), however, if I look at what the machine is doing (in htop), I see only a single core being busy.

Here are my system credentials:

$ python

Python 2.7.3 |CUSTOM| (default, Apr 11 2012, 17:52:16)

[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import cython

>>> import gensim

>>> cython.__version__

'0.20.1'

>>> gensim.__version__

'0.10.1'

I did read http://radimrehurek.com/2013/10/parallelizing-word2vec-in-python/ but form what I read there it should "just work" if I use the latest version.

Is there anything else that needs to be installed..?!?

Many thanks!

Tom

Radim Řehůřek

unread,

Sep 16, 2014, 12:57:33 PM9/16/14

to gen...@googlegroups.com

Hello Tom, can you do `print gensim.models.word2vec.FAST_VERSION` and post the result?

-rr

Tom Kenter

unread,

Sep 16, 2014, 4:39:18 PM9/16/14

to gen...@googlegroups.com

Hi Radim,

Thanks for the quick response!

This is what it says:

>>> print gensim.models.word2vec.FAST_VERSION

1

Tom

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/WcWn6kabLQ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Kenter

unread,

Sep 26, 2014, 6:55:47 AM9/26/14

to gen...@googlegroups.com

Dear all,

Would there be any updates on this?

I updated my gensim version:

>>> import gensim

>>> gensim.__version__

'0.10.2'

>>> print gensim.models.word2vec.FAST_VERSION

1

but I am still not seeing multiple processes when I am training models.

Thanks!

Tom

Radim Řehůřek

unread,

Sep 29, 2014, 1:23:58 PM9/29/14

to gen...@googlegroups.com

Hello Tom,

On Friday, September 26, 2014 12:55:47 PM UTC+2, Tom Kenter wrote:

Dear all,

Would there be any updates on this?

I updated my gensim version:

>>> import gensim
>>> gensim.__version__
'0.10.2'
>>> print gensim.models.word2vec.FAST_VERSION
1

but I am still not seeing multiple processes when I am training models.

but it *is* a single process, using multiple threads. But you say only one core is used, so that's bad either way :)

I have no idea what could be the problem. Can you post your log at DEBUG level? Next step after that, we could meet on skype & share screen & debug interactively, because I don't see what the problem could be, so it's hard to suggest stuff over the mailing list :)

Cheers,

Radim

Tom Kenter

unread,

Oct 8, 2014, 6:04:56 AM10/8/14

to gen...@googlegroups.com

Hi, I am sorry, that took a while…

OK, the update is: I actually do see multiple processes, but I also only see one core being busy.

So, I am looking at htop, and I can see that there are 30 python subprocesses spawned from the 'mother' process.

However, none of the cores (but one) shows any signs of activity.

Tom

Radim Řehůřek

unread,

Oct 8, 2014, 7:15:00 AM10/8/14

to gen...@googlegroups.com

Hi Tom,

I find that hard to believe. Gensim's word2vec doesn't do anything with processes. More likely, htop is showing you threads.

In any case, conclusions from my previous email apply :)

-rr

Tom

Tom Kenter

unread,

Oct 8, 2014, 7:50:55 AM10/8/14

to gen...@googlegroups.com

I am sorry, you are right. They are threads.

But, regardless, they should be distributed over the cores, right?

Tom

Tom Kenter

unread,

Oct 8, 2014, 7:51:51 AM10/8/14

to gen...@googlegroups.com

But, indeed, if you want to debug with shared screens or whatever, that would be great!

Tom

tobigue

unread,

Nov 28, 2014, 12:44:13 PM11/28/14

to gen...@googlegroups.com

Hi, have you found a solution to this? I seem to have the same problem with gensim 0.10.3.

Tobias Günther

unread,

Nov 28, 2014, 12:59:25 PM11/28/14

to gen...@googlegroups.com

I found a solution here: http://stackoverflow.com/questions/15639779/what-determines-whether-different-python-processes-are-assigned-to-the-same-or-d/15641148#15641148

Vivek Kulkarni

unread,

Jan 27, 2015, 2:49:35 PM1/27/15

to gen...@googlegroups.com

I still face the issue even after setting the cpu affinity using the workaround mentioned: os.system("taskset -p 0xfffff %d" % os.getpid()).

I am using gensim 0.10.3 . Any pointer would be appreciated.

Thanks.

Vivek.

Tom.

za...@okcupid.com

unread,

Jun 10, 2015, 11:39:33 AM6/10/15

to gen...@googlegroups.com

I'm still having this issue after using this workaround as well, although I am using an IPython notebook -- do you think this would affect it?

za...@okcupid.com

unread,

Jun 10, 2015, 11:50:02 AM6/10/15

to gen...@googlegroups.com

I looked into it a bit more, I don't think the problem is the CPU affinity as the affinity package shows that I have the correct mask.

Gordon Mohr

unread,

Jun 10, 2015, 8:26:54 PM6/10/15

to gen...@googlegroups.com

Note that the initial 'build_vocab' pass over the data (to obtain vocabulary information and doc ids/count in the Doc2Vec case) is single-threaded, and can take a significant amount of time. Typically I notice that phase has finished when `top` (or similar) shows the gensim python process CPU utilization surging past 100%.

If you're in training, have set workers>1, and assuredly *don't* see more than one core active, that's quite strange, and I don't know what could cause it.

However, if the real issue is that multiple cores are active, but you're not seeing the expected *throughput benefit* of multiple workers, just last week a fix removed an apparent source of serious cross-thread interference in the cythonized training routines. See:

https://github.com/piskvorky/gensim/commit/1e077b93e3eea4031acb4242a5eba03a608b4032

Before this fix, on an 8-core Ubuntu server, I was seeing lower throughput with 4-8 workers than with 1, on an all-in-memory dataset. After, workers generally seem to help at least up to the number of cores (though it's still the first few workers beyond 1 that have the biggest impact). On a 4-core OSX machine, workers up to 4 were always helping at least a little... but after the fix, using 1-4 workers are all noticeably faster.

- Gordon

Sasha Migdal

unread,

Jun 16, 2015, 5:48:15 PM6/16/15

to gen...@googlegroups.com

Tom, I have the same issue on my 32 core machine with 1Tb memory (so that this is not an issue). I checked that I have FAST_VERSION and still only 1 or 2 cores are active in htop. There is one python process, with cpu load at most 200%, which means 2 cores. Apparently, python parallelization with threads on Linux machines does not really work, see

https://www.quantstart.com/articles/Parallelising-Python-with-Threading-and-Multiprocessing. Either we have to use multiple processes with shared memory or we have to use OMP parallelization directly in C or C++.

Gordon Mohr

unread,

Jun 16, 2015, 7:23:39 PM6/16/15

to gen...@googlegroups.com

I often see all 8 cores active on my Ubuntu machine. (Always nice to see "798%"+ CPU utilization on an 8-core machine!) The optimized cython routines specifically release the Python GIL, and use C code & Fortran-originated array-ops, to allow true multi-threaded progress on the core training math – so the limits discussed at that link aren't really applicable to this case.

So, there must be some other factor in play.

I've seen mentions elsewhere (including in non-Python contexts) that some BLAS libraries may be compiled or configured to only use 2 cores, so that may be the issue. But even after swapping your BLAS libraries, you might have to take extra steps to be sure scipy is using the right ones. Also, if you haven't tried a conda-installed environment, it *might* do a better job of finding the best options, since that's their business focus.

I'd really love to be able to confirm good performance on 16-32 core machines, so for anyone still struggling with this it'd be useful to collect...

- exact OS, python, numpy, scipy versions

- exact BLAS versions, as reported by `np.__config__.show()` in the affected environment and `sudo update-alternatives --display libblas.so`

- exact gensim FAST_VERSION, as reported by `gensim

...just to see if there's a pattern for where it's working and not. On my Ubuntu 14.04 machine, where I see all cores active during 8-worker training:

==============

ubuntu@ubuntu14:~$ source activate gensim_cenv # conda environment

discarding /home/ubuntu/miniconda/bin from PATH

prepending /home/ubuntu/miniconda/envs/gensim_cenv/bin to PATH

(gensim_cenv)ubuntu@ubuntu14:~$ python --version

Python 2.7.10 :: Continuum Analytics, Inc.

(gensim_cenv)ubuntu@ubuntu14:~$ conda list | egrep "numpy|scipy"

numpy 1.9.2 py27_0

scipy 0.15.1 np19py27_0

(gensim_cenv)ubuntu@ubuntu14:~/enwiki$ python -c "import numpy as np; np.__config__.show()"

lapack_opt_info:

libraries = ['lapack', 'f77blas', 'cblas', 'atlas']

library_dirs = ['/home/ubuntu/miniconda/envs/gensim_cenv/lib']

define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]

language = f77

openblas_lapack_info:

NOT AVAILABLE

atlas_3_10_blas_threads_info:

NOT AVAILABLE

atlas_threads_info:

NOT AVAILABLE

atlas_3_10_threads_info:

NOT AVAILABLE

atlas_blas_info:

libraries = ['f77blas', 'cblas', 'atlas']

library_dirs = ['/home/ubuntu/miniconda/envs/gensim_cenv/lib']

define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]

language = c

atlas_3_10_blas_info:

NOT AVAILABLE

atlas_blas_threads_info:

NOT AVAILABLE

openblas_info:

NOT AVAILABLE

blas_mkl_info:

NOT AVAILABLE

blas_opt_info:

libraries = ['f77blas', 'cblas', 'atlas']

library_dirs = ['/home/ubuntu/miniconda/envs/gensim_cenv/lib']

define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]

language = c

atlas_info:

libraries = ['lapack', 'f77blas', 'cblas', 'atlas']

library_dirs = ['/home/ubuntu/miniconda/envs/gensim_cenv/lib']

define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]

language = f77

atlas_3_10_info:

NOT AVAILABLE

lapack_mkl_info:

NOT AVAILABLE

mkl_info:

NOT AVAILABLE

(gensim_cenv)ubuntu@ubuntu14:~$ sudo update-alternatives --display libblas.so

libblas.so - auto mode

link currently points to /usr/lib/atlas-base/atlas/libblas.so

/usr/lib/atlas-base/atlas/libblas.so - priority 35

slave libblas.a: /usr/lib/atlas-base/atlas/libblas.a

/usr/lib/libblas/libblas.so - priority 10

slave libblas.a: /usr/lib/libblas/libblas.a

Current 'best' version is '/usr/lib/atlas-base/atlas/libblas.so'.

(gensim_cenv)ubuntu@ubuntu14:~$ python -c "import gensim; print(gensim.models.doc2vec.FAST_VERSION)"

1

(gensim_cenv)ubuntu@ubuntu14:~$ cat /proc/version

Linux version 3.16.0-38-generic (buildd@allspice) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015

==============

- Gordon

Alexander Migdal

unread,

Jun 17, 2015, 1:14:44 AM6/17/15

to gen...@googlegroups.com

Thanks Gordon, this is encouraging... I will check all the settings and report.

Parkway

unread,

Jun 17, 2015, 5:49:50 AM6/17/15

to gen...@googlegroups.com

@gordon

I have access to and have used doc2vec on a 32-core server with 64gb ram running centos6. I'm sure that ATLAS BLAS is enabled by default but will check. Can also enable mkl. But, how do you check how many cores (and workers) are being used? Also, what is the guideline for setting the number of workers per core or for total number of cores?

Gordon Mohr

unread,

Jun 17, 2015, 3:48:06 PM6/17/15

to gen...@googlegroups.com

I use the console command 'top' to monitor CPU activity. If in 'top's main process list, any process shows as using more than 100% CPU, it is engaging multiple cores via threads. If you hit '1', the header summary area toggles to show each core's utilization independently, rather than one average number for all. If you hit 'H', the process list shows lines for each thread – so you'll see utilization for each worker thread.

The training process is CPU-intensive and fairly parallelizable, so a worker count equal to the number of cores is a reasonable starting guess. For my 4- and 8-core tests, neither more nor less than the number of cores has been obviously better (at least since the fix mentioned a few messages up).

Even when all cores are usable, you will still see a period where only one core is active (while `build_vocab()` does its initial scan and initializations), but then when training begins, you should see many worker threads and cores busy in 'top'.

- Gordon

Parkway

unread,

Jun 18, 2015, 4:31:49 AM6/18/15

to gen...@googlegroups.com

My configuration is almost identical to yours except it is a 16-core machine (with 64gb ram) that runs centos6.

The latest gensim (download) from github is installed and workers=16.

1/ Using 'top', only 5 of the 16 cores were active. Of the 5 cores, one reached >50% usage, 2 stayed in the mid-teens and the remaining 2 hovered at around 4%.

2/ MKL

Unfortunately, Anaconda MKL license has expired so couldn't test.

3/ Without ATLAS BLAS

Uninstalled ATLAS and tested without a BLAS library.

The pattern was almost the same with ATLAS installed ie. 5 active cores, with 1 >50% usage, 2 in the mid-teens and 2 < 5%.

Well, that explains why the performance on this particular machine was less than expected! Though, not sure what is going on.

Gordon Mohr

unread,

Jun 18, 2015, 3:52:47 PM6/18/15

to gen...@googlegroups.com

How exactly do you install gensim? (You'd want to be sure that the current 'develop' branch, at least, is what you have, and also exactly what has had its compiled .so files installed into the operative python paths. I've occasionally received stale files from github caches, or separately failed to realize when a cloned working directory hasn't been fully installed into the environment.)

If BLAS was truly removed and unreachable by the numpy/scipy installation, training would be noticeably slower – and there might even be a crash. (The code that handles the no-BLAS case has a comment, "# actually, the BLAS is so messed up we'll probably have segfaulted above and never even reach here".) If a training cycle completed in the same amount of time before and after you removed BLAS, your change may not have had the intended effect.

To be sure of having info about the exact same python environment that's doing the training, adding lines like the following just before your Word2Vec/Doc2Vec instantiation would collect the most reliable info:

print(sys.version)

print(gensim.models.doc2vec.FAST_VERSION)

np.__config__.show()

No expiring-license software should be necessary; I started with a plain Ubuntu and 'miniconda', and added public ubuntu, conda, and pip packages.

Still, there could be something unique to your machine capping the resource use by a single process. (Is it a virtual mechine?)

- Gordon

Parkway

unread,

Jun 18, 2015, 5:56:11 PM6/18/15

to gen...@googlegroups.com

The gensim-develop zip is downloaded.

gensim installed with "python setup.py install"

The environment info is:

Anaconda 2.1.0

Python 2.7.10

numpy 1.9.2

scipy 0.15.1

lapack_opt_info:

libraries = ['lapack', 'f77blas', 'cblas', 'atlas']

library_dirs = ['/home/co-vadh1/anaconda/lib']

define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]

language = f77

python -c "import gensim; print(gensim.models.doc2vec.FAST_VERSION)"

1

It is not a virtual machine and have full access to all 16-cores.

Tom Kenter

unread,

Jun 24, 2015, 4:56:58 AM6/24/15

to gen...@googlegroups.com

Hi all,

Here is an update… I just did some testing and on my 32-core system, and setting the number of workers to 30 actually does cause 30 cores to be busy.

However, they don't seem to be very busy… :-/

I attached a screenshot of what htop shows. The screenshot is pretty typical as to what I observe throughout the entire training phase. As in: 30 cores are active but none of them seems to be doing much.

As to the settings, I am not sure what to look for here. This is what I get:

$ python -c "import gensim; print(gensim.models.doc2vec.FAST_VERSION)"

1

$ cat /proc/version

Linux version 2.6.32-504.3.3.el6.x86_64 (mock...@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Wed Dec 17 01:55:02 UTC 2014

$ python -c "import numpy as np; np.__config__.show()"

atlas_3_10_blas_info:

NOT AVAILABLE

atlas_3_10_blas_threads_info:

NOT AVAILABLE

atlas_threads_info:

libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']

library_dirs = ['/usr/lib64/atlas']

define_macros = [('ATLAS_INFO', '"\\"None\\""')]

language = f77

include_dirs = ['/usr/include']

blas_opt_info:

libraries = ['ptf77blas', 'ptcblas', 'atlas']

library_dirs = ['/usr/lib64/atlas']

define_macros = [('ATLAS_INFO', '"\\"None\\""')]

language = c

include_dirs = ['/usr/include']

openblas_info:

NOT AVAILABLE

atlas_blas_threads_info:

libraries = ['ptf77blas', 'ptcblas', 'atlas']

library_dirs = ['/usr/lib64/atlas']

define_macros = [('ATLAS_INFO', '"\\"None\\""')]

language = c

include_dirs = ['/usr/include']

lapack_opt_info:

libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']

library_dirs = ['/usr/lib64/atlas']

define_macros = [('ATLAS_INFO', '"\\"None\\""')]

language = f77

include_dirs = ['/usr/include']

openblas_lapack_info:

NOT AVAILABLE

lapack_mkl_info:

NOT AVAILABLE

atlas_3_10_threads_info:

NOT AVAILABLE

atlas_3_10_info:

NOT AVAILABLE

blas_mkl_info:

NOT AVAILABLE

mkl_info:

NOT AVAILABLE

gensim_word2vec_screenshot.tiff

Parkway

unread,

Jun 24, 2015, 5:38:06 AM6/24/15

to gen...@googlegroups.com

That is pretty much my experience on a 16-core centos6 machine but with only about ~5 weakly active cores. Still trying to figure out where the real problem is - the ATLAS BLAS library or gensim or something else. @Gordon has got all cores working at full-capacity on a ubuntu 8-core machine. I'm using Anaconda but looks like you're not.

It would be good to hear if folks have tested with linux using MKL or OpenBLAS to see if it is an ATLAS problem.

Gordon Mohr

unread,

Jun 24, 2015, 7:19:09 AM6/24/15

to gen...@googlegroups.com

Hmm. Thanks for the extra info. Some thoughts:

Are you certain that the code from (at least) the post-June-7 gensim 'develop' branch is what's running? (The C-extensions 'word2vec_inner.so'/'doc2vec_inner.so' need to be freshly built and installed to the right paths for your environment. A fresh pip source install from the right github branch, or 'python ./setup.py install' after refreshing a prior clone of the right branch, will usually do the trick.)

Where's the training corpus coming from? (Is it a slow IO source, or going through heavy preprocessing, such that feeding the data to the workers could be the bottleneck?)

Have you tried lower worker counts, and seen at what worker-count throughput peaks?

- Gordon

Tom Kenter

unread,

Jun 24, 2015, 10:47:59 AM6/24/15

to gen...@googlegroups.com

OK, my bad: I wasn't using the latest develop branch.

I am sorry if I am not getting things… so just to be sure:

- I clone https://github.com/piskvorky/gensim.git

- I run python setup.py install in the gensim directory. Right?

Because if I do that, I get:

$ python -c "import gensim; print gensim.__version__"

0.11.1-1

However:

$ python -c "import gensim; print gensim.models.doc2vec.FAST_VERSION"

-1

Whereas that should have said 1, right?

I do have a recent numpy:

$ python -c "import numpy as np; print np.__version__"

1.9.2

Gordon Mohr

unread,

Jun 24, 2015, 4:53:26 PM6/24/15

to gen...@googlegroups.com

Yes, getting gensim.models.doc2vec.FAST_VERSION to be 0 or 1 should be top priority; it makes for an 80-100X speedup.

A clone of https://github.com/piskvorky/gensim.git should get the default 'develop' branch, which will have the fix that's improved multicore utilization for me by a lot. I'd usually do a github install with a command like:

pip install git+https://github.com/piskvorky/gensim.git

That install should log some complaints if everything isn't right. In particular, to build the C extensions a C compiler and python-related headers are necessary. If all goes well, there will be freshly-compiled word2vec_inner.so/doc2vec_inner.so files in the same directory as is shown by:

python -c "from gensim.models import word2vec; print(word2vec.__file__)"

If not, installing your distribution's 'python-dev' (or possibly 'python-devel') package might bring in the necessary tools to attempt a re-install.

The numpy version is certainly adequate, but the extra info from "np.__config__.show()" could also confirm that a BLAS library is active.

- Gordon

Parkway

unread,

Aug 27, 2015, 3:02:41 AM8/27/15

to gensim

I'm still not seeing the multi-threading work. Running doc2vec on a corpus of 2m documents is literally taking days (each epoch ~7 hours). Details are:

16-core centos6

64Gb ram

python 2.7.10 (Anaconda)

numpy 1.9.2 with ATLAS 3.8.4

scipy 0.16

gensim 0.12.1

workers = 8

> from gensim.models import word2vec

> word2vec.FAST_VERSION

1

"top" display during doc2vec processing:

$ top

Cpu0 : 1.1%us, 2.9%sy, 0.0%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st

Cpu1 : 0.8%us, 3.7%sy, 0.0%ni, 95.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu2 : 0.2%us, 7.1%sy, 0.0%ni, 92.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu3 : 0.1%us, 1.0%sy, 0.0%ni, 98.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu4 : 0.0%us, 3.1%sy, 0.0%ni, 96.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu8 : 0.5%us, 5.0%sy, 0.0%ni, 94.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu9 : 0.2%us, 4.9%sy, 0.0%ni, 94.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu10 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu11 : 0.1%us, 0.0%sy, 0.0%ni, 99.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu13 : 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu15 : 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Gordon Mohr

unread,

Aug 27, 2015, 6:05:41 AM8/27/15

to gensim

That duration seems like it could be performance without the Cython extensions. Are there any "C extension not loaded" messages logged during training? Are you sure you're reading the FAST_VERSION from the exact same python interpreter/environment that's running your training script?

Another hint that all your config info may not have come from the same interpreter/env: scipy-0.16 has a change that will prevent the cython extensions in gensim 0.12.1 from loading/working. So I don't think it's possible to get FAST_VERSION other than -1 from gensim-0.12.1 with scipy-0.16 – you have to either roll back to scipy-0.15.1, or use the post-0.12.1 'develop' branch from github. (Rolling back to scipy-0.15.1 is probably easier unless you'll be editing the gensim source.)

Also, even the pure-python or single-threaded training would usually show more than a few percent utilization of a single core. So that 'top' readout doesn't even look like slow training... unless something else very atypical/pathological is going on. (But even reading data from some extremely slow IO source would likely put a core's 'wa[it]' percentage higher than those readouts.)

- Gordon

Parkway

unread,

Aug 27, 2015, 11:13:31 AM8/27/15

to gensim

Rolled back to scipy 0.15.1 and all is ok now:

Cpu0 : 86.5%us, 11.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 2.0%si, 0.0%st

Cpu1 : 94.4%us, 5.3%sy, 0.0%ni, 0.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu2 : 96.4%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu3 : 94.7%us, 5.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu4 : 94.1%us, 5.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu5 : 91.4%us, 8.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu6 : 97.0%us, 3.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu7 : 98.0%us, 2.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu8 : 51.2%us, 44.6%sy, 0.0%ni, 4.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu9 : 96.4%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu10 : 94.7%us, 5.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu11 : 92.1%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu12 : 87.5%us, 12.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu13 : 92.4%us, 7.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu14 : 95.4%us, 4.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu15 : 96.4%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Thanks!

Gordon Mohr

unread,

Aug 27, 2015, 5:49:53 PM8/27/15

to gensim

Glad it's working and thanks for sharing the updated 'top' readout – it's very nice to see such utilization across 16 cores! (I've only been able to test up to 8...)

- Gordon

Parkway

unread,

Aug 29, 2015, 8:14:10 AM8/29/15

to gensim

No probs. Fyi, it took doc2vec 720 secs (12 mins) per epoch for ~2m documents on a 16-core centos6 server with 64Gb ram.

Matt Patterson

unread,

Sep 21, 2015, 5:56:44 PM9/21/15

to gensim

I'm running into this same problem on a Macbook Pro with 8 cores. No matter what I've tried, I can't seem to get my gensim doc2vec training script to utilize more than a single core. I've tried everything I can find online. I've verified that it (appears to be) using OpenBLAS with numpy. Any other ideas? Other thoughts on how I could debug this?

Gordon Mohr

unread,

Sep 21, 2015, 9:16:02 PM9/21/15

to gensim

Key things to check:

(1) Are you using the latest gensim?

(2) Is gensim.models.word2vec.FAST_VERSION less than 2? (If 2, the slow and mostly non-parallelizable pure python code is being used, and you should also be seeing a logged warning when training starts.)

Note that there's an initial phase – the vocabulary survey – that's still single-threaded. So for a while, you'd only see a single core involved. However, once the bulk training begins, you should see much higher utilization. I use an MBP with 4 true cores (pseudo-8 with hyperthreading), and see them all active during runs with `workers=8`... so if you don't, whatever is the issue should be solvable.

What's the indication you're seeing that all cores are *not* being used?

- Gordon

Radim Řehůřek

unread,

Sep 21, 2015, 11:37:11 PM9/21/15

to gensim

(small correction: pure Python FAST_VERSION would be -1)

-rr

Matt Patterson

unread,

Sep 22, 2015, 10:39:58 AM9/22/15

to gensim

as mentioned earlier in this thread, i pip installed the latest gensim directly from github. I've also tried the version pip installs as the latest. Using scipy 0.15.1. Using numpy 1.9.2. Mine is a MBP just like yours with 4 real cores (8 w/ hyperthreading). During model.train() I'll initially see it reach 100% or even a bit above 100%, but after a while it will fall all the way down to 25% cpu utilization; with htop it's pretty clear that the first core is the only one doing any real work. The laptop is just sitting there quiet barely chugging along, max about 30,000 words/sec. In contrast, running the C version of word2vec, it reliably spikes the cpu usage up to around 700%+. Just in case it was a problem with my script, I threw the python script on an EC2 ubuntu instance with 16 cores and 30 gigs of RAM and it was hitting 700,000 words/sec and finished my entire run (which would have taken my mbp 12 hours) in about 30 minutes.

In [3]: gensim.models.word2vec.FAST_VERSION

Out[3]: 0

In [5]: numpy.show_config()

lapack_opt_info:

extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']

extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH']

define_macros = [('NO_ATLAS_INFO', 3)]

openblas_lapack_info:

NOT AVAILABLE

atlas_3_10_blas_threads_info:

NOT AVAILABLE

atlas_threads_info:

NOT AVAILABLE

atlas_3_10_threads_info:

NOT AVAILABLE

atlas_blas_info:

NOT AVAILABLE

atlas_3_10_blas_info:

NOT AVAILABLE

atlas_blas_threads_info:

NOT AVAILABLE

openblas_info:

NOT AVAILABLE

blas_mkl_info:

NOT AVAILABLE

blas_opt_info:

extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']

extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH', '-I/System/Library/Frameworks/vecLib.framework/Headers']

define_macros = [('NO_ATLAS_INFO', 3)]

atlas_info:

NOT AVAILABLE

atlas_3_10_info:

NOT AVAILABLE

lapack_mkl_info:

NOT AVAILABLE

mkl_info:

NOT AVAILABLE

Gordon Mohr

unread,

Sep 22, 2015, 7:23:55 PM9/22/15

to gensim

Hmm. Given those symptoms (slowdown after running for a while), maybe it's the OSX power-saving 'App Nap' feature kicking in

Some ways to disable that are described in a thread at:

https://groups.google.com/a/continuum.io/forum/#!msg/anaconda/kvWVtW40aDI/cr9IkGdyzY4J

Let me know if that helps,

- Gordon

Radim Řehůřek

unread,

Sep 23, 2015, 4:12:38 AM9/23/15

to gensim

Yes, it must be something external / OS related, like Gordon says.

I have a MBP too, so it's one of the best tested HW configurations for gensim :)