deep learning in gensim

Radim Řehůřek

unread,

Sep 17, 2013, 7:18:29 AM9/17/13

to gen...@googlegroups.com

Hello all,

I'm playing with deep learning (neural network) algorithms at the moment, considering adding some practical ones to gensim.

Top candidate so far: Google's word2vec http://radimrehurek.com/deep-learning-with-word2vec-and-gensim/

Do you have any suggestion? Which direction/tasks should gensim take , with respect to deep learning?

Comments and feedback welcome :)

Radim

niefpaarschoenen

unread,

Sep 19, 2013, 7:31:28 AM9/19/13

to gen...@googlegroups.com

Hello Radim,

Op dinsdag 17 september 2013 13:18:29 UTC+2 schreef Radim Řehůřek:

Do you have any suggestion? Which direction/tasks should gensim take , with respect to deep learning?

Comments and feedback welcome :)

Awesome stuff! I will most certainly make use of this new model in the near future!

Joris

bluishgreen

unread,

Sep 20, 2013, 4:47:26 PM9/20/13

to gen...@googlegroups.com

thats great!, I just started to look into deep learning and I think it is the right time to focus gensim into automatic feature learning models.

Radim Řehůřek

unread,

Sep 21, 2013, 4:49:22 PM9/21/13

to gen...@googlegroups.com

I optimized the word2vec code in gensim. The speed up is about 19x, but you'll need Cython to see it (`pip install cython`), plus a fast BLAS (but you should have that anyway!).

http://radimrehurek.com/word2vec-in-python-part-two-optimizing

Best,

Radim

robwahl

unread,

Sep 21, 2013, 6:30:14 PM9/21/13

to gen...@googlegroups.com

My question is whether or not word2vec working in metaphors would work in a nonlinear method (for example not an inheritable has-a or is-a relationship). For example having symmetry can work in one dimension only. For example Pontus Stenetarp asks we say the painting resembles the person" not the other way around. Also another concern I have is the level of certainty. How do we determine whether the corpus is correct for the context. Wouldn't it be better to find the find the difference between sentences and how they are interpreted. I'm wondering if inferences would slow down our semantic engine. The word2vec was neat to think about.

http://openreview.net/document/7b076554-87ba-4e1e-b7cc-2ac107ce8e4d#7b076554-87ba-4e1e-b7cc-2ac107ce8e4d

On Tuesday, September 17, 2013 4:18:29 AM UTC-7, Radim Řehůřek wrote:"Hello all,

Radim Řehůřek

unread,

Sep 22, 2013, 11:23:16 AM9/22/13

to Pedro Cardoso, gen...@googlegroups.com

Hi Pedro,

it's already there -- search for "load_word2vec_format" at
http://radimrehurek.com/deep-learning-with-word2vec-and-gensim/

By the way, I may have found a way for another 4x training speedup...
it looks good, but I have to investigate more before merging it :)

Radim

On Sun, Sep 22, 2013 at 5:12 PM, Pedro Cardoso <karu...@gmail.com> wrote:
> Hi Radim,
>
> Great job.
>
> I wonder, have you considered to add a constructor based on a bin file
> returned by word2vec ? In case I want to train the models with the c tool,
> but use it afterwards in Python ?
>
> Pedro

Radim Řehůřek

unread,

Sep 22, 2013, 1:27:42 PM9/22/13

to gen...@googlegroups.com, Pedro Cardoso

Confirmed; word2vec in gensim is now 4x faster.

Incidentally, it's also 3.5x faster than the original C word2vec from google, if you have a fast BLAS installed.

I have updated the blog post with new timings.

-rr

Radim Řehůřek

unread,

Sep 22, 2013, 3:09:11 PM9/22/13

to Pedro Cardoso, gen...@googlegroups.com

Thanks Pedro.

The speed up was due to realizing Cython's memoryview was the
bottleneck (=> getting rid of it). BLAS had already been included in
the previous, 19x speedup :)

I posted a patch with BLAS to the word2vec-toolkit google group, but
it's trickier to do it right in C (different BLAS libraries may have
slightly different signatures/idiosyncracies). This trickery is solved
"for free" when we use SciPy's BLAS in gensim, thank god.

Radim

On Sun, Sep 22, 2013 at 8:45 PM, Pedro Cardoso <karu...@gmail.com> wrote:
> Considering that word2vec does not use BLAS (AFAIK) I would not be
> surprised. Still, great job.
>
>
> 2013/9/22 Radim Řehůřek <m...@radimrehurek.com>

niefpaarschoenen

unread,

Sep 23, 2013, 9:19:19 AM9/23/13

to gen...@googlegroups.com, Pedro Cardoso

Hello Radim,

Op zaterdag 21 september 2013 22:49:22 UTC+2 schreef Radim Řehůřek:

I optimized the word2vec code in gensim. The speed up is about 19x, but you'll need Cython to see it (`pip install cython`), plus a fast BLAS (but you should have that anyway!).

http://radimrehurek.com/word2vec-in-python-part-two-optimizing

Best,
Radim

I wanted to try this for myself on a small corpus, but got the following timings:

<me@mycomp:~> /usr/bin/time -f '%E' ./word2vec -train corpus -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 1 -binary 1
0:28.30

<me@mycomp:~> /usr/bin/time -f '%E' train_gensim.py corpus
15:45.59

train_gensim.py basically consists of the following lines:
sentences = word2vec.Text8Corpus(f_corpus)
model = word2vec.Word2Vec(sentences,size=200,window=5,min_count=5)
model.save_word2vec_format(f_out,binary=True)

So I did some searching, found out I was missing cython and using old versions of numpy and scipy, updated everything and now the processing time went up to 1h!?

cython (v0.19.1)and blas (scipy v0.13.0b1) are installed and the script seems to be using the faster cython code (no exception thrown in word2vec.py)

Perhaps I should also note that the log files indicate a delay after the first block of sentences, but this might be normal considering the increasing word history:

2013-09-23 13:38:46,815 : INFO : PROGRESS: at sentence #0, 0.00% words, alpha 0.025000, 0 words per second
2013-09-23 13:38:47,834 : INFO : PROGRESS: at sentence #53, 3.96% words, alpha 0.024011, 49393 words per second
2013-09-23 13:38:50,979 : INFO : PROGRESS: at sentence #57, 4.25% words, alpha 0.023937, 13046 words per second
2013-09-23 13:38:53,562 : INFO : PROGRESS: at sentence #58, 4.32% words, alpha 0.023919, 8179 words per second
2013-09-23 13:38:56,474 : INFO : PROGRESS: at sentence #59, 4.40% words, alpha 0.023901, 5816 words per second

Joris

Radim Řehůřek

unread,

Sep 23, 2013, 11:47:06 AM9/23/13

to gen...@googlegroups.com, Pedro Cardoso

I wanted to try this for myself on a small corpus, but got the following timings:

<me@mycomp:~> /usr/bin/time -f '%E' ./word2vec -train corpus -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 1 -binary 1
0:28.30

<me@mycomp:~> /usr/bin/time -f '%E' train_gensim.py corpus
15:45.59

train_gensim.py basically consists of the following lines:
sentences = word2vec.Text8Corpus(f_corpus)
model = word2vec.Word2Vec(sentences,size=200,window=5,min_count=5)
model.save_word2vec_format(f_out,binary=True)

So I did some searching, found out I was missing cython and using old versions of numpy and scipy, updated everything and now the processing time went up to 1h!?

cython (v0.19.1)and blas (scipy v0.13.0b1) are installed and the script seems to be using the faster cython code (no exception thrown in word2vec.py)

Perhaps I should also note that the log files indicate a delay after the first block of sentences, but this might be normal considering the increasing word history:

2013-09-23 13:38:46,815 : INFO : PROGRESS: at sentence #0, 0.00% words, alpha 0.025000, 0 words per second
2013-09-23 13:38:47,834 : INFO : PROGRESS: at sentence #53, 3.96% words, alpha 0.024011, 49393 words per second
2013-09-23 13:38:50,979 : INFO : PROGRESS: at sentence #57, 4.25% words, alpha 0.023937, 13046 words per second
2013-09-23 13:38:53,562 : INFO : PROGRESS: at sentence #58, 4.32% words, alpha 0.023919, 8179 words per second
2013-09-23 13:38:56,474 : INFO : PROGRESS: at sentence #59, 4.40% words, alpha 0.023901, 5816 words per second

The drop before sentence #53 is not normal. For reference, I'm getting around 89k words/s on the text8 corpus, with 200 size (if that's the corpus you're using). There is no increase in "word history": it's always the same training window 1-5 words back and 1-5 words ahead.

So that's where I would start looking for problems... I can't tell what what's wrong from the log though. Does the slow NumPy version exhibit the same drop, at the same spot? Any unusual OS activity? What BLAS do you use? (`import scipy; scipy.show_config()`)

-rr

Joris

niefpaarschoenen

unread,

Sep 23, 2013, 7:39:21 PM9/23/13

to gen...@googlegroups.com

Op maandag 23 september 2013 17:47:06 UTC+2 schreef Radim Řehůřek:

The drop before sentence #53 is not normal. For reference, I'm getting around 89k words/s on the text8 corpus, with 200 size (if that's the corpus you're using).

I was using a different (even smaller) corpus, but text8 also drops quickly:
2013-09-24 01:28:59,142 : INFO : PROGRESS: at sentence #0, 0.00% words, alpha 0.025000, 0 words per second
2013-09-24 01:29:00,433 : INFO : PROGRESS: at sentence #40, 0.24% words, alpha 0.024941, 30863 words per second
2013-09-24 01:29:02,983 : INFO : PROGRESS: at sentence #41, 0.24% words, alpha 0.024939, 10575 words per second
2013-09-24 01:29:06,022 : INFO : PROGRESS: at sentence #42, 0.25% words, alpha 0.024938, 6049 words per second
2013-09-24 01:29:09,692 : INFO : PROGRESS: at sentence #43, 0.25% words, alpha 0.024936, 4039 words per second
2013-09-24 01:29:13,289 : INFO : PROGRESS: at sentence #44, 0.26% words, alpha 0.024935, 3082 words per second
2013-09-24 01:29:16,897 : INFO : PROGRESS: at sentence #45, 0.27% words, alpha 0.024933, 2511 words per second

Does the slow NumPy version exhibit the same drop, at the same spot?

No, on text8 it is stable around 1150 word/s.

Any unusual OS activity?

Not sure what the best way is to measure this.

What BLAS do you use? (`import scipy; scipy.show_config()`)

umfpack_info:
NOT AVAILABLE
atlas_threads_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/lib64/atlas']
    define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
    language = f77
    include_dirs = ['/usr/include']
blas_opt_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/lib64/atlas']
    define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
    language = c
    include_dirs = ['/usr/include']
atlas_blas_threads_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/lib64/atlas']
    define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
    language = c
    include_dirs = ['/usr/include']
lapack_opt_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/lib64/atlas']
    define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
    language = f77
    include_dirs = ['/usr/include']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE

Radim Řehůřek

unread,

Sep 24, 2013, 4:14:27 AM9/24/13

to gen...@googlegroups.com

That ATLAS is more than 4 years old... but that shouldn't be a problem (just suboptimal).

Joris, can you come on skype so we can debug this? One sentence every 3 seconds is not normal, especially since it runs normally for the first few dozen.

-rr

Radim Řehůřek

unread,

Sep 26, 2013, 10:42:50 AM9/26/13

to gen...@googlegroups.com

Looks like we resolved the "slowdown" issue with Joris; the `develop` branch has been updated with some extra checks.

Please report any other problems you may come across.

-rr

Radim Řehůřek

unread,

Oct 4, 2013, 12:41:45 PM10/4/13

to gen...@googlegroups.com

Hello all,

I just updated the word2vec implemenation in gensim with parallel (multi-threaded) processing.

The write-up and additional info is on my blog again: http://radimrehurek.com/2013/10/parallelizing-word2vec-in-python/