Problems learning word vectors with Doc2Vec and dm=0 (PV-DBOW)

Jeradf

unread,

Jun 28, 2015, 6:20:37 PM6/28/15

to gen...@googlegroups.com

I'm trying to simultaneously learn word vectors and document vectors via the skip gram PV-DBOW model. It seems that the parameter train_words, if set to True, should do this very thing, but I'm getting word vectors that seem pretty random.

The original authors have a follow-up follow-up NIPS paper where they use this approach and show interesting results, like training on wikipedia and then finding the Japanese equivalent of Lady Gaga, or training on a corpus of arxiv papers and finding the Bayesian version of a deep learning paper.

Does gensim not support this functionality?

Gordon Mohr

unread,

Jun 28, 2015, 6:34:22 PM6/28/15

to gen...@googlegroups.com

That parameter doesn't work right in the last official release, but it's fixed (and renamed as 'dbow_words') in the pending work for the next release.

Please check out the branch referenced in this pull request to try simultaneous DBOW-doc/SG-word learning:

https://github.com/piskvorky/gensim/pull/356

Comments from any review/testing wanted!

- Gordon

Jeradf

unread,

Jun 29, 2015, 7:21:43 PM6/29/15

to gen...@googlegroups.com

Gordon thanks for all your work on gensim, it looks terrific.

I tried using your code w/ these parameters:

model = doc2vec.Doc2Vec(dm=0, size=500, min_count=45, window=8, dbow_words=1, sample=1e-5, workers=8)

It ran for about 30mins and was reporting around 45% completion and then python crashed. I haven't experienced this before.

Any ideas what went wrong? Here's what seems like the relevant section of the crash report:

Crashed Thread: 0 Dispatch queue: com.apple.main-thread

Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00007f84cee00000

VM Regions Near 0x7f84cee00000:
MALLOC_TINY 00007f84ce000000-00007f84cee00000 [ 14.0M] rw-/rwx SM=PRV
-->
MALLOC_SMALL 00007f84cf000000-00007f84cf800000 [ 8192K] rw-/rwx SM=PRV

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 mtrand.so 0x000000010ceffe8d rk_double + 29
1 mtrand.so 0x000000010ceb7185 __pyx_f_6mtrand_cont0_array + 1541
2 mtrand.so 0x000000010ceb7d26 __pyx_pw_6mtrand_11RandomState_17random_sample + 262
3 org.python.python 0x000000010c2837e6 PyEval_EvalFrameEx + 14392
4 org.python.python 0x000000010c27fd7a PyEval_EvalCodeEx + 1409
5 org.python.python 0x000000010c28659d fast_function + 117
6 org.python.python 0x000000010c283400 PyEval_EvalFrameEx + 13394
7 org.python.python 0x000000010c21b67a gen_send_ex + 193
8 itertools.so 0x000000010c9215c6 islice_next + 115
9 org.python.python 0x000000010c22742a listextend + 297
10 org.python.python 0x000000010c228684 list_init + 97
11 org.python.python 0x000000010c248b5b type_call + 182
12 org.python.python 0x000000010c2060ea PyObject_Call + 99
13 org.python.python 0x000000010c282bd2 PyEval_EvalFrameEx + 11300
14 org.python.python 0x000000010c21b67a gen_send_ex + 193
15 org.python.python 0x000000010c217b91 enum_next + 32
16 org.python.python 0x000000010c280525 PyEval_EvalFrameEx + 1399
17 org.python.python 0x000000010c27fd7a PyEval_EvalCodeEx + 1409
18 org.python.python 0x000000010c28659d fast_function + 117
19 org.python.python 0x000000010c283400 PyEval_EvalFrameEx + 13394
20 org.python.python 0x000000010c27fd7a PyEval_EvalCodeEx + 1409
21 org.python.python 0x000000010c2841b6 PyEval_EvalFrameEx + 16904
22 org.python.python 0x000000010c27fd7a PyEval_EvalCodeEx + 1409
23 org.python.python 0x000000010c27f7f3 PyEval_EvalCode + 54
24 org.python.python 0x000000010c29f8a2 run_mod + 53
25 org.python.python 0x000000010c29f6be PyRun_InteractiveOneFlags + 353
26 org.python.python 0x000000010c29f1cd PyRun_InteractiveLoopFlags + 192
27 org.python.python 0x000000010c29f077 PyRun_AnyFileExFlags + 60
28 org.python.python 0x000000010c2b0c5b Py_Main + 3051
29 libdyld.dylib 0x00007fff8f1285fd start + 1

Gordon Mohr

unread,

Jun 29, 2015, 8:19:18 PM6/29/15

to gen...@googlegroups.com

Hm, that's concerning because memory corruption from the updated cython code could be the most likely culprit. OTOH, it looks like it happened inside a numpy random_sample routine that's supposed to be thread-safe, and shouldn't necessarily have its state near the arrays being updated with sometimes-error-prone pointer arithmetic.

It looks like you're on a 8-core Mac Pro? In the interest of tracking down what's going wrong, collecting more info from triggering the crash repeatedly would be helpful. If you set "ulimit -c unlimited" before launching the python process, you may get a core in /cores/ after any future crash, and that would allow collecting a bit more info about the location and state at the moment of the crash by inspecting that core with 'lldb -c [core_path]'. (Key questions: is it always in the same routine? Does it happen reliably/quickly or occasionally? – there'd be more specific things to check once you had a few example cores in hand.)

You could also try running without the 'sample' parameter – it should skip the particular random_state routine that appears in your crash stack. (That routine may not be the real culprit, but if by chance not-using-sampling avoided crashes, that'd be interesting data.)

To have a full picture of your system, it'd also be good to know:

- when you say "your code", do you mean the bigdocvec_pr branch or now gensim 'develop' (where the PR was just merged yesterday)?

- python version

- numpy version

Also, while it's rarely this, if the crashes seem random-under-heavy-load, then at some point the RAM itself can be suspect. (I haven't done that on a Mac in a long time but it looks like there's a built-in on-boot facility, as described at http://www.macissues.com/2014/03/21/how-to-run-and-interpret-apples-hardware-tests-on-your-mac/ .) But I'd seek a few examples of mysterious crashes before seriously considering this, unless you've just upgraded the RAM.

- Gordon

Jeradf

unread,

Jun 30, 2015, 3:50:46 PM6/30/15

to gen...@googlegroups.com

I tried it again without passing in the 'sample' parameter and it didn't crash. I'll be trying various model parameters over the coming weeks and I'll report back if I get a crash again.

To answer your questions:

-I'm using your bigdocvec_pr fork

-Python 2.7.9

-Numpy 1.9.2

Indeed, it could have been a random crash-- I did recently upgrade my ram, but I've trained 50+ doc2vec models using the official master gensim without any problems.

On an unrelated note, in the master version of gensim, I can do a most_similar query using documents and words together as in:

model.most_similar(positive=['Doc_1', 'word_1'], negative=['word_2'])

Does your implementation have a method for doing this or will I need to implement my own?

Gordon Mohr

unread,

Jun 30, 2015, 6:18:37 PM6/30/15

to gen...@googlegroups.com

Yes, please keep me posted of any patterns observed, especially any crash stacks or reproducible formulas.

You should probably prefer the main gensim 'develop' branch now that that this work has merged (and continues to improve there). For example, anyone using hierarchical sampling out of habit because it often gets off to a faster start may want to give negative-sampling with 5-10 examples another try after some recent size/speed improvements.

Now that the docvecs are in a separate array, there's no `most_similar()` that takes mixed IDs/tags and offers mixed top-N results. (The main `model.most_similar()` just does words, the `model.docvecs.most_similar()` just does docs.) But, each can take a raw vector inside the positive/negative arrays, so you can perform mixed operations that way.

Right now there's a bunch of cut & paste duplication in the most_similar/similarity/etc methods. Ideally that'd be refactored to some shared facility usable in both places. There's also a bunch of potential speedups offered in gensim PRs #340 and #350, which may just need a bit polish/testing to be integrated.

- Gordon

Gordon Mohr

unread,

Jul 6, 2015, 8:37:18 PM7/6/15

to gen...@googlegroups.com

FYI, a likely cause of your segfaults has been fixed, in the 0.12.0 release. (Specifically: when both using string doctags, and repeating some doctags before all were discovered, some indexes could reach beyond the legitimate end of the docvecs array – which would risk corruption and/or segfault crashes. It wasn't related to the 'sample' parameter.)

So please try your training task again, and let us know if you have any recurrences.

- Gordon

Reply all

Reply to author

Forward