Need to make changes in train_batch_sg in cython implementation

103 views
Skip to first unread message

trideep rath

unread,
Feb 16, 2017, 4:39:22 AM2/16/17
to gensim
Hi,
I need to make some changes in the implementation in the train_batch_sg in the cython implementation for having arbitrary context. Can anyone please guide me how to compile the changes and build gensim after making the changes.
I would like to use the new function like this

from gensim.models.word2vec_inner import train_batch_sg_arbitrary, train_batch_cbow_arbitrary
...
...
tally += train_batch_sg_arbitrary(self, sentences, alpha, work)


I have done the appropriate changes in the python function, but for a dataset of 2 billion words it will take forever. Hence I need to make the changes in the cython file and build it. I need help to how after making the changes to the word2vec_inner.pyx and build and use it into gensim.models.word2vec.py.

Thanks in Advance. Help would be highly appreciated.

Thanks & Regards,
Trideep Rath

Gordon Mohr

unread,
Feb 16, 2017, 1:27:10 PM2/16/17
to gensim
Roughly the required steps are:

(1) Ensure your Python environment is using your working-copy of gensim for the python & the `_inner` compiled code (typically `.so` shared-libraries) – this might involve invoking setup.py from inside your project directory, or doing a 'pip' install using a local path

(2) When your changes to the `.pyx` files seem ready, use `cython` to compile them to `.c` code. (You *might* need to do this from the root of the project, eg: `cython gensim/models/word2vec_inner.pyx`)

(3) Use the command `python ./setup.py build_ext --inplace`, from the root of your gensim directory, to compile the `.c` to shared-libraries. (Depending on how well you did step (1), you might also need to do something like `python ./setup.py install` to also install the shared-libraries elsewhere.)

(4) Run your tests, confirming especially that your changed code (and not some older or elsewhere-installed version) is being run from where you expect it. Debug & repeat (2)-(3) as necessary. 

Beware that within the Cython & `nogil` sections, small bugs may cause memory-access process crashes without any useful stack traces, or subtle corruption (that isn't noticed, or doesn't cause a process-crash, until arbitrarily later). Normal Python logging/printing won't usually be available – see the Cython docs for alternatives. You may need to use `gdb` or similar to get a better view of error-conditions. 

Hope this helps. Good luck!

- Gordon

Andrey Kutuzov

unread,
Feb 22, 2017, 1:18:19 PM2/22/17
to gen...@googlegroups.com
Hi Trideep,

In fact, we've already implemented arbitrary contexts in the Gensim
Continuous Skipgram training function (cythonized).
We have not published it online yet, as the accompanying paper is under
review now. However, if you are interested in looking at the code, you
can contact me privately.
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

trideep rath

unread,
Feb 22, 2017, 4:04:49 PM2/22/17
to gen...@googlegroups.com
Hi Andrea,
Thanks for your reply. I am talking about the paper "dependency based word embedding" by Omer Levy, I guess that is published. 
I would be thankful if you can send me the cythonised file. 

Thanks and Regards
Trideep Rath


> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/LTdrGBysMyw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages