Hello friends,
I've written a simple implementation of word2vec using Cuda and Python, and some code that compares my implementation to that of gensim.
When running the demo, my code is a bit faster than gensim, but the vectors it produces are not quite as good as measured by gensim's word analogy task (KeyedVectors.evaluate_word_analogies()). My code is about 75% faster to train (i.e. it takes approx 1/1.75 = 57% as long as gensim), but the word analogy score is around 44-45% vs. gensim's 48-49%. This is over a few repeated trainings with the same specific (partial) Wikipedia dump; see demo.py in the above repo for details.
I've implemented the most important features of the algorithm in myw2v and am using the same options for it as for gensim. I haven't checked how the gensim implementation is done on the source code level though, I'm sure there are differences. Also, even if the implementations were identical, the GPU will run many more threads than gensim does on the CPU, which is likely to cause differences in performance.
It's interesting that word2vec, as an algorithm, doesn't seem to benefit massively from running on a GPU. While running 75% faster is nice, things like convolutional neural networks often run, say, five times faster on GPU than on CPU. Perhaps word2vec is a sufficiently simple algorithm that it doesn't benefit that much from the GPU's capacity; or it could be that my implementation is suboptimal. Perhaps both. Myw2v doesn't tax my GPU much at all even with the relatively large Wikipedia dump the demo uses.
Hope y'all find it interesting,