Using gensim in Azure ML Studio

bok...@gmail.com

unread,

Jul 17, 2018, 11:29:36 AM7/17/18

to gensim

Hello everyone!

I have used gensim for Deep Learning task on sentences and therefore use the Doc2Vec model.

Our workflow includes training data with Azure ML Studio. Now I wanted to integrate my model I have already written locally into Azure ML Studio.

The Python script works totally fine on the local machine. However on Azure ML Studio (Anaconda 4.0, Python 3.5.1) I get several Errors which do not make sense to me.

Here is the code for the model for example:

model = Doc2Vec(dm = 1, dm_concat = 1, vector_size = 50, window = 5, min_count = 2, workers = 4, sample = 1e-5)

In Azure ML Studio I get the following Error message:

Caught exception while executing function: Traceback (most recent call last):
  File "C:\pyhome\lib\site-packages\gensim\models\doc2vec.py", line 585, in __init__
    null_word=dm_concat, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'vector_size'

It works when I change vector_size to size. But size will be deprecated in gensim 4.0.

There are also a few Attribute Errors:

Caught exception while executing function: Traceback (most recent call last):
    model.train(docs, total_examples = docs_count, epochs = 1)
TypeError: train() got an unexpected keyword argument 'epochs'

If I remove the epochs argument, the training works. On the local machine however the code wouldn't train because it expects the epochs argument.

Lastly when I try to get the document vectors:

docs = model.docvecs.vectors_docs

Triggers an other Attribute Error:

AttributeError: 'DocvecsArray' object has no attribute 'vectors_docs'

Could you give me some advise how to fix these problems? Maybe it's an import problem?

Gordon Mohr

unread,

Jul 17, 2018, 2:05:07 PM7/17/18

to gensim

If your gensim classes don't have the parameters/properties `vector_size` and `epochs` and `vectors_docs`, but you see those in docs/examples, your gensim version is older than the version featured in the docs/examples you're looking at. Update your gensim to a recent version that matches the docs/tutorials you're looking at.

Separate observations about your model parameters:

* `dm_concat` is a mode that creates giant, slow models whose benefit was claimed in the original 'Paragraph Vector' paper, but I've not seen cases where it's worth the extra resources. I'd recommend against its use except for advanced users with giant datasets and lots of time to experiment/evaluate.

* typical vector sizes in published work range from 100-1000; a number as low as 50 would probably only be appropriate for very-small datasets

* low `min_count` values make models larger and slower to train - but sometimes hurt overall quality. Don't assume "keeping more of the corpus always helps"

* smaller (more-aggressive) `sample` values tend to make more sense with larger corpuses; so it's a bit odd to see `sample=1e-05` (perhaps appropriate for a large-corpus) alongside `vector_size=50` (suggestive of a small-corpus)

- Gordon

bok...@gmail.com

unread,

Jul 18, 2018, 3:16:33 AM7/18/18

to gensim

Thank you very much for your answer.

I am using the recent version of gensim (3.5.0). Or that's at least the version I have downloaded (gensim-3.5.0-cp35-cp35m-win32.whl, on https://pypi.org/project/gensim/#files). I unpacked the wheel file and put it into a .zip file for Azure ML Studio to import the gensim package (and it's dependencies). Is there a way to know what kind of version is imported currently.

Thank you also for the other observations. I have choosen a vector size of 50 just for testing purposes, otherwise I am using vector sizes in the range you have described.

Radim Řehůřek

unread,

Jul 18, 2018, 6:33:03 AM7/18/18

to gensim

You can always get the current Gensim version with: import gensim; print(gensim.__version__)

HTH,

Radim

bok...@gmail.com

unread,

Jul 19, 2018, 3:05:55 AM7/19/18

to gensim

For some strange reason, gensim.__version__ gives me a 0.12.4.

If I unpack the wheel file, the __init__.py under the gensim directory says it's version 3.5.0

Gordon Mohr

unread,

Jul 19, 2018, 2:47:21 PM7/19/18

to gensim

This indicates you're not installing the wheel properly into the environment that you're actually executing.

I'm unfamiliar with the vagaries of Azuer ML Studio; but getting a package in the right place for an environment usually involves using a `pip` (or often in Anaconda, `conda`) installation command while the right Python interpreter is active, not just dropping or expanding a ZIP somewhere.

- Gordon

bok...@gmail.com

unread,

Jul 23, 2018, 10:28:24 AM7/23/18

to gensim

I import the gensim package to the Python Script Module in Azure ML Studio based on these information:

Neither of them use pip or conda to install packages. As far as I know the Python Script Module has this limitation: You can't install the package, you can only use the scripts available in the directory.

Is version 0.12.4 a fallback version? I mean for some reason it does recognize a specific version of gensim.

Gordon Mohr

unread,

Jul 23, 2018, 12:53:26 PM7/23/18

to gensim

No, 0.12.4 is just an old version - from January 2016, replaced by 0.13.0 in June 2016. Some aspect of that environment must have bundled it and never updated. I suspect your attempts to install something are having no effect, and you'd see the same gensim versioneven before you try to install gensim. Maybe it was bundled with a (relatively-)ancient versions of Anaconda? Maybe you'll have to explicitly uninstall the older gensim before a newer once can be installed via the process you've found? You're in vendor-specific atypical-environment territory here; it'll likely require Microsoft/Azure ML Studio-specific expertise to resolve.

- Gordon

bok...@gmail.com

unread,

Jul 25, 2018, 11:30:00 AM7/25/18

to gensim

Thank you very much for your answer. You are right. An old version of gensim is bundled with Anaconda.

I tried to delete the old modules with

del sys.module("modulename")

I don't get a working script of it though. I guess I need to ask the team behind Azure ML Studio at this point.

Again, thanks for your input!

Reply all

Reply to author

Forward