new to group and facing one issue

Viral .Dave

unread,

Aug 18, 2022, 1:15:01 PM8/18/22

to Gensim

Hi,

I am new to data science especially using gensim

I am trying to load a model from Azure blob storage but facing issues

I tried two times with different code

import gensim

import os

from azure.storage.blob import BlobServiceClient

from smart_open import open

azure_storage_connection_string = "DefaultEndpointsProtocol=xxxxxxx"

client = BlobServiceClient.from_connection_string(azure_storage_connection_string)

file_prefix="azure://landing/TechnologyCluster/VectorCreation/embeddings/"

fin = open(file_prefix+"word2vec.Tobacco.fasttext.model", transport_params=dict(client=client))

clustering.embedding = gensim.models.Word2Vec.load(open(fin))

Error is

TypeError: don't know how to handle uri <_io.TextIOWrapper name='azure://landing/TechnologyCluster/VectorCreation/embeddings/word2vec.Tobacco.fasttext.model' encoding='UTF-8'>

Another version of code is (not a major difference )

import gensim

import os

from azure.storage.blob import BlobServiceClient

from smart_open import open

azure_storage_connection_string = "xxxxxxx"

client = BlobServiceClient.from_connection_string(azure_storage_connection_string)

file_prefix="azure://landing/TechnologyCluster/VectorCreation/embeddings/"

clustering.embedding = gensim.models.Word2Vec.load(open(file_prefix+"word2vec.Tobacco.fasttext.model",transport_params=dict(client=client)))

Error is

AttributeError: '_io.TextIOWrapper' object has no attribute 'endswith'

Gordon Mohr

unread,

Aug 18, 2022, 2:39:40 PM8/18/22

to Gensim

Welcome!

The native `.load()` functionality of Gensim's `Word2Vec` requires a *file-path* (for which in some cases a remote service URI might be acceptable), not an already-open file-stream. So your attempts using `fin`, or a path that's alrady been `open()`ed, gives the `Word2Vec` class something it's not expecting & can't work with.

Theoretically, it should be possible to use an `azure://` URI, as the underlying `smart_open` package supports Azure. But, when using authenticated access, it looks to me like `smart_open` requires that pre-initialized `client`, and I see no way in `Word2Vec` to pass that in. (Potentially, Gensim could be extended to allow extra arguments to be passed to `smart_open`, or `smart_open` could find a already-authentication-initialized via some other mechanism... but neither of those capabilities are yet on any roadmap.)

Your best bet will be either:

* downloading the `.model` file – and any associated `.npy` files that might be alongside it, for any largish-model – to a local volume, then using a normal `.load()` with a local file-path; or

* moving entirely to Python pickling to get the model in one large file, and skip using Gensim's `.save()`/`.load()` at all

Regarding that second option: a bunch of things in Gensim's native `.save()`/`.load()` exist to work around older limits, & some inefficiencies, in standard Python pickling. But recent Python pickling capabilities have ditched some old file-size limits, and if your project can tolerate a single-large-pickle-file coming/going to Azure, this option could work for you. A few tips if you go that way:

* when saving the file, supply a giant value for the `.save()` optional parameter `sep_limit`, so that *no* subsidiary arrays no matter how large get saved seperately. eg: `sep_limit=sys.maxsize`

* when pickling, be sure you're using at least PICKLE_PROTOCOL=4 for large-object support. If using a Python past 3.8, and a recent Gensim version, the utility function in `gensim.utils.pickle()` will use PICKLE_PROTOCOL=4, but will only write to the same paths as `smart_open` supports. (You might be able to supply an Azure write-stream to Python's own `pickle.dump()` function.)

* Gensim's support code for handling the sorts of occasional model-internals changes between versions hook off of the Gensim custom `.save()`/`.load()`, so a pure pickle-based solution will only be sure to work when pickle-dumping, and pickle-loading, from the exact same Gensim version. If you do find yourself needing to migrate an older Gensim model to a newer Gensim version, do a local native Gensim `.save()` from the old version, and a local native Gensim `.load()` from the new version, before dumping a pure-pickle full model for storage elsewhere (that can then unpickle into the new version).

- Gordon

Viral .Dave

unread,

Aug 18, 2022, 5:58:20 PM8/18/22

to Gensim

Hi Gordon,

Thanks for such detailed answer and quick turnaround as well, I really appreciate.

Yes I would go for downloading to local filesystem (spark) and give explicit path , I did think about but thought to get reviews from other people.

Thanks again

Reply all

Reply to author

Forward