Newbie question - error while using the load_word2vec_format function

1,037 views
Skip to first unread message

Yonatan Shalita

unread,
Jun 16, 2022, 11:39:25 AM6/16/22
to Gensim
I am trying to set a word vector model using the NLPL repository (http://vectors.nlpl.eu/repository/

my code:
import gensim
import zipfile
nlpl_zip="C:/Users/PC/Documents/CS/semesterD/nlp/project/47.zip"
with zipfile.ZipFile(nlpl_zip, "r") as archive:
    stream = archive.open("model.bin")
    word_vectors = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True,unicode_errors='replace')

this is the error i get:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last) Input In [5], in <cell line: 2>() 2 with zipfile.ZipFile(nlpl_zip, "r") as archive: 
3 stream = archive.open("model.bin") 
----> 4 word_vectors = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True, 5 unicode_errors='replace') 
 File ~\anaconda3\envs\hebnlp\lib\site-packages\gensim\models\keyedvectors.py:
1723, in KeyedVectors.load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header) 
1676 @classmethod 
1677 def load_word2vec_format( 
1678 cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', 1679 limit=None, datatype=REAL, no_header=False
1680 ): 
1681 """Load KeyedVectors from a file produced by the original C word2vec-tool format. 1682
1683 Warnings (...) 1721 1722 """ -> 1723 return _load_word2vec_format( 1724 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors, 1725 limit=limit, datatype=datatype, no_header=no_header, 1726 ) File ~\anaconda3\envs\hebnlp\lib\site-packages\gensim\models\keyedvectors.py:2052, in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype, no_header, binary_chunk_size) 2049 counts[word] = int(count) 2051 logger.info("loading projection weights from %s", fname) -> 2052 with utils.open(fname, 'rb') as fin: 2053 if no_header: 2054 # deduce both vocab_size & vector_size from 1st pass over file 2055 if binary: File ~\anaconda3\envs\hebnlp\lib\site-packages\smart_open\smart_open_lib.py:224, in open(uri, mode, buffering, encoding, errors, newline, closefd, opener, compression, transport_params) 221 except ValueError as ve: 222 raise NotImplementedError(ve.args[0]) --> 224 binary = _open_binary_stream(uri, binary_mode, transport_params) 225 decompressed = so_compression.compression_wrapper(binary, binary_mode, compression) 227 if 'b' not in mode or explicit_encoding is not None: File ~\anaconda3\envs\hebnlp\lib\site-packages\smart_open\smart_open_lib.py:396, in _open_binary_stream(uri, mode, transport_params) 393 return fobj 395 if not isinstance(uri, str): --> 396 raise TypeError("don't know how to handle uri %s" % repr(uri)) 398 scheme = _sniff_scheme(uri) 399 submodule = transport.get_transport(scheme) 
  TypeError: don't know how to handle uri <zipfile.ZipExtFile name='model.bin' mode='r' compress_type=deflate> 

does anyone know the reason for the error?


Gordon Mohr

unread,
Jun 16, 2022, 2:53:39 PM6/16/22
to Gensim
Your message's quoted material has lost significant whitespace (converted runs-of-spaces to a single space) in the 'error' paste, but not your code. As even the signficiant whitespace in the traceback is helpful for undertstanding Python, that makes it harder to see what's going on. Further, half but not all of traceback seems to have erased significant *newlines*, which also hides key details. In the future, please try to preserve all such meaningful spacing/newlines - if sending styled text, ensure your preview looks like the original sources, or ideally, just send pure plain text that preserves all spacing.

Still, I can see the heart of the problem: your `stream` variable, assigned to what the `ZipFile` `.open()` returned, is a `zipfile.ZipExtFile` object. The `.loas_word2vec_filename()` needs, for its `fname` parameter per the documentation of intended use/operation, a "file path to the saved word2vec-format file". A `ZipExtFile` isn't that, so it fails.

You should pull the specific resource that's inside the ZIP file out into its own true file, and then supply a normal filesystem file-path to that directly. 

(Plausibly, the generic utility code that Gensim is using, from the affiliated `smart_open` project, should be able to handle that file-like `ZipExtFile`. But, looking over the code, I only see `smart_open` supporting file-like streams via detection of raw local file-handles – ints – not via detecting anything with the right 'read' methods. So `ZipExtFile` objects aren't yet supported, and it'd require a new decision to expand support, & contributed code, to add that feature.)

- Gordon

Andrey Kutuzov

unread,
Jun 16, 2022, 3:18:25 PM6/16/22
to gen...@googlegroups.com
Hi,

I've just tested this very code with this very "47.zip" model on Linux,
and it works flawlessly.

Gensim 3.8.3, smart_open 5.2.1.

Yonatan, what are the versions of the libraries you are using?
> <http://vectors.nlpl.eu/repository/>)
>
> *my code:*
> import gensim
> import zipfile
> nlpl_zip="C:/Users/PC/Documents/CS/semesterD/nlp/project/47.zip"
> with zipfile.ZipFile(nlpl_zip, "r") as archive:
>     stream = archive.open("model.bin")
>     word_vectors =
> gensim.models.KeyedVectors.load_word2vec_format(stream,
> binary=True,unicode_errors='replace')
>
> *this is the error i get:*
> <http://logger.info>("loading projection weights from %s", fname) ->
> 2052 with utils.open(fname, 'rb') as fin: 2053 if no_header: 2054 #
> deduce both vocab_size & vector_size from 1st pass over file 2055 if
> binary: File
> ~\anaconda3\envs\hebnlp\lib\site-packages\smart_open\smart_open_lib.py:224,
> in open(uri, mode, buffering, encoding, errors, newline, closefd,
> opener, compression, transport_params) 221 except ValueError as ve:
> 222 raise NotImplementedError(ve.args[0]) --> 224 binary =
> _open_binary_stream(uri, binary_mode, transport_params) 225
> decompressed = so_compression.compression_wrapper(binary,
> binary_mode, compression) 227 if 'b' not in mode or
> explicit_encoding is not None: File
> ~\anaconda3\envs\hebnlp\lib\site-packages\smart_open\smart_open_lib.py:396,
> in _open_binary_stream(uri, mode, transport_params) 393 return fobj
> 395 if not isinstance(uri, str): --> 396 raise TypeError("don't know
> how to handle uri %s" % repr(uri)) 398 scheme = _sniff_scheme(uri)
> 399 submodule = transport.get_transport(scheme)
> TypeError: don't know how to handle uri <zipfile.ZipExtFile
> name='model.bin' mode='r' compress_type=deflate>
>
> does anyone know the reason for the error?
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gensim/169fcfc3-a2d9-43c0-9c31-d915d7176848n%40googlegroups.com
> <https://groups.google.com/d/msgid/gensim/169fcfc3-a2d9-43c0-9c31-d915d7176848n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Solve et coagula!
Andrey

Gordon Mohr

unread,
Jun 16, 2022, 4:09:06 PM6/16/22
to Gensim
On Thursday, June 16, 2022 at 12:18:25 PM UTC-7 akutu...@gmail.com wrote:
Hi,

I've just tested this very code with this very "47.zip" model on Linux,
and it works flawlessly.

Gensim 3.8.3, smart_open 5.2.1.

Yonatan, what are the versions of the libraries you are using?
 
That's surprising & interesting! At 1st I thought, given the older Gensim/smart_open versions you're using, it might be a Gensim-4.0/smart_open-6.0 regression. But a quick glance (but not run) through the source suggests Gensim/smart_open are doing essentially the same things. 

So maybe a Windows thing? Or, if you're using those older versions because you're on Python 2.x, a Python 2-to-3 thing?

There is an interesting 'note' in the Python 2.7 `zipfile` docs (2nd of three notes there) about different behaviors based on how the `ZipFile` was opened:


While it doesn't directly apply to this issue, it hints there may be some more finicky implementation tradeoffs behind the scenes that *might* be related, if the `ZipExtFile` particulars also vary between platforms/OS-filesystems/etc. 

- Gordon


Andrey Kutuzov

unread,
Jun 16, 2022, 4:49:58 PM6/16/22
to gen...@googlegroups.com
We are using Gensim 3.8.3, because it does all we need.

My (admittedly limited) experience with Gensim 4.* was troubled with
various incompatibilities, crashes (because of API changes, of course),
etc. As far as I remember, most pain came from fastText models.

We did not have enough time to re-factor all our existing code, so we
just stayed on the last version before 4.*, which is 3.8.3. It still
runs on Python 3, of course, so the `zipfile` differences between Python
2 and Python 3 are not relevant here.

I agree that the topic starter's issue might be Windows-specific.
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gensim/ef99b839-58ae-493f-909e-42adb925b83bn%40googlegroups.com
> <https://groups.google.com/d/msgid/gensim/ef99b839-58ae-493f-909e-42adb925b83bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Yonatan Shalita

unread,
Jun 17, 2022, 6:19:28 AM6/17/22
to gen...@googlegroups.com
Thanks for all the help!
Andrey I am using smart_open-5.2.1 and gensim 4.1.2
what do you reckon the "windows specific" problem might be?

You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/u_Of1TzYDR0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/af14c0f2-bdb5-1659-828d-8b19f0be4c93%40gmail.com.

Andrey Kutuzov

unread,
Jun 17, 2022, 7:55:00 AM6/17/22
to gen...@googlegroups.com
I mean Gensim is not much tested on Windows, so lots of weird problems
might arise. Can you try on Linux?

On 17.06.2022 12:19, Yonatan Shalita wrote:
> Thanks for all the help!
> Andrey I am using smart_open-5.2.1 and gensim 4.1.2
> what do you reckon the "windows specific" problem might be?
>
> On Thu, Jun 16, 2022 at 11:50 PM Andrey Kutuzov <akutu...@gmail.com
> <mailto:akutu...@gmail.com>> wrote:
>
> We are using Gensim 3.8.3, because it does all we need.
>
> My (admittedly limited) experience with Gensim 4.* was troubled with
> various incompatibilities, crashes (because of API changes, of course),
> etc. As far as I remember, most pain came from fastText models.
>
> We did not have enough time to re-factor all our existing code, so we
> just stayed on the last version before 4.*, which is 3.8.3. It still
> runs on Python 3, of course, so the `zipfile` differences between
> Python
> 2 and Python 3 are not relevant here.
>
> I agree that the topic starter's issue might be Windows-specific.
>
>
>
> On 16.06.2022 22:09, Gordon Mohr wrote:
> > On Thursday, June 16, 2022 at 12:18:25 PM UTC-7 akutu...@gmail.com
> <mailto:akutu...@gmail.com> wrote:
> >
> >     Hi,
> >
> >     I've just tested this very code with this very "47.zip" model
> on Linux,
> >     and it works flawlessly.
> >
> >     Gensim 3.8.3, smart_open 5.2.1.
> >
> >     Yonatan, what are the versions of the libraries you are using?
> >
> > That's surprising & interesting! At 1st I thought, given the older
> > Gensim/smart_open versions you're using, it might be a
> > Gensim-4.0/smart_open-6.0 regression. But a quick glance (but not
> run)
> > through the source suggests Gensim/smart_open are doing
> essentially the
> > same things.
> >
> > So maybe a Windows thing? Or, if you're using those older versions
> > because you're on Python 2.x, a Python 2-to-3 thing?
> >
> > There is an interesting 'note' in the Python 2.7 `zipfile` docs
> (2nd of
> > three notes there) about different behaviors based on how the
> `ZipFile`
> > was opened:
> >
> >
> https://docs.python.org/2.7/library/zipfile.html#zipfile.ZipFile.open <https://docs.python.org/2.7/library/zipfile.html#zipfile.ZipFile.open>
> >
> > While it doesn't directly apply to this issue, it hints there may be
> > some more finicky implementation tradeoffs behind the scenes that
> > *might* be related, if the `ZipExtFile` particulars also vary between
> > platforms/OS-filesystems/etc.
> >
> > - Gordon
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Gensim" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>
> > <mailto:gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>>.
> <https://groups.google.com/d/msgid/gensim/ef99b839-58ae-493f-909e-42adb925b83bn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/gensim/ef99b839-58ae-493f-909e-42adb925b83bn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Solve et coagula!
> Andrey
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "Gensim" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/gensim/u_Of1TzYDR0/unsubscribe
> <https://groups.google.com/d/topic/gensim/u_Of1TzYDR0/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> gensim+un...@googlegroups.com
> <mailto:gensim%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gensim/af14c0f2-bdb5-1659-828d-8b19f0be4c93%40gmail.com
> <https://groups.google.com/d/msgid/gensim/af14c0f2-bdb5-1659-828d-8b19f0be4c93%40gmail.com>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gensim/CANkUXv-45U%3D0AeYz0_%2BRn-FuAE7j9LdjNDFc3ZjsHbS7Q0GM6g%40mail.gmail.com
> <https://groups.google.com/d/msgid/gensim/CANkUXv-45U%3D0AeYz0_%2BRn-FuAE7j9LdjNDFc3ZjsHbS7Q0GM6g%40mail.gmail.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages