noob question: gensim.downloader.load keeps getting stuck

92 views
Skip to first unread message

Roy Becker

unread,
Apr 3, 2024, 4:18:26 PMApr 3
to Gensim
I installed gensim, and implementing the code I copied from the downloader api documentation, I'm trying to download a pre-trained model like "glove-twitter-25" or "glove-wiki-gigaword-100".

However, running the download method keeps getting stuck.

At the beginning it provides me with progress feedback, upto "0.4% 0.5/128.1MB downloaded". Then it says nothing for another 20 minutes or so. Then it prints the following error:

 File "C:\Users\Roy\AppData\Local\Programs\Python\Python311\Lib\site-packages\gensim\downloader.py", line 496, in load
    _download(name)
  File "C:\Users\Roy\AppData\Local\Programs\Python\Python311\Lib\site-packages\gensim\downloader.py", line 396, in _download
    urllib.urlretrieve(url_data, dst_path, reporthook=_progress)
  File "C:\Users\Roy\AppData\Local\Programs\Python\Python311\Lib\urllib\request.py", line 280, in urlretrieve
    raise ContentTooShortError(
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 2044669 out of 134300434 bytes>

In case this matters, I'm using Python Idle 3.11.4

I tried to find help queries about this problem but found none. I found only one apparently related query here: https://github.com/piskvorky/gensim-data/issues/38
I tried to implement its workaround but it kept getting stuck.

Any idea?

Thanks,
Roy.

Gordon Mohr

unread,
Apr 10, 2024, 3:26:34 PMApr 10
to Gensim
Gensim is just using a basic `urllib` HTTP request here. If you consistently get this same failure – a broken connection mid-download, possibly even always at the same place – it's likely something idiosyncratic about your network, and its path to the download source. (This could be anything from security policies on firewall-like hosts detecting & aborting your action to faulty network hardware or network misconfigurations.)

You could try another machine, or another network.

But also, for a variety of reasons from the value of Pythonic explicitness to robustness against potential software supply-chain attacks, I don't think the `gensim.downloader` module should exist, and I personally recommend against its use. My full argument has been archived in issue <https://github.com/piskvorky/gensim/issues/2283>. 

I suggest instead using a standard web browser to go to the original publishers of these datasets – like the GLoVe website <https://nlp.stanford.edu/projects/glove/> – and directly download from there, to an explicit local path you've consciously chosen. (Web browsers can sometimes also resume interrupted-download files.)

Or, if you need to automate the download on a graphical-browserless machine (terminal or notebook host), first discover the source URL via a web browser, then use your own line or two of Python `urllib` or `requests` code, or other scriptable tools like command-line `wget` or `curl`, to do the download to your desired location (and unzip/untar/etc as necessary).

Then, a line or two of Gensim code (like `vecs = KeyedVectors.load_word2vec_format(filename)`) can load those file formats, without obscuring which code is being run, where data has landed in what format, or which kind of model object (class) is being provided. (In some cases, these alternate approaches can also report failures in more detail.)

- Gordon
Reply all
Reply to author
Forward
0 new messages