Loading fasttext model from S3

584 views
Skip to first unread message

Alexey Shkarupin

unread,
Jul 22, 2019, 10:50:53 AM7/22/19
to Gensim
Hello,

I'm having difficulty loading a fasttext bin model from S3 bucket using gensim 3.8.0:
from gensim.models.fasttext import load_facebook_model
load_facebook_model("s3://somebucket/model.bin")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/fasttext.py", line 1250, in load_facebook_model
    return _load_fasttext_format(path, encoding=encoding, full_model=True)
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/fasttext.py", line 1330, in _load_fasttext_format
    m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py", line 321, in load
    raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py", line 177, in _load_vocab
    logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)
AttributeError: 'SeekableBufferedInputBase' object has no attribute 'name'


After commenting out the logging:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/fasttext.py", line 1250, in load_facebook_model
    return _load_fasttext_format(path, encoding=encoding, full_model=True)
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/fasttext.py", line 1330, in _load_fasttext_format
    m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py", line 324, in load
    vectors_ngrams = _load_matrix(fin, new_format=new_format)
  File "/home/alsh/venvs/gensim/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py", line 259, in _load_matrix
    matrix = np.fromfile(fin, _FLOAT_DTYPE, count)
io.UnsupportedOperation: fileno


Is this a bug? Should loading fasttext model from S3 be supported?


Radim Řehůřek

unread,
Jul 23, 2019, 5:47:04 AM7/23/19
to Gensim
Hi Alexey,

if the loading uses NumPy, then it won't work on S3. NumPy only supports local files.

I think it could work in principle though, after some changes, not sure why NumPy is needed. We use smart_open internally, which does support reading from S3 transparently.

Can you open this as a ticket on Github? https://github.com/RaRe-Technologies/gensim
If you include your use-case and motivation, and if we consider the change both desirable and doable, we may modify the loading logic to support S3.

Cheers,
Radim

Megan Rogers

unread,
Mar 7, 2024, 4:04:42 AMMar 7
to Gensim
Hi there

I am also having this issue when trying to load a model from S3 with the same error.


Has this been resolved or is there an alternative solution? 

Gordon Mohr

unread,
Apr 1, 2024, 2:59:41 PMApr 1
to Gensim
I believe the docs are wrong to suggest that smart_open's S3 support is enough for this operation to succeed. When the pseudofile is passed to the numpy routines for bulk array reading – as used for ranges of the Facebook-format full-FastText model – *those* routines assume they can get the sort of `fileno` that's only available for other kinds of files. (They may be trying to do some extra random-access or memory-mapping optimizations.) 

A fix is likely possible, but it might come at extra complexity or less efficiency in the usual case.

OTOH, I see that the Numpy `.fromfile()` docs warn: "Do not rely on the combination of tofile and fromfile for data storage, as the binary files generated are not platform independent. In particular, no byte-order or data-type information is saved." If our reads haven't hit endian problems, I guess our users have only been reading those on systems with the endianness matching the way these models were written.

And, while Gensim's code in `_facebook_bin.py` error-involved `_load_matrix` has comments implying the `np.fromfile()` function is efficient enough to prefer, the comments/code also shows a different code-path followed when the model file is gzipped (and thus also not able to provide a raw local file ID). 

So a workaround that might work – I've not tried, it might have other problems – would be to gzip the FastText file at S3, ensure it is named with a training `.gz` to be recognized as gzipped, and see if the resulting different codepath avoids the error without otther annoyances. 

Alternatively: download the full model locally, then read the local copy. 

- Gordon
Reply all
Reply to author
Forward
0 new messages