loading data from gensim API

Joao

unread,

Nov 13, 2018, 5:19:20 AM11/13/18

to Gensim

Hello all,

I've started using Gemsim's API and the following is taking an inordinate amount of time to load: api.load("word2vec-google-news-300").

I was wondering whether these can be saved locally after loading. If so, how do you load it from your local computer?

Best,

Joao

Mueller, Mark-Christoph

unread,

Nov 13, 2018, 5:36:41 AM11/13/18

to gen...@googlegroups.com

Hi Joao,

I had the same problem not just when using Google embeddings, but with any set of word embeddings. We came up with a python-based solution that uses lazy loading of individual word vectors, which speeds up things considerably, part from some other advantages:

https://github.com/nlpAThits/WOMBAT

You need to import your resource first, which might still take some time, but things are much faster from then on.

You also need to have the resource in plain text format. There are scripts out there that do that for you (i can also provide one).

Converting the resource to plain text prior to import also allows you to do some filtering of the vocabulary: The GoogleNews embeddings contain a *huge* part of phrases that are not really meaningful, and will never be used anyway unless you have a tokenizer that is aware of these phrases.

Best,

Christoph

Mark-Christoph Müller

Research Associate

HITS gGmbH

Schloss-Wolfsbrunnenweg 35

69118 Heidelberg

Germany

phone +49 6221 533 238

fax +49 6221 533 298

email mark-christ...@h-its.org

http://www.h-its.org

_________________________________________________

Amtsgericht Mannheim / HRB 337446

Managing Director: Dr. Gesa Schönberger

Von: gen...@googlegroups.com <gen...@googlegroups.com> im Auftrag von Joao <joao_qu...@hotmail.com>
Gesendet: Dienstag, 13. November 2018 11:19
An: Gensim
Betreff: [gensim:11776] loading data from gensim API

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,

Nov 13, 2018, 1:21:06 PM11/13/18

to Gensim

This `api.load()` function always downloads the data to a temporary location. The `downloader` module's `load()` function has an optional `return_path` argument that, if True, simply downloads the dataset and returns the path to where it was saved. See the docs:

https://radimrehurek.com/gensim/downloader.html#gensim.downloader.load

It's also common for people to look-up someplace that dataset is mirrored, and download it using a web browser. Then, it can be loaded with the `KeyedVectors.load_word2vec_format(filepath)` method.

Note that it's many gigabytes in size on disk and when loaded into RAM, and then when you start doing `most_similar()` lookups, it nearly doubles in size when all vectors are unit-normalized. So it's common on low-memory machines (like 4GB-8GB) for using that full set to trigger local memory swapping, which will make operations very very slow (since every `most_similar()` accesses the whole dataset).

If you *only* need `most_similar()` operations, and are OK working with only the unit-normalized vectors, you can call `model.init_sims(replace=True)` after it's loaded. That will discard the original raw vectors, keeping only the unit-normalized vectors, saving about half the memory.

As it contains millions of words, but most of the value is in the most-frequent words, and those are listed first, you can also use the optional `limit` parameter on `load_word2vec_format()`. For example, loading with `limit=500000` loads only the first (most-frequently-seen) 500K words, saving about 5/6ths of the memory.

- Gordon

Reply all

Reply to author

Forward