Loading the data in streaming instead of loading it all at once: possible with Gensim?

175 views
Skip to first unread message

Marco Ippolito

unread,
Sep 19, 2014, 6:16:53 AM9/19/14
to gen...@googlegroups.com
Hi all,
yesterday only to upload the GoogleNews-vectors-negative300.bin.gz into Word2Vector model I had to make a swap file of 8 GB, which is not so very good in terms of performance.

Is it possible in gensim parsing the data in streaming....I mean, instead of loading all the data in memory, loading the data, in my case the big GoogleNews-vectors-negative300.bin.gz, parts-by-parts in streaming?
 
Looking forward to your helpfull hints and feedback.
Kind regards.
Marco

Radim Řehůřek

unread,
Sep 19, 2014, 12:29:04 PM9/19/14
to gen...@googlegroups.com
Hello Marco,

is this the same post as in the other thread? I'm getting confused :)

Re. "streaming" -- what you have is an already trained model. Streaming is a concept that relates to model training, not to the final product=model.

The model is essentially a matrix in memory. This matrix can also be backed by file on disk, using the mmap parameter to load: `model = Word2Vec.load(save_path, mmap='r')`.


This will tell the OS to use the stored model file on disk directly as "swap", in read-only mode, so the OS doesn't need any extra swap space. But of course, operations backed by such "mmap swap file" will still be slow if your model doesn't fit fully in your RAM.

If you don't mind slow performance, mmap is a good option for systems will little memory.

HTH,
Radim

Marco Ippolito

unread,
Sep 19, 2014, 12:41:09 PM9/19/14
to gen...@googlegroups.com
Hi Radim,
sorry for replicating the message twice (from now on only one
post..promised), and thanks for being so kind in giving me
explanations.

I've been reading and testing the online example of
http://python-blosc.blosc.org/tutorial.html
which actually produce interesting compressing ratio.
Do you think it might be used in conjunction with gensim to reduce the
memory used, or there might be contraindications?

Marco
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "gensim" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/gensim/Hip3sxYkxwo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> gensim+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,
Sep 19, 2014, 1:48:40 PM9/19/14
to gen...@googlegroups.com
That's a great question!

Blosc is a super exciting project, I talked to Valentin this summer in Berlin.

AFAIR the main Blosc use-case is serializing/deserializing arrays. It doesn't support arbitrary operations (such as dot product) on the compressed matrices, which is what word2vec does.

In any case, don't expect much compression for the particular type of matrices that come out of topic models. These are already fairly compressed on their own (unsupervised learning/clustering is compression in disguise!).

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages