Loading the data in streaming instead of loading it all at once: possible with Gensim?
175 views
Skip to first unread message
Marco Ippolito
unread,
Sep 19, 2014, 6:16:53 AM9/19/14
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to gen...@googlegroups.com
Hi all, yesterday only to upload the GoogleNews-vectors-negative300.bin.gz into Word2Vector model I had to make a swap file of 8 GB, which is not so very good in terms of performance.
Is it possible in gensim parsing the data in streaming....I mean,
instead of loading all the data in memory, loading the data, in my case the big GoogleNews-vectors-negative300.bin.gz, parts-by-parts in streaming?
Looking forward to your helpfull hints and feedback. Kind regards. Marco
Radim Řehůřek
unread,
Sep 19, 2014, 12:29:04 PM9/19/14
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to gen...@googlegroups.com
Hello Marco,
is this the same post as in the other thread? I'm getting confused :)
Re. "streaming" -- what you have is an already trained model. Streaming is a concept that relates to model training, not to the final product=model.
The model is essentially a matrix in memory. This matrix can also be backed by file on disk, using the mmap parameter to load: `model = Word2Vec.load(save_path, mmap='r')`.
This will tell the OS to use the stored model file on disk directly as "swap", in read-only mode, so the OS doesn't need any extra swap space. But of course, operations backed by such "mmap swap file" will still be slow if your model doesn't fit fully in your RAM.
If you don't mind slow performance, mmap is a good option for systems will little memory.
HTH,
Radim
Marco Ippolito
unread,
Sep 19, 2014, 12:41:09 PM9/19/14
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to gen...@googlegroups.com
Hi Radim,
sorry for replicating the message twice (from now on only one
post..promised), and thanks for being so kind in giving me
explanations.
I've been reading and testing the online example of
http://python-blosc.blosc.org/tutorial.html which actually produce interesting compressing ratio.
Do you think it might be used in conjunction with gensim to reduce the
memory used, or there might be contraindications?
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to gen...@googlegroups.com
That's a great question!
Blosc is a super exciting project, I talked to Valentin this summer in Berlin.
AFAIR the main Blosc use-case is serializing/deserializing arrays. It doesn't support arbitrary operations (such as dot product) on the compressed matrices, which is what word2vec does.
In any case, don't expect much compression for the particular type of matrices that come out of topic models. These are already fairly compressed on their own (unsupervised learning/clustering is compression in disguise!).