gensim does not serialize corpora with 200,000 documents

326 views
Skip to first unread message

svakulenko

unread,
Jan 28, 2014, 5:05:41 AM1/28/14
to gen...@googlegroups.com
Hi,

I have a collection of more than 200,000 docs (small, no more than 10 lines
each). I have tried saving the corpora, it worked for 20,000 but not 10
times more. Is it a bug?

Sveta

Radim Řehůřek

unread,
Jan 28, 2014, 3:14:36 PM1/28/14
to gen...@googlegroups.com, svitlana....@uni.li
Hello Sveta,

probably not, you're just doing it wrong.

HTH :) (you didn't provide much to work with)
Radim


 

Sveta

Roger Leitzke

unread,
Jan 28, 2014, 4:22:14 PM1/28/14
to gen...@googlegroups.com, svitlana....@uni.li
Hi Sveta,

It's not a bug because I usually work with collections containing more than 500,000 documents and I never had such problem. :-)


---
Roger

2014-01-28 Radim Řehůřek <m...@radimrehurek.com>

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Svitlana Vakulenko

unread,
Jan 28, 2014, 7:45:54 PM1/28/14
to gen...@googlegroups.com, svitlana....@uni.li
Thank you, guys, for the quick responses. I also had an impression that it should work. What could be a reason?
I was trying Mmcorpus and Bleicorpus serialization but ended up with an empty mm file, or the number of docs were reduced to
7,000 though I didnt apply any filters.

Here is the script I'm using (importing data from mongodb and doing simple filtering for English language texts).



from pymongo import Connection

import nltk

import logging, sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

from nltk.stem.wordnet import WordNetLemmatizer
import os
from gensim import corpora, models, similarities
from nltk.corpus import stopwords

ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
 
def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    intersect = len(words & ENGLISH_STOPWORDS)
    return  intersect > len(words & NON_ENGLISH_STOPWORDS) & intersect > 0 #&


lmtzr = WordNetLemmatizer()

def tokenize(document):
    return [lmtzr.lemmatize(word) for word in document.lower().split() if word not in stopwords.words('english') and word.isalpha() and len(word) > 3]

class MyCorpus(corpora.TextCorpus):
    def get_texts(self):
        count = 0
        for app in self.input:
            if count > 200000:
                break
            descr = app [u'results'][0][u'description']
            if is_english(descr):
                    count += 1
                    yield tokenize(descr)

connection = Connection('localhost', 27017)

db = connection.test

myCorpus = MyCorpus(db.iapps.find())

myCorpus.dictionary.filter_extremes(no_below=50, no_above=0.5)
myCorpus.dictionary.compactify()

myCorpus.dictionary.save('/home/vendi/Desktop/iapps/Analysis/pub2.dict') # store the dictionary, for future reference
corpora.BleiCorpus.serialize('/home/vendi/Desktop/iapps/Analysis/pub2.mm', myCorpus)

Sveta

Radim Řehůřek

unread,
Jan 29, 2014, 5:00:41 AM1/29/14
to gen...@googlegroups.com, svitlana....@uni.li
The code looks fine at a glance.

I'd look into the part where `db.iapps.find()` plugs into the corpus iterator. Maybe try printing each `app` title/id, and see which ones make into the yield statement and which not. That should give some idea what's going on during the iteration.

Best,
Radim

--
Radim Řehůřek, Ph.D.
consultant@ machine learning, natural language processing, dig data

Roger Leitzke

unread,
Jan 29, 2014, 6:24:43 AM1/29/14
to gen...@googlegroups.com

I would check the time out when querying mongodb. If the answer to your query is really big, find() will cancel after a while because it's idle.

--
Roger

Roger Leitzke

unread,
Jan 29, 2014, 7:06:29 AM1/29/14
to gen...@googlegroups.com

Try to remove timeout from your query:

db.iapps.find(timeout=False)

--
Roger

Svitlana Vakulenko

unread,
Feb 1, 2014, 12:47:54 PM2/1/14
to gen...@googlegroups.com, svitlana....@uni.li
The problem is indeed when I am trying to save the file ( I have tried already 3 differenct methods Mmcorpus, Bleis and UCI)
(1) For 20,000 docs im able to save the matrix with all the docs, (2) for 100,000 docs the dimension is reduced to 7,546 even without filtering terms. (3) And already for 122000 docs the matrix is saved empty.

Could be any integer overflow? Help, please!

Here is the output from the terminal for the (2) case:

INFO:gensim.corpora.dictionary:built Dictionary(215338 unique tokens) from 100001 documents (tota
l 8661017 corpus positions)
INFO:gensim.corpora.dictionary:keeping 8957 tokens which were in no less than 50 and no more than
 50000 (=50.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(8957 unique tokens)
INFO:gensim.utils:saving Dictionary object to /home/iapps/app1.dict
INFO:gensim.corpora.ucicorpus:no word id mapping provided; initializing from corpus
INFO:gensim.corpora.ucicorpus:saving vocabulary of 8957 words to /home/iapps/app1.mm.vocab
INFO:gensim.corpora.ucicorpus:storing corpus in UCI Bag-of-Words format: /home/iapps/app1.mm
INFO:gensim.corpora.ucicorpus:PROGRESS: saving document #0
INFO:gensim.corpora.ucicorpus:saved 7546x8957 matrix, density=0.561% (378869/67589522)
INFO:gensim.corpora.indexedcorpus:saving UciCorpus index to /home/iapps/app1.mm.index

Stephanus van Schalkwyk

unread,
Feb 5, 2014, 3:25:25 PM2/5/14
to gen...@googlegroups.com, svitlana....@uni.li
Sveta
Are you running Windows? Please post as much info as you have.
We have been running into downcast issues where int on x64 windows were actually int32.
Another post has also commented on a memory size (available RAM) issue.
Regards
Steph

Svitlana Vakulenko

unread,
Feb 7, 2014, 5:51:45 AM2/7/14
to gen...@googlegroups.com, svitlana....@uni.li
Thank you for the replies! Finally I managed to save the matrix. Don't know what exactly helped: I moved mongodb find() into the class body and cut out the "if language" branching part.

Svitlana

Radim Řehůřek

unread,
Feb 7, 2014, 8:42:46 AM2/7/14
to gen...@googlegroups.com, svitlana....@uni.li
Hello Sveta,

If moving find() helped like I suggested, then it's likely the case that it returned a once-only (non repeatable) stream of documents.

Once the corpus stream was exhausted on building the dictionary, then trying to use it again (to store documents in a file) returned no more documents and the file was empty.

With find() moved into the iterator, the stream gets "reset" for each iteration. But the iterations may return a different set of documents, if find() returns different result on each call (the DB has changed etc).

Best,
Radim
--
Radim Řehůřek, Ph.D.
consulting @ machine learning, natural language processing, big data
 
Reply all
Reply to author
Forward
0 new messages