Creating corpus of clean Wikipedia text is very slow

Michael Haus

unread,

May 28, 2014, 2:19:08 AM5/28/14

to gen...@googlegroups.com

Hi,

I preprocessed the current Wikipedia XML dump file and saved the plain text of the articles in a MySQL database. Now, I want to produce a LDA topic model based on the Wikipedia plain text and with the online LDA model from gensim. But it is too slow, the runtime is bad:

2014-05-23 18:17:16,770 : INFO : Start time: 18:17:16.770867
2014-05-23 18:17:33,349 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-23 18:55:35,508 : INFO : adding document #10000 to Dictionary(545397 unique tokens: ...)
2014-05-23 20:51:16,739 : INFO : adding document #20000 to Dictionary(869058 unique tokens: ...)
2014-05-23 23:33:33,985 : INFO : adding document #30000 to Dictionary(1122503 unique tokens: ...)
2014-05-24 03:03:10,464 : INFO : adding document #40000 to Dictionary(1330521 unique tokens: ...)
2014-05-24 07:15:17,000 : INFO : adding document #50000 to Dictionary(1466995 unique tokens: ...)
2014-05-24 11:55:47,585 : INFO : adding document #60000 to Dictionary(1538627 unique tokens: ...)
2014-05-24 16:53:34,972 : INFO : adding document #70000 to Dictionary(1585108 unique tokens: ...)

My code looks like that, I connect to the database, then query the total amount of entries, the database has 45 million entries. Then I select one article, process the text (remove special characters, stopwords, stemming) and save it in the gensim dictionary. In addition, I save it to another dictionary to get the tokens later very quick, when iterating over the corpus to create the LDA model. The database is about 13 GB and I have 32 GB RAM, thus it should be possible to hold it in the memory. How can I speed up the processing?

Best regards,
Michael

tokens_to_id = OrderedDict()

def iter_database(database, max_row):
    index = 0
    for i in range(0, max_row):
        query = "SELECT content FROM dump20140520 LIMIT " + str(i) + ",1"
        database.execute(query)
        content = database.fetchone()
        tokens = preprocess_document(content)
        tokens_to_id[index] = tokens
        index += 1
        yield tokens

class WikiCorpusDatabase(gensim.corpora.WikiCorpus):
    
    def __init__(self, host, user, password, database):
        self.db = MySQLdb.connect(host=host, user=user, passwd=password, db=database)
        self.cursor = self.db.cursor()
        self.db.set_character_set('utf8')
        self.cursor.execute('SET NAMES utf8;')
        self.cursor.execute('SET CHARACTER SET utf8;')
        self.cursor.execute('SET character_set_connection=utf8;')
        
        query = "SELECT COUNT(id) FROM dump20140520"
        self.cursor.execute(query)
        max_row = int(self.cursor.fetchone()[0])
        self.length = max_row        
        self.dictionary = gensim.corpora.Dictionary(iter_database(self.cursor, max_row))
        self.dictionary.filter_extremes(no_below=2, keep_n=100000)
        self.dictionary.compactify()
        self.db.close()
        
    def __len__(self):
        return self.length
        
    def __iter__(self):
        for key in tokens_to_id:
            yield self.dictionary.doc2bow(tokens_to_id[key])

Radim Řehůřek

unread,

May 28, 2014, 3:30:48 AM5/28/14

to gen...@googlegroups.com

Hello Michael,

a couple of things:

1. you're caching all documents in RAM, in the `tokens_to_id` dict (should be `id_to_tokens` I guess?). Not very scalable.

2. you're running one full sql query per document. IIRC mysql doesn't handle OFFSET queries efficiently, hence the gradual slowdown. Play with mysql indexes, to make OFFSET more efficient, or maybe try another cursor type (I think there's a streamed one in mysql).

3. your `index` and `i` variables seem identical.

4. `filter_exterems()` already calls `compactify()` internally.

In other words, I suspect the slowdown comes already from the way you access your data in 1. & 2.

Try simply iterating over the `iter_database()`, without doing anything at all with the result (no gensim). Let that be your performance baseline :)

HTH,

Radim

--

Radim Řehůřek, Ph.D.

consultant @ machine learning, natural language processing, data mining

http://radimrehurek.com

skype "radimrehurek"

Michael Haus

unread,

May 31, 2014, 2:51:16 AM5/31/14

to gen...@googlegroups.com

Hi Radim,

thanks for your comments. I changed some things: writing the token-id mapping to disk and use a special cursor (SSCursor) for MySQL. In addition, I let it run in parallel in 7 processes. But, after 7 hours and 15 minutes and adding documents #2550000, the program is in idle mode. No failure message or anything else, its does nothing. But the database has 4.558.048 million entries. MySQL shows the connection to the database, but it is in sleep mode.

I changed some MySQL Parameters: connect_timeout=86400, wait_timeout=86400, interactive_timeout=86400. However, the error is not fixed, as previously the program is idle mode and the database connection in sleep mode, after #2550000.

What can be the failure? (Deadlock or database connection)

Thanks in advance.

Here is the code:

class WikiCorpus(gensim.corpora.textcorpus.TextCorpus):
   
    def __init__(self, file_name, host, user, password, database):
        self.file_name = file_name
        self.host = host
        self.user = user
        self.password = password
        self.database = database
        self.processes = max(1, multiprocessing.cpu_count() - 1)



        self.db = MySQLdb.connect(host=host,
                                  user=user,
                                  passwd=password,


                                  db=database,
                                  cursorclass=MySQLdb.cursors.SSCursor)


               
        self.cursor = self.db.cursor()
        self.db.set_character_set('utf8')
        self.cursor.execute('SET NAMES utf8;')
        self.cursor.execute('SET CHARACTER SET utf8;')
        self.cursor.execute('SET character_set_connection=utf8;')


        self.length = 4558048
        self.dictionary = gensim.corpora.Dictionary(self.get_texts())
       
       
    def get_texts(self):
        out = open(self.file_name, 'wb')
       
        pool = multiprocessing.Pool(self.processes)        
        query = "SELECT id, content FROM dump20140520 WHERE LENGTH(content) > 500 AND content NOT LIKE 'list of%'"
        self.cursor.execute(query)
        texts = ((row[0], row[1]) for row in self.cursor)
             
        for group in gensim.utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
            for index, tokens in pool.imap(preprocess_document, group):
                if len(tokens) > ARTICLE_MIN_WORDS:
                    marshal.dump([index] + tokens, out)
                    yield tokens
               
        pool.terminate()        
        out.close()
        self.cursor.close()

Radim Řehůřek

unread,

May 31, 2014, 5:25:29 AM5/31/14

to gen...@googlegroups.com

I don't know. Multiprocessing does forking, which can potentially mess things up (shared resources, cursors...), so that's one strong lead.

Like I said, it's probably best to get your data iteration right first, in isolation, without any other logic. Just read the docs from DB, a single process, no preprocessing, no classes, just a simple loop. If that works, start adding more functionality...

Best,

Radim

Valerio Maggio

unread,

Jun 1, 2014, 3:15:06 AM6/1/14

to gen...@googlegroups.com

Hi Michael,

please find below my two cents on this.

On Sat, May 31, 2014 at 11:25 AM, Radim Řehůřek wrote:

I don't know. Multiprocessing does forking, which can potentially mess things up (shared resources, cursors...), so that's one strong lead.

I agree with Radim. This could be an issue.

First of all, what version of Python are you using?

Did you notice that Python 3.4 now introduces some [improvements](https://docs.python.org/3.4/whatsnew/3.4.html#whatsnew-multiprocessing-no-fork) on the multiprocessing module, which [avoid using fork](https://docs.python.org/3.4/library/multiprocessing.html#multiprocessing.get_context) on Unix ?

Btw, we already [tested](https://groups.google.com/d/msg/gensim/_ozOcJoeiMs/fZrxmyKk5DYJ) `gensim` on Python 3.4 and it worked like a charm !-)

That said....

Like I said, it's probably best to get your data iteration right first, in isolation, without any other logic. Just read the docs from DB, a single process, no preprocessing, no classes, just a simple loop. If that works, start adding more functionality...

I totally agree with Radim: you have many variables to evaluate in your "performance experiment" and the only thing you can do to find the bottleneck is to add one variable at a time, incrementally.

Looking at the description of the issue you're experiencing, **it looked like classic memory swapping/thrashing to me**

IMHO this could be likely caused by the fact that multiprocecessing is actually forking processes, so data structures in memory are copied and replicated to every process... and you're building a dictionary in memory (i.e., `self.dictionary`) whose size increase more and more after the iterations.

In this scearion, it could be possible that the big data structures in memory keep on constantly swapping places with each other, so no computation actually happens.

This is definitely one of the things to take into account in the optimization process of your code (see 3).

Then, some possible improvement hints:

(1) Cursor connection:

Looking at your code, namely

[...]

self.cursor.execute(query)

texts = ((row[0], row[1]) for row in self.cursor)

[...]

it seems that after creating the `texts` iterable, you no longer need the cursor object and the connection too.

Thus, why not moving the `self.cursor.close` and `self.db.clonse` instructions right after `texts` assignement?

This would avoid child processes to share also the DB connection and cursor.

What do you think?

(2) `multiprocessing.map` (or `imap`) function takes a `chunksize` parameter: https://docs.python.org/3.4/library/functions.html#map

*Maybe*, you could avoid one *for* loop replacing the `gensim.utils.chunkize` call passing the `chunksize` parameter directly to the imap function.

(3) I would separate the dictionary `dump` process from its instantiation.

I mean, to avoid shared memory issues, I would first create the dump file (thus, calling `self.get_texts`). Afterwards, I would instantiate the

`gensim.corpora.Dictionary` from the dump file one shot.

So far, these are all the optimizations I can think about.

I'll keep on posting if I'll have something new to propose :-)

HTH.

Best,

Valerio

Michael Haus

unread,

Jun 2, 2014, 2:16:23 AM6/2/14

to gen...@googlegroups.com

Hi Valerio,

belonging to 1), it is a streamed cursor, the statement "for row in self.cursor" calls everytime the databse to the get the data. Therefore, the connection cannot be closed.

Thanks for your recommendations. But in the end, it stopped everytime on the same amount of documents #2550000. I removed now the multiprocessing and make only one loop iterating over the database cursor and save everytime 10.000 tokens to disk and repeat until the 4.5 million entries. Afterwards, I create from the dump file the dictionary, via the method "add_documents". The preprocessing step of the tokens has now 1.5 million entries, thus I will see today, if it stops another time at #2550000 documents.

Thanks,
Michael

Valerio Maggio

unread,

Jun 2, 2014, 2:37:41 AM6/2/14

to gen...@googlegroups.com

On Mon, Jun 2, 2014 at 8:16 AM, Michael Haus wrote:

Hi Valerio,

belonging to 1), it is a streamed cursor, the statement "for row in self.cursor" calls everytime the databse to the get the data. Therefore, the connection cannot be closed.

Ok, I think I see what you mean: as far as I understand, `texts` is a generator, thus you lazily fetch (and generate) data from the database at each iteration, right?

Thanks for your recommendations. But in the end, it stopped everytime on the same amount of documents #2550000. I removed now the multiprocessing and make only one loop iterating over the database cursor and save everytime 10.000 tokens to disk and repeat until the 4.5 million entries. Afterwards, I create from the dump file the dictionary, via the method "add_documents". The preprocessing step of the tokens has now 1.5 million entries, thus I will see today, if it stops another time at #2550000 documents.

Ok, let's see what happens.

Looking forward to hearing from you.

Best,

Valerio

Michael Haus

unread,

Jun 6, 2014, 2:11:42 AM6/6/14

to gen...@googlegroups.com

Hi Valerio,

yes, the texts variable is a generator, which execute the query and getting all the data from the database.

The database cursor to retrieve the data from the database also stops at #2950000 documents without any multiprossing, only one loop iterating over the result set from the database. I think there is some limit.

In any case, I use now the xml dump from the wikipedia to create the LDA model.

I used that script for the gensim part: https://github.com/piskvorky/sim-shootout/blob/master/prepare_shootout.py and for a clean text representation that one: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.

Best regards,
Michael

Radim Řehůřek

unread,

Jun 6, 2014, 4:33:23 AM6/6/14

to gen...@googlegroups.com

Hello Michael,

so you gave up on mysql?

I'm still curious why it would consistently stop at ~3 million documents. Does anyone have an idea? Any mysql experts here?

Maybe worth asking at a mysqldb forum -- this doesn't seem related to gensim as such.

Best,

Radim

Valerio Maggio

unread,

Jun 6, 2014, 4:34:24 AM6/6/14

to gen...@googlegroups.com

On Fri, Jun 6, 2014 at 8:11 AM, Michael Haus <michae...@tum.de> wrote:

Hi Valerio,

yes, the texts variable is a generator, which execute the query and getting all the data from the database.

The database cursor to retrieve the data from the database also stops at #2950000 documents without any multiprossing, only one loop iterating over the result set from the database. I think there is some limit.

Great, this shed some lights on the real issue... as Radim already supposed, this seems to be an issues not related to gensim.. maybe a DB cursor fetching limit (maybe tuneable?)

Dunno..

In any case, I use now the xml dump from the wikipedia to create the LDA model.

I used that script for the gensim part: https://github.com/piskvorky/sim-shootout/blob/master/prepare_shootout.py and for a clean text representation that one: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor.

Great!

Please keep us posted on your results, especially considering the multiprocessing part.. I'm quite curious to know how it works/performs on relatively big data :)

Best,

Valerio

Valerio Maggio

unread,

Jun 6, 2014, 4:38:35 AM6/6/14

to gen...@googlegroups.com

On Fri, Jun 6, 2014 at 10:33 AM, Radim Řehůřek wrote:

Hello Michael,

so you gave up on mysql?

It seems so... :)

Btw, did you ever consider to switch to PostgreSQL?! ... I don't really want to flame here on this topic, but it is by far more efficient and more robust than mysql:

Just the very first result from google: http://www.teknico.net/devel/myvspg/index.en.html

Another one: http://www.teknico.net/devel/myvspg/index.en.html (a bit outdated, but still valuable)

m2c,

V.

Michael Haus

unread,

Jun 6, 2014, 8:25:47 AM6/6/14

to gen...@googlegroups.com

Hi,

yes, I gave up on mysql. I mean I have the clean wikipedia data already in the MySQL database, with some other properties, to build up an location based LDA topic model.

With gensim and the script mentioned above, the whole computation with seven parallel processes takes now 12 hours to compute LDA model with 100 topics over an tfidf corpus of 100.000 features.

Best regards,
Michael

Valerio Maggio

unread,

Jun 6, 2014, 1:52:10 PM6/6/14

to gen...@googlegroups.com

On Fri, Jun 6, 2014 at 2:25 PM, Michael Haus wrote:

Hi,

Hi Michael, thanks for your posting.

With gensim and the script mentioned above, the whole computation with seven parallel processes takes now 12 hours to compute LDA model with 100 topics over an tfidf corpus of 100.000 features.

I see.

As far as I can remember, previously the computation took 4/5 hours less. However, this time you're finally crunching the whole dataset without any weird limitation. Is that correct?

Best,

Valerio

Michael Haus

unread,

Jun 12, 2014, 6:13:28 AM6/12/14

to gen...@googlegroups.com

Hi,

no, the computation time was never better than 12 hours. Now, everytime the whole Wikipedia xml dump file is processed and the LDA model is generated. No limitations.

Best regards,
Michael

Reply all

Reply to author

Forward