Hi,
I preprocessed the current Wikipedia XML dump file and saved the plain text of the articles in a MySQL database. Now, I want to produce a LDA topic model based on the Wikipedia plain text and with the online LDA model from gensim. But it is too slow, the runtime is bad:
2014-05-23 18:17:16,770 : INFO : Start time: 18:17:16.770867
2014-05-23 18:17:33,349 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-23 18:55:35,508 : INFO : adding document #10000 to Dictionary(545397 unique tokens: ...)
2014-05-23 20:51:16,739 : INFO : adding document #20000 to Dictionary(869058 unique tokens: ...)
2014-05-23 23:33:33,985 : INFO : adding document #30000 to Dictionary(1122503 unique tokens: ...)
2014-05-24 03:03:10,464 : INFO : adding document #40000 to Dictionary(1330521 unique tokens: ...)
2014-05-24 07:15:17,000 : INFO : adding document #50000 to Dictionary(1466995 unique tokens: ...)
2014-05-24 11:55:47,585 : INFO : adding document #60000 to Dictionary(1538627 unique tokens: ...)
2014-05-24 16:53:34,972 : INFO : adding document #70000 to Dictionary(1585108 unique tokens: ...)
My code looks like that, I connect to the database, then query the total amount of entries, the database has 45 million entries. Then I select one article, process the text (remove special characters, stopwords, stemming) and save it in the gensim dictionary. In addition, I save it to another dictionary to get the tokens later very quick, when iterating over the corpus to create the LDA model. The database is about 13 GB and I have 32 GB RAM, thus it should be possible to hold it in the memory. How can I speed up the processing?
Best regards,
Michael
tokens_to_id = OrderedDict()
def iter_database(database, max_row):
index = 0
for i in range(0, max_row):
query = "SELECT content FROM dump20140520 LIMIT " + str(i) + ",1"
database.execute(query)
content = database.fetchone()
tokens = preprocess_document(content)
tokens_to_id[index] = tokens
index += 1
yield tokens
class WikiCorpusDatabase(gensim.corpora.WikiCorpus):
def __init__(self, host, user, password, database):
self.db = MySQLdb.connect(host=host, user=user, passwd=password, db=database)
self.cursor = self.db.cursor()
self.db.set_character_set('utf8')
self.cursor.execute('SET NAMES utf8;')
self.cursor.execute('SET CHARACTER SET utf8;')
self.cursor.execute('SET character_set_connection=utf8;')
query = "SELECT COUNT(id) FROM dump20140520"
self.cursor.execute(query)
max_row = int(self.cursor.fetchone()[0])
self.length = max_row
self.dictionary = gensim.corpora.Dictionary(iter_database(self.cursor, max_row))
self.dictionary.filter_extremes(no_below=2, keep_n=100000)
self.dictionary.compactify()
self.db.close()
def __len__(self):
return self.length
def __iter__(self):
for key in tokens_to_id:
yield self.dictionary.doc2bow(tokens_to_id[key])