Creating a Corpus, Dictionary and running LDA etc.

Ahmet

unread,

Dec 3, 2012, 4:21:28 PM12/3/12

to gen...@googlegroups.com

Hello,

I am very new to gensim, in fact I started using it only yesterday. First of all, I must congratulate you for this great library. Now here are my questions.

Below is my code:

stemmer = stemmer = PorterStemmer()

def split_line(text):

words = text.split()

out =[]

for word in words:

word=stemmer.stem(word.decode("ascii", "ignore"))

out.append(word)

return out

import gensim

class MyCorpus(gensim.corpora.TextCorpus):

def get_texts(self):

count =1

for filename in self.input:

print count

count = count+1

yield split_line(open(root_dir+filename).read())

if __name__ == '__main__':

data_files = [x[2] for x in os.walk(root_dir)]

myCorpus = MyCorpus(data_files[0])

gensim.corpora.MmCorpus.serialize(out_dir+'flickrCorpusNew.mm', myCorpus)

myCorpus.dictionary.filter_extremes()

print myCorpus.dictionary

myCorpus.dictionary.save(out_dir+"flickrDictNew.dict")

myCorpus.dictionary.save_as_text(out_dir+"flickrDictNewText.txt")

lda = gensim.models.ldamodel.LdaModel(corpus=myCorpus, id2word=myCorpus.dictionary, num_topics=50, update_every=1, chunksize=10000, passes=1)

lda.show_topics(20)

lda.save(out_dir+"lda_model")

My scenario is the following. I have 4000+ Flickr groups as my documents. Each group, which is in a file, contains tags that belong to the photos that the particular group has. I am trying to cluster groups into 50 topics.

1- Since there is a lot of noise in tags, I am using porter stemming to somewhat classify similar tags as the same. I will also try to remove tags that contain numbers etc. Any other pre-processing you recommend?

2- I have the variable count, in get_texts to see how close the code is to finish. And I have realized that get_texts() is called 4-5 different times. This makes this code take immense amount of time to run for 4000+ documents. I tried to load the saved corpus and dictionary and run LDA from there, but then I get index error, indices dont match. Is there an easier way to save and load this the correct way, so I don't have to recreate the corpus and the dictionary everytime?

3- The lda model I got prints out nothing when I run this. What might be the problem, am I calling it right?

thanks

Karsten

unread,

Dec 4, 2012, 2:47:52 PM12/4/12

to gen...@googlegroups.com

Hi,

I am not quite sure but I think TextCorpus has a save_corpus function which you should rather usw.

For stemming I found utils.lemmatize with pattern useful but you should use the fast algorithm.

Karsten

unread,

Dec 5, 2012, 4:39:31 AM12/5/12

to gen...@googlegroups.com

PS: I just saw that you don't convert the corpus to a Bag-of-Words representation and TFIDF space. Unless it's done in MyCorpus you have to do that first.

Ahmet

unread,

Dec 5, 2012, 5:55:13 AM12/5/12

to gen...@googlegroups.com

Hi thank you for your answer. Is this what I am supposed to do?

tfidf = models.TfidfModel(myCorpus)

corpus_tfidf = tfidf[myCorpus]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus_tfidf, id2word=myCorpus.dictionary, num_topics=20, update_every=1, chunksize=10000, passes=1)

Radim Řehůřek

unread,

Dec 5, 2012, 4:19:53 PM12/5/12

to gensim

Hello Ahmet,

depends on the texts -- you're splitting at whitespace, which may not
be appropriate for some languages (chinese, japanese, arabic?). Also,
you seem to ignore non-ascii characters, which is not a good idea for
languages with accents (slavic etc).

> 2- I have the variable count, in get_texts to see how close the code is to
> finish. And I have realized that get_texts() is called 4-5 different times.
> This makes this code take immense amount of time to run for 4000+
> documents.

4000 documents shouldn't take an "immense amount of time". It's hard
to tell from the copy&paste, but it looks the counter gets incremented
+1 for each file? So you should see counter=4000+?

> I tried to load the saved corpus and dictionary and run LDA from
> there, but then I get index error, indices dont match. Is there an easier
> way to save and load this the correct way, so I don't have to recreate the
> corpus and the dictionary everytime?

Sure, just use `Dictionary.save/load` and `MmCorpus.serialize`, pretty
much how you're doing it. Except you modify the dictionary with
`filter_extremes` after saving the corpus, which is probably the cause
of this id mismatch.

> 3- The lda model I got prints out nothing when I run this. What might be
> the problem, am I calling it right?

Error on input, error on preprocessing, error elsewhere... gensim has
pretty advanced logging, so if you add `import logging;
logging.basicConfig(format='%(asctime)s : %(levelname)s : %
(message)s', level=logging.DEBUG)`, you'll see a lot more info. If you
can't figure out what the problem is, post the (link to) your log so
we may have a look :)

Best,
Radim

Karsten

unread,

Dec 6, 2012, 7:37:30 AM12/6/12

to gen...@googlegroups.com

Hi,

here is code I use:

#!/usr/bin/env python

# -*- coding: utf-8 -*-

#

# Some implementation details are taken from wikicorpus.py

import errno

import exceptions

from gensim import utils, corpora, models

import itertools

import logging

import multiprocessing

import os

import sys

# Wiki is first scanned for all distinct word types (~7M). The types that appear

# in more than 10% of articles (supposedly stop words) are removed and

# from the rest, the DEFAULT_DICT_SIZE most frequent types are kept.

DEFAULT_DICT_SIZE = 50000

# No word which appear less then NO_BELOW times are kept

NO_BELOW = 20

#Number of topics to create for lda model

NUM_TOPICS = 500

def process_file_path(file_path):

with open(file_path, "r") as file:

#last character is a breaking /n

article_name = file.readline()[:-1]

#remaining lines is doc

doc = " ".join(file.readlines())

lemmatized_doc = utils.lemmatize(doc)

return (article_name, lemmatized_doc)

class CleanCorpus(corpora.TextCorpus):

'''

Loads all documents in a directory from a file system. Each file in a dir

is regarded as a document. It should be a texfile.

The first line is the article name.

Stems all words and removes stop words. Tokenizes each document

'''

def __init__(self, fname, no_below=NO_BELOW, keep_words=DEFAULT_DICT_SIZE,

dictionary=None):

'''

See gensim.corpora.textcorpus for details.

:param fnam: The path to scan for documents.

'''

self.fname = fname

self.article_names = []

if keep_words is None:

keep_words = DEFAULT_DICT_SIZE

if no_below is None:

no_below = NO_BELOW

self.file_paths = [os.path.join(self.fname, name) for name in os.listdir(self.fname)

if os.path.isfile(os.path.join(self.fname, name))]

self.processes = 2

#each file is considered an article

self.total_articles = len(self.file_paths)

if dictionary is None:

self.dictionary = corpora.Dictionary(self.get_texts())

self.dictionary.filter_extremes(no_below=no_below, no_above=0.1,

keep_n=keep_words)

else:

self.dictionary = dictionary

def get_texts(self):

'''

Files are processed parallel.

See wikicorpus.py by Radim Rehurek

'''

logger = logging.getLogger("feature_extractor")

logger.info("Scanning %d files." % self.total_articles)

articles_processed = 0

pool = multiprocessing.Pool(self.processes)

for group in utils.chunkize_serial(self.file_paths,

chunksize=10*self.processes):

for article_name, tokens in pool.imap(process_file_path, group):

articles_processed += 1

try:

name = article_name.strip("\n").decode("UTF-8")

except UnicodeDecodeError as e:

logger.error("Could not decode %s: %s" % (article_name, e))

exit(1)

self.article_names.append(name)

yield tokens

pool.terminate()

logger.info("Processed %d articles." % articles_processed)

if __name__ == "__main__":

from optparse import OptionParser

p = OptionParser()

p.add_option('-p', '--path', action="store", dest='doc_path',

help="specify path of wiki documents")

p.add_option('-o', '--output-prefix', action="store", dest='prefix',

help="specify path prefix where everything should be saved")

(options, args) = p.parse_args()

logger = logging.getLogger("feature_extractor")

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')

logging.root.setLevel(level=logging.INFO)

logger.info("running %s" % ' '.join(sys.argv))

corpus = CleanCorpus(options.doc_path)

#save dictionary: word <-> token id map

corpus.dictionary.save(options.prefix + "_wordids.dict")

#del corpus

'''Bag-of-Words'''

#init corpus reader and word -> id map

id2token = corpora.Dictionary.load(options.prefix + "_wordids.dict")

new_corpus = CleanCorpus(options.doc_path, dictionary = id2token)

#create and save bow-representation of corpus

corpora.MmCorpus.serialize(options.prefix + '_bow_corpus.mm', new_corpus,

progress_cnt=10000)

#del new_corpus

#init corpus reader

mm_bow = corpora.MmCorpus(options.prefix + '_bow_corpus.mm')

'''TFIDF Model creation'''

#build tfidf model

tfidf = models.TfidfModel(mm_bow, id2word=id2token, normalize=True)

#save tfidf model

tfidf.save(options.prefix + '_tfidf.model')

#save corpus as tfidf vectors in matrix market format

corpora.MmCorpus.serialize(options.prefix + '_tfidf_corpus.mm', tfidf[mm_bow],

progress_cnt=10000)

#init tfidf-corpus reader

mm_tfidf = corpora.MmCorpus(options.prefix + '_tfidf_corpus.mm')

'''LDA Model creation'''

#build lda model

lda = models.LdaModel(corpus=mm_tfidf, id2word=id2token,

num_topics=NUM_TOPICS, update_every=1,

chunksize=10000, passes=2)

#save trained model

lda.save(options.prefix + '_lda.model')

logger.info("finished transforming")

Ahmet

unread,

Dec 6, 2012, 9:55:45 AM12/6/12

to gen...@googlegroups.com

Thank you very much for your help. I was forgetting to check if there was an existing dictionary in the constructor of my corpus, and I also forgot to enable logging to print out results. However, there is still a problem (I think) with my results.

2012-12-06 16:45:51,493 : INFO : topic #0: 0.001*simonrichards + 0.001*pasteup + 0.001*usaf + 0.001*lowermainland + 0.001*birding + 0.001*sculptuur + 0.001*birdphotography + 0.001*lockheed + 0.001*mrfahrenheit + 0.001*mfh

2012-12-06 16:45:51,521 : INFO : topic #1: 0.001*trkiye + 0.001*minimal + 0.001*singapore + 0.001*oop + 0.001*toronto + 0.001*coth + 0.001*minimalism + 0.001*sacramento + 0.001*classiccars + 0.001*istanbul

2012-12-06 16:45:51,554 : INFO : topic #2: 0.001*wdw + 0.001*waltdisneyworld + 0.001*kitsch + 0.001*disneyworld + 0.001*weybridge + 0.001*disney + 0.001*magickingdom + 0.001*queenstown + 0.001*bergen + 0.001*postcrossing

2012-12-06 16:45:51,585 : INFO : topic #3: 0.002*topv + 0.001*nederland + 0.001*topf + 0.001*motorsport + 0.001*uncool + 0.001*deleteme + 0.001*zd + 0.001*dominiquerobert + 0.001*bikini + 0.001*denhaag

2012-12-06 16:45:51,616 : INFO : topic #4: 0.001*feminism + 0.001*cowes + 0.001*isleofwight + 0.000*wight + 0.000*funnysign + 0.000*iow + 0.000*ibm + 0.000*badsign + 0.000*colombia + 0.000*solent

2012-12-06 16:45:51,647 : INFO : topic #5: 0.001*mcshots + 0.001*alberta + 0.001*facepainting + 0.001*makeup + 0.001*summicron + 0.001*flviobrando + 0.001*srie + 0.001*noiretblanc + 0.001*pssaros + 0.001*sries

2012-12-06 16:45:51,681 : INFO : topic #6: 0.001*roae + 0.001*viltrakis + 0.001*cincinnati + 0.001*makro + 0.001*naturephotos + 0.001*puppy + 0.001*corgi + 0.001*lightpainting + 0.001*dmcfz + 0.001*fz

2012-12-06 16:45:51,713 : INFO : topic #7: 0.001*apx + 0.000*wail + 0.000*whelen + 0.000*bullhorn + 0.000*neworleans + 0.000*yelp + 0.000*airhorn + 0.000*callout + 0.000*mmsummicron + 0.000*kunstart

2012-12-06 16:45:51,743 : INFO : topic #8: 0.001*espaa + 0.001*fimo + 0.001*handmade + 0.001*emergency + 0.001*on + 0.001*eisenbahnen + 0.001*ambulance + 0.001*chemaconcelln + 0.001*polymerclay + 0.001*nubes

2012-12-06 16:45:51,774 : INFO : topic #9: 0.001*kenzan + 0.001*nahrungsmittel + 0.001*enviro + 0.001*mittagessen + 0.001*leyland + 0.001*plaxton + 0.001*nahrung + 0.001*buses + 0.000*psv + 0.000*stagecoach

2012-12-06 16:45:51,804 : INFO : topic #10: 0.001*tokyo + 0.001*nj + 0.001*snap + 0.001*zeiss + 0.001*lca + 0.001*streetart + 0.001*planar + 0.001*documentary + 0.001*hawaii + 0.001*zuiko

2012-12-06 16:45:51,834 : INFO : topic #11: 0.001*ireallylike + 0.000*mlb + 0.000*collectible + 0.000*designervinyl + 0.000*shakers + 0.000*vinyltoys + 0.000*designertoy + 0.000*collectibles + 0.000*vinyltoy + 0.000*hasbro

2012-12-06 16:45:51,864 : INFO : topic #12: 0.001*dca + 0.001*bunny + 0.001*boeing + 0.001*railways + 0.001*class + 0.001*canine + 0.001*diesel + 0.001*spotting + 0.001*locomotive + 0.001*trains

2012-12-06 16:45:51,896 : INFO : topic #13: 0.001*etsy + 0.001*handmade + 0.001*harveybarrison + 0.001*ooak + 0.001*tauck + 0.001*polymer + 0.001*fireengine + 0.001*firetruck + 0.001*videogames + 0.001*cosplayer

2012-12-06 16:45:51,926 : INFO : topic #14: 0.001*holga + 0.001*toycamera + 0.001*georgia + 0.001*tibet + 0.001*bangkok + 0.001*beijing + 0.001*mediumformat + 0.001*earthasia + 0.001*quality + 0.001*thailand

2012-12-06 16:45:51,956 : INFO : topic #15: 0.001*eastvan + 0.001*varanasi + 0.001*myanmar + 0.001*nude + 0.001*romania + 0.001*commercialdrive + 0.001*sicilya + 0.001*sziclia + 0.001*vancouver + 0.001*pedalare

2012-12-06 16:45:51,988 : INFO : topic #16: 0.002*tabby + 0.001*kitty + 0.001*kitten + 0.001*feline + 0.001*pets + 0.001*gato + 0.001*cats + 0.001*katze + 0.001*gatos + 0.001*kittens

2012-12-06 16:45:52,019 : INFO : topic #17: 0.001*arturii + 0.001*illustration + 0.001*iceland + 0.001*drawing + 0.001*sland + 0.001*flickrblick + 0.001*southcarolina + 0.001*interetsing + 0.001*hpexif + 0.001*charleston

2012-12-06 16:45:52,050 : INFO : topic #18: 0.001*zinzins + 0.001*pentaxist + 0.001*pentaxian + 0.001*topqualityimage + 0.001*paintshopprox + 0.001*ultimate + 0.001*sdm + 0.001*ashotadayorso + 0.001*paintshop + 0.001*corel

2012-12-06 16:45:52,082 : INFO : topic #19: 0.001*urbanart + 0.001*gig + 0.001*lastfmevent + 0.001*watercolor + 0.001*watercolour + 0.001*wheatpaste + 0.001*streetart + 0.001*suomi + 0.001*bjd + 0.001*scrawl

As you can see the salience value for each word in each topic is extremely low. And when I query the model with the documents I created the corpus, most of the documents choose topic#13 as the most-likely topic. What could possibly be the reason for this? Not enough number of documents(I have about 4600), very noisy data (user-entered tags aren't the most reliable piece of text in the world), or not enough frequent words(tags)?

Thank you for the help

Karsten

unread,

Dec 6, 2012, 5:20:04 PM12/6/12

to gen...@googlegroups.com

Hi,

I'd try more passes. It may not converge. Less topics (around 20) might also help.

Radim Řehůřek

unread,

Dec 6, 2012, 7:03:03 PM12/6/12

to gensim

Karsten is correct, a single pass over 4,000 documents is not enough.
For such a small corpus, you should be able to use >100 passes easily.
The optimal number will of course depend on your corpus structure,
number of requested topics etc.

Hth,
Radim

Karel Antonio Verdecia Ortiz

unread,

Dec 7, 2012, 8:26:10 AM12/7/12

to gen...@googlegroups.com

Hi,

Is there an attribute that I can use to know if the model converged?
Which should be the number of passes for400 documents?

El 06/12/12 19:03, Radim Řehůřek escribió:

> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Karsten

unread,

Dec 7, 2012, 3:38:21 PM12/7/12

to gen...@googlegroups.com

There is a warning in the logs if it may not converge.

Karel Antonio Verdecia Ortiz

unread,

Dec 8, 2012, 8:58:50 AM12/8/12

to gen...@googlegroups.com

thanks

El 07/12/12 15:38, Karsten escribió:

kverdecia.vcf

Ahmet

unread,

Dec 9, 2012, 5:14:09 PM12/9/12

to gen...@googlegroups.com

Thanks for the help everyone

Reply all

Reply to author

Forward