Creating a Corpus, Dictionary and running LDA etc.

1,320 views
Skip to first unread message

Ahmet

unread,
Dec 3, 2012, 4:21:28 PM12/3/12
to gen...@googlegroups.com
Hello,

I am very new to gensim, in fact I started using it only yesterday. First of all, I must congratulate you for this great library. Now here are my questions.

Below is my code:

stemmer = stemmer = PorterStemmer()

def split_line(text):
words = text.split()
out =[]
for word in words:
word=stemmer.stem(word.decode("ascii", "ignore"))
out.append(word)
return out

import gensim
class MyCorpus(gensim.corpora.TextCorpus):
def get_texts(self):
count =1
for filename in self.input:
print count
count = count+1
yield split_line(open(root_dir+filename).read())

if __name__ == '__main__':
data_files = [x[2] for x in os.walk(root_dir)]
myCorpus = MyCorpus(data_files[0])

gensim.corpora.MmCorpus.serialize(out_dir+'flickrCorpusNew.mm', myCorpus)
myCorpus.dictionary.filter_extremes()

print myCorpus.dictionary

myCorpus.dictionary.save(out_dir+"flickrDictNew.dict")
myCorpus.dictionary.save_as_text(out_dir+"flickrDictNewText.txt")

lda = gensim.models.ldamodel.LdaModel(corpus=myCorpus, id2word=myCorpus.dictionary, num_topics=50, update_every=1, chunksize=10000, passes=1)



lda.show_topics(20)

lda.save(out_dir+"lda_model")


My scenario is the following. I have 4000+ Flickr groups as my documents. Each group, which is in a file, contains tags that belong to the photos that the particular group has. I am trying to cluster groups into 50 topics.

1- Since there is a lot of noise in tags, I am using porter stemming to somewhat classify similar tags as the same. I will also try to remove tags that contain numbers etc. Any other pre-processing you recommend?

2- I have the variable count, in get_texts to see how close the code is to finish. And I have realized that get_texts() is called 4-5 different times. This makes this code take immense amount of time to run for 4000+ documents. I tried to load the saved corpus and dictionary and run LDA from there, but then I get index error, indices dont match. Is there an easier way to save and load this the correct way, so I don't have to recreate the corpus and the dictionary everytime?

3- The lda model I got prints out nothing when I run this. What might be the problem, am I calling it right?

thanks

Karsten

unread,
Dec 4, 2012, 2:47:52 PM12/4/12
to gen...@googlegroups.com
Hi,

I am not quite sure but I think TextCorpus has a save_corpus function which you should rather usw.

For stemming I found utils.lemmatize with pattern useful but you should use the fast algorithm.

Karsten

unread,
Dec 5, 2012, 4:39:31 AM12/5/12
to gen...@googlegroups.com
PS: I just saw that you don't convert the corpus to a Bag-of-Words representation and TFIDF space. Unless it's done in MyCorpus you have to do that first.

Ahmet

unread,
Dec 5, 2012, 5:55:13 AM12/5/12
to gen...@googlegroups.com
Hi thank you for your answer. Is this what I am supposed to do?

tfidf = models.TfidfModel(myCorpus)
corpus_tfidf = tfidf[myCorpus]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus_tfidf, id2word=myCorpus.dictionary, num_topics=20, update_every=1, chunksize=10000, passes=1)

Radim Řehůřek

unread,
Dec 5, 2012, 4:19:53 PM12/5/12
to gensim
Hello Ahmet,
depends on the texts -- you're splitting at whitespace, which may not
be appropriate for some languages (chinese, japanese, arabic?). Also,
you seem to ignore non-ascii characters, which is not a good idea for
languages with accents (slavic etc).


> 2- I have the variable count, in get_texts to see how close the code is to
> finish. And I have realized that get_texts() is called 4-5 different times.
> This makes this code take immense amount of time to run for 4000+
> documents.

4000 documents shouldn't take an "immense amount of time". It's hard
to tell from the copy&paste, but it looks the counter gets incremented
+1 for each file? So you should see counter=4000+?

> I tried to load the saved corpus and dictionary and run LDA from
> there, but then I get index error, indices dont match. Is there an easier
> way to save and load this the correct way, so I don't have to recreate the
> corpus and the dictionary everytime?

Sure, just use `Dictionary.save/load` and `MmCorpus.serialize`, pretty
much how you're doing it. Except you modify the dictionary with
`filter_extremes` after saving the corpus, which is probably the cause
of this id mismatch.


> 3- The lda model I got prints out nothing when I run this. What might be
> the problem, am I calling it right?

Error on input, error on preprocessing, error elsewhere... gensim has
pretty advanced logging, so if you add `import logging;
logging.basicConfig(format='%(asctime)s : %(levelname)s : %
(message)s', level=logging.DEBUG)`, you'll see a lot more info. If you
can't figure out what the problem is, post the (link to) your log so
we may have a look :)

Best,
Radim

Karsten

unread,
Dec 6, 2012, 7:37:30 AM12/6/12
to gen...@googlegroups.com
Hi,

here is code I use:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Some implementation details are taken from wikicorpus.py

import errno
import exceptions
from gensim import utils, corpora, models
import itertools
import logging

import multiprocessing

import os
import sys

# Wiki is first scanned for all distinct word types (~7M). The types that appear
# in more than 10% of articles (supposedly stop words) are removed and 
# from the rest, the DEFAULT_DICT_SIZE most frequent types are kept. 
DEFAULT_DICT_SIZE = 50000

# No word which appear less then NO_BELOW times are kept 
NO_BELOW = 20

#Number of topics to create for lda model
NUM_TOPICS = 500

def process_file_path(file_path):
    with open(file_path, "r") as file:
        #last character is a breaking /n
        article_name = file.readline()[:-1]
        
        #remaining lines is doc
        doc = " ".join(file.readlines())
        
        lemmatized_doc = utils.lemmatize(doc)
        
        return (article_name, lemmatized_doc)

class CleanCorpus(corpora.TextCorpus):
    '''
    Loads all documents in a directory from a file system. Each file in a dir 
    is regarded as a document. It should be a texfile.
    
    The first line is the article name.
    
    Stems all words and removes stop words. Tokenizes each document
    '''

    def __init__(self, fname, no_below=NO_BELOW, keep_words=DEFAULT_DICT_SIZE, 
                 dictionary=None):
        '''
        See gensim.corpora.textcorpus for details.
        
        :param fnam: The path to scan for documents.
        '''
        
        self.fname = fname
        self.article_names = []
        if keep_words is None:
            keep_words = DEFAULT_DICT_SIZE
        if no_below is None:
            no_below = NO_BELOW
              
        self.file_paths = [os.path.join(self.fname, name) for name in os.listdir(self.fname) 
                            if os.path.isfile(os.path.join(self.fname, name))]
        
        self.processes = 2
        
        #each file is considered an article
        self.total_articles = len(self.file_paths)
            
        if dictionary is None:
            self.dictionary = corpora.Dictionary(self.get_texts())
            self.dictionary.filter_extremes(no_below=no_below, no_above=0.1, 
                                            keep_n=keep_words)
        else:
            self.dictionary = dictionary
            
    def get_texts(self):
        '''
        Files are processed parallel.
        
        See wikicorpus.py by Radim Rehurek
        '''
        logger = logging.getLogger("feature_extractor")
        
        logger.info("Scanning %d files." % self.total_articles)
        
        articles_processed = 0

        pool = multiprocessing.Pool(self.processes)
        
        for group in  utils.chunkize_serial(self.file_paths, 
                                            chunksize=10*self.processes):
            for article_name, tokens in pool.imap(process_file_path, group):
                articles_processed += 1
                try:
                    name = article_name.strip("\n").decode("UTF-8")
                except UnicodeDecodeError as e:
                    logger.error("Could not decode %s: %s" % (article_name, e))
                    exit(1) 
                self.article_names.append(name)
                yield tokens
        
        pool.terminate()
        
        logger.info("Processed %d articles." % articles_processed)
            
if __name__ == "__main__":
    from optparse import OptionParser
        
    p = OptionParser()
    p.add_option('-p', '--path', action="store", dest='doc_path',
                     help="specify path of wiki documents")
    p.add_option('-o', '--output-prefix', action="store", dest='prefix',
                     help="specify path prefix where everything should be saved")
    (options, args) = p.parse_args()
    
    logger = logging.getLogger("feature_extractor")
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    
    corpus = CleanCorpus(options.doc_path)
    
    #save dictionary: word <-> token id map
    corpus.dictionary.save(options.prefix + "_wordids.dict")
    
    #del corpus
    
    '''Bag-of-Words'''
    
    #init corpus reader and word -> id map
    id2token = corpora.Dictionary.load(options.prefix + "_wordids.dict")
    new_corpus = CleanCorpus(options.doc_path, dictionary = id2token)
    
    #create and save bow-representation of corpus
    corpora.MmCorpus.serialize(options.prefix + '_bow_corpus.mm', new_corpus,
                             progress_cnt=10000)
    
    #del new_corpus
    
    #init corpus reader
    mm_bow = corpora.MmCorpus(options.prefix + '_bow_corpus.mm')
    
    '''TFIDF Model creation'''
    
    #build tfidf model
    tfidf = models.TfidfModel(mm_bow, id2word=id2token, normalize=True)
    
    #save tfidf model
    tfidf.save(options.prefix + '_tfidf.model')
    
    #save corpus as tfidf vectors in matrix market format
    corpora.MmCorpus.serialize(options.prefix + '_tfidf_corpus.mm', tfidf[mm_bow], 
                               progress_cnt=10000)

    
    #init tfidf-corpus reader
    mm_tfidf = corpora.MmCorpus(options.prefix + '_tfidf_corpus.mm')
    
    '''LDA Model creation'''
    
    #build lda model
    lda = models.LdaModel(corpus=mm_tfidf, id2word=id2token, 
                          num_topics=NUM_TOPICS, update_every=1, 
                          chunksize=10000, passes=2) 
    
    #save trained model
    lda.save(options.prefix + '_lda.model')
   
    
    logger.info("finished transforming")

Ahmet

unread,
Dec 6, 2012, 9:55:45 AM12/6/12
to gen...@googlegroups.com
Thank you very much for your help. I was forgetting to check if there was an existing dictionary in the constructor of my corpus, and I also forgot to enable logging to print out results. However, there is still a problem (I think) with my results.


2012-12-06 16:45:51,493 : INFO : topic #0: 0.001*simonrichards + 0.001*pasteup + 0.001*usaf + 0.001*lowermainland + 0.001*birding + 0.001*sculptuur + 0.001*birdphotography + 0.001*lockheed + 0.001*mrfahrenheit + 0.001*mfh
2012-12-06 16:45:51,521 : INFO : topic #1: 0.001*trkiye + 0.001*minimal + 0.001*singapore + 0.001*oop + 0.001*toronto + 0.001*coth + 0.001*minimalism + 0.001*sacramento + 0.001*classiccars + 0.001*istanbul
2012-12-06 16:45:51,554 : INFO : topic #2: 0.001*wdw + 0.001*waltdisneyworld + 0.001*kitsch + 0.001*disneyworld + 0.001*weybridge + 0.001*disney + 0.001*magickingdom + 0.001*queenstown + 0.001*bergen + 0.001*postcrossing
2012-12-06 16:45:51,585 : INFO : topic #3: 0.002*topv + 0.001*nederland + 0.001*topf + 0.001*motorsport + 0.001*uncool + 0.001*deleteme + 0.001*zd + 0.001*dominiquerobert + 0.001*bikini + 0.001*denhaag
2012-12-06 16:45:51,616 : INFO : topic #4: 0.001*feminism + 0.001*cowes + 0.001*isleofwight + 0.000*wight + 0.000*funnysign + 0.000*iow + 0.000*ibm + 0.000*badsign + 0.000*colombia + 0.000*solent
2012-12-06 16:45:51,647 : INFO : topic #5: 0.001*mcshots + 0.001*alberta + 0.001*facepainting + 0.001*makeup + 0.001*summicron + 0.001*flviobrando + 0.001*srie + 0.001*noiretblanc + 0.001*pssaros + 0.001*sries
2012-12-06 16:45:51,681 : INFO : topic #6: 0.001*roae + 0.001*viltrakis + 0.001*cincinnati + 0.001*makro + 0.001*naturephotos + 0.001*puppy + 0.001*corgi + 0.001*lightpainting + 0.001*dmcfz + 0.001*fz
2012-12-06 16:45:51,713 : INFO : topic #7: 0.001*apx + 0.000*wail + 0.000*whelen + 0.000*bullhorn + 0.000*neworleans + 0.000*yelp + 0.000*airhorn + 0.000*callout + 0.000*mmsummicron + 0.000*kunstart
2012-12-06 16:45:51,743 : INFO : topic #8: 0.001*espaa + 0.001*fimo + 0.001*handmade + 0.001*emergency + 0.001*on + 0.001*eisenbahnen + 0.001*ambulance + 0.001*chemaconcelln + 0.001*polymerclay + 0.001*nubes
2012-12-06 16:45:51,774 : INFO : topic #9: 0.001*kenzan + 0.001*nahrungsmittel + 0.001*enviro + 0.001*mittagessen + 0.001*leyland + 0.001*plaxton + 0.001*nahrung + 0.001*buses + 0.000*psv + 0.000*stagecoach
2012-12-06 16:45:51,804 : INFO : topic #10: 0.001*tokyo + 0.001*nj + 0.001*snap + 0.001*zeiss + 0.001*lca + 0.001*streetart + 0.001*planar + 0.001*documentary + 0.001*hawaii + 0.001*zuiko
2012-12-06 16:45:51,834 : INFO : topic #11: 0.001*ireallylike + 0.000*mlb + 0.000*collectible + 0.000*designervinyl + 0.000*shakers + 0.000*vinyltoys + 0.000*designertoy + 0.000*collectibles + 0.000*vinyltoy + 0.000*hasbro
2012-12-06 16:45:51,864 : INFO : topic #12: 0.001*dca + 0.001*bunny + 0.001*boeing + 0.001*railways + 0.001*class + 0.001*canine + 0.001*diesel + 0.001*spotting + 0.001*locomotive + 0.001*trains
2012-12-06 16:45:51,896 : INFO : topic #13: 0.001*etsy + 0.001*handmade + 0.001*harveybarrison + 0.001*ooak + 0.001*tauck + 0.001*polymer + 0.001*fireengine + 0.001*firetruck + 0.001*videogames + 0.001*cosplayer
2012-12-06 16:45:51,926 : INFO : topic #14: 0.001*holga + 0.001*toycamera + 0.001*georgia + 0.001*tibet + 0.001*bangkok + 0.001*beijing + 0.001*mediumformat + 0.001*earthasia + 0.001*quality + 0.001*thailand
2012-12-06 16:45:51,956 : INFO : topic #15: 0.001*eastvan + 0.001*varanasi + 0.001*myanmar + 0.001*nude + 0.001*romania + 0.001*commercialdrive + 0.001*sicilya + 0.001*sziclia + 0.001*vancouver + 0.001*pedalare
2012-12-06 16:45:51,988 : INFO : topic #16: 0.002*tabby + 0.001*kitty + 0.001*kitten + 0.001*feline + 0.001*pets + 0.001*gato + 0.001*cats + 0.001*katze + 0.001*gatos + 0.001*kittens
2012-12-06 16:45:52,019 : INFO : topic #17: 0.001*arturii + 0.001*illustration + 0.001*iceland + 0.001*drawing + 0.001*sland + 0.001*flickrblick + 0.001*southcarolina + 0.001*interetsing + 0.001*hpexif + 0.001*charleston
2012-12-06 16:45:52,050 : INFO : topic #18: 0.001*zinzins + 0.001*pentaxist + 0.001*pentaxian + 0.001*topqualityimage + 0.001*paintshopprox + 0.001*ultimate + 0.001*sdm + 0.001*ashotadayorso + 0.001*paintshop + 0.001*corel
2012-12-06 16:45:52,082 : INFO : topic #19: 0.001*urbanart + 0.001*gig + 0.001*lastfmevent + 0.001*watercolor + 0.001*watercolour + 0.001*wheatpaste + 0.001*streetart + 0.001*suomi + 0.001*bjd + 0.001*scrawl


As you can see the salience value for each word in each topic is extremely low. And when I query the model with the documents I created the corpus, most of the documents choose topic#13 as the most-likely topic. What could possibly be the reason for this? Not enough number of documents(I have about 4600), very noisy data (user-entered tags aren't the most reliable piece of text in the world), or not enough frequent words(tags)?

Thank you for the help 

Karsten

unread,
Dec 6, 2012, 5:20:04 PM12/6/12
to gen...@googlegroups.com
Hi,

I'd try more passes. It may not converge. Less topics (around 20) might also help.

Radim Řehůřek

unread,
Dec 6, 2012, 7:03:03 PM12/6/12
to gensim
Karsten is correct, a single pass over 4,000 documents is not enough.
For such a small corpus, you should be able to use >100 passes easily.
The optimal number will of course depend on your corpus structure,
number of requested topics etc.

Hth,
Radim

Karel Antonio Verdecia Ortiz

unread,
Dec 7, 2012, 8:26:10 AM12/7/12
to gen...@googlegroups.com
Hi,

Is there an attribute that I can use to know if the model converged?
Which should be the number of passes for400 documents?

El 06/12/12 19:03, Radim Řehůřek escribió:
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Karsten

unread,
Dec 7, 2012, 3:38:21 PM12/7/12
to gen...@googlegroups.com
There is a warning in the logs if it may not converge.

Karel Antonio Verdecia Ortiz

unread,
Dec 8, 2012, 8:58:50 AM12/8/12
to gen...@googlegroups.com
thanks

El 07/12/12 15:38, Karsten escribió:
kverdecia.vcf

Ahmet

unread,
Dec 9, 2012, 5:14:09 PM12/9/12
to gen...@googlegroups.com
Thanks for the help everyone
Reply all
Reply to author
Forward
0 new messages