I am trying to visualize LDA topics in Python using PyLDAVis but I can't seem to get it right. My model has a vocab size of 150K words and about 16 Million tokens were taken to train it.
I am doing it outside of an iPython notebook and this is the code that I wrote to do it.
model_filename = "150k_LdaModel_topics_"+ topics +"_passes_"+passes +".model"
dictionary = gensim.corpora.Dictionary.load('LDADictSpecialRemoved150k.dict')
corpus = gensim.corpora.MmCorpus('LDACorpusSpecialRemoved150k.mm')
ldamodel = gensim.models.ldamodel.LdaModel.load(model_filename)
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.save_html(vis, "topic_viz_"+topics+"_passes_"+passes+".html")I get the following error after 2-3 hours of running code on a high speed server with >30GBs of RAM. Can someone help where I am going wrong?
Traceback (most recent call last):
File "create_vis.py", line 36, in <module>
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
File "/local/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 110, in prepare
return vis_prepare(**opts)
File "/local/lib/python2.7/site-packages/pyLDAvis/_prepare.py", line 398, in prepare
token_table = _token_table(topic_info, term_topic_freq, vocab, term_frequency)
File "/local/lib/python2.7/site-packages/pyLDAvis/_prepare.py", line 267, in _token_table
term_ix.sort()
File "/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 1703, in sort
raise TypeError("cannot sort an Index object in-place, use "
TypeError: cannot sort an Index object in-place, use sort_values instead
I think this is an issue with gensim dictionary and corpus. I am trimming the dictionary size to 150K from 2M and I think somehow that part is causing the error.
Can someone help?
Thanks,
Amar
import copyimport gensimfrom gensim.models import VocabTransformimport logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
DEFAULT_DICT_SIZE = 100000
# filter the dictionaryold_dict = gensim.corpora.Dictionary.load('data.new_old/wiki_dict.dict')new_dict = copy.deepcopy(old_dict)new_dict.filter_extremes(no_below=20, no_above=0.1, keep_n=DEFAULT_DICT_SIZE)new_dict.save('data.new_old/filtered.dict')
# transform the corpuscorpus = gensim.corpora.MmCorpus('data.new_old/wiki_corpus.mm')old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}vt = VocabTransform(old2new)gensim.corpora.MmCorpus.serialize('data.new_old/filtered_corpus.mm', vt[corpus], id2word=new_dict, progress_cnt=10000)
# create lda model from filtered databow_corpus = gensim.corpora.MmCorpus('data.new_old/filtered_corpus.mm')dictionary = gensim.corpora.Dictionary.load('data.new_old/filtered.dict')lda = gensim.models.ldamodel.LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, passes=1)lda.save('data.new_old/lda_filtered.model')