pyLDAvis index 1098 is out of bounds for axis 1 with size 707

308 views
Skip to first unread message

Haritz

unread,
Sep 10, 2016, 7:29:47 AM9/10/16
to gensim
Hello, I want to build two models which share a dictionary. So I have two corpus and one dictionary for both. I'll show how I did it.

First, retrieve documents

setDocs1 = []
allDocuments
= []
for file_name in os.listdir("/home/vagrant/shared/Test/1"):
    file
= codecs.open("/home/vagrant/shared/Test/1/" + file_name, "r", "utf-8")
    aux
= file.read()
    setDocs1
.append(aux)
    allDocuments
.append(aux)

setDocs2
= []
for file_name in os.listdir("/home/vagrant/shared/Test/2"):
    file
= codecs.open("/home/vagrant/shared/Test/2/" + file_name, "r", "utf-8")
    aux
= file.read()
    setDocs2
.append(aux)
    allDocuments
.append(aux)

Build dictionary and corpora

texts1 = []
texts2
= []
all_texts
= []
tokenizer
= RegexpTokenizer(r'\w+')
stoplist_tw
=['amp','get','got','hey','hmm','hoo','hop','iep','let','ooo','par',
           
'pdt','pln','pst','wha','yep','yer','aest','didn','nzdt','via',
           
'one','com','new','like','great','make','top','awesome','best',
           
'good','wow','yes','say','yay','would','thanks','thank','going',
           
'new','use','should','could','best','really','see','want','nice',
           
'while','know']

unigrams
= [ w for doc in allDocuments for w in doc if len(w)==1]
bigrams  
= [ w for doc in allDocuments for w in doc if len(w)==2]

en_stop  
= set(nltk.corpus.stopwords.words("english") + stoplist_tw
               
+ unigrams + bigrams)
p_stemmer
= PorterStemmer()
# loop through document list
for i in setDocs1:
   
# clean and tokenize document string
    raw
= i.lower()
    tokens
= tokenizer.tokenize(raw)

   
# remove stop words from tokens
    stopped_tokens
= [i for i in tokens if not i in en_stop]
   
   
# stem tokens
    stemmed_tokens
= [p_stemmer.stem(i) for i in stopped_tokens]
   
   
# add tokens to list
    texts1
.append(stemmed_tokens)
    all_texts
.append(stemmed_tokens)
   
for i in setDocs2:
   
# clean and tokenize document string
    raw
= i.lower()
    tokens
= tokenizer.tokenize(raw)

   
# remove stop words from tokens
    stopped_tokens
= [i for i in tokens if not i in en_stop]
   
   
# stem tokens
    stemmed_tokens
= [p_stemmer.stem(i) for i in stopped_tokens]
   
   
# add tokens to list
    texts2
.append(stemmed_tokens)
    all_texts
.append(stemmed_tokens)
   
# turn our tokenized documents into a id <-> term dictionary
dictionary
= corpora.Dictionary(all_texts)
# convert tokenized documents into a document-term matrix
corpus1
= [dictionary.doc2bow(text) for text in texts1]
corpus2
= [dictionary.doc2bow(text) for text in texts2]


Now, I create two LDA models. One for corpus1 and another one for corpus2.

lda_model_1 = gensim.models.ldamodel.LdaModel(corpus1, num_topics=3, id2word = dictionary, passes=10, alpha=0.001)

I can see that it's working by doing this:

for i in xrange(3):
   
print i
   
for tup in lda_model_1.get_topic_terms(i):
       
print dictionary[tup[0]] + ' ' + str(tup[1])

The result in my case is what I expected.

But now, when I execute the following an exception arises:

data1 =  pyLDAvis.gensim.prepare(lda_model_1, corpus1, dictionary)
pyLDAvis
.display(data1)

Exception:

IndexErrorTraceback (most recent call last)
<ipython-input-17-209fc1d6a743> in <module>()
----> 1 data1 =  pyLDAvis.gensim.prepare(lda_model_1, corpus1, dictionary)
      2 pyLDAvis.display(data1)

/usr/local/lib/python2.7/dist-packages/pyLDAvis/gensim.pyc in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
     95     See `pyLDAvis.prepare` for **kwargs.
     96     """
---> 97     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
     98     return vis_prepare(**opts)

/usr/local/lib/python2.7/dist-packages/pyLDAvis/gensim.pyc in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
     26    beta = 0.01
     27    fnames_argsort = np.asarray(list(dictionary.token2id.values()), dtype=np.int_)
---> 28    term_freqs = corpus_csc.sum(axis=1).A.ravel()[fnames_argsort]
     29    term_freqs[term_freqs == 0] = beta
     30    doc_lengths = corpus_csc.sum(axis=0).A.ravel()

IndexError: index 1098 is out of bounds for axis 1 with size 707

However, the 2nd model works great. Does somebody know what I doing wrong? I don't know why one model works great but the other doesn't.

lda_model_2 = gensim.models.ldamodel.LdaModel(corpus2, num_topics=3, id2word = dictionary, passes=10, alpha=0.001)
data2
=  pyLDAvis.gensim.prepare(lda_model_2, corpus2, dictionary)
pyLDAvis
.display(data2)
#it works well

Thank you.

Christopher S. Corley

unread,
Sep 10, 2016, 8:58:33 AM9/10/16
to gensim
Hi,

The exception is raised in pyLDAvis code, might be worthwhile raising the issue with that project if you have not already.  This dictionary sharing may be breaking some assumptions they have :-)

The code around gensim corpus, dictionary, and model building looks fine at first glance.  The "unigram" and "bigram" approach to building a stopword list is certainly interesting.  Could pyLDAvis be having an issue on documents that are empty as a result of this filtering?

Chris.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Haritz

unread,
Sep 10, 2016, 10:42:16 AM9/10/16
to gensim
Hi,

I've read here and in stackoverflow that some people had a similar problem because of the dictionary, that's why I am asking here. I will ask to pyLDA developers too as you proposed.

There are no empty documents in my case so the filtering shouldn't be the problem.

Thank you for your help :)
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

Lev Konstantinovskiy

unread,
Sep 16, 2016, 1:03:12 PM9/16/16
to gensim
Let's wait for pyLDAVis to reply in https://github.com/bmabey/pyLDAvis/issues/72

Kenneth Orton

unread,
Oct 6, 2016, 3:12:52 AM10/6/16
to gensim
I had the index error issue and solved it by just saving the pyLDAvis file as html instead of trying to display it in Memory. 
Here is the code I used:
vis_data = gensimvis.prepare(lda, bow_corpus, dictionary)
pyLDAvis.save_html(vis_data, 'data/lda_75_lem_5_pass.html')


Reply all
Reply to author
Forward
0 new messages