Gensim LDA model vs scikitlearn LDA

832 views
Skip to first unread message

Adam Reese

unread,
Nov 17, 2017, 10:25:23 AM11/17/17
to gensim
Maybe someone can help me out here with syntax. I'm trying to compare results of gensim lda to sklearns implementation but i cannot figure out how i need to feed in the data.


Gensim works just fine

id2word_l = corpora.Dictionary(longer_docs)
mm_l = [id2word_l.doc2bow(text) for text in longer_docs]

lda_l = models.ldamulticore.LdaMulticore(corpus = mm_l, 
                               id2word=id2word_l,
                               num_topics=40,
                               minimum_probability=0,
                               chunksize=10000,
                               passes=20,
                              workers=24)

This is where I must be doing something wrong

from gensim.models import TfidfModel
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=40, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

tfidf = TfidfModel(mm_l)
corpus_tfidf = tfidf[mm_l]

lda.fit(corpus_tfidf)

Gives me the following error ValueError: Expected 2D array, got 1D array instead:

not sure how to transform the 'list' to a 2d array how it wants. Anyone have experience doing this comparison?

Adam Reese

unread,
Nov 17, 2017, 11:26:06 AM11/17/17
to gensim
Figured it out (I believe)

my data is being fed in an array already broken out by token, but sklearns countvectorizer seems to want a string for the document. once i combined it back into a string i was able to get it to work. 

tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(docs)

lda.partial_fit(tf)

live and learn i guess.

Ivan Menshikh

unread,
Nov 20, 2017, 4:26:52 AM11/20/17
to gensim
Hi Adam,

we have different input format with sklearn (but we have similar API too), I think you correctly found the problem with sklearn here.
Reply all
Reply to author
Forward
0 new messages