Gensim LDA model vs scikitlearn LDA

Adam Reese

unread,

Nov 17, 2017, 10:25:23 AM11/17/17

to gensim

Maybe someone can help me out here with syntax. I'm trying to compare results of gensim lda to sklearns implementation but i cannot figure out how i need to feed in the data.

Gensim works just fine

id2word_l = corpora.Dictionary(longer_docs)

mm_l = [id2word_l.doc2bow(text) for text in longer_docs]

lda_l = models.ldamulticore.LdaMulticore(corpus = mm_l,

id2word=id2word_l,

num_topics=40,

minimum_probability=0,

chunksize=10000,

passes=20,

workers=24)

This is where I must be doing something wrong

from gensim.models import TfidfModel

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=40, max_iter=5,

learning_method='online',

learning_offset=50.,

random_state=0)

tfidf = TfidfModel(mm_l)

corpus_tfidf = tfidf[mm_l]

lda.fit(corpus_tfidf)

Gives me the following error ValueError: Expected 2D array, got 1D array instead:

not sure how to transform the 'list' to a 2d array how it wants. Anyone have experience doing this comparison?

Adam Reese

unread,

Nov 17, 2017, 11:26:06 AM11/17/17

to gensim

Figured it out (I believe)

my data is being fed in an array already broken out by token, but sklearns countvectorizer seems to want a string for the document. once i combined it back into a string i was able to get it to work.

tf_vectorizer = CountVectorizer()

tf = tf_vectorizer.fit_transform(docs)

lda.partial_fit(tf)

live and learn i guess.

Ivan Menshikh

unread,

Nov 20, 2017, 4:26:52 AM11/20/17

to gensim

Hi Adam,

we have different input format with sklearn (but we have similar API too), I think you correctly found the problem with sklearn here.

Reply all

Reply to author

Forward