# split into train and test - random sample, but preserving order
train_size = int(round(len(bow_corpus)*0.8))
train_index = sorted(random.sample(xrange(len(bow_corpus)), train_size))
test_index = sorted(set(xrange(len(bow_corpus)))-set(train_index))
train_corpus = [bow_corpus[i] for i in train_index]
test_corpus = [bow_corpus[j] for j in test_index]
number_of_words = sum(cnt for document in test_corpus for _, cnt in document)
parameter_list = range(5, 151, 5)
for parameter_value in parameter_list:
print "starting pass for parameter_value = %.3f" % parameter_value
model = models.LdaMulticore(corpus=bow_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)
perplex = model.bound(test_corpus) # this is model perplexity not the per word perplexity
print "Total Perplexity: %s" % perplex
grid[parameter_value].append(perplex)
per_word_perplex = np.exp2(-perplex / number_of_words)
print "Per-word Perplexity: %s" % per_word_perplex
grid[parameter_value].append(per_word_perplex)
model.save(data_path + 'ldaMulticore_i10_T' + str(parameter_value) + '_training_corpus.lda')
print
for numtopics in parameter_list:
print numtopics, '\t', grid[numtopics]
df = pandas.DataFrame(grid)
ax = plt.figure(figsize=(7, 4), dpi=300).add_subplot(111)
df.iloc[1].transpose().plot(ax=ax, color="#254F09")
plt.xlim(parameter_list[0], parameter_list[-1])
plt.ylabel('Perplexity')
plt.xlabel('topics')
plt.title('')
plt.savefig('gensim_multicore_i10_topic_perplexity.pdf', format='pdf', bbox_inches='tight', pad_inches=0.1)
plt.show()
df.to_pickle(data_path + 'gensim_multicore_i10_topic_perplexity.df')
There is a dip at around 130 topics, but it isn't very large - seem like
it could be noise? Does the change of gradient at around 35-40 topics
suggest that is the best number of topics? Does anyone have any examples
of this type of graph showing what to expect when it works?
Is there a better approach?
I
started with this model because it had a the shortest run time for a
reasonable perplexity from me previous testing
(https://groups.google.com/forum/#!msg/gensim/yJan7QlKr4I/0XmdtR_78MoJ).
When I didn't see the dip I expected, I tried extending the number of
topics higher - checking every 10th topic between 155 and 300.
The results are strange, the perplexity jumps enormously between 225 and 235 topics and continues to go up:
215 [-10569430.048500545, 584.90060291802604]
225 [-10684866.564780824, 627.05178242805653]
235 [-31870705.178804845, 220681813.39332914]
245 [-37573025.065190136, 6864939909.9477577]
Is this expected? I repeated this from just before it jumped up and this was the result:
So it seems like it is quite unstable at these large perplexity values.
I tried using alpha='asymmetric' to see if that gave a different result but it was very similar.
model = models.LdaMulticore(corpus=bow_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10, alpha='asymmetric')
model = models.LdaMulticore(corpus=bow_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)
should be
model = models.LdaMulticore(corpus=train_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)
model = models.LdaMulticore(corpus=train_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)
mean 462.3 std 6.5 min 451.5 max 467.7
model = models.LdaMulticore(corpus=train_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10, alpha='asymmetric')
Happy to run any tests that might help work out why the perplexity is increasing as number of topics increases.
regards,
Brenda
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/TpuYRxhyIOc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/TpuYRxhyIOc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.