2014-04-10 11:25:08,301 : INFO : topic #1 (0.050): 0.142*scene + 0.063*film + 0.032*time + 0.031*prison + 0.029*school + 0.021*part + 0.021*prisoner + 0.021*type + 0.021*death + 0.021*deal
2014-04-10 11:25:08,301 : INFO : topic diff=0.134907, rho=0.316228
2014-04-10 11:25:08,313 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)
2014-04-10 11:25:08,401 : INFO : creating matrix for 111 documents and 20 features
/home/test/software/gensim_dev/gensim/gensim/similarities/docsim.py:561: RuntimeWarning: divide by zero encountered in log
result = -numpy.dot(numpy.log(self.index), query.T).T # return #queries x #index
[(0, nan), (1, nan), (2, nan), (3, nan), (4, nan), (5, nan), (6, nan), (7, nan), (8, nan), (9, nan)]#smaller dataset
corpus = MyCorpus('/home/test/unsup_pre_small') # create a dictionary
dictionary = corpus.dictionary
corpora.MmCorpus.serialize('testing_corpus.mm', corpus)
corpus = corpora.MmCorpus('testing_corpus.mm')
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, update_every=1, chunksize=10000, passes=10, alpha = None)
doc = read_texts()#some document
bow = dictionary.doc2bow(utils.simple_preprocess(doc))
vec_lda = model[bow]
#index = similarities.MatrixSimilarity(model[corpus], similarity_type=utils.SimilariyType.COSINE)
index = similarities.MatrixSimilarity(model[corpus], similarity_type=utils.SimilariyType.Negative_KL)
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims[:10])Hi,I was testing KL distance on small corpus. I added this patch with gensim 0.9 branch. I'm getting some nan in the result with Negative_KL case.
def kl(p, q):
"""Kullback-Leibler divergence D(P || Q) for discrete distributions
Parameters
----------
p, q : array-like, dtype=float, shape=n
Discrete probability distributions.
thanks to gist : https://gist.github.com/larsmans/3104581
"""
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
sum_pq = np.sum(np.where(p != 0, p * np.log(p / q), 0))
sum_qp = np.sum(np.where(q != 0, q * np.log(q / p), 0))
return (sum_pq+sum_qp)/2 # symmetric
#first doc
doc = read_texts(doc1)
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
vec_1 = model[bow]# 'model' is used in mallet
vec_p = np.array(vec_1[0])
#second doc
doc = read_texts(doc2)
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
vec_2 = model[bow]
vec_q = np.array(vec_2[0])
KL = kl(vec_p[:,1],vec_q[:,1])
print KLsims = [(document, my_sim_fnc(document, query)) for document in index]
On Thu, Apr 10, 2014 at 10:21 AM, suvir <hitl...@gmail.com> wrote:
> No editing options in old posts in google groups, so adding here.
>
> I already have a basic KL_func that works ok for two documents. but couldn't
> extended it to make it work against the whole corpus.
> i. e given the corpus and query, return similar documents.
> def kl(p, q):
> """Kullback-Leibler divergence D(P || Q) for discrete distributions
> Parameters
> ----------
> p, q : array-like, dtype=float, shape=n
> Discrete probability distributions.
> thanks to gist : https://gist.github.com/larsmans/3104581
> """
> p = np.asarray(p, dtype=np.float)
> q = np.asarray(q, dtype=np.float)
> sum_pq = np.sum(np.where(p != 0, p * np.log(p / q), 0))
> sum_qp = np.sum(np.where(q != 0, q * np.log(q / p), 0))
> return (sum_pq+sum_qp)/2 # symmetric
>
This isn't KL divergence anymore when you make it symmetric right?
> email to gensim+unsubscribe@googlegroups.com.
Suvir: here `index` is simply an iterable of documents = a corpus. For example, for Hellinger distance, you'd do:hellinger = lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())sims = [(docno, hellinger(matutils.sparse2full(query, num_topics), matutils.sparse2full(doc, num_topics))) for docno, doc in enumerate(index_corpus)]Now that's a naive "pairwise" implementation; optimizations are another matter.HTH,Radim
len(corpus.dictionary.token2id.keys())---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-191-9d387738e53b> in <module>()
----> 1 execfile('hellinger_test.py')
/home/test/code/similarity/hellinger_test.py in <module>()
57
58 hellinger = lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
---> 59 sims = [(docno, hellinger(matutils.sparse2full(vec_lda, lda.num_topics), matutils.sparse2full(doc, lda.num_topics))) for docno, doc in enumerate(corpus)]
60 #sims_hl = sorted(enumerate(sims_hl), key=lambda item: item[1])
61 sims = sorted(sims, key=lambda x: x[1])
/home/test/software/gensim_dev/gensim/gensim/matutils.pyc in sparse2full(doc, length)
191 doc = dict(doc)
192 # overwrite some of the zeroes with explicit values
--> 193 result[list(doc)] = list(itervalues(doc))
194 return result
195
IndexError: index 2052 is out of bounds for size 200Thanks Radim. That worked!For small corpus(100 docs), it works. For bigger corpus(50k), num_topics is not enough. i guess it can be replaced by number of items in dictionary. On bigger corpus, following works:sims = [(docno, hellinger(matutils.sparse2full(vec_lda[0], 30000), matutils.sparse2full(doc, 30000))) for docno, doc in enumerate(corpus)]#vec_lda[0] as lda_mallet returns result with extra [ ].where 30000 is number of dictionary items. I have to write hard coded number as
I also tried Clint's test file for KL distance. that works but when extending it to bigger corpus, values becomes inf. Log:
RegardsSuvir
#vec_lda[0] as lda_mallet returns result with extra [ ].
this number has to match the #features of your `query` = #features of your `corpus`. Don't just put whatever number there :)I.e., for LDA models, with `query = lda_model[bow]` and `corpus = lda_model[bow_corpus]`, this is `lda_model.num_topics`.For BOW/TF-IDF models, with `query = tfidf_model[bow]` and `corpus = tfidf_model[bow_corpus], it is `len(dictionary)`.In each case, the dimensionality of your query must match the corpus you're comparing it against... otherwise it's a conceptual mismatch. You can't directly compare an LDA vector to a BOW vector. Maybe this is the cause of your issues? Can you check both `query` and `corpus` were produced by the same pipeline?
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
#Query read function
def read_texts():
file = open('/home/sample/sample.txt', 'r')
df = file.read()
return df
#Reading the corpus
def iter_documents(reuters_dir):
"""Iterate over Reuters documents, yielding one document at a time."""
for fname in os.listdir(reuters_dir):
# read each document as one big string
document = open(os.path.join(reuters_dir, fname)).read()
# parse document into a list of utf8 tokens
yield utils.simple_preprocess(document)
class ReutersCorpus(object):
def __init__(self, reuters_dir):
self.reuters_dir = reuters_dir
self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
self.dictionary.filter_extremes(no_below=2,keep_n=30000) # remove stopwords etc. defalt is no less than 5 times for minimum.
def __iter__(self):
for tokens in iter_documents(self.reuters_dir):
yield self.dictionary.doc2bow(tokens)
# set up the streamed corpus
corpus = ReutersCorpus('/home/test/unsup_pre/')#corpus 1k documents.
# train 10 LDA topics using MALLET
mallet_path = '/home/test/software/mallet/bin/mallet'
lda_model = models.LdaMallet(mallet_path, corpus, num_topics=200, id2word=corpus.dictionary) # with 200 topics
query_text = read_texts()
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(query_text))
vec_lda = lda_model[bow]
hellinger = lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
#for LDA models, with `query = lda_model[bow]` and `corpus = lda_model[bow_corpus]`, this is `lda_model.num_topics`.
sims = [(docno, hellinger(matutils.sparse2full(vec_lda, lda_model.num_topics), matutils.sparse2full(lda_model[doc], lda_model.num_topics))) for docno, doc in enumerate(corpus)]
sims = sorted(sims, key=lambda x: x[1])
print sims[:10]
[(936, 0.0),
(325, 0.66297567852570571),
(485, 0.66387883058569441),
(188, 0.66500773755211373),
(823, 0.66597174894280486),
(796, 0.67776506777695011),
(618, 0.67842412143836928),
(881, 0.68023703272641944),
(661, 0.68083020321217136),
(867, 0.68430156575065204)]
With logging enabled, i can actually see the slow speed of pair wise document comparison and the need of optimization here. If i manage to optimize it, i will update here.Is there any doc/link i should look to for this kind of optimization work.
In case I wasn't clear: if your corpus is small (fits in RAM) and static, you can precomputeindex = numpy.sqrt(corpus2dense(lda[corpus], lda.num_topics).T)and then for queries:q = numpy.sqrt(sparse2full(lda[query], lda.num_topics))sims = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))(not tested, but that's the idea)-rr
In case I wasn't clear: if your corpus is small (fits in RAM) and static, you can precomputeindex = numpy.sqrt(corpus2dense(lda[corpus], lda.num_topics).T)and then for queries:q = numpy.sqrt(sparse2full(lda[query], lda.num_topics))sims = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))
index = numpy.sqrt(matutils.corpus2dense(lda_model[corpus], lda_model.num_topics).T)
q = numpy.sqrt(matutils.sparse2full(vec_lda, lda_model.num_topics))
sims_opt = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))
sims_opt = sorted(sims_opt, key=lambda x: x)
print sims_opt[:10]
[0.076513745,
0.66584373,
0.66920334,
0.67540497,
0.67766148,
0.67878348,
0.68208182,
0.68379384,
0.68432146,
0.68683475]index = numpy.sqrt(matutils.corpus2dense((lda_model[doc] for docno, doc in enumerate(corpus)), lda_model.num_topics).T)
q = numpy.sqrt(matutils.sparse2full(vec_lda, lda_model.num_topics))
sims_opt = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))
sims_opt = sorted(sims_opt, key=lambda x: x)
print sims_opt[:10]
[0.0,
0.66941172,
0.66976386,
0.67768055,
0.67840725,
0.68287355,
0.68313229,
0.6832425,
0.68588322,
0.68718135]for docno, doc in enumerate(corpus):
some_temp = lda_model[doc] <------------------------slow but correctsome_temp = lda_model[corpus] <------------------------fast but not correctok, so this works in fast way but result is not correct:
for docno, doc in enumerate(corpus):
some_temp = lda_model[doc] <------------------------slow but correctvssome_temp = lda_model[corpus] <------------------------fast but not correct
If MALLET returns something else when inferring a whole corpus vs. inferring one document at a time from that same corpus, then that's a bug.I'm not sure why that happens, and I won't have time this week to investigate. Can you open an issue on github, so I don't forget?Of course, if you yourself get a chance to find why the MALLET wrapper (or MALLET itself?) does that, that would be perfect.The wrapper is fairly trivial: https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamallet.py#L170
Cheers,Radim
In [433]: some_temp = lda_model[corpus]
2014-04-15 15:04:49,104 : INFO : serializing temporary corpus to /tmp/3cbda2_corpus.txt
2014-04-15 15:04:51,456 : INFO : converting temporary corpus to MALLET format with /home/test/software/mallet/bin/mallet import-file --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/3cbda2_corpus.txt --output /tmp/3cbda2_corpus.mallet.infer --use-pipe-from/tmp/3cbda2_corpus.mallet
Rewriting extended pipe from /tmp/3cbda2_corpus.mallet
Instance ID = b97159eb87b2eb32:366d52c0:14564908883:-7ff9
2014-04-15 15:04:52,578 : INFO : inferring topics with MALLET LDA '/home/test/software/mallet/bin/mallet infer-topics --input /tmp/3cbda2_corpus.mallet.infer --inferencer /tmp/3cbda2_inferencer.mallet --output-doc-topics /tmp/3cbda2_doctopics.txt.infer --num-iterations 100'
In [434]: %paste
for docno, doc in enumerate(corpus):
some_temp = lda_model[doc]
## -- End pasted text --
2014-04-15 15:05:28,265 : INFO : serializing temporary corpus to /tmp/3cbda2_corpus.txt
2014-04-15 15:05:28,266 : INFO : converting temporary corpus to MALLET format with /home/test/software/mallet/bin/mallet import-file --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/3cbda2_corpus.txt --output /tmp/3cbda2_corpus.mallet.infer --use-pipe-from/tmp/3cbda2_corpus.mallet
Rewriting extended pipe from /tmp/3cbda2_corpus.mallet
Instance ID = b97159eb87b2eb32:366d52c0:14564908883:-7ff9
2014-04-15 15:05:29,172 : INFO : inferring topics with MALLET LDA '/home/test/software/mallet/bin/mallet infer-topics --input /tmp/3cbda2_corpus.mallet.infer --inferencer /tmp/3cbda2_inferencer.mallet --output-doc-topics /tmp/3cbda2_doctopics.txt.infer --num-iterations 100'
2014-04-15 15:05:29,897 : INFO : serializing temporary corpus to /tmp/3cbda2_corpus.txt
2014-04-15 15:05:29,898 : INFO : converting temporary corpus to MALLET format with /home/test/software/mallet/bin/mallet import-file --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/3cbda2_corpus.txt --output /tmp/3cbda2_corpus.mallet.infer --use-pipe-from/tmp/3cbda2_corpus.mallet
Rewriting extended pipe from /tmp/3cbda2_corpus.mallet
Instance ID = b97159eb87b2eb32:366d52c0:14564908883:-7ff9
2014-04-15 15:05:30,687 : INFO : inferring topics with MALLET LDA '/home/test/software/mallet/bin/mallet infer-topics --input /tmp/3cbda2_corpus.mallet.infer --inferencer /tmp/3cbda2_inferencer.mallet --output-doc-topics /tmp/3cbda2_doctopics.txt.infer --num-iterations 100'
----------------------goes like this for rest of the 1k docs in the corpus.
ctrl-c:KeyboardInterrupt:
Can you open an issue on github, so I don't forget?
2014-04-14 14:58:32,278 : 2014-04-16 09:15:55,953 :
lda_model = models.ldamodel.LdaModel(corpus=corpus, id2word=corpus.dictionary, num_topics=200, update_every=1, chunksize=10000, passes=100, alpha = 'auto')#tried between 20 passes to 100 passes for 1k doc corpus.
In [90]: sims[:10]
Out[90]:
[(936, 5.0610560625263156e-06),#this should have been 0 but its ok with small value as well
(591, 0.49706860375251177),
(68, 0.64225190296834078),
(575, 0.65321187258382452),
(883, 0.76119358837155626),
(921, 0.79946005153274213),
(847, 0.80047784035037362),
(974, 0.80508864690650472),
(199, 0.8072571326632092),
(97, 0.81616722148694387)]