Document x document similarity matrix with BM25 in gensim

160 views

Skip to first unread message

Joshua Eykens

unread,

Jan 29, 2021, 6:03:20 AM1/29/21

to gen...@googlegroups.com

Hi there,

For a text clustering project I'ld like to use the BM25 ranking function to get a document x document similarity matrix.

Although BM25 is designed for relevance ranking, many studies have shown that it is well applicable in the context of document clustering as well.

I've been playing around with the functions available in gensim a bit , but i'm not sure if i'm doing things correctly.

from gensim.summarization.bm25 import BM25, get_bm25_weights

# first step: create corpus

# df['tok_text'] contains a column with a list of preprocessed tokens for each document

corpus = df['tok_text'].values.tolist()

# the corpus is a list of lists now

# now for the doc x doc similarity matrix, i call get_bm25_weights on the corpus. This takes quite a while but seems to work. Not really sure if this is correct to be honest.

X_sims = get_bm25_weights(corpus)

# to speed things up, i'm now trying out the following

bm25 = BM25(corpus)

X_sims = [bm25.get_scores([term]) for term in set.union(*[set(s) for s in corpus])]

# but this somehow yields a list of lists which is way too small.

Any ideas? How would you go about getting a BM25 doc x doc similarity matrix?

Best regards and many thanks in advance,

Joshua Eykens

P: +32 497 93 68 52

E: joshua...@gmail.com

Reply all

Reply to author

Forward

0 new messages