Hi there,
For a text clustering project I'ld like to use the BM25 ranking function to get a document x document similarity matrix.
Although BM25 is designed for relevance ranking, many studies have shown that it is well applicable in the context of document clustering as well.
I've been playing around with the functions available in gensim a bit , but i'm not sure if i'm doing things correctly.
from gensim.summarization.bm25 import BM25, get_bm25_weights
# first step: create corpus
# df['tok_text'] contains a column with a list of preprocessed tokens for each document
corpus = df['tok_text'].values.tolist()
# the corpus is a list of lists
now
# now for the doc x doc similarity matrix, i call get_bm25_weights on the corpus. This takes quite a while but seems to work. Not really sure if this is correct to be honest.
X_sims = get_bm25_weights(corpus)
# to speed things up, i'm now trying out the following
bm25 = BM25(corpus)
X_sims = [bm25.get_scores([term]) for term in set.union(*[set(s) for s in corpus])]
# but this somehow yields a list of lists which is way too small.
Any ideas? How would you go about getting a BM25 doc x doc similarity matrix?
Best regards and many thanks in advance,