BM25 of gensim with the classifier from sklearn

haru...@gmail.com

unread,

Feb 12, 2019, 9:32:55 AM2/12/19

to Gensim

How can use the BM25 of gensim with the classifier from sklearn like Naive Byes or LinearSCV for text classification in Python?

I am new to this field, kindly guide through the following issues. Please let me know if anyone has any knowledge regarding this.

Tutorial code for BM25 is:

from gensim.summarization.bm25 import get_bm25_weights
corpus = [
     ["black", "cat", "white", "cat"],
     ["cat", "outer", "space"],
     ["wag", "dog"]
]
result = get_bm25_weights(corpus, n_jobs=-1)

The output of the above code is in this format:

[[1.1237959024144617, 0.1824377227735681, 0], [0.11770175662810844, 1.1128701089187656, 0], [0, 0, 1.201942644155272]]

Implementation so far:

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.model_selection import StratifiedKFold

from gensim import corpora
from gensim.summarization.bm25 import get_bm25_weights

short_pos = open("pos.txt","r").read()
short_neg = open("neg.txt","r").read()
documents = []

# split at each sentence
for r in short_pos.split('\n'):
    r= r.rstrip()
    documents.append(r)
for r in short_neg.split('\n'):
    r= r.rstrip()
    documents.append(r)

#Stratified 10-cross fold validation with SVM
labels = np.zeros(200);
labels[0:100]=1;
labels[100:200]=0;

kf = StratifiedKFold(n_splits=10)

totalsvm = 0                            # Accuracy measure on 200 text
totalMatSvm = np.zeros((2,2)); # Confusion matrix on 200 text

# Point 1 => Converting into token of words and the computation of BM25 ranking
texts = [[word for word in document.lower().split()]
          for document in documents]
corpus = corpora.Dictionary(texts)
result = get_bm25_weights(corpus, n_jobs=-1)    # Point 1-1

# Error at Point 1-1 => TypeError: object of type 'int' has no len()
print result
# End of Point 1

# Point 2
X_train = [texts[i] for i in train_index]
X_test = [texts[i] for i in test_index]
y_train, y_test = labels[train_index], labels[test_index]
# Point 2

# Point 3 => Implementation with TfidfVectorizer of sklearn with LinearSVC of sklearn
for train_index, test_index in kf.split(corpus,labels):
    X_train = [corpus[i] for i in train_index]
    X_test = [corpus[i] for i in test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    vectorizer = TfidfVectorizer()
    train_corpus_tf_idf = vectorizer.fit_transform(X_train)
    test_corpus_tf_idf = vectorizer.transform(X_test)

    model1 = LinearSVC()
    model1.fit(train_corpus_tf_idf,y_train)
    result1 = model1.predict(test_corpus_tf_idf)

    totalMatSvm = totalMatSvm + confusion_matrix(y_test, result1)
    totalsvm = totalsvm+sum(y_test==result1)

print totalMatSvm, totalsvm/200.0

# End of Point 3

How to

1. fix error at Point 1-1

2. assign the bm25 ranking at Point 2

Message has been deleted

haru...@gmail.com

unread,

Feb 12, 2019, 10:15:23 AM2/12/19

to Gensim

Error at Poin 1-1 is fixed with

result = get_bm25_weights(corpus[1], n_jobs=-1)

Output of result is like this:

[[1.6094379124341003, 0, 0, 0, 0, 0, 0, 0], [0, 1.6094379124341003, 0, 0, 0, 0, 0, 0], [0, 0, 1.6094379124341003, 0, 0, 0, 0, 0], [0, 0, 0, 0.9555114450274363, 0, 0, 0, 0.9555114450274363], [0, 0, 0, 0, 1.6094379124341003, 0, 0, 0], [0, 0, 0, 0, 0, 1.6094379124341003, 0, 0], [0, 0, 0, 0, 0, 0, 1.6094379124341003, 0], [0, 0, 0, 0.9555114450274363, 0, 0, 0, 0.9555114450274363]]

Now, how to feed this values to the classifier as its training and testing data instead of the values of TF-IDF from sklearn in Point2 and Point 3?

Александр Менщиков

unread,

Feb 12, 2019, 9:21:04 PM2/12/19

to Gensim

Hi,

How to
1. fix error at Point 1-1

Why do you need a Dictionary? BM25 takes a list of documents, where document is a simple list of words. `gensim.corpora.Dictionary` seems like to a simple map between id and word, i.e.

d = corpora.Dictionary([['black', 'cat', 'white', 'cat']])
d[1]  # give us a 'cat' for example
list(d) # give us [1, 2, 3]

Also note that `result = get_bm25_weights(corpus[1], n_jobs=-1)` will not work as `corpus[1]` is just one word. And every symbol (= text with length 1) of it will be consumed as one doc. So, I think Point 1 should look like:

texts = [document.lower().split() for document in documents]result = get_bm25_weights(texts, n_jobs=-1)

I'm not sure about your second question, so maybe I will answer you a bit later.

Reply all

Reply to author

Forward