Document Pairwise Similarity using existing LSA/LSI space

Skip to first unread message

Schlomo Goldstein

Oct 5, 2021, 8:58:56 AM10/5/21
to Gensim
Hi everyone! I am working on a project using LSA/LSI and I would like to know how to efficiently compute pairwise (cosine) similarities.

I have a big corpus that I trained LSA/LSI on but I want to compare 120 other documents (not in corpus) to each other pairwise using the LSA space I created using the big corpus.

but I can't make it work. Any help would be much appreciated!

Radim Řehůřek

Oct 7, 2021, 3:31:58 PM10/7/21
to Gensim
Hi Schlomo,

what part of the tutorial did you get stuck on?

Conceptually, you would:

1. train your LSI model (big corpus)
2. build a MatrixSimilarity object on the small 120 document corpus (transformed to LSI space using the model from 1)
3. run `for sims in my_index: …` to compute the all-against-all pairwise similarities between the 120 documents.

Hope that helps!

Schlomo Goldstein

Oct 11, 2021, 1:35:22 PM10/11/21
to Gensim
Hi Radim,

Thank you very much for answering. I am stuck on 2. since I am not quite sure how MatrixSimilarity works and which parameters I should give it.
I would like to create a matrix consisting of the pairwise similarities of all 120 documents (each is a row in my csv file that I pre-process first).

I have the following (simplified) code:

import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_documents
import os
from gensim import models
import pprint
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

df = pd.read_csv('osf_app_data.csv',encoding="ISO-8859-1")

text_corpus = df['sorted_text'].values #reading a column from csv file
processed_corpus = preprocess_documents(text_corpus)
dictionary = gensim.corpora.Dictionary(processed_corpus)

bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
tfidf = gensim.models.TfidfModel(bow_corpus, smartirs='npu')
corpus_tfidf = tfidf[bow_corpus]
lsi = gensim.models.LsiModel(corpus_tfidf, num_topics=500)
index = gensim.similarities.MatrixSimilarity(lsi[corpus_tfidf])

dataframe = pd.read_csv('120novel_extracts.csv') # I want each row to be considered a separate document
extracts = dataframe['text'].values

processed_extracts = preprocess_documents(extracts)
dict2 = gensim.corpora.Dictionary(processed_extracts)
bow_extracts = [dict2.doc2bow(text) for text in processed_extracts]

#index2 = gensim.similarities.MatrixSimilarity(lsi[bow_extracts])#using big corpus lsi space
#index2 = Similarity(lsi[bow_extracts], bow_extracts, num_features= len(dict2))

for similarity in index2:

I get various errors for the "index2" step including:
AttributeError: 'TransformedCorpus' object has no attribute 'endswith' (when running  index2 = Similarity(lsi[bow_extracts], bow_extracts, num_features= len(dict2)))

I'm not sure I understand the Similarity object. Could you please let me know where I went wrong/explain how Similarity works in this case? Maybe I'm doing other things incorrectly too? I'm still very much a beginner.

Thank you very much in advance for taking the time to help!

Radim Řehůřek

Oct 11, 2021, 4:41:16 PM10/11/21
to Gensim
Hi Schlomo,

I cannot comment on the pandas stuff (I always found that lib more confusing than helpful), but you want just one dictionary, not two:

1) Train your dictionary & TFIDF & LSI from your big corpus
2) Transform your small corpus to LSI space using the dictionary & models from 1)
3) Build MatrixSimilarity from the transformed corpus from 2)

So in code, something like this:

# Train models from big corpus
corpus_big = …
dictionary =  gensim.corpora.Dictionary(corpus_big)
bow_big =  [dictionary.doc2bow(text) for text in corpus_big]
tfidf = gensim.models.TfidfModel(bow_big, smartirs='npu')
tfidf_big = tfidf[bow_corpus]
lsi = gensim.models.LsiModel(tfidf_big, num_topics=500)

# Calculate pairwise similarities from another, small corpus
corpus_small = …
bow_small = [dictionary.doc2bow(text) for text in corpus_small]
index = gensim.similarities.MatrixSimilarity(lsi[tfidf[bow_small]], num_features=lsi.num_topics)
for similarity in index:

Plus enable logging and keep an eye on the logs. If you're getting AttributeErrors it probably means you're passing strings where a list of tokens is expected, or vice versa. Check the tutorials for the correct input structure.

Hope that helps,

Schlomo Goldstein

Oct 12, 2021, 4:49:20 PM10/12/21
to Gensim
Thank you very much for explaining, Radim! The pairwise similarity computations are working very well now.

I also realised that I was getting other errors due to encoding issues. For example strings like "others\x92" appear, where \x92 might influence the model, so I will remove these. Is there a way to do this in gensim? Otherwise I could try something like this to remove them:

s.decode('utf8').encode('ascii', errors='ignore')

Thank you very much for your help!

Radim Řehůřek

Oct 13, 2021, 2:24:57 PM10/13/21
to Gensim
Yeah, text decoding/encoding happens outside of Gensim, that's up to you. But bad preprocessing or tokenization can definitely mess up your pipeline, so be careful there!

I always recommend checking the log – Gensim will try to log samples of what it sees at specific points (e.g. when training a dictionary). That way you can at least eyeball the values, to make sure they make sense and you're not pushing nonsense through your pipeline.


Reply all
Reply to author
0 new messages