Document Pairwise Similarity using existing LSA/LSI space

49 views
Skip to first unread message

Schlomo Goldstein

unread,
Oct 5, 2021, 8:58:56 AM10/5/21
to Gensim
Hi everyone! I am working on a project using LSA/LSI and I would like to know how to efficiently compute pairwise (cosine) similarities.

I have a big corpus that I trained LSA/LSI on but I want to compare 120 other documents (not in corpus) to each other pairwise using the LSA space I created using the big corpus.


but I can't make it work. Any help would be much appreciated!

Radim Řehůřek

unread,
Oct 7, 2021, 3:31:58 PM10/7/21
to Gensim
Hi Schlomo,

what part of the tutorial did you get stuck on?

Conceptually, you would:

1. train your LSI model (big corpus)
2. build a MatrixSimilarity object on the small 120 document corpus (transformed to LSI space using the model from 1)
3. run `for sims in my_index: …` to compute the all-against-all pairwise similarities between the 120 documents.

Hope that helps!
Radim

Schlomo Goldstein

unread,
Oct 11, 2021, 1:35:22 PM10/11/21
to Gensim
Hi Radim,

Thank you very much for answering. I am stuck on 2. since I am not quite sure how MatrixSimilarity works and which parameters I should give it.
I would like to create a matrix consisting of the pairwise similarities of all 120 documents (each is a row in my csv file that I pre-process first).

I have the following (simplified) code:

import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_documents
import os
from gensim import models
import pprint
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

df = pd.read_csv('osf_app_data.csv',encoding="ISO-8859-1")

text_corpus = df['sorted_text'].values #reading a column from csv file
processed_corpus = preprocess_documents(text_corpus)
dictionary = gensim.corpora.Dictionary(processed_corpus)

bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
tfidf = gensim.models.TfidfModel(bow_corpus, smartirs='npu')
corpus_tfidf = tfidf[bow_corpus]
lsi = gensim.models.LsiModel(corpus_tfidf, num_topics=500)
index = gensim.similarities.MatrixSimilarity(lsi[corpus_tfidf])

dataframe = pd.read_csv('120novel_extracts.csv') # I want each row to be considered a separate document
extracts = dataframe['text'].values

processed_extracts = preprocess_documents(extracts)
dict2 = gensim.corpora.Dictionary(processed_extracts)
bow_extracts = [dict2.doc2bow(text) for text in processed_extracts]

#index2 = gensim.similarities.MatrixSimilarity(lsi[bow_extracts])#using big corpus lsi space
#index2 = Similarity(lsi[bow_extracts], bow_extracts, num_features= len(dict2))

for similarity in index2:
    print(similarity)

--------------------------------------------
I get various errors for the "index2" step including:
AttributeError: 'TransformedCorpus' object has no attribute 'endswith' (when running  index2 = Similarity(lsi[bow_extracts], bow_extracts, num_features= len(dict2)))

I'm not sure I understand the Similarity object. Could you please let me know where I went wrong/explain how Similarity works in this case? Maybe I'm doing other things incorrectly too? I'm still very much a beginner.

Thank you very much in advance for taking the time to help!

Radim Řehůřek

unread,
Oct 11, 2021, 4:41:16 PM10/11/21
to Gensim
Hi Schlomo,

I cannot comment on the pandas stuff (I always found that lib more confusing than helpful), but you want just one dictionary, not two:

1) Train your dictionary & TFIDF & LSI from your big corpus
2) Transform your small corpus to LSI space using the dictionary & models from 1)
3) Build MatrixSimilarity from the transformed corpus from 2)

So in code, something like this:

# Train models from big corpus
corpus_big = …
dictionary =  gensim.corpora.Dictionary(corpus_big)
bow_big =  [dictionary.doc2bow(text) for text in corpus_big]
tfidf = gensim.models.TfidfModel(bow_big, smartirs='npu')
tfidf_big = tfidf[bow_corpus]
lsi = gensim.models.LsiModel(tfidf_big, num_topics=500)

# Calculate pairwise similarities from another, small corpus
corpus_small = …
bow_small = [dictionary.doc2bow(text) for text in corpus_small]
index = gensim.similarities.MatrixSimilarity(lsi[tfidf[bow_small]], num_features=lsi.num_topics)
for similarity in index:
    print(similarity)

Plus enable logging and keep an eye on the logs. If you're getting AttributeErrors it probably means you're passing strings where a list of tokens is expected, or vice versa. Check the tutorials for the correct input structure.

Hope that helps,
Radim




Schlomo Goldstein

unread,
Oct 12, 2021, 4:49:20 PM10/12/21
to Gensim
Thank you very much for explaining, Radim! The pairwise similarity computations are working very well now.

I also realised that I was getting other errors due to encoding issues. For example strings like "others\x92" appear, where \x92 might influence the model, so I will remove these. Is there a way to do this in gensim? Otherwise I could try something like this to remove them:

s.decode('utf8').encode('ascii', errors='ignore')

Thank you very much for your help!

Radim Řehůřek

unread,
Oct 13, 2021, 2:24:57 PM10/13/21
to Gensim
Yeah, text decoding/encoding happens outside of Gensim, that's up to you. But bad preprocessing or tokenization can definitely mess up your pipeline, so be careful there!

I always recommend checking the log – Gensim will try to log samples of what it sees at specific points (e.g. when training a dictionary). That way you can at least eyeball the values, to make sure they make sense and you're not pushing nonsense through your pipeline.

Best,
Radim




Reply all
Reply to author
Forward
0 new messages