Weird path issues when trying to use a saved soft cosine similarity object in a python project

151 views
Skip to first unread message

Sugandh

unread,
Apr 21, 2021, 6:30:07 PM4/21/21
to Gensim
Hi, 

I am facing a weird problem whenever I try to use a saved soft cosine similarity object in a python project. The structure of the project is as below (names in blue are directories):

❯ tree
.
├── data
│   ├── corpus.mm
│   ├── corpus.mm.index
│   ├── dict.sav
│   ├── fasttext
│   │   ├── fasttext.sav
│   │   ├── fasttext.sav.trainables.vectors_ngrams_lockf.npy
│   │   ├── fasttext.sav.wv.vectors_ngrams.npy
│   │   └── ft_docsim_index.sav
│   ├── tfidf.sav
│   ├── word2vec
│   │   ├── w2v_docsim_index.sav
│   │   └── word2vec.sav
│   └── zipped_data.sav.npy
└── test.py

The contents of test.py file are as below, which returns a list of dictionaries containing document number as 'keys' and their similarity probability as 'value':
from operator import itemgetter
import os.path
import os
import gensim
from gensim import corpora, utils
from gensim.models import TfidfModel
from gensim.similarities import SoftCosineSimilarity
import numpy as np

data_path = os.path.abspath(os.path.dirname(__file__))
print(data_path)
data_list = np.load(os.path.join(data_path, "data/zipped_data.sav.npy"))

def text_processing(document):
# remove common words and tokenize
texts = utils.simple_preprocess(document)
return texts


def softcossim(query):
# Compute Soft Cosine Measure between the query and the documents.
dictionary = corpora.Dictionary.load_from_text(os.path.join(data_path, "data/dict.sav"))
corpus = corpora.MmCorpus(os.path.join(data_path, "data/corpus.mm"))
tfidf = TfidfModel.load(os.path.join(data_path, "data/tfidf.sav"))
# docsim_index = SoftCosineSimilarity.load(os.path.join(data_path, "../data/fasttext/ft_docsim_index.sav")
docsim_index = SoftCosineSimilarity.load(os.path.join(data_path, "data/word2vec/w2v_docsim_index.sav"))
query = tfidf[dictionary.doc2bow(query.lower().split())]
similarities = docsim_index[query]
return similarities

def rev_results(x):
return sorted(x, key=itemgetter(0), reverse=True)

def gen_search_results(query):
cos_sim = softcossim(query)
unsorted_sim = []
search_res = []
if query == None or query == "":
return ([])
elif len(np.atleast_1d(cos_sim)) == 1:
return ([])
elif len(cos_sim) > 0:
for i in range(len(cos_sim)):
unsorted_sim.append((cos_sim[i], i))
sorted_sim = rev_results(unsorted_sim)
for i in range(len(sorted_sim)):
if sorted_sim[i][0] > .2:
search_res.append({data_list[sorted_sim[i][1]][0]: str(sorted_sim[i][0])})
return search_res
else:
return ([])

r = gen_search_results("textile dyeing")
print(r)


The issue is that if I am at the root of the directory and execute 'python test.py', I get an error about missing 'corpus.mm', full traceback is listed below:
Traceback (most recent call last):
  File "test.py", line 55, in <module>
    r = gen_search_results("textile dyeing")
  File "test.py", line 37, in gen_search_results
    cos_sim = softcossim(query)
  File "test.py", line 28, in softcossim
    similarities = docsim_index[query]
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/gensim/interfaces.py", line 340, in __getitem__
    result = self.get_similarities(query)
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/gensim/similarities/docsim.py", line 981, in get_similarities
    result = self.similarity_matrix.inner_product(query, self.corpus, normalized=True)
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/gensim/similarities/termsim.py", line 345, in inner_product
    Y = corpus2csc(Y, num_terms=self.matrix.shape[0], dtype=dtype)[word_indices, :].todense()
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/gensim/matutils.py", line 140, in corpus2csc
    for docno, doc in enumerate(corpus):
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/gensim/corpora/mmcorpus.py", line 83, in __iter__
    for doc_id, doc in super(MmCorpus, self).__iter__():
  File "gensim/corpora/_mmreader.pyx", line 127, in __iter__
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/gensim/utils.py", line 140, in file_or_filename
    return open(input, 'rb')
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/smart_open/smart_open_lib.py", line 187, in open
    newline=newline,
  File "/home/sug/.pyenv/versions/3.7.3/envs/subs/lib/python3.7/site-packages/smart_open/smart_open_lib.py", line 287, in _shortcut_open
    return _builtin_open(local_path, mode, buffering=buffering, **open_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: 'corpus.mm'

However if I cd to 'data' directory and then execute 'python ../test.py', everything works perfectly. The issue also does not occur if I copy corpus.mm and corpus.mm.index to the root of the directory, the same place where test.py is located. 

How do I solve this issue which allows me to execute the python program from any dir without having to copy the saved corpus files there?

Please let me know if you need any other information.

Thanks,
Sugandh

Radim Řehůřek

unread,
Apr 22, 2021, 2:50:30 AM4/22/21
to Gensim
Hi Sugandh,

try using absolute filesystem paths.

If you use relative paths, the paths are relative to your current working directory.

HTH,
Radim

Sugandh

unread,
Apr 22, 2021, 7:16:33 AM4/22/21
to Gensim
Hey Radium,

Thanks for replying! 

Using absolute paths did not help either, it still complains about the missing corpus.mm file. Is there something else that I can try to solve the problem?

Thanks, 
Sugandh

Gordon Mohr

unread,
Apr 22, 2021, 12:56:30 PM4/22/21
to Gensim
If using the absolute path still generated a `FileNotFoundError`, then it is likely you were using the wrong absolute path. Try:

* cd to the directory where `corpus.mm` is and execute `pwd` to get the full absolute path
* in your code, don't use anything indirect via `os.path` (like `join` or `abspath` or even string-concatenation). only use the complete literal string, starting with a `/`, when specifying where to load/save things

That should either work, or make it clear where the discrepancy is. 

If you then need to add back relative-path capabilities, and have problems, still, confirm that only full absolute paths (starting with `/`) are ever passed to load/save methods - do all fussy relative work before that. and, to debug, print any path's intended final absolute version just before any load/save is attempted - and it should then be clear whenever the code is inadvertently using a path different than you need. 

- Gordon

Sugandh

unread,
Apr 26, 2021, 9:30:42 AM4/26/21
to Gensim
Hey Gordon,

I used absolute path by specifying it in string format but I am still getting the same error. The code with absolute path is as follows:
from operator import itemgetter
import os.path
import os
import gensim
from gensim import corpora, utils
from gensim.models import TfidfModel
from gensim.similarities import SoftCosineSimilarity
import numpy as np

#data_path = os.path.abspath(os.path.dirname(__file__))
#print(data_path)
data_list = np.load("/home/sug/PycharmProjects/test_nlp/data/zipped_data.sav.npy")


def text_processing(document):
# remove common words and tokenize
    texts = utils.simple_preprocess(document)
    return texts


def softcossim(query):
# Compute Soft Cosine Measure between the query and the documents.
    dictionary = corpora.Dictionary.load_from_text("/home/sug/PycharmProjects/test_nlp/data/dict.sav")
    corpus = corpora.MmCorpus("/home/sug/PycharmProjects/test_nlp/data/corpus.mm")
    tfidf = TfidfModel.load("/home/sug/PycharmProjects/test_nlp/data/tfidf.sav")  

    # docsim_index = SoftCosineSimilarity.load(os.path.join(data_path, "../data/fasttext/ft_docsim_index.sav")
    docsim_index = SoftCosineSimilarity.load("/home/sug/PycharmProjects/test_nlp/data/word2vec/w2v_docsim_index.sav")



To ensure that I am indeed using the right paths, I am also posting the output of 'pwd' below:
pwd.png

As you can see, the paths seem to be correct and the error that I get is as follows:
Traceback (most recent call last):
  File "test.py", line 55, in <module>
    r = gen_search_results("textile dyeing")
  File "test.py", line 37, in gen_search_results
    cos_sim = softcossim(query)
  File "test.py", line 28, in softcossim
    similarities = docsim_index[query]
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/gensim/interfaces.py", line 340, in __getitem__
    result = self.get_similarities(query)
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/gensim/similarities/docsim.py", line 981, in get_similarities
    result = self.similarity_matrix.inner_product(query, self.corpus, normalized=True)
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/gensim/similarities/termsim.py", line 345, in inner_product
    Y = corpus2csc(Y, num_terms=self.matrix.shape[0], dtype=dtype)[word_indices, :].todense()
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/gensim/matutils.py", line 140, in corpus2csc
    for docno, doc in enumerate(corpus):
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/gensim/corpora/mmcorpus.py", line 83, in __iter__
    for doc_id, doc in super(MmCorpus, self).__iter__():
  File "gensim/corpora/_mmreader.pyx", line 127, in __iter__
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/gensim/utils.py", line 140, in file_or_filename
    return open(input, 'rb')
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/smart_open/smart_open_lib.py", line 187, in open
    newline=newline,
  File "/home/sug/.pyenv/versions/subs/lib/python3.7/site-packages/smart_open/smart_open_lib.py", line 287, in _shortcut_open
    return _builtin_open(local_path, mode, buffering=buffering, **open_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: 'corpus.mm'

Thanks,
Sugandh

Vít Novotný

unread,
Apr 28, 2021, 11:27:56 AM4/28/21
to Gensim
Dear Sugandh,

it seems that the corpus.mm filename is hardcoded in your w2v_docsim_index.sav file:

    docsim_index = SoftCosineSimilarity.load("/home/sug/PycharmProjects/test_nlp/data/word2vec/w2v_docsim_index.sav")

In the past, you constructed and saved the SoftCosineSimilarity index with a MmCorpus('corpus.mm') model, which is now saved in the  w2v_docsim_index.sav file. Your changes to your corpus have no effect, since it is unused. The only place that the corpus appears is here and it is unused since:

     corpus = corpora.MmCorpus("/home/sug/PycharmProjects/test_nlp/data/corpus.mm")

You will need to construct a new SoftCosineSimilarity index with the correct corpus. Generally, it is best to save only the TermSimilarityMatrix, not SoftCosineSimilarity.
Dne pondělí 26. dubna 2021 v 15:30:42 UTC+2 uživatel Sugandh napsal:

Radim Řehůřek

unread,
Apr 28, 2021, 2:52:04 PM4/28/21
to Gensim
Thanks Vitek – but why does SoftCosineSimilarity store any corpus at all? That doesn't sound right. That's not how any classes in Gensim work (or should work).

-rr

Vít Novotný

unread,
Apr 28, 2021, 3:25:36 PM4/28/21
to Gensim
SoftCosineSimilarity works directly with sparse bag-of-words vectors, i.e. list of (int, float). Therefore, no actual indexing takes place.

A simple fix would be to read the corpus into a list inside the  SoftCosineSimilarity constructor.

If we were to be more thorough and also get some speed benefit, we could index the corpus in a CSC sparse matrix in the SoftCosineSimilarity constructor, similarly to what SparseMatrixSimilarity does, and patched SparseTermSimilarityMatrix.inner_product to cope with CSC sparse matrices. However, this would 1) duplicate existing functionality of SparseMatrixSimilarity, and 2) break the existing contract of the SoftCosineSimilarity constructor (we would now need additional parameters such as num_features and num_terms).

Dne středa 28. dubna 2021 v 20:52:04 UTC+2 uživatel Radim Řehůřek napsal:

Vít Novotný

unread,
Apr 28, 2021, 4:01:55 PM4/28/21
to Gensim
I propose a patch in PR #3128.

Dne středa 28. dubna 2021 v 21:25:36 UTC+2 uživatel Vít Novotný napsal:

Sugandh

unread,
Apr 28, 2021, 11:51:13 PM4/28/21
to Gensim
Thanks @Vit and @Radim! For now I will just move the corpus.mm file to the root of the dir.

Best, 
Sugandh

Sugandh

unread,
Apr 28, 2021, 11:57:54 PM4/28/21
to Gensim
Hey Vit, 

I have a question, if the corpus is hardcoded in the w2v_docsim_index.sav file then how does the program work if I simply move the same corpus (corpus existing in the data dir) to the root of the project, where my program resides? It seems like SoftCosineSimilarity index looks for the corpus.mm file in the same dir from where the program is called?

Thanks,
Sugandh

On Wednesday, April 28, 2021 at 5:27:56 PM UTC+2 Vít Novotný wrote:
Reply all
Reply to author
Forward
0 new messages