How to assign custom string ids to the document when finding similarities between documents

Azeem Haider

unread,

Jul 18, 2019, 6:03:44 AM7/18/19

to Gensim

I'm doing Topic Modelling in Gensim I successfully find the document_id and similarity_percentage.

Here is what I'm trying.

documents = ["Say to other what you feel",
             "Speak truth from your heart and tell people",
             "what this book say and tell about lying"]

texts = # remove common words and tokenize

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

index = similarities.MatrixSimilarity(lsi[corpus])

doc = "Always tell people what in your heart"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]

sims = index[vec_lsi]

Output

[(0, 0.74419993), (1, 0.99159265), (2, 0.35600105)]
  |          |
  |          |
  |          |

index        similarity percentage
number
in
documents
array

I want result something like below

I want this

[(myid_123, 0.74419993), (abc_1, 0.99159265), (id_3, 0.35600105)]
  |          |
  |          |
  |          |

string        similarity percentage
id
in
documents
array

I tried something like this but not working

documents = {"myid_123": "Say to other what you feel",
             "abc_1": "Speak truth from your heart and tell people",
             "id_3": "what this book say and tell about lying"}

How can I specify my on ids to documents. Is it possible in Gensim. If yes how. Do you have any example or something.

Felix Forge

unread,

Jul 22, 2019, 12:45:17 PM7/22/19

to Gensim

I assume you want to assign custom tags to each document.

So one way is gensim.models.doc2vec.TaggedDocument

sentence = TaggedDocuments(words=['some', 'words'],tags=['myid_123'])

although, you will have to apply it with a list.. Have to do that too, so can only post a solution in a few days.

Azeem Haider

unread,

Jul 22, 2019, 10:28:20 PM7/22/19

to Gensim

Thanks Felix Forge for reply.

If you look carefully I'm using "LSI" model not a "Doc2Vec"

How can I achieve it in LSI.

Here is some explanation why I need this

There are some other thing for every document (for example likes, comments, data etc) which are saved in database. That's why I want to attach custom id to every document So late on I can find related stuff to this document.

Radim Řehůřek

unread,

Jul 23, 2019, 5:41:56 AM7/23/19

to Gensim

Hi Azeem,

Gensim identifies documents by their position in the provided corpus. So, document #0, document #1, document #2 etc.

If your documents have other ids, e.g. document #0 = "my_custom_label" or "55e066a5", you'll have to keep track of this mapping yourself. Gensim is agnostic to such schemes and doesn't care about the labels.

HTH,

Radim

Azeem Haider

unread,

Jul 23, 2019, 10:53:32 AM7/23/19

to Gensim

Hey Radim I think you are the creater of Gensim. It's unbelievable to get reply from the own.

Thanks for your precious time!!!

Reply all

Reply to author

Forward