LSI incremental learning

24 views
Skip to first unread message

Martin

unread,
Mar 2, 2022, 10:58:53 AM3/2/22
to Gensim
Hello, this seems like a popular issue but I still didn't manage to get it to work after reading through this forum (a tutorial would be super useful!)

I've built a functional LSI model, returning the right scores on inference. However, I encounter issues when adding documents to it. My kernel either shuts down or inference function returns all 0 scores.

code:

def preprocess(input_data):
    corpus_bow, id2word, _ = get_corpus(input_data)
    tfidf = models.TfidfModel(corpus_bow)
    vector_tfidf = tfidf[corpus_bow]
    save_dictionary(id2word)

    return corpus_bow, vector_tfidf, tfidf

    def add_document(docs):

    lsi = load_model()
    id2w = load_dictionary()

    processed_docs = transform_sample(docs)

    id2w.add_documents(processed_docs)
    save_dictionary(id2w)

    vector_bow = [id2w.doc2bow(text) for text in processed_docs]

    tfidf = models.TfidfModel(vector_bow)
    vector_tfidf = tfidf[vector_bow]

     lsi.add_documents(vector_tfidf)
    save_model(lsi)


def infer(sample, tfidf_model, tfidf_corpus, titles):

    lsi = load_model()
    id2w = load_dictionary()

    index = Similarity(output_prefix=None ,corpus=lsi[tfidf_corpus],     num_features=lsi.num_terms)

    transformed_sample = transform_sample(sample)
    [vector_bow] = [id2w.doc2bow(text) for text in transformed_sample]
    vector_tfidf = tfidf_model[vector_bow]
    vector_lsi = lsi[vector_tfidf]

    sims = index[vector_lsi]
    sims = sorted(enumerate(sims), key=lambda x: -x[-1])

    for doc_pos, doc_score in sims[:15]:
         print(titles[doc_pos], doc_score)



add_document(new_doc['content'])
n_corpus, _, n_tfidf = preprocess(new_data['content'])

create test query for inference

n_d = {'content':['jet fluid']}
n_test = pd.DataFrame(data=n_d)
n_test = n_test['content']

feed infer function with test query, new tfidf model created from updated corpus (including new documents) and the new corpus itself for making a new index

infer(n_test, tfidf_model=n_tfidf, tfidf_corpus=n_corpus, titles=titles)

-----
help much appreciated!


Reply all
Reply to author
Forward
0 new messages