Query regarding usage of LetantSemanticAnalysis

5 views
Skip to first unread message

Ritesh Sangwan

unread,
Aug 29, 2016, 3:25:24 PM8/29/16
to s-spac...@googlegroups.com
I have to implement a service which compares two documents and returns a weighted score of how similar they are.

Below is my use case

I have a list of documents and each document is uniquely identified by an id.
These documents are incrementally added, suppose in the starting I have 10,000 documents and the next day I have to add 1000 documents and so on.

Now I need to run some queries on these documents to find the most relevant documents for that query.

This query will have multiply parts and each of the parts are weighted. For ex:

The quick fox -- weight 2
Black fox --- weight 4

I did some study and came to an understanding that I can use Letant Semantic Analysis algorithm to achieve this.


Listing below the implementation steps

  1. Create a Semantic Space of all the documents using LatentSemanticAnalysis.java
  2. For Every string query find the similarity score, this is shown in LatentSemanticAnalysisTest.java
But I can find below issues
  1. How can I index the documents based on some unique id. I need to run queries on the semantic space and in return I want the document ids which are most accurately matched.
  2. I need this LSA algorithm to be incremental. Suppose in the starting I have 10000 documents and I processed those documents. LSA will create a SVD matrix. Suppose next day I want to index 1000 more documents. How can I achieve this ?
  3. Suppose the server crashes, do I have to re-index the documents or can I load results from disk. If yes how can I achieve this ?

I posted a SO question for this link here

Best
Ritesh

Reply all
Reply to author
Forward
0 new messages