Using Semantic Vectors to override similarity in Lucene

Hadeel Maryoosh

unread,

Jul 13, 2016, 4:44:31 PM7/13/16

to Semantic Vectors

Hello,

I have thousands of documents that I'm using Lucene for indexing and searching, but I want to override the similarity metric that Lucene is using ( originally TF-IDF & Cosine similarity) with LSA, and I'm guessing that's what Semantic Vectors is using.

I'm quite new to Lucene, Java and Semantic Vectors, so I have many questions:

1. Can Semantic Vectors be used to override the similarity metric that Lucene is using?

2. How I can start doing that? I could override the similarity with:

Similarity similarity = new DefaultSimilarity() {

@Override

public float lengthNorm(FieldInvertState state) {

return 1.0f;

}

@Override

public float coord(int overlap, int maxOverlap) {

return 1.0f;

}

@Override

public float idf(long docFreq, long numDocs) {

return 1.0f;

}

@Override

public float queryNorm(float sumOfSquaredWeights) {

return 1.0f;

}

@Override

public float tf(float freq) {

return freq == 0f ? 0f : 1f;

}

But I'm really lost on how to replace that with the LSA as the similarity used for indexing and searching by Lucene. I appreciate the help.

Dominic Widdows

unread,

Jul 13, 2016, 4:58:14 PM7/13/16

to semanti...@googlegroups.com

Hello there,

I don't really know how the Lucene Similarity system works, took a brief look at https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/similarities/Similarity.html but nothing screams out to me as "this function takes a pair of vectors, pair of lists of words, or query and document pair, and returns a score".

The easiest way to get some kind of scoring / ranking working using Semantic Vectors is something like the following:

https://github.com/semanticvectors/semanticvectors/blob/master/exampleclient/src/main/java/pitt/search/examples/ExampleVectorSearcherClient.java

This gives a ranked list of the top-scoring results. For something that compares two inputs directly, see https://github.com/semanticvectors/semanticvectors/blob/master/src/main/java/pitt/search/semanticvectors/CompareTerms.java

Best wishes,

Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at https://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Hadeel Maryoosh

unread,

Jul 14, 2016, 2:20:28 PM7/14/16

to Semantic Vectors

Thanks for the reply Dominic.. What I would is to maintain Lucene method for how to do the similarity of indexing and searching, In other words, Lucene measures the similarity(when scoring) only for the common terms that the query and the documents have. thus, it maintains the fast implementation. The first link you gave me is not applied to a Lucene project, right?

What I would like is , For example, this is a basic Lucene project, that do some indexing and searching for small data, I would like to use Semantic Vectors on it for the indexing and searching procedures. I would like the help with that.

http://www.tutorialspoint.com/lucene/lucene_first_application.htm

I would like to have examples on how to use Semantic Vectors using the project I gave in the link. Thanks,

Dominic Widdows

unread,

Jul 14, 2016, 2:31:27 PM7/14/16

to semanti...@googlegroups.com

Hi Hadeel,

The application at http://www.tutorialspoint.com/lucene/lucene_first_application.htm has a lot of pages, it appears only to go into detail on the indexing process, and for the searching process there are several Lucene classes such as IndexSearcher that are quite complex and specific. Reproducing all these functions in SV would probably not be a good idea because in many cases they'd be irrelevant.

If you can come up with a more specific design I might be able to help with advice. It will need to have questions like "At this point I have a query term and a term vector store and a document vector store ... what's the best way to get related documents?" An end-to-end replacement for Lucene in a tutorial like the one above is much too big and vague a request for me to promise to help with.

Best wishes,

Dominic

Message has been deleted

Dominic Widdows

unread,

Jul 15, 2016, 2:03:53 PM7/15/16

to semanti...@googlegroups.com

Hi Hadeel,

What I'd suggest is to start with the question "What do I want the user to provide?" and "What do I want to show up and where?".

If the user is just typing in text and hitting return, then the input is just a string. If you want the result to show up as a list of high-scoring documents or document titles, then the output is a list of document IDs (presuming that there is a process for taking those document IDs and getting suitable thumbnail summaries to show to the user). If you want the summaries to show a relation to the query (e.g., to highlight segments with the query terms in them) than the summary process is a bit more in depth.

Nothing yet mentions a Lucene interface explicitly, and that's a good thing because at least the first two processes I mentioned (getting a string from the user, and getting a ranked list of high-scoring documents) will be much simpler if you don't bother trying to use Lucence for this at all - passing in a query and getting a list of responses are perfectly easy if you just use the SV package in a standalone fashion. (This is at query time, at indexing time you'd still use Lucene.)

So first I'd say figure out what your end goal is, and then figure out if implementing an IndexSearcher and IndexWriter help at all.

Best wishes,

Dominic

On Thu, Jul 14, 2016 at 1:36 PM, Hadeel Maryoosh <hadeel....@gmail.com> wrote:

Thanks Dominic for replying. Brain storming ideas will be extremely helpful.
Originally, I would like to improve an existed application we have that is mainly using Lucene for indexing and searching. Lucene is using TF-IDF with cosine as the similarity metric for scoring, but I found that might not be efficient cause our data is short for each document ( by the way we have like a million of documents), and in my experience, I found that LSA concept works much better and returns efficient results.

Because I'm new to Lucene and Java, I implemented the first application of Lucene in the link I mentioned earlier ( just the single page) for this data set, it's just about simple indexing and searching in Lucene, I didn't go to the other pages in the link for now. Now, My next step is to change the similarity that Lucene is using ( TF-IDF) with LSA, the details is in here https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/search/package-summary.html#changingSimilarity

So in the official website of Lucene, it's said you can change the similarity when you know your data needs to, so that's why I need! I want Lucene to keep doing what it's doing for indexing and searching, but instead of using (TF-IDF) as explained in the link above, I want to use LSA ( semantic vector), and I found this package. Do you think semantic Vector package will be helpful? I found that link to build the models by command lines https://github.com/semanticvectors/semanticvectors/wiki/InstallationInstructions#to-build-and-search-a-model

but were I able to do the regualar Indexwriter, indexSearcher ( indexing and searching procedures of Lucene) with the Semantic Vector Package? If not, Can I integrate the SV Package in somehow?

Hadeel Maryoosh

unread,

Jul 15, 2016, 3:16:10 PM7/15/16

to Semantic Vectors

The two processes you mentioned are right.

"What do I want the user to provide?" : the input is just a string

"What do I want to show up and where?". : List of documented with the most matching results

My problem with the current Lucene is: The list of documented with the most matching results is not always efficient. Now I want to implement SV package instead of just Lucene. What I found are command lines that can be used with SV, and not actual code that I can integrate to the project example I gave earlier. I would appreciate the help with that.

Dominic Widdows

unread,

Jul 15, 2016, 3:24:01 PM7/15/16

to semanti...@googlegroups.com

Good, that should be much easier.

Hopefully this section should help:

https://github.com/semanticvectors/semanticvectors/wiki/DocumentSearch#programmatic--api-driven-search

Basically you need to:

i. Open query and document vector stores.

ii. Instantiate a VectorSearcher and call getNearestNeighbors appropriately.

Depending on your workflow, user interaction, and performance requirements, you might want to do i. just once in an initialization phase, and ii. every time a user types in a query. For best performance you'll want to read the vector stores from disk into a VectorStoreRAM. But again, that depends on size and performance requirements.

Best wishes,

Dominic

Hadeel Maryoosh

unread,

Jul 15, 2016, 3:56:41 PM7/15/16

to Semantic Vectors

Thanks, I will look into that and try it... by the way, Do you think it will keep the fast performance with having around a million documents ( enough scability)? Right now, with using Lucene, it's fairly fast.

Hadeel Maryoosh

unread,

Jul 15, 2016, 4:26:07 PM7/15/16

to Semantic Vectors

By the way, The links in here https://github.com/semanticvectors/semanticvectors/wiki/ExampleClients are not working, anyway you can provide me where I can find those instead? Thanks

Dominic Widdows

unread,

Jul 15, 2016, 5:17:25 PM7/15/16

to semanti...@googlegroups.com

Thanks for that, it looks like the broken links were a casualty of the googlecode to github migration. Fixed now.

On the scale question. Let's see - 1M docs, say a default 200 dimensions per doc, 4 byte float per dimension, that's around 800MB of memory which should be fine nowadays on a single machine, provided you're not doing too much else at the same time. In terms of CPU, it's still less that 1B flops per search, and given that teraflops are now standard, you should be more than fine there.

So I'm guessing that scaling even on a single machine shouldn't be too big a problem.

Best wishes,

Dominic

Hadeel Maryoosh

unread,

Jul 15, 2016, 6:05:35 PM7/15/16

to Semantic Vectors

Thanks.

So in this example https://github.com/semanticvectors/semanticvectors/blob/master/exampleclient/src/main/java/pitt/search/examples/ExampleVectorSearcherClient.java

Regarding the input data, I only noticed reading the file "src/test/resources/termvectors.bin" .. Is that the term vector matrix? Maybe I can get it somehow by using a Lucene class or something in my code, and then feedback to it the rest of code in the example. Please correct me if I'm wrong. Also,, I couldn't find the query terms to search for,, noticed the line "Enter a query term:,, but couldn't understand the rest and how query searching is done here.. I appreciated the help

Dominic Widdows

unread,

Jul 15, 2016, 7:14:52 PM7/15/16

to semanti...@googlegroups.com

Yes, the termvectors.bin file is a vector store, i.e., long list of vectors all with the same dimension, so you can think of that as a matrix if you like.

You can get termvectors.bin and docvectors.bin files from BuildIndex or BuildPositionalIndex.

The rest is a Scanner object that reads command-line inputs and a call for getting nearby vectors. You'll probably want to open a document vector store and use that to search in if you want to use query terms to find related documents.

Best wishes,

Dominic

Hadeel Maryoosh

unread,

Jul 18, 2016, 11:51:08 AM7/18/16

to Semantic Vectors

So speaking of theory, what those lines are doing? are they searching the term vector using LSA ( SVD)?

Also, I still didn't get where to input the required query term?

Hadeel Maryoosh

unread,

Jul 18, 2016, 12:01:55 PM7/18/16

to Semantic Vectors

And when using the SVD, how to determine the number of dimensions that we need to cut.

Hadeel Maryoosh

unread,

Jul 18, 2016, 12:05:52 PM7/18/16

to Semantic Vectors

Basically I would like to run similar to this package http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Applying LSA ( SVD) on the terms vector using the randomized algorithm( which should be faster http://stackoverflow.com/questions/36812129/why-scikit-learn-truncatedsvd-uses-randomized-algorithm-as-default) ..

Message has been deleted

Dominic Widdows

unread,

Jul 18, 2016, 7:24:47 PM7/18/16

to semanti...@googlegroups.com

Hi Hadeel,

These lines:

VectorSearcher searcher = new VectorSearcher.VectorSearcherCosine(searchVectorStore, searchVectorStore, luceneUtils, defaultFlagConfig, new String[] {queryTerm});
LinkedList<SearchResult> results = searcher.getNearestNeighbors(10);

are the ones doing the actual search. They're searching whatever vectors are in the searchVectorStore, these might have come from LSA, random projection, or even your own text file written in a text editor (as in this example in tests).

Number of dimensions - 200 is typically fine, but results tend to vary across the literature for different datasets and different purposes, so I would advise experimenting and seeing if some look better than others.

For your indexing phase for LSA, please see https://github.com/semanticvectors/semanticvectors/wiki/LatentSemanticAnalysis and compare with the BuildIndex command which uses random projection. The vectors produced have the same format / syntax either way, so for searching your implementation won't care which technique was used (your results may differ, of course).

Small protocol suggestion - please collect questions together into batches before sending them to the main SV mailing list to avoid other members getting more than they need. If you have several small questions please just write to me directly.

Best wishes,

Dominic

On Mon, Jul 18, 2016 at 3:07 PM, Hadeel Maryoosh <hadeel....@gmail.com> wrote:

So I got that the user will write the input query from the command line. Thanks for that. However, I'm waiting for the answers of the rest of the questions. Thanks again.

On Monday, July 18, 2016 at 11:53:16 AM UTC-6, Hadeel Maryoosh wrote:
Sorry but one more question, Is the term vector here the same as the index in Lucene? Cause I already have the index in some directory, so if it is the same, this would save for me some efforts I think..

Reply all

Reply to author

Forward