New to semanticvectors

99 views
Skip to first unread message

elshai...@gmail.com

unread,
Oct 16, 2014, 3:38:28 PM10/16/14
to semanti...@googlegroups.com
Hi,

I generated a Lucene index for my corpus of 584 document, I want to use SemanticVectors in my java project to:
build document vectors
generate document - document similarity matrix
use the generated matrix to cluster documents
I need to do that with a java code, are there any examples or sample codes to do this.

any help will be appreciated
Thanks
Shaimaa

Dominic Widdows

unread,
Oct 16, 2014, 4:15:21 PM10/16/14
to semanti...@googlegroups.com
Hi Shaimaa,

Please see the following:
- Recent thread (ongoing) on what it make take to output a vector store as a matrix: https://groups.google.com/forum/#!topic/semanticvectors/r8tBfOmpqFs

Best wishes,
Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

elshai...@gmail.com

unread,
Oct 17, 2014, 4:52:41 PM10/17/14
to semanti...@googlegroups.com
Thanks Dominic,
I followed the instructions in https://code.google.com/p/semanticvectors/wiki/DocumentSearch I guess I should implement the part in 

Programmatic / API-driven Search


but I don't know how to initialize the FlagConfig class to see my Lucene index directory to create the document vectors and the term vectors,

or if I need to use BuildIndex class to create the document vectors and the term vectors the only way to do this is through :
 java pitt.search.semanticvectors.BuildIndex -luceneindexpath but I can't embed this in my java code. 

I appologize if my questions are trivial but I'm kind of confused.

regards
Shaimaa

Dominic

unread,
Oct 17, 2014, 5:24:35 PM10/17/14
to semanti...@googlegroups.com
Hi there,

Initializing a FlagConfig programmatically would be the same for building an index as it would for searching,
i.e., FlagConfig config = FlagConfig.getFlagConfig( ... appropriate command-line string arguments ... );

If you need to call BuildIndex from within another java class, try using the BuildIndex.main method directly (see https://code.google.com/p/semanticvectors/source/browse/trunk/src/main/java/pitt/search/semanticvectors/BuildIndex.java#73).

Some of the project test code might be useful to crib from, especially the "buildSearchGetRank" helper methods such as https://code.google.com/p/semanticvectors/source/browse/trunk/src/test/java/pitt/search/semanticvectors/integrationtests/RegressionTests.java#78

But - do you really need to do this all "in java" in the sense of running everything withing the same JVM instance without making more than one process? If you want to experiment with clustering documents, by far the easiest way is to run https://code.google.com/p/semanticvectors/source/browse/trunk/src/main/java/pitt/search/semanticvectors/ClusterVectorStore.java

Best wishes,
Dominic

elshai...@gmail.com

unread,
Oct 20, 2014, 3:29:58 PM10/20/14
to semanti...@googlegroups.com
Hi Dominic
The command line string arguments that I added is only the -luceneindexpath (the path for my lucene index), but I think the BuildIndex class is looking for another argument that specifies field names to send it to FlagConfig class,  I got java.lang.NullPointerException in the method that creates term vectors , because it's looking for a filed name "contents" which I don't have, I have different field names in my index. (multiple fields per document)
Thanks
Shaimaa

Dominic Widdows

unread,
Oct 20, 2014, 3:33:41 PM10/20/14
to semanti...@googlegroups.com

elshai...@gmail.com

unread,
Oct 22, 2014, 2:48:41 PM10/22/14
to semanti...@googlegroups.com
Hi Dominic
I tried to send the -contentsfields to the BuildIndex class but it didn't work, it went in an infinite loop. So I had to set it in the BuildIndex.main manually with flagConfig.setContentsfields(contentsfields). when it started to generate Docvector and TermVector the The class DocVector.java threw a null pointer exception in the while loop at  while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) . I'm not sure if this is a bug,  I had to check if docsEnum is null: 
  if (docsEnum != null) 
         {
          int docID;
          while ((docID = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS)
          {.....}
}
Now I generated the Docvector.bin and the Termsvector.bin
When I tried to use the class CompareTerms to find relatedness between corpus terms it threw a runtime exception" Exception in thread "main" java.lang.RuntimeException: C:\workspace\semanticvectors-4.0\termvectors" 
I need to find term relatedness matrix and document relatedness matrix to cluster terms and documents.The classes ClusterResults and ClusterVectorStore doesn't do that 

Thanks for the continuous help
Shaimaa 

Dominic Widdows

unread,
Oct 22, 2014, 3:12:09 PM10/22/14
to semanti...@googlegroups.com
Hi Shaimaa,

Please could you reply with the complete console output of the places where you've seen exceptions or infinite loops?

You're quite right that the clustering classes don't product matrices. If you want a representation of a matrix, the question is what format do you need?

Best wishes,
Dominic

elshai...@gmail.com

unread,
Oct 22, 2014, 8:59:55 PM10/22/14
to semanti...@googlegroups.com
Dominic

This is the console output on running CompareTerms with two terms Sulphur and Waste, it seems it can't see the termvectors and the Docvectors. 
 
Outputting similarity of 'Sulphur' with 'Waste':
Exception in thread "main" java.lang.RuntimeException: C:\workspace\semanticvectors-4.0\termvectors
at pitt.search.semanticvectors.VectorStoreReaderLucene$1.initialValue(VectorStoreReaderLucene.java:102)
at pitt.search.semanticvectors.VectorStoreReaderLucene$1.initialValue(VectorStoreReaderLucene.java:1)
at java.lang.ThreadLocal.setInitialValue(Unknown Source)
at java.lang.ThreadLocal.get(Unknown Source)
at pitt.search.semanticvectors.VectorStoreReaderLucene.readHeadersFromIndexInput(VectorStoreReaderLucene.java:129)
at pitt.search.semanticvectors.VectorStoreReaderLucene.<init>(VectorStoreReaderLucene.java:106)
at pitt.search.semanticvectors.CompareTerms.RunCompareTerms(CompareTerms.java:107)
at pitt.search.semanticvectors.CompareTerms.main(CompareTerms.java:89)
Caused by: java.nio.file.NoSuchFileException: C:\workspace\semanticvectors-4.0\termvectors
at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
at java.nio.channels.FileChannel.open(Unknown Source)
at java.nio.channels.FileChannel.open(Unknown Source)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
at pitt.search.semanticvectors.VectorStoreReaderLucene$1.initialValue(VectorStoreReaderLucene.java:100)
... 7 more
What I need to do exactly is to be able to read the document vector and the term vector so I can use it in clustering documents and terms. My question  is does the SemanticVectors has classes for clustering documents and terms and visualizing the clusters based on the binary vectors produced earlier with BuildIndex 
if yes please guide me to the usages of these classes.

thanks
Shaimaa

Dominic Widdows

unread,
Oct 22, 2014, 11:58:55 PM10/22/14
to semanti...@googlegroups.com
Hi there,

I'm puzzled by the line "Caused by: java.nio.file.NoSuchFileException: C:\workspace\semanticvectors-4.0\termvectors". Unless directed otherwise, the program should be looking for a file with a .bin or .txt extension. Please could you share the command you typed in as part of the transcript?

To your other question, I think that clustering should work fine with binary vectors (because it just uses pairwise comparisons), but 2d visualizations will probably only work with real vectors.

Best wishes,
Dominic

elshai...@gmail.com

unread,
Oct 23, 2014, 12:58:01 PM10/23/14
to semanti...@googlegroups.com
The only line in my main method was:
  CompareTerms.main(new String[]{"Sulphur","Waste"});

I need to find a way to compute a term term relatedness matrix and a document document relatedness matrix to use it for clustering terms and documents.

regards
Shaimaa

Dominic Widdows

unread,
Oct 23, 2014, 1:37:19 PM10/23/14
to semanti...@googlegroups.com
Please could you try running CompareTerms from the command line?

It's still not clear what you want when you say "relatedness matrix". A matlab file on disk? If what you want is the ability to ask "here are two documents, give me their similarity score", you're probably best off using the existing VectorStoreRAM and computing the similarities on-the-fly. It's typically very fast and much more space-efficient.

Best wishes,
Dominic

elshai...@gmail.com

unread,
Oct 24, 2014, 6:29:21 PM10/24/14
to semanti...@googlegroups.com
Hi
Let me explain what I need to do in steps:

 I have a Lucene index with multiple fields, I used BuildIndex to create a semantic index, and I generated the docvectors.bin and the termvectors.bin
1- I have a list of queries and for each query I need to call the Search class form my java classes so this what I wrote:

String[] searchargs = new String[]{"-queryvectorfile", "docvectors.bin", "-searchvectorfile", "termvectors.bin",  "-luceneindexpath", "c:\\test_dataset\\index1\\", "Sulphur"};
Search.main(searchargs);

This What I got
Opening query vector store from file: docvectors.bin
Setting dimension of target config to: 1000
Opening search vector store from file: termvectors.bin
Searching term vectors, searchtype SUM
Didn't find vector for 'sulphur'
No vector for 'sulphur'
No search output.

and this happens with every query.

2- I need to find a way to cluster the documents in the Lucene index based on their similarity.

3- I need to cluster the terms based on their similarity.

I tried but I think clustering with semanticvector Api is not possible so I guess is I can extract the document-document similarity matrix and the term-term similarity matrix. I can find a different method to do that.

thanks
Shaimaa 

Trevor Cohen

unread,
Oct 26, 2014, 1:02:57 PM10/26/14
to semanti...@googlegroups.com
Hi Shaimaa,
The -queryvectorfile sets the file in which Semantic Vectors looks for vectors for the terms in your query. You've set this to "docvectors.bin", so it isn't surprising that no vector for the term "sulphur" is found. 

Could you try -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin instead?
Regards,
Trevor

elshai...@gmail.com

unread,
Oct 27, 2014, 12:06:15 PM10/27/14
to semanti...@googlegroups.com
Thanks Trevor, The Search worked.
 Do you think it is possible to cluster the corpus documents and the terms which is indexed with Lucene.

regards
Shaimaa  

Dominic

unread,
Oct 27, 2014, 12:11:34 PM10/27/14
to semanti...@googlegroups.com
Hi Shaimaa,

Have you tried the clustering tool described at  https://code.google.com/p/semanticvectors/wiki/ClusteringAndVisualization yet? It's just a simple k-means so there may be plenty of better options.

For output to different tools, you should consider using the '-indexrfileformat text' option. See https://code.google.com/p/semanticvectors/wiki/VectorStoreFormats. This will give you the list of vectors as a num_items * num_dimensions matrix, which is what some clustering libraries take as input (e.g., http://www.mathworks.com/help/stats/kmeans.html).

Best wishes,
Dominic
Reply all
Reply to author
Forward
0 new messages