New to semanticvectors

elshai...@gmail.com

unread,

Oct 16, 2014, 3:38:28 PM10/16/14

to semanti...@googlegroups.com

Hi,

I generated a Lucene index for my corpus of 584 document, I want to use SemanticVectors in my java project to:

build document vectors

generate document - document similarity matrix

use the generated matrix to cluster documents

I need to do that with a java code, are there any examples or sample codes to do this.

any help will be appreciated

Thanks

Shaimaa

Dominic Widdows

unread,

Oct 16, 2014, 4:15:21 PM10/16/14

to semanti...@googlegroups.com

Hi Shaimaa,

Please see the following:

- Installation / building vectors: https://code.google.com/p/semanticvectors/wiki/InstallationInstructions.

- Working with document vectors: https://code.google.com/p/semanticvectors/wiki/DocumentSearch

- Clustering: https://code.google.com/p/semanticvectors/wiki/ClusteringAndVisualization

- Recent thread (ongoing) on what it make take to output a vector store as a matrix: https://groups.google.com/forum/#!topic/semanticvectors/r8tBfOmpqFs

Best wishes,

Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

elshai...@gmail.com

unread,

Oct 17, 2014, 4:52:41 PM10/17/14

to semanti...@googlegroups.com

Thanks Dominic,

I followed the instructions in https://code.google.com/p/semanticvectors/wiki/DocumentSearch I guess I should implement the part in

Programmatic / API-driven Search

but I don't know how to initialize the FlagConfig class to see my Lucene index directory to create the document vectors and the term vectors,

or if I need to use BuildIndex class to create the document vectors and the term vectors the only way to do this is through :

java pitt.search.semanticvectors.BuildIndex -luceneindexpath but I can't embed this in my java code.

I appologize if my questions are trivial but I'm kind of confused.

regards

Shaimaa

Dominic

unread,

Oct 17, 2014, 5:24:35 PM10/17/14

to semanti...@googlegroups.com

Hi there,

Initializing a FlagConfig programmatically would be the same for building an index as it would for searching,

i.e., FlagConfig config = FlagConfig.getFlagConfig( ... appropriate command-line string arguments ... );

If you need to call BuildIndex from within another java class, try using the BuildIndex.main method directly (see https://code.google.com/p/semanticvectors/source/browse/trunk/src/main/java/pitt/search/semanticvectors/BuildIndex.java#73).

Some of the project test code might be useful to crib from, especially the "buildSearchGetRank" helper methods such as https://code.google.com/p/semanticvectors/source/browse/trunk/src/test/java/pitt/search/semanticvectors/integrationtests/RegressionTests.java#78

But - do you really need to do this all "in java" in the sense of running everything withing the same JVM instance without making more than one process? If you want to experiment with clustering documents, by far the easiest way is to run https://code.google.com/p/semanticvectors/source/browse/trunk/src/main/java/pitt/search/semanticvectors/ClusterVectorStore.java

at outlined in https://code.google.com/p/semanticvectors/wiki/ClusteringAndVisualization.

Best wishes,

Dominic

elshai...@gmail.com

unread,

Oct 20, 2014, 3:29:58 PM10/20/14

to semanti...@googlegroups.com

Hi Dominic

The command line string arguments that I added is only the -luceneindexpath (the path for my lucene index), but I think the BuildIndex class is looking for another argument that specifies field names to send it to FlagConfig class, I got java.lang.NullPointerException in the method that creates term vectors , because it's looking for a filed name "contents" which I don't have, I have different field names in my index. (multiple fields per document)

Thanks

Shaimaa

Dominic Widdows

unread,

Oct 20, 2014, 3:33:41 PM10/20/14

to semanti...@googlegroups.com

Hi Shaimaa,

You should probably use -contentsfields, see http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/FlagConfig.html#contentsfields()

Hope that helps.

Dominic

elshai...@gmail.com

unread,

Oct 22, 2014, 2:48:41 PM10/22/14

to semanti...@googlegroups.com

Hi Dominic

I tried to send the -contentsfields to the BuildIndex class but it didn't work, it went in an infinite loop. So I had to set it in the BuildIndex.main manually with flagConfig.setContentsfields(contentsfields). when it started to generate Docvector and TermVector the The class DocVector.java threw a null pointer exception in the while loop at while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) . I'm not sure if this is a bug, I had to check if docsEnum is null:

if (docsEnum != null)

{

int docID;

while ((docID = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS)

{.....}

}

Now I generated the Docvector.bin and the Termsvector.bin

When I tried to use the class CompareTerms to find relatedness between corpus terms it threw a runtime exception" Exception in thread "main" java.lang.RuntimeException: C:\workspace\semanticvectors-4.0\termvectors"

I need to find term relatedness matrix and document relatedness matrix to cluster terms and documents.The classes ClusterResults and ClusterVectorStore doesn't do that

Thanks for the continuous help

Shaimaa

Dominic Widdows

unread,

Oct 22, 2014, 3:12:09 PM10/22/14

to semanti...@googlegroups.com

Hi Shaimaa,

Please could you reply with the complete console output of the places where you've seen exceptions or infinite loops?

You're quite right that the clustering classes don't product matrices. If you want a representation of a matrix, the question is what format do you need?

Best wishes,

Dominic

elshai...@gmail.com

unread,

Oct 22, 2014, 8:59:55 PM10/22/14

to semanti...@googlegroups.com

Dominic

This is the console output on running CompareTerms with two terms Sulphur and Waste, it seems it can't see the termvectors and the Docvectors.

Outputting similarity of 'Sulphur' with 'Waste':

Exception in thread "main" java.lang.RuntimeException: C:\workspace\semanticvectors-4.0\termvectors

at pitt.search.semanticvectors.VectorStoreReaderLucene$1.initialValue(VectorStoreReaderLucene.java:102)

at pitt.search.semanticvectors.VectorStoreReaderLucene$1.initialValue(VectorStoreReaderLucene.java:1)

at java.lang.ThreadLocal.setInitialValue(Unknown Source)

at java.lang.ThreadLocal.get(Unknown Source)

at pitt.search.semanticvectors.VectorStoreReaderLucene.readHeadersFromIndexInput(VectorStoreReaderLucene.java:129)

at pitt.search.semanticvectors.VectorStoreReaderLucene.<init>(VectorStoreReaderLucene.java:106)

at pitt.search.semanticvectors.CompareTerms.RunCompareTerms(CompareTerms.java:107)

at pitt.search.semanticvectors.CompareTerms.main(CompareTerms.java:89)

Caused by: java.nio.file.NoSuchFileException: C:\workspace\semanticvectors-4.0\termvectors

at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)

at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)

at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)

at java.nio.channels.FileChannel.open(Unknown Source)

at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)

at pitt.search.semanticvectors.VectorStoreReaderLucene$1.initialValue(VectorStoreReaderLucene.java:100)

... 7 more

What I need to do exactly is to be able to read the document vector and the term vector so I can use it in clustering documents and terms. My question is does the SemanticVectors has classes for clustering documents and terms and visualizing the clusters based on the binary vectors produced earlier with BuildIndex

if yes please guide me to the usages of these classes.

thanks

Shaimaa

Dominic Widdows

unread,

Oct 22, 2014, 11:58:55 PM10/22/14

to semanti...@googlegroups.com

Hi there,

I'm puzzled by the line "Caused by: java.nio.file.NoSuchFileException: C:\workspace\semanticvectors-4.0\termvectors". Unless directed otherwise, the program should be looking for a file with a .bin or .txt extension. Please could you share the command you typed in as part of the transcript?

To your other question, I think that clustering should work fine with binary vectors (because it just uses pairwise comparisons), but 2d visualizations will probably only work with real vectors.

Best wishes,

Dominic

elshai...@gmail.com

unread,

Oct 23, 2014, 12:58:01 PM10/23/14

to semanti...@googlegroups.com

The only line in my main method was:

CompareTerms.main(new String[]{"Sulphur","Waste"});

I need to find a way to compute a term term relatedness matrix and a document document relatedness matrix to use it for clustering terms and documents.

regards

Shaimaa

Dominic Widdows

unread,

Oct 23, 2014, 1:37:19 PM10/23/14

to semanti...@googlegroups.com

Please could you try running CompareTerms from the command line?

It's still not clear what you want when you say "relatedness matrix". A matlab file on disk? If what you want is the ability to ask "here are two documents, give me their similarity score", you're probably best off using the existing VectorStoreRAM and computing the similarities on-the-fly. It's typically very fast and much more space-efficient.

Best wishes,

Dominic

elshai...@gmail.com

unread,

Oct 24, 2014, 6:29:21 PM10/24/14

to semanti...@googlegroups.com

Hi

Let me explain what I need to do in steps:

I have a Lucene index with multiple fields, I used BuildIndex to create a semantic index, and I generated the docvectors.bin and the termvectors.bin

1- I have a list of queries and for each query I need to call the Search class form my java classes so this what I wrote:

String[] searchargs = new String[]{"-queryvectorfile", "docvectors.bin", "-searchvectorfile", "termvectors.bin", "-luceneindexpath", "c:\\test_dataset\\index1\\", "Sulphur"};

Search.main(searchargs);

This What I got

Opening query vector store from file: docvectors.bin

Setting dimension of target config to: 1000

Opening search vector store from file: termvectors.bin

Searching term vectors, searchtype SUM

Didn't find vector for 'sulphur'

No vector for 'sulphur'

No search output.

and this happens with every query.

2- I need to find a way to cluster the documents in the Lucene index based on their similarity.

3- I need to cluster the terms based on their similarity.

I tried but I think clustering with semanticvector Api is not possible so I guess is I can extract the document-document similarity matrix and the term-term similarity matrix. I can find a different method to do that.

thanks

Shaimaa

Trevor Cohen

unread,

Oct 26, 2014, 1:02:57 PM10/26/14

to semanti...@googlegroups.com

Hi Shaimaa,

The -queryvectorfile sets the file in which Semantic Vectors looks for vectors for the terms in your query. You've set this to "docvectors.bin", so it isn't surprising that no vector for the term "sulphur" is found.

Could you try -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin instead?

Regards,

Trevor

elshai...@gmail.com

unread,

Oct 27, 2014, 12:06:15 PM10/27/14

to semanti...@googlegroups.com

Thanks Trevor, The Search worked.

Do you think it is possible to cluster the corpus documents and the terms which is indexed with Lucene.

regards

Shaimaa

Dominic

unread,

Oct 27, 2014, 12:11:34 PM10/27/14

to semanti...@googlegroups.com

Hi Shaimaa,

Have you tried the clustering tool described at https://code.google.com/p/semanticvectors/wiki/ClusteringAndVisualization yet? It's just a simple k-means so there may be plenty of better options.

For output to different tools, you should consider using the '-indexrfileformat text' option. See https://code.google.com/p/semanticvectors/wiki/VectorStoreFormats. This will give you the list of vectors as a num_items * num_dimensions matrix, which is what some clustering libraries take as input (e.g., http://www.mathworks.com/help/stats/kmeans.html).