Document Searching

44 views
Skip to first unread message

TopicModeler

unread,
Feb 13, 2018, 1:27:18 PM2/13/18
to Semantic Vectors
I need to compare a set of terms with a set of documents.  I created a lucene index of the documents and have both a termvectors.bin and a docvectors.bin file. Under the title, Document Search, it describes comparing two documents, but I have yet to get this to work.  It says to use java pitt.search.semanticvectors.CompareTerms -queryvectorfile docvectors.bin ./path/to/Doc1 ./path/to/Doc2.  When I try this i get an error which tells me it can't find a vector for either of the files. I am specifying the full path to each file and including the file name.  Perhaps this is incorrect syntax.  If so, I'm not sure what ./path/to/Doc1 really is--is it the folder in which a document is stored or is it the file itself?  It seems that either way I try this, I get the same result--no vector found.  Any ideas?

Thank you

Dominic

unread,
Feb 13, 2018, 1:31:34 PM2/13/18
to Semantic Vectors
Hi there,

Sometimes document pathnames don't get matched because of the default case-normalization at query time, which can be disabled using the --matchcase flag.

If that doesn't work, feel free to write back with a full cut-n-paste transcript of the command-line output.

Best wishes,
Dominic

Trevor Cohen

unread,
Feb 13, 2018, 2:29:18 PM2/13/18
to semanti...@googlegroups.com
Perhaps you could search the document vector store using a term as a cue to see how the document vectors have been named?

i.e. java pitt.search.semanticvectors.Search -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin semantics

What *should* have happened is that the document file name (including, perhaps the path to it) ended up as the document vector name.

Another thing that might help would be to use the -matchcase flag so that SV doesn't convert your search terms to lowercase (if you have uppercase path or document names, that is).

-Trevor

On Mon, Feb 12, 2018 at 4:01 PM, TopicModeler <mchi...@k-state.edu> wrote:
I need to compare a set of terms with a set of documents.  I created a lucene index of the documents and have both a termvectors.bin and a docvectors.bin file. Under the title, Document Search, it describes comparing two documents, but I have yet to get this to work.  It says to use java pitt.search.semanticvectors.CompareTerms -queryvectorfile docvectors.bin ./path/to/Doc1 ./path/to/Doc2.  When I try this i get an error which tells me it can't find a vector for either of the files. I am specifying the full path to each file and including the file name.  Perhaps this is incorrect syntax.  If so, I'm not sure what ./path/to/Doc1 really is--is it the folder in which a document is stored or is it the file itself?  It seems that either way I try this, I get the same result--no vector found.  Any ideas?

Thank you

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvectors+unsubscribe@googlegroups.com.
To post to this group, send email to semanticvectors@googlegroups.com.
Visit this group at https://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

TopicModeler

unread,
Feb 15, 2018, 8:51:11 AM2/15/18
to Semantic Vectors
Here is my set up: I have built an index for a corpus of 57 text files and placed it within the positional_index directory using the following command

java pitt.search.lucene.IndexFilePositions C:\pathtocorpus

The path to corpus is the folder with the 57 text files, which range in size from  11 to 120 KBytes. These text files are papers written about various marketing topics.  The command proceeds normally and completes in about 15 seconds. Next, I run the BuildIndex command as follows:

java pitt.search.semanticvectors.BuildIndex -luceneindexpath positional_index

This command runs quickly and finds over 23,000 terms in the 57 documents.  It writes the termvectors.bin file with dimension = 200 and the docvectors.bin file with the same dimension.  I can then compare terms individually, such as "brand" and "management." (That query returns 0.419424..., which I presume is the cosine similarity.)  While helpful, I would prefer to compare a list of terms all at once, and many of the terms go together as topical phrases, such as "brand management" or "advertising theory."  So, I created a file with this list of terms in it--one marketing phrase per line and would now like to run this entire file against the corpus.  This is where I'm having trouble.  On your Document Search page, you have the topic heading, "Using Documents as Queries."  This is the path I'm trying to duplicate with no success.  Here is my command to try to get this to work:

java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase Y:\Java\SemanticVectors\MSI

where Y:\Java\SemanticVectors\MSI is the directory containing the text file with the terms I would like to search the corpus with.  I also add the name of the file to the end of the command and have placed quotation marks around this path as well, but neither command works. Here is the result:

Y:\Java\SemanticVectors>java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase Y:\Java\SemanticVectors\MSI\
Opening query vector store from file: docvectors.bin
Opening search vector store from file: termvectors.bin
Searching term vectors, searchtype SUM
Didn't find vector for 'Y:\Java\SemanticVectors\MSI\'
No vector for 'Y:\Java\SemanticVectors\MSI\'
No search output.

Am I doing this correctly? What can I add or change to this command that will get this to work?
Thanks in  advance.

Trevor Cohen

unread,
Feb 15, 2018, 11:26:00 AM2/15/18
to semanti...@googlegroups.com
I think I see the problem here. The "-queryvectorfile docvectors.bin" bit is intended to search using documents that already have document vectors, e.g. those generated during your initial BuildIndex process.

If I'm understanding correctly, the phrases you've constructed are not documents (text files) from the original corpus.

A better fit for this might be the SearchBatch class, e.g:

java pitt.search.semanticvectors.SearchBatch -queryvectorfile termvectors.bin yourFileNameAndPath

-Trevor

 

--

Dominic Widdows

unread,
Feb 15, 2018, 11:29:09 AM2/15/18
to semanti...@googlegroups.com
Trevor's right, the system won't think that phrases are documents in this sense.

If you still want to use preexisting documents as queries, the query
definitely needs the filename to give the full path to an indexed
document. I'd also expect paths relative to the directory where you
ran the indexing command from, but not sure on Windows.

To echo Trevor's suggestion, try something like "java
pitt.search.semanticvectors.Search -queryvectorfile termvectors.bin
-searchvectorfile docvectors.bin semantics", and the output should
tell you some of the pathnames that the document vector store is using
as keys.

Also you could try "$ java
pitt.search.semanticvectors.VectorStoreTranslater -lucenetotext
docvectors.bin docvectors.txt", this won't give search results but it
will transform your document vectors into a plainer text form (I would
hesitate to call this "human readable", but it's documented at
https://github.com/semanticvectors/semanticvectors/wiki/VectorStoreFormats).

Best wishes,
Dominic
>> email to semanticvecto...@googlegroups.com.
>> To post to this group, send email to semanti...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/semanticvectors.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Semantic Vectors" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to semanticvecto...@googlegroups.com.
> To post to this group, send email to semanti...@googlegroups.com.

TopicModeler

unread,
Feb 15, 2018, 2:56:39 PM2/15/18
to Semantic Vectors
Thanks for the suggestion, but when I run SearchBatch, I get the following:

Y:\Java\SemanticVectors>java pitt.search.semanticvectors.SearchBatch -queryvectorfile termvectors.bin MSI\MSI4_year.txt
Opening query vector store from file: termvectors.bin
Exception in thread "main" java.lang.NullPointerException
        at pitt.search.semanticvectors.SearchBatch.runSearch(SearchBatch.java:320)
        at pitt.search.semanticvectors.SearchBatch.main(SearchBatch.java:475)

I'm wondering if CompareTermsBatch might be better here.  I don't quite understand what the format of the files should be for that command.  It seems I should use my list of phrases text file and delimit the terms with a pipe symbol. If I do that, what am I comparing it with, the docvectors file or the termvectors file?  The difference between these two files and how they are used is pretty confusing.

Trevor Cohen

unread,
Feb 15, 2018, 3:07:32 PM2/15/18
to semanti...@googlegroups.com
CompareTermsBatch will compare the vector for the pre-pipe term to the vector for the post-pipe term and return the cosine between them.

Would you be willing to share an excerpt of your text file?



--

TopicModeler

unread,
Feb 16, 2018, 9:55:32 AM2/16/18
to Semantic Vectors
Here's the text file:

Advertising Theory | Brand Management | Corporate Communications | Corporate Strategy Effectiveness | Distribution | Marketing Analysis Techniques | Marketing Management | Marketing Metrics | Marketing Mix | Marketing Organization | Marketing Planning | Marketing Productivity | Marketing Resource Allocation | Marketing Style | Performance Measures | Price | Product | Research Methods | Sales Promotion | Salesforce Management | Analysis Techniques | Business Buyer Behavior | Channels of Distribution | Competitive Analysis | Consumer Behavior | Consumer Demand | Global Marketing | International Marketing | Marketing Environment | Research Measurement | Services Marketing |

TopicModeler

unread,
Feb 16, 2018, 10:15:04 AM2/16/18
to Semantic Vectors
I think I should tell you that the purpose of my study is to compare the phrases in the file above with the corpus of files that I have indexed using semantic vectors.  We would like a measure of the frequency, similarity, relevance, (pick one) of these phrases within that text corpus.  If we find, for example, that the cosine similarity between advertising theory and some words in the corpus is high, then we can infer that the subject of advertising theory is adequately covered in the corpus.

Trevor Cohen

unread,
Feb 16, 2018, 10:49:12 AM2/16/18
to semanti...@googlegroups.com
I'd suggest placing each phrase on its own line, removing the pipes, and running SearchBatch again without using the "-matchcase" flag (as Lucene will have converted terms to lower case when indexing).

However, I'd be surprised if you get good results with a corpus of this size. A better idea might be to derive a set of term vectors from a larger corpus (e.g. WikiPedia), and then use these to generate representations of your documents.

On Fri, Feb 16, 2018 at 9:15 AM, TopicModeler <mchi...@k-state.edu> wrote:
I think I should tell you that the purpose of my study is to compare the phrases in the file above with the corpus of files that I have indexed using semantic vectors.  We would like a measure of the frequency, similarity, relevance, (pick one) of these phrases within that text corpus.  If we find, for example, that the cosine similarity between advertising theory and some words in the corpus is high, then we can infer that the subject of advertising theory is adequately covered in the corpus.

--
Reply all
Reply to author
Forward
0 new messages