Comparing 2 documents explicitly!

194 views
Skip to first unread message

Deswick

unread,
May 23, 2011, 8:59:29 AM5/23/11
to Semantic Vectors
Hi,

I am relatively new to java. I am currently working on a project where
I have to match 2 documents explicitly and find the degree of semantic
similarity between them. I believe semantic vectors would be the best
bet for my task.

I am facing a problem to match 2 documents explicitly. I tried couple
of things before posting this and tried my best to search and go
through related posts in the group, but still I am unable to solve
this problem. I am facing a deadline.

Suppose I have 2 documents file1.txt (contains wiki article about
eggs) and file2.txt (contains wiki article about chickens) and I have
to compare how related they are semantically' based on the terms these
2 documents contain.

I thought CompareTerms would serve the purpose but it seems that
CompareTerms would just take the ‘terms’ as arguments but not the name
of the documents. If I pass on the terms appearing in those 2
documents then it would successfully show the similarity between the
terms, but when I pass the document names, then it would just say ‘No
vector for file1’, ‘No vector for file2’.

I also went through http://code.google.com/p/semanticvectors/wiki/DocumentSearch,
but this also doesnt seem to be solving my problem since I have to
compare 2 documents explicitly.

I would really appreciate if someone can guide me clearly like how I
can go about comparing the 2 documents explicitly. Thanks alot.

Regards,
Deswick

widdows

unread,
May 23, 2011, 9:13:15 AM5/23/11
to Semantic Vectors
Hi Deswick,

Are you giving the full relative pathnames for your documents?
CompareTerms on two documents works for me, e.g.,
$ java pitt.search.semanticvectors.CompareTerms -queryvectorfile
docvectors.bin bible_chapters/John/Chapter_1 bible_chapters/John/
Chapter_3
...
INFO: Outputting similarity of "bible_chapters/John/Chapter_1" with
"bible_chapters/John/Chapter_3" ...
0.96769243

Are you giving the full relative pathnames that Lucene indexed? These
are the keys in your document vectors file (usually docvectors.bin).
To check, try searching from documents to terms, e.g.,
$ java pitt.search.semanticvectors.Search -searchvectorfile
docvectors.bin john
...
INFO: Search output follows ...
0.4998236:bible_chapters/John/Chapter_1
0.49023014:bible_chapters/Luke/Chapter_7
0.48979598:bible_chapters/Mark/Chapter_1
0.48899928:bible_chapters/Matthew/Chapter_11
0.48565915:bible_chapters/Matthew/Chapter_3

The part after the colon is the key you're looking for. Matching is by
default case-sensitive (documentation of our defaults and how to
override them with regard to case-sensitivity is poor).

Best wishes,
Dominic



On May 23, 8:59 am, Deswick <deswick.alme...@googlemail.com> wrote:
> Hi,
>
> I am relatively new to java. I am currently working on a project where
> I have to match 2 documents explicitly and find the degree of semantic
> similarity between them. I believe semantic vectors would be the best
> bet for my task.
>
> I am facing a problem to match 2 documents explicitly. I tried couple
> of things before posting this and tried my best to search and go
> through related posts in the group, but still I am unable to solve
> this problem. I am facing a deadline.
>
> Suppose I have 2 documents file1.txt (contains wiki article about
> eggs) and file2.txt (contains wiki article about chickens) and I have
> to compare how related they are semantically' based on the terms these
> 2 documents contain.
>
> I thought CompareTerms would serve the purpose but it seems that
> CompareTerms would just take the ‘terms’ as arguments but not the name
> of the documents. If I pass on the terms appearing in those 2
> documents then it would successfully show the similarity between the
> terms, but when I pass the document names, then it would just say ‘No
> vector for file1’, ‘No vector for file2’.
>
> I also went throughhttp://code.google.com/p/semanticvectors/wiki/DocumentSearch,

Deswick

unread,
May 23, 2011, 10:45:54 AM5/23/11
to Semantic Vectors
Hi Dominic,

That was real quick reply. Many thanks.
Yes I am using same relative path with which Lucene indexed the files.

I am using windows and eclipse. I will just outine directory structure
and commands I used so that it may help in figuring out where I am
mistaken:

Docs directory (files to index) is located at: C:\workspace\SV\src
\docsDir. This directory has both the files (file1.txt and file2.txt)
Index directory: C:\workspace\SV\src\indexDir

C:\workspace\SV\src\luceneInAction\Indexer.java with which lucene
indexed those 2 above files
I used indexed.java using arguments in eclipse: /workspace/SV/src/
indexDir /workspace/SV/src/docsDir
It printed output: Indexing C:\workspace\SV\src\docsDir\file1.txt
Indexing C:\workspace\SV\src\docsDir\file2.txt
Indexing 2 files took 281 milliseconds

In the index directory (C:\workspace\SV\src\indexDir), it generated
several lucene index files.

Then I used BuildIndex.java (C:\workspace\SV\src\pitt\search
\semanticvectors) with argument in eclipse: /workspace/SV/src/indexDir

INFO: Seedlength = 10
Dimension = 200
Minimum frequency = 0
Maximum frequency = 2147483647
Number non-alphabet characters = 0
Contents fields are: [contents]
23.05.2011 16:18:10 pitt.search.semanticvectors.BuildIndex main
INFO: Creating elemental document vectors ...
23.05.2011 16:18:10 pitt.search.semanticvectors.TermVectorsFromLucene
createTemVectorsFromLuceneImpl
INFO: Populating basic sparse doc vector store, number of vectors: 2
…..
…..
…..
It generated termvectors.bin and docvectors.bin at “C:\workspace\SV”

Then I used CompareTerms.java with arguments:
-queryvectorfile docvectors.bin \workspace\SV\src\docsDir\file1.txt
\workspace\SV\src\docsDir\file2.txt

It generated this output:
23.05.2011 16:21:13 pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: docvectors.bin
23.05.2011 16:21:13 pitt.search.semanticvectors.CompareTerms main
INFO: Couldn't open Lucene index at
23.05.2011 16:21:13 pitt.search.semanticvectors.CompareTerms main
INFO: No Lucene index for query term weighting, so all query terms
will have same weight.
23.05.2011 16:21:13
pitt.search.semanticvectors.VectorStoreReaderLucene getVector
INFO: Didn't find vector for '\workspace\SV\src\docsDir\file1.txt'
23.05.2011 16:21:13 pitt.search.semanticvectors.CompoundVectorBuilder
getAdditiveQueryVector
WARNUNG: No vector for \workspace\SV\src\docsDir\file1.txt
23.05.2011 16:21:13
pitt.search.semanticvectors.VectorStoreReaderLucene getVector
INFO: Didn't find vector for '\workspace\SV\src\docsDir\file2.txt'
23.05.2011 16:21:13 pitt.search.semanticvectors.CompoundVectorBuilder
getAdditiveQueryVector
WARNUNG: No vector for \workspace\SV\src\docsDir\file2.txt
23.05.2011 16:21:13 pitt.search.semanticvectors.CompareTerms main
INFO: Outputting similarity of "\workspace\SV\src\docsDir\file1.txt"
with "\workspace\SV\src\docsDir\file2.txt" ...
0.0

Strangely here in the output, it says couldn’t open lucene index, at
the same time not finding vectors for file1.txt and file2.txt. When I
copied lucene index files next to termvectors.bin and docvectors.bin
(C:\workspace\SV) and ran the above program again; in that case it
didn’t give “couldn’t find lucene index message” but still SV could
not locate vectors for both these files.

I tried other paths options as follows but nothing worked and message
was the same that couldn’t find vector for file1(.txt) and
file2(.txt):

-queryvectorfile docvectors.bin C:\workspace\SV\src\docsDir\file1.txt
C:\workspace\SV\src\docsDir\file2.txt

or…
C:\workspace\SV\src\docsDir\file1 C:\workspace\SV\src\docsDir\file2

or…
C:\workspace\SV\src\indexDir\file1 C:\workspace\SV\src\indexDir\file2

or…
C:\workspace\SV\src\indexDir\file1 C:\workspace\SV\src\indexDir\file2

or…
\workspace\SV\src\indexDir\file1 \workspace\SV\src\indexDir\file2

and few more......


But comparing "terms" will always work for example with arguments:
food chicken

23.05.2011 16:31:29 pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: termvectors.bin
23.05.2011 16:31:30 pitt.search.semanticvectors.CompareTerms main
INFO: Outputting similarity of "food" with "chicken" ...
0.9892034

I also tried to get the output of the documents using term like:
Search.java -searchvectorfile docvectors.bin chicken

Here it displayed the degree of similarity, but didn’t listed any doc
name:

23.05.2011 16:34:04 pitt.search.semanticvectors.Search RunSearch
INFO: Opening query vector store from file: termvectors.bin
23.05.2011 16:34:04 pitt.search.semanticvectors.Search RunSearch
INFO: Opening search vector store from file: docvectors.bin
23.05.2011 16:34:04 pitt.search.semanticvectors.Search RunSearch
INFO: Searching term vectors, searchtype SUM ...

23.05.2011 16:34:04 pitt.search.semanticvectors.Search main
INFO: Search output follows ...
0.25289524:


I just tried to post each and every step I went through so that may be
it can be figured out if I am making any mistake in the path anywhere.
I will really appreciate your valuable help.

Best Regards,
Deswick
> > Deswick- Hide quoted text -
>
> - Show quoted text -

Deswick

unread,
May 24, 2011, 10:37:35 AM5/24/11
to Semantic Vectors
Thanks Dominic. Problem is solved now with the help of one another
member in the group who was also working with documents matching using
SV.

Best Regards,
Deswick
> > - Show quoted text -- Hide quoted text -

Dominic Widdows

unread,
May 24, 2011, 10:50:41 AM5/24/11
to semanti...@googlegroups.com
Delighted to hear it, and thanks so much for the thorough description
of your problem.

If you have time, please post any solutions to the list, or even
better as comments to
http://code.google.com/p/semanticvectors/wiki/DocumentSearch. Or if
you want to edit the Wiki directly I can give you access.

Best wishes,
Dominic

> --
> You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
> To post to this group, send email to semanti...@googlegroups.com.
> To unsubscribe from this group, send email to semanticvecto...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.
>
>

Deswick

unread,
May 24, 2011, 12:42:04 PM5/24/11
to Semantic Vectors
Sure, I will just explain so that someone else who is new to SV,
should not face any such problems.

Since I was using lucene for the first time, so I followed “Lucene in
Action 2E” Book examples step by step to create initial index of my
data files after downloading the source code from the book website.
Although I was using latest versions of SV (2.2) and Lucene (3.1.0)
but still the Indexer.java which I was using from “Lucene in Action”
book lacked the code line doc.add(new Field ("path", PATH OF
FILE, .... )); as this is necessary otherwise document will be indexed
without the path information, hence Semantic Vectors will not be able
to find the documents but just terms.

Better still always use latest src files provided by lucene to index
the data files, because it already includes this above missing code.
After indexing the data files using lucene, one better run the Search
provided by lucene so as to know the exact path with which lucene has
been indexed. For example SearchFiles.java (\lucene-3.1.0-src
\lucene-3.1.0\contrib\demo\src\java\org\apache\lucene\demo) provided
by Lucene will always print the name of the documents with EXACT path
information after taking query terms as arguments. (For me here
earlier it was giving the error that “No path for this document”
although it would list the number of documents found).
Once we have exact path information (from SearchFiles.java for
example), then we can use the same path in CompareTerms.java to
compare the documents and then everything should work perfectly fine.

I have also posted all the basic steps as comments to DocumentSearch
for someone who is new to SV. Thanks a lot.

Best,
Deswick

PS: In my earlier post, I mentioned that I copied the lucene index
files from the index directory and put them next to termvectors.bin
and docvectors.bin (so that I should not get error message “couldn’t
open lucene index”) while comparing 2 documents. We don’t need to
carry out such step otherwise SV will show error message “NaN”.


On May 24, 4:50 pm, Dominic Widdows <widd...@google.com> wrote:
> Delighted to hear it, and thanks so much for the thorough description
> of your problem.
>
> If you have time, please post any solutions to the list, or even
> better as comments tohttp://code.google.com/p/semanticvectors/wiki/DocumentSearch. Or if
> > For more options, visit this group athttp://groups.google.com/group/semanticvectors?hl=en.- Hide quoted text -

Paul M

unread,
Jun 4, 2011, 1:00:56 PM6/4/11
to Semantic Vectors

Hi Deswick and Dom,

Many thanks for the great postings. I have been experimenting with
comparing two documents with SV (running the Java from the XP command
line). I too get the following "Couldn't open Lucene index at" error
with the CompareTerms example below, even though the index sub-
directory is directly below my current directory.

Should the (Lucene) index directory be accessible when using
CompareTerms? Apologies I wasn't able to figure this out by myself,

Regards,

Paul

%%%%%%%%%%%%%%%%%%%%%%%% SV CompareTerms example %%%%%%%%%%%%%%%%%%%%%%
%%%%%

java pitt.search.semanticvectors.CompareTerms -queryvectorfile
docvectors.bin -searchvectorfile termvectors.bin
Documents\file1.txt Documents\file2.txt

Jun 4, 2011 12:28:13 PM pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: docvectors.bin
Jun 4, 2011 12:28:13 PM pitt.search.semanticvectors.CompareTerms main
INFO: Couldn't open Lucene index at
Jun 4, 2011 12:28:13 PM pitt.search.semanticvectors.CompareTerms main
INFO: No Lucene index for query term weighting, so all query terms
will have same weight.
Jun 4, 2011 12:28:13 PM pitt.search.semanticvectors.CompareTerms main

INFO: Outputting similarity of "Documents\test1.txt" with "Documents
\test2.txt"

0.8331268
> > >> I am using windows andeclipse. I will just outine directory structure
> > >> and commands I used so that it may help in figuring out where I am
> > >> mistaken:
>
> > >> Docs directory (files to index) is located at: C:\workspace\SV\src
> > >> \docsDir. This directory has both the files (file1.txt and file2.txt)
> > >> Index directory: C:\workspace\SV\src\indexDir
>
> > >> C:\workspace\SV\src\luceneInAction\Indexer.java with which lucene
> > >> indexed those 2 above files
> > >> I used indexed.java using arguments ineclipse: /workspace/SV/src/
> > >> indexDir /workspace/SV/src/docsDir
> > >> It printed output: Indexing C:\workspace\SV\src\docsDir\file1.txt
> > >> Indexing C:\workspace\SV\src\docsDir\file2.txt
> > >> Indexing 2 files took 281 milliseconds
>
> > >> In the index directory (C:\workspace\SV\src\indexDir), it generated
> > >> several lucene index files.
>
> > >> Then I used BuildIndex.java (C:\workspace\SV\src\pitt\search
> > >> \semanticvectors) with argument ineclipse: /workspace/SV/src/indexDir
> > > For more options, visit this group athttp://groups.google.com/group/semanticvectors?hl=en.-Hide quoted text -

widdows

unread,
Jun 7, 2011, 2:36:09 PM6/7/11
to Semantic Vectors
Hi Paul,

You need a -luceneindexpath flag. I had to dig in the code to remind
myself of this, it is very poorly documented for which I apologize.

Let me know if this works for you.

Best wishes,
Dominic
> ...
>
> read more »

Paul M

unread,
Jun 9, 2011, 6:58:45 AM6/9/11
to Semantic Vectors

Hi Dom,

Much thanks for the reply to the group. I had seen the -
luceneindexpath flag, but I get a NaN (see below) for output if I
point at the index folder (Deswick commented he saw something similar
when he tried to use the contents of his index folder.

If it helps, I am using lucene-3.0.3 and SV version 2.0 as these are
mentioned to work ok on the Compatibility page.

Much thanks if you have any suggestions or ideas?

Kind regards,

Paul

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Example output showing NaN

C:\apache-lucene-3.0.3>java pitt.search.semanticvectors.CompareTerms -
luceneindexpath index -queryvectorfile docvectors.bin -
searchvectorfile termvectors.bin Documents\file1.txt Documents
\file2.txt

Jun 9, 2011 6:47:39 AM pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: docvectors.bin
Jun 9, 2011 6:47:39 AM pitt.search.semanticvectors.CompareTerms main

INFO: Outputting similarity of "Documents\file1.txt" with "Documents
\file2.txt" ...
NaN
> ...
>
> read more »

widdows

unread,
Jun 10, 2011, 1:51:47 PM6/10/11
to Semantic Vectors
Hi Paul,

I think I see what's going on. Using the lucene index for weighting
isn't working for documents because the actual names of the documents
(e.g., Documents\file1.txt) don't occur in the lucene index at all.
Perhaps this may be regarded as a bug, since we end up discarding the
terms altogether. To change this we'd need to have some sort of
baseline, e.g., make all weights start at 1 rather than 0. I just made
this change in my local copy and it works fine.

Also, for ComapreTerms, I don't think it makes sense to have a
separate queryvectorstore and searchvectorstore, because there isn't
really any searching going on.

Best wishes,
Dominic
> ...
>
> read more »

Paul Morris

unread,
Jun 11, 2011, 5:30:59 PM6/11/11
to semanti...@googlegroups.com
Hi Dominic,

Many thanks for investigating the use of the Lucene index information in CompareTerms, and the explanation below is incredibly helpful.

Would you feel comfortable releasing any of your recent code changes or revising the instructions on how best to do a document to document comparison using SV?

This is all very interesting, much thanks again, 

Paul


> ...
>
> read more »

widdows

unread,
Jun 13, 2011, 10:49:53 AM6/13/11
to Semantic Vectors
Hi Paul,

I added a description to the Wiki page at
http://code.google.com/p/semanticvectors/wiki/DocumentSearch. Many
thanks to Deswick for writing such clear step-by-step instructions
there as well.

I'll try to test and release a version 2.4 with these and a couple of
other features, if there's enough to be useful we should get it out
there.

Thanks for your encouragement!
Best wishes,
Dominic
> ...
>
> read more »

Paul Morris

unread,
Jun 14, 2011, 11:01:02 AM6/14/11
to semanti...@googlegroups.com
Hi Dom,

Much thanks for your updates and suggestions to the Group. The extra information about the use of the Lucene index file is very helpful.

Can I confirm that I should be getting identical results (which I do) when I use SV's CompareTerms using Semanticvectors-2.2 and Semanticvectors-2.4? 

I was also wondering if CompareTerms is the correct approach when building a large document-document distance matrix, as the script comparing two documents at a time takes many hours to run when dealing with 500+ individual files.

Apologies if I missed something here,

Paul

> ...
>
> read more »

widdows

unread,
Jun 14, 2011, 11:16:02 AM6/14/11
to Semantic Vectors
Hi Paul,

If you're doing a lot of queries with CompareTerms, I'm guessing that
you're suffering quite a bit from disk I/O for every pairwise query.

One solution to this is to use CompareTermsBatch instead, using the
flag "-vectorstorelocation ram" as well as the others to tell the
program to read all the vectors into memory before you start.
See http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/CompareTermsBatch.html

You'll also need to prepare a file with the pairs of terms you want to
compare, and pass these into stdin. (This could relatively easily be
reconfigured to be a file argument, I guess.)

This is something of an "expert feature", i.e., there's a certain
amount to figure out and prepare, but you will be trading a
(hopefully) reasonable amount of your time for a big reduction in
machine time.

Best wishes,
Dominic
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages