comparing two terms

85 views
Skip to first unread message

Ahmet Arslan

unread,
Aug 31, 2013, 2:21:47 PM8/31/13
to semanti...@googlegroups.com
Hi all,

I downloaded semanticvectors-4.0. I built index using  pitt.search.semanticvectors.BuildIndex.main (with default values) from a solr version 4.4.0 created index.
It created termvectors.bin and docvectors.bin successfully. 

I want to obtain term2term similarities. I use pitt.search.semanticvectors.CompareTerms.main

For some terms pairs I get negative similarity. For example measureOverlap("santral", "projeleri") = -0.184441952582591
Is this expected? By the way these words co-occur in my collection 4 times.

For non-found terms measureOverlap returns Double.NaN. I can understand that.

I am trying to aggregate overlap values for a sentence.Currently I use addition, Example: 
support("term1 term2 term3") = measureOverlap("term1", "term2") + measureOverlap("term2", "term3");

Do you think BuiltIndex and CompareTerm are suitable for my task? Any suggestions?

Does -vectortype [real, complex or binary]" matter in my case?

But i am not use how to handle negative and NaN values.


I found a typo in  pitt.search.semanticvectors.BuildIndex.main. Usage says "-filternumbers [true or false]" but it turned out to be : filteroutnumbers

But this flag seems doesn't working. I built the index using "-filteroutnumbers", "false" but still it says no vector found for numbers. I even changed default value of it in source code but with no luck.

Thanks,
Ahmet

Dominic Widdows

unread,
Sep 2, 2013, 11:27:38 AM9/2/13
to semanti...@googlegroups.com
Hi Ahmet,

Sorry for not getting back to you sooner. Got back from vacation and my kids first day of school is tomorrow, so things are hectic. For now I'll give short answers to some of your questions.

Negative similarities are expected since we start with more or less random vectors. If you pick two high-dimensional vectors at random, their similarity is likely to be small, but just as likely to be negative as positive. However, -0.18 is a surprisingly large value. I would try experimenting with different numbers of dimensions to see if this is just an accident of one configuration.

Your proposed measure looks like a way of measuring sentence consistency, i.e., internal similarity. Is this what you intend? (You're not trying to compare two sentences word-by-word, just trying to find a number saying how self-similar the words in the sentence are, correct?) If so your measure looks reasonable, though it should be used carefully and perhaps there are potential improvements. In particular, if B is a stopword in the sentence A B C, sim(A, B) and sim(B, C) will both be NaN, so to get a nontrivial answer you should apply the same filtering to your sentence as is applied in the corpus. Also, the most "consistent" sentence possible would just be A A A A ..., which would be a trivial statement. Is this desirable?

You might be interested in the sentence similarity technique described in section 5 of http://www.puttypeg.net/papers/orthogonality-and-orthography.pdf (see https://code.google.com/p/semanticvectors/wiki/OrthographicVectors but it's only a stub so far).

On the filteroutnumbers issue, you should probably also set maxnonalphabetchars. I'll try to check tomorrow, the relationship between these flags isn't well documented or tested. Sorry about that.

I would strongly recommend experimenting with vectortype real, complex, and binary, because they sometimes give interestingly different models. I understand that most researchers won't see this as directly relevant to their work, though.

Best wishes,
Dominic




--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/groups/opt_out.

Ron King

unread,
Oct 23, 2020, 3:54:06 PM10/23/20
to Semantic Vectors
Hi all! In section 5, table 5 of the paper "Orthogonality and Orthography:
Introducing Measured Distance into Semantic Space", sentence similarity is attempted by searching based on a "cue".
How can I do this with SV? I built an index using -elementalmethod orthographic, now what do I do to use the index to search for similar sentences?

Dominic Widdows

unread,
Oct 23, 2020, 5:11:00 PM10/23/20
to semanti...@googlegroups.com
Two main options:

If you have the docid (whatever you used to index the document), then something like "java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin $this_doc_id"

If you just have the text of the doc, you can use this as a query, something like "java pitt.search.semanticvectors.Search -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin "your document text gets pasted in here"


Hope that helps!
Dominic

Ron King

unread,
Oct 23, 2020, 7:56:47 PM10/23/20
to semanti...@googlegroups.com
I've used pitt.search.lucene.IndexFlatFilePositions to process enwiki-corpus.txt, a file with 2 million sentences, so it's like having 2 million documents.
I then ran  pitt.search.semanticvectors.BuildIndex  -minfrequency 2 -luceneindexpath positional_index -elementalmethod orthographic. 

Questions:
Does using the option  -elementalmethod orthographic help when searching sentences?
Do I need to use the searchtype option, set to proximity?
In your example, you put the query in quotes, does that cause it to represent a 'sentence'?
What is the SentenceVectors.java class used for? Is it relevant for computing sentence similarity?

Thanks!




--
Youth and Exuberance will never overcome Age And Treachery

Trevor Cohen

unread,
Oct 27, 2020, 9:59:44 AM10/27/20
to semanti...@googlegroups.com
On Fri, Oct 23, 2020 at 4:56 PM Ron King <ronc...@gmail.com> wrote:
I've used pitt.search.lucene.IndexFlatFilePositions to process enwiki-corpus.txt, a file with 2 million sentences, so it's like having 2 million documents.
I then ran  pitt.search.semanticvectors.BuildIndex  -minfrequency 2 -luceneindexpath positional_index -elementalmethod orthographic. 

Questions:
Does using the option  -elementalmethod orthographic help when searching sentences?

No, this affects word vectors only - word vectors at the start of training will be similar if they represent words that are orthographically similar to one another.

Do I need to use the searchtype option, set to proximity?

It's been quite a long time since I looked at this part of the codebase (I'd forgotten it existed), but if memory serves the idea with the proximity search was to find documents where two terms occur close to one another. The SentenceVectors class encodes the relative position of words into the document vector representation, and the proximity search tries to leverage this encoding by inferring the distance between a pair of words. 
 
In your example, you put the query in quotes, does that cause it to represent a 'sentence'?
What is the SentenceVectors.java class used for? Is it relevant for computing sentence similarity?

Can you point me toward the example? I suspect this was just using an entire sentence as a query, i.e. we'd generate a set of sentence vectors using the SentenceVectors.java class, and then use one of these as a cue and see which other sentences are close to it in space.

-Trevor
 

Dominic Widdows

unread,
Oct 27, 2020, 1:43:31 PM10/27/20
to semanti...@googlegroups.com
Thanks again, Trevor.

On Tue, Oct 27, 2020 at 6:59 AM Trevor Cohen <trev...@gmail.com> wrote:


On Fri, Oct 23, 2020 at 4:56 PM Ron King <ronc...@gmail.com> wrote:
I've used pitt.search.lucene.IndexFlatFilePositions to process enwiki-corpus.txt, a file with 2 million sentences, so it's like having 2 million documents.
I then ran  pitt.search.semanticvectors.BuildIndex  -minfrequency 2 -luceneindexpath positional_index -elementalmethod orthographic. 

Questions:
Does using the option  -elementalmethod orthographic help when searching sentences?

No, this affects word vectors only - word vectors at the start of training will be similar if they represent words that are orthographically similar to one another.

Do I need to use the searchtype option, set to proximity?

It's been quite a long time since I looked at this part of the codebase (I'd forgotten it existed), but if memory serves the idea with the proximity search was to find documents where two terms occur close to one another. The SentenceVectors class encodes the relative position of words into the document vector representation, and the proximity search tries to leverage this encoding by inferring the distance between a pair of words. 
 
In your example, you put the query in quotes, does that cause it to represent a 'sentence'?

On this, I don't think the quotes do anything special, sorry for the confusion.

Best wishes,
Dominic
 
Reply all
Reply to author
Forward
0 new messages