getMostSimilar() for multiple terms

19 views
Skip to first unread message

David Webb

unread,
Jul 21, 2011, 2:04:46 PM7/21/11
to S-Space Package Users
In our application of the LSA algorithm, there is often more than one
term involved.

Currently, if someone wants to build a list of 20 similar terms to a
set of terms like "Java" and "Oracle", I call getMostSimilar() for
each term and then sort the collection on score and take the top n
results.

If seems like there is some consideration that might be taken into
account when getting similarity for more than one term.

Is there anything that looks at the similarity of the 1..n terms
passed in, then find the most similar terms based on the collection of
terms?

Let me know if I am making up stuff in my head :)

You guys are a lot smarter than me so I appreciate the guidance here.

David Jurgens

unread,
Jul 21, 2011, 5:24:06 PM7/21/11
to s-spac...@googlegroups.com
You're bringing up good questions :)

There are a few ways to handle what you're talking about.  The most basic way I could think of is to compute the similarity from all words to all your reference words and the pick the top-k words with the largest similarity sum.  Perhaps a more interesting way would be to ask what your reference words have in common and then find the words that are closest to those commonalities.

We don't have direct support for either, but they shouldn't be too hard to implement.  For the first one, I think you can do something like:


int wordsToFind = 10; 
BoundedSortedMultiMap<Double,String> similaritySumToWords = 
    new BoundedSortedMultiMap<Double,String>(wordsToFind);
Set<String> referenceWords; // "Java", "Oracle"
Set<Vector> referenceVectors = new HashSet<Vector>();
SemanticSpace sspace;
for (String w : referenceWords)
    referenceVectors.add(sspace.getVector(w));

for (String w : sspace.getWords()) {
    Vector v = sspace.getVector(w);
    double similaritySum = 0d;
    for (Vector ref : referenceVector)
        similaritySum += Similarity.cosineSimilarity(v, ref);
    similaritySumToWords.put(similaritySum, w);
}
// The values of similaritySumToWords are those K most similar vectors


The second option is a bit more complicated since you have decide how you want to select what features are in common.  Adding or multiplying the reference words' vectors are probably good places to start.
Reply all
Reply to author
Forward
0 new messages