SV queries of Names

6 views
Skip to first unread message

Clive Cox

unread,
Jan 12, 2012, 9:16:17 AM1/12/12
to Semantic Vectors
Hi,

I'm finding if my documents have names in the original text, e.g.
Steve Jobs, and they are lucene indexed as normal. Searches using SUM
for Steve Jobs do not return good results. Documents with just Jobs or
Steve heavily influence the result. I was wondering what ways round
this there are? I suppose I could try to get lucene to index bi-grams
of tokens and thus get Steve_Jobs indexed or use Name Entity
recognition to cut this down to likely names. I have tried a
positional index but that doesn't seem to work well either. Any
suggestions?

Clive

Dominic

unread,
Jan 16, 2012, 12:52:46 PM1/16/12
to Semantic Vectors
Hi Clive,

You're qutie right about the poor behavior of current vector search
techniques with words like "Gates", "Jobs", "Bush", etc. when used as
proper names. In some contexts, it's regarded as a good thing that the
vector sum of "Steve" and "Jobs" returns documents about one or the
other when both query terms aren't present, but that's when the
compound is simply compositional, in the sense that the composition
doesn't change the meaning of its parts. So "tough jobs" keeps the
standard common-noun meaning of jobs, and vector sum is quite good:
"Steve Jobs" changes this meaning entirely and vector sum is lousy.

Tim Baldwin and myself and others did some work on this some years
ago, using the vecry fact that vector sum is sometimes lousy to
predict when compounds are in fact non-compositional(http://
www.puttypeg.net/papers/mwe-decomposability.pdf). I mean non-
compositional in the sense that the composition changes the meaning of
the parts, I hope not in the sense that finding good operators to
model this composition is impossible! But it's a big challenge that we
as a community are slowly working towards.

In the meantime, using a statistical collocation recognizer or named
entity recognizer to isolate these terms as a pre-indexing step is
probably your most reliable option. Your less reliable and potentially
much more glorious option is to solve the root problem and tell
everyone how you did it!

Best wishes,
Dominic
Reply all
Reply to author
Forward
0 new messages