Hi Clive,
You're qutie right about the poor behavior of current vector search
techniques with words like "Gates", "Jobs", "Bush", etc. when used as
proper names. In some contexts, it's regarded as a good thing that the
vector sum of "Steve" and "Jobs" returns documents about one or the
other when both query terms aren't present, but that's when the
compound is simply compositional, in the sense that the composition
doesn't change the meaning of its parts. So "tough jobs" keeps the
standard common-noun meaning of jobs, and vector sum is quite good:
"Steve Jobs" changes this meaning entirely and vector sum is lousy.
Tim Baldwin and myself and others did some work on this some years
ago, using the vecry fact that vector sum is sometimes lousy to
predict when compounds are in fact non-compositional(http://
www.puttypeg.net/papers/mwe-decomposability.pdf). I mean non-
compositional in the sense that the composition changes the meaning of
the parts, I hope not in the sense that finding good operators to
model this composition is impossible! But it's a big challenge that we
as a community are slowly working towards.
In the meantime, using a statistical collocation recognizer or named
entity recognizer to isolate these terms as a pre-indexing step is
probably your most reliable option. Your less reliable and potentially
much more glorious option is to solve the root problem and tell
everyone how you did it!
Best wishes,
Dominic