What I've called 'ProxHash' is a surrogate or proxy for proximity. More
precisely, 'ProxHash' is a mapping from unlimited-dimension
information-topic vector space into a fixed-dimension vector space,
such that the function is "continuous" in the mathematical sense
(nearby input must produce nearby output) and the inverse is
APPROXIMATELY "continuous" in the same sense (nearby output usually
comes from nearby input, not too many "hashing collisions" with typical
data).
Information retrieval is usually based on some abstraction of documents
that in some way represents the content of those documents. (For
example: keywords are weighted in a vector space, a separate dimension
for each distinct keyword; or documents are classified into hierarchial
categories, with major different topics further apart than different
sub-topics within a large topic.) Queries are typically directly
matched against documents in this abstract space (of unbounded
dimension for standard keyword methods). But if documents and queries
are first mapped from abstracted-document (topic) space to ProxHash
space, matching is then easier because (1) the vectors are fixed
dimension hence faster to process, (2) the various components of the
vectors are orthogonal so that processing on them can occur in
parallel, and most of them can be ignored for preliminary screening,
(3) filling of space along any small number of dimensions is roughly
uniform so that radix sorting (sorting first by the high-order bits of
any particular coordinate) works efficiently thereby supporting
fixed-time access into an indexed corpus. ProxHash therefore does for
proximity (nearness) searches what traditional hash tables do for
identity (exact match) searches.
--
Art Pollard <Poll...@Hawaii.edu>
Moderator for Comp.Theory.Info-Retrieval
List Maintainer for the Hyper-Theory (Hypertext Theory) mailing list.