Speaker : Alessandro Tiberi
Date : March 06, 2006
Venue & Time : Aula Alfa, Via Salaria 113. 3:00 PM
Abstract : In information retrieval applications data is
usually represented by points in a multi-dimensional
space, where each dimension corresponds to a feature
of the data (e.g., term frequency for text data.)
In this setting, answering a query turns into
finding the query-point's nearest neighbor(s). The
number of dimensions is usually very large and it
makes it quite difficult to answer queries
efficiently. Actually, often it is hard to do much
better than the naive algorithm: measure the
distance between the query-point and each data
point, then select the K shortest distances.
We propose a family of clustering schemes that is
both very simple and very effective that allows us
to save orders of magnitude in query processing cost
at modest compromises in the quality of retrieved
points. Moreover, we propose a generative model of
data points that allows rigorous theoretical
analysis of the clustering schemes. The experimental
evidences of the effectiveness of our schemes are
presented, on both synthetic data from our model as
well as on a real document corpus.
Please feel free to extend this invitation to other interested people.
http://www.dsi.uniroma1.it/smart
--
with best regards from,
_____________________________________________________________________
Vishwas Patil
Dipartimento di Informatica
Universita degli Studi di Roma - La Sapienza
Via Salaria 113, 00198 Roma, Italy
Tel: +39-3341 02 8875 Fax: +39-06 8541 842
http://www.dsi.uniroma1.it/~patil ivis...@gmail.com
_____________________________________________________________________
UNIX is the answer, but only if you phrase the question very
carefully. -- Anon