I have read the literature on fuzzy/approximate search for strings
keys in large databases, but I could not find a clear answer on what
is the best that has been achieved thus far in terms of the cost of
search (regardless of cost of building any index).
I understand this depends on the application/domain/type of data, but,
to keep things simple, supoose the target data is some string (names,
addresses, or any sequence of characters), and assume any distance
function. I am simply asking this: what is the best that has been
achieved? Is there a logarithmic solution (i.e., that can retrieve a
set of "similar" objects in some logarithmic order), or are they all,
in the worst case, linear?
I thank you in advance...
Without a clear definition of exactly what you are trying to
accomplish, your question cannot be answered.
If I have a hash table of strings, with no duplicates allowed, and my
hash table is large enough, then insert, update, and delete are all
O(1) for exact strings.
Are the strings short or long?
What are the rules for similarity?
Are you looking at soundex sort of similarity or Levenstein edit
distance or something else?
Can we have exact duplicate strings stored in our repository?
What is the domain data? Personal names? DNA sequences?
Do you expect to return the entire class in the case of a fuzzy match?
I do not think you can expect an exact answer with a fuzzy question
(even when the topic of the question is fuzzy).
I guess if you specify the problem exactly you will get a better
answer.
Walid,
I am not aware of any single best algorithm, but I don't discount that
there may be one. The ones from both the edit-distance- and token-
based camps I see popping up quite a bit in the literature are Jaro-
Winkler, Levenshtein, Jaccard, Monge-Elkan, Smith-Waterman, and
Fellegi-Sunter.
I've begun doing a fair amount of work on cleaning (both de-duping and
standardizing) name (personal names and corporate/business names)
data, and am finding the combination of Jaccard and Jaro-Winkler
promising.
There are scores of excellent papers out there, but two of my current
favorites are:
"A Comparison of String Distance Metrics for Name-Matching Tasks"
available here: http://tinyurl.com/4b8jao
--and--
"D-Dupe: An Interactive Tool for Entity Resolution in Social Networks"
available here: http://tinyurl.com/3nt3vj
Here are links to some researchers that are leaders in this area:
http://www.cs.umd.edu/~indrajit/HTML/cv.html
http://www.cs.umd.edu/~getoor/
http://www.cs.cmu.edu/~wcohen/
http://www.cecs.csulb.edu/~monge/
http://www.research.att.com/~divesh/papers/index.php
Also, these books may help:
_Data Quality and Record Linkage Techniques_ by Herzog, Sheuren, and
Winkler
_Data Quality_ by Batini and Scannapieco
_Exploratory Data Mining and Data Cleaning_ by Dasu and Johnson
Finally, some software:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
http://secondstring.sourceforge.net/
Regards,
--Jeff
Walid:
This graph (at the first software site I list above) may be what
you're ultimately looking for:
http://www.dcs.shef.ac.uk/~sam/images/string_metrics_comparison.jpg
--Jeff