The idea of presorting the set of addresses sounds attractive, but I
see two problems with it: first, that someone *might* make a typo in
the first character or two, which could lead you to make more initial
steps than necessary. More importantly, though, Levenshtein distance
is always a positive number, which means you can't do a proper binary
search. (Which is to say, if you start at the middle item and get a
distance of 42, does that mean you look in the top half or the bottom
half?)
You might be able to do something like calculating Levenshtein
distance with the (n/2)th element, the (n/4)th element, and the
(3n/4)th element, then pick the two elements with the lowest numbers,
and recurse... but you'll want to reason through that looking for
pathological cases. (= If that's an actual algorithm (i.e., it will
always get you there), it's probably not optimal, but it's certainly
less pessimal than 3m compares.
I also wonder if you might be able to recruit your database to help...
at first glance, searches like "levenshtein distance database" or
"database fuzzy matching" look mildly promising.
Thanks for the fun brainteaser! I'll report back if I get any flashes
of insight (and remember them long enough to get back to a keyboard).
-Sam
> --
> You received this message because you are subscribed to the Google Groups "pdxruby" group.
> To post to this group, send email to pdx...@googlegroups.com.
> To unsubscribe from this group, send email to pdxruby+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pdxruby?hl=en.
>
>
Smart code in ruby will beat a dumb algorithm in C though. I'd say the
real issue is the framing of the problem as a 1-to-n comparison rather
than a heavily pruned search.
-- Markus
Off-line approximate string matching, bibliography is important.
On-line matching is never going to work with a large dataset no matter
how optimal.
So if I assume the pseudo random addresses are all 10 characters long
and comprised of the letters a-z and the digits 0-9 with equal
probability, we see that each address will contain 8 runs of three
characters and any pair with delta-LD <= 2 will have at least one run of
three characters in common.
If we break our 3 million addresses into 36^3 = 46656 buckets based on
the runs they contain (each address being listed 8 times) we only need
to search 8 buckets instead of the full data set, for a factor of
46656/8 = 5832 speedup.
That would take your 10 minute scan down to ~1/10 second.
Of course, the actual statistics of your pseudo random addresses may
well be different (they may be shorter, or longer, mixed case, or...?),
but doubtless a similar scheme can be devised.
-- M