Using Levenshtein distance to find closest match email addresses

530 views
Skip to first unread message

William Hertling

unread,
Apr 27, 2011, 12:55:46 PM4/27/11
to pdxruby
I'm working on a project where:
- we have ~3 million pseudo-random valid email addresses with an
average minimum Levenshtein distance of 5.5.
- approximately 5% of the emails that come in get rejected due to the
sender making a typo in the email address.
- a single typo usually has a Levenshtein distance of < 3 from the
intended email address (two characters transposed, a character
dropped, etc.)

So it seems likely that I could correct for typo'ed email addresses by
finding the most similar address, and if it has a distance under 3,
it's probably the right address. The problem is that Levenshtein isn't
particularly fast - so 3 million comparisons takes a while. Right now
it's taking about 10 minutes using Ruby 1.9.2 and the Text gem. I need
a solution that's going to be under 15-30 seconds.

(Yes, I realize that the system could be redesigned so as to not use
these pseudo-random email addresses, but instead be based on the
sender's email address, but right now that's beyond my influence.)

I'm trying to puzzle out an optimization in my head. If I presort the
3 million valid email addresses so that neighbors always have a low
Levenshtein distance, can I then compare the incoming invalid email
address against N locations to get the approximate location of similar
email addresses, and then search only a subset of the entire set of
email addresses?

I'm also open to other approaches.

Any suggestions? Thanks, Will

Sam Livingston-Gray

unread,
Apr 27, 2011, 1:16:46 PM4/27/11
to pdx...@googlegroups.com
Interesting problem, and I'd have to think about it for a lot longer
before I felt comfortable saying "just do this."

The idea of presorting the set of addresses sounds attractive, but I
see two problems with it: first, that someone *might* make a typo in
the first character or two, which could lead you to make more initial
steps than necessary. More importantly, though, Levenshtein distance
is always a positive number, which means you can't do a proper binary
search. (Which is to say, if you start at the middle item and get a
distance of 42, does that mean you look in the top half or the bottom
half?)

You might be able to do something like calculating Levenshtein
distance with the (n/2)th element, the (n/4)th element, and the
(3n/4)th element, then pick the two elements with the lowest numbers,
and recurse... but you'll want to reason through that looking for
pathological cases. (= If that's an actual algorithm (i.e., it will
always get you there), it's probably not optimal, but it's certainly
less pessimal than 3m compares.

I also wonder if you might be able to recruit your database to help...
at first glance, searches like "levenshtein distance database" or
"database fuzzy matching" look mildly promising.

Thanks for the fun brainteaser! I'll report back if I get any flashes
of insight (and remember them long enough to get back to a keyboard).
-Sam

> --
> You received this message because you are subscribed to the Google Groups "pdxruby" group.
> To post to this group, send email to pdx...@googlegroups.com.
> To unsubscribe from this group, send email to pdxruby+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pdxruby?hl=en.
>
>

Eric Redmond

unread,
Apr 27, 2011, 1:26:47 PM4/27/11
to pdxruby
Ha! I didn't want to sound like a one-trick pony, so I didn't reply.
But since you brought it up... Have you looked into using a database
solution for this and indexing the choices? The Postgres pg_trgm
package has functions and indexes for doing a trigram comparison which
is actually really fast - a good query can find a best match in a mere
3 million records in a matter of seconds. There's also fuzzystrmatch
which has Levenshtein support, but I can't speak for the indexing
capabilities.

Eric

Sam Goldstein

unread,
Apr 27, 2011, 1:28:12 PM4/27/11
to pdx...@googlegroups.com
I've worked on something similar and have a few suggestions.

1. I haven't used the text gem but I've found the Levenshtein gem was faster than any alternative I looked at.  Its lightweight and all the logic is in C.

2. We found that we could get massive improvements in "performance" by doing some heuristic filtering before calculating the Levenshtein differences.  Here's some examples:
   * Only compare addresses that are almost the same length (i.e. within 2-3 characters difference)
   * Perhaps users are unlikely to fat finger the first letter.  Only compare emails with the same first letter.
   * Perhaps they're unlikely to bungle the tld.  Only compare emails in the same tld.

If you come up with a few good efficient heuristics like this you may be able to filter your list of ~3million down to a few hundred or thousand, and get a massive improvement in perceived performace in the process.

~sam goldstein
--

Kevin Scaldeferri

unread,
Apr 27, 2011, 1:28:16 PM4/27/11
to pdx...@googlegroups.com
This is usually done by constructing a Levenshtein automaton.  C.f.: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

-kevin


On Wed, Apr 27, 2011 at 9:55 AM, William Hertling <william....@gmail.com> wrote:

Kevin Scaldeferri

unread,
Apr 27, 2011, 2:03:52 PM4/27/11
to pdx...@googlegroups.com
In my experience (fuzzy matching search terms), with a good automaton implementation (in C, of course), you can match against millions of candidate terms at LD 1 in ~1ms.  Given that sort of speed, I question whether applying filtering heuristics in Ruby would actually be beneficial.  Going to LD 2 costs a bit more, maybe even an order of magnitude, but is still pretty fast compared to most things you might do with a data set that large in Ruby.

-kevin

William Hertling

unread,
Apr 27, 2011, 2:18:51 PM4/27/11
to pdxruby
Sam, I can't seem to find a Levenshtein gem that works with 1.9. It
only seems to work with 1.8. Am I wrong?

On Apr 27, 10:28 am, Sam Goldstein <sgr...@gmail.com> wrote:
> I've worked on something similar and have a few suggestions.
>
> 1. I haven't used the text gem but I've found the Levenshtein gem was faster
> than any alternative I looked at.  Its lightweight and all the logic is in
> C.
>
> 2. We found that we could get massive improvements in "performance" by doing
> some heuristic filtering before calculating the Levenshtein differences.
>  Here's some examples:
>    * Only compare addresses that are almost the same length (i.e. within 2-3
> characters difference)
>    * Perhaps users are unlikely to fat finger the first letter.  Only
> compare emails with the same first letter.
>    * Perhaps they're unlikely to bungle the tld.  Only compare emails in the
> same tld.
>
> If you come up with a few good efficient heuristics like this you may be
> able to filter your list of ~3million down to a few hundred or thousand, and
> get a massive improvement in perceived performace in the process.
>
> ~sam goldstein
> --http://drasticcode.com

Sam Goldstein

unread,
Apr 28, 2011, 4:17:29 AM4/28/11
to pdx...@googlegroups.com
These filters probably shouldn't happen in Ruby.  Better to do them in the DB, and not in loops in memory.

~sam

Sam Goldstein

unread,
Apr 28, 2011, 4:19:52 AM4/28/11
to pdx...@googlegroups.com
Possible.  I used the one I referenced in 1.8.7.  No idea if it works in 1.9.

Also if you can find a working C version of the algorithm it's pretty easy to attach ruby binding with FFI.

~s

robert J.

unread,
Apr 28, 2011, 8:03:35 AM4/28/11
to pdxruby
You might consider alternative metrics. The Jaro-Winkler distance is
great for catching typos in names and short strings.

As for the optimization, obviously you're going to need some kind of
index if you want to do a string comparison against millions of email
addresses in real time. I know postgres allows for custom indexes,
and alternatively maybe you can use a btree index with a transform on
the integer representation of the email addresses. Or of course a
mixed strategy.

You can use the "amatch" gem for generating comparison metrics as a C
extension in 1.9.2

Finally you may find you get a lot of people receiving email by
accident. I hope you can share what solution you settle on and how
effective it is.

Best
-robert

On Apr 27, 9:55 am, William Hertling <william.hertl...@gmail.com>
wrote:

markus

unread,
Apr 28, 2011, 11:37:25 AM4/28/11
to pdx...@googlegroups.com

> These filters probably shouldn't happen in Ruby. Better to do them in
> the DB, and not in loops in memory.

Smart code in ruby will beat a dumb algorithm in C though. I'd say the
real issue is the framing of the problem as a 1-to-n comparison rather
than a heavily pruned search.

-- Markus


Kevin Scaldeferri

unread,
Apr 28, 2011, 12:22:25 PM4/28/11
to pdx...@googlegroups.com
Give how poorly suited Ruby is for implementing data structures, I'm actually a bit skeptical of that in general.  I shudder to think how much memory the data structures I would consider for this task would require if implemented in pure Ruby.

-kevin
 

Chuck Vose

unread,
Apr 28, 2011, 5:05:11 PM4/28/11
to pdx...@googlegroups.com
http://www.dcc.uchile.cl/~gnavarro/ps/deb01.pdf

Off-line approximate string matching, bibliography is important.
On-line matching is never going to work with a large dataset no matter
how optimal.

markus

unread,
Apr 28, 2011, 6:02:45 PM4/28/11
to pdx...@googlegroups.com
W --

So if I assume the pseudo random addresses are all 10 characters long
and comprised of the letters a-z and the digits 0-9 with equal
probability, we see that each address will contain 8 runs of three
characters and any pair with delta-LD <= 2 will have at least one run of
three characters in common.

If we break our 3 million addresses into 36^3 = 46656 buckets based on
the runs they contain (each address being listed 8 times) we only need
to search 8 buckets instead of the full data set, for a factor of
46656/8 = 5832 speedup.

That would take your 10 minute scan down to ~1/10 second.

Of course, the actual statistics of your pseudo random addresses may
well be different (they may be shorter, or longer, mixed case, or...?),
but doubtless a similar scheme can be devised.

-- M


Reply all
Reply to author
Forward
0 new messages