Detecting similar strings

Dylan Markow

읽지 않음,

2006. 8. 1. 오전 2:25:5906. 8. 1.

받는사람

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

--
Posted via http://www.ruby-forum.com/.

Farrel Lifson

읽지 않음,

2006. 8. 1. 오전 2:42:3706. 8. 1.

받는사람

On 01/08/06, Dylan Markow <dy...@dylanmarkow.com> wrote:
> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem
> with my users punching in duplicate entries with the last names spelled
> slightly different.
>
> Is there a way to check if 2 strings are "identical" up to a certain
> percentage, such as only having 1 or 2 characters different?

Here's a Perl module that does something similar. You might try
porting it over to Ruby.
http://search.cpan.org/~mlehmann/String-Similarity-1.02/Similarity.pm

Farrel

Farrel Lifson

읽지 않음,

2006. 8. 1. 오전 2:46:5506. 8. 1.

받는사람

And here is a nice description of the Levenstein Distance:
http://en.wikipedia.org/wiki/Levenshtein_distance

Farrel

Alexandru E. Ungur

읽지 않음,

2006. 8. 1. 오전 3:22:4906. 8. 1.

받는사람

>>> sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900" <<<EOQ

> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem
> with my users punching in duplicate entries with the last names spelled
> slightly different.
>
> Is there a way to check if 2 strings are "identical" up to a certain
> percentage, such as only having 1 or 2 characters different?

Looks like there is a soundex implementation for Ruby:
http://raa.ruby-lang.org/search.rhtml?search=soundex

If by any chance you are using MySQL you could use the soundex function
builtin into it as well.

Good luck,
Alex

ben...@fysh.org

읽지 않음,

2006. 8. 1. 오전 3:36:1606. 8. 1.

받는사람

If you check on Google under "ruby soundex module", you'll find a few
likely hits:
http://tinyurl.com/rch5g

I remember a friend saying that there are superior algorithms to soundex
that are no more complex, so you might want to explore about a little.

Michal Suchanek

읽지 않음,

2006. 8. 1. 오전 4:20:5106. 8. 1.

받는사람

If the names might be from different languages I would rather use
Levenstein than soundex. Levenstein is probably good at describing
typos as "very close" but soundex might be somewhat language specific.

But I haven't tried so I might be wrong.

Thanks

Michal

ben...@fysh.org

읽지 않음,

2006. 8. 1. 오전 4:03:3806. 8. 1.

받는사람

On 01/08/06, Farrel Lifson <farrel...@gmail.com> wrote:
> And here is a nice description of the Levenstein Distance:
> http://en.wikipedia.org/wiki/Levenshtein_distance

You know, there are lots of implementations there, but the Ruby one
seems to be missing :) [There's no reason to restrict it to working on
strings. If you duck, it'll work just as nicely on arrays of what have
you.]

ben...@fysh.org

읽지 않음,

2006. 8. 1. 오전 4:42:4606. 8. 1.

받는사람

Axel:
>>You know, there are lots of implementations [of the Levenstein
>> distance]

> there, but the >Ruby one seems to be missing :)
>

> That's not true: there is one in Hal Fulton's book "the Ruby Way" -
> it's
> available
> online via Safari
> _http://safari.oreilly.com/search_ (http://safari.oreilly.com/search)

It's missing from wikipedia though.

Helge Elvik

읽지 않음,

2006. 8. 1. 오전 6:08:1906. 8. 1.

받는사람

This should point you in the right direction I think:

http://raa.ruby-lang.org/project/levenshtein/
http://raa.ruby-lang.org/project/soundex/
http://raa.ruby-lang.org/project/metaphone/

The levenshtein algorithm basically gives you the "edit-distance"
between two strings. E.g. the minimum amount of
insertions/replacements/deletions to make the strings identical. It
gives you a pretty good indication on how similar the strings are.

Soundex transforms all strings that are similar into the same 4
character code (which looks something like "E246").

Metaphone is preferred over soundex I believe. It also transforms
similar strings into the same character sequence, but doesn't limit
itself to just 4 characters. That means it works a bit better with
longer strings.

There are probably other algorithms around as well, but I've had pretty
good luck with these three.

Regards,
Helge Elvik

-----Original Message-----
From: list-...@example.com [mailto:list-...@example.com] On Behalf
Of Dylan Markow
Sent: 1. august 2006 09:31
To: ruby-talk ML
Subject: Detecting similar strings

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

--
Posted via http://www.ruby-forum.com/.

ben...@fysh.org

읽지 않음,

2006. 8. 1. 오전 9:19:3206. 8. 1.

받는사람

>>It's missing from wikipedia though.
>

> Does wikipedia provide the code for the implementations it lists?
> Maybe Hal would be so kind to add it then, if we ask him politely
> to do so :)
>
> Best regards,

Here's my implementation anyway (not at all fast):

def lev(s1, s2)
# A hash to build up the solution matrix in.
l = Hash.new do |h,k|
h = if k.member? 0
# Edge case for the top and left of the solution matrix.
k.max
else
# Otherwise, recursively find this cell based on the rest of
# the solution matrix.
i,j = *k
[ l[[i-1,j]]+1,
l[[i,j-1]]+1,
l[[i-1,j-1]] + (s1[i]==s2[j]?0:1)].min
end
end

# The answer's stored in the bottom right of the solution matrix.
l[[s1.size-1, s2.size-1]]
end

Tom Reilly

읽지 않음,

2006. 8. 1. 오전 9:54:4706. 8. 1.

받는사람

Here is an implementation of levenstein distance. It seems to me that
there is another in ruby gems somewhere but I don't remember where.

# Determine Levenshtein distance of two strings

def Ld(s,t)

#s string #1
#t string #2

n = s.size
m = t.size
a = Array.new

if n != 0 && m != 0

#2 create array
r = Array.new
rz = Array.new

0.upto(m) {|x| r.push(0)}

0.upto(n) {|x|a.push(r.dup)}
a.each_index {|x| a[x][0] = x}
0.upto(m) {|x| a[0][x] = x}

#a.each {|x| p x}

cost = 0
1.upto(n) do |i|
1.upto(m) do |j|
if s[i] == t[j]
cost =0
else
cost = 1
end
a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -
1] + cost].min
end
end
a[n][m]
#a.each {|x| p x}
else
0
end
end

Gene Tani

읽지 않음,

2006. 8. 1. 오전 9:55:1206. 8. 1.

받는사람

Dylan Markow wrote:
> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem

Interesting your name is almost "Markov" ;-}

Anyway, besides algorithms mentioned here, I have notes on Double
Metaphone, NYSIIS, Phonex. For sequence analysis in general, google
McIlroy-Hunt, Ratcliff/Obershelp:

http://docs.python.org/lib/module-difflib.htmlhttp://docs.python.org/lib/module-difflib.html

Gene Tani

읽지 않음,

2006. 8. 1. 오전 10:05:4506. 8. 1.

받는사람

Gene Tani wrote:
> Dylan Markow wrote:
> > Is there a way to take two strings, and decide if they are "similar."

http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf

Hal Fulton

읽지 않음,

2006. 8. 1. 오후 7:41:2206. 8. 1.

받는사람

Nura...@aol.com wrote:
>>It's missing from wikipedia though.
>
>
> Does wikipedia provide the code for the implementations it lists?
> Maybe Hal would be so kind to add it then, if we ask him politely
> to do so :)

Ping me in a week and I will...

Hal

Andrew Skegg

읽지 않음,

2006. 8. 1. 오후 8:37:5706. 8. 1.

받는사람

A quick search on Google reveals a levenshtein ruby function has already been created.
http://po-ru.com/projects/levenshtein/

To my untrained eye (I am new to Ruby) using the levenshtein distance to calculate what someone might have meant to type mught be expensive. You will need to calculate the distance for all values in the field to compare it with the user input, then pick the small distances and display them as options. As the list of values grows larger, your app will bog down.

Perhaps a better way is to use an auto-complete ajax call on the input field?

Andrew