Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?
--
Posted via http://www.ruby-forum.com/.
Here's a Perl module that does something similar. You might try
porting it over to Ruby.
http://search.cpan.org/~mlehmann/String-Similarity-1.02/Similarity.pm
Farrel
And here is a nice description of the Levenstein Distance:
http://en.wikipedia.org/wiki/Levenshtein_distance
Farrel
If by any chance you are using MySQL you could use the soundex function
builtin into it as well.
Good luck,
Alex
If you check on Google under "ruby soundex module", you'll find a few
likely hits:
http://tinyurl.com/rch5g
I remember a friend saying that there are superior algorithms to soundex
that are no more complex, so you might want to explore about a little.
If the names might be from different languages I would rather use
Levenstein than soundex. Levenstein is probably good at describing
typos as "very close" but soundex might be somewhat language specific.
But I haven't tried so I might be wrong.
Thanks
Michal
You know, there are lots of implementations there, but the Ruby one
seems to be missing :) [There's no reason to restrict it to working on
strings. If you duck, it'll work just as nicely on arrays of what have
you.]
It's missing from wikipedia though.
http://raa.ruby-lang.org/project/levenshtein/
http://raa.ruby-lang.org/project/soundex/
http://raa.ruby-lang.org/project/metaphone/
The levenshtein algorithm basically gives you the "edit-distance"
between two strings. E.g. the minimum amount of
insertions/replacements/deletions to make the strings identical. It
gives you a pretty good indication on how similar the strings are.
Soundex transforms all strings that are similar into the same 4
character code (which looks something like "E246").
Metaphone is preferred over soundex I believe. It also transforms
similar strings into the same character sequence, but doesn't limit
itself to just 4 characters. That means it works a bit better with
longer strings.
There are probably other algorithms around as well, but I've had pretty
good luck with these three.
Regards,
Helge Elvik
-----Original Message-----
From: list-...@example.com [mailto:list-...@example.com] On Behalf
Of Dylan Markow
Sent: 1. august 2006 09:31
To: ruby-talk ML
Subject: Detecting similar strings
Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.
Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?
--
Posted via http://www.ruby-forum.com/.
Here's my implementation anyway (not at all fast):
def lev(s1, s2)
# A hash to build up the solution matrix in.
l = Hash.new do |h,k|
h = if k.member? 0
# Edge case for the top and left of the solution matrix.
k.max
else
# Otherwise, recursively find this cell based on the rest of
# the solution matrix.
i,j = *k
[ l[[i-1,j]]+1,
l[[i,j-1]]+1,
l[[i-1,j-1]] + (s1[i]==s2[j]?0:1)].min
end
end
# The answer's stored in the bottom right of the solution matrix.
l[[s1.size-1, s2.size-1]]
end
# Determine Levenshtein distance of two strings
def Ld(s,t)
#s string #1
#t string #2
n = s.size
m = t.size
a = Array.new
if n != 0 && m != 0
#2 create array
r = Array.new
rz = Array.new
0.upto(m) {|x| r.push(0)}
0.upto(n) {|x|a.push(r.dup)}
a.each_index {|x| a[x][0] = x}
0.upto(m) {|x| a[0][x] = x}
#a.each {|x| p x}
cost = 0
1.upto(n) do |i|
1.upto(m) do |j|
if s[i] == t[j]
cost =0
else
cost = 1
end
a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -
1] + cost].min
end
end
a[n][m]
#a.each {|x| p x}
else
0
end
end
Interesting your name is almost "Markov" ;-}
Anyway, besides algorithms mentioned here, I have notes on Double
Metaphone, NYSIIS, Phonex. For sequence analysis in general, google
McIlroy-Hunt, Ratcliff/Obershelp:
http://docs.python.org/lib/module-difflib.htmlhttp://docs.python.org/lib/module-difflib.html
http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
Ping me in a week and I will...
Hal
To my untrained eye (I am new to Ruby) using the levenshtein distance to calculate what someone might have meant to type mught be expensive. You will need to calculate the distance for all values in the field to compare it with the user input, then pick the small distances and display them as options. As the list of values grows larger, your app will bog down.
Perhaps a better way is to use an auto-complete ajax call on the input field?
Andrew