How do you compare 2 strings, and determine how much they are "close" to each other? Eg. aqwerty qwertyb are similar to each other, except for first/last char. But, how do I quantify that?
I guess you can say for the above 2 strings that - at max, 6 chars out of 7 are same sequence --> 85% max
But, for qawerty qwerbty max correlation is - 3 chars out of 7 are the same sequence --> 42% max
> How do you compare 2 strings, and determine how much they are "close" to > each other? Eg. > aqwerty > qwertyb > are similar to each other, except for first/last char. But, how do I > quantify that?
This is a classic problem of computer science. Watch this:
William Park wrote: > How do you compare 2 strings, and determine how much they are "close" to > each other? Eg. > aqwerty > qwertyb > are similar to each other, except for first/last char. But, how do I > quantify that?
> I guess you can say for the above 2 strings that > - at max, 6 chars out of 7 are same sequence --> 85% max
> But, for > qawerty > qwerbty > max correlation is > - 3 chars out of 7 are the same sequence --> 42% max
> (Crossposted to 3 of my favourite newsgroup.)
"However you like" is probably the right answer, but one way might be to compare their soundex encoding (http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out percentage difference based on comparing the numeric part.
>> How do you compare 2 strings, and determine how much they are "close" to >> each other? Eg. >> aqwerty >> qwertyb >> are similar to each other, except for first/last char. But, how do I >> quantify that?
>> I guess you can say for the above 2 strings that >> - at max, 6 chars out of 7 are same sequence --> 85% max
>> But, for >> qawerty >> qwerbty >> max correlation is >> - 3 chars out of 7 are the same sequence --> 42% max
>> (Crossposted to 3 of my favourite newsgroup.)
>"However you like" is probably the right answer, but one way might be to >compare their soundex encoding >(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >percentage difference based on comparing the numeric part.
Fantastic suggestion. Here's a tiny piece of real-life test data:
compare the surnames "Mousaferiadis" and "McPherson".
William Park wrote: > How do you compare 2 strings, and determine how much they are "close" to > each other? Eg. > aqwerty > qwertyb > are similar to each other, except for first/last char. But, how do I > quantify that?
> I guess you can say for the above 2 strings that > - at max, 6 chars out of 7 are same sequence --> 85% max
> But, for > qawerty > qwerbty > max correlation is > - 3 chars out of 7 are the same sequence --> 42% max
<opengeome...@yahoo.ca> wrote: >How do you compare 2 strings, and determine how much they are "close" to >each other? Eg. > aqwerty > qwertyb >are similar to each other, except for first/last char. But, how do I >quantify that?
>I guess you can say for the above 2 strings that > - at max, 6 chars out of 7 are same sequence --> 85% max
>But, for > qawerty > qwerbty >max correlation is > - 3 chars out of 7 are the same sequence --> 42% max
1. Google for such topics as "fuzzy matching", "edit distance", "approximate comparison".
2. Closer to home, look at the thread in comp.lang.python around 2004-11-18 -- search for "Pingel Hyyro" [and yes you do mean "hyyro", not "hydro"!!]
3. Steadfastly ignore any (presumably) well-intentioned profferings of soundex.
The above is broken, not meeting one of the elementary conditions for a distance metric:
distance(a, b) == distance(b, a)
Quoting from its docs: | Note: The definition of the goodness of an approximate match is the | number of steps required to bring the string pattern to a form that is | entirely contained in the string to which it is being matched.
Note: "entirely contained in", rather than "equal to". Now read on:
| The mathing | is not commutative. The pattern that you instantiate the class with will be | matched against the input. For example the word "funky" can be made to | match the word "funnybone" with an edit distance of one. However, using | "funnybone" as a pattern that will be matched to "funky" the distance | will become five. | | Example: | | >>> from Apse import Approx | >>> a = Approx("funky") | >>> a.dist("funnybone") | 1 | >>> a = Approx("funnybone") | >>> a.dist("funky") | 5
>How do you compare 2 strings, and determine how much they are "close" to >each other? Eg. > aqwerty > qwertyb >are similar to each other, except for first/last char. But, how do I >quantify that?
>I guess you can say for the above 2 strings that > - at max, 6 chars out of 7 are same sequence --> 85% max
>But, for > qawerty > qwerbty >max correlation is > - 3 chars out of 7 are the same sequence --> 42% max
John Machin wrote: > On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <mor...@lsupcaemnt.com> > wrote:
>>William Park wrote:
>>>How do you compare 2 strings, and determine how much they are "close" to >>>each other? Eg. >>> aqwerty >>> qwertyb >>>are similar to each other, except for first/last char. But, how do I >>>quantify that?
>>>I guess you can say for the above 2 strings that >>> - at max, 6 chars out of 7 are same sequence --> 85% max
>>>But, for >>> qawerty >>> qwerbty >>>max correlation is >>> - 3 chars out of 7 are the same sequence --> 42% max
>>>(Crossposted to 3 of my favourite newsgroup.)
>>"However you like" is probably the right answer, but one way might be to >>compare their soundex encoding >>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>percentage difference based on comparing the numeric part.
> Fantastic suggestion. Here's a tiny piece of real-life test data:
> compare the surnames "Mousaferiadis" and "McPherson".
Fantastic test data set. I know how to pronounce McPherson but I'd never have guessed that Mousaferiadis sounds like it. I suppose non-Celts probably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean, Ceilidh, or Concobarh are pronounced either.
I assume you were actually being facetious and trying to make the point that names that don't look the same on paper can have the same soundex encoding and that's obviously countered with the fact that soundex is just a cheap and cheerful way to find names that probably sound similair which can vary tremendously based on ethnicity or accent.
It's a reasonable approach to consider given the very loose requirements presented.
>John Machin wrote: >> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <mor...@lsupcaemnt.com> >> wrote:
>>>William Park wrote:
>>>>How do you compare 2 strings, and determine how much they are "close" to >>>>each other? Eg. >>>> aqwerty >>>> qwertyb >>>>are similar to each other, except for first/last char. But, how do I >>>>quantify that?
>>>>I guess you can say for the above 2 strings that >>>> - at max, 6 chars out of 7 are same sequence --> 85% max
>>>>But, for >>>> qawerty >>>> qwerbty >>>>max correlation is >>>> - 3 chars out of 7 are the same sequence --> 42% max
>>>>(Crossposted to 3 of my favourite newsgroup.)
>>>"However you like" is probably the right answer, but one way might be to >>>compare their soundex encoding >>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>>percentage difference based on comparing the numeric part.
>> Fantastic suggestion. Here's a tiny piece of real-life test data:
>> compare the surnames "Mousaferiadis" and "McPherson".
>Fantastic test data set. I know how to pronounce McPherson but I'd never >have guessed that Mousaferiadis sounds like it.
If you guessed "moose a ferry ah dis" i.e. phonetically you wouldn't be far wrong. The point is that the two names neither look similar nor sound similar. It is highly unlikely that one would be corrupted into the other during either written or spoken communication. However they get the same soundex code because the soundex method picks out MSFR and MCPR and says in effect that S===C (sometimes) and F==P (sometimes).
>I assume you were actually being facetious > and trying to make the point >that names that don't look the same on paper can have the same soundex >encoding and that's obviously countered with the fact that soundex is >just a cheap and cheerful way to find names that probably sound similair >which can vary tremendously based on ethnicity or accent.
*If* you want phonetic similarity, there are methods that much better than soundex, in the sense of fewer false positives and fewer false negatives. Google for NYSIIS, dolby, metaphone, caverphone.
Cheap? You get what you pay for.
Cheerful? What's the relevance?
Someone who types "Mousaferiadis" into a customer search screen and gets back several lines of McPherson and MacPherson is unlikely to be cheerful -- even before we factor in the speed [soundex divides the universe into a relative small number of buckets].
Someone who's looking for Erin when they should be looking for Aaron (or vice versa) won't get much cheer out of soundex, either.
>It's a reasonable approach to consider given the very loose requirements >presented.
Soundex is *NEVER* a reasonable approach to consider. Phonetic variation is only one consideration. In any case, the OP didn't appear to be concerned with phonetic variations.
John Machin wrote: > On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <mor...@lsupcaemnt.com> > wrote: <snip> >>I assume you were actually being facetious >>and trying to make the point >>that names that don't look the same on paper can have the same soundex >>encoding and that's obviously countered with the fact that soundex is >>just a cheap and cheerful way to find names that probably sound similair >>which can vary tremendously based on ethnicity or accent.
> *If* you want phonetic similarity, there are methods that much better > than soundex, in the sense of fewer false positives and fewer false > negatives. Google for NYSIIS, dolby, metaphone, caverphone.
And I assume I'd find they all have pros and cons too, otherwise you'd be referring to THE best one rather than a selection. It seems a bit pointless to go browsing through the documentation on them when someone who presumably already has can't just state the best one for the job.
> Cheap? You get what you pay for.
> Cheerful? What's the relevance?
"Cheap and cheerful" is a colloquial expression meaning cost-effective.
> Someone who types "Mousaferiadis" into a customer search screen and > gets back several lines of McPherson and MacPherson is unlikely to be > cheerful -- even before we factor in the speed [soundex divides the > universe into a relative small number of buckets].
> Someone who's looking for Erin when they should be looking for Aaron > (or vice versa) won't get much cheer out of soundex, either.
That goes back to accent. In [some parts at least of] the USA Erin sounds very much like Aaron wheras in the UK the 2 are very dissimilar. I assume since you apparently consider them similair that you live in the USA and so would consider soundex as providing a "false negative" by saying they don't match. Perhaps one of the other approaches you suggest would report that they do match but that wouldn't make it clearly a better choice to everyone.
>>It's a reasonable approach to consider given the very loose requirements >>presented.
> Soundex is *NEVER* a reasonable approach to consider. Phonetic > variation is only one consideration. In any case, the OP didn't appear > to be concerned with phonetic variations.
The OP didn't say what the application was at all, but you're right that from his example he does SEEM more interested in character matches than phonetic ones so he'd presumably quickly discard phonetic comparisons if that's really not what he wants.
>> On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <mor...@lsupcaemnt.com> >> wrote: ><snip> >>>I assume you were actually being facetious >>>and trying to make the point >>>that names that don't look the same on paper can have the same soundex >>>encoding and that's obviously countered with the fact that soundex is >>>just a cheap and cheerful way to find names that probably sound similair >>>which can vary tremendously based on ethnicity or accent.
>> *If* you want phonetic similarity, there are methods that much better >> than soundex, in the sense of fewer false positives and fewer false >> negatives. Google for NYSIIS, dolby, metaphone, caverphone.
>And I assume I'd find they all have pros and cons too, otherwise you'd >be referring to THE best one rather than a selection.
*ALL* approximate matching methods have pros and cons -- and all others have fewer than soundex.
> It seems a bit >pointless to go browsing through the documentation on them when someone >who presumably already has can't just state the best one for the job.
They were listed in roughly increasing order of general rough effectiveness. It depends on the job. It depends on the language. None of them would work well with your O Muirchaitaeioughs :-)
>> Cheap? You get what you pay for.
>> Cheerful? What's the relevance?
>"Cheap and cheerful" is a colloquial expression meaning cost-effective.
Grossly misapplied to soundex.
>> Someone who types "Mousaferiadis" into a customer search screen and >> gets back several lines of McPherson and MacPherson is unlikely to be >> cheerful -- even before we factor in the speed [soundex divides the >> universe into a relative small number of buckets].
>> Someone who's looking for Erin when they should be looking for Aaron >> (or vice versa) won't get much cheer out of soundex, either.
>That goes back to accent. In [some parts at least of] the USA Erin >sounds very much like Aaron wheras in the UK the 2 are very dissimilar. >I assume since you apparently consider them similair that you live in >the USA
You assume incorrectly. In any case my whereabouts are of sublime irrelevance. What matters is that some people will as you say think that Aaron and Erin sound similar in the best of circumstances; they are prone to be mistaken one for the other by (say) a tired call centre operative especially when the caller and the callee are from different backgrounds.
> and so would consider soundex as providing a "false negative" by >saying they don't match. Perhaps one of the other approaches you suggest >would report that they do match but that wouldn't make it clearly a >better choice to everyone.
None of the other approaches make the mistake of preserving the first letter -- this alone is almost enough reason for jettisoning soundex.
Could hit a few snags. Quick out-of-the-library compression using standards like zlib will have headers that will dilute the difference on short strings, and on long strings block compression (zlib, bzip2) will not pick up similarities because the similarities will be in different blocks. With blocks of around 100k-1M in these algos by default (IIRC), this could work well for strings between oh say 1k-50k.
But I need to underscore Aahz's posting above:
***Check out difflib, it's in the library.*** Perfect package for what the OP wants AFAICT.
> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <mor...@lsupcaemnt.com> > wrote:
>>William Park wrote:
>>> How do you compare 2 strings, and determine how much they are "close" to >>> each other? Eg. >>> aqwerty >>> qwertyb >>> are similar to each other, except for first/last char. But, how do I >>> quantify that?
>>"However you like" is probably the right answer, but one way might be to >>compare their soundex encoding >>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>percentage difference based on comparing the numeric part.
> Fantastic suggestion. Here's a tiny piece of real-life test data:
> compare the surnames "Mousaferiadis" and "McPherson".
I remember a word processor, MultiMate, which used Soundex to do matching for its suggestions for spelling correction. One of my cow-orkers typed the word 'agains' -- it was supposed to be 'against' but 'again' would also have been a sensible suggestion. MultiMate, however, suggested neither of those reasonable words, it did suggest "iguanas" amd "Utahns"...
(I wonder what it does with "Talliafero" and "Tolliver", and with "Featherstone-Howe" and "Fanshaw"...)
The answer to the OP, which someone just pointed out to me on comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A google search on either produces some good descriptions in the first few links, including http://www.merriampark.com/ld.htm which has not only a description of the algorithm but also source code in Java, C++ and Visual Basic (no Awk, thought there are links to pages with implementations in Perl, Python, Delphi and many more)...
[1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein' in Schwaebisch (south German) fashion, but apparently the author of the algorithm was one Vladimir I. Levenshtein and that is how he is credited on the IEEE site. There are also a number of Google hits under the 'stein' spelling, though...
> The above is broken, not meeting one of the elementary conditions for > a distance metric:
> distance(a, b) == distance(b, a)
I agree that this makes the edit distance broken in the context of text strings, but you should be aware that distance is only commutative if there are no constraints on travel. If you've ever driven around cities like Sydney that use lots of one-way streets, you will know that the distance from point A to point B is not necessarily the same as the distance from B back to A.
On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote: > None of the other approaches make the mistake of preserving the first > letter -- this alone is almost enough reason for jettisoning soundex.
Off-topic now, but you've made me curious.
Why is this a bad idea?
How would you handle the case of "barow" and "marow"? (Barrow and marrow, naturally.) Without the first letter, they sound identical. Why is throwing that information away a good thing?
Steven D'Aprano wrote: > On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
> > None of the other approaches make the mistake of preserving the first > > letter -- this alone is almost enough reason for jettisoning soundex.
> Off-topic now, but you've made me curious.
> Why is this a bad idea?
Because of situations like, for example, my mother's last name, which originally started with "Y" but got anglicized to a name beginning with "E". Same name, different Soundex codes, and the problem occurs only occurs because of Soundex's preservation of the exact initial letter.
> How would you handle the case of "barow" and "marow"? (Barrow and > marrow, naturally.) Without the first letter, they sound identical. Why is > throwing that information away a good thing?
No one's suggesting throwing away the first letter's information, just removing the special treatment for it. "Barow" becomes 1600 and "Marow" becomes 5600.
On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
> On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
>> None of the other approaches make the mistake of preserving the first >> letter -- this alone is almost enough reason for jettisoning soundex.
> Off-topic now, but you've made me curious.
> Why is this a bad idea?
Why is the first letter any more important than any other?
> How would you handle the case of "barow" and "marow"? (Barrow and > marrow, naturally.) Without the first letter, they sound identical. Why is > throwing that information away a good thing?
Well, Soundex will quite possibly throw the information away anyway, certainly it regards several letters as the same. But why is the difference between barrow and marrow more important than that between help and held? Or between hatter and hammer?
Regarding 'agains' as similar to 'iguanas' and 'Utahns', but not to 'again' or 'against', is silly...
On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano
<st...@REMOVETHIScyber.com.au> wrote: >On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
>> None of the other approaches make the mistake of preserving the first >> letter -- this alone is almost enough reason for jettisoning soundex.
>Off-topic now, but you've made me curious.
>Why is this a bad idea?
>How would you handle the case of "barow" and "marow"? (Barrow and >marrow, naturally.) Without the first letter, they sound identical. Why is >throwing that information away a good thing?
Sorry if that was unclear. By "preserving the first letter", I meant that in "standard" soundex, the first letter is not transformed into a digit.
Karen -> K650 Kieran -> K650 (R->6, N->5; vowels->0 and then are squeezed out)
Now compare this: Aaron -> A650 Erin -> E650
Bearing in mind that the usual application of soundex is "all or nothing", the result is Karen == Kieran, but Aaron !== Erin, which is at the very least extremely inconsistent.
A better phonetic-key creator would produce the same result for each of the first pair, and for each of the second pair -- e.g. KARAN and ARAN respectively.
Dennis Lee Bieber wrote: > On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <mor...@lsupcaemnt.com> > declaimed the following in comp.lang.python:
>>Fantastic test data set. I know how to pronounce McPherson but I'd never >>have guessed that Mousaferiadis sounds like it. I suppose non-Celts >>probably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean, >>Ceilidh, or Concobarh are pronounced either.
> Since "soundex" is initial letter (consonant?) and a code for > the next three syllables (or close to it), really long multi-syllabic > names are effectively truncated...
> Howe'er... When Maire Brennan releases an album as "Moya", > following sister's "Enya" (Eithne, as I seem to recall reading)... I'd > not attempt to pronounce most of the names you supply... "Dalziell" > doesn't look Celtic... "Culzean" almost looks Aztec/Mayan... "Ceilidh" > => kay-lee?
> Okay, I think I can manage bain sidhe and uisge (after too much > of the latter, I'll be seeing the former)
Well, as an Englishman who has spent a good deal of time in Scotland I'd essay the following. If there are any Scots reading they can chastise me with glee for my presumption.
Dalziell: "Da'y yell" Drumnadrochit: "Dru'mnadro'ckit" (but the Scots would insist you use a gutteral for the "ch", I'm not sure how to render that phonetically. It's a bit like the sound made before spitting, or the "G" in Guido in Dutch :-). Culzean: "Ka La'ne" Ceilidh: "Ca'yli" (once had a border collie called Ceilidh). Concobarh: (is this the same as 'Conchobar'?) Co'nnahwar
Chris Croughton wrote: > On Thu, 19 May 2005 06:38:59 +1000, John Machin > <sjmac...@lexicon.net> wrote:
>>On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <mor...@lsupcaemnt.com> >>wrote:
>>>William Park wrote:
>>>>How do you compare 2 strings, and determine how much they are "close" to >>>>each other? Eg. >>>> aqwerty >>>> qwertyb >>>>are similar to each other, except for first/last char. But, how do I >>>>quantify that?
>>>"However you like" is probably the right answer, but one way might be to >>>compare their soundex encoding >>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out >>>percentage difference based on comparing the numeric part.
>>Fantastic suggestion. Here's a tiny piece of real-life test data:
>>compare the surnames "Mousaferiadis" and "McPherson".
> I remember a word processor, MultiMate, which used Soundex to do > matching for its suggestions for spelling correction. One of my > cow-orkers typed the word 'agains' -- it was supposed to be 'against' > but 'again' would also have been a sensible suggestion. MultiMate, > however, suggested neither of those reasonable words, it did suggest > "iguanas" amd "Utahns"...
> (I wonder what it does with "Talliafero" and "Tolliver", and with > "Featherstone-Howe" and "Fanshaw"...)
> The answer to the OP, which someone just pointed out to me on > comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A > google search on either produces some good descriptions in the first few > links, including http://www.merriampark.com/ld.htm which has not only a > description of the algorithm but also source code in Java, C++ and > Visual Basic (no Awk, thought there are links to pages with > implementations in Perl, Python, Delphi and many more)...
> [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein' > in Schwaebisch (south German) fashion, but apparently the author of the > algorithm was one Vladimir I. Levenshtein and that is how he is credited > on the IEEE site. There are also a number of Google hits under the > 'stein' spelling, though...
> Chris C
And what's the Levenshtein distance between "levenshtein" and "levenstein"? Ironic if it were large ...