Fuzzy Lookups

BBands

unread,

Jan 30, 2006, 10:41:06 AM1/30/06

to

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Diez B. Roggisch

unread,

Jan 30, 2006, 10:54:30 AM1/30/06

to

> As mentioned above this works quite well and I am happy with it, but I
> wonder if there is a more Pythonic way of doing this type of lookup?

I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = distance(a,b)
longer = float(max((len(a), len(b))))
shorter = float(min((len(a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r

distance is a run-of-the mill levenshtein-distance as yours.

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3, but the normalized distance is

0.729591836735

and

0.015306122449

respectively - making you chose the better match :)

regards,

Diez

Fredrik Lundh

unread,

Jan 30, 2006, 11:05:34 AM1/30/06

to pytho...@python.org

Diez B. Roggisch wrote:

> The advantage becomes apparent when you try to e.g. compare
>
> "Angelina Jolie"
>
> with
>
> "AngelinaJolei"
>
> and
>
> "Bob"
>
> Both have a l-dist of 3

>>> distance("Angelina Jolie", "AngelinaJolei")
3
>>> distance("Angelina Jolie", "Bob")
13

what did I miss ?

</F>

Diez B. Roggisch

unread,

Jan 30, 2006, 11:30:06 AM1/30/06

to

Fredrik Lundh wrote:

Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki/Levenshtein_distance

it has the property

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn't
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people's and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez

BBands

unread,

Jan 30, 2006, 12:56:40 PM1/30/06

to

Diez B. Roggisch wrote:
> I did a levenshtein-fuzzy-search myself, however I enhanced my version by
> normalizing the distance the following way:

Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab

gene tani

unread,

Jan 30, 2006, 7:28:19 PM1/30/06

to

i noticed this guy, who's quite a good ruby developer spent some time
on distances:

http://ruby.brian-schroeder.de/editierdistanz/

and also look at soundex, other algorithms (Double Metaphone, NYSIIS,
Phonex, I have notes to investigate but I haven't looked at them
myhself)

ajones

unread,

Jan 30, 2006, 8:28:47 PM1/30/06

to

Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1][i] == n[1][i]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.

Gregory Piñero

unread,

Jan 30, 2006, 11:10:13 PM1/30/06

to ajones, pytho...@python.org

Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

-Greg

> --
> http://mail.python.org/mailman/listinfo/python-list
>

--
Gregory Piñero
Chief Innovation Officer
Blended Technologies
(www.blendedtechnologies.com)

Gregory Piñero

unread,

Jan 31, 2006, 10:51:44 AM1/31/06

to BBands, pytho...@python.org

> Thanks for that, I'll have a look. (So many packages, so little
> time...)

Yes, there's a standard library for everything it seems! Except for a
MySQL api :-(

> > I wrote a script to delete duplicate mp3's by filename a few years
> > back with this. If anyone's interested in seeing it, I'll post a blog
> > entry on it. I'm betting it uses a similiar algorithm your functions.
>

> I would be very interested it seeing that.

Done, see:
http://www.blendedtechnologies.com/removing-duplicate-mp3s-with-python-a-naive-yet-fuzzy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.

Kent Johnson

unread,

Jan 31, 2006, 3:54:42 PM1/31/06

to

Gregory Pińero wrote:
> Ok, ok, I got it! The Pythonic way is to use an existing library ;-)
>
> import difflib
> CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)
>
> I wrote a script to delete duplicate mp3's by filename a few years
> back with this. If anyone's interested in seeing it, I'll post a blog
> entry on it. I'm betting it uses a similiar algorithm your functions.

A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent

Gregory Piñero

unread,

Jan 31, 2006, 4:24:46 PM1/31/06

to Kent Johnson, pytho...@python.org

I wonder which algorithm determines the similarity between two strings better?

On 1/31/06, Kent Johnson <ke...@kentsjohnson.com> wrote:

Steven D'Aprano

unread,

Jan 31, 2006, 4:28:17 PM1/31/06

to

On Tue, 31 Jan 2006 10:51:44 -0500, Gregory Piñero wrote:

> http://www.blendedtechnologies.com/removing-duplicate-mp3s-with-python-a-naive-yet-fuzzy-approach/60
>
> If anyone would be kind enough to improve it I'd love to have these
> features but I'm swamped this week!
>
> - MD5 checking for find exact matches regardless of name
> - Put each set of duplicates in its own subfolder.

This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?

--
Steven.

Paul Rubin

unread,

Jan 31, 2006, 4:38:50 PM1/31/06

to

Steven D'Aprano <st...@REMOVETHIScyber.com.au> writes:
> This isn't a criticism, it is a genuine question. Why do people compare
> local files with MD5 instead of doing a byte-to-byte compare? Is it purely
> a caching thing (once you have the checksum, you don't need to read the
> file again)? Are there any other reasons?

It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.

Tom Anderson

unread,

Feb 1, 2006, 5:37:57 AM2/1/06

to

On Tue, 31 Jan 2006, it was written:

> Steven D'Aprano <st...@REMOVETHIScyber.com.au> writes:
>
>> This isn't a criticism, it is a genuine question. Why do people compare
>> local files with MD5 instead of doing a byte-to-byte compare?

I often wonder that!

>> Is it purely a caching thing (once you have the checksum, you don't
>> need to read the file again)? Are there any other reasons?
>
> It's not just a matter of comparing two files. The idea is you have
> 10,000 local files and you want to find which ones are duplicates (i.e.
> if files 637 and 2945 have the same contents, you want to discover
> that). The obvious way is make a list of hashes, and sort the list.

Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, you'd be better off just
using the first N bytes of the file, since this would be just as
effective, and cheaper to read. Looking at some random MP3s i have to
hand, they all differ within the first 20 bytes - probably due to the ID3
tags, so this should work for these.

Better yet, note that if two files are identical, they must have the same
length, and that finding the length can be done very cheaply, so a quicker
yet approach is to make a list of lengths, sort that, and look for
duplicates; when you find one, do a byte-by-byte comparison of the files
(probably terminating in the first block) to see if they really are the
same.

By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within the
first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule
out duplication with just 22 data blocks read from disk (plus rather more
blocks of directory information and inodes, of course). A hash-based
approach would have had to wade through a touch over 13 GB of data before
it could even get started.

Of course, there are situations where this is the wrong approach - if you
have a collection of serialised sparse matrices, for example, which
consist of identically-sized blocks of zeroes with a scattering of ones
throughout, then lengths and prefixes will be useless, whereas hashes will
work perfectly. However, here, we're looking at MP3s, where lengths and
prefixes will be a win.

tom

[1] The distribution of those is a bit weird: ten pairs consist of two
tracks from The Conet Project's 'Recordings of Shortwave Numbers
Stations', one is a song from that and The Doors' 'Horse Latitudes', and
one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De
Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from
the same artist? Is it really that they're pairs of songs from the same
compressor (those tracks aren't from CD), i wonder?

--
Not all legislation can be eye-catching, and it is important that the
desire to achieve the headlines does not mean that small but useful
measures are crowded out of the legislative programme. -- Select Committee
on Transport

Steven D'Aprano

unread,

Feb 1, 2006, 5:59:21 AM2/1/06

to

Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)

I can't see any, but I thought maybe that's because I'm not thinking
outside the box.

--
Steven.

Paul Rubin

unread,

Feb 1, 2006, 7:45:36 PM2/1/06

to

Tom Anderson <tw...@urchin.earth.li> writes:
> > The obvious way is make a list of hashes, and sort the list.
>
> Obvious, perhaps, prudent, no. To make the list of hashes, you have to
> read all of every single file first, which could take a while. If your

> files are reasonably random at the beginning, ...

The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for. Hashing the first
few hundred bytes would be more reliable, since it would include some
actual audio in the hash. But then you're back to a list of hashes.

> Better yet, note that if two files are identical, they must have the
> same length, and that finding the length can be done very cheaply, so
> a quicker yet approach is to make a list of lengths, sort that, and
> look for duplicates; when you find one, do a byte-by-byte comparison
> of the files (probably terminating in the first block) to see if they
> really are the same.

Yes, checking the file lengths first is an obvious heuristic, but if
you fine you have a bunch of files with the same length, what do you
do? You're back to a list of hashes.

> By way of example, of the 2690 music files in my iTunes library, i
> have twelve pairs of same-sized files [1], and all of these differ
> within the first 18 bytes (mostly, within the first 9 bytes).

That's a small enough set of matches that you don't need a general
purpose algorithm.

Paul Rubin

unread,

Feb 1, 2006, 7:52:34 PM2/1/06

to

Steven D'Aprano <st...@REMOVETHIScyber.com.au> writes:
> Sure. But if you are just comparing two files, is there any reason to
> bother with a checksum? (MD5 or other.)

No of course not, except in special situations, like some problem
opening and reading both files simultaneously. E.g.: the files are on
two different DVD-R's, they are too big to fit in ram, and you only
have one DVD drive. If you want to compare byte by byte, you have to
either copy one of the DVD's to your hard disk (if you have the space
available) or else swap DVD's back and forth in the DVD drive reading
and comparing a bufferload at a time. But you can easily read in the
first DVD and compute its hash on the fly, then read and hash the
second DVD and compare the hashes.

If it's a normal situation with two files on HD, just open both files
simultaneously, and use large buffers to keep the amount of seeking
reasonable. That will be faster than big md5 computations, and more
reliable (there are known ways to construct pairs of distinct files
that have the same md5 hash.)

Erik Max Francis

unread,

Feb 1, 2006, 9:31:34 PM2/1/06

to

Steven D'Aprano wrote:

> This isn't a criticism, it is a genuine question. Why do people compare
> local files with MD5 instead of doing a byte-to-byte compare? Is it purely
> a caching thing (once you have the checksum, you don't need to read the
> file again)? Are there any other reasons?

Because if you store a hash, then you can keep that around even when the
original file is archived, moved elsewhere, or deleted. It's awfully
helpful for building databases of files you've seen before.

--
Erik Max Francis && m...@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
Everyone wants to look good at his own funeral.
-- Louis Wu

Tom Anderson

unread,

Feb 2, 2006, 7:10:01 AM2/2/06

to

On Thu, 1 Feb 2006, it was written:

> Tom Anderson <tw...@urchin.earth.li> writes:
>
>>> The obvious way is make a list of hashes, and sort the list.
>>
>> Obvious, perhaps, prudent, no. To make the list of hashes, you have to
>> read all of every single file first, which could take a while. If your
>> files are reasonably random at the beginning, ...
>
> The possibility of two different mp3 files having the same id3 tags is
> something you might specifically be checking for.

So read from the end of the file, rather than the beginning.

>> Better yet, note that if two files are identical, they must have the
>> same length, and that finding the length can be done very cheaply, so a
>> quicker yet approach is to make a list of lengths, sort that, and look
>> for duplicates; when you find one, do a byte-by-byte comparison of the
>> files (probably terminating in the first block) to see if they really
>> are the same.
>
> Yes, checking the file lengths first is an obvious heuristic, but if you
> fine you have a bunch of files with the same length, what do you do?
> You're back to a list of hashes.

Or prefixes or suffixes.

>> By way of example, of the 2690 music files in my iTunes library, i have
>> twelve pairs of same-sized files [1], and all of these differ within
>> the first 18 bytes (mostly, within the first 9 bytes).
>
> That's a small enough set of matches that you don't need a general
> purpose algorithm.

True - and this is *exactly* the situation that the OP was talking about,
so this algorithm is appropriate. Moreover, i believe is representative of
most situations where you have a bunch of files to compare. Of course,
cases where files are tougher to tell apart do exist, but i think they're
corner cases. Could you suggest a common kind of file with degenerate
lengths, prefixes and suffixes?

The only one that springs to mind is a set of same-sized image files in
some noncompressed format, recording similar images (frames in a movie,
say), where the differences might be buried deep in the pixel data. As it
happens, i have just such a dataset on disk: with the images in TIFF
format, i get differences between subsequent frames after 9 bytes, but i
suspect that's a timestamp or something; if i convert everything to a nice
simple BMP (throwing away 8 bits per sample of precision in the process -
probably turning most of the pixels to 0!), then i find differences about
a megabyte in. If i compare from the tail in, i also have to wade through
about a megabyte before finding a difference. Here, hashes would be ideal.

tom

--
The revolution is here. Get against the wall, sunshine. -- Mike Froggatt

BBands

unread,

Feb 3, 2006, 6:48:25 PM2/3/06

to

Diez B. Roggisch wrote:
> I did a levenshtein-fuzzy-search myself, however I enhanced my version by
> normalizing the distance the following way:
>
> def relative(a, b):
> """
> Computes a relative distance between two strings. Its in the range
> (0-1] where 1 means total equality.
> @type a: string
> @param a: arg one

<snip>

Hello,

I adapted your approach to my needs and thought you might like to see
the result

def LevenshteinRelative(a, b):
"""
Returns the Levenshtein distance between two strings
as a relative quantity in the range 1 to 0 where
1.0 is a perfect match.
"""
# Calculate the Levenshtein distance. (returns an integer)
dist = LevenshteinDistance(a, b)
# dist is at most the length of the longer string.
max_dist = float(max(len(a), len(b)))
# dist is always at least the difference of the sizes of the two
strings.
min_dist = max_dist - float(min(len(a), len(b)))
try: # If max_dist and min_dist are equal use simple form.
relative = 1.0 - (dist - min_dist) / (max_dist - min_dist)
except ZeroDivisionError:
relative = 1.0 - dist / max_dist
return relative

Thanks,

jab

name

unread,

Feb 9, 2006, 3:45:15 PM2/9/06

to Gregory Piñero, BBands, pytho...@python.org

Gregory Piñero ha scritto:
:

> If anyone would be kind enough to improve it I'd love to have these
> features but I'm swamped this week!
>
> - MD5 checking for find exact matches regardless of name
> - Put each set of duplicates in its own subfolder.

Done? http://pyfdupes.sourceforge.net/

Bye,
luca

name

unread,

Feb 9, 2006, 3:45:15 PM2/9/06

to Gregory Piñero, pytho...@python.org, BBands

Gregory Piñero ha scritto:
:

> If anyone would be kind enough to improve it I'd love to have these
> features but I'm swamped this week!
>
> - MD5 checking for find exact matches regardless of name
> - Put each set of duplicates in its own subfolder.

Done? http://pyfdupes.sourceforge.net/

Bye,
luca

Gregory Piñero

unread,

Feb 9, 2006, 4:05:05 PM2/9/06

to name, pytho...@python.org

Wow, that looks excellent. I'll definately try it out. I'm assuming
this is an existing project, e.g. you didn't write it after reading
this thread?

-Greg

name

unread,

Feb 9, 2006, 11:50:58 PM2/9/06

to Gregory Piñero, pytho...@python.org

Gregory Pińero ha scritto:

> Wow, that looks excellent. I'll definately try it out. I'm assuming
> this is an existing project, e.g. you didn't write it after reading
> this thread?

Yes it is an existing projects of course ;)
Right now I've no time to improve it.
I hope that later this summer I will find
the time to make file management much easier on the
gui side and to split all the different algorithm in
external plugins so anyone can add it's own and compare
with the other already there. The main goal of my project
was primarly the interest for the "similarity search".

bye,
luca

name

unread,

Feb 9, 2006, 11:50:58 PM2/9/06

to Gregory Piñero, pytho...@python.org

Gregory Pińero ha scritto:

> Wow, that looks excellent. I'll definately try it out. I'm assuming
> this is an existing project, e.g. you didn't write it after reading
> this thread?

Yes it is an existing projects of course ;)