generic (works for any seq) levenshtein distance

138 views
Skip to first unread message

Laurent PETIT

unread,
Feb 15, 2011, 5:38:23 PM2/15/11
to clo...@googlegroups.com
Hi,

Was playing with levenshtein (argh, where do I place the h -sorry mister levenshtein-), and thougth it could be interesting to share my current result here, to get some feedback.

The following version works with any seq-able (not only Strings), but hardwires function = for equality testing of seq values (rather good default IMHO), but also hardwires the cost of 1 for either element insertion, deletion, or swap.

It is functional, it does not use arrays, nor a MxN matrix, just the required data for computing a given "row" of the "virtual matrix" (e.g. the previous row)

I'm quite sure it can still be improved for better readability and performance without loosing any of the above mentioned characteristics :

https://gist.github.com/828413

Cheers,

--
Laurent

Stuart Sierra

unread,
Feb 15, 2011, 6:45:52 PM2/15/11
to clo...@googlegroups.com
Cool!  That's a very compact implementation.

Could the same technique be adapted to give you the longest common substring?
e.g. (foo "fooba" "baab") => "ba"

Or better yet, the length of the longest common substring and the starting indices of each common substring of that length,
e.g. (foo "baaboobaa" "baa") => {:length 3, :indices #{0 6}}

-Stuart Sierra

Andreas Kostler

unread,
Feb 15, 2011, 7:53:34 PM2/15/11
to clo...@googlegroups.com
Laurent,
I've been studying your implementation for a while now and can't really fully grasp it. 
Can you elaborate a bit on the algorithm?
Cheers
Andreas

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

--
"Programs must be written for people to read, and only incidentally for machines to execute."
- Abelson & Sussman, SICP
-- 
**********************************************************
Andreas Koestler, Software Engineer
Leica Geosystems Pty Ltd
270 Gladstone Road, Dutton Park QLD 4102
Main: +61 7 3891 9772     Direct: +61 7 3117 8808
Fax: +61 7 3891 9336
Email: andreas....@leica-geosystems.com

************www.leica-geosystems.com*************

when it has to be right, Leica Geosystems

Please  consider the environment before printing this email.





Base

unread,
Feb 15, 2011, 8:29:28 PM2/15/11
to Clojure
I'm with Andreas, and would love some help dissecting this a bit
further.

On Feb 15, 6:53 pm, Andreas Kostler <andreas.koestler.le...@gmail.com>
wrote:
> Email: andreas.koest...@leica-geosystems.com

Brenton

unread,
Feb 16, 2011, 12:39:29 AM2/16/11
to Clojure
Laurent,

I have been doing some work on a diff library for Clojure sequences (I
need to get back to it and finish it up).

http://github.com/brentonashworth/clj-diff

The main goal of this library is to compute sequential diffs quickly.
Whenever I see someone doing something similar I like to compare
performance just in case you know something that I don't.

Other algorithms usually perform well on small sequences but then
break down as the sizes grow. For example, I did a quick test of this
algorithm on two 10,000 character strings and your algorithm took 80
seconds while mine computed the edit distance is 120 ms.

While my library is primarily concerned with diffs and edit distance,
I did add a levenshtein-distance function which attempts to compute
this distance from a previously computed minimum edit path. It is not
always accurate because there may be many minimum edit paths with
shorter or longer levenshtein distances. If the algorithm is modified
slightly so that the edit path with the minimum levenshtein distance
is chosen then it would be able to do both.

I can't take credit for the algorithm, I just implemented what I read
in a paper. But I do think this approach will get the job done as
quickly as possible. Of course there is a lot more code to read than
your very impressive ten lines.

Cheers,
Brenton

Tassilo Horn

unread,
Feb 16, 2011, 2:23:57 AM2/16/11
to clo...@googlegroups.com
Laurent PETIT <lauren...@gmail.com> writes:

Hi Laurent,

> Was playing with levenshtein (argh, where do I place the h -sorry
> mister levenshtein-), and thougth it could be interesting to share my
> current result here, to get some feedback.

Looks really concise. Nice!

> The following version works with any seq-able (not only Strings), but
> hardwires function = for equality testing of seq values (rather good
> default IMHO), but also hardwires the cost of 1 for either element
> insertion, deletion, or swap.

Isn't "swap" (aka replace) usually considered a deletion followed by an
insertion, and thus with costs 2?

Bye,
Tassilo

Laurent PETIT

unread,
Feb 16, 2011, 3:13:15 AM2/16/11
to clo...@googlegroups.com, Tassilo Horn
Hi Tassilo

2011/2/16 Tassilo Horn <tas...@member.fsf.org>
Hmm, not in the wikipedia algorithm, unless I've misread it.
But indeed, it would be cool to be able to make these cost functions configurable (not very hard to change the algorithm to do so).

Laurent PETIT

unread,
Feb 16, 2011, 5:03:50 AM2/16/11
to clo...@googlegroups.com, Base
Andreas, Base, sure, I'll try (I wrote it yesterday by night, let's see if I can remember what I wrote :-) ):

So I've studied the algorithm presented here in wikipedia:
http://en.wikipedia.org/wiki/Levenshtein_distance

I've also taken note of the suggested possible improvement on space :
"We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time"

(here  : http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_improvements)

So the algorithm of the clojure implementation is decomposed in two:

function new-row:
this function computes a new "row" (stricto sensu a row in the wikipedia page examples) given arguments:
  * prev-row: sequence of distances for the previous row
  * row-elem: the value of the element of the seq represented vertically in the wikipedia examples grid, for the current row for which we are computing.
  * t: the sequence whose values are heads of columns in the wikipedia examples grid
At its core, this function then just computes the new row by "walking" in parallel the previous row. It uses "reduce" and not map because for each "cell", I need the value just computed previously. reduce gives me access via its accumulator to the row (implemented as a vector) we're computing (via the peek call), while map's interface does not allow me such freedom.

function levenshtein:
It walks the sequence "s" (the rows in the grid), computing a new row each time from the previous row. Then it's just a matter of getting the last value of the last raw to get back the levenshtein distance.
For the same reason as above, I'm not using map but reduce to get access to the previously computed row.

Hope this is clearer,

ps: I can feel fogus resisting the urge to suggest me to rewrite my gist using marginalia ... and he may not be wrong at all!

--
Laurent

2011/2/16 Base <basse...@gmail.com>

Laurent PETIT

unread,
Feb 16, 2011, 5:12:27 AM2/16/11
to clo...@googlegroups.com, Brenton
Hi Brenton!

2011/2/16 Brenton <bash...@gmail.com>

Laurent,

I have been doing some work on a diff library for Clojure sequences (I
need to get back to it and finish it up).

http://github.com/brentonashworth/clj-diff

The main goal of this library is to compute sequential diffs quickly.

While mine goals was really just to first see how I could express the algorithm using higher order functions and clojure datastructures / sequence abstraction, even internally.
 
Whenever I see someone doing something similar I like to compare
performance just in case you know something that I don't.

Fair enough ;)
 
Other algorithms usually perform well on small sequences but then
break down as the sizes grow. For example, I did a quick test of this
algorithm on two 10,000 character strings and your algorithm took 80
seconds while mine computed the edit distance is 120 ms.

It could be interesting to plot some more intermediate results between 10 character strings comparisons and 10,000, so that we can see if this is just a problem of a constant factor, or a divergence in big O complexity.

Please note that my algorithm is very close in implementation (and thus big-O complexity, unless I've made a mistake) to the one described on the wikipedia page.
It's not an "approximate" levenshtein distance, it is the levenshtein distance.

Do you know what the performance of your version could become if you really implemented the computation of the levenshtein distance by really computing and comparing all edit paths ?

(Not saying that strictly sticking to levenshtein distance/algorithm is a rational goal, just trying to compare apples with apples)
 
While my library is primarily concerned with diffs and edit distance,
I did add a levenshtein-distance function which attempts to compute
this distance from a previously computed minimum edit path. It is not
always accurate because there may be many minimum edit paths with
shorter or longer levenshtein distances. If the algorithm is modified
slightly so that the edit path with the minimum levenshtein distance
is chosen then it would be able to do both.

I can't take credit for the algorithm, I just implemented what I read
in a paper. But I do think this approach will get the job done as
quickly as possible. Of course there is a lot more code to read than
your very impressive ten lines.

You're welcome,

Cheers,

--
Laurent

Brenton

unread,
Feb 16, 2011, 11:45:29 AM2/16/11
to Clojure
Laurent,

When I have some time, I will take another look at my code to see if I
can get an accurate levenshtein distance without adding significant
complexity. I am optimistic.

I'll let you know what I find.

Thanks again for sharing your code.
Brenton

On Feb 16, 2:12 am, Laurent PETIT <laurent.pe...@gmail.com> wrote:
> Hi Brenton!
>
> 2011/2/16 Brenton <bashw...@gmail.com>
Reply all
Reply to author
Forward
0 new messages