--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
Hi Laurent,
> Was playing with levenshtein (argh, where do I place the h -sorry
> mister levenshtein-), and thougth it could be interesting to share my
> current result here, to get some feedback.
Looks really concise. Nice!
> The following version works with any seq-able (not only Strings), but
> hardwires function = for equality testing of seq values (rather good
> default IMHO), but also hardwires the cost of 1 for either element
> insertion, deletion, or swap.
Isn't "swap" (aka replace) usually considered a deletion followed by an
insertion, and thus with costs 2?
Bye,
Tassilo
Laurent,
I have been doing some work on a diff library for Clojure sequences (I
need to get back to it and finish it up).
http://github.com/brentonashworth/clj-diff
The main goal of this library is to compute sequential diffs quickly.
Whenever I see someone doing something similar I like to compare
performance just in case you know something that I don't.
Other algorithms usually perform well on small sequences but then
break down as the sizes grow. For example, I did a quick test of this
algorithm on two 10,000 character strings and your algorithm took 80
seconds while mine computed the edit distance is 120 ms.
While my library is primarily concerned with diffs and edit distance,
I did add a levenshtein-distance function which attempts to compute
this distance from a previously computed minimum edit path. It is not
always accurate because there may be many minimum edit paths with
shorter or longer levenshtein distances. If the algorithm is modified
slightly so that the edit path with the minimum levenshtein distance
is chosen then it would be able to do both.
I can't take credit for the algorithm, I just implemented what I read
in a paper. But I do think this approach will get the job done as
quickly as possible. Of course there is a lot more code to read than
your very impressive ten lines.