Levenshtein maximum distance is greater than length of both strings (JavaScript)

224 views
Skip to first unread message

Anders Lyman

unread,
May 22, 2013, 4:08:51 PM5/22/13
to diff-mat...@googlegroups.com

var dmp = new diff_match_patch();
var diffs = dmp.diff_main("709113544", "1555855732");
var distance = dmp.diff_levenshtein(diffs);
console.log(distance); // 14

I'm pretty sure this is a bug, since the API docs say: "the maximum distance is the length of the longer string".
I've logged a new issue, but I'm wondering if I've simply misunderstood something?

Thoughts?

Neil Fraser

unread,
May 23, 2013, 2:06:34 AM5/23/13
to Diff Match Patch
On 22 May 2013 13:08, Anders Lyman <ander...@gmail.com> wrote:
> I'm pretty sure this is a bug, since the API docs say: "the maximum distance
> is the length of the longer string".
> I've logged a new issue, but I'm wondering if I've simply misunderstood
> something?

It is a bug, but in the API docs. That Levenshtien distance is
correct. Neat, I didn't realize that a Levenshtien distance could be
longer than the longer string.

I'll take a closer look at this tomorrow, and if it checks out I'll
change the docs. Thanks!

--
Neil Fraser
http://neil.fraser.name

Anders Lyman

unread,
May 23, 2013, 9:01:54 AM5/23/13
to diff-mat...@googlegroups.com

Ha! That's awesome. I've had intermittent bugs surface due to this, but this is the first time I've been able to nail it down.
I need to generate a percentage of change between two strings, so my assumption that the maximum distance is the length of the longer string is clearly what caused the aforementioned bugs...

What would be the maximum distance? Is there any easy/consistent way to determine that?

> --
> You received this message because you are subscribed to a topic in the Google Groups "Diff Match Patch" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/diff-match-patch/RXxj5-fxMKs/unsubscribe?hl=en.
> To unsubscribe from this group and all its topics, send an email to diff-match-pat...@googlegroups.com.
> To post to this group, send email to diff-mat...@googlegroups.com.
> Visit this group at http://groups.google.com/group/diff-match-patch?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Anders Lyman

unread,
May 23, 2013, 10:39:14 AM5/23/13
to diff-mat...@googlegroups.com
This seems really bizarre to me - every definition I've found of the levenshtein algorithm indicates "It is at most the length of the longer string."
For example: 
  1. http://en.wikipedia.org/wiki/Levenshtein_distance
  2. http://stackoverflow.com/questions/6087281/similarity-score-levenshtein/6087809#6087809
  3. http://stackoverflow.com/questions/2781252/probability-algorithm-finding-probable-correct-item-in-a-list-e-g-john-john/2782593#2782593 (3rd answer down)
It does mention an edge case: "If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance."
That doesn't apply here however, as the strings have different lengths.

Christian LeMoussel

unread,
Oct 1, 2013, 6:46:16 AM10/1/13
to diff-mat...@googlegroups.com
I have the same probleme with this C# code :

            DiffMatchPatch.diff_match_patch dmpDiff = new DiffMatchPatch.diff_match_patch();
            List<DiffMatchPatch.Diff> lDiffs = dmpDiff.diff_main("considéré", "apprécié");
            int iIndex = dmpDiff.diff_levenshtein(lDiffs);

                        Console.WriteLine(iIndex); // 11 ??????

According to the docs on the API home page  this is not possible. The length of the longer string is equal to 9.
The minimum distance is 0 which means equality, the maximum distance is the length of the longer string.

Reply all
Reply to author
Forward
0 new messages