DMP enhancement

37 views
Skip to first unread message

Ivan Kaschenko

unread,
Sep 16, 2009, 4:28:32 AM9/16/09
to Diff Match Patch
Hi Guys,
I'd like to suggest a little improvement of the DMP API, with
corresponding algorithm change.
I need to track not only "insert" or "delete" diff types, but also a
"modification" ones. I'm going to use the API mainly for computing
changed/inserted/deleted lines in a text files. In order to be able to
identify such type of diffs absolutely accurately, I suggest
introducing a "line number" attribute in the "Diff" class. This would
allow accurate tracking of "modification" operations between text1 &
text2. For example, if there're a INSERT and DELETE diffs, and both
belong to the same line number, then I would treat this as
"modification" diff type.

Neil Fraser

unread,
Sep 16, 2009, 12:21:25 PM9/16/09
to Diff Match Patch
On Sep 16, 10:28 am, Ivan Kaschenko <ikasche...@gmail.com> wrote:
> I need to track not only "insert" or "delete" diff types, but also a
> "modification" ones. I'm going to use the API mainly for computing
> changed/inserted/deleted lines in a text files. In order to be able to
> identify such type of diffs absolutely accurately, I suggest
> introducing a "line number" attribute in the "Diff" class. This would
> allow accurate tracking of "modification" operations between text1 &
> text2. For example, if there're a INSERT and DELETE diffs, and both
> belong to the same line number, then I would treat this as
> "modification" diff type.

The diff algorithm operates on a letter-by-letter basis, not line-by-
line.

If you want to use DMP in line mode, use the diff_linesToChars and
diff_charsToLines functions. diff_linesToChars will pack your text
into a format where each line is encoded as one character. You can
then run your diff with the two packed strings. diff_charsToLines can
then be used to unpack that diff into something you can read.

A change is any DELETE which is immediately followed by an INSERT.

Ivan Kaschenko

unread,
Sep 17, 2009, 1:04:11 PM9/17/09
to Diff Match Patch
Thanks a lot!

I found that "diff_linesToChars" + "diff_main" returns the information
which, after minimal processing, can be treated as total number of
inserted, modified and deletes lines between two text fragments. It's
not exactly what I needed but I'm fine with that!

For those who may be interested in the solution, here's the code
snippet:

diff_match_patch dmp = new diff_match_patch();
diff_match_patch.LinesToCharsResult tmp = dmp.diff_linesToChars
(oldText, newText);
LinkedList<diff_match_patch.Diff> diff = dmp.diff_main(tmp.chars1,
tmp.chars2);

int totalInserted = 0;
int totalModified = 0;
int totalDeleted = 0;
int inDelSection = 0;
int inInsSection = 0;

for (diff_match_patch.Diff cur : diff) {
int linesCount = cur.text.length();
if (cur.operation == diff_match_patch.Operation.EQUAL) {
inDelSection = 0;
inInsSection = 0;
} else if (cur.operation == diff_match_patch.Operation.DELETE) {
inDelSection += linesCount;
totalDeleted += linesCount;
} else if (cur.operation == diff_match_patch.Operation.INSERT) {
inInsSection += linesCount;
if (inDelSection > 0) {
totalDeleted -= Math.min(inInsSection, inDelSection);
totalModified += Math.min(inInsSection, inDelSection);
totalInserted += (Math.max(inInsSection, inDelSection) - Math.min
(inInsSection, inDelSection));
} else {
totalInserted += linesCount;
}
}
}

// Here total* variables are calculation result.

Neil Fraser

unread,
Sep 18, 2009, 4:18:13 AM9/18/09
to Diff Match Patch
On Sep 17, 7:04 pm, Ivan Kaschenko <ikasche...@gmail.com> wrote:
> LinkedList<diff_match_patch.Diff> diff = dmp.diff_main(tmp.chars1,
> tmp.chars2);

In this case you can improve performance very slightly be adding a 3rd
argument of 'false' to diff_main. That will prevent the algorithm
from attempting to do a line-by-line speed up, which in this case
won't work since the lines have already been factored out.
Reply all
Reply to author
Forward
0 new messages