Possible bug: chars-array containing weird characters when comparing huge txt files at 1,5 mio lines, resulting in overflow of the char type limit at 65535.

93 views
Skip to first unread message

peteri...@gmail.com

unread,
Sep 18, 2014, 6:09:00 AM9/18/14
to diff-mat...@googlegroups.com
Hi there,
First I started by converting the C# solution to VB.NET which is my primary language, by using an online converter. It went very well and the converted edition works very well.
But when I tried to use the diff_match_patch with two txt-files, both about 55Mb and about 1,5 mio lines, I get an exception.
I looked into the code and since the code block didn't make any sense I looked at the similar C#-code block. That didn't help.
 
It's all about:
 
Java edition: chars.append(String.valueOf((char) (lineArray.size() - 1)));
C# edition: chars.Append(((char)(lineArray.Count - 1))); (line ~731 in the DiffMatchPatch.cs)
converted into VB.NET edition: chars.Append(ChrW(lineArray.Count - 1))
 
The above mentioned line will fill the chars-array with very weird characters. I don't get why someone would locate a character by using the .Count-property of a string-value. It makes no sense what so ever.
In VB.NET I get a type definition exception because the lineArray.Count reaches the char type limit at 65535 chars, and therefor cannot be converted into a char!!!! (again: why would someone convert a char based on length of a string, into a character).
 
So my question is: what is the chars-array used for if it only contains very weird characters?
 
//peter

Peter Ingemann Hansen

unread,
Sep 18, 2014, 9:24:50 AM9/18/14
to diff-mat...@googlegroups.com
Oh. My. God!
 
lineArray consists of lines being added when a while loops through the one of the files. All lines gets a unique definer represented by the char-value of every line number in the lineArray.
This means that this unique character has a limit equal to the char type which is 65535. So if a line/line number is above 65535 no unique char kan be assigned and the program throws an exception.
I bet this is just to minimize amount of RAM usage.
 
But it is a horrible way of adding a limit to program.
 
 
Please fix this bug/limitation. It is useless as for now.
 
// peter

Patrick Burrows

unread,
Oct 2, 2014, 4:57:37 PM10/2/14
to diff-mat...@googlegroups.com
Looking at the code, it seems to me that the call to diff_lineMode (which ultimately calls the quick munge to diff the strings which has the limit you are running into) is simply a performance optimization for large strings (which you obviously have). 

It seems the easiest way to fix it would be to set "checklines" to false in the call to diff_main.

There is no quick / easy fix for the munge function. A new method of doing that munge would need to be figured out and tested.
Reply all
Reply to author
Forward
0 new messages