Compare algorithm problems

41 views
Skip to first unread message

Michael Charnoky

unread,
Sep 19, 2018, 10:41:59 AM9/19/18
to Mergely
Hi, I've been evaluating Mergely for use in a custom CMS. It looks very impressive and seems much nicer than any other diff tools that I've tried!

However, I'm seeing some unexpected results with the comparison algorithm. I boiled it down to a fairly simple test case which can be seen here: http://www.mergely.com/PiK2NNBq/. Here are some of the oddities:

Issue 1: In line 1, why is the word "paragraph" marked in red on the left and blue on the right? It did not change, only the word "here" was added. 

Issue 2: In line 3, there were initially two sentences and a third was inserted in between. If you remove lines 4 and 5, the comparison for line 3 is correct. However, once a blank line is added (line 4), the comparison for line 3 changes: the entire paragraph is marked in red on the left and blue on the right! Another oddity: If I remove the word "here" on line 1 on the right side, line 3 comparison results change back to what I expect! Very strange...

Note that the "ignore white space" option does not seem to affect the results. I tried the same example using jsdiff (http://incaseofstairs.com/jsdiff/) and it produces the correct results... yet it is nowhere near as pretty as Mergely! Any thoughts on these issues? They seem to happen quite often when comparing large blocks of text with multiple paragraphs. Thank you!

Mike

Jamie Peabody

unread,
Sep 19, 2018, 11:54:05 AM9/19/18
to no...@noky.net, Mergely
Issue 1: Mergely optimizes by line differences as opposed to character-by-character differences.  It is much faster.  This means that it first identifies differences between lines.  Then, it uses a similar approach to find differences in between words, split by spaces - this is for highlighting text changes within the line.  In this case, it finds that "paragraph." was deleted, and that "paragraph is here." was inserted.  Another demonstration of this is to just diff "First paragraph." and "First paragraph" - you will see that "paragraph." was deleted, not the "." character.

Issue 2: It's been a long time since I worked on the algorithm, so it is difficult to explain why this is the case, but it is working exactly as expected.  I believe the algorithm groups some changes where it is better to show 2 changes, rather than 3.  GNU diff does a similar thing.

--
You received this message because you are subscribed to the Google Groups "Mergely" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mergely+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Charnoky

unread,
Sep 19, 2018, 3:25:57 PM9/19/18
to Mergely
Thanks for your quick reply Jamie! I'm still confused by the behavior, it seems odd to me... At the end of the day, I need to empower editors to compare revisions of articles in a CMS. I'd really love to be able to use Mergely, but it often indicates that large blocks of text have changed when in fact only small chunks of text in strategic places have changed. I'm just trying to understand why this happens and how it could possibly be improved. I'd be willing to get dirty in the code if I had some pointers!

Issue 1: Why not split on punctuation as well as spaces? Seems like that would make the diff reporting less "greedy".

Issue 2: In playing around with my example, I removed the blank lines (extra carriage returns) from both sides, and it works like I would expect! See here: http://www.mergely.com/GP6xZf6E/. Does this make sense? I suppose a workaround would be for me to strip out double CR characters in the text before feeding it to Mergely... but then the UI display would not show paragraph breaks clearly.

Still, that workaround doesn't work in all cases. Take the "Declaration of Independence" example. Before line 12, the diff seems to be working well. Starting with line 12, Mergely basically indicates that the remainder of the documents are completely different, which is not the case. I tried removing the extra CRs, but that did not help. If you break up the Declaration before/after documents into smaller chunks, Mergely does a better job at spotting the differences. If you feed those Declaration docs into jsdiff, it seems to do much better at the detection, but it is slower and the UI component leaves something to be desired...

To be clear: what you have built is really quite amazing! I'm just hoping to be able to bring some improvements to the table. Thanks for your feedback and consideration.

Jamie Peabody

unread,
Sep 19, 2018, 4:02:40 PM9/19/18
to no...@noky.net, Mergely
Well, if you really want to wreck your head, you can get your head around the myers diff algorithm (the paper is here) and have a look at the code.

Issue 1: it could be made to diff characters - I originally chose to use words for performance, and it's "good enough".  I made the algorithm stand-alone in myers-diff, and had every intention of replacing the algorithm in Mergely with myers-diff module but haven't really had the time.  Using that module would enable character diffs, which I think is what you are after.

In the Declaration of Independence (DoI), notice at line 12 used to read "He has refused", but it was merged into line 10, but the algorithm doesn't "know" that.  It knows 10 changed, and it knows 12 changed.  Put the DoI into http://incaseofstairs.com/jsdiff/ and see how long it take to diff (click Lines first, clear the text, insert DoI, and then click Chars).  Mine took several minutes.  Best performance would be to find lines that change, and then optionally do character differences within those lines.

Removing double CR is a hack, and all you really get is one big "change", i.e. the algorithm identifies "adds", "removes", and "changes".  I wouldn't recommend it, but if you think it works in your context that you need, then go for it.

Michael Charnoky

unread,
Sep 20, 2018, 9:46:23 AM9/20/18
to Mergely
Thanks Jamie, I'll see what I can do!
Reply all
Reply to author
Forward
0 new messages