[PAN'10] Diff algorithm tailored specifiably for Wikipedia articles (v. detection)

dmtr

unread,

May 3, 2010, 5:04:10 PM5/3/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

In my research I've found that there is a need for a diff algorithm
designed specifically to diff consecutive revisions of a single
Wikipedia article for v. detection. More specifically algorithm/
algorithm implementation with the following properties:
* Fast. (Order of 5*10^3 diffs / second on a regular desktop for a
regular article; entire en. Wikipedia history/day);
* Accuracy and completeness of the generated diff can be sacrificed
for speed.
* Differences that it produces are human-readable;
* Ignores / detects copy-edits. (see http://en.wikipedia.org/wiki/Copy_editing);
* Keeps tokens order (trigrams should be left mostly unchanged);
* Detects and returns approximate edit position in the article;
* Detects insertions and removal of duplicate tokens;
* Size of the generated differences could be limited.

I think I was able to address these requirements with a very simple
algorithm which comes down to adding/removing revisions texts to/from
an ordered dictionary (map) in two stages - first for lines, next for
tokens. I'm going to illustrate it on the following example.

Revision 1:
HEAD.
The quick /**/ brown fox jumps over the lazy dog.
TAIL.

Revision 2:
HEAD.
The quick brown fox FUN FUN FUN jumps over the /**/ lazy dog.
TAIL.

Step 1. Split into lines, remove matching head and tail (& store
approximate edit position). Result:
Lines from R1: The quick /**/ brown fox jumps over the lazy dog.
Lines from R2: The quick brown fox FUN FUN FUN jumps over the /**/
lazy dog.

Step 2: Create empty ordered dictionary D1 mapping the text into the
number of occurrences.

Step 3. Add lines from R1 into the D1. Result (D1):
The quick /**/ brown fox jumps over the lazy dog. ---> 1

Step 4. Add lines from R2 into the D1 with the opposite sign. Result
(D1):
The quick /**/ brown fox jumps over the lazy dog. ---> 1
The quick brown fox FUN FUN FUN jumps over the /**/ lazy dog. ---
> -1

Step 5. Take all keys (lines) with the positive values from D1, split
into tokens, add to the list T1. Result (T1):
[The, quick, /**/, brown, fox, jumps, over, the, lazy, dog.]

Step 6. Take all keys (lines) with the negative values from D1, split
into tokens, add to the list T2. Result (T2):
[The, quick, brown, fox, FUN, FUN, FUN, jumps, over, the, /**/,
lazy, dog.]

Step 7. Create empty ordered dictionary D2 mapping tokens into the
number of occurrences.

Step 8. Add tokens from T2 into the D2. Result (D2):
[The : 1, quick : 1, brown : 1, fox : 1, FUN : 3, jumps : 1, over :
1, the : 1, /**/ : 1, lazy : 1, dog. : 1]

Step 9. Add tokens from T1 into the D2 with the opposite sign. Result
(D2):
[The : 0, quick : 0, brown : 0, fox : 0, FUN : 3, jumps : 0, over :
0, the : 0, /**/ : 0, lazy : 0, dog. : 0]

Step 10. Take all keys with nonzero values from D2, that's your diff
result:
+FUN

As you can see, this algorithm is fairly simple, the only tricky part
is the use of ordered dictionary and doing the diff in two stages -
first for lines and next for tokens. Advantages of doing the diff in
two stages are 1) speed 2) resulting tokens order remains more or less
the same.

I've also found that this algorithm resists/detects copy-edits well;
resulting diffs are remarkably human readable (e.g. better or same as
ndiff); for typical Wikipedia revisions it is several orders of
magnitude faster than basic Ratcliff-Obershelp algorithm/gestalt
approach. With this algorithm it is also fairly easy to limit the size
of the produced diff, detect repeating tokens and get an approximate
position of the edit.

An LGPL implementation of the core algorithm in python is available
at: http://code.google.com/p/wrdese/source/browse/trunk/b/ddiff.py .
Feel free to use it (or this algorithm) in this workshop or your
research. If you need to refer to it in your work - please use a
reference to this forum post.

With Regards, Dmitry Chichkov
Software Developer, SC Software Inc.

--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,

May 4, 2010, 3:36:18 AM5/4/10

to pan-works...@googlegroups.com

Thank you, Dmitry!

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

dmtr

unread,

May 5, 2010, 6:22:35 PM5/5/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Here's a few real-life diffs generated with this algorithm (using
simplified slpit('\n') : split('\, |\. |\s+') tokenization, 5 lines,
50 tokens limit).

-- Dmitry

>> R3163 (-1, 0) by 193.62.111.10(-1): /* In antiquity */ <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=280880991
-rockets#In +Rockets}} -antiquity}}
Old: 532 lines. New: 532 lines.
Added: 1 lines, 1 words
Removed: 1 lines, 2 words
Diff position: lo = 18, ahi = 19, bhi = 19

>> R3164 (3164, 0) by 193.62.111.10(-1): /* In antiquity */ <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=280881533
-Rockets}} +rockets}}
Old: 532 lines. New: 532 lines.
Added: 1 lines, 1 words
Removed: 1 lines, 1 words
Diff position: lo = 18, ahi = 19, bhi = 19

>> R3200 (-5, 0) by 207.197.98.85(-3): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=285260133
+ethan
Old: 531 lines. New: 531 lines.
Added: 1 lines, 1 words
Removed: 1 lines, 0 words
Diff position: lo = 5, ahi = 6, bhi = 6

>> R3201 (3195, 0) by 207.197.98.85(-3): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=285260277
-ethan
Old: 531 lines. New: 531 lines.
Added: 1 lines, 0 words
Removed: 1 lines, 1 words
Diff position: lo = 5, ahi = 6, bhi = 6

>> R3205 (3195, 0) by 207.197.98.84(-3): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=285262915
-hi
Old: 531 lines. New: 531 lines.
Added: 1 lines, 0 words
Removed: 1 lines, 1 words
Diff position: lo = 4, ahi = 5, bhi = 5

>> R3212 (-4, 0) by 132.23.192.16(-75): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=285459168
-cantaloupes +calories +(heat +energy)
Old: 528 lines. New: 528 lines.
Added: 1 lines, 3 words
Removed: 1 lines, 1 words
Diff position: lo = 11, ahi = 12, bhi = 12

>> R3284 (-3, 0) by 71.56.128.89(-1): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=294110068
-hi + -my -is -Kamilla -Pilipchuk
Old: 538 lines. New: 538 lines.
Added: 1 lines, 1 words
Removed: 1 lines, 5 words
Diff position: lo = 5, ahi = 6, bhi = 6

>> R3303 (3303, 0) by 216.86.113.139(2739): io <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=297117673
+[[io:Fuzeo]]
Old: 537 lines. New: 538 lines.
Added: 1 lines, 1 words
Removed: 0 lines, 0 words
Diff position: lo = 510, ahi = 510, bhi = 511

>> R3436 (-1, 0) by Gunjan verma81(133): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=299705276
+[[hi: +à °à¥à à¥à ]]
Old: 546 lines. New: 547 lines.
Added: 1 lines, 2 words
Removed: 0 lines, 0 words
Diff position: lo = 515, ahi = 515, bhi = 516

>> R3623 (-1, 0) by 64.39.150.26(-83): /* Spread of rocket technology
*/ <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=318266199
+the
Old: 593 lines. New: 593 lines.
Added: 1 lines, 1 words
Removed: 1 lines, 0 words
Diff position: lo = 32, ahi = 33, bhi = 33

>> R3626 (-5, 0) by 216.157.209.62(-162): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=318296229
-'''rocket +'''no +my +nutsrocket
Old: 594 lines. New: 594 lines.
Added: 1 lines, 3 words
Removed: 1 lines, 1 words
Diff position: lo = 6, ahi = 7, bhi = 7

>> R3627 (3625, 0) by 216.157.209.62(-162): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=318296604
-'''no +'''rocket -my -nutsrocket
Old: 594 lines. New: 594 lines.
Added: 1 lines, 1 words
Removed: 1 lines, 3 words
Diff position: lo = 6, ahi = 7, bhi = 7

>> R3645 (-1, 0) by 80.41.110.25(-1): None <<<
Diff: http://en.wikipedia.org/w/index.php?diff=prev&oldid=320377836
+In +===Metal-cylinder +1792 +Rocket +Artillery=== ++++rockets + ++++
+were ++used +++++the ++by ++[[Tipu +++Sultan]] -Ruler +++++of +
+developed +[[Hyder +++++in ++Ali]] +++++and +++against ++his +son
+rulers ++forces +The +++++British +an +technology +19th -century. -
[[William ----Congreve -(inventor)|William -Congreve]] -Comptroller --
Royal -Arsenal ++++century -Woolwich -London +++[[Mysore]] +a -major +
+this -figure +period ++much -field +more -From +advanced -1801 ++
+than +what -set --on +had -vigorous +seen -research +chiefly +because
+development -program +++++use --Arsenal's +++iron -laboratory.<ref
+tubes -name="congreve">{{harvnb|Stephen|1887}} +++for -[http://
www.archive.org/stream/dictionarynatio36stepgoog#page/n21/mode/1up
+holding -p -9]</ref> +propellant; -prepared -new -propellant -mixture
-motor -strong -tube -conical -nose -This -early -weighed -about -32 -
pounds -(14.5 -kilograms) -demonstration -solid -fuel -1805 -
effectively -Napoleonic
Old: 594 lines. New: 599 lines.
Added: 5 lines, 50 words
Removed: 2 lines, 50 words
Diff position: lo = 42, ahi = 47, bhi = 52

On May 4, 12:36 am, Martin Potthast <martin.potth...@uni-weimar.de>
wrote: