The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.
Thanks in advance for any hints,
Regards,
vbr
#############################################################
#! Python
# -*- coding: utf-8 -*-
import difflib
# txt_a - extra character A at index 196
txt_a = "Chapman: *I* don't know - Mr Wentworth just told me to come
in here and say that there was trouble at the mill, that's all - I
didn't expect a kind of Spanish Inquisition.[jarring chord] Ximinez:
ANobody expects the Spanish Inquisition! Our chief weapon is
surprise...surprise and fear...fear and surprise.... Our two weapons
are fear and surprise...and ruthless efficiency.... Our *three*
weapons are fear, surprise, and ruthless efficiency...and an almost
fanatical devotion to the Pope.... Our *four*...no... *Amongst* our
weapons.... Amongst our weaponry...are such elements as fear,
surprise.... I'll come in again."
# txt_b - extra character B at index 525
txt_b = "Chapman: *I* don't know - Mr Wentworth just told me to come
in here and say that there was trouble at the mill, that's all - I
didn't expect a kind of Spanish Inquisition.[jarring chord] Ximinez:
Nobody expects the Spanish Inquisition! Our chief weapon is
surprise...surprise and fear...fear and surprise.... Our two weapons
are fear and surprise...and ruthless efficiency.... Our *three*
weapons are fear, surprise, and ruthless efficiency...and an almost
fanatical devotion to the Pope.... Our *four*...no... *Amongst* our
Bweapons.... Amongst our weaponry...are such elements as fear,
surprise.... I'll come in again."
seq_match = difflib.SequenceMatcher(None, txt_a, txt_b)
print ("\n".join("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % (tag, i1, i2,
txt_a[i1:i2], j1, j2, txt_b[j1:j2]) for tag, i1, i2, j1, j2 in
seq_match.get_opcodes()))
...
Instead of just reporting the insertion and deletion of these single
characters ... the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
...
Just for the record, althought it seemed unlikely to me first, it
turns out, that this may have the same cause like several difflib
items in the issue tracker regarding unexpected outputs for long
sequences with relatively highly repetitive items, e.g.
http://bugs.python.org/issue2986
http://bugs.python.org/issue1711800
http://bugs.python.org/issue4622
http://bugs.python.org/issue1528074
In my case, disabling the "popular" heuristics as mentioned in
http://bugs.python.org/issue1528074#msg29269
i.e. modifying the difflib source (around line 314 for py.2.5.4) to
if 0: # disable popular heuristics
if n >= 200 and len(indices) * 100 > n:
populardict[elt] = 1
del indices[:]
seems to work perfectly.
Anyway, I would appreciate comments, whether this is the appropriate
solution for the given task - i.e. the character-wise comparison of
strings; or are there maybe some drawbacks to be aware of? Wouldn't
some kind of control over the "pouplar" heuristics be useful in the
exposed interface of difflib?
Or is this just the inappropriate tool for the character-wise string
comparison, as is suggested e.g. in
http://bugs.python.org/issue1528074#msg29273 althought it seems to
work just right for the most part?
regards,
vbr