Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF-8 corruption bug with diff -y

0 views
Skip to first unread message

Sjur Nørstebø Moshagen

unread,
Nov 8, 2018, 9:07:11 AM11/8/18
to bug-gn...@gnu.org
Hello

Using diff on text files with long lines risk corrupting UTF-8 enocded files when used with the default column width of 130 columns, if a multibyte char happens to be on the border of that limit. The diff command will truncate the resulting diff output in the middle of the byte sequence, producing malformed UTF-8 text.

To reproduce:

diff -y Input-text-1.txt Input-text-2.txt

The bug can be circumvented by setting the column width to a randomly high number, as long as it is higher than the longest diff line produced:

diff -y -W 200 Input-text-1.txt Input-text-2.txt

The files Input-text-1.txt and Input-text-2.txt (UTF-8 encoded) are attached. The text (excluding --------) is also reproduced below in case the attachments are removed during e-mail transfer.

Regards,
Sjur Moshagen


Input-text-1.txt:
--------
"<ja>"
"ja" CC
"<iešguđet>"
"iešguhtet" Pron Indef Acc
"iešguhtet" Pron Indef Attr
"iešguhtet" Pron Indef Gen
"<lágan>"
"lága" N Sem/Dummytag Ess
"lága" N Sem/Dummytag Sg Loc South Err/Orth
"lágan" A Sem/Hum Attr
"lágan" A Sem/Hum Sg Acc Err/Orth-nom-acc
"lágan" A Sem/Hum Sg Gen Err/Orth-nom-gen
"lágan" A Sem/Hum Sg Nom
"láhka" N Sem/Rule Sg Loc South Err/Orth
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------

Input-text-2.txt
--------
"<ja>"
"ja" CC
"<iešguđet lágan>"
"iešguđetlágan" A Sem/Dummytag Attr Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Acc Err/Orthacc Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Gen Err/Orthgen Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Nom Err/SpaceCmpágan
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------



Input-text-1.txt
Input-text-2.txt

Paul Eggert

unread,
Nov 12, 2018, 3:07:55 PM11/12/18
to Sjur Nørstebø Moshagen, bug-gn...@gnu.org
Thanks, could you please send that email to bug-di...@gnu.org? That's the
place for diffutils bugs these days.

0 new messages