grepping for invalid UTF-8 characters

Adam Funk

unread,

Jun 30, 2016, 7:15:08 AM6/30/16

to

Hi,

I had a problem yesterday with some text files (provided by someone
else) mostly in UTF-8 but with occasional encoding issues: e.g., I
opened one in emacs & found red \227 (Latin-1 for em-dash) but also
valid "£" (GBP sign in UTF-8). I wanted to find the files & line
numbers with invalid characters in UTF-8, but the closest thing I
could find on the WWW was this:

http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix

grep --color='auto' -P -n "[\x80-\xFF]" *.txt

But that highlighted valid UTF-8 non-ASCII characters like "£". Is
there a way to grep specifically for invalid UTF-8?

Thanks,
Adam

--
It is probable that television drama of high caliber and produced by
first-rate artists will materially raise the level of dramatic taste
of the nation. --- David Sarnoff, CEO of RCA, 1939; in Stoll 1995

Geoff Clare

unread,

Jun 30, 2016, 9:11:06 AM6/30/16

to

Adam Funk wrote:

> I had a problem yesterday with some text files (provided by someone
> else) mostly in UTF-8 but with occasional encoding issues: e.g., I
> opened one in emacs & found red \227 (Latin-1 for em-dash) but also
> valid "£" (GBP sign in UTF-8). I wanted to find the files & line
> numbers with invalid characters in UTF-8, but the closest thing I
> could find on the WWW was this:
>
> http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix
>
> grep --color='auto' -P -n "[\x80-\xFF]" *.txt
>
> But that highlighted valid UTF-8 non-ASCII characters like "£". Is
> there a way to grep specifically for invalid UTF-8?

I don't have a grep solution, but you could use iconv to strip out the
invalid characters and then diff the output with the original file to
see where they were:

iconv -cs -f utf-8 -t utf-8 file | LC_ALL=C diff - file > file.diff

Then display or edit file.diff using a utility that can cope with the
invalid characters (such as emacs, which you said you used above).

For multiple files:

for f in *.txt
do
iconv -cs -f utf-8 -t utf-8 -- "$f" | LC_ALL=C diff - "$f" > "$f".diff
done
find . -name '*.txt.diff' -size 0 -exec rm {} +

(The find removes the empty diff output files, i.e. the ones where there
were no invalid characters.)

--
Geoff Clare <net...@gclare.org.uk>

Kaz Kylheku

unread,

Jun 30, 2016, 11:38:04 AM6/30/16

to

On 2016-06-30, Adam Funk <a24...@ducksburg.com> wrote:
> Hi,
>
> I had a problem yesterday with some text files (provided by someone
> else) mostly in UTF-8 but with occasional encoding issues: e.g., I
> opened one in emacs & found red \227 (Latin-1 for em-dash) but also
> valid "£" (GBP sign in UTF-8). I wanted to find the files & line
> numbers with invalid characters in UTF-8, but the closest thing I
> could find on the WWW was this:
>
> http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix
>
> grep --color='auto' -P -n "[\x80-\xFF]" *.txt
>
> But that highlighted valid UTF-8 non-ASCII characters like "£". Is
> there a way to grep specifically for invalid UTF-8?

Kaz's txr language: print lines of standard input with a line
number, if they contain an invalid character:

$ txr -e '(mapdo (do if (match-regex @1 #/[\xDC01-\xDCFF]/)
(put-line `@2: @1`))
(get-lines) (range 1))' < yourfile

Invalid UTF-8 bytes are mapped by TXR's decoder to the range U+DC01 to
U+DCFF, that code can look for.

That includes the byte of overlong forms.

The algorithm is basically: any input position, the decoder
tries to extract a valid UTF-8 character, which must not be
an overlong form. If it cannot do that, it returns to that position,
and maps the first input byte at that position into the U+DCxx range,
and then tries decoding at the next byte position.

Valid UTF-8 depicting U+DCxx code points is also treated as invalid
bytes. This scheme allows the encoding and decoding to be transparent;
arbitrary byte string can pass through. Consequently, the output of the
above one-liner reproduces the original lines, byte for byte.

Ben Bacarisse

unread,

Jun 30, 2016, 5:57:36 PM6/30/16

to

Adam Funk <a24...@ducksburg.com> writes:

> Hi,
>
> I had a problem yesterday with some text files (provided by someone
> else) mostly in UTF-8 but with occasional encoding issues: e.g., I
> opened one in emacs & found red \227 (Latin-1 for em-dash) but also
> valid "£" (GBP sign in UTF-8). I wanted to find the files & line
> numbers with invalid characters in UTF-8, but the closest thing I
> could find on the WWW was this:
>
> http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix
>
> grep --color='auto' -P -n "[\x80-\xFF]" *.txt
>
> But that highlighted valid UTF-8 non-ASCII characters like "£". Is
> there a way to grep specifically for invalid UTF-8?

That would be complicated because what constitutes and invalid encoding
is not simple. When I was having trouble with UTF-8 files I ended up
writing

http://www.bsb.me.uk/software/utf-8-dump/

which lets you specify a format for, amongst other things, invalid
encodings. You would then be able to grep for something in that format.

--
Ben.

Thomas 'PointedEars' Lahn

unread,

Jul 5, 2016, 8:14:03 AM7/5/16

to

Adam Funk wrote:

> http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix
>
> grep --color='auto' -P -n "[\x80-\xFF]" *.txt

Should have been at least

grep --color='auto' -P -n '[\x80-\xFF]' *.txt

More portable:

find . $ -name . -o -prune $ -name '*.txt' \
-exec grep --color='auto' -ne '['$'\x80''-'$'\xFF'']' '{}' +

> But that highlighted valid UTF-8 non-ASCII characters like "£".

Because 0x80 to 0xBF inclusive are continuation bytes. (There are no "UTF-8
characters".)

> Is there a way to grep specifically for invalid UTF-8?

Yes, but it depends on how you define “invalid UTF-8”:

<https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences>

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Adam Funk

unread,

Aug 1, 2016, 10:15:09 AM8/1/16

to

On 2016-07-05, Thomas 'PointedEars' Lahn wrote:

> Adam Funk wrote:
>
>> http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix
>>
>> grep --color='auto' -P -n "[\x80-\xFF]" *.txt
>
> Should have been at least
>
> grep --color='auto' -P -n '[\x80-\xFF]' *.txt
>
> More portable:
>
> find . $ -name . -o -prune $ -name '*.txt' \
> -exec grep --color='auto' -ne '['$'\x80''-'$'\xFF'']' '{}' +
>
>> But that highlighted valid UTF-8 non-ASCII characters like "£".
>
> Because 0x80 to 0xBF inclusive are continuation bytes. (There are no "UTF-8
> characters".)

Of course --- I should've called this "grepping for invalid UTF-8
sequences"!

>> Is there a way to grep specifically for invalid UTF-8?
>
> Yes, but it depends on how you define “invalid UTF-8”:
>
><https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences>
>

--

Dear Ann [Landers]: if there's an enormous rash of necrophilia that
happens in the next year because of this song, please let me know.
99.9% of the rest of us know it's a funny song! --- Alice Cooper