On 2016-06-30, Adam Funk <
a24...@ducksburg.com> wrote:
> Hi,
>
> I had a problem yesterday with some text files (provided by someone
> else) mostly in UTF-8 but with occasional encoding issues: e.g., I
> opened one in emacs & found red \227 (Latin-1 for em-dash) but also
> valid "£" (GBP sign in UTF-8). I wanted to find the files & line
> numbers with invalid characters in UTF-8, but the closest thing I
> could find on the WWW was this:
>
>
http://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix
>
> grep --color='auto' -P -n "[\x80-\xFF]" *.txt
>
> But that highlighted valid UTF-8 non-ASCII characters like "£". Is
> there a way to grep specifically for invalid UTF-8?
Kaz's txr language: print lines of standard input with a line
number, if they contain an invalid character:
$ txr -e '(mapdo (do if (match-regex @1 #/[\xDC01-\xDCFF]/)
(put-line `@2: @1`))
(get-lines) (range 1))' < yourfile
Invalid UTF-8 bytes are mapped by TXR's decoder to the range U+DC01 to
U+DCFF, that code can look for.
That includes the byte of overlong forms.
The algorithm is basically: any input position, the decoder
tries to extract a valid UTF-8 character, which must not be
an overlong form. If it cannot do that, it returns to that position,
and maps the first input byte at that position into the U+DCxx range,
and then tries decoding at the next byte position.
Valid UTF-8 depicting U+DCxx code points is also treated as invalid
bytes. This scheme allows the encoding and decoding to be transparent;
arbitrary byte string can pass through. Consequently, the output of the
above one-liner reproduces the original lines, byte for byte.