Baudais Jean-Yves wrote:
> Le 14/03/2017 à 00:11, Thomas 'PointedEars' Lahn a écrit :
>> grep(1) is a utility designed to search text, so it ignores non-text
>> files, i.e. binaries, by default. Whether content is considered text
>> depends on whether it can be fully decoded to text characters. Whether
>> that is possible depends on the used decoder, which depends on the
>> locale, specifically the value of the LC_CTYPE environment variable (see
>> locale(1) and locale(7)). [...]
>
> Ok, I think I understand. My LANG variable is fr_FR.UTF-8 and my files
> are ISO-8859-1 so when grep gives accented character I have the message.
Yes; according to locale(7), the value of the LANG variable serves as the
last fallback for determining the default locale if none of LC_ALL nor any
of the LC_* variables in a suitable category is non-null.
It is also possible that your files are neither ISO/IEC 8859-1-encoded nor
ISO-8859-1-encoded, but Windows-1252-encoded (although not part of the
latter encoding as such, CRLF instead of LF for newline is indicative of
that). file(1) may not be able to determine the difference, but recode(1)
or iconv(1) should.
> Of course the files are not binary ones but the output is not compliant
> with fr_FR.UTF-8.
Almost correct; the _input_ (file) is not, then. There are octet sequences
in ISO/IEC 8859-1 and ISO-8859-1, especially those for the accented basic
Latin characters like U+00E9 LATIN SMALL LETTER E WITH ACUTE (« é »), that
are not valid UTF-8 sequences, which apparently leads grep(1) to the
assumption that it is a binary.
The hexadecimal values of the codepoints of those characters are in a range
where leading parts of the corresponding octets (e.g., 0xE9 which is 1110
1001 in binary) are reserved for leading bit sequences indicating the number
of UTF-8 code units in the code sequence (e.g., ^111 for 3 code units).
Therefore, two octets, one leading byte and one continuation byte, must be
used to encode the character (e.g., 0xC3 0xA9 which is 1100 0011… in
binary), whereas ISO(-)8859-1 requires only one (0xE9). IOW, UTF-8 and
ISO-8859-1 are incompatible encodings above U+007F (you need a *program* to
convert text content from one to the other that contains those characters).
<
https://en.wikipedia.org/wiki/UTF-8>
> The "-a" option solve my problem (while awaiting full
> UFT8 files and environment use...)
Super.
> Thank you so much,
De rien.