grep output message: "binary file"

Baudais Jean-Yves

unread,

Mar 13, 2017, 11:12:39 AM3/13/17

to

Hello,

I have many TeX file and when I do

$> grep -F "\chapter" *.tex

the command works very well with some files, but with some other I have
the message (in french)

$> "Fichier binaire file.tex correspondant"

("corresponding binary file") and grep don't read the file! I have the
same message if I do

$> cat file.tex | grep -F "\chapter"

I don't understand because all files are not binary at all. They are all
human readable as espected. I can compile them with LaTeX. All the file
mode bits are not "x" (only "rw") and there is no file difference (LaTeX
document, ISO-8859 text) with

$> file -i *.tex

Thanks for your help (and sorry for my bad english),

Jean-Yves B.

Thomas 'PointedEars' Lahn

unread,

Mar 13, 2017, 7:02:38 PM3/13/17

to

Baudais Jean-Yves wrote:

> I have many TeX file and when I do
>
> $> grep -F "\chapter" *.tex

(I suggest that you modify the PS1 environment variable to create a less
ambiguous command prompt. “>” is the shell redirection special character,
and it can be used to create an empty file named “my_empty_file” if you
write the command “> my_empty_file”. I find the “>” in the prompt
confusing; this is not WinDOS.)

> the command works very well with some files, but with some other I have
> the message (in french)
>
> $> "Fichier binaire file.tex correspondant"
>
> ("corresponding binary file") and grep don't read the file! I have the
> same message if I do
>
> $> cat file.tex | grep -F "\chapter"

UUOC.

grep -F '\chapter' file.tex

> I don't understand because all files are not binary at all. They are all
> human readable as espected. I can compile them with LaTeX. All the file
> mode bits are not "x" (only "rw") and there is no file difference (LaTeX
> document, ISO-8859 text) with
>
> $> file -i *.tex

(This is probably a FAQ.)

grep(1) is a utility designed to search text, so it ignores non-text files,
i.e. binaries, by default. Whether content is considered text depends on
whether it can be fully decoded to text characters. Whether it that is
possible depends on the used decoder, which depends on the locale,
specifically the value of the LC_CTYPE environment variable (see locale(1)
and locale(7)). You can either set LC_CTYPE temporarily to a fitting value
if you have the corresponding locale installed –

LC_CTYPE=fr_FR.ISO-8859-1 grep …

– or with GNU grep(1) you can use the “-a” (or “--text” or
“--binary-files=text”) option to force grep(1) to process the file as if it
were text. RTFM.

> Thanks for your help (and sorry for my bad english),

De rien :)

(IMHO, your English is not flawless, but good.)

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Thomas 'PointedEars' Lahn

unread,

Mar 13, 2017, 7:11:21 PM3/13/17

to

Baudais Jean-Yves wrote:

> I have many TeX file and when I do
>
> $> grep -F "\chapter" *.tex

(I suggest that you modify the PS1 environment variable to create a less
ambiguous command prompt. “>” is the shell redirection special character,
and it can be used to create an empty file named “my_empty_file” if you
write the command “> my_empty_file”. I find the “>” in the prompt
confusing; this is not WinDOS.)

> the command works very well with some files, but with some other I have
> the message (in french)
>
> $> "Fichier binaire file.tex correspondant"

(However, this suggests that “$>” is not your command prompt, but some sort
of uncommon quoting prefix that you added. Please do not do it like that.
Always post input and output, particularly commands and messages, verbatim;
at least post unambiguously.)

> ("corresponding binary file") and grep don't read the file! I have the
> same message if I do
>
> $> cat file.tex | grep -F "\chapter"

UUOC.

grep -F '\chapter' file.tex

> I don't understand because all files are not binary at all. They are all
> human readable as espected. I can compile them with LaTeX. All the file
> mode bits are not "x" (only "rw") and there is no file difference (LaTeX
> document, ISO-8859 text) with
>
> $> file -i *.tex

(This is probably a FAQ.)

grep(1) is a utility designed to search text, so it ignores non-text files,
i.e. binaries, by default. Whether content is considered text depends on

whether it can be fully decoded to text characters. Whether that is

possible depends on the used decoder, which depends on the locale,
specifically the value of the LC_CTYPE environment variable (see locale(1)
and locale(7)). You can either set LC_CTYPE temporarily to a fitting value
if you have the corresponding locale installed –

LC_CTYPE=fr_FR.ISO-8859-1 grep …

– or with GNU grep(1) you can use the “-a” (or “--text” or
“--binary-files=text”) option to force grep(1) to process the file as if it
were text. RTFM.

> Thanks for your help (and sorry for my bad english),

Baudais Jean-Yves

unread,

Mar 14, 2017, 5:29:50 AM3/14/17

to

Le 14/03/2017 à 00:11, Thomas 'PointedEars' Lahn a écrit :
> (However, this suggests that “$>” is not your command prompt, but some sort
> of uncommon quoting prefix that you added. Please do not do it like that.
> Always post input and output, particularly commands and messages, verbatim;
> at least post unambiguously.)

You are right my PS1 is not "$>" (but by default PS2 is ">") and thank
you for your remark and advice :-)

> (This is probably a FAQ.)

Maybe but my web research have done nothing, maybe I don't spend enough
time :-(

> grep(1) is a utility designed to search text, so it ignores non-text files,
> i.e. binaries, by default. Whether content is considered text depends on
> whether it can be fully decoded to text characters. Whether that is
> possible depends on the used decoder, which depends on the locale,
> specifically the value of the LC_CTYPE environment variable (see locale(1)

> and locale(7)). [...]

Ok, I think I understand. My LANG variable is fr_FR.UTF-8 and my files
are ISO-8859-1 so when grep gives accented character I have the message.
Of course the files are not binary ones but the output is not compliant
with fr_FR.UTF-8. The "-a" option solve my problem (while awaiting full
UFT8 files and environment use...)

Thank you so much,

Jean-Yves

Thomas 'PointedEars' Lahn

unread,

Mar 14, 2017, 5:27:38 PM3/14/17

to

Baudais Jean-Yves wrote:

> Le 14/03/2017 à 00:11, Thomas 'PointedEars' Lahn a écrit :
>> grep(1) is a utility designed to search text, so it ignores non-text
>> files, i.e. binaries, by default. Whether content is considered text
>> depends on whether it can be fully decoded to text characters. Whether
>> that is possible depends on the used decoder, which depends on the
>> locale, specifically the value of the LC_CTYPE environment variable (see
>> locale(1) and locale(7)). [...]
>
> Ok, I think I understand. My LANG variable is fr_FR.UTF-8 and my files
> are ISO-8859-1 so when grep gives accented character I have the message.

Yes; according to locale(7), the value of the LANG variable serves as the
last fallback for determining the default locale if none of LC_ALL nor any
of the LC_* variables in a suitable category is non-null.

It is also possible that your files are neither ISO/IEC 8859-1-encoded nor
ISO-8859-1-encoded, but Windows-1252-encoded (although not part of the
latter encoding as such, CRLF instead of LF for newline is indicative of
that). file(1) may not be able to determine the difference, but recode(1)
or iconv(1) should.

> Of course the files are not binary ones but the output is not compliant
> with fr_FR.UTF-8.

Almost correct; the _input_ (file) is not, then. There are octet sequences
in ISO/IEC 8859-1 and ISO-8859-1, especially those for the accented basic
Latin characters like U+00E9 LATIN SMALL LETTER E WITH ACUTE (« é »), that
are not valid UTF-8 sequences, which apparently leads grep(1) to the
assumption that it is a binary.

The hexadecimal values of the codepoints of those characters are in a range
where leading parts of the corresponding octets (e.g., 0xE9 which is 1110
1001 in binary) are reserved for leading bit sequences indicating the number
of UTF-8 code units in the code sequence (e.g., ^111 for 3 code units).
Therefore, two octets, one leading byte and one continuation byte, must be
used to encode the character (e.g., 0xC3 0xA9 which is 1100 0011… in
binary), whereas ISO(-)8859-1 requires only one (0xE9). IOW, UTF-8 and
ISO-8859-1 are incompatible encodings above U+007F (you need a *program* to
convert text content from one to the other that contains those characters).

<https://en.wikipedia.org/wiki/UTF-8>

> The "-a" option solve my problem (while awaiting full
> UFT8 files and environment use...)

Super.

> Thank you so much,

De rien.