Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Testing for binary file

4 views
Skip to first unread message

Brian Seitz

unread,
Mar 4, 2002, 5:04:44 PM3/4/02
to
Is there a standard way to test a file for being binary or
text? Something to the effect of -B in Perl. I would be using either
Allegro 5.01 or CMUCL (whatever version is in Debian unstable).

Thanks,

Brian

Erik Naggum

unread,
Mar 4, 2002, 8:10:48 PM3/4/02
to
* Brian Seitz <bse...@stsci.edu>

| Is there a standard way to test a file for being binary or
| text? Something to the effect of -B in Perl. I would be using either
| Allegro 5.01 or CMUCL (whatever version is in Debian unstable).

I have no idea what -B does in Perl, but a text file is generally
understood to be a file that lacks any other control characters than the
horizontal (CR, HT, SP) and vertical (LF, VT) format effectors. If you
have a decently encoded character set, that means very few characters in
the ranges #x00-#x1f and #x7f-#x9f.

If you have some IBM-based crud page or any one of the usual Microsoft
disasters, there is no way to tell for real, except you would probably
find periodic line breaks with CRLF in text files.

A common and very simple negative test for a text file is if the last
character in teh file is not a line feed.

///
--
In a fight against something, the fight has value, victory has none.
In a fight for something, the fight is a loss, victory merely relief.

Larry Clapp

unread,
Mar 5, 2002, 11:20:17 PM3/5/02
to

I dunno about a *standard* way, but you could always rewrite Perl's -B
operator. From perlfunc(1):

The "-T" and "-B" switches work as follows. The first block or so of the
file is examined for odd characters such as strange control codes or
characters with the high bit set. If too many strange characters (>30%)
are found, it's a "-B" file, otherwise it's a "-T" file. Also, any file
containing null in the first block is considered a binary file.

(-T, for you non-Perl-ers, tests for text files.)

-- Larry

Marco Antoniotti

unread,
Mar 6, 2002, 11:23:00 AM3/6/02
to

Larry Clapp <la...@theclapp.org> writes:

Sorry. Binary files are defined by having 42% of "strange" characters
in the first 4242 sextets (4+2 bits).

Perl got this wrong. :)

Cheers

--
Marco Antoniotti ========================================================
NYU Courant Bioinformatics Group tel. +1 - 212 - 998 3488
719 Broadway 12th Floor fax +1 - 212 - 995 4122
New York, NY 10003, USA http://bioinformatics.cat.nyu.edu
"Hello New York! We'll do what we can!"
Bill Murray in `Ghostbusters'.

Tim Bradshaw

unread,
Mar 6, 2002, 11:36:00 AM3/6/02
to
* Marco Antoniotti wrote:

> Sorry. Binary files are defined by having 42% of "strange" characters
> in the first 4242 sextets (4+2 bits).

> Perl got this wrong. :)

Rubbish. Files are binary if more than 17 of the first 23 5-bit bytes
are not legal BAUDOT, with the exception that, if they spell
'EWIGE BLUMENKRAFT FNORD', in which case the file is considered binary
anyway.

--tim

0 new messages