Test if a file is binary

2,689 views
Skip to first unread message

Rob Young

unread,
Oct 30, 2012, 5:02:22 PM10/30/12
to golan...@googlegroups.com
Does anyone know of a library something similar to perl's -B flag for guessing whether a file is binary or not in go?

Thanks,
Rob

Dustin

unread,
Oct 30, 2012, 5:03:27 PM10/30/12
to golan...@googlegroups.com

On Tuesday, October 30, 2012 2:02:22 PM UTC-7, Rob Young wrote:
Does anyone know of a library something similar to perl's -B flag for guessing whether a file is binary or not in go?

  What does "binary" mean to you? 

Larry Clapp

unread,
Oct 30, 2012, 5:09:45 PM10/30/12
to golan...@googlegroups.com
Well, for reference, here's what it means to Perl:

The "-T" and "-B" switches work as follows.  The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set.  If too many strange characters (>30%) are found, it's a "-B" file; otherwise it's a "-T" file.  Also, any file containing null in the first block is considered a binary file.  If "-T" or "-B" is used on a filehandle, the current IO buffer is examined rather than the first block.  Both "-T" and "-B" return true on a null file, or a file at EOF when testing a filehandle.  Because you have to read a file to do the "-T" test, on most occasions you want to use a "-f" against the file first, as in "next unless -f $file && -T $file".
 
-- Larry

Rob Pike

unread,
Oct 30, 2012, 5:15:22 PM10/30/12
to Larry Clapp, golan...@googlegroups.com
"Strange" is not well-defined.

I suggest trying to convert the first "block or so" into UTF-8. If
that causes errors, it's probably binary.

-rob

Devon H. O'Dell

unread,
Oct 30, 2012, 5:21:52 PM10/30/12
to Rob Pike, Larry Clapp, golan...@googlegroups.com
You may also be interested in checking out libmagic (or making a go
implementation that is able to parse the compiled magic file from
/usr/share/misc/magic). I would say that this is probably more useful
than "guessing" based on a block.

--dho

Dustin

unread,
Oct 30, 2012, 5:22:48 PM10/30/12
to golan...@googlegroups.com

On Tuesday, October 30, 2012 2:09:45 PM UTC-7, Larry Clapp wrote:
Well, for reference, here's what it means to Perl:

The "-T" and "-B" switches work as follows.  The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set.  If too many strange characters (>30%) are found, it's a "-B" file; otherwise it's a "-T" file.  Also, any file containing null in the first block is considered a binary file.  If "-T" or "-B" is used on a filehandle, the current IO buffer is examined rather than the first block.  Both "-T" and "-B" return true on a null file, or a file at EOF when testing a filehandle.  Because you have to read a file to do the "-T" test, on most occasions you want to use a "-f" against the file first, as in "next unless -f $file && -T $file".

  Is it a null character, or is this a UTF-16 file?  Is it a strange character, or did you see a part of a UTF-8 multi-byte character sequence?  Is it a text file, or did you just not read far enough into it to find that it's some special format that requires special processing?

  Generally someone asking this question is trying to do something they're leaving out of the question, and that's why I ask.  http has DetectContentType which is sometimes useful for determining a mime type for content, but there's a lot it doesn't know.  I know there are "magic" ports out there that may be more exhaustive.

Rob Young

unread,
Oct 30, 2012, 5:43:01 PM10/30/12
to golan...@googlegroups.com
For the level of detail I need I think Rob's solution of trying to parse the first block as UTF-8 and then detecting errors should be good enough. It's for ruling out files when searching over pretty bog-standard text files so it doesn't need to exhaustive.
Thanks.

Dustin

unread,
Oct 30, 2012, 6:50:08 PM10/30/12
to golan...@googlegroups.com

On Tuesday, October 30, 2012 2:43:01 PM UTC-7, Rob Young wrote:
For the level of detail I need I think Rob's solution of trying to parse the first block as UTF-8 and then detecting errors should be good enough. It's for ruling out files when searching over pretty bog-standard text files so it doesn't need to exhaustive.

  Sounds like you could get by well with net/http.DetectContentType which does basically that, but will also sort out utf8 or utf16/{big,little}endian text-looking files.

Greg Ward

unread,
Oct 31, 2012, 3:59:59 PM10/31/12
to Larry Clapp, golan...@googlegroups.com
On 30 October 2012, Larry Clapp said:
>
> Well, for reference, here's what it means to Perl:
>
> The "-T" and "-B" switches work as follows. The first block or so of the
> > file is examined for odd characters such as strange control codes or
> > characters with the high bit set. If too many strange characters (>30%)
> > are found, it's a "-B" file; otherwise it's a "-T" file. Also, any file
> > containing null in the first block is considered a binary file. If "-T" or
> > "-B" is used on a filehandle, the current IO buffer is examined rather than
> > the first block. Both "-T" and "-B" return true on a null file, or a file
> > at EOF when testing a filehandle. Because you have to read a file to do
> > the "-T" test, on most occasions you want to use a "-f" against the file
> > first, as in "next unless -f $file && -T $file".

There's more than one way to do it. For example, Mercurial's
definition of "binary" (for deciding how to display a diff) is
"contains at least one NUL byte (\0)". This has the advantage of being
dead simple to explain and implement, but it requires having the whole
file in memory. But since Mercurial always does this anyways, no big
deal. (After all, who would put a file into source control that
doesn't fit in memory -- that would be madness!)

(Yes, that last sentence was *irony*.)

IOW: make your own rule and implement it. I can't imagine Go adding a
fuzzy, might-work-most-of-the-time heuristic like this to the library.

Greg

Carl Menezes

unread,
Jun 4, 2014, 8:40:36 PM6/4/14
to golan...@googlegroups.com
Just for the future, I did try your suggestion and checking the first byte buffer in the file using DetectContentType works like a charm. No idea how well it performs, but I have been using it for a couple of days and have yet to get a false positive. 
From my tests, looking for non UTF-8 characters (even in the first buffer) doesn't work well on Windows machines especially, because UTF-16 files fall into this trap.

Robert Knight

unread,
Jun 5, 2014, 9:43:05 AM6/5/14
to golan...@googlegroups.com
There is a relatively simple algorithm at this RFC: http://tools.ietf.org/html/draft-abarth-mime-sniff-06#page-8 and a simple isBinaryData() function in https://github.com/adobe/webkit/blob/master/Source/WebCore/platform/network/MIMESniffing.cpp - the latter works by classifying each byte as being a binary-indicator or not. If the content contains a 'binary indicator' char it is considered binary.

Carl Menezes

unread,
Jun 5, 2014, 6:43:11 PM6/5/14
to golan...@googlegroups.com
The latest version of that RFC is http://tools.ietf.org/html/draft-ietf-websec-mime-sniff

From the net/http:DetectContentType source: 
// DetectContentType implements the algorithm described 
// at http://mimesniff.spec.whatwg.org/ to determine the
// Content-Type of the given data. It considers at most the"

It turns out that both sources reference the same paper published by Adam Barth in 2009 with regard to content sniffing.
It also turns out that the version history of http://mimesniff.spec.whatwg.org/ started with Adam Barth making the initial commits until mid 2012, after which others started contributing.
Given that the Adam Barth was working for Google when he published the ietf draft in Jan 2011 and that the mimesniff spec implemented by the Go standard library was last updated on the 17 Jan 2014, I would infer that me implementing the algorithm would be reinventing a square wheel :) 

Thanks for the link though. I did learn something after all.
Reply all
Reply to author
Forward
0 new messages