How do I identify a file type?

40 views
Skip to first unread message

David Storrs

unread,
Jun 29, 2018, 12:24:16 PM6/29/18
to Racket Users
I'd like to be able to answer the question "Is this file gzipped?" without depending on file extensions, much like the Unix CLI utility 'file'.  I've looked through the compression docs (https://docs.racket-lang.org/file/gzip.html and also zip and tar) but been unable to find anything that seems appropriate.  There's 'file-creator-and-type' which might work but is Mac specific.  There's various things with 'file-type' in the name but they are all inappropriate -- graphics items or etc.

Can someone point me to the right part of the FM?

Neil Van Dyke

unread,
Jun 29, 2018, 12:54:09 PM6/29/18
to David Storrs, Racket Users
I'm not aware of one, but three ways to get lots of Unix `file`
program-like functionality:

* Quick: Make Racket FFI to the `magic` C library.  The code might be
quick to write, but you get problems of making sure the library is
available, and risks of pulling more potentially-imperfect C code into
your Racket process (where it could cause very hard-to-debug problems).

* Also quick: Carefully call the `file` program (if available on the
host system) as a host OS process from Racket, and parse the stdout
output, and catch any stderr, to translate into return values and
exceptions.

* Solid: Make a Racket package that parses the `magic` file into Racket
syntax objects, and uses `syntax-parse` to convert those to Racket code,
all at syntax expansion time.  This might also be a demo of Racket's
language support, to get a safe and modern execution of some important
legacy code in an obscure language (or some story like that).

David Storrs

unread,
Jun 29, 2018, 1:01:17 PM6/29/18
to Neil Van Dyke, Racket Users
I'm actually working on a library right now that will examine the file for signatures as listed here: https://en.wikipedia.org/wiki/List_of_file_signatures and let you get various bits of metadata.  I'll post it to the package server as soon as it's ready, although it will be very minimal to start.

Also, maybe I'm thick, but why would it need to be syntax objects?  I had been intending to have it simply work off either a file path or a port and return relevant data, e.g   (file-type "/tmp/foo") -> 'gzip   

Neil Van Dyke

unread,
Jun 29, 2018, 1:07:26 PM6/29/18
to David Storrs, Racket Users
David Storrs wrote on 06/29/2018 01:01 PM:
>
> * Solid: Make a Racket package that parses the `magic` file into
> Racket syntax objects, and uses `syntax-parse` to convert those to
> Racket code, all at syntax expansion time.  This might also be a
> demo of Racket's language support, to get a safe and modern
> execution of some important legacy code in an obscure language (or
> some story like that).
>
[...]
> why would it need to be syntax objects?

Sorry, I should've explained my understanding.

Historically, IIRC, at least one version of the `file` program has its
logic specified not in C code, but in a file called `magic`, or
something like that.

So, perhaps one could "expand" the language of that file into Racket
code that is then compiled.

And often one good way to do that in Racket is to parse the input
language into something like an abstract syntax tree of Racket syntax
objects, and then use `syntax-parse` and/or other programmatic syntax
transformation to turn those syntax objects into syntax objects for
`racket/base` code that implements the language for that syntax.

David Storrs

unread,
Jun 29, 2018, 1:13:52 PM6/29/18
to Neil Van Dyke, Racket Users
Ah, I see.  It seems like relying on the 'magic' file represents a weakness and a potential portability issue.  My plan had been to have a directory named 'definitions' and fill it with a set of tiny files, each of which exports a hash describing one file format.  The main.rkt would read all files in this directory and use their data to interpret whatever file you throw at it.  This would allow new file types to be added simply by adding a new file to the 'definitions' directory and everything would Just Work from there.  Does this seem like a sensible way to do it?

Eric Griffis

unread,
Jun 29, 2018, 1:45:47 PM6/29/18
to David Storrs, Neil Van Dyke, Racket Users
The (modern) magic file format is rich and well documented. Just `man 5 magic` or google it. If you care about windows users, this approach is not so portable.

Checking a gzip file signature (first two bytes are 1F 8B) with base Racket is pretty easy:

  (bytes=? (with-input-from-file "some.gz" (λ () (read-bytes 2))) #"\x1F\x8B")

Eric

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Neil Van Dyke

unread,
Jun 29, 2018, 2:50:56 PM6/29/18
to David Storrs, Racket Users
You could bundle the "magic" file with your Racket package (in a
license-compliant way), alongside the ".rkt" files.  All of which files
then are essentially source files that all end up contributing to
whatever the Racket compiler spits out (e.g., Racket virtual machine
bytecode).

Regarding portability, the solution of compiling the "magic" file to
Racket code seems very portable and solid.  Binary format files, like
".gz" files, are usually byte-identical regardless of which
machine/platform is looking at them.  (An exception to this is that,
occasionally, a file format might use big-endian byte ordering on one
platform, and little-ending on another, but that's unusual today, and
even then, the file would likely indicate which byte ordering it's
using.)  For text-format files, like source code files, almost all
nowadays are in ASCII, 7-bit ASCII plus some 8th-bit characters in some
extended encoding (e.g., IBM PC extended graphics characters embedded in
string literals of some programming language file), or some Unicode
encoding (especially source code files written by Racket people, for
whom even sprinkling their code with the cool word "lambda" isn't hip
enough :).

The one portability thing I can think of that you can double-check is
that you don't have any file fingerprinting rules that assume Unix text
file newline conventions (ASCII linefeed character 10) rather than also
permit newline to be a linefeed followed by carriage-return (10, 13).

(There's also old mainframe files, which might not even use ASCII for
text, but I wouldn't know what useful to do with such files in a package
such as this.  Mainframe work probably involves a bespoke solution, and
a consultant's briefcase full of US hundred dollar bills.)

David Storrs

unread,
Jun 29, 2018, 3:55:20 PM6/29/18
to Eric Griffis, Racket Users
On Fri, Jun 29, 2018 at 1:45 PM, Eric Griffis <ded...@gmail.com> wrote:
The (modern) magic file format is rich and well documented. Just `man 5 magic` or google it. If you care about windows users, this approach is not so portable.

Yep, that's the problem.

Eh, maybe I'll start off just working about people on reasonable platforms and add Windows later.  Something to think about, anyway. 



Checking a gzip file signature (first two bytes are 1F 8B) with base Racket is pretty easy:

  (bytes=? (with-input-from-file "some.gz" (λ () (read-bytes 2))) #"\x1F\x8B")

Yep, I had this already.  Regardless, I appreciate the pointer.


Eric

On Fri, Jun 29, 2018 at 10:13 AM David Storrs <david....@gmail.com> wrote:


On Fri, Jun 29, 2018 at 1:07 PM, Neil Van Dyke <ne...@neilvandyke.org> wrote:
David Storrs wrote on 06/29/2018 01:01 PM:

    * Solid: Make a Racket package that parses the `magic` file into
    Racket syntax objects, and uses `syntax-parse` to convert those to
    Racket code, all at syntax expansion time.  This might also be a
    demo of Racket's language support, to get a safe and modern
    execution of some important legacy code in an obscure language (or
    some story like that).

[...]
why would it need to be syntax objects?

Sorry, I should've explained my understanding.

Historically, IIRC, at least one version of the `file` program has its logic specified not in C code, but in a file called `magic`, or something like that.

So, perhaps one could "expand" the language of that file into Racket code that is then compiled.

And often one good way to do that in Racket is to parse the input language into something like an abstract syntax tree of Racket syntax objects, and then use `syntax-parse` and/or other programmatic syntax transformation to turn those syntax objects into syntax objects for `racket/base` code that implements the language for that syntax.

Ah, I see.  It seems like relying on the 'magic' file represents a weakness and a potential portability issue.  My plan had been to have a directory named 'definitions' and fill it with a set of tiny files, each of which exports a hash describing one file format.  The main.rkt would read all files in this directory and use their data to interpret whatever file you throw at it.  This would allow new file types to be added simply by adding a new file to the 'definitions' directory and everything would Just Work from there.  Does this seem like a sensible way to do it?

--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscribe@googlegroups.com.

George Neuner

unread,
Jun 29, 2018, 4:24:49 PM6/29/18
to David Storrs, Racket Users

On 6/29/2018 12:24 PM, David Storrs wrote:
I'd like to be able to answer the question "Is this file gzipped?" without depending on file extensions, much like the Unix CLI utility 'file'.  I've looked through the compression docs (https://docs.racket-lang.org/file/gzip.html and also zip and tar) but been unable to find anything that seems appropriate.  There's 'file-creator-and-type' which might work but is Mac specific.  There's various things with 'file-type' in the name but they are all inappropriate -- graphics items or etc.

Can someone point me to the right part of the FM?

gzip and zip files begin with a magic number.  You need to read the first few bytes and identify which format you're dealing with.  [Of course, without actually trying to decode the file, you may misidentify a file that coincidentally has the same first bytes.]

gzip starts with 0x1f8b
zip starts with 0x04034b50

gzip -  https://tools.ietf.org/html/rfc1952
zip   - https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT


Tar is a pit ... there is no magic number.  You basically have to try to decode it and see if you succeed.


George

David Storrs

unread,
Jun 29, 2018, 4:45:09 PM6/29/18
to George Neuner, Racket Users


On Fri, Jun 29, 2018 at 4:24 PM, George Neuner <gneu...@comcast.net> wrote:


Tar is a pit


I see what you did there.... :P

Hendrik Boom

unread,
Jun 30, 2018, 9:52:12 AM6/30/18
to Racket Users
On Fri, Jun 29, 2018 at 02:50:52PM -0400, Neil Van Dyke wrote:
>
> The one portability thing I can think of that you can double-check is that
> you don't have any file fingerprinting rules that assume Unix text file
> newline conventions (ASCII linefeed character 10) rather than also permit
> newline to be a linefeed followed by carriage-return (10, 13).

The standard in the old days was carriage-return followed by linefeed.
To give the tyypewriter carriage enough tine to get back to the
beginning of the line. So you might want to check for that, too.

>
> (There's also old mainframe files, which might not even use ASCII for text,
> but I wouldn't know what useful to do with such files in a package such as
> this.  Mainframe work probably involves a bespoke solution, and a
> consultant's briefcase full of US hundred dollar bills.)

Or find some very old free software enthousiasts who still remember the
old days.

-- hendrik

Neil Van Dyke

unread,
Jun 30, 2018, 10:47:12 AM6/30/18
to Racket Users
Hendrik Boom wrote on 06/30/2018 09:52 AM:
> The standard in the old days was carriage-return followed by linefeed.
>
> To give the tyypewriter carriage enough tine to get back to the
> beginning of the line.  So you might want to check for that, too.

Thank you for the correction.  I think one wants to handle text file
newlines that are any of:
* CR-LF (MS-DOS and lots of earlier systems, as well as inter-system
gateways/networking, which is why HTTP does it),
* LF-only (Unix), and
* CR-only (I'm pretty sure I've seen it before in some old text files,
now that I think of it, perhaps originating as printer dumps, which
happened; and you might as well support it, so long as you don't
mishandle old printer dumps that are doing CR-only instead for
underscore/strikethrough/doublestrike effects without backspace, and
without implied LF).[1]

To correct my earlier email: I don't recall ever seeing LF-CR as a text
file newline sequence.

(I've become certain that my email program has a terrible data
corruption bug, which adds errors to whatever I type.)

[1]
http://www.neilvandyke.org/racket/csv-reading/#%28part._.Reader_.Specs%29

Hendrik Boom

unread,
Jun 30, 2018, 1:49:04 PM6/30/18
to Racket Users
On Sat, Jun 30, 2018 at 10:47:08AM -0400, Neil Van Dyke wrote:
> Hendrik Boom wrote on 06/30/2018 09:52 AM:
> > The standard in the old days was carriage-return followed by linefeed.
> >
> > To give the tyypewriter carriage enough tine to get back to the
> > beginning of the line.  So you might want to check for that, too.
>
> Thank you for the correction.  I think one wants to handle text file
> newlines that are any of:
> * CR-LF (MS-DOS and lots of earlier systems, as well as inter-system
> gateways/networking, which is why HTTP does it),
> * LF-only (Unix), and
> * CR-only (I'm pretty sure I've seen it before in some old text files, now
> that I think of it, perhaps originating as printer dumps, which happened;
> and you might as well support it, so long as you don't mishandle old printer
> dumps that are doing CR-only instead for
> underscore/strikethrough/doublestrike effects without backspace, and without
> implied LF).[1]

In those same old days when you had to give the teletype time to return
its carriage, it was permissible for an input routine to accept a bare
CR to indicaate a newline, as long as it synthesized the standar CR-LF
pair. This was intended as a convenience to typists, not a file format.

-- hendrik

>
> To correct my earlier email: I don't recall ever seeing LF-CR as a text file
> newline sequence.
>
> (I've become certain that my email program has a terrible data corruption
> bug, which adds errors to whatever I type.)
>
> [1]
> http://www.neilvandyke.org/racket/csv-reading/#%28part._.Reader_.Specs%29
>

Jon Zeppieri

unread,
Jun 30, 2018, 4:41:10 PM6/30/18
to Hendrik Boom, Racket Users


> On Jun 30, 2018, at 1:48 PM, Hendrik Boom <hen...@topoi.pooq.com> wrote:
>
>> On Sat, Jun 30, 2018 at 10:47:08AM -0400, Neil Van Dyke wrote:
>> * CR-only (I'm pretty sure I've seen it before in some old text files, now
>> that I think of it, perhaps originating as printer dumps, which happened;
>> and you might as well support it, so long as you don't mishandle old printer
>> dumps that are doing CR-only instead for
>> underscore/strikethrough/doublestrike effects without backspace, and without
>> implied LF).[1]
>
> In those same old days when you had to give the teletype time to return
> its carriage, it was permissible for an input routine to accept a bare
> CR to indicaate a newline, as long as it synthesized the standar CR-LF
> pair. This was intended as a convenience to typists, not a file format.
>

Even so, plenty of systems have used a bare carriage return as a newline, including the old MacOS (pre-OS X).
Reply all
Reply to author
Forward
0 new messages