How to know if a file is a text file

Luca Fabbri

unread,

Nov 14, 2009, 11:02:29 AM11/14/09

to pytho...@python.org

Hi all.

I'm looking for a way to be able to load a generic file from the
system and understand if he is plain text.
The mimetype module has some nice methods, but for example it's not
working for file without extension.

Any suggestion?

--
-- luca

Philip Semanchuk

unread,

Nov 14, 2009, 12:51:30 PM11/14/09

to Python-list (General)

On Nov 14, 2009, at 11:02 AM, Luca Fabbri wrote:

> Hi all.
>
> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.

Hi Luca,
You have to define what you mean by "text" file. It might seem
obvious, but it's not.

Do you mean just ASCII text? Or will you accept Unicode too? Unicode
text can be more difficult to detect because you have to guess the
file's encoding (unless it has a BOM; most don't).

And do you need to verify that every single byte in the file is
"text"? What if the file is 1GB, do you still want to examine every
single byte?

If you give us your own (specific!) definition of what "text" means,
or perhaps a description of the problem you're trying to solve, then
maybe we can help you better.

Cheers
Philip

Nobody

unread,

Nov 15, 2009, 7:06:48 AM11/15/09

to

You could use the "file" command. It's normally installed by default on
Unix systems, but you can get a Windows version from:

http://gnuwin32.sourceforge.net/packages/file.htm

Chris Rebert

unread,

Nov 15, 2009, 7:34:10 AM11/15/09

to Nobody, pytho...@python.org

On Sun, Nov 15, 2009 at 4:06 AM, Nobody <nob...@nowhere.com> wrote:
> On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:
>
>> I'm looking for a way to be able to load a generic file from the
>> system and understand if he is plain text.
>> The mimetype module has some nice methods, but for example it's not
>> working for file without extension.
>>
>> Any suggestion?
>
> You could use the "file" command. It's normally installed by default on
> Unix systems, but you can get a Windows version from:

FWIW, IIRC the heuristic `file` uses to check whether a file is text
or not is whether it contains any null bytes; if it does, it
classifies it as binary (i.e. not text).

Cheers,
Chris
--
http://blog.rebertia.com

Luca

unread,

Nov 15, 2009, 7:49:54 AM11/15/09

to Philip Semanchuk, Python-list (General)

On Sat, Nov 14, 2009 at 6:51 PM, Philip Semanchuk <phi...@semanchuk.com> wrote:
> Hi Luca,
> You have to define what you mean by "text" file. It might seem obvious, but
> it's not.
>
> Do you mean just ASCII text? Or will you accept Unicode too? Unicode text
> can be more difficult to detect because you have to guess the file's
> encoding (unless it has a BOM; most don't).
>
> And do you need to verify that every single byte in the file is "text"? What
> if the file is 1GB, do you still want to examine every single byte?
>
> If you give us your own (specific!) definition of what "text" means, or
> perhaps a description of the problem you're trying to solve, then maybe we
> can help you better.
>

Thanks all.

I was quite sure that this is not a very simple task. Right now search
only inside ASCII encode is not enough for me (my native language is
outside this encode :-)
Checking every single byte can be a good solution...

I can start using the mimetype module and, if the file has no
extension, check byte one by one (commonly) as "file" command does.
Better: I can check use the "file" command if available.

Again: thanks all!

--
-- luca

Nobody

unread,

Nov 15, 2009, 1:50:55 PM11/15/09

to

On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote:

>>> I'm looking for a way to be able to load a generic file from the
>>> system and understand if he is plain text.
>>> The mimetype module has some nice methods, but for example it's not
>>> working for file without extension.
>>>
>>> Any suggestion?
>>
>> You could use the "file" command. It's normally installed by default on
>> Unix systems, but you can get a Windows version from:
>
> FWIW, IIRC the heuristic `file` uses to check whether a file is text
> or not is whether it contains any null bytes; if it does, it
> classifies it as binary (i.e. not text).

"file" provides more granularity than that, recognising many specific
formats, both text and binary.

First, it uses "magic number" checks, checking for known signature bytes
(e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
it checks for common text encodings. If those also fail, it reports "data".

Also, UTF-16-encoded text is recognised as text, even though it may
contain a high proportion of NUL bytes.

Nobody

unread,

Nov 15, 2009, 1:56:01 PM11/15/09

to

On Sun, 15 Nov 2009 13:49:54 +0100, Luca wrote:

> I was quite sure that this is not a very simple task. Right now search
> only inside ASCII encode is not enough for me (my native language is
> outside this encode :-)
> Checking every single byte can be a good solution...
>
> I can start using the mimetype module and, if the file has no
> extension, check byte one by one (commonly) as "file" command does.
> Better: I can check use the "file" command if available.

Another possible solution:

Universal Encoding Detector
Character encoding auto-detection in Python 2 and 3

http://chardet.feedparser.org/