Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
How to know if a file is a text file
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Luca Fabbri  
View profile  
 More options Nov 14, 11:02 am
Newsgroups: comp.lang.python
From: Luca Fabbri <l...@keul.it>
Date: Sat, 14 Nov 2009 17:02:29 +0100
Local: Sat, Nov 14 2009 11:02 am
Subject: How to know if a file is a text file
Hi all.

I'm looking for a way to be able to load a generic file from the
system and understand if he is plain text.
The mimetype module has some nice methods, but for example it's not
working for file without extension.

Any suggestion?

--
-- luca


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philip Semanchuk  
View profile  
 More options Nov 14, 12:51 pm
Newsgroups: comp.lang.python
From: Philip Semanchuk <phi...@semanchuk.com>
Date: Sat, 14 Nov 2009 12:51:30 -0500
Local: Sat, Nov 14 2009 12:51 pm
Subject: Re: How to know if a file is a text file

On Nov 14, 2009, at 11:02 AM, Luca Fabbri wrote:

> Hi all.

> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.

Hi Luca,
You have to define what you mean by "text" file. It might seem  
obvious, but it's not.

Do you mean just ASCII text? Or will you accept Unicode too? Unicode  
text can be more difficult to detect because you have to guess the  
file's encoding (unless it has a BOM; most don't).

And do you need to verify that every single byte in the file is  
"text"? What if the file is 1GB, do you still want to examine every  
single byte?

If you give us your own (specific!) definition of what "text" means,  
or perhaps a description of the problem you're trying to solve, then  
maybe we can help you better.

Cheers
Philip


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nobody  
View profile  
 More options Nov 15, 7:06 am
Newsgroups: comp.lang.python
From: Nobody <nob...@nowhere.com>
Date: Sun, 15 Nov 2009 12:06:48 +0000
Local: Sun, Nov 15 2009 7:06 am
Subject: Re: How to know if a file is a text file

On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:
> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.

> Any suggestion?

You could use the "file" command. It's normally installed by default on
Unix systems, but you can get a Windows version from:

        http://gnuwin32.sourceforge.net/packages/file.htm


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chris Rebert  
View profile  
 More options Nov 15, 7:34 am
Newsgroups: comp.lang.python
From: Chris Rebert <c...@rebertia.com>
Date: Sun, 15 Nov 2009 04:34:10 -0800
Local: Sun, Nov 15 2009 7:34 am
Subject: Re: How to know if a file is a text file

On Sun, Nov 15, 2009 at 4:06 AM, Nobody <nob...@nowhere.com> wrote:
> On Sat, 14 Nov 2009 17:02:29 +0100, Luca Fabbri wrote:

>> I'm looking for a way to be able to load a generic file from the
>> system and understand if he is plain text.
>> The mimetype module has some nice methods, but for example it's not
>> working for file without extension.

>> Any suggestion?

> You could use the "file" command. It's normally installed by default on
> Unix systems, but you can get a Windows version from:

FWIW, IIRC the heuristic `file` uses to check whether a file is text
or not is whether it contains any null bytes; if it does, it
classifies it as binary (i.e. not text).

Cheers,
Chris
--
http://blog.rebertia.com


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Luca  
View profile  
 More options Nov 15, 7:49 am
Newsgroups: comp.lang.python
From: Luca <luca...@gmail.com>
Date: Sun, 15 Nov 2009 13:49:54 +0100
Local: Sun, Nov 15 2009 7:49 am
Subject: Re: How to know if a file is a text file

Thanks all.

I was quite sure that this is not a very simple task. Right now search
only inside ASCII encode is not enough for me (my native language is
outside this encode :-)
Checking every single byte can be a good solution...

I can start using the mimetype module and, if the file has no
extension, check byte one by one (commonly) as "file" command does.
Better: I can check use the "file" command if available.

Again: thanks all!

--
-- luca


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nobody  
View profile  
 More options Nov 15, 1:50 pm
Newsgroups: comp.lang.python
From: Nobody <nob...@nowhere.com>
Date: Sun, 15 Nov 2009 18:50:55 +0000
Local: Sun, Nov 15 2009 1:50 pm
Subject: Re: How to know if a file is a text file

On Sun, 15 Nov 2009 04:34:10 -0800, Chris Rebert wrote:
>>> I'm looking for a way to be able to load a generic file from the
>>> system and understand if he is plain text.
>>> The mimetype module has some nice methods, but for example it's not
>>> working for file without extension.

>>> Any suggestion?

>> You could use the "file" command. It's normally installed by default on
>> Unix systems, but you can get a Windows version from:

> FWIW, IIRC the heuristic `file` uses to check whether a file is text
> or not is whether it contains any null bytes; if it does, it
> classifies it as binary (i.e. not text).

"file" provides more granularity than that, recognising many specific
formats, both text and binary.

First, it uses "magic number" checks, checking for known signature bytes
(e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
it checks for common text encodings. If those also fail, it reports "data".

Also, UTF-16-encoded text is recognised as text, even though it may
contain a high proportion of NUL bytes.


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nobody  
View profile  
 More options Nov 15, 1:56 pm
Newsgroups: comp.lang.python
From: Nobody <nob...@nowhere.com>
Date: Sun, 15 Nov 2009 18:56:01 +0000
Local: Sun, Nov 15 2009 1:56 pm
Subject: Re: How to know if a file is a text file

On Sun, 15 Nov 2009 13:49:54 +0100, Luca wrote:
> I was quite sure that this is not a very simple task. Right now search
> only inside ASCII encode is not enough for me (my native language is
> outside this encode :-)
> Checking every single byte can be a good solution...

> I can start using the mimetype module and, if the file has no
> extension, check byte one by one (commonly) as "file" command does.
> Better: I can check use the "file" command if available.

Another possible solution:

        Universal Encoding Detector
        Character encoding auto-detection in Python 2 and 3

        http://chardet.feedparser.org/


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google