I'm looking for a way to be able to load a generic file from the
system and understand if he is plain text.
The mimetype module has some nice methods, but for example it's not
working for file without extension.
Any suggestion?
--
-- luca
> Hi all.
>
> I'm looking for a way to be able to load a generic file from the
> system and understand if he is plain text.
> The mimetype module has some nice methods, but for example it's not
> working for file without extension.
Hi Luca,
You have to define what you mean by "text" file. It might seem
obvious, but it's not.
Do you mean just ASCII text? Or will you accept Unicode too? Unicode
text can be more difficult to detect because you have to guess the
file's encoding (unless it has a BOM; most don't).
And do you need to verify that every single byte in the file is
"text"? What if the file is 1GB, do you still want to examine every
single byte?
If you give us your own (specific!) definition of what "text" means,
or perhaps a description of the problem you're trying to solve, then
maybe we can help you better.
Cheers
Philip
You could use the "file" command. It's normally installed by default on
Unix systems, but you can get a Windows version from:
FWIW, IIRC the heuristic `file` uses to check whether a file is text
or not is whether it contains any null bytes; if it does, it
classifies it as binary (i.e. not text).
Cheers,
Chris
--
http://blog.rebertia.com
Thanks all.
I was quite sure that this is not a very simple task. Right now search
only inside ASCII encode is not enough for me (my native language is
outside this encode :-)
Checking every single byte can be a good solution...
I can start using the mimetype module and, if the file has no
extension, check byte one by one (commonly) as "file" command does.
Better: I can check use the "file" command if available.
Again: thanks all!
--
-- luca
>>> I'm looking for a way to be able to load a generic file from the
>>> system and understand if he is plain text.
>>> The mimetype module has some nice methods, but for example it's not
>>> working for file without extension.
>>>
>>> Any suggestion?
>>
>> You could use the "file" command. It's normally installed by default on
>> Unix systems, but you can get a Windows version from:
>
> FWIW, IIRC the heuristic `file` uses to check whether a file is text
> or not is whether it contains any null bytes; if it does, it
> classifies it as binary (i.e. not text).
"file" provides more granularity than that, recognising many specific
formats, both text and binary.
First, it uses "magic number" checks, checking for known signature bytes
(e.g. "#!" or "JFIF") at the beginning of the file. If those checks fail
it checks for common text encodings. If those also fail, it reports "data".
Also, UTF-16-encoded text is recognised as text, even though it may
contain a high proportion of NUL bytes.
> I was quite sure that this is not a very simple task. Right now search
> only inside ASCII encode is not enough for me (my native language is
> outside this encode :-)
> Checking every single byte can be a good solution...
>
> I can start using the mimetype module and, if the file has no
> extension, check byte one by one (commonly) as "file" command does.
> Better: I can check use the "file" command if available.
Another possible solution:
Universal Encoding Detector
Character encoding auto-detection in Python 2 and 3