if (-T $filename) { print "file contains 'text' characters\n"; }
if (-B $filename) { print "file contains 'binary' characters\n"; }
Is there already a Python analog to these? I'm happy to write them on
my own if no such constructs currently exist, but before I start, I'd
like to make sure that I'm not "re-inventing the wheel".
By the way, here's what the perl docs say about these constructs. I'm
looking for something similar in Python:
... The -T and -B switches work as follows. The first block or so
... of the file is examined for odd characters such as strange control
... codes or characters with the high bit set. If too many strange
... characters (>30%) are found, it's a -B file; otherwise it's a -T
... file. Also, any file containing null in the first block is
... considered a binary file. [ ... ]
Thanks in advance for any suggestions.
--
Lloyd Zusman
l...@asfast.com
God bless you.
That's a butt ugly heuristic that will lead to lots of false positives
if your text happens to be UTF-16 encoded or non-english text UTF-8 encoded.
Christian
Pray tell, what are the circumstances that lead you to use such a
heuristic rather than a more definitive method?
Cheers,
Chris
--
http://blog.rebertia.com
He did say it was from Perl, the home of butt-ugly.
--
Cheers,
Simon B.
Assuming you're on a Unix-like system or can install Cygwin, the
standard response is to use the "file" command. It's *much* more
sophisticated.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
"At Resolver we've found it useful to short-circuit any doubt and just
refer to comments in code as 'lies'. :-)"
While I agree with the others who have responded along the lines
of "that's a hinky heuristic", it's not too hard to write an analog:
import string
def is_text(fname,
chars=set(string.printable),
threshold=0.3,
portion=1024, # read a kilobyte to find out
mode='rb',
):
assert portion is None or portion > 0
assert 0 < threshold < 1
f = file(fname, mode)
if portion is None:
content = iter(f)
else:
content = iter(f.read(int(portion)))
f.close()
total = valid = 0
for c in content:
if c in chars:
valid += 1
total += 1
return (float(valid)/total) > threshold
def is_bin(*args, **kwargs):
return not is_text(*args, **kwargs)
for fname in (
'/usr/bin/abiword',
'/home/tkc/.bashrc',
):
print fname, is_text(fname)
It should allow you to tweak the charset to consider "text",
defaulting to string.printable, but adjust the "text" chars and
the file-reading-mode accordingly if you're using unicode text
(perhaps inverting the logic to make it an "binary chars" set).
You can also change the threshold from 0.3 (30%) to whatever you
need, and test the entire file or a subset of it (this defaults
to just reading the first K of the file, but if you pass None for
the portion, it will read the whole thing, even if it's a TB file).
-tkc
And a hell of a lot of false negatives if the file is binary.
The way I've always seen it, a file is binary if it contains a single
binary character *anywhere* in the file.
--
Steven
> In article <mailman.2434.1265983...@python.org>,
> Lloyd Zusman <l...@asfast.com> wrote:
> >if (-T $filename) { print "file contains 'text' characters\n"; }
> >if (-B $filename) { print "file contains 'binary' characters\n"; }
>
> Assuming you're on a Unix-like system or can install Cygwin, the
> standard response is to use the "file" command. It's *much* more
> sophisticated.
Indeed, the ‘file’ command is an expected (though not standard) part of
most Unix systems, and its use frees us from the lies of filenames about
their contents.
The sophistication comes from an extensive database of heuristics —
filesystem attributes, “magic” content signatures, and parsing — that
are together known as the “magic database”. This database is maintained
along with the ‘file’ program, and made accessible through a C library
from the same code base called ‘magic’.
So, you don't actually need to use the ‘file’ command to access this
sophistication. Just about every program on a GNU system that wants to
display file types, such as the graphical file manager, will query the
‘magic’ library directly to get the file type.
The ‘file’ code base has for a while now also included a Python
interface to this library, importable as the module name ‘magic’.
Unfortunately it isn't registered at PyPI as far as I can tell. (There
are several project using the name “magic” that implement something
similar, but they are nowhere near as sophisticated.)
On a Debian GNU/Linux system, you can install the ‘python-magic’ package
to get this installed. Otherwise, you can build it from the ‘file’ code
base <URL:http://www.darwinsys.com/file/>.
--
\ “I don't accept the currently fashionable assertion that any |
`\ view is automatically as worthy of respect as any equal and |
_o__) opposite view.” —Douglas Adams |
Ben Finney
> if portion is None:
> content = iter(f)
iter(f) will iterate over lines in the file, which doesn't fit with the
rest of the algorithm. Creating an iterator that iterates over
fixed-size file chunks (in this case of length 1) is where the
two-argument form of iter comes in handy:
content = iter(lambda: f.read(1), '')