Filtering out non-readable characters

7 views
Skip to first unread message

MKoool

unread,
Jul 15, 2005, 8:33:39 PM7/15/05
to
I have a file with binary and ascii characters in it. I massage the
data and convert it to a more readable format, however it still comes
up with some binary characters mixed in. I'd like to write something
to just replace all non-printable characters with '' (I want to delete
non-printable characters).

I am having trouble figuring out an easy python way to do this... is
the easiest way to just write some regular expression that does
something like replace [^\p] with ''?

Or is it better to go through every character and do ord(character),
check the ascii values?

What's the easiest way to do something like this?

thanks

Bengt Richter

unread,
Jul 15, 2005, 9:13:05 PM7/15/05
to

>>> import string
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>> identity = ''.join([chr(i) for i in xrange(256)])
>>> unprintable = ''.join([c for c in identity if c not in string.printable])
>>>
>>> def remove_unprintable(s):
... return s.translate(identity, unprintable)
...
>>> set(remove_unprintable(identity)) - set(string.printable)
set([])
>>> set(remove_unprintable(identity))
set(['\x0c', ' ', '$', '(', ',', '0', '4', '8', '<', '@', 'D', 'H', 'L', 'P', 'T', 'X', '\\', '`
', 'd', 'h', 'l', 'p', 't', 'x', '|', '\x0b', '#', "'", '+', '/', '3', '7', ';', '?', 'C', 'G',
'K', 'O', 'S', 'W', '[', '_', 'c', 'g', 'k', 'o', 's', 'w', '{', '\n', '"', '&', '*', '.', '2',
'6', ':', '>', 'B', 'F', 'J', 'N', 'R', 'V', 'Z', '^', 'b', 'f', 'j', 'n', 'r', 'v', 'z', '~', '
\t', '\r', '!', '%', ')', '-', '1', '5', '9', '=', 'A', 'E', 'I', 'M', 'Q', 'U', 'Y', ']', 'a',
'e', 'i', 'm', 'q', 'u', 'y', '}'])
>>> sorted(set(remove_unprintable(identity))) == sorted(set(string.printable))
True
>>> sorted((remove_unprintable(identity))) == sorted((string.printable))
True

After that, to get clean file text, something like

cleantext = remove_unprintable(file('unclean.txt').read())

should do it. Or you should be able to iterate by lines something like (untested)

for uncleanline in file('unclean.txt'):
cleanline = remove_unprintable(uncleanline)
# ... do whatever with clean line

If there is something in string.printable that you don't want included, just use your own
string of printables. BTW,

>>> help(str.translate)
Help on method_descriptor:

translate(...)
S.translate(table [,deletechars]) -> string

Return a copy of the string S, where all characters occurring
in the optional argument deletechars are removed, and the
remaining characters have been mapped through the given
translation table, which must be a string of length 256.

Regards,
Bengt Richter

Raymond Hettinger

unread,
Jul 16, 2005, 6:11:34 AM7/16/05
to
Wow, that was the most thorough answer to a comp.lang.python question
since the Martellibot got busy in the search business.

Steven D'Aprano

unread,
Jul 16, 2005, 11:48:10 AM7/16/05
to
On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote:

> Bengt Richter wrote:
>> >>> identity = ''.join([chr(i) for i in xrange(256)])
>> >>> unprintable = ''.join([c for c in identity if c not in string.printable])
>

> And note that with Python 2.4, in each case the above square brackets
> are unnecessary (though harmless), because of the arrival of "generator
> expressions" in the language.

But to use generator expressions, wouldn't you need an extra pair of round
brackets?

eg identity = ''.join( ( chr(i) for i in xrange(256) ) )

with the extra spaces added for clarity.

That is, the brackets after join make the function call, and the nested
brackets make the generator. That, at least, is my understanding.

--
Steven
who is still using Python 2.3, and probably will be for quite some time


Peter Hansen

unread,
Jul 16, 2005, 4:42:58 PM7/16/05
to
Steven D'Aprano wrote:
> On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote:
>>Bengt Richter wrote:
>>
>>> >>> identity = ''.join([chr(i) for i in xrange(256)])
>>
>>And note that with Python 2.4, in each case the above square brackets
>>are unnecessary (though harmless), because of the arrival of "generator
>>expressions" in the language.
>
> But to use generator expressions, wouldn't you need an extra pair of round
> brackets?
>
> eg identity = ''.join( ( chr(i) for i in xrange(256) ) )

Come on, Steven. Don't tell us you didn't have access to a Python
interpreter to check before you posted:

c:\>python
Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32
>>> ''.join(chr(c) for c in range(65, 91))
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

-Peter

Bengt Richter

unread,
Jul 16, 2005, 6:18:53 PM7/16/05
to
On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen <pe...@engcorp.com> wrote:

>Bengt Richter wrote:
>> >>> identity = ''.join([chr(i) for i in xrange(256)])
>> >>> unprintable = ''.join([c for c in identity if c not in string.printable])
>

>And note that with Python 2.4, in each case the above square brackets
>are unnecessary (though harmless), because of the arrival of "generator

>expressions" in the language. (Bengt knows this already, of course, but
>his brain is probably resisting the reprogramming. :-) )
>
Thanks for the nudge. Actually, I know about generator expressions, but
at some point I must have misinterpreted some bug in my code to mean
that join in particular didn't like generator expression arguments,
and wanted lists. Actually it seems to like anything at all that can
be iterated produce a sequence of strings. So I'm glad to find that
join is fine after all, and to get that misap[com?:-)]prehension
out of my mind ;-)

Regards,
Bengt Richter

George Sakkis

unread,
Jul 16, 2005, 6:55:42 PM7/16/05
to
"Bengt Richter" <bo...@oz.net> wrote:

> >>> identity = ''.join([chr(i) for i in xrange(256)])
> >>> unprintable = ''.join([c for c in identity if c not in string.printable])

Or equivalently:

>>> identity = string.maketrans('','')
>>> unprintable = identity.translate(identity, string.printable)

George


Peter Hansen

unread,
Jul 16, 2005, 7:01:50 PM7/16/05
to
George Sakkis wrote:
> "Bengt Richter" <bo...@oz.net> wrote:
>> >>> identity = ''.join([chr(i) for i in xrange(256)])
>
> Or equivalently:
>>>>identity = string.maketrans('','')

Wow! That's handy, not to mention undocumented. (At least in the
string module docs.) Where did you learn that, George?

-Peter

Jp Calderone

unread,
Jul 16, 2005, 7:07:26 PM7/16/05
to pytho...@python.org

Peter Hansen

unread,
Jul 16, 2005, 8:36:01 PM7/16/05
to
Jp Calderone wrote:
> On Sat, 16 Jul 2005 19:01:50 -0400, Peter Hansen <pe...@engcorp.com> wrote:
>> George Sakkis wrote:
>>>>>> identity = string.maketrans('','')
>>
>> Wow! That's handy, not to mention undocumented. (At least in the
>> string module docs.) Where did you learn that, George?
>>
> http://python.org/doc/lib/node109.html

Perhaps I was unclear. I thought it would be obvious that I knew where
to find the docs for maketrans(), but that the particular behaviour
shown (i.e. arguments of '' having that effect) was undocumented in that
page.

-Peter

George Sakkis

unread,
Jul 16, 2005, 8:48:20 PM7/16/05
to
"Peter Hansen" <pe...@engcorp.com> wrote:

Actually I first read about this in the Cookbook; there are two or three recipes related to
string.translate. As for string.maketrans, it doesn't do anything special for empty string
arguments:

maketrans( from, to)

Return a translation table suitable for passing to translate() or regex.compile(), that will map
each character in from into the character at the same position in to; from and to must have the same
length.

So if from and to are empty, maketrans will map zero characters, hence the identity. It's not the
only way to get the identity translation table by the way:
>>> string.maketrans('', '') == string.maketrans('a', 'a') == string.maketrans('hello', 'hello')
True

George


Peter Hansen

unread,
Jul 16, 2005, 10:58:20 PM7/16/05
to
George Sakkis wrote:

> "Peter Hansen" <pe...@engcorp.com> wrote:
>>>> Where did you learn that, George?
>
> Actually I first read about this in the Cookbook; there are two or three
> recipes related to string.translate. As for string.maketrans, it
> doesn't do anything special for empty string arguments: ...

I guess so. I was going to offer to suggest a new paragraph on that
usage for the docs, but as you and Jp both seem to think the behaviour
is obvious, I conclude "it's just me" so I suppose I shouldn't bother.

-Peter

George Sakkis

unread,
Jul 16, 2005, 11:28:02 PM7/16/05
to
"Peter Hansen" <pe...@engcorp.com> wrote:

It's only obvious in the sense that _after_ you see this idiom, you can go back to the docs and
realize it's not doing something special; OTOH if you haven't seen it, it's not at all the obvious
solution to "how do I get the first 256 characters". So IMO it should be mentioned, given that
string.translate often operates on the identity table. I think a single sentence is adequate for the
reference docs.

George


Steven D'Aprano

unread,
Jul 17, 2005, 1:08:12 AM7/17/05
to
On Sat, 16 Jul 2005 16:42:58 -0400, Peter Hansen wrote:

> Steven D'Aprano wrote:
>> On Sat, 16 Jul 2005 10:25:29 -0400, Peter Hansen wrote:
>>>Bengt Richter wrote:
>>>
>>>> >>> identity = ''.join([chr(i) for i in xrange(256)])
>>>
>>>And note that with Python 2.4, in each case the above square brackets
>>>are unnecessary (though harmless), because of the arrival of "generator
>>>expressions" in the language.
>>
>> But to use generator expressions, wouldn't you need an extra pair of round
>> brackets?
>>
>> eg identity = ''.join( ( chr(i) for i in xrange(256) ) )
>
> Come on, Steven. Don't tell us you didn't have access to a Python
> interpreter to check before you posted:

Er, as I wrote in my post:

"Steven
who is still using Python 2.3, and probably will be for quite some time"

So, no, I didn't have access to a Python interpreter running version 2.4.

I take it then that generator expressions work quite differently
than list comprehensions? The equivalent "implied delimiters" for a list
comprehension would be something like this:

>>> L = [1, 2, 3]
>>> L[ i for i in range(2) ]
File "<stdin>", line 1
L[ i for i in range(2) ]
^
SyntaxError: invalid syntax

which is a very different result from:

>>> L[ [i for i in range(2)] ]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: list indices must be integers

In other words, a list comprehension must have the [ ] delimiters to be
recognised as a list comprehension, EVEN IF the square brackets are there
from some other element. But a generator expression doesn't care where the
round brackets come from, so long as they are there: they can be part of
the function call.

I hope that makes sense to you.


--
Steven

Steven D'Aprano

unread,
Jul 17, 2005, 1:19:52 AM7/17/05
to

I can't answer for George, but I also noticed that behaviour. I discovered
it by trial and error. I thought, oh what a nuisance that the arguments
for maketrans had to include all 256 characters, then I wondered what
error you would get if you left some out, and discovered that you didn't
get an error at all.

That actually disappointed me at the time, because I was looking for
behaviour where the missing characters weren't filled in, but I've come to
appreciate it since.


--
Steven


Steven D'Aprano

unread,
Jul 17, 2005, 1:23:58 AM7/17/05
to
Replying to myself... this is getting to be a habit.

On Sun, 17 Jul 2005 15:08:12 +1000, Steven D'Aprano wrote:

> I hope that makes sense to you.

That wasn't meant as a snide little dig at Peter, and I'm sorry if anyone
reads it that way. I found myself struggling to explain simply the
different behaviour between list comps and generator expressions, and
couldn't be sure I was explaining myself as clearly as I wanted. It might
have been better if I had left off the "to you".

--
Steven

Peter Hansen

unread,
Jul 17, 2005, 6:49:50 AM7/17/05
to
Steven D'Aprano wrote:
> On Sat, 16 Jul 2005 16:42:58 -0400, Peter Hansen wrote:
>>Come on, Steven. Don't tell us you didn't have access to a Python
>>interpreter to check before you posted:
>
> Er, as I wrote in my post:
>
> "Steven
> who is still using Python 2.3, and probably will be for quite some time"

Sorry, missed that! I don't generally notice signatures much, partly
because Thunderbird is smart enough to "grey them out" (the main text is
displayed as black, quoted material in blue, and signatures in a light
gray.)

I don't have a firm answer (though I suspect the language reference
does) about when "dedicated" parentheses are required around a generator
expression. I just know that, so far, they just work when I want them
to. Like most of Python. :-)

-Peter

Steven Bethard

unread,
Jul 17, 2005, 5:42:08 PM7/17/05
to
Bengt Richter wrote:
> Thanks for the nudge. Actually, I know about generator expressions, but
> at some point I must have misinterpreted some bug in my code to mean
> that join in particular didn't like generator expression arguments,
> and wanted lists.

I suspect this is bug 905389 [1]:

>>> def gen():
... yield 1
... raise TypeError('from gen()')
...
>>> ''.join([x for x in gen()])


Traceback (most recent call last):

File "<interactive input>", line 1, in ?
File "<interactive input>", line 3, in gen
TypeError: from gen()
>>> ''.join(x for x in gen())


Traceback (most recent call last):

File "<interactive input>", line 1, in ?
TypeError: sequence expected, generator found

I run into this every month or so, and have to remind myself that it
means that my generator is raising a TypeError, not that join doesn't
accept generator expressions...

STeVe

[1] http://www.python.org/sf/905389

Raymond Hettinger

unread,
Jul 17, 2005, 9:32:28 PM7/17/05
to
[George Sakkis]

> It's only obvious in the sense that _after_ you see this idiom, you can go back to the docs and
> realize it's not doing something special; OTOH if you haven't seen it, it's not at all the obvious
> solution to "how do I get the first 256 characters". So IMO it should be mentioned, given that
> string.translate often operates on the identity table. I think a single sentence is adequate for the
> reference docs.

For Py2.5, I've accepted a feature request to allow string.translate's
first argument to be None and then run as if an identity string had
been provided.


Raymond Hettinger

Michael Ströder

unread,
Jul 18, 2005, 9:22:18 AM7/18/05
to
Peter Hansen wrote:
>>>> ''.join(chr(c) for c in range(65, 91))
> 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

Wouldn't this be a candidate for making the Python language stricter?

Do you remember old Python versions treating l.append(n1,n2) the same
way like l.append((n1,n2)). I'm glad this is forbidden now.

Ciao, Michael.

Peter Hansen

unread,
Jul 18, 2005, 8:19:29 PM7/18/05
to
Michael Ströder wrote:
> Peter Hansen wrote:
>
>>>>>''.join(chr(c) for c in range(65, 91))
>>
>>'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>
>
> Wouldn't this be a candidate for making the Python language stricter?

Why would that be true? I believe str.join() takes any iterable, and a
generator (as returned by a generator expression) certainly qualifies.

-Peter

Robert Kern

unread,
Jul 19, 2005, 12:51:16 AM7/19/05
to pytho...@python.org

That wasn't a syntax issue; it was an API issue. list.append() allowed
multiple arguments and interpreted them as if they were a single tuple.
That was confusing and unnecessary.

Allowing generator expressions to forgo extra parentheses where they
aren't required is something different, and in my opinion, a good thing.

--
Robert Kern
rk...@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Ross

unread,
Jul 19, 2005, 4:28:31 AM7/19/05
to
On 15 Jul 2005 17:33:39 -0700, "MKoool" <mo...@terabolic.com> wrote:

Easiest way is open the file with EdXor (freeware editor), select all,
Format > Wipe Non-Ascii.

Ok it's not python, but it's the easiest.

Steven D'Aprano

unread,
Jul 19, 2005, 10:03:36 AM7/19/05
to

1 Open Internet Explorer
2 Go to Google
3 Search for EdXor
4 Browser locks up
5 Force quit with ctrl-alt-del
6 Run anti-virus program
7 Download new virus definitions
8 Remove viruses
9 Run anti-spyware program
10 Download new definitions
11 Remove spyware
12 Open Internet Explorer
13 Download Firefox
14 Install Firefox
15 Open Firefox
16 Go to Google
17 Search for EdXor
18 Download application
19 Run installer
20 Reboot
21 Run EdXor
22 Open file
23 Select all
24 Select Format>Wipe Non-ASCII
25 Select Save
26 Quit EdXor

Hmmm. Perhaps not *quite* the easiest way :-)

--
Steven.

Peter Hansen

unread,
Jul 16, 2005, 10:25:29 AM7/16/05
to
Bengt Richter wrote:
> >>> identity = ''.join([chr(i) for i in xrange(256)])
> >>> unprintable = ''.join([c for c in identity if c not in string.printable])

And note that with Python 2.4, in each case the above square brackets

are unnecessary (though harmless), because of the arrival of "generator

expressions" in the language. (Bengt knows this already, of course, but
his brain is probably resisting the reprogramming. :-) )

-Peter

Bengt Richter

unread,
Jul 24, 2005, 1:03:31 AM7/24/05
to

My news service has been timing out on postings, but I had a couple that
made reference to that ;-) Maybe this post will get through.

Regards,
Bengt Richter

Bengt Richter

unread,
Jul 24, 2005, 1:03:56 AM7/24/05
to

I would suggest changing
"""
maketrans(from, to)


Return a translation table suitable for passing to translate() or regex.compile(),
that will map each character in from into the character at the same position in to;
from and to must have the same length.
"""

to something that would make the idiom more easily inferrable, e.g.,

"""
maketrans(from, to)


Return a translation table suitable for passing to translate() or regex.compile(),

that will map each character in from into the character at the same position in to,
while leaving all characters other than those in from unchanged;


from and to must have the same length.
"""

Meanwhile, if my python feature request #1193128 on sourceforge gets implemented,
we'll be able to write s.translate(None, badchars) instead of having to build
an identity table to pass as the first argument. Maybe 2.5? (Not being pushy ;-)

Regards,
Bengt Richter

Bengt Richter

unread,
Jul 24, 2005, 1:04:09 AM7/24/05
to

That must have been it, thanks.

Regards,
Bengt Richter

Reply all
Reply to author
Forward
0 new messages