Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Regex to remove non printable characters

5,708 views
Skip to first unread message

Larry

unread,
Dec 21, 2007, 9:54:33 PM12/21/07
to
Hi peeps,

I'd like to remove all characters with ascii values > 127 from a
string...that's to say i'd like to remove non printable chars...

is the following fine?

my $input =~ s/[^ -~]+//g;

thanks ever so much!

Glenn Jackman

unread,
Dec 21, 2007, 10:19:19 PM12/21/07
to
At 2007-12-21 09:54PM, "Larry" wrote:
> Hi peeps,
>
> I'd like to remove all characters with ascii values > 127 from a
> string...that's to say i'd like to remove non printable chars...

You might want:
$string =~ s/\P{IsPrint}//g;

See perldoc perlre

--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry

Jürgen Exner

unread,
Dec 21, 2007, 11:04:08 PM12/21/07
to
On Sat, 22 Dec 2007 03:54:33 +0100, Larry <dontme...@got.it> wrote:
> I'd like to remove all characters with ascii values > 127 from a

ASCII is a 7 bit encoding system where sometimes the eights bit is used as
parity bit. There are no ASCII characters > 127, therefore your request
doesn't make sense.

>string...that's to say i'd like to remove non printable chars...

In case you are not talking about ASCII but about e.g Windows-1252 or
ISO-Latin-x or any of the dozen other code pages that share the lower 128
characters with ASCII then please be advised that the vast majority of
those characters > 127 _ARE_ printable, at least in your typical commonly
used code pages.

The non-printable characters can be found in the lower part from 0x00 to
0x1F, no matter if ASCII or Windows-1252 or ISO-Latin-x or many, many
others.

Therefore your request makes even less sense. Maybe you want to clarify
first what you are talking about?

>is the following fine?
>
>my $input =~ s/[^ -~]+//g;

That will remove pretty much all the lower case English letters and a few
special characters. Wonder what they have to do with non-printable or
non-ASCII.

jue

John W. Krahn

unread,
Dec 21, 2007, 11:18:19 PM12/21/07
to
Larry wrote:
>
> I'd like to remove all characters with ascii values > 127 from a
> string

$input =~ s/[^[:ascii:]]+//g;


>...that's to say i'd like to remove non printable chars...

$input =~ s/[^[:print:]]+//g;


> is the following fine?
>
> my $input =~ s/[^ -~]+//g;

my() creates a new variable with no contents so there is nothing for the
substitution operator to remove.

$ perl -wle'my $input =~ s/[^ -~]+//g;'
Use of uninitialized value in substitution (s///) at -e line 1.

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Larry

unread,
Dec 21, 2007, 11:53:18 PM12/21/07
to
In article <fe0bj.9527$wy2.5863@edtnps90>,

"John W. Krahn" <du...@example.com> wrote:

> $input =~ s/[^[:ascii:]]+//g;
>
>
> >...that's to say i'd like to remove non printable chars...
>
> $input =~ s/[^[:print:]]+//g;

is this fine?

$input =~ tr/\x80-\xFF//d;

Dr.Ruud

unread,
Dec 22, 2007, 5:00:16 AM12/22/07
to
Larry schreef:
> John W. Krahn:

> [remove non printable chars]


> is this fine?
> $input =~ tr/\x80-\xFF//d;

No. How about chr(0x00)..chr(0x1F)?
And characters > "\x{FF}"?

--
Affijn, Ruud

"Gewoon is een tijger."

Jürgen Exner

unread,
Dec 22, 2007, 7:47:01 AM12/22/07
to

Depends what you are looking for (you still didn't clarify).
It will remove non-ASCII character in the typical 8-bit encodings.
It will _NOT_ remove non-printable characters.

Maybe you should make up your mind and let us know _which_ of these two
you are actually trying to do.

jue

Petr Vileta

unread,
Dec 22, 2007, 9:26:56 AM12/22/07
to
Maybe this do it

my $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]//g;

--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)

Please reply to <petr AT practisoft DOT cz>

John W. Krahn

unread,
Dec 23, 2007, 3:23:07 AM12/23/07
to
On Sat, 22 Dec 2007 05:53:18 +0100
Larry <dontme...@got.it> wrote:

Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.

Jürgen Exner

unread,
Dec 23, 2007, 12:45:07 PM12/23/07
to
"John W. Krahn" <kra...@telus.net> wrote:

>Larry <dontme...@got.it> wrote:
>> is this fine?
>>
>> $input =~ tr/\x80-\xFF//d;
>
>Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.

Good point. However, if you are splitting hairs, then let's be accurate:
Regular expressions match a string but they never remove anything as
requested by the OP. Therefore taking literally the OPs question is
non-sensical in the first place.

And he still didn't tell us if he wanted to remove non-ASCII or
non-printable, two very different categories which have no relationship with
each other whatsoever.

jue

Larry

unread,
Dec 23, 2007, 8:29:45 PM12/23/07
to
In article <p57tm3dqffe88o8gc...@4ax.com>,
J?rgen Exner <jurg...@hotmail.com> wrote:

> And he still didn't tell us if he wanted to remove non-ASCII or
> non-printable, two very different categories which have no relationship with
> each other whatsoever.

I have yet to understand the differences...in the meanwhile I think I'll
settle for the following:

tr/\x80-\xFF//d;

thanks

Jürgen Exner

unread,
Dec 23, 2007, 10:52:30 PM12/23/07
to
Larry <dontme...@got.it> wrote:

>In article <p57tm3dqffe88o8gc...@4ax.com>,
> J?rgen Exner <jurg...@hotmail.com> wrote:
>
>> And he still didn't tell us if he wanted to remove non-ASCII or
>> non-printable, two very different categories which have no relationship with
>> each other whatsoever.
>
>I have yet to understand the differences..

Well, there is no communallity at all. It's two totally different things,
like colour and texture. A specific object can be green and smooth or green
and rough or blue and rough or blue and smooth or whatever combination you
can imagine.

Non-printable characters are characters that don't have a glyph assigned to
them and therefore cannot be printed. Another word for them is control
characters and they include e.g. line feed, carriage return, delete,
backspace, end-of-transmission, header start, etc., etc.
In ASCII and most other modern code pages the non-printable characters are
in the range 0x00 to 0x1F and 0x7F.


Non-ASCII characters on the other hand are characters that are not included
in the 7-bit ASCII encoding at all like e.g. symbols, graphics, and what
some people refer to as 'extended' characters like German umlauts, French
and Spanish accented characters, scandinavian extended characters, but also
Greek, Cyrillic, Arabic,Chinese, ... characters. Basically anything you can
imagine that is not typically used in the English language or that's not on
a US typewriter.
That's not surprising because as the name suggests ASCII is an _AMERICAN_
Standard Code for Information Interchange and Lyndon B. Johnson surely
didn't care about the rest of the world when he mandated its use back in
1968.

For e.g. ISO-Latin-1 those non-ASCII characters would be
Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ ©
ª « ¬ SHY ® ¯
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹
º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É
Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù
Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é
ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù
ú û ü ý þ ÿ

However almost all non-ASCII characters do have a glyph and obviously they
can be printed very well(*), just see the list above.
Or do you really think I would just omit the second letter of my first name
'Jürgen' when printing it?

*1: You could argue if the NBSP and and in particular SHY are printable or
not because they have an additional semantic on top of their (blank resp.
dash) glyphs.
*2: There are exceptions in the code pages for more exotic languages
(Arabic, Thai, Tamil, ...) , where some characters my not have a glyph
assigned but instead they alter the appearence and/or the meaning of
preceeding or following characters.

jue

Larry

unread,
Dec 24, 2007, 1:55:39 AM12/24/07
to
In article <ib8um3h9dgpeg3q6i...@4ax.com>,
Jürgen Exner <jurg...@hotmail.com> wrote:

> Well, there is no communallity at all. It's two totally different things,
> like colour and texture. A specific object can be green and smooth or green
> and rough or blue and rough or blue and smooth or whatever combination you
> can imagine.

ok...to me those are ascii printable chars:

#!/usr/bin/perl

use strict;
use warnings;

for my $k (33 .. 126)
{
print "$k => " . chr($k) . "\n";
}

plus chr(10) and chr(13)

Jürgen Exner

unread,
Dec 24, 2007, 2:19:17 AM12/24/07
to
Larry <dontme...@got.it> wrote:
>ok...to me those are ascii printable chars:
>
>#!/usr/bin/perl
>
>use strict;
>use warnings;
>
>for my $k (33 .. 126)
>{
> print "$k => " . chr($k) . "\n";
>}

Agreed, those characters are the intersection of the set of printable
characters and the set of ASCII characters, except that commonly the space
character 0x20 is considered a printable character, too. It just has a blank
glyph.

>plus chr(10) and chr(13)

This however conflicts with customary understanding. From "perldoc perlre"
on POSIX character classes:

print
Any alphanumeric or punctuation (special) character or space.

While on the other hand

cntrl
Any control character. Usually characters that don't produce output
as such but instead control the terminal somehow: for example
newline and backspace are control characters. All characters with
ord() less than 32 are most often classified as control characters
(assuming ASCII, the ISO Latin character sets, and Unicode).

It appears LF and CR are control characters, not printable characters. After
all why should LF be a printable character but its cousin FF not?

jue

Larry

unread,
Dec 24, 2007, 5:56:23 AM12/24/07
to
In article <admum3105kdr92500...@4ax.com>,
J?rgen Exner <jurg...@hotmail.com> wrote:

> Agreed, those characters are the intersection of the set of printable
> characters and the set of ASCII characters, except that commonly the space
> character 0x20 is considered a printable character, too. It just has a blank
> glyph.

by the way, I'd like to get rid of 0x00 also! The thing is that I'm
coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
0x10 and those ranging from 0x21 to 0x7E

is that doable?

thanks

Larry

unread,
Dec 24, 2007, 6:00:33 AM12/24/07
to
In article <dontmewithme-6700...@news.tin.it>,
Larry <dontme...@got.it> wrote:

> 0x20 0x13
> 0x10 and those ranging from 0x21 to 0x7E

I'm hopeless at hex values...let's say:

chr(10)
chr(13)
chr(32) to chr(126)

thanks

Larry

unread,
Dec 24, 2007, 6:11:53 AM12/24/07
to
In article <dontmewithme-4475...@news.tin.it>,
Larry <dontme...@got.it> wrote:

> I'm hopeless at hex values...let's say:
>
> chr(10)
> chr(13)
> chr(32) to chr(126)
>
> thanks

well, for the moment I'll go along with keeping those ranging from 0x20
to 0x7E ... so that I don't have to chomp and all...

Jürgen Exner

unread,
Dec 24, 2007, 2:04:35 PM12/24/07
to
Larry <dontme...@got.it> wrote:

> J?rgen Exner <jurg...@hotmail.com> wrote:
> The thing is that I'm
>coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
>0x10 and those ranging from 0x21 to 0x7E

Thank you for calling me a person with a bad char.

*PLONK*

jue

Jürgen Exner

unread,
Dec 24, 2007, 2:16:36 PM12/24/07
to
Larry <dontme...@got.it> wrote:

What a concept!
I am giving up.

jue

Larry

unread,
Dec 24, 2007, 2:33:22 PM12/24/07
to
In article <tb10n3tb1f61d5v99...@4ax.com>,
J?rgen Exner <jurg...@hotmail.com> wrote:

> What a concept!
> I am giving up.

please don't! it's xmas time after all...

i need this to get values (commands) from CGI->param and need to get rid
of those chars

Larry

unread,
Dec 24, 2007, 7:39:26 PM12/24/07
to
In article <fkj72b$1n26$1...@ns.felk.cvut.cz>,
"Petr Vileta" <sto...@practisoft.cz> wrote:

> my $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]//g;

thank you so much ... btw, what is chr (127) ??

I think I'll make it this way:

$input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]//g;

thanks

comp.llang.perl.moderated

unread,
Dec 25, 2007, 12:47:19 AM12/25/07
to
On Dec 24, 11:04 am, Jürgen Exner <jurge...@hotmail.com> wrote:
> Larry <dontmewit...@got.it> wrote:

> > J?rgen Exner <jurge...@hotmail.com> wrote:
> > The thing is that I'm
> >coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
> >0x10 and those ranging from 0x21 to 0x7E
>
> Thank you for calling me a person with a bad char.
>
> *PLONK*
>
Wow, I thought for sure you'd finish with a
smiley after that wonderful flash of wit....
Of course, maybe you were sitting in a bad
"char" :)

--
Charles DeRykus

0 new messages