Regexp to remove spaces

Sftriman

unread,

Dec 19, 2009, 10:13:58 PM12/19/09

to begi...@perl.org

I use this series of regexp all over the place to clean up lines of
text:

$x=~s/^\s+//g;
$x=~s/\s+$//g;
$x=~s/\s+/ /g;

in that order, and note the final one replace \s+ with a single space.

Basically, it's (1) remove all leading space, (2) remove all trailing
space,
and (3) replace all multi-space with a single space [which, at this
point,
should only occur on interior characters].

Is there a handy way to do this in one regexp? And, fast? I've been
using Devel::NYTProf to study code timing and see that some regexp,
especially mine, can be CPU expensive/intensive.

Thanks!
David

Erez Schatz

unread,

Dec 20, 2009, 8:27:58 AM12/20/09

to sftriman, begi...@perl.org

2009/12/20 sftriman <dal...@gmail.com>:

> I use this series of regexp all over the place to clean up lines of
> text:
>
> $x=~s/^\s+//g;
> $x=~s/\s+$//g;
> $x=~s/\s+/ /g;
>

You can probably use $x=~s/^(\s+)|(\s+)$//g;

But I don't think it will use any less CPU than the 3 regex option,
the nature of Perl's regex engine being what it is.

--
Erez

"The government forgets that George Orwell's 1984 was a warning, and
not a blueprint"
http://www.nonviolent-conflict.org/ -- http://www.whyweprotest.org/

Shawn H Corey

unread,

Dec 20, 2009, 8:54:08 AM12/20/09

to sftriman, begi...@perl.org

tr/// is generally faster than s///

$text =~ tr{\t}{ };
$text =~ tr{\n}{ };
$text =~ tr{\r}{ };
$text =~ tr{\f}{ };
$text =~ tr{ }{ }s;

--
Just my 0.00000002 million dollars worth,
Shawn

Programming is as much about organization and communication
as it is about coding.

I like Perl; it's the only language where you can bless your
thingy.

John W. Krahn

unread,

Dec 20, 2009, 11:33:54 AM12/20/09

to Perl Beginners

Shawn H Corey wrote:
> sftriman wrote:
>> I use this series of regexp all over the place to clean up lines of
>> text:
>>
>> $x=~s/^\s+//g;
>> $x=~s/\s+$//g;
>> $x=~s/\s+/ /g;
>>
>> in that order, and note the final one replace \s+ with a single space.
>>
>> Basically, it's (1) remove all leading space, (2) remove all trailing
>> space,
>> and (3) replace all multi-space with a single space [which, at this
>> point,
>> should only occur on interior characters].
>>
>> Is there a handy way to do this in one regexp? And, fast? I've been
>> using Devel::NYTProf to study code timing and see that some regexp,
>> especially mine, can be CPU expensive/intensive.
>

> tr/// is generally faster than s///
>
> $text =~ tr{\t}{ };
> $text =~ tr{\n}{ };
> $text =~ tr{\r}{ };
> $text =~ tr{\f}{ };
> $text =~ tr{ }{ }s;

That can be reduced to:

$text =~ tr/ \t\n\r\f/ /s;

But that still doesn't remove leading and trailing whitespace so add two
more lines:

$text =~ tr/ \t\n\r\f/ /s;
$text =~ s/\A //;
$text =~ s/ \z//;

John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway

Shawn H Corey

unread,

Dec 20, 2009, 12:45:45 PM12/20/09

to John W. Krahn, Perl Beginners

John W. Krahn wrote:
> That can be reduced to:
>
> $text =~ tr/ \t\n\r\f/ /s;
>
> But that still doesn't remove leading and trailing whitespace so add two
> more lines:
>
> $text =~ tr/ \t\n\r\f/ /s;
> $text =~ s/\A //;
> $text =~ s/ \z//;

That was left as an exercise to the reader. Come now, you don't expect
the bestest of code early Sunday morning...before I finish my first cup
of coffee? If so, I must say that your optimism is only overshadowed by
your hope. :)

Robert Wohlfarth

unread,

Dec 20, 2009, 2:02:38 PM12/20/09

to begginers perl.org

On Sat, Dec 19, 2009 at 9:13 PM, sftriman <dal...@gmail.com> wrote:

> I use this series of regexp all over the place to clean up lines of
> text:
>
> $x=~s/^\s+//g;
> $x=~s/\s+$//g;
> $x=~s/\s+/ /g;
>
> in that order, and note the final one replace \s+ with a single space.
>
> Basically, it's (1) remove all leading space, (2) remove all trailing
> space,
> and (3) replace all multi-space with a single space [which, at this
> point,
> should only occur on interior characters].
>

Take a look at the String::Util module. The "crunch" function, for example,
also removes leading/trailing/multiple spaces.

--
Robert Wohlfarth

Dr.Ruud

unread,

Dec 20, 2009, 9:36:38 AM12/20/09

to begi...@perl.org

sftriman wrote:
> I use this series of regexp all over the place to clean up lines of
> text:
>
> $x=~s/^\s+//g;
> $x=~s/\s+$//g;
> $x=~s/\s+/ /g;
>
> in that order, and note the final one replace \s+ with a single space.

The g-modifier on the first 2 is bogus
(unless you would add an m-modifier).

I currently tend to write it like this:

s/\s+\z//, s/\A\s+//, s/\s+/ /g, for $x;

So first remove tail spaces (less to lshift next).
Then remove head spaces. Then normalize.

For a multi-line buffer you can do it like this:

perl -wle '

my $x = <<"EOT";
123 456 \t
abc def
\t\t\t\t \t\t\t\t
*** *** *** \t
EOT

s/^\s+//mg, s/\s+$//mg, s/[^\S\n]+/ /g for $x;

$x =~ s/\n/>\n/g;
print $x, "<";
'

123 456>
abc def>
*** *** ***<

--
Ruud

Dr.Ruud

unread,

Dec 20, 2009, 9:46:13 AM12/20/09

to begi...@perl.org

Shawn H Corey wrote:

> $text =~ tr{\t}{ };
> $text =~ tr{\n}{ };
> $text =~ tr{\r}{ };
> $text =~ tr{\f}{ };
> $text =~ tr{ }{ }s;

That can be written as:

tr/\t\n\r\f/ /, tr/ / /s for $text;

But it doesn't remove all leading nor all trailing spaces.

--
Ruud

Albert Q

unread,

Dec 21, 2009, 5:11:10 AM12/21/09

to Dr.Ruud, begi...@perl.org

2009/12/20 Dr.Ruud <rvtol+...@isolution.nl <rvtol%2Bus...@isolution.nl>>

> sftriman wrote:
>
>> I use this series of regexp all over the place to clean up lines of
>> text:
>>
>> $x=~s/^\s+//g;
>> $x=~s/\s+$//g;
>> $x=~s/\s+/ /g;
>>
>> in that order, and note the final one replace \s+ with a single space.
>>
>
> The g-modifier on the first 2 is bogus
> (unless you would add an m-modifier).
>
> I currently tend to write it like this:
>
> s/\s+\z//, s/\A\s+//, s/\s+/ /g, for $x;
>
> So first remove tail spaces (less to lshift next).
> Then remove head spaces. Then normalize.
>
>
> For a multi-line buffer you can do it like this:
>
> perl -wle '
>
> my $x = <<"EOT";
> 123 456 \t
> abc def
> \t\t\t\t \t\t\t\t
> *** *** *** \t
> EOT
>
> s/^\s+//mg, s/\s+$//mg, s/[^\S\n]+/ /g for $x;

I know what it does, but I haven't seen this form of *for* before. Where can
I find the description of this syntax in perldoc?

Thanks.

>

$x =~ s/\n/>\n/g;
> print $x, "<";
> '
>
> 123 456>
> abc def>
> *** *** ***<
>
> --
> Ruud
>
>

> --
> To unsubscribe, e-mail: beginners-...@perl.org
> For additional commands, e-mail: beginne...@perl.org
> http://learn.perl.org/
>
>
>

--
missing the days we spend together

Jim Gibson

unread,

Dec 21, 2009, 11:06:38 AM12/21/09

to begi...@perl.org

At 6:11 PM +0800 12/21/09, Albert Q wrote:
>2009/12/20 Dr.Ruud <rvtol+...@isolution.nl <rvtol%2Bus...@isolution.nl>>

>
>
> > For a multi-line buffer you can do it like this:
>>
>> perl -wle '
>>
>> my $x = <<"EOT";
>> 123 456 \t
>> abc def
>> \t\t\t\t \t\t\t\t
>> *** *** *** \t
>> EOT
>>
>> s/^\s+//mg, s/\s+$//mg, s/[^\S\n]+/ /g for $x;
>
>
>I know what it does, but I haven't seen this form of *for* before. Where can
>I find the description of this syntax in perldoc?

That is a question about "Perl syntax", so look in "perldoc perlsyn".
Search for the section on "Statement Modifiers", and realize that
"for" and "foreach" are synonyms.

Sftriman

unread,

Dec 22, 2009, 3:29:36 PM12/22/09

to begi...@perl.org

Thanks to everyone for their input!

So I've tried out many of the methods, first making sure that each
works as I intended it.
Which is, I'm not concerned with multi-line text, just single line
data. That said, I have noted
that I should use \A and \z in general over ^ and $.

I wrote a 176 byte string for testing, and ran each method 1,000,000
times to time
the speed. The winner is: 3 regexp, using tr for intra-string
spaces. I found I could
make this even faster using a pointer to the variable versus passing
in the variable
as a local input parameter, modifying, then returning it. (In all
cases, my goal is
to write a sub for general use anywhere I want it, so I wrote each
possibility as
a sub. There ARE cases where I need to compare the the original
string with the
"cleaned" string, but I can deal with that as need be with local
variables.)

1ST PLACE - THE WINNER: 5.0s average on 5 runs

# Limitation - pointer
sub fixsp5 {
${$_[0]}=~tr/ \t\n\r\f/ /s;
${$_[0]}=~s/\A //;
${$_[0]}=~s/ \z//;
}

2nd PLACE - same as above, but with local variables - 6.0s average on
5 runs

sub fixsp4 {
my ($x)=@_;
$x=~tr/ \t\n\r\f/ /s;
$x=~s/\A //;
$x=~s/ \z//;
return $x;
}

[ QUESTION - any difference using my $x=shift; ??? ]

3rd PLACE - 3 way tie, my method, either as variable in, change in
place, or pointer - 17.0s average

sub fixsp0 {
my ($x)=@_;
$x=~s/^\s+//;
$x=~s/\s+$//;
$x=~s/\s+/ /g;
return $x;
}

# Limitation: pointer
sub fixsp1 {
${$_[0]}=~s/^\s+//;
${$_[0]}=~s/\s+$//;
${$_[0]}=~s/\s+/ /g;
}

# Limitation: change in place
sub fixsp2 {
$_[0]=~s/^\s+//;
$_[0]=~s/\s+$//;
$_[0]=~s/\s+/ /g;
}

4TH PLACE - 20.0s average on 5 runs (did not try change in place or as
pointer)

sub fixsp6 {
my ($x)=@_;

s/\s+\z//, s/\A\s+//, s/\s+/ /g, for $x;

return $x;
}

5TH PLACE - DEAD LAST! (or DFL in some parlance) - 62.0s average on 3
runs

sub fixsp3 {
my ($x)=@_;
$x=~s/^(\s+)|(\s+)$//g;
$x=~s/\s+/ /g;
return $x;
}

Any and all comments welcome.

David

Dr.Ruud

unread,

Dec 23, 2009, 5:31:26 AM12/23/09

to begi...@perl.org

sftriman wrote:

> 1ST PLACE - THE WINNER: 5.0s average on 5 runs
>
> # Limitation - pointer
> sub fixsp5 {
> ${$_[0]}=~tr/ \t\n\r\f/ /s;
> ${$_[0]}=~s/\A //;
> ${$_[0]}=~s/ \z//;
> }

Just decide to change in-place, based on the defined-ness of wantarray.

sub trim {
no warnings 'uninitialized';

if ( defined wantarray ) {
# need to return scalar / list
my @values= @_;
s#^\s+##s, s#\s+$##s foreach @values;
return wantarray ? @values : $values[0];
}

# need to change in-place
s#^\s+##s, s#\s+$##s foreach @_;
return;
} #trim

--
Ruud

Sftriman

unread,

Dec 28, 2009, 1:01:53 AM12/28/09

to begi...@perl.org

Hi there,

You're missing the tr to squash space down, but I see what you're
doing.
I never need to "trim" an array at this point, but if I did...

So I think it can boil down to:

sub fixsp7 {
s#\A\s+##, s#\s+\z##, tr/ \t\n\r\f/ /s foreach @_;
return;
}

This is in keeping consistent with my other 6 test cases. I run it
against
several test strings including some with line breaks to make sure the
results are always the same. Note I am using \A and \z and not ^ and
$.
Still, I think this has the flavor of what you intended.

Result: 5 trial runs over the same data set, 1,000,000 times, average
time was 16.30s. All things considered, this puts it in a 4-way tie
for 3rd place with the other methods. IF - the times above still
stand...

And in fact, they don't. Why? CPU usage is high on my box right now.
So I baselined the other methods in the 6.0s range, and they are now
coming in at 25s! So maybe this one is the fastest! I'll have to do
more
testing.

To be fair, I had to rewrite the former "winner" as:

sub fixsp1a {
${$_[0]}=~s/\A\s+//;
${$_[0]}=~s/\s+\z//;

${$_[0]}=~s/\s+/ /g;
}

using \A and \z.

I wonder how expensive that foreach is. Knowing that it is exactly
one argument, is there a faster way for this to run, not using
foreach?
Even so, this may not be the fastest trim method - in place, no
pointer,
one line, with the foreach @_ as written.

David

Shawn H Corey

unread,

Dec 28, 2009, 8:20:52 AM12/28/09

to sftriman, begi...@perl.org

sftriman wrote:
> So I think it can boil down to:
>
> sub fixsp7 {
> s#\A\s+##, s#\s+\z##, tr/ \t\n\r\f/ /s foreach @_;
> return;
> }

sub fixsp7 {
tr/ \t\n\r\f/ /s, s#\A\s##, s#\s\z## foreach @_;
return;
}

Placing the tr/// first reduces the number of characters scanned for
s#\s\z## which makes things slightly faster.

Dr.Ruud

unread,

Dec 28, 2009, 2:31:16 PM12/28/09

to begi...@perl.org

sftriman wrote:
> Dr.Ruud:

>> sub trim { ...
>> } #trim

>
> You're missing the tr to squash space down

To trim() is to remove from head and tail only.
Just use it as an example to build a "trim_and_normalize()".

> So I think it can boil down to:
>
> sub fixsp7 {
> s#\A\s+##, s#\s+\z##, tr/ \t\n\r\f/ /s foreach @_;
> return;
> }

Best remove from the end before removing from the start.

--
Ruud