$x=~s/^\s+//g;
$x=~s/\s+$//g;
$x=~s/\s+/ /g;
in that order, and note the final one replace \s+ with a single space.
Basically, it's (1) remove all leading space, (2) remove all trailing
space,
and (3) replace all multi-space with a single space [which, at this
point,
should only occur on interior characters].
Is there a handy way to do this in one regexp? And, fast? I've been
using Devel::NYTProf to study code timing and see that some regexp,
especially mine, can be CPU expensive/intensive.
Thanks!
David
You can probably use $x=~s/^(\s+)|(\s+)$//g;
But I don't think it will use any less CPU than the 3 regex option,
the nature of Perl's regex engine being what it is.
--
Erez
"The government forgets that George Orwell's 1984 was a warning, and
not a blueprint"
http://www.nonviolent-conflict.org/ -- http://www.whyweprotest.org/
tr/// is generally faster than s///
$text =~ tr{\t}{ };
$text =~ tr{\n}{ };
$text =~ tr{\r}{ };
$text =~ tr{\f}{ };
$text =~ tr{ }{ }s;
--
Just my 0.00000002 million dollars worth,
Shawn
Programming is as much about organization and communication
as it is about coding.
I like Perl; it's the only language where you can bless your
thingy.
That can be reduced to:
$text =~ tr/ \t\n\r\f/ /s;
But that still doesn't remove leading and trailing whitespace so add two
more lines:
$text =~ tr/ \t\n\r\f/ /s;
$text =~ s/\A //;
$text =~ s/ \z//;
John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway
That was left as an exercise to the reader. Come now, you don't expect
the bestest of code early Sunday morning...before I finish my first cup
of coffee? If so, I must say that your optimism is only overshadowed by
your hope. :)
> I use this series of regexp all over the place to clean up lines of
> text:
>
> $x=~s/^\s+//g;
> $x=~s/\s+$//g;
> $x=~s/\s+/ /g;
>
> in that order, and note the final one replace \s+ with a single space.
>
> Basically, it's (1) remove all leading space, (2) remove all trailing
> space,
> and (3) replace all multi-space with a single space [which, at this
> point,
> should only occur on interior characters].
>
Take a look at the String::Util module. The "crunch" function, for example,
also removes leading/trailing/multiple spaces.
--
Robert Wohlfarth
The g-modifier on the first 2 is bogus
(unless you would add an m-modifier).
I currently tend to write it like this:
s/\s+\z//, s/\A\s+//, s/\s+/ /g, for $x;
So first remove tail spaces (less to lshift next).
Then remove head spaces. Then normalize.
For a multi-line buffer you can do it like this:
perl -wle '
my $x = <<"EOT";
123 456 \t
abc def
\t\t\t\t \t\t\t\t
*** *** *** \t
EOT
s/^\s+//mg, s/\s+$//mg, s/[^\S\n]+/ /g for $x;
$x =~ s/\n/>\n/g;
print $x, "<";
'
123 456>
abc def>
*** *** ***<
--
Ruud
> $text =~ tr{\t}{ };
> $text =~ tr{\n}{ };
> $text =~ tr{\r}{ };
> $text =~ tr{\f}{ };
> $text =~ tr{ }{ }s;
That can be written as:
tr/\t\n\r\f/ /, tr/ / /s for $text;
But it doesn't remove all leading nor all trailing spaces.
--
Ruud
> sftriman wrote:
>
>> I use this series of regexp all over the place to clean up lines of
>> text:
>>
>> $x=~s/^\s+//g;
>> $x=~s/\s+$//g;
>> $x=~s/\s+/ /g;
>>
>> in that order, and note the final one replace \s+ with a single space.
>>
>
> The g-modifier on the first 2 is bogus
> (unless you would add an m-modifier).
>
> I currently tend to write it like this:
>
> s/\s+\z//, s/\A\s+//, s/\s+/ /g, for $x;
>
> So first remove tail spaces (less to lshift next).
> Then remove head spaces. Then normalize.
>
>
> For a multi-line buffer you can do it like this:
>
> perl -wle '
>
> my $x = <<"EOT";
> 123 456 \t
> abc def
> \t\t\t\t \t\t\t\t
> *** *** *** \t
> EOT
>
> s/^\s+//mg, s/\s+$//mg, s/[^\S\n]+/ /g for $x;
I know what it does, but I haven't seen this form of *for* before. Where can
I find the description of this syntax in perldoc?
Thanks.
>
$x =~ s/\n/>\n/g;
> print $x, "<";
> '
>
> 123 456>
> abc def>
> *** *** ***<
>
> --
> Ruud
>
>
> --
> To unsubscribe, e-mail: beginners-...@perl.org
> For additional commands, e-mail: beginne...@perl.org
> http://learn.perl.org/
>
>
>
--
missing the days we spend together
That is a question about "Perl syntax", so look in "perldoc perlsyn".
Search for the section on "Statement Modifiers", and realize that
"for" and "foreach" are synonyms.
So I've tried out many of the methods, first making sure that each
works as I intended it.
Which is, I'm not concerned with multi-line text, just single line
data. That said, I have noted
that I should use \A and \z in general over ^ and $.
I wrote a 176 byte string for testing, and ran each method 1,000,000
times to time
the speed. The winner is: 3 regexp, using tr for intra-string
spaces. I found I could
make this even faster using a pointer to the variable versus passing
in the variable
as a local input parameter, modifying, then returning it. (In all
cases, my goal is
to write a sub for general use anywhere I want it, so I wrote each
possibility as
a sub. There ARE cases where I need to compare the the original
string with the
"cleaned" string, but I can deal with that as need be with local
variables.)
1ST PLACE - THE WINNER: 5.0s average on 5 runs
# Limitation - pointer
sub fixsp5 {
${$_[0]}=~tr/ \t\n\r\f/ /s;
${$_[0]}=~s/\A //;
${$_[0]}=~s/ \z//;
}
2nd PLACE - same as above, but with local variables - 6.0s average on
5 runs
sub fixsp4 {
my ($x)=@_;
$x=~tr/ \t\n\r\f/ /s;
$x=~s/\A //;
$x=~s/ \z//;
return $x;
}
[ QUESTION - any difference using my $x=shift; ??? ]
3rd PLACE - 3 way tie, my method, either as variable in, change in
place, or pointer - 17.0s average
sub fixsp0 {
my ($x)=@_;
$x=~s/^\s+//;
$x=~s/\s+$//;
$x=~s/\s+/ /g;
return $x;
}
# Limitation: pointer
sub fixsp1 {
${$_[0]}=~s/^\s+//;
${$_[0]}=~s/\s+$//;
${$_[0]}=~s/\s+/ /g;
}
# Limitation: change in place
sub fixsp2 {
$_[0]=~s/^\s+//;
$_[0]=~s/\s+$//;
$_[0]=~s/\s+/ /g;
}
4TH PLACE - 20.0s average on 5 runs (did not try change in place or as
pointer)
sub fixsp6 {
my ($x)=@_;
s/\s+\z//, s/\A\s+//, s/\s+/ /g, for $x;
return $x;
}
5TH PLACE - DEAD LAST! (or DFL in some parlance) - 62.0s average on 3
runs
sub fixsp3 {
my ($x)=@_;
$x=~s/^(\s+)|(\s+)$//g;
$x=~s/\s+/ /g;
return $x;
}
Any and all comments welcome.
David
> 1ST PLACE - THE WINNER: 5.0s average on 5 runs
>
> # Limitation - pointer
> sub fixsp5 {
> ${$_[0]}=~tr/ \t\n\r\f/ /s;
> ${$_[0]}=~s/\A //;
> ${$_[0]}=~s/ \z//;
> }
Just decide to change in-place, based on the defined-ness of wantarray.
sub trim {
no warnings 'uninitialized';
if ( defined wantarray ) {
# need to return scalar / list
my @values= @_;
s#^\s+##s, s#\s+$##s foreach @values;
return wantarray ? @values : $values[0];
}
# need to change in-place
s#^\s+##s, s#\s+$##s foreach @_;
return;
} #trim
--
Ruud
Hi there,
You're missing the tr to squash space down, but I see what you're
doing.
I never need to "trim" an array at this point, but if I did...
So I think it can boil down to:
sub fixsp7 {
s#\A\s+##, s#\s+\z##, tr/ \t\n\r\f/ /s foreach @_;
return;
}
This is in keeping consistent with my other 6 test cases. I run it
against
several test strings including some with line breaks to make sure the
results are always the same. Note I am using \A and \z and not ^ and
$.
Still, I think this has the flavor of what you intended.
Result: 5 trial runs over the same data set, 1,000,000 times, average
time was 16.30s. All things considered, this puts it in a 4-way tie
for 3rd place with the other methods. IF - the times above still
stand...
And in fact, they don't. Why? CPU usage is high on my box right now.
So I baselined the other methods in the 6.0s range, and they are now
coming in at 25s! So maybe this one is the fastest! I'll have to do
more
testing.
To be fair, I had to rewrite the former "winner" as:
sub fixsp1a {
${$_[0]}=~s/\A\s+//;
${$_[0]}=~s/\s+\z//;
${$_[0]}=~s/\s+/ /g;
}
using \A and \z.
I wonder how expensive that foreach is. Knowing that it is exactly
one argument, is there a faster way for this to run, not using
foreach?
Even so, this may not be the fastest trim method - in place, no
pointer,
one line, with the foreach @_ as written.
David
sub fixsp7 {
tr/ \t\n\r\f/ /s, s#\A\s##, s#\s\z## foreach @_;
return;
}
Placing the tr/// first reduces the number of characters scanned for
s#\s\z## which makes things slightly faster.
>> sub trim { ...
>> } #trim
>
> You're missing the tr to squash space down
To trim() is to remove from head and tail only.
Just use it as an example to build a "trim_and_normalize()".
> So I think it can boil down to:
>
> sub fixsp7 {
> s#\A\s+##, s#\s+\z##, tr/ \t\n\r\f/ /s foreach @_;
> return;
> }
Best remove from the end before removing from the start.
--
Ruud