Questions on scalar references in general, substr and regular expression engine in particular

s...@netherlands.co

unread,

May 15, 2008, 3:40:06 PM5/15/08

to

ActiveState perl guidelines state its better to pass by reference for best performance.

Does this apply in the case of SCALARS?

It would seem that dereferencing a SCALAR reference would create a temporary of the original
and I think this is the case.

Perl seems to know array and hash references pretty well and, internally, dereferencing them does
not seem to incur overhead of making a temporary copy of the original array or hash, the element is
directly accessed, as if the reference were a pointer.

Thats fine, I have no problem at all with this. I am only interrested in SCALAR.

I was curious about the function 'substr'. The first parameter is EXPRESSION. It seems to be an
EXPRESSION evaluator. On its face, it will not take a reference, but it will take a dereferenced SCALAR.
But I wonder if, based on the EXPRESSION, if it knows it is dereferencing a SCALAR, and does not make a temporary copy.

In the case of the regular expression engine, I wonder the same thing, although with this there may be other properties
that would cause the discrepancies shown below.

For example 'pos()=' and '= pos()' might incur much overhead, and that may explain it as it could produce unknown temporaries
in the engine.

If I run this code segment 100 times, the substr is averaging 10 times faster than the regex. I can't explain it.

Thanks!

// substr
return substr($$refscalar, 20, 30);

// regex
$savpos = pos($$refscalar);
pos($$refscalar) = 20;
while ($$refscalar =~ /(.{10})/gs) {
pos($$refscalar) = $savpos;
return $1;
}

s...@netherlands.co

unread,

May 15, 2008, 3:54:34 PM5/15/08

to

I'm sorry, this should be ^10

bugbear

unread,

May 16, 2008, 5:11:46 AM5/16/08

to

s...@netherlands.co wrote:
> ActiveState perl guidelines state its better to pass by reference for best performance.
>
> Does this apply in the case of SCALARS?

No.

BugBear

Ben Morrow

unread,

May 16, 2008, 7:41:58 AM5/16/08

to

Quoth bugbear <bugbear@trim_papermule.co.uk_trim>:

To expand a little: the items in @_ are aliases, not copies, so if you
manipulate them directly you will not be taking a copy. However, as soon
as you do something like

my ($x, $y, $z) = @_;

you've copied all your arguments, and if one of those was a huge great
string you've just done a big memcpy.

Of course, manipulating @_ directly is obscure, and you run the risk of
modifying your arguments unexpectedly. Under normal circumstances (when
you're not expecting to deal with truly enormous strings) taking a copy
is the right thing to do. When it isn't, if you are going to modify the
arguments passed it is probably clearer to make the caller pass a
reference explicitly.

Ben

--
Razors pain you / Rivers are damp
Acids stain you / And drugs cause cramp. [Dorothy Parker]
Guns aren't lawful / Nooses give
Gas smells awful / You might as well live. b...@morrow.me.uk

Uri Guttman

unread,

May 16, 2008, 12:32:39 PM5/16/08

to

>>>>> "b" == bugbear <bugbear@trim_papermule.co.uk_trim> writes:

b> s...@netherlands.co wrote:
>> ActiveState perl guidelines state its better to pass by reference for best performance.
>> Does this apply in the case of SCALARS?

b> No.

i beg to differ. when passing around large scalars (e.g. long pieces of
text), passing by ref is better. typically args are copied from @_ and
that will be slower with large scalars. so when i work with large
scalars i tend to pass them by ref. you do have to make sure about
modifications not affecting the original or not caring about that.

but i also don't like giving good advice to sln as he is a troll who
doesn't listen. his xml parser is insane.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

s...@netherlands.co

unread,

May 17, 2008, 3:31:23 PM5/17/08

to

On Thu, 15 May 2008 12:40:06 -0700, s...@netherlands.co wrote:

this should be ^10

>
>// regex
>$savpos = pos($$refscalar);
>pos($$refscalar) = 20;
>while ($$refscalar =~ /(.{10})/gs) {
> pos($$refscalar) = $savpos;
> return $1;
>}
>
>

I have some heartening information about substr that confirms my suspicions.
The pos() is disapointing and still unexplained.

I tested substr with scalar text of 300k bytes. There was a minimum 7,000 calls made, using
as above, a dereferenced scalar. This seems to represent 2.1 GIGABYTES of data on its face.
Check that please.

The series was something like this:

LOOP @7,000 times
------------------
my $t = '';
$t = substr($$refscalar, 20, 10);
$t = substr($$refscalar, 20, 10);
$t = substr($$refscalar, 20, 10);
THIS:
$t = substr($$refscalar, 20, 10);
THAT:
$t = substr($$refscalar, 20, 10);
ENDLOOP:

For the entire 2.1 gigs of data, the difference between THIS and THAT is only .03 seconds !!!!

Its apparent that the function substr() does NOT create a temporary (ie: memcpy)
on the C side, but instead operates on the resultant scalars pointer to directly
access the data via a pointer!!

If you know how this to be true or how it works please let me know.

Thanks!