Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Iterating hashes

23 views
Skip to first unread message

Dave Saville

unread,
May 14, 2013, 9:37:47 AM5/14/13
to
If I have

foreach (sort keys %hash)

Then I know that there is a posible performance hit as the keys are
all extracted and then sorted. But is this because of the sort or are
they always all extracted first? If the latter is true then how do you
iterate over a hash without taking the hit?

I am having some problems with a script that has two hashes upon which
I am trying to do inner and outer joins amongst other things. The
hashes are roughly the same size and have over 60,000 keys the
majority of the keys have a length of approx 70 characters. The hash
values are a three element array: A mixed case copy of the key and two
integers.

--
Regards
Dave Saville

Peter Makholm

unread,
May 14, 2013, 9:51:13 AM5/14/13
to
"Dave Saville" <da...@invalid.invalid> writes:

> Then I know that there is a posible performance hit as the keys are
> all extracted and then sorted. But is this because of the sort or are
> they always all extracted first? If the latter is true then how do you
> iterate over a hash without taking the hit?

You kan use the each() function in a while loop:

while (($key, $value) = each %hash) {
print $key, "\n";
}

See the relevant part of the perlfunc manual page or
http://perldoc.perl.org/functions/each.html

//Makholm

Rainer Weikusat

unread,
May 14, 2013, 10:49:46 AM5/14/13
to
Peter Makholm <pe...@makholm.net> writes:
> "Dave Saville" <da...@invalid.invalid> writes:
>
>> Then I know that there is a posible performance hit as the keys are
>> all extracted and then sorted. But is this because of the sort or are
>> they always all extracted first? If the latter is true then how do you
>> iterate over a hash without taking the hit?
>
> You kan use the each() function in a while loop:
>
> while (($key, $value) = each %hash) {
> print $key, "\n";
> }

There's actually a variety of options and depending on the number of
pairs in the hash, they perform differently ('values' came out first
everwhere for the tests I made).

----------------------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];

print("\n===\n$_[0] keys\n===\n");

cmpthese(-4,
{
keys => sub {
my $v;

$v = $h{$_} for keys(%h);
},

sorted_keys => sub {
my $v;

$v = $h{$_} for sort(keys(%h));
},

values => sub {
1 for values(%h);
},

scalar_each => sub {
my ($v, $k);

$v = $h{$k} while $k = each(%h);
},

list_each => sub {
my ($v, $k);

1 while ($k, $v) = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;

Dave Saville

unread,
May 14, 2013, 11:30:32 AM5/14/13
to
===
100 keys
===
Rate sorted_keys scalar_each list_each keys
values
sorted_keys 5715/s -- -28% -36% -46%
-85%
scalar_each 7956/s 39% -- -10% -25%
-80%
list_each 8863/s 55% 11% -- -16%
-77%
keys 10602/s 85% 33% 20% --
-73%
values 39161/s 585% 392% 342% 269%
--

Care to explain the numbers please?
--
Regards
Dave Saville

Ben Morrow

unread,
May 14, 2013, 12:40:31 PM5/14/13
to

Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
> Peter Makholm <pe...@makholm.net> writes:
> > "Dave Saville" <da...@invalid.invalid> writes:
> >
> >> Then I know that there is a posible performance hit as the keys are
> >> all extracted and then sorted. But is this because of the sort or are
> >> they always all extracted first? If the latter is true then how do you
> >> iterate over a hash without taking the hit?
> >
> > You kan use the each() function in a while loop:
> >
> > while (($key, $value) = each %hash) {
> > print $key, "\n";
> > }
>
> There's actually a variety of options and depending on the number of
> pairs in the hash, they perform differently ('values' came out first
> everwhere for the tests I made).
>
> ----------------------------
> use Benchmark qw(cmpthese);
>
> sub traversal_bench
> {
> my %h = map { $_, 1; } 0 .. $_[0];
>
> print("\n===\n$_[0] keys\n===\n");
>
>
> sorted_keys => sub {
> my $v;
>
> $v = $h{$_} for sort(keys(%h));
> },
>
> values => sub {
> 1 for values(%h);
> },

That's hardly a fair comparison. In fact, 'values' coming out faster is
a red herring as well: it's only happening because the values are all 1
which is much faster to copy than a string.

With a fairer test like

use Benchmark qw/cmpthese/;

my %h = map +("$_", "$_"), 1..60_000;
my $x;

cmpthese -5, {
keys => sub { $x = $_ for keys %h },
sort => sub { $x = $_ for sort keys %h },
values => sub { $x = $_ for values %h },
keach => sub { 1 while $x = each %h },
veach => sub { 1 while $x = (each %h)[1] },
};

I consistently get

Rate sort veach keys values keach
sort 9.11/s -- -61% -72% -76% -81%
veach 23.6/s 159% -- -27% -37% -51%
keys 32.5/s 256% 37% -- -13% -33%
values 37.2/s 309% 58% 15% -- -23%
keach 48.4/s 431% 105% 49% 30% --

which is pretty much what I would expect: 'sort' is very expensive,
'keach' is cheap, and 'veach' is unavoidably more expensive than either
'keys' or 'values' because it has to do more work. I'm not sure why
'values' is cheaper than 'keys', but I suspect it has something to do
with the fact that hash keys are shared.

Attempting to compensate for 'veach's unfair disadvantage like this

use Benchmark qw/cmpthese/;

use constant C => "60000";
my %h = map +("$_", "$_"), 1..60_000;
my ($k, $v);

cmpthese -5, {
keys => sub { $k = 1; ($k, $v) = ($_, C) for keys %h },
values => sub { $k = 1; ($k, $v) = (C, $_) for values %h },
keach => sub { $k = 1; ($k, $v) = (scalar each %h, C) while $k },
veach => sub { $k = 1; 1 while ($k, $v) = each %h },
};

gives

Rate keys values veach keach
keys 18.4/s -- -11% -16% -49%
values 20.7/s 12% -- -5% -43%
veach 21.8/s 18% 6% -- -39%
keach 35.9/s 95% 74% 65% --


which puts 'each' fastest again; I'm not sure why 'keach' is faster than
'veach', but I suspect that comparison still isn't entirely fair.

Ben

Ben Morrow

unread,
May 14, 2013, 12:46:07 PM5/14/13
to

Quoth "Dave Saville" <da...@invalid.invalid>:
Have you considered using SQLite? DBD::SQLite has the option of creating
an in-memory database, which may well be faster and will certainly be
easier than trying to write joins in Perl. (I don't know if it will be
faster because DBI assumes it's talking to a remote database server, so
it tends to be quite heavy. A straight SQLite binding to Perl would be
what you want, but I don't think that exists.)

Ben

Rainer Weikusat

unread,
May 14, 2013, 1:27:43 PM5/14/13
to
Why would this be 'a fair test' when the keys are copied for no
particular reason while no attempt is made to determine the values
except for the 'values' and 'each in list context' cases? Iterating
over the keys of a hash while not looking at the values associated
with those keys at all seems to be a rather bizarre idea of a use
case.

Willem

unread,
May 14, 2013, 1:58:38 PM5/14/13
to
Rainer Weikusat wrote:
) Why would this be 'a fair test' when the keys are copied for no
) particular reason while no attempt is made to determine the values
) except for the 'values' and 'each in list context' cases? Iterating
) over the keys of a hash while not looking at the values associated
) with those keys at all seems to be a rather bizarre idea of a use
) case.

Perl doesn't have a 'set' type, and typically a hash is used for that, and
that is a perfectly legitimate use case for using only the keys of a hash.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Dr.Ruud

unread,
May 14, 2013, 2:35:03 PM5/14/13
to
On 14/05/2013 18:40, Ben Morrow wrote:

> I'm not sure why
> 'values' is cheaper than 'keys', but I suspect it has something to do
> with the fact that hash keys are shared.

perldoc -f keys:

The returned values are copies of the original keys in the hash, so
modifying them will not affect the original hash.


perldoc -f values:

Note that the values are not copied, which means modifying them will
modify the contents of the hash

--
Ruud

Rainer Weikusat

unread,
May 14, 2013, 2:41:51 PM5/14/13
to
Willem <wil...@turtle.stack.nl> writes:
> Rainer Weikusat wrote:
> ) Why would this be 'a fair test' when the keys are copied for no
> ) particular reason while no attempt is made to determine the values
> ) except for the 'values' and 'each in list context' cases? Iterating
> ) over the keys of a hash while not looking at the values associated
> ) with those keys at all seems to be a rather bizarre idea of a use
> ) case.
>
> Perl doesn't have a 'set' type, and typically a hash is used for that, and
> that is a perfectly legitimate use case for using only the keys of a
> hash.

While you're of course right on that, I regard this is 'rather
bizarre' use case for a hash.

Rainer Weikusat

unread,
May 14, 2013, 3:26:01 PM5/14/13
to
"Dave Saville" <da...@invalid.invalid> writes:
> On Tue, 14 May 2013 14:49:46 UTC, Rainer Weikusat
> <rwei...@mssgmbh.com> wrote:

[...]

>> use Benchmark qw(cmpthese);

[...]

>> cmpthese(-4,

[...]

> ===
> 100 keys
> ===
> Rate sorted_keys scalar_each list_each keys
> values
> sorted_keys 5715/s -- -28% -36% -46%
> -85%
> scalar_each 7956/s 39% -- -10% -25%
> -80%
> list_each 8863/s 55% 11% -- -16%
> -77%
> keys 10602/s 85% 33% 20% --
> -73%
> values 39161/s 585% 392% 342% 269%
> --
>
> Care to explain the numbers please?

The first column is the 'speed' in terms of 'executions per second',
the other columns show the 'speed differences' of the code of the
current row relative to the others, expressed as percentage of the
speed of the 'column algorithm'. Eg, for 'keys', the sorted_keys
column says '85%'. This is calculated as

int((10602 - 5715) / 5715 * 100)

Ben Morrow

unread,
May 14, 2013, 4:04:47 PM5/14/13
to

Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
> Ben Morrow <b...@morrow.me.uk> writes:
> > Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
> >>
Reading just the keys of a hash is somewhat common, for instance if the
has is being used to implement uniq or if the values are not relevant at
the moment. Dave's use case (doing a join over two hashes) starts by
determining a list of the keys present in both hashes, which means first
getting a list of the keys in one hash.

Iterating over just the values, on the other hand, is much less common;
the only use I can remember having made of it is when dealing with a
data structure like

{
foo => { name => "foo", ... },
bar => { name => "bar", ... },
}

where the key is already duplicated inside the value.

In any case, I thought the purpose here was to benchmark 'for (keys)'
and 'while (each)' as methods of iterating over a hash. The difference
between them is so small that a single extra assignment, or an
assignment that copies a string rather than a number, will completely
swamp the difference you are trying to measure, at least until you get
to the point where the extra memory allocated by keys causes the process
to start swapping.

Ben

Rainer Weikusat

unread,
May 14, 2013, 5:29:07 PM5/14/13
to
Ben Morrow <b...@morrow.me.uk> writes:

[...]

> In any case, I thought the purpose here was to benchmark 'for (keys)'
> and 'while (each)' as methods of iterating over a hash. The difference
> between them is so small that a single extra assignment, or an
> assignment that copies a string rather than a number, will completely
> swamp the difference you are trying to measure, at least until you get
> to the point where the extra memory allocated by keys causes the process
> to start swapping.

Testing just this, the result is (as I already knew from a past
thread) that keys is (on the computer where I tested this)
significantly faster than each for hashes with up to 10,000 keys.

----------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
my %h = map { $_, 1; } 0 .. $_[0];


print("\n===\n$_[0] keys\n===\n");

cmpthese(-5,
{
keys => sub {
1 for keys(%h);
},

each => sub {
my $k;

1 while $k = each(%h);
}});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
-----------------

I'm somewhat uncertain what "it is possible to keep adding unrelated
code to each example until that dominates the execution time so
overwhelmingly that this difference can't be measured anymore" is
supposed to communicate in this context ...

Ben Morrow

unread,
May 15, 2013, 6:32:37 AM5/15/13
to

Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
> Ben Morrow <b...@morrow.me.uk> writes:
>
> [...]
>
> > In any case, I thought the purpose here was to benchmark 'for (keys)'
> > and 'while (each)' as methods of iterating over a hash. The difference
> > between them is so small that a single extra assignment, or an
> > assignment that copies a string rather than a number, will completely
> > swamp the difference you are trying to measure, at least until you get
> > to the point where the extra memory allocated by keys causes the process
> > to start swapping.
>
> Testing just this, the result is (as I already knew from a past
> thread) that keys is (on the computer where I tested this)
> significantly faster than each for hashes with up to 10,000 keys.
>
> ----------------
> use Benchmark qw(cmpthese);
>
> sub traversal_bench
> {
> my %h = map { $_, 1; } 0 .. $_[0];
>
>
> print("\n===\n$_[0] keys\n===\n");
>
> cmpthese(-5,
> {
> keys => sub {
> 1 for keys(%h);
> },
>
> each => sub {
> my $k;
>
> 1 while $k = each(%h);

Did you actually read what I wrote? That 'each' benchmark does an extra
assignment, so of *course* it's slower.

Ben

George Mpouras

unread,
May 15, 2013, 7:03:54 AM5/15/13
to
Στις 14/5/2013 16:37, ο/η Dave Saville έγραψε:
> I am trying to do inner and outer joins amongst other things. The

if you explain what to you mean with "inner and outer joins" I could
provide a small piece of code.

Rainer Weikusat

unread,
May 15, 2013, 7:51:38 AM5/15/13
to
Ben Morrow <b...@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
>> Ben Morrow <b...@morrow.me.uk> writes:

[...]

>> Testing just this, the result is (as I already knew from a past
>> thread) that keys is (on the computer where I tested this)
>> significantly faster than each for hashes with up to 10,000 keys.
>>
>> ----------------
>> use Benchmark qw(cmpthese);
>>
>> sub traversal_bench
>> {
>> my %h = map { $_, 1; } 0 .. $_[0];
>>
>>
>> print("\n===\n$_[0] keys\n===\n");
>>
>> cmpthese(-5,
>> {
>> keys => sub {
>> 1 for keys(%h);
>> },
>>
>> each => sub {
>> my $k;
>>
>> 1 while $k = each(%h);
>
> Did you actually read what I wrote? That 'each' benchmark does an extra
> assignment, so of *course* it's slower.

The 'each' benchmark does not do 'an extra assignment': In order to
use the key (similar to the earlier one which purposely included an
operation extracting the value in every subroutine iterating over the
keys) in a loop, it has to be stored somewhere (aka 'assigned to
something') which is part of the cost of using each, whereas (term?)
keys can be used with for like any other 'term' (term?) returning a
list. Because of this

1. If iterating over the keys and (usually) examining the values is
desired, keys is the sensible choice in almost all cases and each
becomes the sensible choice once the hashes become 'large'.

2. If iterating over the values is sufficient, values is the way to
go, except possibly for extremly large hashes (> 1,000,000 keys for
this example).

3. If a predictable 'iteration order' is desired/ required, there's
not much which can be done except sorting the keylist beforehand (But
when comparing hashes, this can be avoided by iterating over the keys
of one hash, looking for each key in the other hash and deleteing it
if it was found. If the loop didn't terminate early because of a
mismatch, whatever is left in the second after it has was stuff not
contained in the first).

Dave Saville

unread,
May 15, 2013, 8:13:59 AM5/15/13
to
On Wed, 15 May 2013 11:03:54 UTC, George Mpouras
<nospam.gravital...@hotmail.noads.com> wrote:

> 14/5/2013 16:37, / Dave Saville :
> > I am trying to do inner and outer joins amongst other things. The
>
> if you explain what to you mean with "inner and outer joins" I could
> provide a small piece of code.
>

Well my SQl is a bit rusty :-)

But the code is easy enough.

I have three hashes S, C & H holding 60K+ keys of 70 characters and
differing only slightly in that much of the keys are the same. Think
of a list of fully qualified filenames on your hard drive.

I need (at least)

In S in H not in C
in C in H not in S
in S not in H
in C not in H
in S in C

With that number of keys I was trying to avoid any unnecessary
extraction of keys to arrays like you can't avoid when you use "sort
keys".

Ben: I think a DB would be overkill - and don't forget I would not be
able to install any modules anyway. As you can see above my
requirements are not really joins in the traditional sense. I thought
saying joins would be close enough when the real problem is
efficiently iterating over multpile hashes.

Thanks all for your thoughts.
--
Regards
Dave Saville

Ben Morrow

unread,
May 15, 2013, 8:44:39 AM5/15/13
to

Quoth "Dave Saville" <da...@invalid.invalid>:
>
> I have three hashes S, C & H holding 60K+ keys of 70 characters and
> differing only slightly in that much of the keys are the same. Think
> of a list of fully qualified filenames on your hard drive.
>
> I need (at least)
>
> In S in H not in C
> in C in H not in S
> in S not in H
> in C not in H
> in S in C

You don't need to sort to get these, at which point whether you use
'keys' or 'each' is not terribly important.

for (keys %S) {
if (exists $C{$_}) {
# in S in C
}
else {
if (exists $H{$_}) {
# in S in H not in C
}
}
# and so on...
}

By the looks of it you will have to iterate S once and C once; this
should not be a performance problem. If you're using an itty-bitty
machine (I don't know what OS/2 runs on these days) you may find

while (my $k = each %S) {

is faster; the break-even point is well above 60_000 on my machine, but
might not be on yours.

> Ben: I think a DB would be overkill

SQLite isn't really a 'DB' in the traditional sense, it's more of a
lightweight SQL library.

> - and don't forget I would not be able to install any modules anyway.

Grrrmph. There's *almost* no point using Perl at all if you can't use
CPAN...

> As you can see above my
> requirements are not really joins in the traditional sense. I thought
> saying joins would be close enough when the real problem is
> efficiently iterating over multpile hashes.

I would exactly describe them as joins, and the advantage of using an
SQL system is that it has data structures designed to make joins easy.

Ben

0 new messages