Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Get an arbitrary hash key, quickly.

0 views
Skip to first unread message

xho...@gmail.com

unread,
Jan 24, 2008, 12:25:07 PM1/24/08
to
I need a work queue, but I don't really care about the order (LIFO, FIFO,
random) in which things come out of it. Either is pretty simple and
efficient with a Perl array, and either would suffice. But I want the
queue to not hold duplicate entries. I could use an array as a stack or
queue, with a parallel hash to be checked to prevent duplicates from being
entered. But why keep parallel data structures? Just use the hash as the
queue:

while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};

Well, except that this gets really slow on large hashes.

I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.

If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:


while (%hash) { # does not reset iterator
my $to_do=each %hash;
next unless defined $to_do; # not a real hash key,
# signals end of iterator
## do stuff with $to_do, which might make new entries in %hash
};

If my speculation on the internals is right, then this would still
perform poorly if the hash first grows very large, then shrinks to
be only a few elements, but those last few elements keep triggering
the addition of just one more element each time. In my case, that
shouldn't be a problem.

Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?

Thanks,

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

nolo contendere

unread,
Jan 24, 2008, 12:43:32 PM1/24/08
to
On Jan 24, 12:25 pm, xhos...@gmail.com wrote:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it.  Either is pretty simple and
> efficient with a Perl array, and either would suffice.  But I want the
> queue to not hold duplicate entries.  I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered.  But why keep parallel data structures?  Just use the hash as the
> queue:
>
> while (key %hash) {
>   my $to_do=each %hash;
>   delete $hash{$to_do};
>   ## do stuff with $to_do, which might make new entries in %hash
>
> };
>

if "do stuff" might make entries in %hash, how do you guarantee you
won't be operating on a duplicate entry, since you delete
$hash{$to_do}?

xho...@gmail.com

unread,
Jan 24, 2008, 1:07:41 PM1/24/08
to
nolo contendere <simon...@fmr.com> wrote:

> On Jan 24, 12:25=A0pm, xhos...@gmail.com wrote:
> > I need a work queue, but I don't really care about the order (LIFO,
> > FIFO, random) in which things come out of it. =A0Either is pretty

> > simple and efficient with a Perl array, and either would suffice.
> > =A0But I want the queue to not hold duplicate entries. =A0I could use
> > an array as a stack or=

>
> > queue, with a parallel hash to be checked to prevent duplicates from
> > being entered. But why keep parallel data structures? Just use the hash
> > as the queue:
> >
> > while (key %hash) {
> > my $to_do=3Deach %hash;

> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might make new entries in %hash
> >
> > };
> >
>
> if "do stuff" might make entries in %hash, how do you guarantee you
> won't be operating on a duplicate entry, since you delete
> $hash{$to_do}?

I want to guarantee that duplicate entries won't be in the queue *at the
same time*. Whether they can be re-entered at a later point is up to
the details of "do stuff", which vary from project to project and which
I am intentionally avoiding.

Uri Guttman

unread,
Jan 24, 2008, 1:25:10 PM1/24/08
to
>>>>> "x" == xhoster <xho...@gmail.com> writes:

x> I need a work queue, but I don't really care about the order (LIFO, FIFO,
x> random) in which things come out of it. Either is pretty simple and
x> efficient with a Perl array, and either would suffice. But I want the
x> queue to not hold duplicate entries. I could use an array as a stack or
x> queue, with a parallel hash to be checked to prevent duplicates from being
x> entered. But why keep parallel data structures? Just use the hash as the
x> queue:

why not use the hash/array pair? and how do you generate dup work
request keys anyway? why not use a ref to the work thing as the key as
that will be unique. then an array will be fine.

x> while (key %hash) {
x> my $to_do=each %hash;

why not do this instead?

while( my $todo = each %hash ) {

#do work
delete $hash{$to_do};
keys %hash ; # reset iterator

the docs say IIRC that keys in a void context will reset the
iterator. keys in a while condition is in a scalar/boolean context and
may just return the number of keys.

and as another poster says, adding hash elements in a loop may change the
iterator order and cause unknown results. you need to do a clean reset
of the hash iterator so the each will work correctly.

x> If I just use scalar %hash instead of scalar keys %hash, then the slow down
x> doesn't occur because "each" scans the buckets from where it left off last
x> time, instead of from the beginning each time. (At first I thought it was
x> scalar keys %hash that was being slow and was going to post about *that*,
x> but then I realized that keys' reseting of the iterator was causing "each"
x> to be slow). But you shouldn't change a hash at the same time you iterate
x> over it because things might get missed. Presumably, anything missed will
x> be found on the next time through the iterator, so this might work without
x> the slowdown:

and you may not see all the elements of the hash as the iterator could
skip over new entries.

the best asnwer is what i said above, do a proper iterator reset after
you process the current work (actually just after the each() call should
work fine too and that would be clearer).

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

xho...@gmail.com

unread,
Jan 24, 2008, 1:59:24 PM1/24/08
to
Uri Guttman <u...@stemsystems.com> wrote:
> >>>>> "x" == xhoster <xho...@gmail.com> writes:
>
> x> I need a work queue, but I don't really care about the order (LIFO,
> FIFO, x> random) in which things come out of it. Either is pretty
> simple and x> efficient with a Perl array, and either would suffice.
> But I want the x> queue to not hold duplicate entries. I could use an
> array as a stack or x> queue, with a parallel hash to be checked to
> prevent duplicates from being x> entered. But why keep parallel data
> structures? Just use the hash as the x> queue:
>
> why not use the hash/array pair?

Parallel data structures are ugly and easy to screw up because they always
need to be kept in sync. Also, they need to use more memory, which in some
cases may be important. But other than that, there is no reason not to.
That is what I often do, and it works, I just want something neater.


> and how do you generate dup work
> request keys anyway? why not use a ref to the work thing as the key as
> that will be unique. then an array will be fine.

I don't understand what you are saying. What stops an array from
containing many duplicate values, be they references or anything else?

>
> x> while (key %hash) {
> x> my $to_do=each %hash;
>
> why not do this instead?
>
> while( my $todo = each %hash ) {
>
> #do work
> delete $hash{$to_do};
> keys %hash ; # reset iterator

Effectively the same thing, just re-worded. Works, but has the
same slowness issue, presumably for the same reason.

>
> the docs say IIRC that keys in a void context will reset the
> iterator. keys in a while condition is in a scalar/boolean context and
> may just return the number of keys.

Nah, it resets in all contexts. In a void context, that is all it does.
(i.e. it doesn't build the return list and then through it away, or
something like that.)

Mark Clements

unread,
Jan 24, 2008, 2:29:48 PM1/24/08
to
xho...@gmail.com wrote:
> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. Either is pretty simple and
> efficient with a Perl array, and either would suffice. But I want the
> queue to not hold duplicate entries. I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered. But why keep parallel data structures? Just use the hash as the
> queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.
>

Have you looked at one of the Heap modules, eg Heap::Simple?

Mark

nolo contendere

unread,
Jan 24, 2008, 2:33:50 PM1/24/08
to
On Jan 24, 1:59 pm, xhos...@gmail.com wrote:
> Uri Guttman <u...@stemsystems.com> wrote:
> > >>>>> "x" == xhoster  <xhos...@gmail.com> writes:
>
>
>
>
> >   x> while (key %hash) {
> >   x>   my $to_do=each %hash;
>
> > why not do this instead?
>
> > while( my $todo = each %hash ) {
>
> >         #do work
> >         delete $hash{$to_do};
> >         keys %hash ;    # reset iterator
>
> Effectively the same thing, just re-worded.  Works, but has the
> same slowness issue, presumably for the same reason.
>

So if you remove the keys %hash in the above scenario, would the
condition to 'while' evaluate to true after the iterator reached then
end if keys were added? will the iterator "know" that it needs to loop
around again? it's unclear from the perldoc -f each:

When the hash is entirely read, a null array is
returned in list context (which when assigned
produces a false (0) value), and "undef" in scalar
context. The next call to "each" after that will
start iterating again. There is a single iterator
for each hash, shared by all "each", "keys", and
"values" function calls in the program; it can be
reset by reading all the elements from the hash, or
by evaluating "keys HASH" or "values HASH". If you
add or delete elements of a hash while you're
iterating over it, you may get entries skipped or
duplicated, so don't. Exception: It is always safe
to delete the item most recently returned by
"each()", which means that the following code will
work:

while (($key, $value) = each %hash) {
print $key, "\n";
delete $hash{$key}; # This is safe
}

It says undef will be returned if the hash is entirely read. But since
the hash is not empty, is it considered "entirely read"? Also, it says
it's always safe to delete the most recently returned item, but sounds
like it's UNsafe to add elements while iterating.

based on this, Uri's way (which is essentially the same as your
original slow way, if a tad bit cleaner), seems the correct way.

A. Sinan Unur

unread,
Jan 24, 2008, 2:43:52 PM1/24/08
to
xho...@gmail.com wrote in news:20080124122509.078$o...@newsreader.com:

> I need a work queue, but I don't really care about the order (LIFO,
> FIFO, random) in which things come out of it. Either is pretty simple
> and efficient with a Perl array, and either would suffice. But I want
> the queue to not hold duplicate entries. I could use an array as a
> stack or queue, with a parallel hash to be checked to prevent
> duplicates from being entered. But why keep parallel data structures?
> Just use the hash as the queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.

It is possible I am misunderstanding what you are doing, but I would
have written:

#!/usr/bin/perl

use strict;
use warnings;

my %queue;
$queue{ $_ } = undef for 1 .. 1_000_000;

while ( my @keys = keys %queue ) {
for my $task ( @keys ) {
delete $queue{ $task };
#
# Do something
#
# Add another task with a small probability
#
if ( 0.1 > rand ) {
$queue{ 1 + int(rand 1_000_000) } = undef;
}
}
}


C:\Home\asu1\Src\test\hash_queue> timethis hq

TimeThis : Command Line : hq
TimeThis : Start Time : Thu Jan 24 14:42:36 2008
TimeThis : End Time : Thu Jan 24 14:42:46 2008
TimeThis : Elapsed Time : 00:00:09.312

Sinan

--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.augustmail.com/~tadmc/clpmisc.shtml>

nolo contendere

unread,
Jan 24, 2008, 3:24:59 PM1/24/08
to
On Jan 24, 2:43 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:

That was my initial thought as well (well, something like that
anyway), and I think that was Xho's initial thought. But I think he's
trying to devise a solution that doesn't require both %queue and
@keys.

xho...@gmail.com

unread,
Jan 24, 2008, 3:43:33 PM1/24/08
to
nolo contendere <simon...@fmr.com> wrote:

> On Jan 24, 1:59=A0pm, xhos...@gmail.com wrote:
> > Uri Guttman <u...@stemsystems.com> wrote:
> > > >>>>> "x" =3D=3D xhoster =A0<xhos...@gmail.com> writes:
> >
> >
> >
> >
> > > x> while (key %hash) {
> > > x> my $to_do=each %hash;
> >
> > > why not do this instead?
> >
> > > while( my $todo = each %hash ) {
> >
> > > #do work
> > > delete $hash{$to_do};
> > > keys %hash ; # reset iterator
> >
> > Effectively the same thing, just re-worded. Works, but has the
> > same slowness issue, presumably for the same reason.
> >
>
> So if you remove the keys %hash in the above scenario, would the
> condition to 'while' evaluate to true after the iterator reached then
> end if keys were added?

No. Once the iterator reached the end, it will return undef and the loop
will exit. So as far as the loop is concerned, there is no "after"
the iterator reaches the end, because the loop is done.

> will the iterator "know" that it needs to loop
> around again?

Only if you happen to invoke it once more after it already indicated
the it was done. But the loop, without the reset, doesn't invoke it
again.


> it's unclear from the perldoc -f each:
>
> When the hash is entirely read, a null array is
> returned in list context (which when assigned
> produces a false (0) value), and "undef" in scalar
> context. The next call to "each" after that will
> start iterating again. There is a single iterator
> for each hash, shared by all "each", "keys", and
> "values" function calls in the program; it can be
> reset by reading all the elements from the hash, or
> by evaluating "keys HASH" or "values HASH". If you
> add or delete elements of a hash while you're
> iterating over it, you may get entries skipped or
> duplicated, so don't. Exception: It is always safe
> to delete the item most recently returned by
> "each()", which means that the following code will
> work:
>

> while (($key, $value) =3D each %hash) {


> print $key, "\n";
> delete $hash{$key}; # This is safe
> }
>
> It says undef will be returned if the hash is entirely read. But since
> the hash is not empty, is it considered "entirely read"?

Of course. Otherwise, if you did an each without a delete (which
is the most common way to use each), then it would be infinite loop.

> Also, it says
> it's always safe to delete the most recently returned item, but sounds
> like it's UNsafe to add elements while iterating.

If you violate their warning about adding things to the hash while
iterating, then it may be deemed fully read even though some elements have
not been iterated over during this current round of iteration. So if you
violate that warning, then when each returns empty you have to use some
other way to make sure the hash itself is empty.

I'm pretty sure it is safe in the sense that it will not segfault or
anything like that, it is only unsafe towards your assumptions that
everything has been seen.


> based on this, Uri's way (which is essentially the same as your
> original slow way, if a tad bit cleaner), seems the correct way.

I don't see that. Both methods always reset the iterator between the time
additions are made and the time each is called again. (Well, Uri's doesn't
reset the iterator before the very first each, which presumably comes
after some insert which was done before the loop was reached, but surely
that is not a problem.)

nolo contendere

unread,
Jan 24, 2008, 3:54:41 PM1/24/08
to
On Jan 24, 3:43 pm, xhos...@gmail.com wrote:


True, the flaw in this algorithm is that it always resets the
iterator. Personally, I like Sinan's implementation.

A. Sinan Unur

unread,
Jan 24, 2008, 4:00:34 PM1/24/08
to
nolo contendere <simon...@fmr.com> wrote in
news:36b5756e-5bfd-4d44...@t1g2000pra.googlegroups.com:

> On Jan 24, 2:43 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>> xhos...@gmail.com wrote innews:20080124122509.078$o...@newsreader.com:
>>
>> > I need a work queue, but I don't really care about the order (LIFO,
>> > FIFO, random) in which things come out of it.  Either is pretty
>> > simple
>

...

>> > Well, except that this gets really slow on large hashes.
>>
>> It is possible I am misunderstanding what you are doing, but I would
>> have written:

Obviously, I was.

...

> That was my initial thought as well (well, something like that
> anyway), and I think that was Xho's initial thought. But I think he's
> trying to devise a solution that doesn't require both %queue and
> @keys.

Thank you for the clarification.

Ben Morrow

unread,
Jan 24, 2008, 7:05:49 PM1/24/08
to

Quoth xho...@gmail.com:

> I need a work queue, but I don't really care about the order (LIFO, FIFO,
> random) in which things come out of it. Either is pretty simple and
> efficient with a Perl array, and either would suffice. But I want the
> queue to not hold duplicate entries. I could use an array as a stack or
> queue, with a parallel hash to be checked to prevent duplicates from being
> entered. But why keep parallel data structures? Just use the hash as the
> queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.

I would have thought the obvious method would be

my (@queue, %done);

while (my $to_do = shift @queue) {
exists $done{$to_do} and next;
$done{$to_do} = 1;

process_stuff($to_do);
}

obviously, if you wanted to deliberately requeue something you would
need to know that, and delete the hash entry. I realize you're trying to
avoid parallel data structures, but $done is pretty much
self-maintaining, so I don't know if that applies here.

> I suspect the each operator does a linear scan of all the buckets until it
> finds an occupied one. With a hash that used to be big but now isn't (and
> so has lots of empty buckets), this behaves poorly, essentially being N^2
> to empty a large hash.

Presumably you could fix this with

%hash = %hash;

if that would be helpful. Of course, you'll copy everything.

> If I just use scalar %hash instead of scalar keys %hash, then the slow down
> doesn't occur because "each" scans the buckets from where it left off last
> time, instead of from the beginning each time. (At first I thought it was
> scalar keys %hash that was being slow and was going to post about *that*,
> but then I realized that keys' reseting of the iterator was causing "each"
> to be slow). But you shouldn't change a hash at the same time you iterate
> over it because things might get missed. Presumably, anything missed will
> be found on the next time through the iterator, so this might work without
> the slowdown:

Presumably also things might get found twice, depending on how the
buckets get reorganized. Since you're always deleting the current
element before calling each again that shouldn't be a problem here.

> Any comments on this code? Is there some less quirky way to do this
> efficiently? Is there a simple (as simple as perl's internals can get)
> way to fix "each" so that it doesn't have this degenerate case?

Presumably switching to something like BerkeleyDB::BTREE would also fix
this problem, since btrees are designed to be iterated over. I don't
know how much grief the tie overhead would cause in this case;
presumably an implementation could be made that used U/u magic directly
rather than tie magic, and so avoided the method calls.

Ben

nolo contendere

unread,
Jan 25, 2008, 9:37:10 AM1/25/08
to
On Jan 24, 7:05 pm, Ben Morrow <b...@morrow.me.uk> wrote:

> Presumably you could fix this with

...


> Presumably also things might get found twice, depending on how the

...


>
> Presumably switching to something like BerkeleyDB::BTREE would also fix

...


> presumably an implementation could be made that used U/u magic directly
> rather than tie magic, and so avoided the method calls.
>

You're very presumptuous ;-).

xho...@gmail.com

unread,
Jan 25, 2008, 12:16:38 PM1/25/08
to

Now we've changed from an extra parallel structure to an extra caching
structure, which of course works but has the same inelegance. (Grepping
back through years of code, I see that I've actually used this method many
years ago, too, but from the comments it looks like I didn't understand why
the original was slow, so I regarded it as trial and error magic.) It does
require that "do something" can only add and not delete things from %queue,
which is fine with my original specification but lacks maximal flexibility.

There are many ways to work around this degradation in performance, but it
seems like all of them impose other limitations or inelegances. Oh well.
if these things were too easy I guess anyone could do it and I'd be out of
a job :)

xho...@gmail.com

unread,
Jan 25, 2008, 12:32:23 PM1/25/08
to
Ben Morrow <b...@morrow.me.uk> wrote:
> Quoth xho...@gmail.com:
> > I need a work queue, but I don't really care about the order (LIFO,
> > FIFO, random) in which things come out of it. Either is pretty simple
> > and efficient with a Perl array, and either would suffice. But I want
> > the queue to not hold duplicate entries. I could use an array as a
> > stack or queue, with a parallel hash to be checked to prevent
> > duplicates from being entered. But why keep parallel data structures?
> > Just use the hash as the queue:
> >
> > while (key %hash) {
> > my $to_do=each %hash;
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might make new entries in %hash
> > };
> >
> > Well, except that this gets really slow on large hashes.
>
> I would have thought the obvious method would be
>
> my (@queue, %done);
>
> while (my $to_do = shift @queue) {
> exists $done{$to_do} and next;
> $done{$to_do} = 1;
>
> process_stuff($to_do);
> }

The problem here is that, under some uses, @queue can accumulate an fatal
number of replicates. While %done prevents them from being processed
more than once, they are still stored more than once. If each of 10_000
work-items build up on the queue 10_000 times each, that becomes a storage
problem. So you need to use %done to guard against adding replicates into
@queue to defend memory usage.

But replicates can still build up to unacceptable levels before the first
instance of one ever gets $done{}. So then you need another hash,
%queued_but_not_done, to reject duplicate entries at the earliest stage.
And then you say "This is a mess, let's just use one hash", then you
discover that "each" performs poorly on a shrinking hash, and then you post
a message here. :)

nolo contendere

unread,
Jan 25, 2008, 1:02:45 PM1/25/08
to
On Jan 25, 12:32 pm, xhos...@gmail.com wrote:
...

>
> But replicates can still build up to unacceptable levels before the first
> instance of one ever gets $done{}.  So then you need another hash,
> %queued_but_not_done, to reject duplicate entries at the earliest stage.

From this I infer that you are engaging in parallel processing. Is
this correct?

comp.llang.perl.moderated

unread,
Jan 25, 2008, 1:39:01 PM1/25/08
to

Even with tie slowness, a Tie::IxHash solution would at least be more
straightforward since insertion order is maintained... I think.

tie my %hash, 'Tie::IxHash';
my $idx = 0;

while ( my @queue = keys %hash ) {
my $to_do = $queue[$idx++];
delete $hash{ $to_do };


# do stuff with $to_do, which might make

# new entries in %hash
}

--
Charles DeRykus

comp.llang.perl.moderated

unread,
Jan 25, 2008, 1:57:15 PM1/25/08
to
On Jan 25, 10:39 am, "comp.llang.perl.moderated" <c...@blv-

sam-01.ca.boeing.com> wrote:
> On Jan 24, 9:25 am, xhos...@gmail.com wrote:
>
>
> ...

>
> Even with tie slowness, a Tie::IxHash solution would at least be more
> straightforward since insertion order is maintained... I think.
>
> tie my %hash, 'Tie::IxHash';
> my $idx = 0;
>
> while ( my @queue = keys %hash ) {
> my $to_do = $queue[$idx++];
> delete $hash{ $to_do };
> # do stuff with $to_do, which might make
> # new entries in %hash
>
> }

Well, it seemed like a good idea a minute ago.
But, you might need to tweak $idx for instance...
Maybe other problems too...

while ( my @queue = keys %hash ) {

$idx = $#queue if $idx > $#queue;

Ben Morrow

unread,
Jan 25, 2008, 5:35:33 PM1/25/08
to

Quoth "comp.llang.perl.moderated" <c...@blv-sam-01.ca.boeing.com>:

> On Jan 24, 9:25 am, xhos...@gmail.com wrote:
>
> Even with tie slowness, a Tie::IxHash solution would at least be more
> straightforward since insertion order is maintained... I think.

I have to admit I've never seen the point of Tie::IxHash. It's just the
same as the parallel array/hash Xho'd already rejected, but slower...

Ben

xho...@gmail.com

unread,
Jan 26, 2008, 8:05:03 PM1/26/08
to
nolo contendere <simon...@fmr.com> wrote:
> On Jan 25, 12:32=A0pm, xhos...@gmail.com wrote:
> =2E..

> >
> > But replicates can still build up to unacceptable levels before the
> > first instance of one ever gets $done{}. =A0So then you need another

> > hash, %queued_but_not_done, to reject duplicate entries at the earliest
> > stage.
>
> From this I infer that you are engaging in parallel processing. Is
> this correct?

I have run into these constructs in parallel processing, but that was not
the immediate case here. I was computing a single-linkage hierarchical
clustering tree from a distance matrix that is way too large for the COTS
software I have access to handle. I've also ran into in certain traversals
of extremely large graphs.

nolo contendere

unread,
Jan 28, 2008, 10:01:34 AM1/28/08
to
On Jan 26, 8:05 pm, xhos...@gmail.com wrote:

> nolo contendere <simon.c...@fmr.com> wrote:
> > On Jan 25, 12:32=A0pm, xhos...@gmail.com wrote:
> > =2E..
>
> > > But replicates can still build up to unacceptable levels before the
> > > first instance of one ever gets $done{}. =A0So then you need another
> > > hash, %queued_but_not_done, to reject duplicate entries at the earliest
> > > stage.
>
> > From this I infer that you are engaging in parallel processing. Is
> > this correct?
>
> I have run into these constructs in parallel processing, but that was not
> the immediate case here.  I was computing a single-linkage hierarchical
> clustering tree from a distance matrix that is way too large for the COTS
> software I have access to handle.  I've also ran into in certain traversals
> of extremely large graphs.


yeah, well...that was my second guess.

comp.llang.perl.moderated

unread,
Jan 28, 2008, 5:52:25 PM1/28/08
to
On Jan 24, 9:25 am, xhos...@gmail.com wrote:

Do you need 'each' since values don't seem
to be retrieved...Some simple benchmarks
suggest just looping over the keys would
be quite a bit faster if that's the case:

QUEUE: {
foreach my $to_do (keys %hash) {


delete $hash{$to_do};
## do stuff with $to_do, which might

# make new entries in %hash
}
redo QUEUE if keys %hash;
}

--
Charles DeRykus

nolo contendere

unread,
Jan 28, 2008, 6:40:51 PM1/28/08
to
On Jan 28, 5:52 pm, "comp.llang.perl.moderated" <c...@blv-

perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
confused if
you add or remove elements within the loop body, for example
with
"splice". So don't do that.

Xho taught me that one :-).

comp.llang.perl.moderated

unread,
Jan 28, 2008, 7:15:39 PM1/28/08
to
On Jan 28, 3:40 pm, nolo contendere <simon.c...@gmail.com> wrote:
> On Jan 28, 5:52 pm, "comp.llang.perl.moderated" <c...@blv-
>
>
>
...

> > > Any comments on this code? Is there some less quirky way to do this
> > > efficiently? Is there a simple (as simple as perl's internals can get)
> > > way to fix "each" so that it doesn't have this degenerate case?
>
> > Do you need 'each' since values don't seem
> > to be retrieved...Some simple benchmarks
> > suggest just looping over the keys would
> > be quite a bit faster if that's the case:
>
> > QUEUE: {
> > foreach my $to_do (keys %hash) {
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might
> > # make new entries in %hash
> > }
> > redo QUEUE if keys %hash;
>
> > }
>
> perldoc perlsyn:
> ...
> If any part of LIST is an array, "foreach" will get very
> confused if
> you add or remove elements within the loop body, for example
> with
> "splice". So don't do that.
>

Yes, that's true. Even delete's can be problematic
unless iterating with 'each' I think. But the
original code Xho demo'ed was adding elements
during the loop as well... so I'm not sure how
that could ever work..

--
Charles DeRykus

nolo contendere

unread,
Jan 28, 2008, 7:26:06 PM1/28/08
to
On Jan 28, 7:15 pm, "comp.llang.perl.moderated" <c...@blv-

Those were 'while' loops, which don't suffer from the same issue.

comp.llang.perl.moderated

unread,
Jan 28, 2008, 7:54:47 PM1/28/08
to

No, I believe it's not just a 'while vs. foreach'
issue but rather ensuring that the iterator's reset
before the next each call.

--
Charles DeRykus

Ben Morrow

unread,
Jan 28, 2008, 8:05:59 PM1/28/08
to

Quoth nolo contendere <simon...@gmail.com>:
^^^^^^^^^^^
It isn't. keys returns a list. What *is* true in this case is that if
any entries you haven't got to yet are deleted from the hash, they will
still be in for's list and will be returned anyway; since that isn't the
case here, it doesn't matter.

However, I would have thought that as the number of keys gets larger,
this get slower, since it has to build a complete list of the keys each
time through QUEUE. Let's see...

#!/usr/bin/perl -l

use strict;
use warnings;

use Benchmark qw/cmpthese/;

my %h =
map { ($_, 1) }
map { join '', map { chr rand 256 } 1..10 }
1..100_000;

cmpthese -5, {
keys => sub {
for my $x (keys %h) { 1; }
},
leach => sub {
while (my ($k, $v) = each %h) { 1; }
},
seach => sub {
while (my $k = each %h) { 1; }
},
};

__END__

Rate leach keys seach
leach 3.76/s -- -28% -30%
keys 5.21/s 38% -- -3%
seach 5.37/s 43% 3% --

Nope, I'm wrong even on large hashes, but scalar each is faster again,
though even that seems to depend on your malloc: FreeBSD's perl is built
with perl's malloc by default, and a version built with the native
malloc has keys > seach > leach. Hmmm. :)

Ben

nolo contendere

unread,
Jan 29, 2008, 11:04:53 AM1/29/08
to
On Jan 28, 8:05 pm, Ben Morrow <b...@morrow.me.uk> wrote:
> Quoth nolo contendere <simon.c...@gmail.com>:

>
> > On Jan 28, 5:52 pm, "comp.llang.perl.moderated" <c...@blv-
> > sam-01.ca.boeing.com> wrote:
>
> > > Do you need 'each' since values don't seem
> > > to be retrieved...Some simple benchmarks
> > > suggest just looping over the keys would
> > > be quite a bit faster if that's the case:
>
> > > QUEUE: {
> > >   foreach my $to_do (keys %hash) {
> > >      delete $hash{$to_do};
> > >      ## do stuff with $to_do, which might
> > >         # make new entries in %hash
> > >   }
> > >   redo QUEUE if keys %hash;
>
> > > }
>
> > perldoc perlsyn:
> >        ...
> >        If any part of LIST is an array, "foreach" will get very
>
>                              ^^^^^^^^^^^
> It isn't. keys returns a list. What *is* true in this case is that if
> any entries you haven't got to yet are deleted from the hash, they will

hmm, I thought it was unsafe to delete any entries other than the one
just accessed.

> still be in for's list and will be returned anyway; since that isn't the
> case here, it doesn't matter.
>

right. my understanding is that for (and foreach) take a snapshot of
the list, and iterate through that snapshot.

> However, I would have thought that as the number of keys gets larger,
> this get slower, since it has to build a complete list of the keys each
> time through QUEUE. Let's see...
>

wouldn't every subsequent list be smaller for the most part? unless
"do stuff" generated more keys than were deleted on average.

Ben Morrow

unread,
Jan 29, 2008, 11:38:12 AM1/29/08
to

Quoth nolo contendere <simon...@fmr.com>:

> On Jan 28, 8:05 pm, Ben Morrow <b...@morrow.me.uk> wrote:
> >
> > It isn't. keys returns a list. What *is* true in this case is that if
> > any entries you haven't got to yet are deleted from the hash, they will
>
> hmm, I thought it was unsafe to delete any entries other than the one
> just accessed.

No, this is a misconception (though one that is somewhat supported by
the docs). See http://blog.plover.com/prog/perl/undefined.html#3 .
Deleting the current entry is actually the potentially-unsafe case, but
perl special-cases it for you so it works correctly (the entry has to be
deleted lazily, after the iterator has moved on to the next entry).

In any case, none of this applies to

for (keys %h) { ... }

keys is in list context, so the complete list of keys is generated
before the loop even starts iterating.

> > still be in for's list and will be returned anyway; since that isn't the
> > case here, it doesn't matter.
>
> right. my understanding is that for (and foreach) take a snapshot of
> the list, and iterate through that snapshot.

Not quite; keys returns a list which is a snapshot of the current set of
keys. What you do with that list afterwards is irrelevant.

> > However, I would have thought that as the number of keys gets larger,
> > this get slower, since it has to build a complete list of the keys each
> > time through QUEUE. Let's see...
>
> wouldn't every subsequent list be smaller for the most part? unless
> "do stuff" generated more keys than were deleted on average.

No, that wasn't what I meant. I meant 'as the initial set of keys
becomes large, perhaps building a complete list of that set becomes
inefficient'. It seems it doesn't.

Ben

xho...@gmail.com

unread,
Jan 29, 2008, 11:50:50 AM1/29/08
to
nolo contendere <simon...@gmail.com> wrote:
> >
> > QUEUE: {
> > =A0 foreach my $to_do (keys %hash) {
> > =A0 =A0 =A0delete $hash{$to_do};
> > =A0 =A0 =A0## do stuff with $to_do, which might
> > =A0 =A0 =A0 =A0 # make new entries in %hash
> > =A0 }
> > =A0 redo QUEUE if keys %hash;

> >
> > }
> >
>
> perldoc perlsyn:
> ...
> If any part of LIST is an array, "foreach" will get very
> confused if
> you add or remove elements within the loop body, for example
> with
> "splice". So don't do that.

But in this case, no part of the LIST is an array. keys %hash is not
an array.

xho...@gmail.com

unread,
Jan 29, 2008, 12:00:48 PM1/29/08
to
"comp.llang.perl.moderated" <c...@blv-sam-01.ca.boeing.com> wrote:
> >
> > while (%hash) { # does not reset iterator
> > my $to_do=each %hash;
> > next unless defined $to_do; # not a real hash key,
> > # signals end of iterator
> > ## do stuff with $to_do, which might make new entries in %hash
> >
> > };
> >
> > If my speculation on the internals is right, then this would still
> > perform poorly if the hash first grows very large, then shrinks to
> > be only a few elements, but those last few elements keep triggering
> > the addition of just one more element each time. In my case, that
> > shouldn't be a problem.
> >
> > Any comments on this code? Is there some less quirky way to do this
> > efficiently? Is there a simple (as simple as perl's internals can get)
> > way to fix "each" so that it doesn't have this degenerate case?
>
> Do you need 'each' since values don't seem
> to be retrieved...Some simple benchmarks
> suggest just looping over the keys would
> be quite a bit faster if that's the case:
>
> QUEUE: {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }
> redo QUEUE if keys %hash;
> }

I like it. It does have a 2nd caching structure, but that structure is
implicit and entirely managed by perl as part of the foreach loop. I might
change the outer loop syntax somewhat, as I prefer to avoid labels
whenever possible.

while (keys %hash) {


foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
}

(Causes one extra scalar "keys %hash", but that shouldn't be a problem)

The "do stuff" can only add new entries, not remove entries (other than
$to_do itself) without causing problems. Deleting wasn't part of the
original specification, and that limitation is not a problem in this
particular case.

xho...@gmail.com

unread,
Jan 29, 2008, 12:20:23 PM1/29/08
to
nolo contendere <simon...@fmr.com> wrote:

> On Jan 28, 8:05=A0pm, Ben Morrow <b...@morrow.me.uk> wrote:
> > Quoth nolo contendere <simon.c...@gmail.com>:
> >
> > > On Jan 28, 5:52=A0pm, "comp.llang.perl.moderated" <c...@blv-

> > > sam-01.ca.boeing.com> wrote:
> >
> > > > Do you need 'each' since values don't seem
> > > > to be retrieved...Some simple benchmarks
> > > > suggest just looping over the keys would
> > > > be quite a bit faster if that's the case:
> >
> > > > QUEUE: {
> > > > =A0 foreach my $to_do (keys %hash) {
> > > > =A0 =A0 =A0delete $hash{$to_do};
> > > > =A0 =A0 =A0## do stuff with $to_do, which might
> > > > =A0 =A0 =A0 =A0 # make new entries in %hash
> > > > =A0 }
> > > > =A0 redo QUEUE if keys %hash;
> >
> > > > }
> >
> > > perldoc perlsyn:
> > > =A0 =A0 =A0 =A0...
> > > =A0 =A0 =A0 =A0If any part of LIST is an array, "foreach" will get
> > > very
> >
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0^^^^^^^^^^^

> > It isn't. keys returns a list. What *is* true in this case is that if
> > any entries you haven't got to yet are deleted from the hash, they will
>
> hmm, I thought it was unsafe to delete any entries other than the one
> just accessed.

That is with "each %hash", not with "keys %hash".

>
> > still be in for's list and will be returned anyway; since that isn't
> > the case here, it doesn't matter.
> >
>
> right. my understanding is that for (and foreach) take a snapshot of
> the list, and iterate through that snapshot.

Except when the LIST of the foreach has an array in it, then it doesn't
take a snapshot--it does something else (presumably as an optimization),
the details of which are not documented but the effects of which are
(weirdness when changing the array).

xho...@gmail.com

unread,
Jan 29, 2008, 12:31:56 PM1/29/08
to

I'm sure you already know this, but just be clear for others, I think the
potential slowness you hypothetically referring to is just a constant
multiplier of slowness (and a rather small one at that), right? The
slowness that I originally posted about was of a different
sort--pathological slowness that seems to be around O(N**2) when it really
"should" be O(N).

Once I take care of the O(N**2)->O(N), I'm not so interested in
microoptimizing the rest, except as a theoretical curiosity.

comp.llang.perl.moderated

unread,
Jan 29, 2008, 12:55:16 PM1/29/08
to

Memory might be the only issue then. And it's
really nice to have Ben explain how it really
all works :)

--
Charles

Ben Morrow

unread,
Jan 29, 2008, 12:32:02 PM1/29/08
to

Quoth xho...@gmail.com:

>
> while (keys %hash) {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }
> }
> (Causes one extra scalar "keys %hash", but that shouldn't be a problem)
>
> The "do stuff" can only add new entries, not remove entries (other than
> $to_do itself) without causing problems. Deleting wasn't part of the
> original specification, and that limitation is not a problem in this
> particular case.

Fixing that is just a matter of

delete $hash{$to_do} or next;

isn't it? Of course, if you delete lots of entries you're going to waste
time spinning round that loop.

Ben

nolo contendere

unread,
Jan 30, 2008, 10:31:09 AM1/30/08
to
On Jan 29, 11:38 am, Ben Morrow <b...@morrow.me.uk> wrote:
> Quoth nolo contendere <simon.c...@fmr.com>:

>
> > On Jan 28, 8:05 pm, Ben Morrow <b...@morrow.me.uk> wrote:
>
> > > It isn't. keys returns a list. What *is* true in this case is that if
> > > any entries you haven't got to yet are deleted from the hash, they will
>
> > hmm, I thought it was unsafe to delete any entries other than the one
> > just accessed.
>
> No, this is a misconception (though one that is somewhat supported by
> the docs). Seehttp://blog.plover.com/prog/perl/undefined.html#3.
> Deleting the current entry is actually the potentially-unsafe case, but
> perl special-cases it for you so it works correctly (the entry has to be
> deleted lazily, after the iterator has moved on to the next entry).
>

cool!

> In any case, none of this applies to
>
>     for (keys %h) { ... }
>
> keys is in list context, so the complete list of keys is generated
> before the loop even starts iterating.
>
> > > still be in for's list and will be returned anyway; since that isn't the
> > > case here, it doesn't matter.
>
> > right. my understanding is that for (and foreach) take a snapshot of
> > the list, and iterate through that snapshot.
>
> Not quite; keys returns a list which is a snapshot of the current set of
> keys. What you do with that list afterwards is irrelevant.
>

right, that was my understanding.

> > > However, I would have thought that as the number of keys gets larger,
> > > this get slower, since it has to build a complete list of the keys each
> > > time through QUEUE. Let's see...
>
> > wouldn't every subsequent list be smaller for the most part? unless
> > "do stuff" generated more keys than were deleted on average.
>
> No, that wasn't what I meant. I meant 'as the initial set of keys
> becomes large, perhaps building a complete list of that set becomes
> inefficient'. It seems it doesn't.
>

oh.

nolo contendere

unread,
Jan 30, 2008, 10:31:32 AM1/30/08
to
On Jan 29, 11:50 am, xhos...@gmail.com wrote:

> nolo contendere <simon.c...@gmail.com> wrote:
>
> > > QUEUE: {
> > > =A0 foreach my $to_do (keys %hash) {
> > > =A0 =A0 =A0delete $hash{$to_do};
> > > =A0 =A0 =A0## do stuff with $to_do, which might
> > > =A0 =A0 =A0 =A0 # make new entries in %hash
> > > =A0 }
> > > =A0 redo QUEUE if keys %hash;
>
> > > }
>
> > perldoc perlsyn:
> >        ...
> >        If any part of LIST is an array, "foreach" will get very
> > confused if
> >        you add or remove elements within the loop body, for example
> > with
> >        "splice".   So don't do that.
>
> But in this case, no part of the LIST is an array.  keys %hash is not
> an array.
>

oops! newbie mistake, sorry!

nolo contendere

unread,
Jan 30, 2008, 10:33:00 AM1/30/08
to
On Jan 29, 12:20 pm, xhos...@gmail.com wrote:

hmm, could you, or anyone with >= your knowledge about this subject
expound?

xho...@gmail.com

unread,
Jan 30, 2008, 10:48:46 AM1/30/08
to
nolo contendere <simon...@fmr.com> wrote:

> On Jan 29, 12:20=A0pm, xhos...@gmail.com wrote:
> >
> > > right. my understanding is that for (and foreach) take a snapshot of
> > > the list, and iterate through that snapshot.
> >
> > Except when the LIST of the foreach has an array in it, then it doesn't
> > take a snapshot--it does something else (presumably as an
> > optimization), the details of which are not documented but the effects
> > of which are (weirdness when changing the array).
> >
>
> hmm, could you, or anyone with >=3D your knowledge about this subject
> expound?

I think the link to plover that Ben Morrow posted already did that.
It isn't actually weird, I just treat as weird because it isn't documented
and so could in theory do something different in other versions.

Well, OK, it is weird, because the docs "If any part of LIST is an array"
but it seems that, to get the behavior shown in the plover link, the
entirety of the LIST has to be an array. if you do something like:

foreach ("foo",@x) {

Then it appears, at least on my version, that it does in fact take a
snapshot of @x (more properly, a snapshot of the addresses to the elements
in @x) and any splices made into it during the loop have no effect on the
iteration.

nolo contendere

unread,
Jan 30, 2008, 10:57:21 AM1/30/08
to
On Jan 30, 10:48 am, xhos...@gmail.com wrote:

> nolo contendere <simon.c...@fmr.com> wrote:
> > On Jan 29, 12:20=A0pm, xhos...@gmail.com wrote:
>
> > > > right. my understanding is that for (and foreach) take a snapshot of
> > > > the list, and iterate through that snapshot.
>
> > > Except when the LIST of the foreach has an array in it, then it doesn't
> > > take a snapshot--it does something else (presumably as an
> > > optimization), the details of which are not documented but the effects
> > > of which are (weirdness when changing the array).
>
> > hmm, could you, or anyone with >=3D your knowledge about this subject
> > expound?
>
> I think the link to plover that Ben Morrow posted already did that.
> It isn't actually weird, I just treat as weird because it isn't documented
> and so could in theory do something different in other versions.
>
> Well, OK, it is weird, because the docs "If any part of LIST is an array"
> but it seems that, to get the behavior shown in the plover link, the
> entirety of the LIST has to be an array.  if you do something like:
>
> foreach ("foo",@x) {
>
> Then it appears, at least on my version, that it does in fact take a
> snapshot of @x (more properly, a snapshot of the addresses to the elements
> in @x) and any splices made into it during the loop have no effect on the
> iteration.
>

thanks Xho, I'll check out the plover site more thoroughly.

kj

unread,
Feb 23, 2008, 6:09:29 PM2/23/08
to
Xho, why do you like this

In <20080129120050.119$y...@newsreader.com> xho...@gmail.com writes:

>while (keys %hash) {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }
>}

better than your original

>while (%hash) { # does not reset iterator
> my $to_do=each %hash;
> next unless defined $to_do; # not a real hash key,
> # signals end of iterator
> ## do stuff with $to_do, which might make new entries in %hash
>};

?

kynn

--
NOTE: In my address everything before the first period is backwards;
and the last period, and everything after it, should be discarded.

xho...@gmail.com

unread,
Feb 23, 2008, 9:06:59 PM2/23/08
to
kj <so...@987jk.com.invalid> wrote:
> Xho, why do you like this
>
> In <20080129120050.119$y...@newsreader.com> xho...@gmail.com writes:
>
> >while (keys %hash) {
> > foreach my $to_do (keys %hash) {
> > delete $hash{$to_do};
> > ## do stuff with $to_do, which might
> > # make new entries in %hash
> > }
> >}
>
> better than your original
>
> >while (%hash) { # does not reset iterator
> > my $to_do=each %hash;
> > next unless defined $to_do; # not a real hash key,
> > # signals end of iterator
> > ## do stuff with $to_do, which might make new entries in %hash
> >};
>
> ?

I'm not keen on using "scalar %hash", as it is more esoteric than
"scalar keys %hash". And using the next to cause they "each" to reboot
when it delivers an undefined value is more subtle than I care for.

On the other hand, at this long remove, I am kind of looking kindly on my
original code. It looks abnormal, which advertises the fact that it
is abnormal and should be tinkered with with caution.

kj

unread,
Feb 24, 2008, 3:37:40 PM2/24/08
to
In <20080223210702.865$D...@newsreader.com> xho...@gmail.com writes:

>I'm not keen on using "scalar %hash", as it is more esoteric than
>"scalar keys %hash". And using the next to cause they "each" to reboot
>when it delivers an undefined value is more subtle than I care for.

Still, using foreach in the inner loop is unnecessary use of memory.
What's wrong with

while ( keys %hash ) {
while ( my $to_do = each %hash ) {
delete $hash{ $to_do };
# etc.
}
}

?

Or perhaps this expresses the intentions more clearly:

{
while ( my $to_do = each %hash ) {
delete $hash{ $to_do };
# etc.
}

redo if keys %hash;

0 new messages