while (key %hash) {
my $to_do=each %hash;
delete $hash{$to_do};
## do stuff with $to_do, which might make new entries in %hash
};
Well, except that this gets really slow on large hashes.
I suspect the each operator does a linear scan of all the buckets until it
finds an occupied one. With a hash that used to be big but now isn't (and
so has lots of empty buckets), this behaves poorly, essentially being N^2
to empty a large hash.
If I just use scalar %hash instead of scalar keys %hash, then the slow down
doesn't occur because "each" scans the buckets from where it left off last
time, instead of from the beginning each time. (At first I thought it was
scalar keys %hash that was being slow and was going to post about *that*,
but then I realized that keys' reseting of the iterator was causing "each"
to be slow). But you shouldn't change a hash at the same time you iterate
over it because things might get missed. Presumably, anything missed will
be found on the next time through the iterator, so this might work without
the slowdown:
while (%hash) { # does not reset iterator
my $to_do=each %hash;
next unless defined $to_do; # not a real hash key,
# signals end of iterator
## do stuff with $to_do, which might make new entries in %hash
};
If my speculation on the internals is right, then this would still
perform poorly if the hash first grows very large, then shrinks to
be only a few elements, but those last few elements keep triggering
the addition of just one more element each time. In my case, that
shouldn't be a problem.
Any comments on this code? Is there some less quirky way to do this
efficiently? Is there a simple (as simple as perl's internals can get)
way to fix "each" so that it doesn't have this degenerate case?
Thanks,
Xho
--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
if "do stuff" might make entries in %hash, how do you guarantee you
won't be operating on a duplicate entry, since you delete
$hash{$to_do}?
I want to guarantee that duplicate entries won't be in the queue *at the
same time*. Whether they can be re-entered at a later point is up to
the details of "do stuff", which vary from project to project and which
I am intentionally avoiding.
x> I need a work queue, but I don't really care about the order (LIFO, FIFO,
x> random) in which things come out of it. Either is pretty simple and
x> efficient with a Perl array, and either would suffice. But I want the
x> queue to not hold duplicate entries. I could use an array as a stack or
x> queue, with a parallel hash to be checked to prevent duplicates from being
x> entered. But why keep parallel data structures? Just use the hash as the
x> queue:
why not use the hash/array pair? and how do you generate dup work
request keys anyway? why not use a ref to the work thing as the key as
that will be unique. then an array will be fine.
x> while (key %hash) {
x> my $to_do=each %hash;
why not do this instead?
while( my $todo = each %hash ) {
#do work
delete $hash{$to_do};
keys %hash ; # reset iterator
the docs say IIRC that keys in a void context will reset the
iterator. keys in a while condition is in a scalar/boolean context and
may just return the number of keys.
and as another poster says, adding hash elements in a loop may change the
iterator order and cause unknown results. you need to do a clean reset
of the hash iterator so the each will work correctly.
x> If I just use scalar %hash instead of scalar keys %hash, then the slow down
x> doesn't occur because "each" scans the buckets from where it left off last
x> time, instead of from the beginning each time. (At first I thought it was
x> scalar keys %hash that was being slow and was going to post about *that*,
x> but then I realized that keys' reseting of the iterator was causing "each"
x> to be slow). But you shouldn't change a hash at the same time you iterate
x> over it because things might get missed. Presumably, anything missed will
x> be found on the next time through the iterator, so this might work without
x> the slowdown:
and you may not see all the elements of the hash as the iterator could
skip over new entries.
the best asnwer is what i said above, do a proper iterator reset after
you process the current work (actually just after the each() call should
work fine too and that would be clearer).
uri
--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Parallel data structures are ugly and easy to screw up because they always
need to be kept in sync. Also, they need to use more memory, which in some
cases may be important. But other than that, there is no reason not to.
That is what I often do, and it works, I just want something neater.
> and how do you generate dup work
> request keys anyway? why not use a ref to the work thing as the key as
> that will be unique. then an array will be fine.
I don't understand what you are saying. What stops an array from
containing many duplicate values, be they references or anything else?
>
> x> while (key %hash) {
> x> my $to_do=each %hash;
>
> why not do this instead?
>
> while( my $todo = each %hash ) {
>
> #do work
> delete $hash{$to_do};
> keys %hash ; # reset iterator
Effectively the same thing, just re-worded. Works, but has the
same slowness issue, presumably for the same reason.
>
> the docs say IIRC that keys in a void context will reset the
> iterator. keys in a while condition is in a scalar/boolean context and
> may just return the number of keys.
Nah, it resets in all contexts. In a void context, that is all it does.
(i.e. it doesn't build the return list and then through it away, or
something like that.)
Have you looked at one of the Heap modules, eg Heap::Simple?
Mark
So if you remove the keys %hash in the above scenario, would the
condition to 'while' evaluate to true after the iterator reached then
end if keys were added? will the iterator "know" that it needs to loop
around again? it's unclear from the perldoc -f each:
When the hash is entirely read, a null array is
returned in list context (which when assigned
produces a false (0) value), and "undef" in scalar
context. The next call to "each" after that will
start iterating again. There is a single iterator
for each hash, shared by all "each", "keys", and
"values" function calls in the program; it can be
reset by reading all the elements from the hash, or
by evaluating "keys HASH" or "values HASH". If you
add or delete elements of a hash while you're
iterating over it, you may get entries skipped or
duplicated, so don't. Exception: It is always safe
to delete the item most recently returned by
"each()", which means that the following code will
work:
while (($key, $value) = each %hash) {
print $key, "\n";
delete $hash{$key}; # This is safe
}
It says undef will be returned if the hash is entirely read. But since
the hash is not empty, is it considered "entirely read"? Also, it says
it's always safe to delete the most recently returned item, but sounds
like it's UNsafe to add elements while iterating.
based on this, Uri's way (which is essentially the same as your
original slow way, if a tad bit cleaner), seems the correct way.
> I need a work queue, but I don't really care about the order (LIFO,
> FIFO, random) in which things come out of it. Either is pretty simple
> and efficient with a Perl array, and either would suffice. But I want
> the queue to not hold duplicate entries. I could use an array as a
> stack or queue, with a parallel hash to be checked to prevent
> duplicates from being entered. But why keep parallel data structures?
> Just use the hash as the queue:
>
> while (key %hash) {
> my $to_do=each %hash;
> delete $hash{$to_do};
> ## do stuff with $to_do, which might make new entries in %hash
> };
>
> Well, except that this gets really slow on large hashes.
It is possible I am misunderstanding what you are doing, but I would
have written:
#!/usr/bin/perl
use strict;
use warnings;
my %queue;
$queue{ $_ } = undef for 1 .. 1_000_000;
while ( my @keys = keys %queue ) {
for my $task ( @keys ) {
delete $queue{ $task };
#
# Do something
#
# Add another task with a small probability
#
if ( 0.1 > rand ) {
$queue{ 1 + int(rand 1_000_000) } = undef;
}
}
}
C:\Home\asu1\Src\test\hash_queue> timethis hq
TimeThis : Command Line : hq
TimeThis : Start Time : Thu Jan 24 14:42:36 2008
TimeThis : End Time : Thu Jan 24 14:42:46 2008
TimeThis : Elapsed Time : 00:00:09.312
Sinan
--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines: <URL:http://www.augustmail.com/~tadmc/clpmisc.shtml>
That was my initial thought as well (well, something like that
anyway), and I think that was Xho's initial thought. But I think he's
trying to devise a solution that doesn't require both %queue and
@keys.
No. Once the iterator reached the end, it will return undef and the loop
will exit. So as far as the loop is concerned, there is no "after"
the iterator reaches the end, because the loop is done.
> will the iterator "know" that it needs to loop
> around again?
Only if you happen to invoke it once more after it already indicated
the it was done. But the loop, without the reset, doesn't invoke it
again.
> it's unclear from the perldoc -f each:
>
> When the hash is entirely read, a null array is
> returned in list context (which when assigned
> produces a false (0) value), and "undef" in scalar
> context. The next call to "each" after that will
> start iterating again. There is a single iterator
> for each hash, shared by all "each", "keys", and
> "values" function calls in the program; it can be
> reset by reading all the elements from the hash, or
> by evaluating "keys HASH" or "values HASH". If you
> add or delete elements of a hash while you're
> iterating over it, you may get entries skipped or
> duplicated, so don't. Exception: It is always safe
> to delete the item most recently returned by
> "each()", which means that the following code will
> work:
>
> while (($key, $value) =3D each %hash) {
> print $key, "\n";
> delete $hash{$key}; # This is safe
> }
>
> It says undef will be returned if the hash is entirely read. But since
> the hash is not empty, is it considered "entirely read"?
Of course. Otherwise, if you did an each without a delete (which
is the most common way to use each), then it would be infinite loop.
> Also, it says
> it's always safe to delete the most recently returned item, but sounds
> like it's UNsafe to add elements while iterating.
If you violate their warning about adding things to the hash while
iterating, then it may be deemed fully read even though some elements have
not been iterated over during this current round of iteration. So if you
violate that warning, then when each returns empty you have to use some
other way to make sure the hash itself is empty.
I'm pretty sure it is safe in the sense that it will not segfault or
anything like that, it is only unsafe towards your assumptions that
everything has been seen.
> based on this, Uri's way (which is essentially the same as your
> original slow way, if a tad bit cleaner), seems the correct way.
I don't see that. Both methods always reset the iterator between the time
additions are made and the time each is called again. (Well, Uri's doesn't
reset the iterator before the very first each, which presumably comes
after some insert which was done before the loop was reached, but surely
that is not a problem.)
True, the flaw in this algorithm is that it always resets the
iterator. Personally, I like Sinan's implementation.
> On Jan 24, 2:43 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>> xhos...@gmail.com wrote innews:20080124122509.078$o...@newsreader.com:
>>
>> > I need a work queue, but I don't really care about the order (LIFO,
>> > FIFO, random) in which things come out of it. Either is pretty
>> > simple
>
...
>> > Well, except that this gets really slow on large hashes.
>>
>> It is possible I am misunderstanding what you are doing, but I would
>> have written:
Obviously, I was.
...
> That was my initial thought as well (well, something like that
> anyway), and I think that was Xho's initial thought. But I think he's
> trying to devise a solution that doesn't require both %queue and
> @keys.
Thank you for the clarification.
I would have thought the obvious method would be
my (@queue, %done);
while (my $to_do = shift @queue) {
exists $done{$to_do} and next;
$done{$to_do} = 1;
process_stuff($to_do);
}
obviously, if you wanted to deliberately requeue something you would
need to know that, and delete the hash entry. I realize you're trying to
avoid parallel data structures, but $done is pretty much
self-maintaining, so I don't know if that applies here.
> I suspect the each operator does a linear scan of all the buckets until it
> finds an occupied one. With a hash that used to be big but now isn't (and
> so has lots of empty buckets), this behaves poorly, essentially being N^2
> to empty a large hash.
Presumably you could fix this with
%hash = %hash;
if that would be helpful. Of course, you'll copy everything.
> If I just use scalar %hash instead of scalar keys %hash, then the slow down
> doesn't occur because "each" scans the buckets from where it left off last
> time, instead of from the beginning each time. (At first I thought it was
> scalar keys %hash that was being slow and was going to post about *that*,
> but then I realized that keys' reseting of the iterator was causing "each"
> to be slow). But you shouldn't change a hash at the same time you iterate
> over it because things might get missed. Presumably, anything missed will
> be found on the next time through the iterator, so this might work without
> the slowdown:
Presumably also things might get found twice, depending on how the
buckets get reorganized. Since you're always deleting the current
element before calling each again that shouldn't be a problem here.
> Any comments on this code? Is there some less quirky way to do this
> efficiently? Is there a simple (as simple as perl's internals can get)
> way to fix "each" so that it doesn't have this degenerate case?
Presumably switching to something like BerkeleyDB::BTREE would also fix
this problem, since btrees are designed to be iterated over. I don't
know how much grief the tie overhead would cause in this case;
presumably an implementation could be made that used U/u magic directly
rather than tie magic, and so avoided the method calls.
Ben
> Presumably you could fix this with
...
> Presumably also things might get found twice, depending on how the
...
>
> Presumably switching to something like BerkeleyDB::BTREE would also fix
...
> presumably an implementation could be made that used U/u magic directly
> rather than tie magic, and so avoided the method calls.
>
You're very presumptuous ;-).
Now we've changed from an extra parallel structure to an extra caching
structure, which of course works but has the same inelegance. (Grepping
back through years of code, I see that I've actually used this method many
years ago, too, but from the comments it looks like I didn't understand why
the original was slow, so I regarded it as trial and error magic.) It does
require that "do something" can only add and not delete things from %queue,
which is fine with my original specification but lacks maximal flexibility.
There are many ways to work around this degradation in performance, but it
seems like all of them impose other limitations or inelegances. Oh well.
if these things were too easy I guess anyone could do it and I'd be out of
a job :)
The problem here is that, under some uses, @queue can accumulate an fatal
number of replicates. While %done prevents them from being processed
more than once, they are still stored more than once. If each of 10_000
work-items build up on the queue 10_000 times each, that becomes a storage
problem. So you need to use %done to guard against adding replicates into
@queue to defend memory usage.
But replicates can still build up to unacceptable levels before the first
instance of one ever gets $done{}. So then you need another hash,
%queued_but_not_done, to reject duplicate entries at the earliest stage.
And then you say "This is a mess, let's just use one hash", then you
discover that "each" performs poorly on a shrinking hash, and then you post
a message here. :)
From this I infer that you are engaging in parallel processing. Is
this correct?
Even with tie slowness, a Tie::IxHash solution would at least be more
straightforward since insertion order is maintained... I think.
tie my %hash, 'Tie::IxHash';
my $idx = 0;
while ( my @queue = keys %hash ) {
my $to_do = $queue[$idx++];
delete $hash{ $to_do };
# do stuff with $to_do, which might make
# new entries in %hash
}
--
Charles DeRykus
Well, it seemed like a good idea a minute ago.
But, you might need to tweak $idx for instance...
Maybe other problems too...
while ( my @queue = keys %hash ) {
$idx = $#queue if $idx > $#queue;
I have to admit I've never seen the point of Tie::IxHash. It's just the
same as the parallel array/hash Xho'd already rejected, but slower...
Ben
I have run into these constructs in parallel processing, but that was not
the immediate case here. I was computing a single-linkage hierarchical
clustering tree from a distance matrix that is way too large for the COTS
software I have access to handle. I've also ran into in certain traversals
of extremely large graphs.
yeah, well...that was my second guess.
Do you need 'each' since values don't seem
to be retrieved...Some simple benchmarks
suggest just looping over the keys would
be quite a bit faster if that's the case:
QUEUE: {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
redo QUEUE if keys %hash;
}
--
Charles DeRykus
perldoc perlsyn:
...
If any part of LIST is an array, "foreach" will get very
confused if
you add or remove elements within the loop body, for example
with
"splice". So don't do that.
Xho taught me that one :-).
Yes, that's true. Even delete's can be problematic
unless iterating with 'each' I think. But the
original code Xho demo'ed was adding elements
during the loop as well... so I'm not sure how
that could ever work..
--
Charles DeRykus
Those were 'while' loops, which don't suffer from the same issue.
No, I believe it's not just a 'while vs. foreach'
issue but rather ensuring that the iterator's reset
before the next each call.
--
Charles DeRykus
However, I would have thought that as the number of keys gets larger,
this get slower, since it has to build a complete list of the keys each
time through QUEUE. Let's see...
#!/usr/bin/perl -l
use strict;
use warnings;
use Benchmark qw/cmpthese/;
my %h =
map { ($_, 1) }
map { join '', map { chr rand 256 } 1..10 }
1..100_000;
cmpthese -5, {
keys => sub {
for my $x (keys %h) { 1; }
},
leach => sub {
while (my ($k, $v) = each %h) { 1; }
},
seach => sub {
while (my $k = each %h) { 1; }
},
};
__END__
Rate leach keys seach
leach 3.76/s -- -28% -30%
keys 5.21/s 38% -- -3%
seach 5.37/s 43% 3% --
Nope, I'm wrong even on large hashes, but scalar each is faster again,
though even that seems to depend on your malloc: FreeBSD's perl is built
with perl's malloc by default, and a version built with the native
malloc has keys > seach > leach. Hmmm. :)
Ben
hmm, I thought it was unsafe to delete any entries other than the one
just accessed.
> still be in for's list and will be returned anyway; since that isn't the
> case here, it doesn't matter.
>
right. my understanding is that for (and foreach) take a snapshot of
the list, and iterate through that snapshot.
> However, I would have thought that as the number of keys gets larger,
> this get slower, since it has to build a complete list of the keys each
> time through QUEUE. Let's see...
>
wouldn't every subsequent list be smaller for the most part? unless
"do stuff" generated more keys than were deleted on average.
No, this is a misconception (though one that is somewhat supported by
the docs). See http://blog.plover.com/prog/perl/undefined.html#3 .
Deleting the current entry is actually the potentially-unsafe case, but
perl special-cases it for you so it works correctly (the entry has to be
deleted lazily, after the iterator has moved on to the next entry).
In any case, none of this applies to
for (keys %h) { ... }
keys is in list context, so the complete list of keys is generated
before the loop even starts iterating.
> > still be in for's list and will be returned anyway; since that isn't the
> > case here, it doesn't matter.
>
> right. my understanding is that for (and foreach) take a snapshot of
> the list, and iterate through that snapshot.
Not quite; keys returns a list which is a snapshot of the current set of
keys. What you do with that list afterwards is irrelevant.
> > However, I would have thought that as the number of keys gets larger,
> > this get slower, since it has to build a complete list of the keys each
> > time through QUEUE. Let's see...
>
> wouldn't every subsequent list be smaller for the most part? unless
> "do stuff" generated more keys than were deleted on average.
No, that wasn't what I meant. I meant 'as the initial set of keys
becomes large, perhaps building a complete list of that set becomes
inefficient'. It seems it doesn't.
Ben
But in this case, no part of the LIST is an array. keys %hash is not
an array.
I like it. It does have a 2nd caching structure, but that structure is
implicit and entirely managed by perl as part of the foreach loop. I might
change the outer loop syntax somewhat, as I prefer to avoid labels
whenever possible.
while (keys %hash) {
foreach my $to_do (keys %hash) {
delete $hash{$to_do};
## do stuff with $to_do, which might
# make new entries in %hash
}
}
(Causes one extra scalar "keys %hash", but that shouldn't be a problem)
The "do stuff" can only add new entries, not remove entries (other than
$to_do itself) without causing problems. Deleting wasn't part of the
original specification, and that limitation is not a problem in this
particular case.
That is with "each %hash", not with "keys %hash".
>
> > still be in for's list and will be returned anyway; since that isn't
> > the case here, it doesn't matter.
> >
>
> right. my understanding is that for (and foreach) take a snapshot of
> the list, and iterate through that snapshot.
Except when the LIST of the foreach has an array in it, then it doesn't
take a snapshot--it does something else (presumably as an optimization),
the details of which are not documented but the effects of which are
(weirdness when changing the array).
I'm sure you already know this, but just be clear for others, I think the
potential slowness you hypothetically referring to is just a constant
multiplier of slowness (and a rather small one at that), right? The
slowness that I originally posted about was of a different
sort--pathological slowness that seems to be around O(N**2) when it really
"should" be O(N).
Once I take care of the O(N**2)->O(N), I'm not so interested in
microoptimizing the rest, except as a theoretical curiosity.
Memory might be the only issue then. And it's
really nice to have Ben explain how it really
all works :)
--
Charles
Fixing that is just a matter of
delete $hash{$to_do} or next;
isn't it? Of course, if you delete lots of entries you're going to waste
time spinning round that loop.
Ben
cool!
> In any case, none of this applies to
>
> for (keys %h) { ... }
>
> keys is in list context, so the complete list of keys is generated
> before the loop even starts iterating.
>
> > > still be in for's list and will be returned anyway; since that isn't the
> > > case here, it doesn't matter.
>
> > right. my understanding is that for (and foreach) take a snapshot of
> > the list, and iterate through that snapshot.
>
> Not quite; keys returns a list which is a snapshot of the current set of
> keys. What you do with that list afterwards is irrelevant.
>
right, that was my understanding.
> > > However, I would have thought that as the number of keys gets larger,
> > > this get slower, since it has to build a complete list of the keys each
> > > time through QUEUE. Let's see...
>
> > wouldn't every subsequent list be smaller for the most part? unless
> > "do stuff" generated more keys than were deleted on average.
>
> No, that wasn't what I meant. I meant 'as the initial set of keys
> becomes large, perhaps building a complete list of that set becomes
> inefficient'. It seems it doesn't.
>
oh.
oops! newbie mistake, sorry!
hmm, could you, or anyone with >= your knowledge about this subject
expound?
I think the link to plover that Ben Morrow posted already did that.
It isn't actually weird, I just treat as weird because it isn't documented
and so could in theory do something different in other versions.
Well, OK, it is weird, because the docs "If any part of LIST is an array"
but it seems that, to get the behavior shown in the plover link, the
entirety of the LIST has to be an array. if you do something like:
foreach ("foo",@x) {
Then it appears, at least on my version, that it does in fact take a
snapshot of @x (more properly, a snapshot of the addresses to the elements
in @x) and any splices made into it during the loop have no effect on the
iteration.
thanks Xho, I'll check out the plover site more thoroughly.
In <20080129120050.119$y...@newsreader.com> xho...@gmail.com writes:
>while (keys %hash) {
> foreach my $to_do (keys %hash) {
> delete $hash{$to_do};
> ## do stuff with $to_do, which might
> # make new entries in %hash
> }
>}
better than your original
>while (%hash) { # does not reset iterator
> my $to_do=each %hash;
> next unless defined $to_do; # not a real hash key,
> # signals end of iterator
> ## do stuff with $to_do, which might make new entries in %hash
>};
?
kynn
--
NOTE: In my address everything before the first period is backwards;
and the last period, and everything after it, should be discarded.
I'm not keen on using "scalar %hash", as it is more esoteric than
"scalar keys %hash". And using the next to cause they "each" to reboot
when it delivers an undefined value is more subtle than I care for.
On the other hand, at this long remove, I am kind of looking kindly on my
original code. It looks abnormal, which advertises the fact that it
is abnormal and should be tinkered with with caution.
>I'm not keen on using "scalar %hash", as it is more esoteric than
>"scalar keys %hash". And using the next to cause they "each" to reboot
>when it delivers an undefined value is more subtle than I care for.
Still, using foreach in the inner loop is unnecessary use of memory.
What's wrong with
while ( keys %hash ) {
while ( my $to_do = each %hash ) {
delete $hash{ $to_do };
# etc.
}
}
?
Or perhaps this expresses the intentions more clearly:
{
while ( my $to_do = each %hash ) {
delete $hash{ $to_do };
# etc.
}
redo if keys %hash;