I'm boggled.
I have this function which takes a keyer that keys a table (iterable). I
filter based on these keys, then groupby based on the filtered keys and
a keyfunc. Then, to make the resulting generator behave a little nicer
(no requirement for user to unpack the keys), I strip the keys in a
generator expression that unpacks them and generates the k,g pairs I
want ("regrouped"). I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).
Here's the function:
def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
for s in serieses:
yield s
I defined a little debugging function called iterprint:
def iterprint(thing):
if isinstance(thing, str):
print thing
elif hasattr(thing, 'items'):
print thing.items()
else:
try:
for x in thing:
iterprint(x)
except TypeError:
print thing
The gist is that iterprint will print any generator down to its
non-iterable components--it works fine for my purpose here, but I
included the code for the curious.
When I apply iterprint in the following manner (only change is the
iterprint line) everything looks fine and my "regrouped" generators in
"serieses" generate what they are supposed to when iterprinting. The
iterprint at this point shows that everything is working just the way I
want (I can see the last item in "serieses" iterprints just fine).
def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
iterprint(serieses)
for s in serieses:
yield s
Now, here's the rub. When I apply iterprint in the following manner, it
looks like my generator ("regrouped") gets consumed (note the only
change is a two space de-dent of the iterprint call--the printing is
outside the loop):
def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
iterprint(serieses)
for s in serieses:
yield s
Now, what is consuming my "regrouped" generator when going from inside
the loop to outside?
Thanks in advance for any clue.
py> print version
2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)]
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
of course this mutates the thing that is being printed. Try using
itertools.tee to fork a copy of the iterator and print from that.
I didn't look at the rest of your code enough to spot any errors
but take note of the warnings in the groupby documentation about
pitfalls with using the results some number of times other than
exactly once.
Thank you for your answer, but I am aware of this caveat. Something is
consuming my generator *before* I iterprint it. Please give it another
look if you would be so kind.
I can see I didn't explain so well. This one must be a bug if my code
looks good to you. Here is a summary:
- If I iterprint inside the loop, iterprint looks correct.
- If I iterprint outside the loop, my generator gets consumed and I am
only left with the last item, so my iterprint prints only one item
outside the loop.
Conclusion: something consumes my generator going from inside the loop
to outside.
Please note that I am not talking about the yielded values, or the
for-loop that creates them. I left them there to show my intent with the
function. The iterprint function is there to show that the generator
gets consumed just moving from inside the loop to outside.
I know this one is easy to dismiss to my consuming the generator with
the iterprint, as this would be a common mistake.
James
I'll see if I can look at it some more later, I'm in the middle of
something else right now. All I can say at the moment is that I've
encountered problems like this in my own code many times, and it's
always been a matter of having to carefully keep track of how the
nested iterators coming out of groupby are being consumed. I doubt
there is a library bug. Using groupby for things like this is
powerful, but unfortunately bug-prone because of how these mutable
iterators work. I suggest making some sample sequences and stepping
through with a debugger seeing just how the iterators advance.
I didn't spot any obvious errors, but I didn't look closely enough
to say that the code looked good or bad.
> Conclusion: something consumes my generator going from inside the loop
> to outside.
I'm not so sure of this, the thing is you're building these internal
grouper objects that don't expect to be consumed in the wrong order, etc.
Really, I'd try making a test iterator that prints something every
time you advance it, then step through your function with a debugger.
Thank you for your suggestion. I replied twice to your first post before
you made your suggestion to step through with a debugger, so it looks
like I ignored it.
Thanks again.
groupby() is "all you can eat", but "no doggy bag".
> def serialize(table, keyer=_keyer,
> selector=_selector,
> keyfunc=_keyfunc,
> series_keyfunc=_series_keyfunc):
> keyed = izip(imap(keyer, table), table)
> filtered = ifilter(selector, keyed)
> serialized = groupby(filtered, series_keyfunc)
> serieses = []
> for s_name, series in serialized:
> grouped = groupby(series, keyfunc)
> regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
> serieses.append((s_name, regrouped))
You are trying to store a group for later consumption here.
> for s in serieses:
> yield s
That doesn't work:
>>> groups = [g for k, g in groupby(range(10), lambda x: x//3)]
>>> for g in groups:
... print list(g)
...
[]
[]
[]
[9]
You cannot work around that because what invalidates a group is the call of
groups.next():
>>> groups = groupby(range(10), lambda x: x//3)
>>> g = groups.next()[1]
>>> g.next()
0
>>> groups.next()
(1, <itertools._grouper object at 0x2b3bd1f300f0>)
>>> g.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
Perhaps Python should throw an out-of-band exception for an invalid group
instead of yielding bogus data.
Peter
Good catch, the solution is to turn that loop into a generator,
but then it has to be consumed very carefully. This stuff
maybe presses the limits of what one can do with Python iterators
while staying sane.
Thank you for your clear explanation--a satisfying conclusion to nine
hours of head scratching.
James
Brilliant suggestion. Worked like a charm. Here is the final product:
def dekeyed(serialized, keyfunc):
for name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
yield (name, regrouped)
def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
return dekeyed(serialized, keyfunc)
Thank you!
James
Cool, glad it worked out. When writing this type of code I like to
use doctest to spell out some solid examples of what each function is
supposed to do, as part of the function. It's the only way I can
remember the shapes of the sequences going in and out, and the
automatic testing doesn't hurt either. Even with that though, at
least for me, Python starts feeling really scary when the iterators
get this complicated. I start wishing for a static type system,
re-usable iterators, etc.
regardses
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Nasty hobbitses... We hates them!
Where is this coming from? Please see posts by Otten and Rubin for
proper human conduct.
Its getting so you can't make a post on this list without getting
needled by this irritating minority who never know when to quit. If you
have nothing to add to the thread, you might want to practice humor on
your own time--and you need practice because needling is not funny, just
irritating.
This is an interesting topic. I agree with you, I too was scared in a
similar situation. The language features allow you to do some things
in a simple way, but if you pile too much of them, you end losing
track of what you are doing, etc.
The D language has static typing and its classes allow a standard
opApply method that allows lazy iteration, they are re-usable
iterators (but to scan two iterators in parallel you need a big trick,
it's a matter of stack). They require more syntax, and it gets in the
way, so in the end I am more able to write recursive generators in
Python because its less cluttered syntax allows my brain to manage
that extra part of algorithmic complexity necessary for that kind of
convoluted code.
The Haskall language is often uses by very intelligent programmers, it
often allows to use lazy computations and iterations, but it has the
advantage that its iterators behave better, and during the generation
of some items you can, when you want, refer and use the items already
generated. Those things make lazy Python code very different from lazy
Haskell code.
Bye,
bearophile