Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

In need of a binge-and-purge idiom

0 views
Skip to first unread message

Magnus Lie Hetland

unread,
Mar 23, 2003, 1:20:57 PM3/23/03
to
Maybe not the best name, but it somehow describes what's going on...
So...

I've noticed that I use the following in several contexts:

chunk = []
for element in iterable:
if isSeparator(element) and chunk:
doSomething(chunk)
chunk = []
if chunk:
doSomething(chunk)
chunk = []

If the iterable above is a file, isSeparator(element) is simply
defined as not element.strip() and doSomething(chunk) is
yield(''.join(chunk)) you have a paragraph splitter. I've been using
the same approach for slightly more complicated parsing recently.

However, the extra check at the end (i.e. the duplication) is a bit
ugly. A solution would be:

...
for element in iterable + separator:
...

but that isn't possible, of course. (It could be possible with some
fiddling with itertools etc., I guess.)

If it were possible to check whether the iterator extracted from the
iterable was at an end, that could help too -- but I see no elegant
way of doing it.

I can't really see any good way of using the while/break idiom either,
without resorting to explicit iterator pumping and a Boolean flag
(which isn't really all that elegant...):

it = iter(iterable)
chunk = []
done = False
while not done:
try:
element = it.next()
except StopIteration:
done = True
element = SomeSeparator()
if isSeparator(element) and chunk:
doSomething(chunk)
chunk = []

This seems far too wordy and clunky.

An alternative is:

it = iter(iterable)
chunk = []
while True:
try:
try:
element = it.next()
except StopIteration:
element = SomeSeparator()
break
finally:
if isSeparator(element) and chunk:
doSomething(chunk)
chunk = []

But this stuff is really just as bad (or even quite a bit worse) than
the version with duplication.

I just thought I'd hear if someone can think of a more elegant way of
handling this sort of thing?

--
Magnus Lie Hetland "Nothing shocks me. I'm a scientist."
http://hetland.org -- Indiana Jones

Tim Peters

unread,
Mar 23, 2003, 2:19:53 PM3/23/03
to
[Magnus Lie Hetland]

> Maybe not the best name, but it somehow describes what's going on...
> So...
>
> I've noticed that I use the following in several contexts:
>
> chunk = []
> for element in iterable:
> if isSeparator(element) and chunk:
> doSomething(chunk)
> chunk = []
> if chunk:
> doSomething(chunk)
> chunk = []

Since chunk is initialized to an empty list, the if clause in the loop can
never evaluate to true, so this is equivalent to

chunk = []
for element in iterable:

isSeparator(element)

All variations of the code later in the msg suffer the same problem. As a
result, I've got no idea what you intend the code to do. Does calling
isSeparator(element), or <shudder> the process of iterating over iterable,
mutate chunk as a side effect? If so, "yuck" comes to mind.

If the code made sense <wink>, something like

def terminated_iterator(iterable, a_seperator):
for element in iterable:
yield element
yield a_separator

would produce the original sequence, then tack a_separator on to the end.

> ...


> However, the extra check at the end (i.e. the duplication) is a bit
> ugly. A solution would be:
>
> ...
> for element in iterable + separator:
> ...
>
> but that isn't possible, of course. (It could be possible with some
> fiddling with itertools etc., I guess.)

WRT the preceding,

for element in terminated_iterator(iterable, seperator):

gets that effect. More generally,

def concat(*seqs):
"Generate all the elements of all the argument iterables."
for seq in seqs:
for x in seq:
yield x

and then, e.g.,

for element in concat(iterable, [seperator]):

> ...


> An alternative is:
>
> it = iter(iterable)
> chunk = []
> while True:
> try:
> try:
> element = it.next()
> except StopIteration:
> element = SomeSeparator()
> break
> finally:
> if isSeparator(element) and chunk:
> doSomething(chunk)
> chunk = []
>
> But this stuff is really just as bad (or even quite a bit worse) than
> the version with duplication.

Indeed, stick to sane alternatives.


Magnus Lie Hetland

unread,
Mar 23, 2003, 6:51:29 PM3/23/03
to
In article <mailman.1048447384...@python.org>, Tim Peters wrote:
>[Magnus Lie Hetland]
>> Maybe not the best name, but it somehow describes what's going on...
>> So...
>>
>> I've noticed that I use the following in several contexts:

A little fix...

>> chunk = []
>> for element in iterable:

>> if isSeparator(element):


if chunk:
>> doSomething(chunk)
>> chunk = []

else:
chunk.append(element)


>> if chunk:
>> doSomething(chunk)
>> chunk = []

The original was written in a hurry :]

>Since chunk is initialized to an empty list, the if clause in the loop can
>never evaluate to true, so this is equivalent to
>
> chunk = []
> for element in iterable:
> isSeparator(element)

Yup.

>All variations of the code later in the msg suffer the same problem. As a
>result, I've got no idea what you intend the code to do. Does calling
>isSeparator(element), or <shudder> the process of iterating over iterable,
>mutate chunk as a side effect? If so, "yuck" comes to mind.

No, sorry -- I just forgot parts of the code :)

>If the code made sense <wink>, something like
>
>def terminated_iterator(iterable, a_seperator):
> for element in iterable:
> yield element
> yield a_separator
>
>would produce the original sequence, then tack a_separator on to the end.

Yes, that's what I've done before (e.g. in an example in my book).
Maybe that is the best way of doing it.

[snip]


>WRT the preceding,
>
> for element in terminated_iterator(iterable, seperator):
>
>gets that effect.

Indeed.

> More generally,
>
>def concat(*seqs):
> "Generate all the elements of all the argument iterables."
> for seq in seqs:
> for x in seq:
> yield x
>
>and then, e.g.,
>
> for element in concat(iterable, [seperator]):

Yes. I posted something similar to that when discussing itertools
previously. I guess I was (now) mainly looking for some basic use of
control structures that I had overlooked.

Anyway, thanks for the input.

Jeremy Fincher

unread,
Mar 24, 2003, 2:04:19 AM3/24/03
to
m...@furu.idi.ntnu.no (Magnus Lie Hetland) wrote in message news:<slrnb7ruo...@furu.idi.ntnu.no>...

> I've noticed that I use the following in several contexts:
>
> chunk = []
> for element in iterable:
> if isSeparator(element) and chunk:
> doSomething(chunk)
> chunk = []
> if chunk:
> doSomething(chunk)
> chunk = []
>
> If the iterable above is a file, isSeparator(element) is simply
> defined as not element.strip() and doSomething(chunk) is
> yield(''.join(chunk)) you have a paragraph splitter. I've been using
> the same approach for slightly more complicated parsing recently.

Maybe something like this can work?

def itersplit(iterable, isSeparator):
acc = []
for element in iterable:
if isSeparator(element):
yield acc
acc = []
else:
acc.append(element)
yield acc


Then your paragraph splitter might look like this:

def paragraphSplitter(file):
for L in itersplit(file, lambda s: not s.split()):
yield ''.join(L)

Jeremy

Alex Martelli

unread,
Mar 24, 2003, 6:57:44 AM3/24/03
to
Magnus Lie Hetland wrote:

Ah, I recognize the outline of our joint contribution to the
printed Cookbook (recipe 4.8...).

> I've noticed that I use the following in several contexts:

[fixing as per followups]


> chunk = []
> for element in iterable:
> if isSeparator(element) and chunk:
> doSomething(chunk)
> chunk = []

else: chunk.append(element)


> if chunk:
> doSomething(chunk)
> chunk = []

First refactoring that comes to mind is:

def maydosomething(chunk):


if chunk:
doSomething(chunk)
chunk[:] = []

chunk = []
for element in iterable:
if isSeparator(element): maydosomething(chunk)
else: chunk.append(element)
maydosomething(chunk)

but this wouldn't work for the specific use case you require:

> If the iterable above is a file, isSeparator(element) is simply
> defined as not element.strip() and doSomething(chunk) is
> yield(''.join(chunk)) you have a paragraph splitter. I've been using

i.e., factoring out a *yield* to maydosomething would NOT work.
So I'll focus on the specific case of yield in the following,
assuming a "munge" function such as
def munge(chunk): return ''.join(chunk)
is also passed as an argument.


> for element in iterable + separator:
> ...
>
> but that isn't possible, of course. (It could be possible with some
> fiddling with itertools etc., I guess.)

Indeed, there ain't much "fiddling" needed at all -- you just
DO need to know SOME acceptable separator, however:

import itertools

def chunkitup(iterable, isSeparator, aSeparator, munge=''.join):

# a sanity check never hurts...
assert isSeparator(aSeparator)

chunk = []
for element in itertools.chain(iterable, [aSeparator]):
if isSeparator(element):
yield munge(chunk)
chunk = []
else: chunk.append(element)

> If it were possible to check whether the iterator extracted from the
> iterable was at an end, that could help too -- but I see no elegant
> way of doing it.

Elegance is in the eye of the beholder, but...:

class iter_with_lookahead:
def __init__(self, iterable):
self.it = iter(iterable)
self.done = False
self.step()
def __iter__(self):
return self
def step(self):
try:
self.lookahead = self.it.next()
except StopIteration:
self.done = True
def next(self):
if self.done: raise StopIteration
result = self.lookahead
self.step()
return result

...I've had occasion to use variants of this in order to be able
to peek ahead, check if an iterator was done, or in small further
variants to give an iterator one level of "pushback", etc, etc.
So, if you have a wrapper such as this one around somewhere, you
might choose to reuse it (though it probably wouldn't be worth
developing for the sole purpose of this use!-):

def chunkitup1(iterable, isSeparator, munge=''.join):
chunk = []
it = iter_with_lookahead(iterable)
for element in it:
issep = isSeparator(element)
if not issep:
chunk.append(element)
if issep or it.done:
yield munge(chunk)
chunk = []



> I can't really see any good way of using the while/break idiom either,

Well, you COULD use a different wrapper class to obtain code such as:

def chunkitup2(iterable, isSeparator, munge=''.join):
wit = wild_thing(iterable, isSeparator)
while wit:
if wit.isSeparator() and wit.hasChunk():
yield munge(wit.getChunk())

but the wrapper wouldn't be all that nice under the covers AND it
would in practice have to embody a bit too much of the control
logic and bury it in a non-obvious place -- so I wouldn't pursue
this tack, myself.


Alex

Magnus Lie Hetland

unread,
Mar 24, 2003, 7:22:42 PM3/24/03
to
In article <698f09f8.03032...@posting.google.com>, Jeremy

Fincher wrote:
>Maybe something like this can work?
[snip]

>
>def itersplit(iterable, isSeparator):
> acc = []
> for element in iterable:
> if isSeparator(element):
> yield acc
> acc = []
> else:
> acc.append(element)
> yield acc

You should add "if acc" before you yield acc -- I don't want an empty
acc (that only means several separators in a row -- which amounts to a
single separator in my case). And, with that statement in place, you'd
get the same duplication as before, as far as I can see. What is new
about this (except putting it inside a generator)?

Thanks for the input, though.

Magnus Lie Hetland

unread,
Mar 24, 2003, 7:36:20 PM3/24/03
to
In article <YoCfa.960$i26....@news2.tin.it>, Alex Martelli wrote:
>Magnus Lie Hetland wrote:
>
>Ah, I recognize the outline of our joint contribution to the
>printed Cookbook (recipe 4.8...).

:)

I use this sort of thing in the "Instant Markup" chapter of my book as
well. But I haven't found a nice solution, except artificially tucking
an extra separator onto the iterator. Maybe that's nice enough,
though...

>> I've noticed that I use the following in several contexts:

> [fixing as per followups]

Good :)

[snip]


>First refactoring that comes to mind is:

Yes. I thought of refactoring as a solution too. It does reduce the
duplication to a duplicated function call -- which may be good enough.
I just sort of hoped I could avoid it altogether :]

>but this wouldn't work for the specific use case you require:
>
>> If the iterable above is a file, isSeparator(element) is simply
>> defined as not element.strip() and doSomething(chunk) is
>> yield(''.join(chunk)) you have a paragraph splitter. I've been using
>
>i.e., factoring out a *yield* to maydosomething would NOT work.

Now -- the yield isn't important. I could very well update a list
instead. (Although it would be nice if the yield-thing were possible
too...)

>So I'll focus on the specific case of yield in the following,
>assuming a "munge" function such as
>def munge(chunk): return ''.join(chunk)
>is also passed as an argument.

OK.

[snip]


>Indeed, there ain't much "fiddling" needed at all

But some, though ;)

> -- you just
>DO need to know SOME acceptable separator, however:

Hm. Yeah -- that's sort of the thing I don't really like, I guess.
(It's a pretty vague feeling, though ;)

>import itertools


>
> for element in itertools.chain(iterable, [aSeparator]):

Yup. This was discussed separately in another thread, where I
suggested some related tools to itertools (and got the above as a
suggested alternative).

This is, as I mentioned, more or less what I did in the "Instant
Markup" thing. I simply used something along the lines of

def lines(file):
for line in file:
yield line
yield '\n'

and then iterated over that when producing paragraphs.

[snip]


>Elegance is in the eye of the beholder, but...:

Indeed. My notion of elegance has been known to be a bit superficial
at times ;)

[snip]

I'm actually using something very similar in another context (where I
need to know wheter two thingies of the same kind are next to each
other :)

>...I've had occasion to use variants of this in order to be able
>to peek ahead, check if an iterator was done, or in small further
>variants to give an iterator one level of "pushback", etc, etc.

Indeed. Perhaps some simple, general version of this (maybe even like
the one you described above -- although perhaps with a slightly more
pithy name? -- might be a candidate for itertools? Lookahead can be
useful in many cases (such as when writing a parser, for instance :)

>So, if you have a wrapper such as this one around somewhere, you
>might choose to reuse it (though it probably wouldn't be worth
>developing for the sole purpose of this use!-):

Indeed...

[snip]
> if issep or it.done:

Yes -- this is exactly what I'm missing, I suppose.

>> I can't really see any good way of using the while/break idiom
>> either,
>
>Well, you COULD use a different wrapper class to obtain code such as:

Yeah... If I was to write a wrapper class in the first place, I could
do pretty much anything, I suppose :]

>but the wrapper wouldn't be all that nice under the covers AND it
>would in practice have to embody a bit too much of the control
>logic and bury it in a non-obvious place -- so I wouldn't pursue
>this tack, myself.

Agreed.

I guess what it all boils down to is that it would be nice to know
whether one is in the last iteration of a for loop. Sadly, I see no
way of doing that in the general case.

>Alex

Jeremy Fincher

unread,
Mar 25, 2003, 2:26:54 AM3/25/03
to
m...@furu.idi.ntnu.no (Magnus Lie Hetland) wrote in message news:<slrnb7v8a...@furu.idi.ntnu.no>...

> You should add "if acc" before you yield acc -- I don't want an empty
> acc (that only means several separators in a row -- which amounts to a
> single separator in my case).

That makes sense. To be truly general, that should be a named
argument with a default to not return empty values.


> And, with that statement in place, you'd
> get the same duplication as before, as far as I can see. What is new
> about this (except putting it inside a generator)?

Simply that once writing it, the ugliness is contained in that one
function, and all your code that needs the behavior you describe can
be written much more beautifully :)

Jeremy

Magnus Lie Hetland

unread,
Mar 25, 2003, 9:07:08 AM3/25/03
to
In article <698f09f8.03032...@posting.google.com>, Jeremy
Fincher wrote:
>m...@furu.idi.ntnu.no (Magnus Lie Hetland) wrote in message news:<slrnb7v8a...@furu.idi.ntnu.no>...
>> You should add "if acc" before you yield acc -- I don't want an empty
>> acc (that only means several separators in a row -- which amounts to a
>> single separator in my case).
>
>That makes sense. To be truly general, that should be a named
>argument with a default to not return empty values.
>

Maybe. I wasn't really looking for a general function/generator, but
for an idiom (i.e. a way of solving this with basic tools). An
unrealistic wish, perhaps :)

>
>> And, with that statement in place, you'd
>> get the same duplication as before, as far as I can see. What is new
>> about this (except putting it inside a generator)?
>
>Simply that once writing it, the ugliness is contained in that one
>function, and all your code that needs the behavior you describe can
>be written much more beautifully :)

Indeed.

>Jeremy

Manuel Garcia

unread,
Mar 25, 2003, 2:54:04 PM3/25/03
to
On Sun, 23 Mar 2003 14:19:53 -0500, "Tim Peters"
<tim...@email.msn.com> wrote:

>If the code made sense <wink>, something like
>
>def terminated_iterator(iterable, a_seperator):
> for element in iterable:
> yield element
> yield a_separator
>
>would produce the original sequence, then tack a_separator on to the end.

Isn't it a general rule that terminators are easier to work with than
separators? I remember some programming guru saying this (Jon
Bentley?) I think it was Pascal's use of separators between
statements that convinced Dennis Ritchie to use terminators instead.

When I have to deal with separators, I always tack an extra one on the
end, using a trick like the one above, or a simple append or
concatenation. This is usually good to make a boolean or repeated
code vanish. For string processing, I usually throw an extra one on
the front too, for good luck.

Separator is also harder to spell than terminator. ;-)

Manuel

Daniel Timothy Bentley

unread,
Mar 25, 2003, 5:54:20 PM3/25/03
to
On Tue, 25 Mar 2003, Manuel Garcia wrote:

> Isn't it a general rule that terminators are easier to work with than
> separators? I remember some programming guru saying this (Jon
> Bentley?) I think it was Pascal's use of separators between

Doubtful. He can't remember such a thing.
ObOldPersonMemoryDisclaimer

-Dan

>
> Manuel
>
>

Manuel M Garcia

unread,
Mar 26, 2003, 1:14:51 AM3/26/03
to
On Tue, 25 Mar 2003 14:54:20 -0800, Daniel Timothy Bentley
<dben...@stanford.edu> wrote:

>> Isn't it a general rule that terminators are easier to work with than
>> separators? I remember some programming guru saying this (Jon
>> Bentley?) I think it was Pascal's use of separators between
>
>Doubtful. He can't remember such a thing.
>ObOldPersonMemoryDisclaimer

I guess it could be considered a special case of adding a sentinel at
a boundary (5.1 in 'Writing Efficient Programs'). But that section is
definitely not talking specifically about terminators vs. separators.

Maybe because I have done so much 'multi-dimensional' database
programming, I am acutely aware how separators make for lousy data
structures.

Manuel

0 new messages