Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss
Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Best practice for operations on streams of text

5 views
Skip to first unread message

James

unread,
May 7, 2009, 3:59:50 PM5/7/09
to
Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?

Thanks,
James

J Kenneth King

unread,
May 7, 2009, 4:06:43 PM5/7/09
to
James <rent.lu...@gmail.com> writes:

Co-routines my friends. Google will help you greatly in discovering
this processing wonder.

Gary Herron

unread,
May 7, 2009, 4:23:57 PM5/7/09
to James, pytho...@python.org
> --
> http://mail.python.org/mailman/listinfo/python-list
>

David Beazly has a very interesting talk on using generators for
building and linking together individual stream filters. Its very cool
and surprisingly eye-opening.

See "Generator Tricks for Systems Programmers" at
http://www.dabeaz.com/generators/

Gary Herron


MRAB

unread,
May 7, 2009, 5:07:42 PM5/7/09
to pytho...@python.org
What you should be doing is letting the filters accept an iterator and
yield values on demand:

def lowercase_token(stream):
for t in stream:
yield t.lower()

def remove_boring(stream):
for t in stream:
if t not in boring:
yield t

def remove_dupes(stream):
seen = set()
for t in stream:
if t not in seen:
yield t
seen.add(t)

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
for t in stream(t):
yield t

Terry Reedy

unread,
May 7, 2009, 6:32:25 PM5/7/09
to pytho...@python.org
MRAB wrote:

> James wrote:
>> Hello all,
>> I'm working on some NLP code - what I'm doing is passing a large
>> number of tokens through a number of filtering / processing steps.
>>
>> The filters take a token as input, and may or may not yield a token as
>> a result. For example, I might have filters which lowercases the
>> input, filter out boring words and filter out duplicates chained
>> together.
>>
>> I originally had code like this:
>> for t0 in token_stream:
>> for t1 in lowercase_token(t0):
>> for t2 in remove_boring(t1):
>> for t3 in remove_dupes(t2):
>> yield t3

For that to work at all, the three functions would have to turn each
token into an iterable of 0 or 1 tokens. Hence the inner 'loops' would
execute 0 or 1 times. Better to return a token or None, and replace the
three inner 'loops' with three conditional statements (ugly too) or less
efficiently (due to lack of short circuiting),

t = remove_dupes(remove_boring(lowercase_token(t0)))
if t is not None: yield t

>> Apart from being ugly as sin, I only get one token out as
>> StopIteration is raised before the whole token stream is consumed.

That puzzles me. Your actual code must be slightly different from the
above and what I imagine the functions to be. But nevermind, because

>> Any suggestions on an elegant way to chain together a bunch of
>> generators, with processing steps in between?

MRAB's suggestion is the way to go. Your automatically get
short-circuiting because each generator only gets what is passed on.
And resuming a generator is much faster that re-calling a function.

> What you should be doing is letting the filters accept an iterator and
> yield values on demand:
>
> def lowercase_token(stream):
> for t in stream:
> yield t.lower()
>
> def remove_boring(stream):
> for t in stream:
> if t not in boring:
> yield t
>
> def remove_dupes(stream):
> seen = set()
> for t in stream:
> if t not in seen:
> yield t
> seen.add(t)
>
> def compound_filter(token_stream):
> stream = lowercase_token(token_stream)
> stream = remove_boring(stream)
> stream = remove_dupes(stream)
> for t in stream(t):
> yield t

I also recommend the Beazly reference Herron gave.

tjr

Beni Cherniavsky

unread,
May 17, 2009, 6:59:00 AM5/17/09
to
On May 8, 12:07 am, MRAB <goo...@mrabarnett.plus.com> wrote:
> def compound_filter(token_stream):
>      stream = lowercase_token(token_stream)
>      stream = remove_boring(stream)
>      stream = remove_dupes(stream)
>      for t in stream(t):
>          yield t

The last loop is superfluous. You can just do::

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)

return stream

which is simpler and slightly more efficient. This works because from
the caller's perspective, a generator is just a function that returns
an iterator. It doesn't matter whether it implements the iterator
itself by containing ``yield`` statements, or shamelessly passes on an
iterator implemented elsewhere.

0 new messages