Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

A gnarly little python loop

127 views
Skip to first unread message

Roy Smith

unread,
Nov 10, 2012, 5:58:14 PM11/10/12
to
I'm trying to pull down tweets with one of the many twitter APIs. The
particular one I'm using (python-twitter), has a call:

data = api.GetSearch(term="foo", page=page)

The way it works, you start with page=1. It returns a list of tweets.
If the list is empty, there are no more tweets. If the list is not
empty, you can try to get more tweets by asking for page=2, page=3, etc.
I've got:

page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r:
break
for tweet in r:
process(tweet)
page += 1

It works, but it seems excessively fidgety. Is there some cleaner way
to refactor this?

Ian Kelly

unread,
Nov 10, 2012, 6:17:08 PM11/10/12
to Python
I'd do something like this:

def get_tweets(term):
for page in itertools.count(1):
r = api.GetSearch(term, page)
if not r:
break
for tweet in r:
yield tweet

for tweet in get_tweets("foo"):
process(tweet)

Steven D'Aprano

unread,
Nov 10, 2012, 7:23:07 PM11/10/12
to
On Sat, 10 Nov 2012 17:58:14 -0500, Roy Smith wrote:

> The way it works, you start with page=1. It returns a list of tweets.
> If the list is empty, there are no more tweets. If the list is not
> empty, you can try to get more tweets by asking for page=2, page=3, etc.
> I've got:
>
> page = 1
> while 1:
> r = api.GetSearch(term="foo", page=page)
> if not r:
> break
> for tweet in r:
> process(tweet)
> page += 1
>
> It works, but it seems excessively fidgety. Is there some cleaner way
> to refactor this?


Seems clean enough to me. It does exactly what you need: loop until there
are no more tweets, process each tweet.

If you're allergic to nested loops, move the inner for-loop into a
function. Also you could get rid of the "if r: break".

page = 1
r = ["placeholder"]
while r:
r = api.GetSearch(term="foo", page=page)
process_all(tweets) # does nothing if r is empty
page += 1


Another way would be to use a for list for the outer loop.

for page in xrange(1, sys.maxint):
r = api.GetSearch(term="foo", page=page)
if not r: break
process_all(r)



--
Steven

Steve Howell

unread,
Nov 10, 2012, 10:03:42 PM11/10/12
to
I think your code is perfectly readable and clean, but you can flatten
it like so:

def get_tweets(term, get_page):
page_nums = itertools.count(1)
pages = itertools.imap(api.getSearch, page_nums)
valid_pages = itertools.takewhile(bool, pages)
tweets = itertools.chain.from_iterable(valid_pages)
return tweets

Stefan Behnel

unread,
Nov 11, 2012, 2:56:10 AM11/11/12
to pytho...@python.org
Steve Howell, 11.11.2012 04:03:
I'd prefer the original code ten times over this inaccessible beast.

Stefan


Cameron Simpson

unread,
Nov 11, 2012, 3:48:36 AM11/11/12
to pytho...@python.org
Me too.
--
Cameron Simpson <c...@zip.com.au>

In an insane society, the sane man must appear insane.
- Keith A. Schauer <ke...@balrog.dseg.ti.com>

Paul Rubin

unread,
Nov 11, 2012, 4:09:37 AM11/11/12
to
Cameron Simpson <c...@zip.com.au> writes:
> | I'd prefer the original code ten times over this inaccessible beast.
> Me too.

Me, I like the itertools version better. There's one chunk of data
that goes through a succession of transforms each of which
is very straightforward.

Peter Otten

unread,
Nov 11, 2012, 4:54:33 AM11/11/12
to pytho...@python.org
[Steve Howell]
> def get_tweets(term, get_page):
> page_nums = itertools.count(1)
> pages = itertools.imap(api.getSearch, page_nums)
> valid_pages = itertools.takewhile(bool, pages)
> tweets = itertools.chain.from_iterable(valid_pages)
> return tweets


But did you spot the bug(s)?
My itertools-based version would look like this

def get_tweets(term):
pages = (api.GetSearch(term, pageno)
for pageno in itertools.count(1))
for page in itertools.takewhile(bool, pages):
yield from page

but I can understand that it's not everybody's cup of tea.

Steve Howell

unread,
Nov 11, 2012, 12:16:06 PM11/11/12
to pytho...@python.org
On Sunday, November 11, 2012 1:54:46 AM UTC-8, Peter Otten wrote:
> Paul Rubin wrote:
>
>
>
> > Cameron Simpson <c...@zip.com.au> writes:
>
> >> | I'd prefer the original code ten times over this inaccessible beast.
>
> >> Me too.
>
> >
>
> > Me, I like the itertools version better. There's one chunk of data
>
> > that goes through a succession of transforms each of which
>
> > is very straightforward.
>
>
>
> [Steve Howell]
>
> > def get_tweets(term, get_page):
>
> > page_nums = itertools.count(1)
>
> > pages = itertools.imap(api.getSearch, page_nums)
>
> > valid_pages = itertools.takewhile(bool, pages)
>
> > tweets = itertools.chain.from_iterable(valid_pages)
>
> > return tweets
>
>
>
>
>
> But did you spot the bug(s)?
>

My first version was sketching out the technique, and I don't have handy access to the API.

Here is an improved version:

def get_tweets(term):
def get_page(page):
return getSearch(term, page)
page_nums = itertools.count(1)
pages = itertools.imap(get_page, page_nums)
valid_pages = itertools.takewhile(bool, pages)
tweets = itertools.chain.from_iterable(valid_pages)
return tweets

for tweet in get_tweets("foo"):
process(tweet)

This is what I used to test it:


def getSearch(term = "foo", page = 1):
# simulate api for testing
if page < 5:
return [
'page %d, tweet A for term %s' % (page, term),
'page %d, tweet B for term %s' % (page, term),
]
else:
return None

def process(tweet):
print tweet

Steve Howell

unread,
Nov 11, 2012, 12:16:06 PM11/11/12
to comp.lan...@googlegroups.com, pytho...@python.org
On Sunday, November 11, 2012 1:54:46 AM UTC-8, Peter Otten wrote:
> Paul Rubin wrote:
>
>
>
> > Cameron Simpson <c...@zip.com.au> writes:
>
> >> | I'd prefer the original code ten times over this inaccessible beast.
>
> >> Me too.
>
> >
>
> > Me, I like the itertools version better. There's one chunk of data
>
> > that goes through a succession of transforms each of which
>
> > is very straightforward.
>
>
>
> [Steve Howell]
>
> > def get_tweets(term, get_page):
>
> > page_nums = itertools.count(1)
>
> > pages = itertools.imap(api.getSearch, page_nums)
>
> > valid_pages = itertools.takewhile(bool, pages)
>
> > tweets = itertools.chain.from_iterable(valid_pages)
>
> > return tweets
>
>
>
>
>
> But did you spot the bug(s)?
>

My first version was sketching out the technique, and I don't have handy access to the API.

Here is an improved version:

def get_tweets(term):
def get_page(page):
return getSearch(term, page)
page_nums = itertools.count(1)
pages = itertools.imap(get_page, page_nums)
valid_pages = itertools.takewhile(bool, pages)
tweets = itertools.chain.from_iterable(valid_pages)
return tweets

Steve Howell

unread,
Nov 11, 2012, 12:29:15 PM11/11/12
to
Thanks, Paul.

Even though I supplied the "inaccessible" itertools version, I can
understand why folks find it inaccessible. As I said to the OP, there
was nothing wrong with the original imperative approach; I was simply
providing an alternative.

It took me a while to appreciate itertools, but the metaphor that
resonates with me is a Unix pipeline. It's just a metaphor, so folks
shouldn't be too literal, but the idea here is this:

page_nums -> pages -> valid_pages -> tweets

The transforms are this:

page_nums -> pages: call API via imap
pages -> valid_pages: take while true
valid_pages -> tweets: use chain.from_iterable to flatten results

Here's the code again for context:

def get_tweets(term):
def get_page(page):
return getSearch(term, page)
page_nums = itertools.count(1)
pages = itertools.imap(get_page, page_nums)

Peter Otten

unread,
Nov 11, 2012, 1:34:06 PM11/11/12
to pytho...@python.org
Actually you supplied the "accessible" itertools version. For reference,
here's the inaccessible version:

class api:
"""Twitter search API mock-up"""
pages = [
["a", "b", "c"],
["d", "e"],
]
@staticmethod
def GetSearch(term, page):
assert term == "foo"
assert page >= 1
if page > len(api.pages):
return []
return api.pages[page-1]

from collections import deque
from functools import partial
from itertools import chain, count, imap, takewhile

def process(tweet):
print tweet

term = "foo"

deque(
imap(
process,
chain.from_iterable(
takewhile(bool, imap(partial(api.GetSearch, term), count(1))))),
maxlen=0)

;)

Steve Howell

unread,
Nov 11, 2012, 2:16:06 PM11/11/12
to
I know Peter's version is tongue in cheek, but I do think that it has
a certain expressive power, and it highlights three mind-expanding
Python modules.

Here's a re-flattened take on Peter's version ("Flat is better than
nested." -- PEP 20):

term = "foo"
search = partial(api.GetSearch, term)
nums = count(1)
paged_tweets = imap(search, nums)
paged_tweets = takewhile(bool, paged_tweets)
tweets = chain.from_iterable(paged_tweets)
processed_tweets = imap(process, tweets)
deque(processed_tweets, maxlen=0)

The use of deque to exhaust an iterator is slightly overboard IMHO,
but all the other lines of code can be fairly easily understood once
you read the docs.

partial: http://docs.python.org/2/library/functools.html
count, imap, takewhile, chain.from_iterable:
http://docs.python.org/2/library/itertools.html
deque: http://docs.python.org/2/library/collections.html

Roy Smith

unread,
Nov 11, 2012, 2:23:46 PM11/11/12
to
In article <mailman.3562.1352658...@python.org>,
Peter Otten <__pet...@web.de> wrote:

> deque(
> imap(
> process,
> chain.from_iterable(
> takewhile(bool, imap(partial(api.GetSearch, term), count(1))))),
> maxlen=0)
>
> ;)

If I wanted STL, I would still be writing C++ :-)

Cameron Simpson

unread,
Nov 11, 2012, 7:43:56 PM11/11/12
to Steve Howell, pytho...@python.org
On 11Nov2012 11:16, Steve Howell <show...@yahoo.com> wrote:
| On Nov 11, 10:34 am, Peter Otten <__pete...@web.de> wrote:
| > Steve Howell wrote:
| > > On Nov 11, 1:09 am, Paul Rubin <no.em...@nospam.invalid> wrote:
| > >> Cameron Simpson <c...@zip.com.au> writes:
| > >> > | I'd prefer the original code ten times over this inaccessible beast.
| > >> > Me too.
| >
| > >> Me, I like the itertools version better.  There's one chunk of data
| > >> that goes through a succession of transforms each of which
| > >> is very straightforward.
| >
| > > Thanks, Paul.
| >
| > > Even though I supplied the "inaccessible" itertools version, I can
| > > understand why folks find it inaccessible.  As I said to the OP, there
| > > was nothing wrong with the original imperative approach; I was simply
| > > providing an alternative.
| >
| > > It took me a while to appreciate itertools, but the metaphor that
| > > resonates with me is a Unix pipeline.
[...]
| > Actually you supplied the "accessible" itertools version. For reference,
| > here's the inaccessible version:
[...]
| I know Peter's version is tongue in cheek, but I do think that it has
| a certain expressive power, and it highlights three mind-expanding
| Python modules.
| Here's a re-flattened take on Peter's version ("Flat is better than
| nested." -- PEP 20):
[...]

Ok, who's going to quiz the OP on his/her uptake of these techniques...
--
Cameron Simpson <c...@zip.com.au>

It's hard to make a man understand something when his livelihood depends
on him not understanding it. - Upton Sinclair

Steve Howell

unread,
Nov 11, 2012, 8:38:23 PM11/11/12
to
On Nov 11, 4:44 pm, Cameron Simpson <c...@zip.com.au> wrote:
Cameron, with all due respect, I think you're missing the point.

Roy posted this code:

page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r:
break
for tweet in r:
process(tweet)
page += 1

In his own words, he described the loop as "gnarly" and the overall
code as "fidgety."

One way to eliminate the "while", the "if", and the "break" statements
is to use higher level constructs that are shipped with all modern
versions of Python, and which are well documented and well tested (and
fast, I might add):

search = partial(api.GetSearch, "foo")
paged_tweets = imap(search, count(1))
paged_tweets = takewhile(bool, paged_tweets)
tweets = chain.from_iterable(paged_tweets)
for tweet in tweets:
process(tweet)

The moral of the story is that you can avoid brittle loops by relying
on a well-tested library to work at a higher level of abstraction.

For this particular use case, the imperative version is fine, but for
more complex use cases, the loops are only gonna get more gnarly and
fidgety.



rusi

unread,
Nov 12, 2012, 2:09:31 AM11/12/12
to
This is a classic problem -- structure clash of parallel loops -- nd
Steve Howell has given the classic solution using the fact that
generators in python simulate/implement lazy lists.
As David Beazley http://www.dabeaz.com/coroutines/ explains,
coroutines are more general than generators and you can use those if
you prefer.

The classic problem used to be stated like this:
There is an input in cards of 80 columns.
It needs to be copied onto printer of 132 columns.

The structure clash arises because after reading 80 chars a new card
has to be read; after printing 132 chars a linefeed has to be given.

To pythonize the problem, lets replace the 80,132 by 3,4, ie take the
char-square
abc
def
ghi

and produce
abcd
efgh
i

The important difference (explained nicely by Beazley) is that in
generators the for-loop pulls the generators, in coroutines, the
'generator' pushes the consuming coroutines.


---------------
from __future__ import print_function
s= ["abc", "def", "ghi"]

# Coroutine-infrastructure from pep 342
def consumer(func):
def wrapper(*args,**kw):
gen = func(*args, **kw)
gen.next()
return gen
return wrapper

@consumer
def endStage():
while True:
for i in range(0,4):
print((yield), sep='', end='')
print("\n", sep='', end='')


def genStage(s, target):
for line in s:
for i in range(0,3):
target.send(line[i])


if __name__ == '__main__':
genStage(s, endStage())






rusi

unread,
Nov 12, 2012, 10:21:49 AM11/12/12
to
On Nov 12, 12:09 pm, rusi <rustompm...@gmail.com> wrote:
> This is a classic problem -- structure clash of parallel loops
<rest snipped>

Sorry wrong solution :D

The fidgetiness is entirely due to python not allowing C-style loops
like these:
>> while ((c=getchar()!= EOF) { ... }


Putting it into coroutine form, it becomes something like the
following [Untested since I dont have the API]. Clearly the
fidgetiness is there as before and now with extra coroutine plumbing

def genStage(term, target):
page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r: break
for tweet in r: target.send(tweet)
page += 1


@consumer
def endStage():
while True: process((yield))

if __name__ == '__main__':
genStage("foo", endStage())

Peter Otten

unread,
Nov 12, 2012, 10:49:23 AM11/12/12
to pytho...@python.org
rusi wrote:

> The fidgetiness is entirely due to python not allowing C-style loops
> like these:
> >>> while ((c=getchar()!= EOF) { ... }

for c in iter(getchar, EOF):
...

> Clearly the fidgetiness is there as before and now with extra coroutine
> plumbing

Hmm, very funny...


Steve Howell

unread,
Nov 12, 2012, 11:09:16 AM11/12/12
to
On Nov 12, 7:21 am, rusi <rustompm...@gmail.com> wrote:
> On Nov 12, 12:09 pm, rusi <rustompm...@gmail.com> wrote:> This is a classic problem -- structure clash of parallel loops
>
> <rest snipped>
>
> Sorry wrong solution :D
>
> The fidgetiness is entirely due to python not allowing C-style loops
> like these:
>
> >> while ((c=getchar()!= EOF) { ... }
> [...]

There are actually three fidgety things going on:

1. The API is 1-based instead of 0-based.
2. You don't know the number of pages in advance.
3. You want to process tweets, not pages of tweets.

Here's yet another take on the problem:

# wrap fidgety 1-based api
def search(i):
return api.GetSearch("foo", i+1)

paged_tweets = (search(i) for i in count())

# handle sentinel
paged_tweets = iter(paged_tweets.next, [])

# flatten pages

rusi

unread,
Nov 12, 2012, 11:14:04 PM11/12/12
to
[Steve Howell]
Nice on the whole -- thanks
Could not the 1-based-ness be dealt with by using count(1)?
ie use
paged_tweets = (api.GetSearch("foo", i) for i in count(1))

{Peter]
> >>> while ((c=getchar()!= EOF) { ... }

for c in iter(getchar, EOF):
...

Thanks. Learnt something
0 new messages