new cursor-based pagination not multithread-friendly

167 views
Skip to first unread message

alan_b

unread,
Sep 18, 2009, 4:09:39 AM9/18/09
to Twitter Development Talk
when dealing with retrieving a large followers list from API, what i
did was estimate the no. of pages i need (total / 5000) from the
follower count of user's profile, and then send concurrent API
requests to improve the speed.

now with the new cursor-based pagination, this become impossible(it
stills work, but i guess page-based pagination will be obsoleted
someday?), because I don't know the next_cursor until I finish
downloading a whole page. so i guess the page-based should be preserved
(and improve)? rather than making it obsolete?

jmathai

unread,
Sep 18, 2009, 4:14:05 AM9/18/09
to Twitter Development Talk
I haven't switched to the cursor method yet but I access the follower
list like you described and I have no issues with performance since
the calls aren't serialized. If the reliability could be fixed and
the page parameter preserved, then I'd vote for that.

John Kalucki

unread,
Sep 18, 2009, 12:47:58 PM9/18/09
to Twitter Development Talk
The page based approach does not scale with large sets. We can no
longer support this kind of API without throwing a painful number of
503s.

Working with row-counts forces the data store to recount rows in an O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O(n/
block_size) which, yes, is O(n), but a graceful one given n < 10^7 and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.

Proportionally, very few users require multiple page fetches with a
page size of 5,000.

Also, scraping the social graph repeatedly at high speed is could
often be considered a low-value, borderline abusive use of the social
graph API.

-John Kalucki
http://twitter.com/jkalucki
Services, Twitter Inc.




On Sep 18, 1:09 am, alan_b <ala...@gmail.com> wrote:

David W.

unread,
Sep 18, 2009, 4:46:16 PM9/18/09
to Twitter Development Talk
Hi Alan,

I originally thought this was a show-stopper too, but it can be worked
around by simply processing multiple accounts using those threads
rather than multiple pages of a single account.

Something like this:

Have a producer that emits the account IDs requiring update onto a
queue, which is then consumed by your thread pool, with each thread
writing its 'page' to an intermediary scratch area associated with an
account, before emitting another work item onto the queue with the
next cursor ID, or if the next ID is null, initiating a 3rd process
that completes the task on a per-account basis once all pages have
been gathered. Repeat until queue is empty.

If you don't have multiple accounts to process, then I guess that
doesn't work. Note in the old scheme, your threads would have been
causing localized load spikes for Twitter anyway.


David

Kevin Mesiab

unread,
Sep 18, 2009, 6:58:59 PM9/18/09
to twitter-deve...@googlegroups.com
We can deal w/ rate limiting, just give us some semblance of accuracy
or the calls are pointless.
Reply all
Reply to author
Forward
0 new messages