Long polling across multiple tornados / machines

Romy

unread,

Jan 11, 2011, 2:25:26 AM1/11/11

to Tornado Web Server

Ben (and friends),

I was curious what FriendFeed used for long polling (what kind of IPC,
data structures, etc) since I know you guys kept a connection open for
every user, and ran multiple instances per box. Also, not sure if
you're allowed to share but would love to know (ballpark) how many
open connections a single box could handle, so I can decide whether
doing long-polling off a single VPS, for every visitor, is a good or
bad idea.

I need to push updates to users in a pub/sub manner. I was thinking of
something similar to the chat client demo (ie storing callbacks in a
growing list, emptying on new msg, etc) but with some form of IPC
across Tornado instances to sync. I suppose eventually it'd turn into
RPC once the setup expands beyond a single machine. Is this a good
approach, or am I missing anything ?

Cheers,

R

Ben Darnell

unread,

Jan 11, 2011, 4:38:57 PM1/11/11

to python-...@googlegroups.com

Friendfeed's system was pretty simple: a single database table (containing (timestamp, entry id, feed id) tuples) served as a kind of global message queue. All the servers polled this table frequently and would check to see if any of the new/updated entries matched any of that server's pending long poll operations. Fine-grained pubsub complicates things greatly and should probably be avoided as long as possible.

It takes some tuning to maximize the number of connections allowed by the OS - default configurations often limit you to 512 or 1024 connections per process. This applies at multiple levels, including e.g. nginx/haproxy, any NAT devices, and possibly the virtualization layer. I don't have notes handy on what exactly needs to be tweaked. Once you've raised the various limits, idle connections are pretty cheap on the tornado side, so you can have a lot of them per process.

-Ben

Romy

unread,

Jan 12, 2011, 2:30:16 PM1/12/11

to Tornado Web Server

Thank you Ben.

When you say fine-grained pubsub, I'm assuming you mean notifying upon
actual updates, rather than polling ?

And with the polling model you guys had, that table never became a
bottleneck ?

Thanks for the numbers,

R

Gary B

unread,

Jan 12, 2011, 3:15:47 PM1/12/11

to python-...@googlegroups.com

The table is polled once per application instance, not once per client.

The application instances periodically poll for new items since the last timestamp seen by the instance. This query is efficient because the timestamp is the primary key for the table.

Ben Darnell

unread,

Jan 12, 2011, 3:18:58 PM1/12/11

to python-...@googlegroups.com

On Wed, Jan 12, 2011 at 11:30 AM, Romy <romy.m...@gmail.com> wrote:

Thank you Ben.

When you say fine-grained pubsub, I'm assuming you mean notifying upon
actual updates, rather than polling ?

No, I meant that making the pubsub fine-grained (i.e. server X needs updates for users A B and C, server Y needs users D E F, etc) is too complicated. Just let every server see every update for as long as your scale allows.

-Ben

Joe Bowman

unread,

Jan 18, 2011, 10:05:43 AM1/18/11

to python-...@googlegroups.com

I'm doing something similar with MongoDB for the realtime streaming search for unscatter.com. It works basically like this

client makes a search {query}

The RequestHandler queries MongoDB to see if there's a topic for that query.

If there is no topic

Creates a topic which is basically query + last access time

Returns the base page to deploy the long polling javascript

If there is a topic

Uses the topic ObjectId to query a queue table for the most recent messages.

Deploys the messages and the javascript to start long polling for more messages.

I have a separate process, also written in Tornado, that polls looking for active topics.

Active topics are determined by last access time.

It checks the twitter/facebook nexturl fields in the topic to poll those sites for new messages.

If there are new messages it processes them into the queue, and updates the topic nexturl fields.

The longpoll basically is a GET request for new messages with a parameter that is the objectid of the last message it received (if any). this kicks off the polling in the RequestHandler that will poll mongodb for new messages, returning them when it gets them to start the next polling session.

I used mongodb because of it's speed and also provides build in garbage collection. The topics and queue collections are capped, so they'll never excede a certain size. Eventually I'm going to plug in bit.ly's asyncmongo they released and if i ever start getting real amounts of traffic then that should help as well.

It's not really live yet, i'm working on adding more features where i'll be processing the results to dig out links and such. The link works if you want to see it in action thought - http://www.unscatter.com/search/?q=python&f=realtime&s=date&t=python

there's no navigation to that filter yet, you have to add the f=realtime to the search url manually at this time.

Reply all

Reply to author

Forward