Retwis - a little twitter clone written in PHP + Redis

329 views
Skip to first unread message

Salvatore Sanfilippo

unread,
Mar 10, 2009, 10:13:24 AM3/10/09
to redi...@googlegroups.com
Hello all,

I wanted to have an hello world application for Redis. Something so
simple that even the average PHP programmer may understand without
problems, but that was a real application of Redis in a plausible
environment. The result is a Twitter clone called Retwis, that you can
see at work here: http://retwis.antirez.com/home.php

I'll release the source code together with an article explaining the
design and the source code in a very simple way. My guess is that
key-value stores so fare failed a bit the marketing side of the game,
I could like to avoid this error with Redis :)

Any hint appreciated!
antirez

--
Salvatore 'antirez' Sanfilippo
http://antirez.com

Organizations which design systems are constrained to produce designs
which are copies of the communication structures of these
organizations.

Conway's Law

Carlos Pero

unread,
Mar 16, 2009, 9:55:45 AM3/16/09
to Redis DB
Hi Salvatore,

I've studied this example, and have a question about scalability for
you. (As an engineer by nature, I like to understand the obstacles in
best case and worst case scenarios...)

You talk about being about to write to the database say 50,000
transactions per second. I have no idea how Twitter actually does it,
but the biggest names on Twitter have around 250,000 followers. Of
course with your small dev hardware this would take 5 full seconds to
process. How would this scale for real in production? I read your
initial writing on master-slave replication, so presumably you can
point your GETs to the slaves, but the SETs still go to the one
master? Do you just get as big of a box as possible to be the master?

-Carlos

Salvatore Sanfilippo

unread,
Mar 16, 2009, 10:08:22 AM3/16/09
to redi...@googlegroups.com
On Mon, Mar 16, 2009 at 2:55 PM, Carlos Pero <carlo...@gmail.com> wrote:
>
> Hi Salvatore,
>
> I've studied this example, and have a question about scalability for
> you.  (As an engineer by nature, I like to understand the obstacles in
> best case and worst case scenarios...)

Hello Carlos!
A good approach of course!

> You talk about being about to write to the database say 50,000
> transactions per second.  I have no idea how Twitter actually does it,
> but the biggest names on Twitter have around 250,000 followers. Of

Ok first I'll try to tell what The Right Thing is IMHO if you *are* twitter.
If you are Twitter you MUST probably hack your tools to make them specific
for your problems. That is, to code a special operation that is able to LPUSH
by pattern matching, exactly like SORT does. Something like this (we may
even support this if we discover it's a common pattern):

SORT uid:1000:followers BY nokey LPUSH uid:*:posts 99

Here you have to know how Redis's SORT works to understand this, but basically
99 is the ID of the message, and it will be pushed to all the keys obtained
substituting '*' to Set elements in uid:*:posts.

This will of course be *much* faster than issuing queries. So for this special
problems to code a Redis special command is the best thing one could possibly
do.

Now let's assume we can't do it for some reason:

> course with your small dev hardware this would take 5 full seconds to
> process.  How would this scale for real in production?  I read your
> initial writing on master-slave replication, so presumably you can
> point your GETs to the slaves, but the SETs still go to the one
> master?  Do you just get as big of a box as possible to be the master?

In this specific example my solution could be to distribute among N
clients, that is, if an user has more than N followers when he posts a new
message I'll set a key somewhere to say "hey this is a big user, and it
posted this huge messsage, please, my N-clients cluster, proess this
message adding it in the queue of every user".

Note that you can scale this both in terms of Redis Server and "Pusher clients".
The PUSH will be distributed since followers will be distributed across N
servers. Clients will do the same, using one of the locking primitives like
RENAME or MOVE, every client will get in charge of a big-push-operation,
one it is done, the next one, and so on. You have multiple workers so this
will be distributed, and to add workers will decrease latency, specifically the
time needed to receive the message since the very-big-user posted.

I hope that even with all the limits of my english I was clear enough :)

Thanks for asking!
Salvatore

>
> -Carlos

Chris Lamb

unread,
Mar 16, 2009, 10:10:47 AM3/16/09
to redi...@googlegroups.com
Carlos Pero wrote:

> with your small dev hardware this would take 5 full seconds to
> process. How would this scale for real in production?

Side-stepping your question somewhat, is 5 seconds actually bad? If you
did the SETs from a message queue, I'm pretty sure it would be fine;
people with 250,000 followers aren't Tweeting every minute.

You can keep the illusion up doing a single synchronous SET into the
Tweeter's feed too.


Regards,

--
Chris Lamb, UK ch...@chris-lamb.co.uk
GPG: 0x634F9A20

signature.asc

Salvatore Sanfilippo

unread,
Mar 16, 2009, 10:16:24 AM3/16/09
to redi...@googlegroups.com
On Mon, Mar 16, 2009 at 3:10 PM, Chris Lamb <ch...@chris-lamb.co.uk> wrote:
> Carlos Pero wrote:
>
>> with your small dev hardware this would take 5 full seconds to
>> process.  How would this scale for real in production?
>
> Side-stepping your question somewhat, is 5 seconds actually bad? If you
> did the SETs from a message queue, I'm pretty sure it would be fine;
> people with 250,000 followers aren't Tweeting every minute.
>
> You can keep the illusion up doing a single synchronous SET into the
> Tweeter's feed too.

Exactly I agree that the central idea here is that if the operation is Huge
you have to create a queue where "workers" will operate against instead
of trying to perform it synchronously in the web page generation like Retwis
is doing.

Regards,
Salvatore

Carlos Pero

unread,
Mar 16, 2009, 10:21:59 AM3/16/09
to Redis DB
I need to digest Salvatore's answer above, but wanted to reply to you
quickly.

On Mar 16, 9:10 am, Chris Lamb <ch...@chris-lamb.co.uk> wrote:
> Carlos Pero wrote:
> > with your small dev hardware this would take 5 full seconds to
> > process.  How would this scale for real in production?
>
> Side-stepping your question somewhat, is 5 seconds actually bad? If you
> did the SETs from a message queue, I'm pretty sure it would be fine;
> people with 250,000 followers aren't Tweeting every minute.

A 5 second latency would be perfectly acceptable. My concern is that
while the master database is busy processing that single post for
those 5 seconds, everyone else is locked out of updating the
database. Heaven forbid a different user with 250,000 followers posts
at roughly the same time.

Salvatore Sanfilippo

unread,
Mar 16, 2009, 10:25:19 AM3/16/09
to redi...@googlegroups.com
On Mon, Mar 16, 2009 at 3:21 PM, Carlos Pero <carlo...@gmail.com> wrote:

> A 5 second latency would be perfectly acceptable.  My concern is that
> while the master database is busy processing that single post for
> those 5 seconds, everyone else is locked out of updating the
> database.  Heaven forbid a different user with 250,000 followers posts
> at roughly the same time.

That's the point Carlos,

there is no single DB locked.

A worker select a random slave, and use SMEMBERS uid:1000:followers to
get all the followers.
A worker perform LPUSH of the new post ID against all the users, that
are in different DBs.

So a worker is busy for this 5 seconds, but it's a batch worker so it's ok.

Ciao,
Salvatore

Carlos Pero

unread,
Mar 16, 2009, 10:34:26 AM3/16/09
to Redis DB
> > You can keep the illusion up doing a single synchronous SET into the
> > Tweeter's feed too.
>
> Exactly I agree that the central idea here is that if the operation is Huge
> you have to create a queue where "workers" will operate against instead
> of trying to perform it synchronously in the web page generation like Retwis
> is doing.

Oh OK, I think I'm getting it now. If the post is huge, don't try to
do it all at once in the main Redis database, and instead create a new
Redis database to act as a queue and have separate workers operate on
that queue to do all 250,000 notifications.

I presume the read/write benchmarks are regardless of how many DBs you
have in your one Redis server, correct? Meaning for optimal
performance the queue must be on different hardware?


Salvatore Sanfilippo

unread,
Mar 16, 2009, 1:15:44 PM3/16/09
to redi...@googlegroups.com
On Mon, Mar 16, 2009 at 3:34 PM, Carlos Pero <carlo...@gmail.com> wrote:

>
> Oh OK, I think I'm getting it now.  If the post is huge, don't try to
> do it all at once in the main Redis database, and instead create a new
> Redis database to act as a queue and have separate workers operate on
> that queue to do all 250,000 notifications.

Exactly but there is no main database. You have N redis servers, and
using some kind of partitioning (vanilla key hashing, consistent
hashing, user-id range dispatch, ...) different keys will be stored in
different Redis servers. The more users you need to handle, the more
servers you'll use to scale.

There are important keys that maybe it's wise to store in dedicated
servers, like global:timeline or things like this, probably even the
global:largeUsersUpdateQueue we are talking about.

> I presume the read/write benchmarks are regardless of how many DBs you

If a server can do 50k read/write operations per second and you have N
servers will this schema you'll get more or less N*50k queries per
second. This is why to operate with this data mode can bring
horizontally scalable applications. There is no need for the data to
be available all at the same point into a single node.

> have in your one Redis server, correct?  Meaning for optimal
> performance the queue must be on different hardware?

The queue is not the most stressed part. If many users with many
followers will update frequently what you need are actually just more
servers, since the PUSH operations against uid:*:posts are partitioned
among all this servers. You don't really need a single big very
powerful server for this kind of application. What you need is N
commodity hardware servers, and an hashing scheme smart enough that if
a node will die you have the same data stored somebody else.

This is a viable idea: You have N master servers and you use
partitioning against this servers. Also every of this N servers have M
slaves that are used only in order to issue read queries there, and in
order to be highly available if one of the N servers will die (just
turn one of the slaves into a master). But there are many other ways
to deal with this kind of problems.

Reply all
Reply to author
Forward
0 new messages