Durability and consistency in Redis Cluster

602 views
Skip to first unread message

sqm1998

unread,
Oct 24, 2011, 4:28:22 AM10/24/11
to Redis DB
What is the planned persistence model in redis cluster and its effect
on durability and consistency? In the current (non-cluster) model, a
redis master would typically (as recommended by the docs) use fsync
every second for the AOF. However, this creates the potential up for
to 1 second of writes to be lost in case of the master failing and the
slave having a replication lag of more than 1 second (so the command
during the last second was never replicated to the slave).

According to http://redis.io/topics/cluster-spec - "Cluster nodes are
also able to auto-discover other nodes, detect non working nodes, and
performing slave nodes election to master when needed."

So, within redis cluster, if a master fails and a slave is auto-
elected, and the slave had a replication lag of more than 1 second but
the master was configured with fsync every second for the AOF, could
we have a situation where to a redis client (EG: phpredis), a counter
that was say 100 at 2 PM would actually 'go backwards' to say 98 at
2:01 PM, if the writes (an INCR command increasing 98 to 100) was made
on the master at 2:00 PM but the master than failed and redis cluster
auto-failed over to the slave but the command had not yet been
successfully replicated to the slave?

Or, does redis cluster intend to syncronously replicate all writes to
the slaves, so that the write on the master is not considered to have
'committed' unless the slave (all or one?) have acknowledged that the
command was successfully logged by the slave?

The use case that generated this question is whether edis would be
appropriate to use as an auto-increment ticket server, to generate
primary keys, a la the twitter snowflake system. Of course, to make
this work, the ticket server could never 'go backwards' in the
sequence generator, even in the case of node failures.

Josiah Carlson

unread,
Oct 29, 2011, 3:44:26 AM10/29/11
to redi...@googlegroups.com
On Mon, Oct 24, 2011 at 1:28 AM, sqm1998 <sqm...@gmail.com> wrote:
> What is the planned persistence model in redis cluster and its effect
> on durability and consistency? In the current (non-cluster) model, a
> redis master would typically (as recommended by the docs) use fsync
> every second for the AOF. However, this creates the potential up for
> to 1 second of writes to be lost in case of the master failing and the
> slave having a replication lag of more than 1 second (so the command
> during the last second was never replicated to the slave).
>
> According to http://redis.io/topics/cluster-spec - "Cluster nodes are
> also able to auto-discover other nodes, detect non working nodes, and
> performing slave nodes election to master when needed."
>
> So, within redis cluster, if a master fails and a slave is auto-
> elected, and the slave had a replication lag of more than 1 second but
> the master was configured with fsync every second for the AOF, could
> we have a situation where to a redis client (EG: phpredis), a counter
> that was say 100 at 2 PM would actually 'go backwards' to say 98 at
> 2:01 PM, if the writes (an INCR command increasing 98 to 100) was made
> on the master at 2:00 PM but the master than failed and redis cluster
> auto-failed over to the slave but the command had not yet been
> successfully replicated to the slave?

Yes. But this is the case with every multi-master database around,
except for ones that require the writes to replicate out to more than
one node before returning. Some have even designed things like "vector
clocks" to pick a winner during reconnects with writes to both the
"masters".

> Or, does redis cluster intend to syncronously replicate all writes to
> the slaves, so that the write on the master is not considered to have
> 'committed' unless the slave (all or one?) have acknowledged that the
> command was successfully logged by the slave?

There was talk of offering this, I haven't read the cluster doc yet,
so I can't say whether this will be offered.

> The use case that generated this question is whether edis would be
> appropriate to use as an auto-increment ticket server, to generate
> primary keys, a la the twitter snowflake system. Of course, to make
> this work, the ticket server could never 'go backwards' in the
> sequence generator, even in the case of node failures.

High-speed distributed auto-incrementing ids are the wrong thing to
use. There is all of that nasty synchronization stuff that needs to
happen that slows you down. Also, Twitter peaked at somewhere around
10k tweets/second a couple months ago. That's not all that fast.

If you have the space, use UUIDs (128 bits packed). They are designed
to generate unique keys. Some variants are better at proving that
guarantee, but all are pretty good, and none require synchronization

If you have 64 bits, and can deal with one/second (or so) sync times,
you can do the following:
[32 bits for the seconds since Jan 1, 1970][10 bit id][22 bits of a counter]

Each process that is generating ids gets an id that is placed in the
10 bit id field. Every time it generates an 64 bit id, it increments
the counter. Before it generates a 64 bit id in any second (by
incrementing the counter and concatenating the data), it refreshes
it's lock on it's id. That process can generate some 4 million unique
64 bit ids each second. There are some tricks for lock timeouts, etc.,
but aside from a convenient lock and allocation of the 10 bit id, you
don't need Redis for this.

Regards,
- Josiah

sqm1998

unread,
Oct 30, 2011, 7:33:50 PM10/30/11
to Redis DB
I was wondering if the following would work for a high-speed auto-
increment mechanism (while preserving the 4 bytes of an integer
instead of 8 bytes for a timestamp-based id or 16 bytes for uuid):

1. Create a durable counter database that stores the current counter
(EG: mysql innodb) atomically.
2. Each 'generator' would then atomically increment the database in #1
by N (EG: 1,000) and then store the returned value into a local cache
(EG: apc) with the value being the returned value minus (N - 1). For
example, suppose the current counter in #1 is 19000, then a generator
would atomically increment it to 20000 (this operation would be slow
as it would need to be ACID, but it only occurs once per 1,000 ids)
and set the local cache to be 19001 (20000 - 999).
3. When it is time to generate a new id, the generator would go into
its local cache and do an atomic increment by 1 which returns A. If A
is a multiple of N (EG: 1,000), then that means the 'allocated' id
range has been fully used, and the generator should repeat #2 to get a
new range of N ids. Otherwise, the returned A is a valid allocated id.

Do you think the above will work?

On Oct 29, 3:44 am, Josiah Carlson <josiah.carl...@gmail.com> wrote:
> On Mon, Oct 24, 2011 at 1:28 AM, sqm1998 <sqm1...@gmail.com> wrote:
> > What is the planned persistence model in redis cluster and its effect
> > on durability and consistency? In the current (non-cluster) model, a
> > redis master would typically (as recommended by the docs) use fsync
> > every second for the AOF. However, this creates the potential up for
> > to 1 second of writes to be lost in case of the master failing and the
> > slave having a replication lag of more than 1 second (so the command
> > during the last second was never replicated to the slave).
>
> > According tohttp://redis.io/topics/cluster-spec- "Cluster nodes are

Josiah Carlson

unread,
Oct 30, 2011, 8:00:54 PM10/30/11
to redi...@googlegroups.com
You'd have to be careful with your choice of endpoints in ids, but
that would work pretty well.

Regards,
- Josiah

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

Joran Greef

unread,
Oct 31, 2011, 3:52:41 AM10/31/11
to redi...@googlegroups.com
UUIDs are the way to go. Especially if you ever plan on writing apps that need to work offline, disconnected from the master. Use an entropy source to feed an SHA1 hash. Then decode this from hexadecimal to base62 for better storage efficiency. Some of the newer browsers are already providing window.crypto.getRandomValues as an entropy source (you can also use a counter + timestamp + a few random values + session id + random mouse movements etc).

Under this scheme, your ids would look like "Ff2qY2EH5Zw8JuRy3N8C5U30". That's slightly more than 140 bits. Provided your entropy source is sufficiently random you would have close to zero chance of a collision. You are also free to upgrade your entropy source any time. The ascii (case-sensitive) will work well in urls without escaping and it's much shorter than the same in hexadecimal.

sqm1998

unread,
Oct 31, 2011, 8:46:23 AM10/31/11
to Redis DB
Our current plan for use cases that require off-line support is that
new data are provided with 'temporary' uuid-based ids, and then upon
synchronization with the data center, we would reassign the uuids to
permanent integer-based ids when this is stored into the db.

Would this approach have a problem?

Joran Greef

unread,
Oct 31, 2011, 8:59:32 AM10/31/11
to redi...@googlegroups.com
Why have you planned to do that?

There's an interesting story with the Virgin space ships, where Burt Rutan designed the launch vehicle and payload vehicle to have the same cockpit, so pilots would only need to train for one system, and so they could standardize on parts etc.

A good goal for offline is to make things as consistent on the client and the server as possible. For example, you would want to write the same language, so you can define the same model definitions, you would want the database interface to be the same, so you could have a simple replication link replicating client database to server database. An offline web app is a distributed system: multiple nodes, most of which are mobile and can be occasionally disconnected, master/master from the get go. Move towards symmetrical nodes as much as possible (e.g. Dynamo), and away from multiple tier asymmetrical systems. UUIDs are a step in the right direction.

Josiah Carlson

unread,
Oct 31, 2011, 1:46:58 PM10/31/11
to redi...@googlegroups.com
For others that were confused, Joran's post is a reply in the
"Durability and consistency in Redis Cluster" thread, which I've
re-added to the subject line.

While I agree with your sentiment (just use UUIDs for reliability,
consistency, etc.) the practical matter is that most DBMS available in
the marketplace (really all) do a poor job of handling non-integer
primary key ids, and arguably a fairly poor job at non-numeric indexes
generally (Oracle had some awful bugs around this the last time I used
it). The exception being some noSQL dbs that use UUIDs internally
(like MongoDB, Riak, maybe CouchDB, etc.) ... seemingly for their
ability to generate arbitrary ids based on limited/zero-sharing of
information between nodes.

I've had pretty good luck with PostgreSQL and MySQL for string indexes
(just use hex and deal with the larger space, it's just easier), but
I've not personally tried them for primary keys.

Regards,
- Josiah

Joran Greef

unread,
Nov 1, 2011, 12:54:22 PM11/1/11
to redi...@googlegroups.com
Thanks Josiah, sorry about that!
Reply all
Reply to author
Forward
0 new messages