Redis critiques, let's take the good part.

12,016 views
Skip to first unread message

Salvatore Sanfilippo

unread,
Dec 6, 2013, 8:52:41 AM12/6/13
to Redis DB
Hello dear Redis community,

today Pierre Chapuis started a discussion on Twitter about Redis
bashing, stimulated by this thread on Twitter from Rick Branson:

https://twitter.com/rbranson/status/408853897495592960

It is not the first time that Rick Branson, that works at Instagram,
openly criticizes Redis, because I guess he does not like the Redis
design and / or implementation.
However according to Pierre, this is not something limited to Rick,
but there are other engineers in the SF area that believe that Redis
sucks, and Pierre also reported to hear similar stories in Paris.

Of course every open source project of a given size is target if
critiques, especially a project like Redis is very opinionated on how
programs should be written, with the search for simple design and
implementation that sometimes are felt as sub-optimal.
However, what we can learn from this critiques, and what is that you
think is not working well in Redis? I really encourage you to share
your view.

As a starting point I'll use Rick tweet: "BGSAVE. the sentinel wtf.
memory cliffs. impossible to track what's in it. heap fragmentation.
LRU impl sux. etc et".
He also writes: "you can't even really dump the whole keyspace because
KEYS "*" causes it to shit it's"

This is a good starting point, and I'll use the rest of this email to
see what happened in the different areas of Redis criticized by Rick.

1) BGSAVE

I'm not sure what is wrong with BGSAVE, probably Rick had bad
experiences with EC2 instances where the fork time can create latency
spikes?

2) The Sentinel WTF.

Here probably the reference is the following:
http://aphyr.com/posts/283-call-me-maybe-redis

Aphyr analyzed Redis Sentinel from the point of view of a consistent
system, consistent as in CAP "strong consistency". During partition in
Aphyr tests Sentinel was not able to handle the promises of a CP
system.
I replied with a blog post trying to clarify that Redis Sentinel is
not designed to provide strong consistency in the face of partitions,
but only to provide some degree of availability when the master
instance fails.

However the implementation of Sentinel, even as a system promoting a
slave when the master fails, was not optimal, so there was work to
reimplement it from scratch. Finally the new Sentinel is available in
Redis 2.8.x
and is much more simple to understand and predict. This is surely an
improvement. The new implementation is able to version changes in the
configuration that are eventually propagated to all the other
Sentinels, requires majority to perform the failover, and so forth.

However if you understand even the basics of distributed programming
you know a few things, like how a system with asynchronous replication
is not capable to guarantee consistency.
Even if Sentinel was not designed for this, is Redis improving from
this point of view? Probably yes. For example now the unstable branch
has support for a new command called WAIT that implements a form of
synchronous replication.

Using WAIT and the new sentinel, it is possible to have a setup that
is quite partition resistant. For example if you have three computers,
A, B, C, and run a Sentinel instance and a Redis instance in every
computer, only the majority partition will be able to perform the
failover, and the minority partition will stop accepting writes if you
use "WAIT 1", that is, if you wait the propagation of the write to at
least one replica. The new Sentinel also elects the slave that has the
most updated version of data automatically.

Redis Cluster is another step forward towards Redis HA and automatic
sharding, we'll see how it works in practice. However I believe that
Sentinel is improving and Redis is providing more tools to fine-tune
consistency guarantees.

3) Impossible to track what is in it.

Lack of SCAN was a problem indeed, now it is solved. Even before using
RANDOMKEY it was somewhat possible to inspect data sets, but SCAN is
surely a much better way to do this.
The same argument goes for KEYS *.

4) LRU implementation sucks.

The LRU implementation in Redis 2.4 had issues, and under mass-expire
there where latency spikes.
The LRU in 2.6 is much smoother, however it contained issues signaled
by Pavlo Baron where the algorithm was not able to guarantee expired
keys where always under a given threshold.
Newer versions of 2.6, and 2.8 of course, both fix this issue.

I'm not aware of issues with the LRU algorithm.

I've the feeling that Rick's opinion is a bit biased by the fact that
he was exposed to older versions of Redis, however his criticism where
in part actually applicable to older versions of Redis.
This show that there is something good about this critiques. For
instance Rick always said that replication sucked because of lack for
partial resynchronization. I'm sorry he is no longer able to say this.
As a consolatory prize we'll send him a t-shirt if budget will permit.
But this again shows that critiques tend to be focused where
deficiencies *are*, so hiding Redis behind a niddle is not a good idea
IMHO. We need to improve the system to make it better, as long is it
still an useful system for many users.

So, what are the critiques that you hear frequently about Redis? What
are your own critiques? When Redis sucks?

Let's tear Redis apart, something good will happen.

Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

We suspect that trading off implementation flexibility for
understandability makes sense for most system designs.
— Diego Ongaro and John Ousterhout (from Raft paper)

Pierre Chapuis

unread,
Dec 6, 2013, 10:22:02 AM12/6/13
to redi...@googlegroups.com
Others:

Quentin Adam, CEO of Clever Cloud (a PaaS) has a presentation that says Redis is not fit to store sessions: http://www.slideshare.net/quentinadam/dotscale2013-how-to-scale/15 (he advises Membase)

Tony Arcieri (Square, ex-LivingSocial) is a "frequent offender":

https://twitter.com/bascule/status/277163514412548096
https://twitter.com/bascule/status/335538863869136896
https://twitter.com/bascule/status/371108333979054081
https://twitter.com/bascule/status/390919938862379008

Then there's the Disqus guys, who migrated to Cassandra,
the Superfeedr guys who migrated to Riak...

Instagram moved to Cassandra as well, here's more on
it by Branson to see where he comes from:
http://www.planetcassandra.org/blog/post/cassandra-summit-2013-instagrams-shift-to-cassandra-from-redis-by-rick-branson

This presentation about scaling Instagram with a small
team (by Mike Krieger) is very interesting as well:
http://qconsf.com/system/files/presentation-slides/How%20a%20Small%20Team%20Scales%20Instagram.pdf
He says he would go with Redis again, but there are
some points about scaling up Redis starting at slide 56.

My personal experience, to be clear, is that Redis is an
awesome tool when you know how it works and how to
use it, especially for a small team (like Krieger basically).

I have worked for a company with a very reduced technical
team for the last 3.5 years. We make technology for mobile
applications which we sell to large companies (retail, TV,
cinema, press...) mostly white-labelled. I have written most
of our server side software, and I have also been responsible
for operations. We have used and still use Redis *a lot*, and
some of the things we have done would just not have been
possible with such a reduced team in so little time without it.

So when I read someone saying he would ban Redis from
his architecture if he ever makes a startup, I think: "good
thing he doesn't." :)

Thank you Antirez for this awesome tool.

Alexander Gladysh

unread,
Dec 6, 2013, 10:25:14 AM12/6/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 7:22 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:

> My personal experience, to be clear, is that Redis is an
> awesome tool when you know how it works and how to
> use it, especially for a small team (like Krieger basically).

Indeed! Until you bumped on all the hidden obstacles, the experience
is rather horrible. When Redis blows up on production — it usually
costs developers a few gray hairs :-)

However, after you know what not to do, Redis is all awesomeness.

My 2c,
Alexander.

Pierre Chapuis

unread,
Dec 6, 2013, 10:33:31 AM12/6/13
to redi...@googlegroups.com
Le vendredi 6 décembre 2013 16:25:14 UTC+1, Alexander Gladysh a écrit :
On Fri, Dec 6, 2013 at 7:22 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:

Indeed! Until you bumped on all the hidden obstacles, the experience
is rather horrible. When Redis blows up on production — it usually
costs developers a few gray hairs :-)

I would say that of every tool. You can all outgrow them or use them poorly.

I had a terrible experience with MySQL. A (VC funded) startup around
here had issues with CouchDB, moved to Riak with Basho support,
had issued, moved to HBase which the still use (I think). That does
not make any of those tools bad. You just have to invest some time
into learning what those tools can and cannot do, which one to use for
which use case, and how to use them correctly.

--
Pierre Chapuis

Alexander Gladysh

unread,
Dec 6, 2013, 10:34:38 AM12/6/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 7:33 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:
> Le vendredi 6 décembre 2013 16:25:14 UTC+1, Alexander Gladysh a écrit :
>>
>> On Fri, Dec 6, 2013 at 7:22 PM, Pierre Chapuis
>> <catwell...@catwell.info> wrote:
>>
>> Indeed! Until you bumped on all the hidden obstacles, the experience
>> is rather horrible. When Redis blows up on production — it usually
>> costs developers a few gray hairs :-)
>
>
> I would say that of every tool. You can all outgrow them or use them poorly.
>
> I had a terrible experience with MySQL. A (VC funded) startup around
> here had issues with CouchDB, moved to Riak with Basho support,
> had issued, moved to HBase which the still use (I think). That does
> not make any of those tools bad. You just have to invest some time
> into learning what those tools can and cannot do, which one to use for
> which use case, and how to use them correctly.

I agree :-)

If learning curve is flat, it usually means that the tool is too
casual to be useful.

Alexander.

Pierre Chapuis

unread,
Dec 6, 2013, 10:41:21 AM12/6/13
to redi...@googlegroups.com
Also: I am not saying I have never experienced scaling issues
with Redis! I have. You will always when you build a system from
scratch that ends up serving millions of users. So there are
bottlenecks I hit, models I had to reconsider, and even things I had
to move off Redis.

But none of that made me go "OMG this tool is terrible and nobody
should use it, ever!!1". And I still think going with Redis in the first
place was a very good idea.

On a side note: one of the things it *did* make me decide not
to use is intermediate layers between my application and Redis
that abstract your models. When you hit a bottleneck, you want
to know exactly what you have stored in Redis, how and why.

So things like https://github.com/soveran/ohm are really cool
for prototyping and things that are not intended to scale, but
if you decide to use them for a product with traction you'd better
understand exactly what they do or just write your own abstraction
layer that suits your business logic.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 10:47:11 AM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 4:22 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:
> Others:
>
> Quentin Adam, CEO of Clever Cloud (a PaaS) has a presentation that says
> Redis is not fit to store sessions:
> http://www.slideshare.net/quentinadam/dotscale2013-how-to-scale/15 (he
> advises Membase)

I don't quite understand the presentation to be super-honest, what
means "multiple writes" / "pseudo automic"? I'm not sure.
MULTI/EXEC and Lua scripts both retain their semantic in the slave,
that will process the transaction all-or-nothing.

About HA, with new Sentinel and Cluster we have something to say in
the present and in the future.
Not sure what Membase properties are, their page seems like marketing,
and I don't know a single person that uses it to be honest.

> Tony Arcieri (Square, ex-LivingSocial) is a "frequent offender":
>
> https://twitter.com/bascule/status/277163514412548096

Latency complains, 2.2.x, no information given but Redis can be
operated with excellent latency characteristics if you know what you
are doing.
Honestly I believe that from the point of view of average latency, and
ability to provide a consistent latency, Redis is one of the better
DBs available out there.
If you run it on EC2 with EBS, instances that can't fork, fsync that
can't cope, it is a sysop fail, not a problem with the system IMHO.

> https://twitter.com/bascule/status/335538863869136896

FUD

> https://twitter.com/bascule/status/371108333979054081

FUD

> https://twitter.com/bascule/status/390919938862379008

101 of distributed systems is that non-synchronous replication can
drop acknowledged writes.
Every on disk-db single instance not configured to fsync on disk at
every write, can drop acknowledged writes.

So this is totally obvious for most DBs deployed currently.

What does not write acknowledged writes as long as the majority is up?
CP systems with strong consistency like Zookeeper.

It's worth to mention that WAIT announced yesterday can do a lot from
this point of view.

> Then there's the Disqus guys, who migrated to Cassandra,

I've no idea why Disqus migrated to Cassandra, probably it was just a
much better pick for them?
Migrating to a different does not necessarily implies a problem with
Redis, so this is not a criticism we can use in a positive way to act,
unless Disqus guys write us why they migrated and what Redis
deficiencies they found.

> the Superfeedr guys who migrated to Riak...

Same story here.

> Instagram moved to Cassandra as well, here's more on
> it by Branson to see where he comes from:
> http://www.planetcassandra.org/blog/post/cassandra-summit-2013-instagrams-shift-to-cassandra-from-redis-by-rick-branson

And again...

> This presentation about scaling Instagram with a small
> team (by Mike Krieger) is very interesting as well:
> http://qconsf.com/system/files/presentation-slides/How%20a%20Small%20Team%20Scales%20Instagram.pdf
> He says he would go with Redis again, but there are
> some points about scaling up Redis starting at slide 56.

This is interesting indeed, and sounds like problems that we can solve
with Redis Cluster.
Let's face it, partitioning client side is complex. Redis Cluster
provides a lot of help for big players with many instances since
operations will be much simpler once you can reshard live.

I find the above pointers interesting, but how to act based on this?
IMHO the current ruote of providing a simple HA system like Sentinel
trying to make it robust, and at the same time providing a more
complex system like Redis Cluster for "bigger needs" is the best the
Redis project can be headed to.

The "moved away from Redis" stories don't tell us much. What I believe
is that sometimes when you are small you tend to do things with an
in-memory data store that don't really scale cost wise, since the IOPS
per instance can be handled with a disk oriented system, so it could
be a natural consequence, and this is fine. At the start maybe using
Redis helped a lot by serving many queries with little machines,
during the boom with relatively little users in the order of maybe 1
million, but the hype about the service creating a big pressure from
the point of view of load.

What do you think we can do to improve Redis based on the above stories?

Cheers!

Pierre Chapuis

unread,
Dec 6, 2013, 10:48:05 AM12/6/13
to redi...@googlegroups.com
Le vendredi 6 décembre 2013 16:34:38 UTC+1, Alexander Gladysh a écrit :

If learning curve is flat, it usually means that the tool is too
casual to be useful.

This.

Also, maybe I avoided some of the issues others encountered in
production because:

  1) I have a MSc in distributed systems (helps sometimes :p)

  2) I had forked Redis and implemented custom commands
     before I actually deployed it so I understood the code base.

Also, I had read the documentation and not skipped the
parts about algorithmic complexity of the commands,
persistence trade-offs... :)

I guess that if you let a novice developer use Redis in his
application it may be easier for him to shoot himself in the
foot.

But... if you think about it, those things are also true of a
relational database: if you don't understand what you do
you will write dangerous code, and if you decide to use an
ORM and scale you'd better understand it.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 10:52:07 AM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 4:33 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:

> I had a terrible experience with MySQL. A (VC funded) startup around
> here had issues with CouchDB, moved to Riak with Basho support,

About the "moves to Riak", this is also a component. People seek for
help with Redis and there was nothing: me busy, Pivotal yet not
providing support (now they do finally!).
If Basho engineers say hi, we'll fix your issues, this is surely an
incentive (yet in this case people moved).

Unfortunately I'm really not qualified to say if there is big value or
not into Riak for the use case it is designed about as I hear a mix of
horrible and great things, and I never deployed it seriously.
But I'm happy that people try other solutions: in the end what is no
longer useful MUST DIE in technology.

If Redis will die in 6 months, this is great news, it means that
technology evolved enough that with other systems you can do the same
in some simpler way.
However as long as I'll see traction as I'm seeing it right now in the
project, and there is a company like Pivotal supporting the effort,
I'll continue to improve it.

Shane McEwan

unread,
Dec 6, 2013, 11:05:32 AM12/6/13
to redi...@googlegroups.com
On 06/12/13 15:52, Salvatore Sanfilippo wrote:
> Unfortunately I'm really not qualified to say if there is big value or
> not into Riak for the use case it is designed about as I hear a mix of
> horrible and great things, and I never deployed it seriously.
> But I'm happy that people try other solutions: in the end what is no
> longer useful MUST DIE in technology.

For what it's worth, we run both Riak and Redis. They each solve
different problems for us. You use whichever tool solves your problem.
There's no point complaining that your screwdriver is no good at
hammering nails!

Shane.

Pierre Chapuis

unread,
Dec 6, 2013, 11:08:19 AM12/6/13
to redi...@googlegroups.com
Le vendredi 6 décembre 2013 16:47:11 UTC+1, Salvatore Sanfilippo a écrit :
On Fri, Dec 6, 2013 at 4:22 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:
> Others:
>
> Quentin Adam, CEO of Clever Cloud (a PaaS) has a presentation that says
> Redis is not fit to store sessions:
> http://www.slideshare.net/quentinadam/dotscale2013-how-to-scale/15 (he
> advises Membase)

I don't quite understand the presentation to be super-honest, what
means "multiple writes" / "pseudo automic"? I'm not sure.

Afaik he is saying the system is single master and you cannot
have two writes executing concurrently, so write throughput / latency
is limited by a single node.

> Then there's the Disqus guys, who migrated to Cassandra,

I've no idea why Disqus migrated to Cassandra, probably it was just a
much better pick for them?  
Migrating to a different does not necessarily implies a problem with
Redis, so this is not a criticism we can use in a positive way to act,
unless Disqus guys write us why they migrated and what Redis
deficiencies they found.

They mention it here:
http://planetcassandra.org/blog/post/disqus-discusses-migration-from-redis-to-cassandra-for-horizontal-scalability
 
But they don't say much about their reasons, basically "it didn't
scale" :(

> This presentation about scaling Instagram with a small
> team (by Mike Krieger) is very interesting as well:
> http://qconsf.com/system/files/presentation-slides/How%20a%20Small%20Team%20Scales%20Instagram.pdf
> He says he would go with Redis again, but there are
> some points about scaling up Redis starting at slide 56.

This is interesting indeed, and sounds like problems that we can solve
with Redis Cluster. [...]

He also mentions the allocator as their reason to use Memcache
instead of Redis. I wonder if a lot of this criticism does not come
from people who don't use jemalloc.
 
Let's face it, partitioning client side is complex. Redis Cluster
provides a lot of help for big players with many instances since
operations will be much simpler once you can reshard live.

I can't comment much on that, I don't see a reason to use Redis
Cluster for now. Most of my data is trivial to shard in the application.
Maybe that would help with migrations / re-sharding but this is not
*so* terrible if you don't let your shards grow really huge.

We suspect that trading off implementation flexibility for
understandability makes sense for most system designs.
       — Diego Ongaro and John Ousterhout (from Raft paper)

:)

Jonathan Leibiusky

unread,
Dec 6, 2013, 11:09:30 AM12/6/13
to redi...@googlegroups.com

One of the big challenges we had with redis in mercadolibre was size of dataset. The fact that it needs to fit in memory was a big issue for us.
We used to have, on a common basis, 500gb DBs or even more.
Not sure if this is a common case for other redis users anyway.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Alexander Gladysh

unread,
Dec 6, 2013, 11:12:46 AM12/6/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 8:09 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:
> One of the big challenges we had with redis in mercadolibre was size of
> dataset. The fact that it needs to fit in memory was a big issue for us.
> We used to have, on a common basis, 500gb DBs or even more.
> Not sure if this is a common case for other redis users anyway.

Seems to be kind of screwdriver vs. nails problem, no? Why use Redis
for the task that it is explicitly not designed for?

(Not trying to offend you, this is a honest question — relevant, I
think, since we're talking about why Redis is perceived as deficient
by some users...)

Alexander.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 11:15:41 AM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 5:05 PM, Shane McEwan <sh...@mcewan.id.au> wrote:
> For what it's worth, we run both Riak and Redis. They each solve different
> problems for us. You use whichever tool solves your problem. There's no
> point complaining that your screwdriver is no good at hammering nails!

Totally makes sense indeed. The systems are very different.

Just a question, supposing Redis Cluster were available and stable, is
some problem at the intersection between Redis and Riak that you ended
solving with Riak more disputable with Redis Cluster? Or it was a
matter of other metrics like consistency model and alike?

Jonathan Leibiusky

unread,
Dec 6, 2013, 11:18:23 AM12/6/13
to redi...@googlegroups.com

It's not that we planned it. Developers started using it for something they thought will stay small but it grew. And it grew a lot. We ended up using redis to cache a small chunk of the data and the as a backend data store mysql or oracle.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 11:18:50 AM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 5:12 PM, Alexander Gladysh <agla...@gmail.com> wrote:
> On Fri, Dec 6, 2013 at 8:09 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:
>> One of the big challenges we had with redis in mercadolibre was size of
>> dataset. The fact that it needs to fit in memory was a big issue for us.
>> We used to have, on a common basis, 500gb DBs or even more.
>> Not sure if this is a common case for other redis users anyway.
>
> Seems to be kind of screwdriver vs. nails problem, no? Why use Redis
> for the task that it is explicitly not designed for?

This is entirely possible but depends a lot on use case. If IOPS for
object are in a range that you pay less for RAM compared to how many
nodes you need to spin with an on-disk solution, then switching
becomes hard even when you realize you are using a lot of RAM. Also it
depends on where you run. On premise 500GB is not huge, on EC2 it is.

Alexander Gladysh

unread,
Dec 6, 2013, 11:28:02 AM12/6/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 8:18 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:
> On Dec 6, 2013 1:13 PM, "Alexander Gladysh" <agla...@gmail.com> wrote:
>> On Fri, Dec 6, 2013 at 8:09 PM, Jonathan Leibiusky <iona...@gmail.com>
>> wrote:
>> > One of the big challenges we had with redis in mercadolibre was size of
>> > dataset. The fact that it needs to fit in memory was a big issue for us.
>> > We used to have, on a common basis, 500gb DBs or even more.
>> > Not sure if this is a common case for other redis users anyway.
>>
>> Seems to be kind of screwdriver vs. nails problem, no? Why use Redis
>> for the task that it is explicitly not designed for?
>>
> It's not that we planned it. Developers started using it for something they
> thought will stay small but it grew. And it grew a lot.

Ah, I see. We had that happen (on much smaller scale). But, despite
Redis blowing up in our faces several times, we were eventually able
to get away with optimizing data sizes (and adding a few ad-hoc
cluster nodes).

> We ended up using
> redis to cache a small chunk of the data and the as a backend data store
> mysql or oracle.

This is exactly what I would do now — after I had that experience.
Redis can be a primary data storage, but you have to think very well
before using it as such.

I had different point of view before — and it was the source of some
pain for us. You live and learn :-)

My 2c,
Alexander.

Alexander Gladysh

unread,
Dec 6, 2013, 11:29:28 AM12/6/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 8:18 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> On Fri, Dec 6, 2013 at 5:12 PM, Alexander Gladysh <agla...@gmail.com> wrote:
>> On Fri, Dec 6, 2013 at 8:09 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:
>>> One of the big challenges we had with redis in mercadolibre was size of
>>> dataset. The fact that it needs to fit in memory was a big issue for us.
>>> We used to have, on a common basis, 500gb DBs or even more.
>>> Not sure if this is a common case for other redis users anyway.
>>
>> Seems to be kind of screwdriver vs. nails problem, no? Why use Redis
>> for the task that it is explicitly not designed for?
>
> This is entirely possible but depends a lot on use case. If IOPS for
> object are in a range that you pay less for RAM compared to how many
> nodes you need to spin with an on-disk solution, then switching
> becomes hard even when you realize you are using a lot of RAM. Also it
> depends on where you run. On premise 500GB is not huge, on EC2 it is.

Of course. But you have to know Redis well to be able to get away with
this — and even to be able to make weighted and sane decision on that
matter.

Alexander.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 11:31:09 AM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 5:08 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:

> Afaik he is saying the system is single master and you cannot
> have two writes executing concurrently, so write throughput / latency
> is limited by a single node.

Unless you use sharding. Otherwise any system that accepts at the same
time, in two different nodes, a write for the same object, is
eventually consistent.

> But they don't say much about their reasons, basically "it didn't
> scale" :(

From what I can tell, Redis *can not* really scale on EC2 for
applications requiring a large data set just because of the cost of
spinning enough instances.
Imagine the 4TB Twitter Redis cluster on EC2. Totally possible even
for small companies on premise.

> He also mentions the allocator as their reason to use Memcache
> instead of Redis. I wonder if a lot of this criticism does not come
> from people who don't use jemalloc.

That's pre-jemalloc IMHO.

>> Let's face it, partitioning client side is complex. Redis Cluster
>> provides a lot of help for big players with many instances since
>> operations will be much simpler once you can reshard live.
>
>
> I can't comment much on that, I don't see a reason to use Redis
> Cluster for now. Most of my data is trivial to shard in the application.
> Maybe that would help with migrations / re-sharding but this is not
> *so* terrible if you don't let your shards grow really huge.

I'm quite sure that as soon as we provide solid Sentinel and a Redis
Cluster that works, we'll see a lot of new users...



>
>> We suspect that trading off implementation flexibility for
>> understandability makes sense for most system designs.
>> — Diego Ongaro and John Ousterhout (from Raft paper)
>
>
> :)
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Felix Gallo

unread,
Dec 6, 2013, 11:32:22 AM12/6/13
to redi...@googlegroups.com
I think there's three types of criticism.  

The first type comes from a surge in popularity of high-A-style systems and, owing to the sexiness of those concepts and relative newness, a corresponding surge in dilettantes who try to eagerly apply knowledge gleaned from Aphyr's (great) Jepsen posts against all use cases, find Redis wanting, and try to be the first to tweet out the hipster sneering.  I won't name names but there's a dude who posted that you should replace redis with zookeeper.  I literally cried with laughter.

The second type is serious high-A folk like Aphyr, who do correctly point out that Redis cluster was not designed "properly."  It turns out that distributed systems are incredibly complicated and doing things the most simple and direct way, as Salvatore seems to aim to do, frequently misses some complex edge cases.  This type of criticism is more important, because here traditionally Redis has claimed it has a story when it really didn't.  I have concerns that Salvatore working alone will not get to a satisfactory story here owing to the complexities, and sometimes wonder if maybe external solutions (e.g. the system that uses zookeeper as a control plane) would not be better, not go for 100% availability, and for focus to be placed on the third area of criticism.

The third type is the most important, in my opinion: it's the people who fundamentally misunderstand Redis.  You see it all the time on this list: people who think Redis is mysql, or who ask why the server seems to have exploded when they put 100G of data in an m1.small, or why expiry is not instant, or why a transaction isn't rollable back.  The problem here is that Redis is very much a database construction set, with Unix-style semantics.  By itself it gives you just enough rope to hang you with.  By itself without care and feeding and diligence, Redis will detonate over time in the face of junior- and mid- level developers.  People will create clashing schemas across applications.  People will issue KEYS * in production.  People will DEL a 4 million long list and wonder why it doesn't return immediately (<-- this was me).  Heck, I'd been using Redis hard for a year before I learned the stupid SORT join trick from Josiah.  Many of these warts and complexities around usage and operation of a single instance could be smoothed over (KEYS *, ARE YOU SURE (Y/N) in redis-cli), and as far as making The World happy, that's probably the biggest bang for the buck.

Personally, I've just finished deploying a major application component for an online game for which you have seen many billboards no matter where you are in the world.  Over 2 million users use the component every day, and we put and get tens-to-hundreds-of-thousands of data items per second.  We don't use in-redis clustering, and we don't use sentinel, but I sleep at night fine because my dev and ops teams understand the product and know how it fails.

F.




Shane McEwan

unread,
Dec 6, 2013, 11:42:25 AM12/6/13
to redi...@googlegroups.com
On 06/12/13 16:15, Salvatore Sanfilippo wrote:
> Just a question, supposing Redis Cluster were available and stable, is
> some problem at the intersection between Redis and Riak that you ended
> solving with Riak more disputable with Redis Cluster? Or it was a
> matter of other metrics like consistency model and alike?

I haven't looked at Redis Cluster yet so I can't say for sure. The main
reason for choosing Riak was scalability and redundancy. We know there's
some huge Riak clusters out there and we plan to be one of them
eventually. Our dataset is larger than can easily (cheaply!) fit into
memory so we use Riak with LevelDB to store our data while anything we
want quick and easy access to we store in Redis.

Shane.

Yiftach Shoolman

unread,
Dec 6, 2013, 12:03:03 PM12/6/13
to redi...@googlegroups.com
From the point of view of a Redis provider who "live" from these OSS issues I can only say that I know a handful of companies that can actually manage themselves any OSS DB at a large scale in production. I'm quite sure that most of these transitions to Riak/Cassandra were backed by Basho and Datastax guys. The fact that Redis is much more popular than those DBs (only 2nd to Mongo in real NoSQL deployments) actually means that someone built a solid product here. 
From the commercial side, there are now a few companies with enough cash in the bank for supporting and giving services around Redis, I'm this will only strengthen its position.

Another point to mention is the cloud deployment - I can only guess that most of the Redis deployments today are on AWS, and managing any large distributed deployment over this environment is a great challenge and especially with in-memory databases. This is because: instances fail frequently, data-centers fail, network partition happens too often, noisy neighbor all over, the concept of ephemeral storage, SAN/EBS storage which is not tuned for sequential writes, etc - I can only say that due to the competition from the other cloud vendors,  SoftLayer, GCE and Azure, AWS infrastructure is constantly improving. For instance - last year there were 4 zone (data-center) failure events in the AWS us-east region; this year - zero. The new AWS C3 instances are now based on HVM and most of the BGSAVE fork time issues have been solved
    


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.



--

Yiftach Shoolman
+972-54-7634621

Josiah Carlson

unread,
Dec 6, 2013, 1:29:16 PM12/6/13
to redi...@googlegroups.com
Heck, I'd been using Redis hard for a year before I learned the stupid SORT join trick from Josiah.

Not stupid, just crazy :)


My criticisms are primarily from the point of view of someone who knows enough about Redis to be dangerous, who has spent the last 11+ years studying, designing, and building data structures, but who doesn't have a lot of time to work on Redis itself. All of the runtime-related issues have already been covered.

Long story short: every one of the existing data structures in Redis can be improved substantially. All of them can have their memory use reduced, and most of them can have their performance improved. I would argue that the ziplist encoding should be removed in favor of structures that are concise enough to make the optimization unnecessary for structures with more than 5 or 10 items. If the intset encoding is to be kept, I would also argue that it should be modified to apply to all sets of integers (not just small ones), and its performance characteristics updated if it happens that the implementation changes to improve large intset performance.

I might also argue that something like Redis-nds should be included in core, but that it should *not* involve the development of a new storage engine, unless that storage engine is super simple (I wrote a bitcask w/index creation on shutdown in Go a few weeks ago in a week, and it is the best on-disk key/value storage engine I've ever used). I don't know whether explicitly paging data in and out makes sense, or whether it should be automatic, as I can make passionate arguments on both sides.

All of that said, Redis does work very well for every use case that I find reasonable, even if there are some rough edges.
 - Josiah

Salvatore Sanfilippo

unread,
Dec 6, 2013, 1:36:13 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 5:32 PM, Felix Gallo <felix...@gmail.com> wrote:
> I think there's three types of criticism.

Hello Felix, I like classifications ;-)

> I won't
> name names but there's a dude who posted that you should replace redis with
> zookeeper. I literally cried with laughter.

Skipping that... as I recognized this and not worth analyzing :-)

> The second type is serious high-A folk like Aphyr, who do correctly point
> out that Redis cluster was not designed "properly." It turns out that
> distributed systems are incredibly complicated and doing things the most
> simple and direct way, as Salvatore seems to aim to do, frequently misses
> some complex edge cases. This type of criticism is more important, because
> here traditionally Redis has claimed it has a story when it really didn't.
> I have concerns that Salvatore working alone will not get to a satisfactory
> story here owing to the complexities, and sometimes wonder if maybe external
> solutions (e.g. the system that uses zookeeper as a control plane) would not
> be better, not go for 100% availability, and for focus to be placed on the
> third area of criticism.

Here there is a "mea culpa" to do, the first Sentinel and the first
version of Redis Cluster were designed before I seriously learned the
theoretical basis of distributed systems. This is why I used the past
months to read and learn about distributed systems.

I believe the new design of Redis Cluster is focused on real trade
offs and will hold well in practice. It may not be bug free or some
minor changes may be needed but IMHO there are not huge mistakes.

Aphyr did a great thing analyzing systems in practice, they hold the
expectations? However I think that distributed systems are not super
hard, like kernel programming is not super hard, like C system
programming is not super hard. Everything new or that you don't do in
a daily basis seems super hard, but it is actually different concepts
that are definitely things everybody here in this list can master.

So Redis Sentinel as a distributed system was not consistent? Wow,
asynchronous replication used so no way for the master partitioned
away to stop receiving writes, no merge operation afterward but "who
is the master rewrites the history". Also the first Sentinel was much
simpler to take apart from a theoretical perspective, the system would
not converge after the partition hails and it was simple to prove. It
is also possible to trivially prove that the ODOWN state for the kind
of algorithm used does not guarantee liveness (but this is practically
not important for *how* it is used now).

It is important to learn, but there is no distributed system cathedral
that is impossible to escalate. At max to learn more is needed, and to
adapt the implementation to the best one could provide in a given
moment, given understanding, practical limits (a single coder) and so
forth.

However my take on that is that the Redis project responded in a
positive way to theoretical criticisms. I never believed that was
interesting for the kind of uses Redis was designed to, to improve a
lot our story about consistency. I changed idea, and we got things
like WAIT. This is a huge change, WAIT means that if you run three
nodes A, B, C where every node contains a Sentinel instance and a
Redis instance, and you "WAIT 1" after every operation to reach the
majority of slaves, you get a consistent system.

> The third type is the most important, in my opinion: it's the people who
> fundamentally misunderstand Redis. You see it all the time on this list:
> people who think Redis is mysql, or who ask why the server seems to have
> exploded when they put 100G of data in an m1.small, or why expiry is not
> instant, or why a transaction isn't rollable back. The problem here is that
> Redis is very much a database construction set, with Unix-style semantics.
> By itself it gives you just enough rope to hang you with. By itself without
> care and feeding and diligence, Redis will detonate over time in the face of
> junior- and mid- level developers. People will create clashing schemas
> across applications. People will issue KEYS * in production. People will
> DEL a 4 million long list and wonder why it doesn't return immediately (<--
> this was me). Heck, I'd been using Redis hard for a year before I learned
> the stupid SORT join trick from Josiah. Many of these warts and
> complexities around usage and operation of a single instance could be
> smoothed over (KEYS *, ARE YOU SURE (Y/N) in redis-cli), and as far as
> making The World happy, that's probably the biggest bang for the buck.

Totally agree, what is disturbing is that in most environments where
you could expect "A class" developers sometimes the system was misused
like that.

> Personally, I've just finished deploying a major application component for
> an online game for which you have seen many billboards no matter where you
> are in the world. Over 2 million users use the component every day, and we
> put and get tens-to-hundreds-of-thousands of data items per second. We
> don't use in-redis clustering, and we don't use sentinel, but I sleep at
> night fine because my dev and ops teams understand the product and know how
> it fails.

Totally reasonable... thanks for sharing.

John Watson

unread,
Dec 6, 2013, 2:34:28 PM12/6/13
to redi...@googlegroups.com
We outgrew Redis in 1 specific use case. For the exact tradeoff Salvatore
has already ceded as a possible deficiency.

Some info about that in this slide deck:

Besides that, Redis is still a critical piece of our infrastructure and
has not been much of a pain point. We "cluster" by running many instances per
machine (and in some "clusters", some semblance of HA by a spider web of
SLAVE OFs between them.) We also built a Python library for handling the clustering
client side using various routing methods: https://pypi.python.org/pypi/nydus

Of course Nydus has some obvious drawbacks and so we're watching the work
Salvatore has been putting in to Sentinel/Cluster very closely.

Aphyr Null

unread,
Dec 6, 2013, 3:07:37 PM12/6/13
to redi...@googlegroups.com
> WAIT means that if you run three nodes A, B, C where every node contains a Sentinel instance and a Redis instance, and you "WAIT 1" after every operation to reach the majority of slaves, you get a consistent system.

While I am enthusiastic about the Redis project's improvements with respect to safety, this is not correct.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 4:14:37 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 9:07 PM, Aphyr Null <aphyr...@gmail.com> wrote:
> While I am enthusiastic about the Redis project's improvements with respect
> to safety, this is not correct.

It is not correct if you take it as "strong consistency" because there
are definitely failure modes, basically it is not like if synchronous
replication + failover turned the system into Paxos or Raft. For
example if the master returns writable when the failover already
started we are no longer sure to pick the slave with the best
replication offset. However this is definitely "more consistent" then
in the past, and probably it is possible to achieve strong consistency
if you have a way to stop writes during the replication process.

I understand this not the "C" consistency of "CAP" but, before: the
partition with clients and the (old) master partitioned away would
receive writes that gets lost.
after: under certain system models the system is consistent, like if
you assume that crashed instances never start again. It is not
realistic as a system model, but it means that in practice you have a
better real-world behavior, and in theory you have a system that is
going towards a better consistency model.

Regards,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Matt Palmer

unread,
Dec 6, 2013, 5:04:11 PM12/6/13
to redi...@googlegroups.com
On Fri, Dec 06, 2013 at 11:09:30AM -0500, Jonathan Leibiusky wrote:
> One of the big challenges we had with redis in mercadolibre was size of
> dataset. The fact that it needs to fit in memory was a big issue for us.
> We used to have, on a common basis, 500gb DBs or even more.
> Not sure if this is a common case for other redis users anyway.

Common enough that I sat down and hacked together NDS to satisfy it. As you
said in your other message, though, it isn't that anyone usually *plans* to
store 500GB of data from the start and chooses Redis anyway, but rather that
you start small, and then things get out of hand... the situation isn't
helped when the developers aren't aware enough of what's going on "inside
the box" that they don't realise that they can't just throw data at the
Redis indefinitely -- but then, I (ops) didn't exactly give them the full
visibility required to know how big those Rediises were getting...

- Matt

--
Ruby's the only language I've ever used that feels like it was designed by a
programmer, and not by a hardware engineer (Java, C, C++), an academic
theorist (Lisp, Haskell, OCaml), or an editor of PC World (Python).
-- William Morgan

Matt Palmer

unread,
Dec 6, 2013, 5:06:32 PM12/6/13
to redi...@googlegroups.com
On Fri, Dec 06, 2013 at 07:22:02AM -0800, Pierre Chapuis wrote:
> So when I read someone saying he would ban Redis from
> his architecture if he ever makes a startup, I think: "good
> thing he doesn't." :)

I, on the other hand, just sincerely hope that whatever startup he makes is
competing with mine, because if he refuses to use the right tool for the job
(if Redis turns out to be the right tool for a specific use case), then I'll
gladly use that tool as a competitive advantage, and I need every advantage
I can get.

- Matt

Salvatore Sanfilippo

unread,
Dec 6, 2013, 5:16:21 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 11:06 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
> I, on the other hand, just sincerely hope that whatever startup he makes is
> competing with mine, because if he refuses to use the right tool for the job
> (if Redis turns out to be the right tool for a specific use case), then I'll
> gladly use that tool as a competitive advantage, and I need every advantage
> I can get.

This is a fundamental point.

If you consider systems from a theoretical point of view, everybody
should use Zookeeper.
It is like to try to win all the wars with a precision rifle: it is
the most accurate, however it does not work against a tank.

People use Redis because it solves problems, because of the data model
that fits a given problem, and so forth, not because of it offers the
best consistency guarantees.
This is the point of view of us, programmers. We try to do the best to
implement systems in a great way.

There are other guys, like the authors of the Raft algorithm, that try
to do "A grade" work in the field of applicable distributed systems.
Those people provide us with the theoretical foundation to improve the
systems we are designing, however it is the sensibility of the
programmer to pick the trade offs, the API, and so forth.

Companies using the right tools will survive and will solve user
problems. When a tool, like Redis, starts to solve no problems, it
gets obsoleted and after a few years marginalized.
This is not a linear process because fashion also is a big player in
tech. Especially in the field of DBs lately there are too much money
for the environment to be sane, people don't just argue from a
technical point of view, there is a bit too much rage IMHO. But my
optimism says me that eventually the technology is the most important
thing.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 5:37:04 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 8:34 PM, John Watson <jo...@disqus.com> wrote:
> We outgrew Redis in 1 specific use case. For the exact tradeoff Salvatore
> has already ceded as a possible deficiency.
>
> Some info about that in this slide deck:
> http://www.slideshare.net/gjcourt/cassandra-sf-meetup20130731
>
> Besides that, Redis is still a critical piece of our infrastructure and
> has not been much of a pain point. We "cluster" by running many instances
> per
> machine (and in some "clusters", some semblance of HA by a spider web of
> SLAVE OFs between them.) We also built a Python library for handling the
> clustering
> client side using various routing methods:
> https://pypi.python.org/pypi/nydus

Hello John, thank you a lot for your feedback.
I seriously believe in using multiple DB systems to get the job done,
maybe because my point of view is biased by Redis being not very
general purpose, but I believe there is definitely value in being open
to use the right technologies for the right jobs. Of course it is a
hard to generalize concept, good engineers will understand when
something new is needed with great sensibility, and less experienced
ones sometimes do the error of trowing many technologies together when
they are not exactly needed, including Redis...

Thanks for the link to Nydus, I was not aware of this project. I'm
adding it here in the tools section -> http://redis.io/clients

> Of course Nydus has some obvious drawbacks and so we're watching the work
> Salvatore has been putting in to Sentinel/Cluster very closely.

Thanks, those are the priorities of the Redis project currently!

Salvatore
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

Alberto Gimeno Brieba

unread,
Dec 6, 2013, 5:46:34 PM12/6/13
to redi...@googlegroups.com
Hi,

I use redis a lot (3 big projects already) and I love it. And I know many people that love it too.

For me the two biggest problems with redis used to be:

- distribute it over several nodes. The problem is being solved with redis-cluster. And synchronous replication is a great feature.

- the dataset size needs to fit into memory. Of course I totally understand that redis is an in-memory db and is the main reason that makes redis so fast. However I would appreciate something like NDS officially supported. There were some attempts to address this problem in the past (vm and diskstore) but in the end they were removed.

I think that having something like NDS officially supported would make redis a great option for many more usage cases. Many times the 90% of the "hot data" in your db fits in an inexpensive server, but the rest of the data is too big and would be too expensive (unaffordable) to have enough RAM for it. So in the end you choose other db for the entire dataset.

My 2 cents.

Salvatore Sanfilippo

unread,
Dec 6, 2013, 5:54:49 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 6:03 PM, Yiftach Shoolman
<yiftach....@gmail.com> wrote:

> From the commercial side, there are now a few companies with enough cash in
> the bank for supporting and giving services around Redis, I'm this will only
> strengthen its position.

This is a very important point. With Redis you were alone until
recently, that's not good.

> Another point to mention is the cloud deployment - I can only guess that
> most of the Redis deployments today are on AWS, and managing any large
> distributed deployment over this environment is a great challenge and
> especially with in-memory databases. This is because: instances fail
> frequently, data-centers fail, network partition happens too often, noisy
> neighbor all over, the concept of ephemeral storage, SAN/EBS storage which
> is not tuned for sequential writes, etc - I can only say that due to the
> competition from the other cloud vendors, SoftLayer, GCE and Azure, AWS
> infrastructure is constantly improving. For instance - last year there were
> 4 zone (data-center) failure events in the AWS us-east region; this year -
> zero. The new AWS C3 instances are now based on HVM and most of the BGSAVE
> fork time issues have been solved

Absolutely, in some way EC2 is good for distributed systems, it is
problematic enough that it is much simpler to sense pain points and
see partitions that in practice are very rare in other deployments.
This somewhat is "training for failure" that's good. But seriously,
sometimes the problems are *just* a result of EC2 and if you don't
know how to fine-tune for this environment it is likely to see latency
and other issues that in other conditions are very hard to see at all.

I'm super happy about C3 instances, but... what about EBS? It remains
a problem I guess when AOF is enabled and disk can't cope with the
fsync policy...

Thanks,
Salvatore

Salvatore Sanfilippo

unread,
Dec 6, 2013, 6:00:01 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 11:46 PM, Alberto Gimeno Brieba
<gime...@gmail.com> wrote:
> I think that having something like NDS officially supported would make redis
> a great option for many more usage cases. Many times the 90% of the "hot
> data" in your db fits in an inexpensive server, but the rest of the data is
> too big and would be too expensive (unaffordable) to have enough RAM for it.
> So in the end you choose other db for the entire dataset.

I completely understand this, but IMHO to make Redis on disk right we need:

1) an optional threaded model. You may use it to dispatch maybe only
slow queries and on-disk queries. Threads are not a good fit for Redis
in memory I think. Similarly I believe that threads are the key for a
good on disk implementation.
2) Representing every data structure on disk in a native way. Mostly a
btree of btrees or alike, but definitely some work ahead to understand
what to use or what to implement.

Currently it is an effort that is basically impossible to do. If I'll
be able to continue the development of what we have now that the
complexity is raised is already a good result, the core, cluster,
sentinel, the community...
So for now the choice is to stay focused in the in-memory paradigm
even if I understand this makes Redis less useful for certain use
cases, since there are other DBs solving at least in part the "Redis
on disk" case, but there are little systems doing the Redis work well
IMHO.

Thanks!

Dvir Volk

unread,
Dec 6, 2013, 6:00:47 PM12/6/13
to redi...@googlegroups.com
Just my two cents. 
I've been using redis in production for almost 3 years, and I've had many difficulties but many many wins with it.
I think the biggest mistake I made was, being excited about it in the beginning, to use it for too many things, some of which it didn't fit. 
I'm very happy with redis as:
1. a geo resolving database.
2. complex cache (where just putting a blob in something like memcache is not enough)
3. distributed event bus
4. semantic entity store.

Where I wasn't happy with it was:
1. storing data that had complex relations in it.
2. storing data that needed migration from other dbs constantly (that's not redis' fault thuogh)
3. storing data that needed high persistence rates on EC2, although the forking problem was solved in recent generation machines.
4. having a mission critical DB that needed super fast failover. Sentinel in its original form was simply not good enough for what we needed.
5. needing cross DC replication. This has been solved in 2.8, but I needed it before.

So we've been moving some things we used to do with redis to other databases, but I still love this tool and would definitely use it for new projects.






On Fri, Dec 6, 2013 at 3:52 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
Hello dear Redis community,

today Pierre Chapuis started a discussion on Twitter about Redis
bashing, stimulated by this thread on Twitter from Rick Branson:

https://twitter.com/rbranson/status/408853897495592960

It is not the first time that Rick Branson, that works at Instagram,
openly criticizes Redis, because I guess he does not like the Redis
design and / or implementation.
However according to Pierre, this is not something limited to Rick,
but there are other engineers in the SF area that believe that Redis
sucks, and Pierre also reported to hear similar stories in Paris.

Of course every open source project of a given size is target if
critiques, especially a project like Redis is very opinionated on how
programs should be written, with the search for simple design and
implementation that sometimes are felt as sub-optimal.
However, what we can learn from this critiques, and what is that you
think is not working well in Redis? I really encourage you to share
your view.

As a starting point I'll use Rick tweet: "BGSAVE. the sentinel wtf.
memory cliffs. impossible to track what's in it. heap fragmentation.
LRU impl sux. etc et".
He also writes: "you can't even really dump the whole keyspace because
KEYS "*" causes it to shit it's"

This is a good starting point, and I'll use the rest of this email to
see what happened in the different areas of Redis criticized by Rick.

1) BGSAVE

I'm not sure what is wrong with BGSAVE, probably Rick had bad
experiences with EC2 instances where the fork time can create latency
spikes?

2) The Sentinel WTF.

Here probably the reference is the following:
http://aphyr.com/posts/283-call-me-maybe-redis

Aphyr analyzed Redis Sentinel from the point of view of a consistent
system, consistent as in CAP "strong consistency". During partition in
Aphyr tests Sentinel was not able to handle the promises of a CP
system.
I replied with a blog post trying to clarify that Redis Sentinel is
not designed to provide strong consistency in the face of partitions,
but only to provide some degree of availability when the master
instance fails.

However the implementation of Sentinel, even as a system promoting a
slave when the master fails, was not optimal, so there was work to
reimplement it from scratch. Finally the new Sentinel is available in
Redis 2.8.x
and is much more simple to understand and predict. This is surely an
improvement. The new implementation is able to version changes in the
configuration that are eventually propagated to all the other
Sentinels, requires majority to perform the failover, and so forth.

However if you understand even the basics of distributed programming
you know a few things, like how a system with asynchronous replication
is not capable to guarantee consistency.
Even if Sentinel was not designed for this, is Redis improving from
this point of view? Probably yes. For example now the unstable branch
has support for a new command called WAIT that implements a form of
synchronous replication.

Using WAIT and the new sentinel, it is possible to have a setup that
is quite partition resistant. For example if you have three computers,
A, B, C, and run a Sentinel instance and a Redis instance in every
computer, only the majority partition will be able to perform the
failover, and the minority partition will stop accepting writes if you
use "WAIT 1", that is, if you wait the propagation of the write to at
least one replica. The new Sentinel also elects the slave that has the
most updated version of data automatically.

Redis Cluster is another step forward towards Redis HA and automatic
sharding, we'll see how it works in practice. However I believe that
Sentinel is improving and Redis is providing more tools to fine-tune
consistency guarantees.

3) Impossible to track what is in it.

Lack of SCAN was a problem indeed, now it is solved. Even before using
RANDOMKEY it was somewhat possible to inspect data sets, but SCAN is
surely a much better way to do this.
The same argument goes for KEYS *.

4) LRU implementation sucks.

The LRU implementation in Redis 2.4 had issues, and under mass-expire
there where latency spikes.
The LRU in 2.6 is much smoother, however it contained issues signaled
by Pavlo Baron where the algorithm was not able to guarantee expired
keys where always under a given threshold.
Newer versions of 2.6, and 2.8 of course, both fix this issue.

I'm not aware of issues with the LRU algorithm.

I've the feeling that Rick's opinion is a bit biased by the fact that
he was exposed to older versions of Redis, however his criticism where
in part actually applicable to older versions of Redis.
This show that there is something good about this critiques. For
instance Rick always said that replication sucked because of lack for
partial resynchronization. I'm sorry he is no longer able to say this.
As a consolatory prize we'll send him a t-shirt if budget will permit.
But this again shows that critiques tend to be focused where
deficiencies *are*, so hiding Redis behind a niddle is not a good idea
IMHO. We need to improve the system to make it better, as long is it
still an useful system for many users.

So, what are the critiques that you hear frequently about Redis? What
are your own critiques? When Redis sucks?

Let's tear Redis apart, something good will happen.

Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

We suspect that trading off implementation flexibility for
understandability makes sense for most system designs.
       — Diego Ongaro and John Ousterhout (from Raft paper)

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.



--
Dvir Volk
Chief Architect, Everything.me

Salvatore Sanfilippo

unread,
Dec 6, 2013, 6:02:47 PM12/6/13
to Redis DB
On Fri, Dec 6, 2013 at 7:29 PM, Josiah Carlson <josiah....@gmail.com> wrote:
> Long story short: every one of the existing data structures in Redis can be
> improved substantially. All of them can have their memory use reduced, and
> most of them can have their performance improved. I would argue that the
> ziplist encoding should be removed in favor of structures that are concise
> enough to make the optimization unnecessary for structures with more than 5
> or 10 items. If the intset encoding is to be kept, I would also argue that
> it should be modified to apply to all sets of integers (not just small
> ones), and its performance characteristics updated if it happens that the
> implementation changes to improve large intset performance.

Hello Josiah, thanks for your contrib. I agree with you, it is exactly
another case of "this is the simplest way to avoid work given that it
is good enough".
This would deserve a person allocated to this solely that is able to
do steady progresses and merge code when it is mature / tested enough
to avoid disasters, since it is a very sensible area.

Cheers,

Salvatore Sanfilippo

unread,
Dec 6, 2013, 6:07:36 PM12/6/13
to Redis DB
Thanks Dvir, this is a very balanced message.

Certain use cases in your non-happy list probably will never be good
for Redis, including complex relations.
The good side is that I see other entries about issues that are
getting solved finally.

Just to collect a data point, about the fast failover, what were the
consistency requirements and the actual failover times? A few seconds,
or milliseconds, or what?
New Sentinel is faster at failing over, but could be made a lot faster
(in the order of 200 milliseconds instead of a 2/3 seconds it takes
now).

Salvatore

Dvir Volk

unread,
Dec 6, 2013, 6:20:47 PM12/6/13
to redi...@googlegroups.com
A few seconds were fine. If you remember our lengthy discussion about it (and the rejected 1000 line-long pull request :) ) from about a year ago, the problem we had was how to do this stuff dynamically without changing config files, and without having sentinel state conflicting with Chef state.

I ended up protecting the code itself from having a lost master, so the app won't fail while we do a longer failover process; And also moving the write intensive, mission critical stuff, away from redis, to cassandra. As long as I treat redis as read (almost) only, and a potentially volatile (Although we never suffered any major data loss with it) data store - all is fine. 

Salvatore Sanfilippo

unread,
Dec 6, 2013, 6:26:08 PM12/6/13
to Redis DB
Ok, thanks for the additional context, a few seconds is already in
line with the new implementation, however now that you said that, it
is really easy to bring the failover timeout delay under a few hundred
milliseconds. About the PR, I'm sorry but I had no enough focus /
context at the time to really understand if it was a good thing or
not... really tried to take a slower evolution path where I was able
to understand more during the process. Thanks for the PR anyway and
for the chats ;-)

Salvatore

Pierre Chapuis

unread,
Dec 6, 2013, 6:42:00 PM12/6/13
to redi...@googlegroups.com

Le vendredi 6 décembre 2013 16:22:02 UTC+1, Pierre Chapuis a écrit :

Tony Arcieri (Square, ex-LivingSocial) is a "frequent offender":

OK, it looks like I have an apology to make. I wanted to say that Tony had often criticised Redis. Instead I used an English expression which I clearly did not understand well. That was a really, really stupid thing to do.

Moreover, even though I do not share his point of view on Redis, I think Tony is a very good engineer I respect a lot. In particular, he wrote Celluloid, which you probably know about if you are interested in distributed systems and/or Ruby. That makes me even more ashamed to have written such a terrible thing.

Aphyr Null

unread,
Dec 6, 2013, 6:44:53 PM12/6/13
to redi...@googlegroups.com
> probably it is possible to achieve strong consistency 
> if you have a way to stop writes during the replication process.

A formal model and proof would go a long way towards convincing me. I strongly suspect that in the absence of transactional rollbacks, one cannot currently use WAIT to guarantee both linearizability and liveness in the presence of one or more node failures--not without careful control of the election process, anyway.

Howard Chu

unread,
Dec 6, 2013, 6:44:56 PM12/6/13
to redi...@googlegroups.com


On Friday, December 6, 2013 3:00:01 PM UTC-8, Salvatore Sanfilippo wrote:
On Fri, Dec 6, 2013 at 11:46 PM, Alberto Gimeno Brieba
<gime...@gmail.com> wrote:
> I think that having something like NDS officially supported would make redis
> a great option for many more usage cases. Many times the 90% of the "hot
> data" in your db fits in an inexpensive server, but the rest of the data is
> too big and would be too expensive (unaffordable) to have enough RAM for it.
> So in the end you choose other db for the entire dataset.

I completely understand this, but IMHO to make Redis on disk right we need:

1) an optional threaded model. You may use it to dispatch maybe only
slow queries and on-disk queries. Threads are not a good fit for Redis
in memory I think. Similarly I believe that threads are the key for a
good on disk implementation.
2) Representing every data structure on disk in a native way. Mostly a
btree of btrees or alike, but definitely some work ahead to understand
what to use or what to implement.

LMDB, which NDS uses, already supports btree of btrees.

The major flaw in any in-memory DB design, as I see it, is the notion that there is a difference between in-memory data and on-disk data. It inherently leads to waste of CPU + memory due to redundant caches and associated management code.

Kelly Sommers

unread,
Dec 6, 2013, 7:21:08 PM12/6/13
to redi...@googlegroups.com


On Friday, December 6, 2013 4:14:37 PM UTC-5, Salvatore Sanfilippo wrote:
On Fri, Dec 6, 2013 at 9:07 PM, Aphyr Null <aphyr...@gmail.com> wrote:
> While I am enthusiastic about the Redis project's improvements with respect
> to safety, this is not correct.

It is not correct if you take it as "strong consistency" because there
are definitely failure modes, basically it is not like if synchronous
replication + failover turned the system into Paxos or Raft. For
example if the master returns writable when the failover already
started we are no longer sure to pick the slave with the best
replication offset. However this is definitely "more consistent" then
in the past, and probably it is possible to achieve strong consistency
if you have a way to stop writes during the replication process.

Descriptions like this indicate the trade-offs aren't understood, explicitly chosen and designed or accounted for. What is Redis trying to be? Is Redis trying to be a CP or AP system? Pick one and design it as such. From my perspective, with masters and slaves, Redis is trying to be a CP system but it's not achieving the goals. If it's trying to be an AP system, it isn't achieving those goals either. 

Broken CP systems are the worst kinds of AP systems. The aren't as consistent as they intend to be, nor as available and eventually consistent as they ought to be.

Now for a little tough love. I share the #2 type criticism concern Felix mentioned. Respect for the complexity of the problems that production distributed systems face seems to be a root problem here. This is a common theme I see repeating, even today. I don't think one can claim that "distributed systems are not super hard" while their distributed system has issues. Some people devote their entire career to this domain and you don't just learn it in a couple months.

I post this because like many, I want to see Redis improve and I want to see users I work with that use it and everyone else have a better experience. I think the distributed systems community is very welcoming and that Redis could benefit from some design discussions and peer review in these areas.

Josiah Carlson

unread,
Dec 6, 2013, 7:31:33 PM12/6/13
to redi...@googlegroups.com
I thought your use of "frequent offender" with respect to Tony's complaints against Redis was right on :P

Whether or not he has built a lot of good stuff, Salvatore pointed out that his complaints were either FUD or missing the point of what Redis offers. Right tool for the right job and all that.

I wouldn't take it back, and I don't think that any reasonable person should have a problem with what you said.
 - Josiah


--

Josiah Carlson

unread,
Dec 6, 2013, 7:39:59 PM12/6/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 3:02 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
On Fri, Dec 6, 2013 at 7:29 PM, Josiah Carlson <josiah....@gmail.com> wrote:
> Long story short: every one of the existing data structures in Redis can be
> improved substantially. All of them can have their memory use reduced, and
> most of them can have their performance improved. I would argue that the
> ziplist encoding should be removed in favor of structures that are concise
> enough to make the optimization unnecessary for structures with more than 5
> or 10 items. If the intset encoding is to be kept, I would also argue that
> it should be modified to apply to all sets of integers (not just small
> ones), and its performance characteristics updated if it happens that the
> implementation changes to improve large intset performance.

Hello Josiah, thanks for your contrib. I agree with you, it is exactly
another case of "this is the simplest way to avoid work given that it
is good enough".
This would deserve a person allocated to this solely that is able to
do steady progresses and merge code when it is mature / tested enough
to avoid disasters, since it is a very sensible area.

Having someone on this as their job is exactly what it needs. It's a pity Pivotal missed the boat back in July.

 - Josiah

Alberto Gimeno

unread,
Dec 6, 2013, 8:00:12 PM12/6/13
to redi...@googlegroups.com
Hi,

On Sat, Dec 7, 2013, at 12:00 AM, Salvatore Sanfilippo wrote:
> On Fri, Dec 6, 2013 at 11:46 PM, Alberto Gimeno Brieba
> <gime...@gmail.com> wrote:
> > I think that having something like NDS officially supported would make redis
> > a great option for many more usage cases. Many times the 90% of the "hot
> > data" in your db fits in an inexpensive server, but the rest of the data is
> > too big and would be too expensive (unaffordable) to have enough RAM for it.
> > So in the end you choose other db for the entire dataset.
>
> I completely understand this, but IMHO to make Redis on disk right we
> need:
>
> 1) an optional threaded model. You may use it to dispatch maybe only
> slow queries and on-disk queries. Threads are not a good fit for Redis
> in memory I think. Similarly I believe that threads are the key for a
> good on disk implementation.
> 2) Representing every data structure on disk in a native way. Mostly a
> btree of btrees or alike, but definitely some work ahead to understand
> what to use or what to implement.

What about using an already working disk key-value store like leveldb,
rocksdb (http://rocksdb.org), lmdb (like nds does
https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ), etc.?

Matt Palmer

unread,
Dec 6, 2013, 8:43:23 PM12/6/13
to redi...@googlegroups.com
On Sat, Dec 07, 2013 at 12:00:01AM +0100, Salvatore Sanfilippo wrote:
> On Fri, Dec 6, 2013 at 11:46 PM, Alberto Gimeno Brieba
> <gime...@gmail.com> wrote:
> > I think that having something like NDS officially supported would make redis
> > a great option for many more usage cases. Many times the 90% of the "hot
> > data" in your db fits in an inexpensive server, but the rest of the data is
> > too big and would be too expensive (unaffordable) to have enough RAM for it.
> > So in the end you choose other db for the entire dataset.
>
> I completely understand this, but IMHO to make Redis on disk right we need:
>
> 1) an optional threaded model. You may use it to dispatch maybe only
> slow queries and on-disk queries. Threads are not a good fit for Redis
> in memory I think. Similarly I believe that threads are the key for a
> good on disk implementation.

Well, in theory we've got Posix AIO and O_NONBLOCK, but... hahahaha. No.

I've pondered using bio to handle reads from disk, which would mostly just
involve adding the ability for bio to notify the event loop that a
particular key was now in memory (and thus running all those commands
blocked on that key), but I'm keeping that in reserve for a rainy and boring
weekend...

For now, I recommend enabling nds-keycache, keeping an eye on your cache hit
rate to make sure your maxmemory is high enough, and living with the
occasional latency spike when you have to go to disk to read in a
rarely-used key. Hell, if you're running Redis in EC2, you're used to huge
latency spikes, right? </me ducks>

> 2) Representing every data structure on disk in a native way. Mostly a
> btree of btrees or alike, but definitely some work ahead to understand
> what to use or what to implement.

I actually don't think this is a huge blocker. The time involved in
deserialising a value from the packed RDB format is, I believe, a small part
of the total time involved in getting a key from disk to memory -- compared
to how long you spend waiting for the disk to barf up something useful,
almost any CPU-oriented operation is lightning fast. True, I haven't
benchmarked this, and if someone does wave a profiler at NDS and it shows
that the amount of time spent in rdbLoadObject is a significant percentage
of the time spent in getNDS, I'll gladly change my opinion. Until then,
I'll worry more about reducing the impact of disk operations on request
latency.

> So for now the choice is to stay focused in the in-memory paradigm

For you, perhaps. I'm having quite a bit of fun over here shuffling data on
and off disk inside of Redis. <grin> It's the beauty of OSS -- you can
focus on what you think is more important / interesting, and so can everyone
else.

And thanks, by the way, for providing such a high-quality, easy-to-hack-on
codebase to use as a starting point for my adventures.

- Matt

--
"After years of studying math and encountering surprising and
counterintuitive results, I came to accept that math is always reasonable,
by my intuition of what is reasonably is not always reasonable."
-- Steve VanDevender, ASR

Josh Berkus

unread,
Dec 6, 2013, 9:00:44 PM12/6/13
to redi...@googlegroups.com
On 12/06/2013 05:43 PM, Matt Palmer wrote:
> I actually don't think this is a huge blocker. The time involved in
> deserialising a value from the packed RDB format is, I believe, a small part
> of the total time involved in getting a key from disk to memory -- compared
> to how long you spend waiting for the disk to barf up something useful,
> almost any CPU-oriented operation is lightning fast. True, I haven't
> benchmarked this, and if someone does wave a profiler at NDS and it shows
> that the amount of time spent in rdbLoadObject is a significant percentage
> of the time spent in getNDS, I'll gladly change my opinion. Until then,
> I'll worry more about reducing the impact of disk operations on request
> latency.

Actually, you'd be surprised how much time you can spend in
serailization operations. It's nothing compared with reading from EBS,
of course, but some people have faster disks than that; SSDs are quite
affordable these days, and even Amazon has dedicated IOPS. Not that
your prioritization is wrong; it's still better to spend your time where
you are spending it.

BTW, once we go over to disk-backed Redis, we're pretty much certain to
need a better append-only log. The general approach for on-disk
databases is to write first to the AOL (or WAL), and then have a
background process shuffle data to the searchable representation of the
database on disk; it turns out that writing to an AOL is vastly faster
than writing to more elaborately structured data, even (nay, especially)
on SSD.

Of course, right now we don't *have* background processes ...

Anyway, as an Old Database Geek, I'll speak for the Postgres community
and say that we're around if you need advice on how to manage disk-based
access. We have more than a little experience in this regard ;-)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Matt Palmer

unread,
Dec 6, 2013, 9:04:29 PM12/6/13
to redi...@googlegroups.com
On Sat, Dec 07, 2013 at 02:00:12AM +0100, Alberto Gimeno wrote:
> On Sat, Dec 7, 2013, at 12:00 AM, Salvatore Sanfilippo wrote:
> > I completely understand this, but IMHO to make Redis on disk right we
> > need:

[...]

> > 2) Representing every data structure on disk in a native way. Mostly a
> > btree of btrees or alike, but definitely some work ahead to understand
> > what to use or what to implement.
>
> What about using an already working disk key-value store like leveldb,
> rocksdb (http://rocksdb.org), lmdb (like nds does
> https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ), etc.?

I think the issue that Salvatore is talking about there is that with all
those examples you've given, they all treat the values associated with keys
as opaque blobs. Redis, on the other hand, provides its value squarely in
the realm of "I know what these values are, and I have the commands
necessary to allow you to manipulate them".

For NDS, I've gotten around that by only allowing disk/memory granularity at
the key level -- if you want any part of a key, the entire key gets loaded
into memory, and then Redis works on it entirely as normal. This is
hideously inefficient for very large values (hence the "Naive" in "Naive
Disk Store") and performance for almost all types of values would be greatly
improved if granularity increased, but what's there now works well enough
for a great variety of workloads. (/me gives the sigmonster a cookie)

- Matt

--
> There really is no substitute for brute force.
Indeed - I must admit to being a disciple of blessed Saint Makita myself.
-- Robert Sneddon and Tanuki, in the Monastery

Rodrigo Ribeiro

unread,
Dec 7, 2013, 12:05:30 AM12/7/13
to redi...@googlegroups.com
This is a great post Antirez, redis will only improve from this kind of feedback.

Well, we use redis extensively at JusBrasil and my biggest complaint is how expensive can be to keep a large dataset high available.
One of our use is to process user feed. For this we have a cluster of 90 redis instances(distributed across 15 servers), 2/3 of those instances are slaves, used to read from and by our failover mechanism(similar to sentinel).
The problem is that we need to use 2x more memory even if we decide not to read from slaves.

Redis could have a option to run as a "cold-slave", that only receive changes from master and append to disc(RDB+AOF or something similar to NDS fork), keeping minimal memory usage while in this state. 
Then when sentinel elect it as the new master, it would load everything to memory and come back to normal execution.
This would represent huge memory reduction to our cluster, just an idea though.

I also think the core development could be closer with the community work. 
I understand that is important to keep redis simple, but I see few forks that have good contributions(eg: NDS, Sentinel automatic discovery/registration), yet not much movement to merge in the core.
-- 
Rodrigo Ribeiro

On Friday, December 6, 2013 10:52:41 AM UTC-3, Salvatore Sanfilippo wrote:
Hello dear Redis community,

today Pierre Chapuis started a discussion on Twitter about Redis
bashing, stimulated by this thread on Twitter from Rick Branson:

https://twitter.com/rbranson/status/408853897495592960

It is not the first time that Rick Branson, that works at Instagram,
openly criticizes Redis, because I guess he does not like the Redis
design and / or implementation.
However according to Pierre, this is not something limited to Rick,
but there are other engineers in the SF area that believe that Redis
sucks, and Pierre also reported to hear similar stories in Paris.

Of course every open source project of a given size is target if
critiques, especially a project like Redis is very opinionated on how
programs should be written, with the search for simple design and
implementation that sometimes are felt as sub-optimal.
However, what we can learn from this critiques, and what is that you
think is not working well in Redis? I really encourage you to share
your view.

So, what are the critiques that you hear frequently about Redis? What
are your own critiques? When Redis sucks?

Let's tear Redis apart, something good will happen.

Matt Palmer

unread,
Dec 7, 2013, 3:26:34 AM12/7/13
to redi...@googlegroups.com
On Fri, Dec 06, 2013 at 06:00:44PM -0800, Josh Berkus wrote:
> On 12/06/2013 05:43 PM, Matt Palmer wrote:
> > I actually don't think this is a huge blocker. The time involved in
> > deserialising a value from the packed RDB format is, I believe, a small part
> > of the total time involved in getting a key from disk to memory -- compared
> > to how long you spend waiting for the disk to barf up something useful,
> > almost any CPU-oriented operation is lightning fast. True, I haven't
> > benchmarked this, and if someone does wave a profiler at NDS and it shows
> > that the amount of time spent in rdbLoadObject is a significant percentage
> > of the time spent in getNDS, I'll gladly change my opinion. Until then,
> > I'll worry more about reducing the impact of disk operations on request
> > latency.
>
> Actually, you'd be surprised how much time you can spend in
> serailization operations. It's nothing compared with reading from EBS,
> of course, but some people have faster disks than that; SSDs are quite
> affordable these days, and even Amazon has dedicated IOPS.

While I've come to the conclusion that PIOPS are snakeoil, SSDs are quite
nice -- but they're not magic. They're still not as fast as RAM or CPU.

> BTW, once we go over to disk-backed Redis, we're pretty much certain to
> need a better append-only log. The general approach for on-disk
> databases is to write first to the AOL (or WAL), and then have a
> background process shuffle data to the searchable representation of the
> database on disk; it turns out that writing to an AOL is vastly faster
> than writing to more elaborately structured data, even (nay, especially)
> on SSD.

Oh, definitely. In the case of NDS, writing to disk doesn't impact
performance, because that's done from memory to disk in a forked background
process, but that naturally sucks because the data isn't properly durable
(the use case I was addressing meant I can suffer the loss of the last few
writes).

For a proper disk-backed Redis, I'd be switching to something like AOF
fragments to store the log, and the background process would rewrite the AOF
fragments into the disk cache; on startup, this would also be done before we
start serving data.

> Anyway, as an Old Database Geek, I'll speak for the Postgres community
> and say that we're around if you need advice on how to manage disk-based
> access. We have more than a little experience in this regard ;-)

Yeah, I can imagine...

- Matt

--
The hypothalamus is one of the most important parts of the brain, involved
in many kinds of motivation, among other functions. The hypothalamus
controls the "Four F's": 1. fighting; 2. fleeing; 3. feeding; and 4. mating.
-- Psychology professor in neuropsychology intro course

Matt Palmer

unread,
Dec 7, 2013, 3:31:20 AM12/7/13
to redi...@googlegroups.com
On Fri, Dec 06, 2013 at 09:05:30PM -0800, Rodrigo Ribeiro wrote:
> Redis could have a option to run as a "cold-slave", that only receive
> changes from master and append to disc(RDB+AOF or something similar to NDS
> fork), keeping minimal memory usage while in this state.

You could definitely do this with NDS right now -- set a low nds-watermark
and a huge maxmemory on the slaves, and then as part of the promotion
process, set nds-waterwark to 0 (turns it off) and trigger a preload.
Performance will suck a little while the preloading gets everything into
memory, but after that it'll feel just like normal Redis, except you'll get
the benefits of NDS persistence (quick restarts, frequent but tiny disk
flushes, etc).

- Matt


--
Judging by this particular thread, many people in this group spent their
school years taking illogical, pointless orders from morons and having their
will to live systematically crushed. And people say school doesn't prepare
kids for the real world. -- Rayner, in the Monastery

Robert Allen

unread,
Dec 6, 2013, 9:07:56 PM12/6/13
to redi...@googlegroups.com
Firstly, I would like to say thank you to all contributors for your time, efforts and contributions to this outstanding project. 

We have utilised redis for three and a half years with only one notable incident; an incident I attribute solely to a failed HA implementation not related to redis itself. Charles Eames is quoted as saying, "design depends largely on constraints." This holds true with redis and all other systems components. As consumers, we have the noble responsibility to ensure we know, define and learn the constraints of all components we deploy or develop. Our deployment[s] of redis has grown massively in the three+ years of constant use, necessitating these deployments to be configured and tuned with workloads divided specifically to what they are responsible for. We do not mix persisting data, transient cache keys or sessions; we do not utilise Sentinel for HA yet (I would like to but I am giving it more time). 

In summary, I am convinced that, at this time, there are no other viable products that would fit our environment and constraints as well as redis has and will continue to for the foreseeable future. 


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Salvatore Sanfilippo

unread,
Dec 7, 2013, 3:52:12 AM12/7/13
to Redis DB
On Sat, Dec 7, 2013 at 12:44 AM, Aphyr Null <aphyr...@gmail.com> wrote:
> A formal model and proof would go a long way towards convincing me. I
> strongly suspect that in the absence of transactional rollbacks, one cannot
> currently use WAIT to guarantee both linearizability and liveness in the
> presence of one or more node failures--not without careful control of the
> election process, anyway.

A formal model would be required indeed, but surprisingly I think that
transactional rollbacks are totally irrelevant.
Take Raft for example: when the system replies you that your new entry
was not accepted as it was not able to replicate it to the majority of
the nodes, it actually means: "I don't know if the entry will ever be
replicated, but I can't guarantee". I'm not sure for Paxos, it is
possible that there is the same semantics, but anyway whatever is the
ability to internally roll back or not an operation that did not
reached the majority, you are always facing the problem of the
*client* not receiving the acknowledgement after the write was
accepted.

So I believe all the care is in the failover process.

Salvatore Sanfilippo

unread,
Dec 7, 2013, 5:21:55 AM12/7/13
to Redis DB
On Sat, Dec 7, 2013 at 1:21 AM, Kelly Sommers <kell.s...@gmail.com> wrote:

> Descriptions like this indicate the trade-offs aren't understood, explicitly
> chosen and designed or accounted for. What is Redis trying to be? Is Redis
> trying to be a CP or AP system? Pick one and design it as such. From my
> perspective, with masters and slaves, Redis is trying to be a CP system but
> it's not achieving the goals. If it's trying to be an AP system, it isn't
> achieving those goals either.

I believe there is a place for "relaxed" CP systems. In Redis by
default the replication is asynchronous, and most people will use it
this way. At the same time because of the data model, and the size of
aggregate values single keys can hold, I don't want application
assisted merges, semantically. Relaxed CP systems can trade part of
consistency properties for performance and simple semantics, I'm not
sure why this is not acceptable.

It is not new, too: when a sysop relaxes the fsync policy of a
database to "flush every 2 seconds" he is making the same conceptual
tradeoff, but for good reasons.

> Now for a little tough love. I share the #2 type criticism concern Felix
> mentioned. Respect for the complexity of the problems that production
> distributed systems face seems to be a root problem here. This is a common
> theme I see repeating, even today. I don't think one can claim that
> "distributed systems are not super hard" while their distributed system has
> issues. Some people devote their entire career to this domain and you don't
> just learn it in a couple months.

What I mean is that everything is hard, and everything is approachable
at the same time.
Designing video games 3D engines is super hard. Writing a device
driver is super hard. Implement reliable system programs is hard.
However distributed systems are everywhere, including where they were
not supposed to do, better to develop, as a community, skills about it
at large instead of being intimidating.
I spent decades to learn how to write proper C code that does not
crashes easily, but I don't pretend that less experienced people are
scared about doing system programming in C.

For sure a few months of exposure will not make you able to provide
work like Raft or Paxos, but the basics can be used in order to try to
design practical systems, that can be improved over time.
While we are having this conversation, half the AP systems maybe are
running with a last-write wall clock win for *practical* reasons, so
it is not like distributed systems is an exact science when applied:
it is theoretical tradeoffs, then implementation tradeoffs, then
application semantics tradeoffs.

> I post this because like many, I want to see Redis improve and I want to see
> users I work with that use it and everyone else have a better experience. I
> think the distributed systems community is very welcoming and that Redis
> could benefit from some design discussions and peer review in these areas.

My process has always been like that: publish a description of the
system, make people aware of that, and finally start to write the
implementation.
I think this is an open process that allows for contributions.

If you or other interested parties are willing to comment Redis
Cluster design, I'll be super excited about that, seriously.
Of course if you tell me: hey no no no, let's make this a true CP
system regardless of the fact that this means mandatory synchronous
replication to the majority of nodes, and fsyncing every operation to
disk, I can't say this will be helpful. I mean, theory should not win
over the intended goals of the system. However suggestions could
surely help. Redis Cluster is a simple system currently, with a simple
implementation, it is something we can change if needed. I'm open to
suggestions...

Salvatore


>>
>>
>> I understand this not the "C" consistency of "CAP" but, before: the
>> partition with clients and the (old) master partitioned away would
>> receive writes that gets lost.
>> after: under certain system models the system is consistent, like if
>> you assume that crashed instances never start again. It is not
>> realistic as a system model, but it means that in practice you have a
>> better real-world behavior, and in theory you have a system that is
>> going towards a better consistency model.
>>
>> Regards,
>> Salvatore
>>
>> --
>> Salvatore 'antirez' Sanfilippo
>> open source developer - GoPivotal
>> http://invece.org
>>
>> We suspect that trading off implementation flexibility for
>> understandability makes sense for most system designs.
>> — Diego Ongaro and John Ousterhout (from Raft paper)
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



Javier Guerra Giraldez

unread,
Dec 7, 2013, 9:51:40 AM12/7/13
to redi...@googlegroups.com
On Fri, Dec 6, 2013 at 6:44 PM, Howard Chu <highl...@gmail.com> wrote:
> LMDB, which NDS uses, already supports btree of btrees.


for very limited values of 'support'. after playing it for a while,
i think it's the best option for on-disk key-value libraries; but the
limited key size makes it not enough out of the box for a redis-like
on disk db.

a posibility i tried (just some PoC code) is what i called a "hash
tree": like a hash table but instead of an array indexed by key
hashes, use the LMDB tree keyed by key hashes. IOW: take the user's
key, hash it, and use that as (part of) the key to the LMDB tree.
since the keys aren't as limited as the indexes on a hash table,
collisions are _extremely_ rare, and easily solved by storing the full
key with the value. I tried with 128-bit SIPHash and wasn't able to
get a single collision with any dataset i could get my hands on
(murmur3 did get a few)

then, thinking on the 'btree of btrees' comment of Salvatore, i tried
a two-key schema: the LMDB key would be (<parent key hash>,<item key
hash>), and since the underlying tree preserves ordering, all the
'siblings' are consecutive and can be retrieved efficiently.

... and i stopped there, because got distracted by some other shiny
things... , ahem, i mean, paying jobs.


--
Javier

Quentin Adam

unread,
Dec 7, 2013, 11:59:20 AM12/7/13
to redi...@googlegroups.com
Hi

To be clear : I think it's important to consider the right technology to your usage. People often use technology they like or know... And redis is moving fast, but some peoples doesn't use it for the good use case. Indeed, i'm not very confortable with redis for session, in lot of use case... Like log expiration session : couchbase usage is cool because it use the hard drive.

But Redis can be great and move fast :-)

Best regards


Le vendredi 6 décembre 2013 16:22:02 UTC+1, Pierre Chapuis a écrit :
Others:

Quentin Adam, CEO of Clever Cloud (a PaaS) has a presentation that says Redis is not fit to store sessions: http://www.slideshare.net/quentinadam/dotscale2013-how-to-scale/15 (he advises Membase)


Tony Arcieri (Square, ex-LivingSocial) is a "frequent offender":

https://twitter.com/bascule/status/277163514412548096
https://twitter.com/bascule/status/335538863869136896
https://twitter.com/bascule/status/371108333979054081
https://twitter.com/bascule/status/390919938862379008

Then there's the Disqus guys, who migrated to Cassandra,
the Superfeedr guys who migrated to Riak...

Instagram moved to Cassandra as well, here's more on
it by Branson to see where he comes from:
http://www.planetcassandra.org/blog/post/cassandra-summit-2013-instagrams-shift-to-cassandra-from-redis-by-rick-branson

This presentation about scaling Instagram with a small
team (by Mike Krieger) is very interesting as well:
http://qconsf.com/system/files/presentation-slides/How%20a%20Small%20Team%20Scales%20Instagram.pdf
He says he would go with Redis again, but there are
some points about scaling up Redis starting at slide 56.

My personal experience, to be clear, is that Redis is an
awesome tool when you know how it works and how to
use it, especially for a small team (like Krieger basically).

I have worked for a company with a very reduced technical
team for the last 3.5 years. We make technology for mobile
applications which we sell to large companies (retail, TV,
cinema, press...) mostly white-labelled. I have written most
of our server side software, and I have also been responsible
for operations. We have used and still use Redis *a lot*, and
some of the things we have done would just not have been
possible with such a reduced team in so little time without it.

So when I read someone saying he would ban Redis from
his architecture if he ever makes a startup, I think: "good
thing he doesn't." :)

Thank you Antirez for this awesome tool.

Alberto Gimeno

unread,
Dec 7, 2013, 1:05:50 PM12/7/13
to redi...@googlegroups.com
Hi,

Just an idea about the disk-based storage. Many people say that redis is
like a toolset, and I agree. What about doing this toolset more flexible
to support hybrid approaches.

For example redis could have an interface to load keys from a secondary
source. Everytime lookupKey() does not find the key in the current redis
database, let plug something using a simple interface to be able to
lookup in other source. And probably an option to move a key from redis
to the other datasource. This way you can integrate redis with a
disk-based key-value store or many other things without changing
anything at the application level, lua scripts, etc. And redis remains
as clean as possible, with minimal dependencies, but with the ability to
interoperate with other storage engines or datasources.

I don’t know how it could be implemented. Maybe at compilation level
being able to implement a simple interface, maybe through a tcp
connection with a minimal protocol,…

--
Alberto Gimeno
http://backbeam.io/
http://twitter.com/gimenete
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Redis DB" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/redis-db/Oazt2k7Lzz4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Kelly Sommers

unread,
Dec 7, 2013, 1:35:49 PM12/7/13
to redi...@googlegroups.com


On Saturday, December 7, 2013 5:21:55 AM UTC-5, Salvatore Sanfilippo wrote:
On Sat, Dec 7, 2013 at 1:21 AM, Kelly Sommers <kell.s...@gmail.com> wrote:

> Descriptions like this indicate the trade-offs aren't understood, explicitly
> chosen and designed or accounted for. What is Redis trying to be? Is Redis
> trying to be a CP or AP system? Pick one and design it as such. From my
> perspective, with masters and slaves, Redis is trying to be a CP system but
> it's not achieving the goals. If it's trying to be an AP system, it isn't
> achieving those goals either.

I believe there is a place for "relaxed" CP systems. In Redis by
default the replication is asynchronous, and most people will use it
this way. At the same time because of the data model, and the size of
aggregate values single keys can hold, I don't want application
assisted merges, semantically. Relaxed CP systems can trade part of
consistency properties for performance and simple semantics, I'm not
sure why this is not acceptable.

"Relaxed CP" is a new one I've never heard. You're either consistent or you are not. There's no such thing as "I'm a little less pregnant". Please let's not start making stuff up. Once you relax consistency, you're no longer a CP system.

This denial reminds me of RavenDB's insistence that they are an ACID database while admitting they have a broken isolation model. Nobody believes this incorrect representation of ACID outside their own community. StackOverFlow is full of problems the users are experiencing due to partially provided guarantees. Denial only continues to hurt the users because the problems aren't being addressed. If Redis is going to get a good story for replication and distribution, it's not going to get there by denying the current design flaws. 

What's the difference between Cassandra's ConsistencyLevel.ALL and Redis WAIT? Not a lot. Cassandra is an AP system. CL.ALL in Cassie gives a "best effort" for consistency but in my experience this confuses and misleads users because some people think this means they can opt out of AP and become CP if you use CL.ALL and you will be consistent. I have to explain why this is false on a daily basis. No joke. I wish CL.ALL didn't exist. There are use cases for it, but these are few and you need to understand the nuances to be effective.

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.

Redis is trying to cherry pick the best of both worlds, a master/slave system with WAIT but without the proper failure semantics to make it a true reliable CP system and with an asynchronous replication that undermines all of that. On the flip side, the asynchronous replication could do with a lot more supporting features if it is to support an AP system. What we are left with is a system that isn't good at either. It looks like a system where someone can't make up their mind what to build.

Databases provide or trade-off guarantees so that applications have a set of expectations on what they can consider correct. When correct state is confusing and difficult to predict, it makes it very difficult for applications to compensate. I don't recommend any of Redis clustering or replication to people I work with because people find it hard to reason about in production systems - for good reason. This will continue until I see that the design is influenced by explicit trade-offs made and that I am confident users can use it properly.

This conversation is about how to fix that, which I would love to see! It starts with a simple, but hard question to answer. What do you want a Redis cluster to be?

Matt Palmer

unread,
Dec 7, 2013, 4:16:25 PM12/7/13
to redi...@googlegroups.com
On Sat, Dec 07, 2013 at 07:05:50PM +0100, Alberto Gimeno wrote:
> For example redis could have an interface to load keys from a secondary
> source. Everytime lookupKey() does not find the key in the current redis
> database, let plug something using a simple interface to be able to
> lookup in other source. And probably an option to move a key from redis
> to the other datasource. This way you can integrate redis with a
> disk-based key-value store or many other things without changing
> anything at the application level, lua scripts, etc. And redis remains
> as clean as possible, with minimal dependencies, but with the ability to
> interoperate with other storage engines or datasources.

Having done exactly that, I can say that there are a *lot* of places you've
got to hook into Redis to make that a possibility, and the semantics
involved will take some effort to genericise. There's also a fair amount of
logic involved to minimise the impact of this work on the fast path of
serving in-memory requests quickly (which I haven't even solved completely
yet, so I'm not sure how deep that particular rabbit hole will go yet).
I would presume that anything that cuts into Redis' fundamental mission of
being a very fast and consistently responsive data structure store wouldn't
be well regarded as a core feature.

- Matt

--
When the revolution comes, they won't be able to FIND the wall.
-- Brian Kantor, in the Monastery

Salvatore Sanfilippo

unread,
Dec 7, 2013, 5:12:45 PM12/7/13
to Redis DB
On Sat, Dec 7, 2013 at 7:35 PM, Kelly Sommers <kell.s...@gmail.com> wrote:
>
>
> On Saturday, December 7, 2013 5:21:55 AM UTC-5, Salvatore Sanfilippo wrote:
>>
>> On Sat, Dec 7, 2013 at 1:21 AM, Kelly Sommers <kell.s...@gmail.com> wrote:
>>
>> > Descriptions like this indicate the trade-offs aren't understood,
>> > explicitly
>> > chosen and designed or accounted for. What is Redis trying to be? Is
>> > Redis
>> > trying to be a CP or AP system? Pick one and design it as such. From my
>> > perspective, with masters and slaves, Redis is trying to be a CP system
>> > but
>> > it's not achieving the goals. If it's trying to be an AP system, it
>> > isn't
>> > achieving those goals either.
>>
>> I believe there is a place for "relaxed" CP systems. In Redis by
>> default the replication is asynchronous, and most people will use it
>> this way. At the same time because of the data model, and the size of
>> aggregate values single keys can hold, I don't want application
>> assisted merges, semantically. Relaxed CP systems can trade part of
>> consistency properties for performance and simple semantics, I'm not
>> sure why this is not acceptable.
>
>
> "Relaxed CP" is a new one I've never heard. You're either consistent or you
> are not. There's no such thing as "I'm a little less pregnant". Please let's
> not start making stuff up. Once you relax consistency, you're no longer a CP
> system.

Yes, when I talk about "Relaxed CP" I mean, not true CP systems, no
strong consistency, but a tradeoff that is an approximation of CP.
"C" Consistency of CAP requires two fundamental things to happen:

1) Replicate to majority before acknowledge the write.
2) Sync every operation on disk in the replicas before to acknowledge
the node proposing a new state.

Waiting for the slowest fsync between the first N/2+1 replicas is not
going to work well for Redis.
However I believe it is wrong to think that there is only strong
consistency that is worthwhile.

For example a system may employ asynchronous replication with
asynchronous acks: if the operation committed more than 1 second ago
was still not acknowledged the system stops accepting writes.
This is a form of non-strong consistency that trades strong
consistency for latency. Are people crazy to use such a system? I
don't believe it, because a system is the sum of the performance
characteristics it has while running without partitions, plus the
consistency and availability characteristics it has when partition
happens. You can't consider only what happens during the worst case
scenario to evaluate a system.

So an application where losing some write bound to a given maximum
window will be ok with using a system that performs very well during
normal operations, because it is "business acceptable" to trade
consistency for this gain in performances.

> What's the difference between Cassandra's ConsistencyLevel.ALL and Redis
> WAIT? Not a lot. Cassandra is an AP system. CL.ALL in Cassie gives a "best
> effort" for consistency but in my experience this confuses and misleads
> users because some people think this means they can opt out of AP and become
> CP if you use CL.ALL and you will be consistent. I have to explain why this
> is false on a daily basis. No joke. I wish CL.ALL didn't exist. There are
> use cases for it, but these are few and you need to understand the nuances
> to be effective.

WAIT is a tool that you can use in general, exactly like CL.ALL, the
problem is, with every tool, that you have to understand the exact
semantics.

Example: WAIT can be used in order to run a Redis stand-alone instance
with replicas in "CA" mode. Writes only succeed if you can replicate
to all the replicas (no availability at all under partitions).

WAIT can be also used, by improving the failover procedure, in order
to have a strong consistent system (no writes to the older master from
the point the failure detection is positive, to the end of the
failover when the configuration is updated, or alternative, disconnect
the majority of slaves you can reach during the failure detection so
that every write will fail during this time).

WAIT also improves the real-world "holes" that you face if the failure
detection is not designed to be safe.

For people it is important how systems behaves in practice. /dev/null
is not the same consistency of asynchronous replication for example.
Similarly users can say, I'm ok with a system that has excellent
latency and IOPs where everything is fine but is not able to feature
strong consistency, howevern when shit happens given that it can't
guarantee strong consistency, what degree of consistency will it
offer? What is the contract with the user?
I find the idea that there is "strong consistency" or nothing not
correct, the AP systems you cite are a perfect example of that.
Wallclock last-write-win is a model, but there are better more costly
models, and so forth.

From the point of view of weak CP systems you can see this in terms of
the kind of partition you have to create for inconsistencies to be
created.
There are systems where of all the partitions and failures possible
only a small subset will create inconsistencies, there are other
systems that are affected by a larger subset.

To reply you with a counter-example that is as pointless as your
pregnancy example: car safety is not just cars that I can't be killed
in an accident, or cars where I die.
Different cars will have different security levels, and if you want to
run faster, you are more exposed.

> Similar to ACID properties, if you partially provide properties it means the
> user has to _still_ consider in their application that the property doesn't
> exist, because sometimes it doesn't. In you're fsync example, if fsync is

Yes, but there are applications where data loss is totally acceptable
if it is an exception that happens with a given probability and with
given results in terms of amount of write lost.
There are instead applications where data loss is unacceptable, so one
loss, or 10 loss, is the same, and there you need a CP system.

> This conversation is about how to fix that, which I would love to see! It
> starts with a simple, but hard question to answer. What do you want a Redis
> cluster to be?

That's pretty simple: Redis cluster can't be CP because the
performance is unacceptable for the way most people use Redis. However
Redis could be optionally CP for some operation, and I believe that
WAIT is a start in that direction: not enough, more work is needed in
the leader switch to make the process safe. But that's optional, so
let's reason about: no synchronous replication by default.

Redis also can't accept a distribution model where there is the need
of merging values, since values are easily two billion zsets, or
alike. To timestamp with a logical clock each element is crazy for
instance, and the time in order to analyze and merge such big values
can be seriously big, the semantics not trivial to predict in real use
cases.

So no synchronous replication, no merge. What is the guarantees it
should be able to provide? The best consistency possible that is
possible to achieve in order to survive certain partitions.
With certain partitions I mean, the majority of masters with at least
a slave for every hash slot, should be able to continue operations.

My feeling is that under the above assumptions the best model is a CP
model with maximum windows to lose writes. The tradeoff of Redis
Cluster also changes the guarantees of clients in the majority
partition and clients in the minority have.

Salvatore

Dvir Volk

unread,
Dec 7, 2013, 6:21:17 PM12/7/13
to redi...@googlegroups.com
I don't understand what the big deal about strong consistency in redis is. People use redis for one main reason: it's super fast and you can hack your data model. As someone said before, if i wanted strong consistency I'd use ZooKeeper. Actually I'm using it. but where 10 writes per second are acceptable. 

Matt Palmer

unread,
Dec 7, 2013, 9:23:02 PM12/7/13
to redi...@googlegroups.com
On Sun, Dec 08, 2013 at 01:21:17AM +0200, Dvir Volk wrote:
> I don't understand what the big deal about strong consistency in redis is.

People don't understand tradeoffs. Short of some sort of magical
faster-than-light transmission, instantaneous-calculation-and-storage
technology, you don't get to have everything you want, but that doesn't stop
people wanting it, and not casting a cynical and critical eye over every
technology they consider to ensure it meets their needs.

That isn't helped by developers (understandbly) talking up their features
and not being quite so loud about the limitations, but I know from
experience that even 72pt font in <blink> tags saying "this software doesn't
do X" won't stop people from sending you an e-mail complaining about how the
software you made freely available doesn't do X. Oooh... kinda like this
(from "14 Ways to Tick off a Writer"
http://blog.pshares.org/index.php/14-ways-to-tick-off-a-writer/):

Read ten pages of the author’s book. Realize that it’s absolutely not
for you: you thought it was a zombie story, and it’s actually historical
fiction about Alexander Graham Bell. Go on Goodreads anyway, and give it
one star for not being a zombie story.

- Matt


--
I was punching a text message into my phone yesterday and thought, "they need
to make a phone that you can just talk into."
-- Major Thomb

Kelly Sommers

unread,
Dec 8, 2013, 4:57:52 AM12/8/13
to redi...@googlegroups.com
Couple points about #2.  Firstly, there are many ways to optimize disk usage when processing transactions. Doing a fsync per operation is a naive approach that won't be very successful. Even before SSD's, if you study many databases, there are many optimizations used. One (but not limited to) example is coalescing transactions. Most good databases do more transactions than the 120ish IOPS a rotational disk can offer. The key to the durability guarantee is not to acknowledge a transaction until it's written. However if there are tens of thousands (or more) of concurrent transactions, you can commit them in a single fsync and acknowledging them all. There are also papers on how to write high performance WAL's (write-ahead logs). I'm not going into extensive detail here but studying current systems and the state of the art research can be helpful here. You do not have to flush disk buffers per database operation, that is overkill. You can make whatever optimizations you want (lots of papers and prior art implementations covering different approaches) so long as the transaction response does not lie and the system state is correct.

Secondly, #2 is not true. Nothing about a CP system requires disks. You can have in-memory only system that is a CP system. If the node has to re-sync with another for some purpose, it must not be capable of becoming the master (someone who is up to date should be the master) or responding to read requests. This is a CAP trade-off being made. Even D (durability) in ACID isn't restricted to fsyncing to disks. I suggest reading Jim Gray's papers on the topic. Writing to a disk is a form of data replication just like writing to another node is a form of data replication. Disks die just like nodes do. Disks can write out of order and corrupt data too.
There's no such thing as "CA mode". I recommend reading this wonderful post by Henry Robinson from Cloudera. More specifically item #10 related to "CA". I highly recommend reading the whole thing.

 

WAIT can be also used, by improving the failover procedure, in order
to have a strong consistent system (no writes to the older master from
the point the failure detection is positive, to the end of the
failover when the configuration is updated, or alternative, disconnect
the majority of slaves you can reach during the failure detection so
that every write will fail during this time).

WAIT also improves the real-world "holes" that you face if the failure
detection is not designed to be safe.

For people it is important how systems behaves in practice. /dev/null
is not the same consistency of asynchronous replication for example.
Similarly users can say, I'm ok with a system that has excellent
latency and IOPs where everything is fine but is not able to feature
strong consistency, howevern when shit happens given that it can't
guarantee strong consistency, what degree of consistency will it
offer? What is the contract with the user?
I find the idea that there is "strong consistency" or nothing not
correct, the AP systems you cite are a perfect example of that.
Wallclock last-write-win is a model, but there are better more costly
models, and so forth.

Because you're trying to pretend to be a CP system (but not one) with things like WAIT, you will have a horde of users not understanding what a failed WAIT that writes to 1 node but not 2 nodes means. The ones who do understand what this means (after some pain in production) will learn that this operation doesn't work as expected and will have to consider WAIT having AP like semantics. Similar to CL.ALL.

It's really important that a transaction tell the truth of what happened and that the expectations are intuitive to the users. If that is not the case then users will struggle reasoning about the system and their application code will have incorrect expectations and potentially lacking compensations. This can all lead to applications causing incorrect state.
 

From the point of view of weak CP systems you can see this in terms of
the kind of partition you have to create for inconsistencies to be
created.
There are systems where of all the partitions and failures possible
only a small subset will create inconsistencies, there are other
systems that are affected by a larger subset.

To reply you with a counter-example that is as pointless as your
pregnancy example: car safety is not just cars that I can't be killed
in an accident, or cars where I die.
Different cars will have different security levels, and if you want to
run faster, you are more exposed.

> Similar to ACID properties, if you partially provide properties it means the
> user has to _still_ consider in their application that the property doesn't
> exist, because sometimes it doesn't. In you're fsync example, if fsync is

Yes, but there are applications where data loss is totally acceptable
if it is an exception that happens with a given probability and with
given results in terms of amount of write lost.
There are instead applications where data loss is unacceptable, so one
loss, or 10 loss, is the same, and there you need a CP system.


100% agree that for some applications it's acceptable to provide the highest performance with the risk of losing data. However Redis doesn't currently present itself as a predictable CP nor a predictable AP system. It needs to be a predictable _something_. I am not suggesting that you make Redis a CP system. There are many varying designs, not only the ones you or I mentioned so far. Regardless of available choices, you need to explicitly decide on what type of system you are building and acknowledge that in the design.
 
> This conversation is about how to fix that, which I would love to see! It
> starts with a simple, but hard question to answer. What do you want a Redis
> cluster to be?

That's pretty simple: Redis cluster can't be CP because the
performance is unacceptable for the way most people use Redis. However
Redis could be optionally CP for some operation, and I believe that
WAIT is a start in that direction: not enough, more work is needed in
the leader switch to make the process safe. But that's optional, so
let's reason about: no synchronous replication by default.

I don't think WAIT is designed correctly, especially in the failure scenarios.
 

Redis also can't accept a distribution model where there is the need
of merging values, since values are easily two billion zsets, or
alike. To timestamp with a logical clock each element is crazy for
instance, and the time in order to analyze and merge such big values
can be seriously big, the semantics not trivial to predict in real use
cases.

So no synchronous replication, no merge. What is the guarantees it
should be able to provide? The best consistency possible that is
possible to achieve in order to survive certain partitions.
With certain partitions I mean, the majority of masters with at least
a slave for every hash slot, should be able to continue operations.

My feeling is that under the above assumptions the best model is a CP
model with maximum windows to lose writes. The tradeoff of Redis
Cluster also changes the guarantees of clients in the majority
partition and clients in the minority have.
 

The same theme exists for the persistence problem that is also discussed in this thread. It's blurry what kind of database Redis is trying to become in the future. Does Redis want to cater to the needs for consistency, or availability and/or durability? What problems does it want to solve moving forward? I hear suggestions of people trying to hack storage engines into 50 different places in Redis because it's not designed (and wasn't intended to be) a disk based system. Hacking these things together isn't the right approach. If it's going to be a durable disk based system it should be designed as one _properly_.

Both of these problems whether you choose to support or trade them off for other benefits require a holistic approach with explicit trade-offs and decisions accounted for in the design and implementation. Database engineers are faced with a ton of trade-off decisions that ultimately decide what kind of system the database presents itself as and what it's good at.

Salvatore Sanfilippo

unread,
Dec 8, 2013, 5:46:16 AM12/8/13
to Redis DB
Kelly for an error sent me this via private email, but it was intended
to be public, so here is my reply:

On Sun, Dec 8, 2013 at 9:04 AM, Kelly Sommers <kell.s...@gmail.com> wrote:

> Couple points about #2. Firstly, there are many ways to optimize disk usage
[snip]
> commit them in a single fsync and acknowledging them all. There are also

Redis already does this when fsync = always.

> Secondly, #2 is not true. Nothing about a CP system requires disks. You can
> have in-memory only system that is a CP system. If the node has to re-sync
> with another for some purpose, it must not be capable of becoming the master
> (someone who is up to date should be the master) or responding to read
> requests. This is a CAP trade-off being made. Even D (durability) in ACID
> isn't restricted to fsyncing to disks. I suggest reading Jim Gray's papers
> on the topic. Writing to a disk is a form of data replication just like
> writing to another node is a form of data replication.

Let forget for a moment that latency of acks alone is already too
much, for CP systems without disks what kind of System Model are
assuming?

There are three processes A, B, C. Process A replicates to B, receives
the acknowledge, and replies ok to the client since the majority was
reached.
Process A fails, at the same time process B reboots. Process B returns
available again after the reboot, there is the majority: B and C that
can continue, however the write is lost.

There is no way to reach strong consistency in a system model where
RAM is volatile and processes can restart, without using an external
storage that guarantees that certain state is durable.

>> WAIT is a tool that you can use in general, exactly like CL.ALL, the
>> problem is, with every tool, that you have to understand the exact
>> semantics.
>>
>> Example: WAIT can be used in order to run a Redis stand-alone instance
>> with replicas in "CA" mode. Writes only succeed if you can replicate
>> to all the replicas (no availability at all under partitions).
>
>
> There's no such thing as "CA mode". I recommend reading this wonderful post
> by Henry Robinson from Cloudera. More specifically item #10 related to "CA".
> I highly recommend reading the whole thing.
>
> http://henryr.github.io/cap-faq/

"CA" mode is often a way to refer to systems that are not partition
tolerant but consistent.
When we talk of "CA" we are actually outside of the whole point of the
CAP theorem, so if you prefer we can call it just consistent systems
that are totally unable to handle partitions.

> Because you're trying to pretend to be a CP system (but not one) with things
> like WAIT, you will have a horde of users not understanding what a failed
> WAIT that writes to 1 node but not 2 nodes means. The ones who do understand
> what this means (after some pain in production) will learn that this
> operation doesn't work as expected and will have to consider WAIT having AP
> like semantics. Similar to CL.ALL.
>
> It's really important that a transaction tell the truth of what happened and
> that the expectations are intuitive to the users. If that is not the case
> then users will struggle reasoning about the system and their application
> code will have incorrect expectations and potentially lacking compensations.
> This can all lead to applications causing incorrect state.

Raft faces the user with the same exact tradeoff, when Raft replies
that it failed to replicate to the majority, it really means that the
result is undetermined.
The entry may be applied or not. This is something hard to avoid
without making the algorithm more complex, and something somewhat
pointless to avoid since you have always the case where the system is
not able to sent the ACK to the client before of a partition, so you
anyway have to re-apply operations, or if partitioned away as a
client, live with the undetermined state of your operation.

I think that Raft semantics is good enough for most use cases: if the
reply is positive we guarantee the operation will be retained, if the
reply is negative you can't count on it.
If the write is idempotent you retry it usually, but there is always
the case of the client that at this point is partitioned away, and
will live with the undetermined state for an unbound time before the
partition heals.

> 100% agree that for some applications it's acceptable to provide the highest
> performance with the risk of losing data. However Redis doesn't currently
> present itself as a predictable CP nor a predictable AP system. It needs to
> be a predictable _something_. I am not suggesting that you make Redis a CP
> system. There are many varying designs, not only the ones you or I mentioned
> so far. Regardless of available choices, you need to explicitly decide on
> what type of system you are building and acknowledge that in the design.

I think that this is actually quite clear in the design even if not
stated in the CAP terms.
The consistency, since is not strong, can be considered eventual
because when the partition hails actually there is agreement about the
state.
However this agreement only selects a single timeline between all the
possible, so it means that it is possible to lose data, like in
last-write-win AP systems.
However some care in the distributed system orchestration try to
reduce the windows to lose data to a minimum.

That is the default. With WAIT we improve just the windows so far,
there are less failure modes out of all the partitions and failures
that the system can face.
With future work in the failover process of the cluster it will be
possible, probably, to also ensure strong consistency.

> I don't think WAIT is designed correctly, especially in the failure
> scenarios.

So Raft if you think so. I don't believe this to be a valid point.

> That being said, the success of Redis has come from the in-memory
> performance that Redis provides. I think it's logical to continue on that
> path. Do I think Redis could be a high performing disk-based system with the
> right engineering? Yes. Will it be slower than a purely in-memory based
> solution? Of course. But like the CAP trade-offs, you can't have the _best
> of both_ worlds.

Currently I've not interest to make Redis on-disk as I said multiple times.
Software is not just engineering, but also culture. I don't like a
Redis on-disk as a system, I want to make the performance side an
extreme, not a compromise, so only memory.
Maybe in the future with a pluggable storage engine...

> I suggest doing more research because there are a lot more options than the
> ones you are suggesting so far in this thread. Database architecture and
> distributed systems are both deep topics. Sometimes I wish they weren't so
> that it is easier to cover but I guess that's what makes them interesting :)

I'll surely do my research, but I'm not a Right Thing person. What I
mean is that I'll try to provide what I can provide with my best of my
capabilities now, making clear what are the tradeoffs.
This is how it always has worked with Redis. As my vision improves, I
try to transfer it into the code.

I'm here open to improvements to the Redis Cluster design (or
whatever) that allow to retain the same goals with an improved
consistency.
In lack of suggestions I'll try to do the best to improve it during
the future time, as long as people use and I enjoy working at it at
least.

Cheers,
Salvatore

Dvir Volk

unread,
Dec 8, 2013, 6:47:15 AM12/8/13
to redi...@googlegroups.com



On Sun, Dec 8, 2013 at 12:46 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
Kelly for an error sent me this via private email, but it was intended
to be public, so here is my reply:

On Sun, Dec 8, 2013 at 9:04 AM, Kelly Sommers <kell.s...@gmail.com> wrote:

> Couple points about #2.  Firstly, there are many ways to optimize disk usage
[snip]
> commit them in a single fsync and acknowledging them all. There are also

Redis already does this when fsync = always.

how does that work in a single threaded model? you mean an entire transaction request?

Pierre Chapuis

unread,
Dec 8, 2013, 7:24:46 AM12/8/13
to redi...@googlegroups.com
Just a few points on this whole CA / CP issue.

First, about this example:


There are three processes A, B, C. Process A replicates to B, receives
the acknowledge, and replies ok to the client since the majority was
reached.

Process A fails, at the same time process B reboots. Process B returns
available again after the reboot, there is the majority: B and C that
can continue, however the write is lost.

There are at least two ways to solve that. The first one is to assume
fail-stop: process B cannot reboot by itself. And if both A and B die,
then it's the "uh oh I have lost over half of my cluster" issue that
will always exist anyway. Even with disks, if two machines out of
three are obliterated by a nuclear strike, you can lose data...

Another solution is to use a write quorum of 3 (every replica must
receive all writes). I think that is actually what the original Dynamo
paper was doing.

Also, on CAP:

There is no useful (*) distributed CA system. CA means partitions
cannot happen, which means a single node system. But then can we
really say it is highly available?

(*) /dev/null is CAP, hence the "useful" qualifier.

It *is* possible to make a non-integral trade off between C and A,
but then you stop calling them Consistency and Availability and say
Harvest and Yield instead:
http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

I don't think those ideas are really useful for a datastore though
(but they are for a search engine for instance).

As for Redis Cluster... Kelly is completely right in that the problem
is to define what Redis is. If I had to name *one* property of Redis
that makes people use it, it would be performance (low latency (*),
high throughput), for both reads and writes. This is basically the
reason why it is an in-memory system.

(*) If you use it correctly, i.e. with mostly O(1) operations,
Tony Arcieri would say :)

I don't think it will be possible to keep these properties with a
CP system. Inter-node network latencies will be deadly. So I
just don't think it makes sense to try to make Redis cluster CP.

Dvir Volk

unread,
Dec 8, 2013, 7:47:54 AM12/8/13
to redi...@googlegroups.com

As for Redis Cluster... Kelly is completely right in that the problem
is to define what Redis is. If I had to name *one* property of Redis
that makes people use it, it would be performance (low latency (*),
high throughput), for both reads and writes. This is basically the
reason why it is an in-memory system.


Adding to that, a bit off topic in this sense, but on topic in the broader sense of "what redis is":
We need to also keep in mind that redis cluster removes a few useful (to me) aggregate commands like ZINTERSTORE, SINTER, SUNION, etc. 
If we'll want them in a cluster, an access node has to be added, adding latency and consistency problem of its own, but that's another story.
So right out the gate redis cluster will not be for everyone. Having said that, I don't have any statistical data about the actual use cases of redis out in the world, I suspect most people won't care about them. 

But it seems the whole cluster focus keeps redis in a bit of a split brain state  (pun intended :) ) , where it is two (somewhat) different beasts if you're using it as a classic master/slave system and do your own sharding if needed, or using it in cluster mode. 

I'm not saying redis should take one way or another, though, just pointing it out. I don't think I'll be using redis cluster any time soon.

Salvatore Sanfilippo

unread,
Dec 8, 2013, 8:25:52 AM12/8/13
to Redis DB
On Sun, Dec 8, 2013 at 1:24 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:

> There are at least two ways to solve that. The first one is to assume
> fail-stop: process B cannot reboot by itself. And if both A and B die,
> then it's the "uh oh I have lost over half of my cluster" issue that
> will always exist anyway. Even with disks, if two machines out of
> three are obliterated by a nuclear strike, you can lose data...

I'm assuming crash-recovery that is a lot more similar to reality.

In the crash recovery model, without persistent state even when the
majority is available again, and even if the write was acknowledged,
the write is lost, so you can't achieve strong consistency.

> Another solution is to use a write quorum of 3 (every replica must
> receive all writes). I think that is actually what the original Dynamo
> paper was doing.

The number of replicas has nothing to do with that, because you can
assume mass-reboot in the crash-recovery model.
Otherwise you have to make your system model more forgiving and assume
that only a minority of processes fail and restart.

Btw Redis Cluster assumes fail-recovery in the failover procedure
right now, so the new state is synched on disk before replying for
other nodes, for everything related to the hash slots configuration,
epochs, and so forth.

> There is no useful (*) distributed CA system. CA means partitions
> cannot happen, which means a single node system. But then can we
> really say it is highly available?

This is why I use "CA" to say, systems that provide no availability on
partitions but that are able to stop working instead of providing
wrong results when partitions happen.
It is just a way to name things, if "CA" is not the best, we can call
it in another way, but I find it helpful to say "CA" from the point of
view of finding a common name for those kind of systems.

> As for Redis Cluster... Kelly is completely right in that the problem
> is to define what Redis is. If I had to name *one* property of Redis
> that makes people use it, it would be performance (low latency (*),
> high throughput), for both reads and writes. This is basically the
> reason why it is an in-memory system.
>
> (*) If you use it correctly, i.e. with mostly O(1) operations,
> Tony Arcieri would say :)

Honestly the O(1) thing is a misconception. Redis provides excellent
performance and latency for:

1) O(1) operations.
2) Logarithmic operations (most basic sorted sets operations, including ZRANK).
3) O(N) seek + O(M) work (for example LTRIM) every time you can make
sure to take M small. Example: capped collections implementation.
4) Log(N) seek + O(M) work (for example removing ranges of elements
from a sorted set).

The Redis single thread model results in big latency with O(N)
operations with large "N" of course.
The most important cases where it was a big problem: KEYS * or
SMEMBERS & similar stuff. Now there is a solution with SCAN & co.

The slow O(N) operations are conceived for two use cases:
1) When called against collections that are small enough that the
latency is acceptable, considering the fact that constant times are
very small.
2) When called in a context where Redis is used as a computational
node in an asynchronous way.

> I don't think it will be possible to keep these properties with a
> CP system. Inter-node network latencies will be deadly. So I
> just don't think it makes sense to try to make Redis cluster CP.

I agree, as already stated the default operations can't be CP, however
I would be enthusiast to have an optional CP mode based on WAIT that
is able to make its work without affecting the other clients, and I
think this is possible to achieve in accordance with the other goals
exposed.

Salvatore

Salvatore Sanfilippo

unread,
Dec 8, 2013, 8:31:54 AM12/8/13
to Redis DB
On Sun, Dec 8, 2013 at 12:47 PM, Dvir Volk <dvi...@gmail.com> wrote:
> how does that work in a single threaded model? you mean an entire
> transaction request?

This is extremely easy to provide in an event-driven programming model.

The Redis event loop is designed to guarantee that if a client was
processed for the "readable" event, it is not processed for the
"writable" event in the same loop.
So this is what happens:

1) Fsync is set to always.
2) Multiple processes for every event loop cycle try to write to the database.
3) When we write, instead of using fsync, we just set a flag instead.
4) At the end of the event loop cycle we are sure that no reply was
yet delivered to processes that performed a write, because no writable
event was fired for all the clients we processed a write for.
5) Redis ae.c has an event loop function called "before sleep" that is
invoked before re-entering the event loop for the next cylce.
6) If we find that we need to fsync, we do it there. So we grouped all
the clients trying to write in a given event loop cycle into a single
fsync.

No added latency, and under load a huge decrease in the number of
fsync performed.

Redis Cluster uses the same trick in order to rewrite the nodes.conf
file a single time before replying to the other nodes.

Pierre Chapuis

unread,
Dec 8, 2013, 9:03:10 AM12/8/13
to redi...@googlegroups.com
Le dimanche 8 décembre 2013 14:25:52 UTC+1, Salvatore Sanfilippo a écrit :
> There is no useful (*) distributed CA system. CA means partitions
> cannot happen, which means a single node system. But then can we
> really say it is highly available?

This is why I use "CA" to say, systems that provide no availability on
partitions but that are able to stop working instead of providing
wrong results when partitions happen.
It is just a way to name things, if "CA" is not the best, we can call
it in another way, but I find it helpful to say "CA" from the point of
view of finding a common name for those kind of systems.

This looks like the definition of CP to me. It will prefer to stop
working instead of compromising consistency for the sake of
availability.
 

Yes, completely true. O(1) is an over-simplification. The point is that
it is low-latency if you take care not to run a command that could
take forever. But basic operations are faster than e.g. a SQL
database (if nobody else is blocking the server).
 
> I don't think it will be possible to keep these properties with a
> CP system. Inter-node network latencies will be deadly. So I
> just don't think it makes sense to try to make Redis cluster CP.

I agree, as already stated the default operations can't be CP, however
I would be enthusiast to have an optional CP mode based on WAIT that
is able to make its work without affecting the other clients, and I
think this is possible to achieve in accordance with the other goals
exposed.

 Why not... I think Brewer said explicitly in a paper (which I cannot find
right now) that CAP was to be understood for some data at some point
in time. So you can have a system that is CP for some data and CA for
other data, and you can have a system that switches between CA and
CP modes. But I think that this "CP" mode should be seen like a bonus,
and not hinder the "natural" distributed mode for Redis which is AP.

Pierre Chapuis

unread,
Dec 8, 2013, 9:04:24 AM12/8/13
to redi...@googlegroups.com
Le dimanche 8 décembre 2013 15:03:10 UTC+1, Pierre Chapuis a écrit :
 Why not... I think Brewer said explicitly in a paper (which I cannot find
right now) that CAP was to be understood for some data at some point
in time. So you can have a system that is CP for some data and CA for
other data, and you can have a system that switches between CA and
CP modes. But I think that this "CP" mode should be seen like a bonus,
and not hinder the "natural" distributed mode for Redis which is AP.

Of course you should read AP instead of CA in that paragraph...

Kelly Sommers

unread,
Dec 8, 2013, 12:47:14 PM12/8/13
to redi...@googlegroups.com


On Sunday, December 8, 2013 9:03:10 AM UTC-5, Pierre Chapuis wrote:
Le dimanche 8 décembre 2013 14:25:52 UTC+1, Salvatore Sanfilippo a écrit :
> There is no useful (*) distributed CA system. CA means partitions
> cannot happen, which means a single node system. But then can we
> really say it is highly available?

This is why I use "CA" to say, systems that provide no availability on
partitions but that are able to stop working instead of providing
wrong results when partitions happen.
It is just a way to name things, if "CA" is not the best, we can call
it in another way, but I find it helpful to say "CA" from the point of
view of finding a common name for those kind of systems.

This looks like the definition of CP to me. It will prefer to stop
working instead of compromising consistency for the sake of
availability.

Bingo :) One thing I realized when I started building 1,000+ node production clusters is that "partitions" happen all the time and many times it has nothing to do with networking equipment. The experience I gained with these systems absolutely opened my eyes to what the papers and others who build even bigger systems than I do have been saying for a long time. Once it hits you smack in the face it's hard not to learn it.
 
 
> As for Redis Cluster... Kelly is completely right in that the problem
> is to define what Redis is. If I had to name *one* property of Redis
> that makes people use it, it would be performance (low latency (*),
> high throughput), for both reads and writes. This is basically the
> reason why it is an in-memory system. 

Agreed. To do so you're going to want a Redis cluster to scale in a predictable fashion. This means that when I add N nodes I can have a general idea how much capacity that is adding to my cluster. Creating a system where all nodes talk to each other on every single write (or coalesced batch) isn't going to scale well. This now directs us into bounding the amount of nodes included in a transaction and partitioning the data so that the cost is somewhat constant.

I've seen some people trying to scale Redis too far and while I would love a better story for larger scale, I think creating a system that does well with high performance and small node sizes that behaves predictably is a better goal for the short term. With that experience in the project it increases the knowledge for taking it to the next scale. It's important this gets communicated properly so that people don't misuse though.
I'm going to guess you meant "AP" and not "CA" since you correctly identified that above :)

I agree that CP for some data and AP for other data can definitely be advantageous if the user is able to understand the nuances and the requirements from their data properly. This sets clear expectations of what the system does with a piece of data even though both may behave differently.

However! Switching between CP and AP for the same data means you are basically an AP system. From the perspective of the actors and observers of the system, they can't trust the system to ever be correct so they must consider that AP mode happens anyways. A CP system means that actors and observers have a set of guarantees. If that can be traded-off then the application must account for this trade-off. Even more problematic, if this toggle is done with a command like WAIT, a misbehaving application can cause incorrect state to well behaved applications. We must consider the serializability implications when CP can be circumvented.

Again, I'm not promoting CP systems here, I'm just trying to clarify the implications of what these suggestions mean because I don't think they are thought out. My 1,000 node clusters are AP systems, I try to make the trade-offs where they make sense :)
 

Pierre Chapuis

unread,
Dec 8, 2013, 1:46:33 PM12/8/13
to redi...@googlegroups.com
Le dimanche 8 décembre 2013 18:47:14 UTC+1, Kelly Sommers a écrit :

However! Switching between CP and AP for the same data means you are basically an AP system. From the perspective of the actors and observers of the system, they can't trust the system to ever be correct so they must consider that AP mode happens anyways. A CP system means that actors and observers have a set of guarantees. If that can be traded-off then the application must account for this trade-off. Even more problematic, if this toggle is done with a command like WAIT, a misbehaving application can cause incorrect state to well behaved applications. We must consider the serializability implications when CP can be circumvented.

I am not sure how misbehaving applications should be taken into account.
After all a misbehaving application could also decrement your counter
when it shouldn't, canceling your increment somehow...

But I agree that we should not make it extremely hard for users (i.e.
application developers) to understand the guarantees provided by the
system. I admit I do not understand them well myself.

Antirez cites Raft as an example, but Raft is all about leader election.
In Redis Cluster the guarantees that Raft offers are apparently not
there, and the WAIT command cannot provide them anyway.

For instance, imagine you have 5 replicas (A to E). You would think
that by using WAIT 3 you would be safe. But if you perform three
operations (1 to 3) on three different clients, acknowledged
respectively by nodes (A, B, C), (C, D, E) and (B, D, E), then
 you *do not* have the certainty that any node has seen all three
operations:

A -> 1
B -> 1, 3
C -> 1, 2
D -> 2, 3
E -> 2, 3

If the master fails then, how do you pick the new master?

It looks to me as if even WAIT N where N is the *total* number
of replicas (here 5) could offer real guarantees in the event of
master failure. And even then, I am not sure it would be enough.

I may be wrong though, because I don't understand the Cluster
replication algorithm. Maybe if Antirez could publish an explanation
of how it works and the assumptions it makes (comparable to the
Raft paper and associated lecture slides) it would answer a lot of
the questions people have. But I can understand this would be a
*lot* of work...

Pierre Chapuis

unread,
Dec 8, 2013, 1:58:12 PM12/8/13
to redi...@googlegroups.com
Le dimanche 8 décembre 2013 19:46:33 UTC+1, Pierre Chapuis a écrit :
I may be wrong though, because I don't understand the Cluster
replication algorithm. Maybe if Antirez could publish an explanation
of how it works and the assumptions it makes (comparable to the
Raft paper and associated lecture slides) it would answer a lot of
the questions people have. But I can understand this would be a
*lot* of work...

TBH there is http://redis.io/topics/cluster-spec which I should read
more attentively :) It is not as clear as what exists for Raft but could
contain the answer to my question, which is: what exactly does WAIT
garantee? Is there any way a write followed by a successful WAIT 3
in a cluster of 5 could not be acknowledged by the next master?

Mark Papadakis

unread,
Dec 8, 2013, 3:49:55 PM12/8/13
to redi...@googlegroups.com
Group commit does wonders for a write ops throughput. it's somewhat non trivial to get right on a multi threaded environment. Mariadb devs have a nice writeup that describes how it works on their implementation and what kind of performance and efficiency that provides. You may want to google for that.

Salvatore Sanfilippo

unread,
Dec 8, 2013, 5:15:35 PM12/8/13
to Redis DB
On Sun, Dec 8, 2013 at 7:46 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:
> Antirez cites Raft as an example, but Raft is all about leader election.
> In Redis Cluster the guarantees that Raft offers are apparently not
> there, and the WAIT command cannot provide them anyway.

I cited Raft as an example of CP system with false negatives. It
ensures that a positive reply means the entry will be applied to the
state machine, but it does not offer guarantees of the opposite when a
negative reply is provided to the client. This totally makes sense
IMHO for a number of reasons in the case of Raft.

For Redis Cluster + WAIT to be consistent, you have to improve the
failover process, in two ways at least. You may already know that the
failover is performed by slaves in Redis Cluster.
It requires two basic things, that are, the slave that wins the
election should only failover if it can get an agreement from N/2
other slaves (so itself included, there is majority), and the
acknowledge from the point of view of the slaves means, I'll stop
acknowledging writes to the master (but I'll continue to process the
replication stream) until a new version of the configuration is
available for the slots I'm replicating.

This means that no writes with WAIT set to N can be accepted during
the failover process, and that we are guaranteed to select the next
master from the majority of slaves, so that we are sure at least one
slave must have the last write, if we select the one with the greatest
replication offset. The exact same thing could be implemented into
Redis Sentinel as well.

I don't think something like that will be available in the first
version of Redis Cluster, so WAIT in the short term will only have the
effect of improving the consistency guarantees provided by Redis
Cluster, but without providing strong consistency.

When we mix Redis Cluster partitioning schema with CP, what happens is
that actually every set of replicas serving a given hash slot is a CP
system per se, so this is the set of nodes where you need majority.
To avoid that during normal operations Redis Cluster allows to have
just one replica and still perform the failover, as the trick is to
use as majority, to version new configurations, the full set of
masters available.

Long story short if the design was directly targeting only a CP
system, the failover could be *just* made in terms of a given set of
replicas, instead of involving all the cluster.
It is like if you take a CP system based on Raft or Paxos or whatever,
and run N systems like that, and partition your keys in ranges across
the N systems.

Cheers,
Salvatore

Kelly Sommers

unread,
Dec 8, 2013, 6:30:52 PM12/8/13
to redi...@googlegroups.com


On Sunday, December 8, 2013 5:15:35 PM UTC-5, Salvatore Sanfilippo wrote:
On Sun, Dec 8, 2013 at 7:46 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:
> Antirez cites Raft as an example, but Raft is all about leader election.
> In Redis Cluster the guarantees that Raft offers are apparently not
> there, and the WAIT command cannot provide them anyway.

I cited Raft as an example of CP system with false negatives. It
ensures that a positive reply means the entry will be applied to the
state machine, but it does not offer guarantees of the opposite when a
negative reply is provided to the client. This totally makes sense
IMHO for a number of reasons in the case of Raft.

What you've discovered here isn't specific to Raft or Redis. Almost all transaction approaches suffer from this problem. It can be described simply as: The client is part of the distributed system. 

This is a point that gets lost often. As one example, a two-phase commit transaction has the same problem. What happens if the TCP socket between the transaction coordinator and the client disconnects while the server was sending the success acknowledgement? As far as the client is concerned the transaction failed, but it could have succeeded. This problem exists because the client isn't considered part of the transaction scope but it definitely can be and there are transaction models where the client is included. As you can imagine, this comes at a cost. 

Salvatore Sanfilippo

unread,
Dec 9, 2013, 3:25:31 AM12/9/13
to Redis DB
On Mon, Dec 9, 2013 at 12:30 AM, Kelly Sommers <kell.s...@gmail.com> wrote:
> What you've discovered here isn't specific to Raft or Redis. Almost all
> transaction approaches suffer from this problem. It can be described simply
> as: The client is part of the distributed system.

Yes this is obvious, but from what you said about WAIT I believed it
was useful to make it clear:

"> Because you're trying to pretend to be a CP system (but not one) with things
> like WAIT, you will have a horde of users not understanding what a failed
> WAIT that writes to 1 node but not 2 nodes means."

So assuming the failover is safe, the semantics of 1 node vs 2 is
handled as with other systems: retrying most of the times, or dealing
with the indetermination.
Note that I did not assumed that your sentence above had something to
do with the failover properties because with a broken failover WAIT
semantics is not CP *even* if it returns N-1, with N being the total
number of replicas.

Btw the client being part of the distributed system is a different
problem in the case of Raft, this is why I mentioned Raft.
Raft can provide you with a *false negative* even if there is no
partition between the client and the leader, that's the point.

However the rationale for allowing this behavior is that because
anyway this partition between the client and the leader can happen,
you have to handle it in a way or the other.
The same applies to WAIT.

javier ramirez

unread,
Dec 9, 2013, 6:23:41 AM12/9/13
to redi...@googlegroups.com
On 07/12/13 01:00, Alberto Gimeno wrote:
What about using an already working disk key-value store like leveldb, rocksdb (http://rocksdb.org), lmdb (like nds does https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ), etc.?

FWIW, I attended a talk by basho the past week and they were talking about the upcoming features of riak. One of the new features are data types in a similar way to redis (lists, hashes, sets...) but running on riak, so with replication and persistence baked in. This piqued my curiosity, so I went to talk to the basho people after the talk, to see what can be done and how it was implemented.

The relevant part for the discussion here is if you want to use the type system you need to choose the riak LevelDB store, so it would seem possible to implement types on something derived from leveldb. The thing is on riak you don't have a double paradigm at the same time, either you are using the memory store, or you are using the type system, which uses LevelDB.

After talking to their engineers, I decided the right tool for us right now is still redis. I like redis very much the way it is and in my opinion the minimalistic approach is one of Redis killer features. I for one prefer to see the future of redis as the best in-memory data store (which I think it is right now) rather than trying to cover several areas and not being the best in all of them.

Cheers,

j

Salvatore Sanfilippo

unread,
Dec 9, 2013, 6:27:37 AM12/9/13
to Redis DB
Hello Javier,

I believe the problem of storing data structures on a btree where the
first "level" of the btree is the key, it is easy to obtain if you
have small data structures. I think AP stores like Riak are adding
support for some data structure but those are not conceived to be used
like you use them in Redis, that is, with million of elements in a
single data structure.

Redis-on-disk with values bound to a given size is easy to accomplish.
The "diskstore" branch was a good approximation: working more on that
we would be there. The problem is that my feelings about a "capped
Redis" on disk are not great.

So what you could do is to use a file for every b-tree. This works as
long as you don't have many millions of keys, otherwise you have to
understand if the filesystem is designed to cope with that.

Salvatore
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



Dvir Volk

unread,
Dec 9, 2013, 7:49:49 AM12/9/13
to redi...@googlegroups.com
Just as a reference, CQL on top of Cassandra adds container columns using wide rows.

meaning if you have a dictionary field, what's really written is a column with the name "mydict:key" and the value. And the same way they have lists and sets. This allows atomic persistence operations on a small part of the value. 

So this can be done, with its price. But of course Cassandra's model is different, and I don't know the details of how it touches the filesystem in this manner. Plus I'm not sure how it will scale to millions of keys with millions of sub-keys.

Aphyr Null

unread,
Dec 9, 2013, 12:40:01 PM12/9/13
to redi...@googlegroups.com
> Example: WAIT can be used in order to run a Redis stand-alone instance
> with replicas in "CA" mode. Writes only succeed if you can replicate
> to all the replicas (no availability at all under partitions).

Please note that WAIT provides, in the context of the CAP theorem, exactly zero of consistency, availability, and partition tolerance. Labeling it CA or "relaxed CP" is misleading at best and dangerous at worst.

> Because you're trying to pretend to be a CP system (but not one) with things like WAIT, you will
> have a horde of users not understanding what a failed WAIT that writes to 1 node but not 2 nodes

> means. The ones who do understand what this means (after some pain in production) will learn
> that this operation doesn't work as expected and will have to consider WAIT having AP like
> semantics. Similar to CL.ALL.

Precisely. WAIT is *not* a consensus algorithm and it *can not* provide serializable semantics without implementing some kind of coherent transactional rollback.

> "CA" mode is often a way to refer to systems that are not partition
> tolerant but consistent.

I have yet to encounter any system labeled "CA" which actually provided CA. This should not be surprising because CA has been shown to be impossible in real-world networks. Please read http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf.

> Raft faces the user with the same exact tradeoff, when Raft replies
> that it failed to replicate to the majority, it really means that the
> result is undetermined.

You have not implemented or described RAFT's semantics in Redis, and failing to understand how Redis WAIT differs from RAFT, VR, multipaxos, etc is a dangerous mistake. Consensus protocols are subtle and extremely difficult to design correctly. Please consider writing a formal model and showing verification by a model checker, if not a proof.

> I think that Raft semantics is good enough for most use cases

Please don't claim these are equivalent designs. In particular, the RAFT inductive consistency constraint is not present in the current or proposed WAIT/failover design. Without a similar constraint you will not be able to provide linearizability.

> I'll surely do my research, but I'm not a Right Thing person. What I
> mean is that I'll try to provide what I can provide with my best of my
> capabilities now, making clear what are the tradeoffs.

Please consider choosing a proven consistency model and implementing it, instead of rolling your own. Alternatively, consider documenting that Redis can easily lose your data. I see an awful lot of people treating it as a system of record rather than a cache.

Salvatore Sanfilippo

unread,
Dec 9, 2013, 3:06:11 PM12/9/13
to Redis DB
On Mon, Dec 9, 2013 at 6:40 PM, Aphyr Null <aphyr...@gmail.com> wrote:
>> Example: WAIT can be used in order to run a Redis stand-alone instance
>> with replicas in "CA" mode. Writes only succeed if you can replicate
>> to all the replicas (no availability at all under partitions).
>
> Please note that WAIT provides, in the context of the CAP theorem, exactly zero of consistency, availability, and partition tolerance. Labeling it CA or "relaxed CP" is misleading at best and dangerous at worst.

I used "CA" and "relaxed CP" in totally different contexts. Btw if we
don't like those names, we can just talk about concepts.

In the above sentence what I mean is that WAIT is a tool that you can
use to get real guarantees in real contexts, like the one described
above, that is, a master + N salves setup.
Operations acknowledged by WAIT N make the user aware that the write
is accepted by all the replicas. When master fails, if there is manual
failover in which the master is taken down, and a random replica
restarted as master, it guarantees you the obvious property that all
the writes for which you received a positive acknowledge, are retained
by the system.

>> Because you're trying to pretend to be a CP system (but not one) with things like WAIT, you will
>> have a horde of users not understanding what a failed WAIT that writes to 1 node but not 2 nodes
>> means. The ones who do understand what this means (after some pain in production) will learn
>> that this operation doesn't work as expected and will have to consider WAIT having AP like
>> semantics. Similar to CL.ALL.
>
> Precisely. WAIT is *not* a consensus algorithm and it *can not* provide serializable semantics without implementing some kind of coherent transactional rollback.

Nobody claimed that, but WAIT can be used as one of the building
blocks to mount a system featuring strong consistency (good start
could be a failover process that does not accept writes during the
failover, and is guaranteed to elect the slave with the higher
replication offset).

What I claim here is that the point is not transactional rollbacks, see later.

>> Raft faces the user with the same exact tradeoff, when Raft replies
>> that it failed to replicate to the majority, it really means that the
>> result is undetermined.
>
> You have not implemented or described RAFT's semantics in Redis, and failing to understand how Redis WAIT differs from RAFT, VR, multipaxos, etc is a dangerous mistake. Consensus protocols are subtle and extremely difficult to design correctly. Please consider writing a formal model and showing verification by a model checker, if not a proof.

The above sentence was about a specific issue: false negatives.
Apparently you also agree that without transactional rollbacks you
can't mount a CP system.
How is WAIT returning a non majority different than Raft false
negative from the point of view of transactional rollbacks? If you get
a positive reply, the write is accepted,
if you get a negative reply, you don't know and can retry.

>> I think that Raft semantics is good enough for most use cases
>
> Please don't claim these are equivalent designs. In particular, the RAFT inductive consistency constraint is not present in the current or proposed WAIT/failover design. Without a similar constraint you will not be able to provide linearizability.

This was in the above context, false negatives.

Salvatore

Yiftach Shoolman

unread,
Dec 9, 2013, 4:20:41 PM12/9/13
to redi...@googlegroups.com
In 2002 when the first paper of CAP theorem published 40msec and even 500msec database latency was acceptable by 99% of the apps on earth. 

I'm talking on a daily basis with companies who have decided to migrate from DynamoDB (the holy grail of the AP systems) to Redis because with 10-20msec average latency ( 40msec at the 95 percentile) their application just cannot work! - and this when it runs on the strongest dedicated EC2 instances with ultra-fast SSD (not avaliable to the public).

Redis should nigher be built to serve 1000 nodes cluster scenario nor to comply to all the corners of a CP system.

IMO "Relax CP" is when the probability to reach these corners is smaller than the probability of a major infrastructure failure.





--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.



--

Yiftach Shoolman
+972-54-7634621

Matt Palmer

unread,
Dec 9, 2013, 4:46:07 PM12/9/13
to redi...@googlegroups.com
On Mon, Dec 09, 2013 at 11:23:41AM +0000, javier ramirez wrote:
> On 07/12/13 01:00, Alberto Gimeno wrote:
> >What about using an already working disk key-value store like
> >leveldb, rocksdb (http://rocksdb.org), lmdb (like nds does
> >https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ),
> >etc.?
>
> FWIW, I attended a talk by basho the past week and they were talking
> about the upcoming features of riak. One of the new features are
> data types in a similar way to redis (lists, hashes, sets...) but
> running on riak, so with replication and persistence baked in. This
> piqued my curiosity, so I went to talk to the basho people after the
> talk, to see what can be done and how it was implemented.

Good to see others are seeing the value in a data structures server. I can
definitely see the value in being able to operate on more complicated data
structures inside the Riak paradigm, although it's going to start getting
awfully tricky if you still want to use a conflict resolution algorithm more
complicated than LWW.

Redis still has a place, though, in the high-throughput / low latency arena.
We just trialled using Riak to store some metrics data for M/R querying, and
try as we might, we couldn't get it to keep up with the write rate for even
*one* of the incoming data streams, let alone the full set we wanted to
point at it.

- Matt

--
School never taught ME anything at all, except that there are even
more morons out there than I would have dreamed, and many of them like
to beat up people smaller than they are.
-- SeaWasp in RASFW

Kelly Sommers

unread,
Dec 9, 2013, 4:53:37 PM12/9/13
to redi...@googlegroups.com, mpa...@hezmatt.org


On Monday, December 9, 2013 4:46:07 PM UTC-5, Matt Palmer wrote:
On Mon, Dec 09, 2013 at 11:23:41AM +0000, javier ramirez wrote:
> On 07/12/13 01:00, Alberto Gimeno wrote:
> >What about using an already working disk key-value store like
> >leveldb, rocksdb (http://rocksdb.org), lmdb (like nds does
> >https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ),
> >etc.?
>
> FWIW, I attended a talk by basho the past week and they were talking
> about the upcoming features of riak. One of the new features are
> data types in a similar way to redis (lists, hashes, sets...) but
> running on riak, so with replication and persistence baked in. This
> piqued my curiosity, so I went to talk to the basho people after the
> talk, to see what can be done and how it was implemented.

Good to see others are seeing the value in a data structures server.  I can
definitely see the value in being able to operate on more complicated data
structures inside the Riak paradigm, although it's going to start getting
awfully tricky if you still want to use a conflict resolution algorithm more
complicated than LWW.

This is why I keep saying the holistic design around the trade-offs made from top to bottom is important. Some of these data structures are possible in Riak because of the CRDT research and Riak's implementation of this research.

A comprehensive study of Convergent and Commutative Replicated Data Types

Aphyr Null

unread,
Dec 9, 2013, 4:54:47 PM12/9/13
to redi...@googlegroups.com
> When master fails, if there is manual 
> failover in which the master is taken down, and a random replica 
> restarted as master, it guarantees you the obvious property that all 
> the writes for which you received a positive acknowledge, are retained 
> by the system.

I'm not sure how to state this any clearer. The problem is not false negatives. The problem is a lack of an inductive constraint on leader election+log replication. If you really want to insist on claiming WAIT prevents false positive acks, I'd be happy to attempt an existence proof of this problem in the next installation of Jepsen.

Salvatore Sanfilippo

unread,
Dec 9, 2013, 6:02:41 PM12/9/13
to Redis DB
On Mon, Dec 9, 2013 at 10:54 PM, Aphyr Null <aphyr...@gmail.com> wrote:
> If you really want to insist on claiming WAIT prevents false positive acks,
> I'd be happy to attempt an existence proof of this problem in the next
> installation of Jepsen.

WAIT alone can't be evaluated without the failover properties.

If you want to evaluate WAIT with a good failover procedure, you can
do the failover manually.

1) Install 5 nodes, master + 4 slaves of normal instances (Redis unstable).
2) Write to the master with WAIT 2.
3) Consider acknowledged every write where WAIT returns 2 or more
(accepted by the majority of nodes).
4) If master is down, do a manual failover: stop the master
completely, issue INFO in al the slaves, check what has the most
recent replication offset, turn it into a master.

Note 1 about 4: As long as N/2+1 nodes are available (so the master
and another slave can fail) we are guaranteed to have all the writes
performed to the master that returned a 2 or more as WAIT return
value.
Note 2 about 4: If a node acknowledged a given write, because of how
WAIT works, it also acknowledged to have received all the previous
writes.

I did not analyzed the above system in depth, but I can't find a
trivial failure mode.

Redis Cluster and Redis Sentinel are currently both not able to
provide the same guarantees of the manual failover procedure described
above for different reasons.
However what could be a key idea to implement this, is that instead to
make sure that the master does not return available, which is very
hard, it is possible to tell N/2 nodes to stop acknowledging writes
before the next master switch. In this way if the master returns
available it will not be able to reach the majority and all the writes
will not be acknowledged.

Aphyr Null

unread,
Dec 9, 2013, 6:29:01 PM12/9/13
to redi...@googlegroups.com
> stop the master completely, issue INFO in al the slaves, check what has the most 
> recent replication offset, turn it into a master.

0.) This presupposes the existence of strong coordination for the failover process itself.
1.) This precludes any recovery from a failed or isolated primary node. All nodes must halt until the primary is reachable by whatever system is coordinating failover.
2.) Even if strong coordination about the order of shutdown and takeover were possible, this system is not linearizable. Can you guess why?

Salvatore Sanfilippo

unread,
Dec 9, 2013, 7:12:03 PM12/9/13
to Redis DB
On Tue, Dec 10, 2013 at 12:29 AM, Aphyr Null <aphyr...@gmail.com> wrote:
>> stop the master completely, issue INFO in al the slaves, check what has
>> the most
>> recent replication offset, turn it into a master.
>
> 0.) This presupposes the existence of strong coordination for the failover
> process itself.

Yes, the idea is that this strong coordinator could be elected among
slave nodes by only voting the node if the vote request has a
replication offset greater than your (and majority would be needed to
get elected).
If you win the election and a given entry was replicated to the
majority, then you should have the entry.

> 1.) This precludes any recovery from a failed or isolated primary node. All
> nodes must halt until the primary is reachable by whatever system is
> coordinating failover.

Not sure why: majority of slaves agreed to don't ack writes, so the
primary can be back reachable and clients writing to it will only see
non acknowledged writes.
Slaves will not ack again until a slave wins the election and gets promoted.

> 2.) Even if strong coordination about the order of shutdown and takeover
> were possible, this system is not linearizable. Can you guess why?

No sorry, I did not analyzed the system very well, but I can't find
trivial to spot reasons why it is not linearizable, assuming we talk
of the single hash slot and not of the system as a whole.
I'm interested to understand why.

Salvatore

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



Salvatore Sanfilippo

unread,
Dec 10, 2013, 3:53:02 AM12/10/13
to Redis DB
On Tue, Dec 10, 2013 at 1:12 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
>> 2.) Even if strong coordination about the order of shutdown and takeover
>> were possible, this system is not linearizable. Can you guess why?
>
> No sorry, I did not analyzed the system very well, but I can't find
> trivial to spot reasons why it is not linearizable, assuming we talk
> of the single hash slot and not of the system as a whole.
> I'm interested to understand why.

About non linearizability, perhaps it does not apply to the case where
a strong coordinator exists, but in the general case one issue is that
when we read, we can't just read because a stale master could reply
with stale data, breaking linearizability. There is a trick to force
the read to be acknowledged that could work:

MULTI
INCR somecounter
GET data
EXEC
WAIT <n>

I'll check this better, because a safer master switch could be one of
the applicable things in Redis Cluster, at least when there are more
than two replicas per hash slot.

Marc Gravell

unread,
Dec 10, 2013, 2:22:22 PM12/10/13
to redi...@googlegroups.com

I came all the way back to the top post, because I kinda feel that the thread has gone astray on the CAP things. Which isn't to diminish those things in any way: just - it is now going around in circles.

The reality is that while CAP is interesting, that isn't the only feature a product needs, and comments along the lines of "what does redis want to be when it grows up?" are pretty condescending IMO.

If I had to give a list of things that cause me pain in redis, I would say:

- "keys" et al: which is now a solved problem with "scan"
- how to perform maintenance on a 24x7 master node without having a brief "blip" - this is something I very much hope redis-cluster makes much friendlier
- scalability of large domains - again: redis-cluster
- replication quirks on unreliable connections: I'm hopeful "psync" makes this happier

So actually, most of the things that *are actual real problems for me* : already in hand.

The transaction model takes a little getting used to - but when you get into the "assert, try, redo from start if fail" mindset it is a breeze (and client libraries can do things to help here) - so I don't count this as a plus or a minus - just a "difference". Of course, when this isn't practical: LUA allows the problem to be approached procedurally instead.

For the good: redis is a kickass product with insanely fast performance and rock solid reliability even when under sustained and aggressive load. The features are versatile allowing complex models to be built from easy to understand primitives.

We love you :p

Marc

On 6 Dec 2013 13:53, "Salvatore Sanfilippo" <ant...@gmail.com> wrote:
Hello dear Redis community,

today Pierre Chapuis started a discussion on Twitter about Redis
bashing, stimulated by this thread on Twitter from Rick Branson:

https://twitter.com/rbranson/status/408853897495592960

It is not the first time that Rick Branson, that works at Instagram,
openly criticizes Redis, because I guess he does not like the Redis
design and / or implementation.
However according to Pierre, this is not something limited to Rick,
but there are other engineers in the SF area that believe that Redis
sucks, and Pierre also reported to hear similar stories in Paris.

Of course every open source project of a given size is target if
critiques, especially a project like Redis is very opinionated on how
programs should be written, with the search for simple design and
implementation that sometimes are felt as sub-optimal.
However, what we can learn from this critiques, and what is that you
think is not working well in Redis? I really encourage you to share
your view.

As a starting point I'll use Rick tweet: "BGSAVE. the sentinel wtf.
memory cliffs. impossible to track what's in it. heap fragmentation.
LRU impl sux. etc et".
He also writes: "you can't even really dump the whole keyspace because
KEYS "*" causes it to shit it's"

This is a good starting point, and I'll use the rest of this email to
see what happened in the different areas of Redis criticized by Rick.

1) BGSAVE

I'm not sure what is wrong with BGSAVE, probably Rick had bad
experiences with EC2 instances where the fork time can create latency
spikes?

2) The Sentinel WTF.

Here probably the reference is the following:
http://aphyr.com/posts/283-call-me-maybe-redis

Aphyr analyzed Redis Sentinel from the point of view of a consistent
system, consistent as in CAP "strong consistency". During partition in
Aphyr tests Sentinel was not able to handle the promises of a CP
system.
I replied with a blog post trying to clarify that Redis Sentinel is
not designed to provide strong consistency in the face of partitions,
but only to provide some degree of availability when the master
instance fails.

However the implementation of Sentinel, even as a system promoting a
slave when the master fails, was not optimal, so there was work to
reimplement it from scratch. Finally the new Sentinel is available in
Redis 2.8.x
and is much more simple to understand and predict. This is surely an
improvement. The new implementation is able to version changes in the
configuration that are eventually propagated to all the other
Sentinels, requires majority to perform the failover, and so forth.

However if you understand even the basics of distributed programming
you know a few things, like how a system with asynchronous replication
is not capable to guarantee consistency.
Even if Sentinel was not designed for this, is Redis improving from
this point of view? Probably yes. For example now the unstable branch
has support for a new command called WAIT that implements a form of
synchronous replication.

Using WAIT and the new sentinel, it is possible to have a setup that
is quite partition resistant. For example if you have three computers,
A, B, C, and run a Sentinel instance and a Redis instance in every
computer, only the majority partition will be able to perform the
failover, and the minority partition will stop accepting writes if you
use "WAIT 1", that is, if you wait the propagation of the write to at
least one replica. The new Sentinel also elects the slave that has the
most updated version of data automatically.

Redis Cluster is another step forward towards Redis HA and automatic
sharding, we'll see how it works in practice. However I believe that
Sentinel is improving and Redis is providing more tools to fine-tune
consistency guarantees.

3) Impossible to track what is in it.

Lack of SCAN was a problem indeed, now it is solved. Even before using
RANDOMKEY it was somewhat possible to inspect data sets, but SCAN is
surely a much better way to do this.
The same argument goes for KEYS *.

4) LRU implementation sucks.

The LRU implementation in Redis 2.4 had issues, and under mass-expire
there where latency spikes.
The LRU in 2.6 is much smoother, however it contained issues signaled
by Pavlo Baron where the algorithm was not able to guarantee expired
keys where always under a given threshold.
Newer versions of 2.6, and 2.8 of course, both fix this issue.

I'm not aware of issues with the LRU algorithm.

I've the feeling that Rick's opinion is a bit biased by the fact that
he was exposed to older versions of Redis, however his criticism where
in part actually applicable to older versions of Redis.
This show that there is something good about this critiques. For
instance Rick always said that replication sucked because of lack for
partial resynchronization. I'm sorry he is no longer able to say this.
As a consolatory prize we'll send him a t-shirt if budget will permit.
But this again shows that critiques tend to be focused where
deficiencies *are*, so hiding Redis behind a niddle is not a good idea
IMHO. We need to improve the system to make it better, as long is it
still an useful system for many users.

So, what are the critiques that you hear frequently about Redis? What
are your own critiques? When Redis sucks?

Let's tear Redis apart, something good will happen.

Salvatore


--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

We suspect that trading off implementation flexibility for
understandability makes sense for most system designs.
       — Diego Ongaro and John Ousterhout (from Raft paper)

Pieter Noordhuis

unread,
Dec 10, 2013, 2:33:28 PM12/10/13
to redi...@googlegroups.com
WAIT in isolation doesn’t give any guarantees, it seems, only
information about the state of the slave links. A concept like WAIT
only becomes useful when action is taken on its failure. Right now, if
a call to WAIT returns with an undesirable result, it is up to the
user to figure out what to do next. In my opinion, there is an
opportunity for Redis to do the right thing instead and provide bounds
towards how many writes can be lost.

Looking at the set of operations that Redis currently supports, we
find the following:
- For one single key, there must be only one process taking writes for
it, or the linearizability property is violated. As soon as there is
more than one process taking writes, there is no way these processes
can converge, because of the ordering requirement on operations (think
of a list push; there exists no merge function with a predictable
result).
- Following this observation, selecting which process is going to take
writes for a single key needs a majority vote. If a majority vote
cannot be achieved, the system must halt. If it doesn’t, we again end
up with the possibility of multiple processes taking writes and
absence of linearizability.
- AP is out of the question for Redis, in its current form.

It looks like a failed WAIT needs to be followed by the process taking
writes to stop taking writes (halting). It means that writes are no
longer replicated to a majority of slaves. This in turn means that the
system can partition in a way where the process taking writes and its
slaves are separated from a majority of slaves. This majority can then
elect a new master and start taking writes, violating linearizability
since it is not allowed to have more than one process taking writes
for a single key.

The fact that Redis uses asynchronous replication means that it can’t
be a pure CP system, which is what people in this thread have been
arguing for/against. I think it can be a CP system with bounds on
write loss (can this still be called a CP system?). The bound is
defined by the time the process taking writes continues to take writes
without a majority acknowledgement. This is only possible when the
process taking writes halts. Otherwise, there are very few guarantees
that can be made towards retention of writes.

These statements reflect my understanding of the domain, please tell
me if/where I’m wrong.

Cheers,
Pieter

On Tue, Dec 10, 2013 at 12:53 AM, Salvatore Sanfilippo

Kelly Sommers

unread,
Dec 10, 2013, 3:45:56 PM12/10/13
to redi...@googlegroups.com


On Tuesday, December 10, 2013 2:22:22 PM UTC-5, Marc Gravell wrote:

I came all the way back to the top post, because I kinda feel that the thread has gone astray on the CAP things. Which isn't to diminish those things in any way: just - it is now going around in circles.

The reality is that while CAP is interesting, that isn't the only feature a product needs, and comments along the lines of "what does redis want to be when it grows up?" are pretty condescending IMO.

I definitely didn't mean that comment as condescending and if that came across in that way to anyone I sincerely apologize. I'm sorry. The intent was to say that deciding what type of system Redis wants to be is important in the decision making to reduce the opposing functionality that makes the system more unstable than it needs to be (which addresses some of your concerns).

From your list below are a lot of items I hear from a lot of customers I work with. These are definitely a common theme I hear at least from my perspective. I'm not suggesting that's what the focus should be though. If those are to be improved, the concerns I pointed out and Aphyr elaborated on are part of that solution.

I don't know why there's push back on these topics I raised because they are involved in 3 of the 4 bullet items you want fixed. 

Redis Cluster currently has features that were broken in the last implementation and potentially broken even worse in the "fixed" implementation. Some features conflict and undermine these from working properly.

If you don't want blips under maintenance in a larger scalable cluster without replication quirks, this all requires a sound distributed system implementation. I can't stress enough how picking trade-offs that undermine each other negate most of the benefits from the trade-offs and cause problems. The good news is this can be improved :)

Salvatore Sanfilippo

unread,
Dec 10, 2013, 6:10:44 PM12/10/13
to Redis DB
Hello, I wrote a better description of the "toy" distributed system I
proposed as a starting model.
It is a toy since it uses a powerful entity that is actually
impractical, but the idea is that it is possible to remove it with
careful changes.

https://gist.github.com/antirez/7901666

This was mostly an exercise to me, instead analyzing this trivial
model I found a trivial improvement that I can make to the Redis
replication process that results in a better degree of data safety,
opening an issue right now.

Salvatore

Howard Chu

unread,
Dec 10, 2013, 8:34:14 PM12/10/13
to redi...@googlegroups.com
While you're on the subject of replication, I suggest you read RFC 4533 (LDAP Content Sync Replication) to get some ideas. Currently your replication protocol's resync after a node disconnect/reconnect is far too expensive. Inventing new replication protocols is a loser's game, especially when a lot of dedicated/determined people have already done the hard work. Learn from the mistakes and lessons of the work that has come before you. (And for an even better method, read up on OpenLDAP's enhancement of RFC 4533, Delta-Sync replication.)

Dvir Volk

unread,
Dec 11, 2013, 3:52:26 AM12/11/13
to redi...@googlegroups.com
On Wed, Dec 11, 2013 at 3:34 AM, Howard Chu <highl...@gmail.com> wrote:
While you're on the subject of replication, I suggest you read RFC 4533 (LDAP Content Sync Replication) to get some ideas. Currently your replication protocol's resync after a node disconnect/reconnect is far too expensive.

That't not how replication works in redis anymore. Redis 2.8 does not do a full resync on reconnect.


Howard Chu

unread,
Dec 12, 2013, 12:56:35 AM12/12/13
to redi...@googlegroups.com

Even so - in reality, there is no difference between "single-master with failover" and "multimaster" but the redis protocol doesn't maintain enough state to track multiple masters. Which is why it so easily loses data in a failover condition.

Wayne Brantley

unread,
Dec 12, 2013, 1:32:29 AM12/12/13
to redi...@googlegroups.com
First, I think this is a very open discussion and very helpful.

Here are my thoughts:

1)  Has to be 100% memory backed (database is held in memory during operation).  

People constantly suggest a disk backed Redis and you always say no.  There is even a project that implements it and you still say no.  Heck
start with that design and bolt it on (optionally).  I really think you need to reconsider.  You could have 'memory only' mode as well as disk backed mode.  There would be trade offs, but there are always trade offs - difference is it would be up to me (not you) to decide those.

2)  HA support is weak. 

Heck, you guys know this - this entire thread has turned into that.  The system needs to scale horizontally too though.  I want HA to ensure my Redis is up and running.  If there is a failure, it should failover and when something comes back online add the capacity back in.  This should all be easy to setup and make work - just dead simple.  The how-to should read something like this:  http://www.couchbase.com/couchbase-server/scalability

3)  Publish/Subscribe model cannot be backed by a list.  

Would be nice when publishing to a channel, that channel can be backed by a list, so subscribers can get messages while they were offline.  Additionally, a fan-out type of publish/subscribe feature would be more than welcome.  (Note this would mean I could subscribe to Redis Keyspace Notifications and not worry that I missed some because my client was not subscribed!)

4)  Publish/Subscribe cannot guarantee message was processed.  

Sort of related to #3, but if a message is consumed by a subscriber, should be able to require an acknowledgement of message before it is removed from the list in #3.  

5)  >>I also think the core development could be closer with the community work.  I understand that is important to keep redis simple, but I see few forks that have good contributions(eg: NDS, Sentinel automatic discovery/registration), yet not much movement to merge in the core.

I agree with prior poster on this.  As an example there are 215 pull requests!  That is an open source dream - all those pull requests ready to go.  They are not all so complex you need to study them for years or so 'changing' you do not agree with their premise.  Some are simple spelling mistakes, etc.   It does not look like a healthy open source project.  People want to help, change, add features - let us/them!
   
Great product and nice to see some movement and improvements!

Thanks for your time and for listening/considering.
It is loading more messages.
0 new messages