| Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 05:52 | Hello dear Redis community,
today Pierre Chapuis started a discussion on Twitter about Redis bashing, stimulated by this thread on Twitter from Rick Branson: https://twitter.com/rbranson/status/408853897495592960 It is not the first time that Rick Branson, that works at Instagram, openly criticizes Redis, because I guess he does not like the Redis design and / or implementation. However according to Pierre, this is not something limited to Rick, but there are other engineers in the SF area that believe that Redis sucks, and Pierre also reported to hear similar stories in Paris. Of course every open source project of a given size is target if critiques, especially a project like Redis is very opinionated on how programs should be written, with the search for simple design and implementation that sometimes are felt as sub-optimal. However, what we can learn from this critiques, and what is that you think is not working well in Redis? I really encourage you to share your view. As a starting point I'll use Rick tweet: "BGSAVE. the sentinel wtf. memory cliffs. impossible to track what's in it. heap fragmentation. LRU impl sux. etc et". He also writes: "you can't even really dump the whole keyspace because KEYS "*" causes it to shit it's" This is a good starting point, and I'll use the rest of this email to see what happened in the different areas of Redis criticized by Rick. 1) BGSAVE I'm not sure what is wrong with BGSAVE, probably Rick had bad experiences with EC2 instances where the fork time can create latency spikes? 2) The Sentinel WTF. Here probably the reference is the following: http://aphyr.com/posts/283-call-me-maybe-redis Aphyr analyzed Redis Sentinel from the point of view of a consistent system, consistent as in CAP "strong consistency". During partition in Aphyr tests Sentinel was not able to handle the promises of a CP system. I replied with a blog post trying to clarify that Redis Sentinel is not designed to provide strong consistency in the face of partitions, but only to provide some degree of availability when the master instance fails. However the implementation of Sentinel, even as a system promoting a slave when the master fails, was not optimal, so there was work to reimplement it from scratch. Finally the new Sentinel is available in Redis 2.8.x and is much more simple to understand and predict. This is surely an improvement. The new implementation is able to version changes in the configuration that are eventually propagated to all the other Sentinels, requires majority to perform the failover, and so forth. However if you understand even the basics of distributed programming you know a few things, like how a system with asynchronous replication is not capable to guarantee consistency. Even if Sentinel was not designed for this, is Redis improving from this point of view? Probably yes. For example now the unstable branch has support for a new command called WAIT that implements a form of synchronous replication. Using WAIT and the new sentinel, it is possible to have a setup that is quite partition resistant. For example if you have three computers, A, B, C, and run a Sentinel instance and a Redis instance in every computer, only the majority partition will be able to perform the failover, and the minority partition will stop accepting writes if you use "WAIT 1", that is, if you wait the propagation of the write to at least one replica. The new Sentinel also elects the slave that has the most updated version of data automatically. Redis Cluster is another step forward towards Redis HA and automatic sharding, we'll see how it works in practice. However I believe that Sentinel is improving and Redis is providing more tools to fine-tune consistency guarantees. 3) Impossible to track what is in it. Lack of SCAN was a problem indeed, now it is solved. Even before using RANDOMKEY it was somewhat possible to inspect data sets, but SCAN is surely a much better way to do this. The same argument goes for KEYS *. 4) LRU implementation sucks. The LRU implementation in Redis 2.4 had issues, and under mass-expire there where latency spikes. The LRU in 2.6 is much smoother, however it contained issues signaled by Pavlo Baron where the algorithm was not able to guarantee expired keys where always under a given threshold. Newer versions of 2.6, and 2.8 of course, both fix this issue. I'm not aware of issues with the LRU algorithm. I've the feeling that Rick's opinion is a bit biased by the fact that he was exposed to older versions of Redis, however his criticism where in part actually applicable to older versions of Redis. This show that there is something good about this critiques. For instance Rick always said that replication sucked because of lack for partial resynchronization. I'm sorry he is no longer able to say this. As a consolatory prize we'll send him a t-shirt if budget will permit. But this again shows that critiques tend to be focused where deficiencies *are*, so hiding Redis behind a niddle is not a good idea IMHO. We need to improve the system to make it better, as long is it still an useful system for many users. So, what are the critiques that you hear frequently about Redis? What are your own critiques? When Redis sucks? Let's tear Redis apart, something good will happen. Salvatore -- Salvatore 'antirez' Sanfilippo open source developer - GoPivotal http://invece.org We suspect that trading off implementation flexibility for understandability makes sense for most system designs. — Diego Ongaro and John Ousterhout (from Raft paper) |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 06/12/13 07:22 | Others: Quentin Adam, CEO of Clever Cloud (a PaaS) has a presentation that says Redis is not fit to store sessions: http://www.slideshare.net/quentinadam/dotscale2013-how-to-scale/15 (he advises Membase) Tony Arcieri (Square, ex-LivingSocial) is a "frequent offender": https://twitter.com/bascule/status/277163514412548096 https://twitter.com/bascule/status/335538863869136896 https://twitter.com/bascule/status/371108333979054081 https://twitter.com/bascule/status/390919938862379008 Then there's the Disqus guys, who migrated to Cassandra, the Superfeedr guys who migrated to Riak... Instagram moved to Cassandra as well, here's more on it by Branson to see where he comes from: http://www.planetcassandra.org/blog/post/cassandra-summit-2013-instagrams-shift-to-cassandra-from-redis-by-rick-branson This presentation about scaling Instagram with a small team (by Mike Krieger) is very interesting as well: http://qconsf.com/system/files/presentation-slides/How%20a%20Small%20Team%20Scales%20Instagram.pdf He says he would go with Redis again, but there are some points about scaling up Redis starting at slide 56. My personal experience, to be clear, is that Redis is an awesome tool when you know how it works and how to use it, especially for a small team (like Krieger basically). I have worked for a company with a very reduced technical team for the last 3.5 years. We make technology for mobile applications which we sell to large companies (retail, TV, cinema, press...) mostly white-labelled. I have written most of our server side software, and I have also been responsible for operations. We have used and still use Redis *a lot*, and some of the things we have done would just not have been possible with such a reduced team in so little time without it. So when I read someone saying he would ban Redis from his architecture if he ever makes a startup, I think: "good thing he doesn't." :) Thank you Antirez for this awesome tool. |
| Re: Redis critiques, let's take the good part. | Alexander Gladysh | 06/12/13 07:25 | On Fri, Dec 6, 2013 at 7:22 PM, Pierre ChapuisIndeed! Until you bumped on all the hidden obstacles, the experience is rather horrible. When Redis blows up on production — it usually costs developers a few gray hairs :-) However, after you know what not to do, Redis is all awesomeness. My 2c, Alexander. |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 06/12/13 07:33 | Le vendredi 6 décembre 2013 16:25:14 UTC+1, Alexander Gladysh a écrit : On Fri, Dec 6, 2013 at 7:22 PM, Pierre Chapuis I would say that of every tool. You can all outgrow them or use them poorly. I had a terrible experience with MySQL. A (VC funded) startup around here had issues with CouchDB, moved to Riak with Basho support, had issued, moved to HBase which the still use (I think). That does not make any of those tools bad. You just have to invest some time into learning what those tools can and cannot do, which one to use for which use case, and how to use them correctly. -- Pierre Chapuis |
| Re: Redis critiques, let's take the good part. | Alexander Gladysh | 06/12/13 07:34 | On Fri, Dec 6, 2013 at 7:33 PM, Pierre Chapuis
<catwell...@catwell.info> wrote:I agree :-) If learning curve is flat, it usually means that the tool is too casual to be useful. Alexander. |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 06/12/13 07:41 | Also: I am not saying I have never experienced scaling issues with Redis! I have. You will always when you build a system from scratch that ends up serving millions of users. So there are bottlenecks I hit, models I had to reconsider, and even things I had to move off Redis. But none of that made me go "OMG this tool is terrible and nobody should use it, ever!!1". And I still think going with Redis in the first place was a very good idea. On a side note: one of the things it *did* make me decide not to use is intermediate layers between my application and Redis that abstract your models. When you hit a bottleneck, you want to know exactly what you have stored in Redis, how and why. So things like https://github.com/soveran/ohm are really cool for prototyping and things that are not intended to scale, but if you decide to use them for a product with traction you'd better understand exactly what they do or just write your own abstraction layer that suits your business logic. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 07:47 | On Fri, Dec 6, 2013 at 4:22 PM, Pierre ChapuisI don't quite understand the presentation to be super-honest, what means "multiple writes" / "pseudo automic"? I'm not sure. MULTI/EXEC and Lua scripts both retain their semantic in the slave, that will process the transaction all-or-nothing. About HA, with new Sentinel and Cluster we have something to say in the present and in the future. Not sure what Membase properties are, their page seems like marketing, and I don't know a single person that uses it to be honest. Latency complains, 2.2.x, no information given but Redis can be operated with excellent latency characteristics if you know what you are doing. Honestly I believe that from the point of view of average latency, and ability to provide a consistent latency, Redis is one of the better DBs available out there. If you run it on EC2 with EBS, instances that can't fork, fsync that can't cope, it is a sysop fail, not a problem with the system IMHO. > https://twitter.com/bascule/status/335538863869136896 FUD > https://twitter.com/bascule/status/371108333979054081 FUD > https://twitter.com/bascule/status/390919938862379008 101 of distributed systems is that non-synchronous replication can drop acknowledged writes. Every on disk-db single instance not configured to fsync on disk at every write, can drop acknowledged writes. So this is totally obvious for most DBs deployed currently. What does not write acknowledged writes as long as the majority is up? CP systems with strong consistency like Zookeeper. It's worth to mention that WAIT announced yesterday can do a lot from this point of view. I've no idea why Disqus migrated to Cassandra, probably it was just a much better pick for them? Migrating to a different does not necessarily implies a problem with Redis, so this is not a criticism we can use in a positive way to act, unless Disqus guys write us why they migrated and what Redis deficiencies they found. Same story here. And again... This is interesting indeed, and sounds like problems that we can solve with Redis Cluster. Let's face it, partitioning client side is complex. Redis Cluster provides a lot of help for big players with many instances since operations will be much simpler once you can reshard live. I find the above pointers interesting, but how to act based on this? IMHO the current ruote of providing a simple HA system like Sentinel trying to make it robust, and at the same time providing a more complex system like Redis Cluster for "bigger needs" is the best the Redis project can be headed to. The "moved away from Redis" stories don't tell us much. What I believe is that sometimes when you are small you tend to do things with an in-memory data store that don't really scale cost wise, since the IOPS per instance can be handled with a disk oriented system, so it could be a natural consequence, and this is fine. At the start maybe using Redis helped a lot by serving many queries with little machines, during the boom with relatively little users in the order of maybe 1 million, but the hype about the service creating a big pressure from the point of view of load. What do you think we can do to improve Redis based on the above stories? Cheers! |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 06/12/13 07:48 | Le vendredi 6 décembre 2013 16:34:38 UTC+1, Alexander Gladysh a écrit : This. Also, maybe I avoided some of the issues others encountered in production because: 1) I have a MSc in distributed systems (helps sometimes :p) 2) I had forked Redis and implemented custom commands before I actually deployed it so I understood the code base. Also, I had read the documentation and not skipped the parts about algorithmic complexity of the commands, persistence trade-offs... :) I guess that if you let a novice developer use Redis in his application it may be easier for him to shoot himself in the foot. But... if you think about it, those things are also true of a relational database: if you don't understand what you do you will write dangerous code, and if you decide to use an ORM and scale you'd better understand it. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 07:52 | On Fri, Dec 6, 2013 at 4:33 PM, Pierre ChapuisAbout the "moves to Riak", this is also a component. People seek for help with Redis and there was nothing: me busy, Pivotal yet not providing support (now they do finally!). If Basho engineers say hi, we'll fix your issues, this is surely an incentive (yet in this case people moved). Unfortunately I'm really not qualified to say if there is big value or not into Riak for the use case it is designed about as I hear a mix of horrible and great things, and I never deployed it seriously. But I'm happy that people try other solutions: in the end what is no longer useful MUST DIE in technology. If Redis will die in 6 months, this is great news, it means that technology evolved enough that with other systems you can do the same in some simpler way. However as long as I'll see traction as I'm seeing it right now in the project, and there is a company like Pivotal supporting the effort, I'll continue to improve it. |
| Re: Redis critiques, let's take the good part. | Shane McEwan | 06/12/13 08:05 | On 06/12/13 15:52, Salvatore Sanfilippo wrote:For what it's worth, we run both Riak and Redis. They each solve different problems for us. You use whichever tool solves your problem. There's no point complaining that your screwdriver is no good at hammering nails! Shane. |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 06/12/13 08:08 | Le vendredi 6 décembre 2013 16:47:11 UTC+1, Salvatore Sanfilippo a écrit :On Fri, Dec 6, 2013 at 4:22 PM, Pierre Chapuis Afaik he is saying the system is single master and you cannot have two writes executing concurrently, so write throughput / latency is limited by a single node. > Then there's the Disqus guys, who migrated to Cassandra, They mention it here: http://planetcassandra.org/blog/post/disqus-discusses-migration-from-redis-to-cassandra-for-horizontal-scalability But they don't say much about their reasons, basically "it didn't scale" :( > This presentation about scaling Instagram with a smallwith Redis Cluster. [...] He also mentions the allocator as their reason to use Memcache instead of Redis. I wonder if a lot of this criticism does not come from people who don't use jemalloc. Let's face it, partitioning client side is complex. Redis Cluster I can't comment much on that, I don't see a reason to use Redis Cluster for now. Most of my data is trivial to shard in the application. Maybe that would help with migrations / re-sharding but this is not *so* terrible if you don't let your shards grow really huge. We suspect that trading off implementation flexibility for :) |
| Re: Redis critiques, let's take the good part. | Jonathan L. | 06/12/13 08:09 | One of the big challenges we had with redis in mercadolibre was size of dataset. The fact that it needs to fit in memory was a big issue for us. -- |
| Re: Redis critiques, let's take the good part. | Alexander Gladysh | 06/12/13 08:12 | On Fri, Dec 6, 2013 at 8:09 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:Seems to be kind of screwdriver vs. nails problem, no? Why use Redis for the task that it is explicitly not designed for? (Not trying to offend you, this is a honest question — relevant, I think, since we're talking about why Redis is perceived as deficient by some users...) Alexander. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 08:15 | On Fri, Dec 6, 2013 at 5:05 PM, Shane McEwan <sh...@mcewan.id.au> wrote:Totally makes sense indeed. The systems are very different. Just a question, supposing Redis Cluster were available and stable, is some problem at the intersection between Redis and Riak that you ended solving with Riak more disputable with Redis Cluster? Or it was a matter of other metrics like consistency model and alike? |
| Re: Redis critiques, let's take the good part. | Jonathan L. | 06/12/13 08:18 | It's not that we planned it. Developers started using it for something they thought will stay small but it grew. And it grew a lot. We ended up using redis to cache a small chunk of the data and the as a backend data store mysql or oracle. -- |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 08:18 | On Fri, Dec 6, 2013 at 5:12 PM, Alexander Gladysh <agla...@gmail.com> wrote:This is entirely possible but depends a lot on use case. If IOPS for object are in a range that you pay less for RAM compared to how many nodes you need to spin with an on-disk solution, then switching becomes hard even when you realize you are using a lot of RAM. Also it depends on where you run. On premise 500GB is not huge, on EC2 it is. |
| Re: Redis critiques, let's take the good part. | Alexander Gladysh | 06/12/13 08:28 | On Fri, Dec 6, 2013 at 8:18 PM, Jonathan Leibiusky <iona...@gmail.com> wrote: > It's not that we planned it. Developers started using it for something theyAh, I see. We had that happen (on much smaller scale). But, despite Redis blowing up in our faces several times, we were eventually able to get away with optimizing data sizes (and adding a few ad-hoc cluster nodes). This is exactly what I would do now — after I had that experience. Redis can be a primary data storage, but you have to think very well before using it as such. I had different point of view before — and it was the source of some pain for us. You live and learn :-) My 2c, Alexander. |
| Re: Redis critiques, let's take the good part. | Alexander Gladysh | 06/12/13 08:29 | On Fri, Dec 6, 2013 at 8:18 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:Of course. But you have to know Redis well to be able to get away with this — and even to be able to make weighted and sane decision on that matter. Alexander. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 08:31 | On Fri, Dec 6, 2013 at 5:08 PM, Pierre ChapuisUnless you use sharding. Otherwise any system that accepts at the same time, in two different nodes, a write for the same object, is eventually consistent. From what I can tell, Redis *can not* really scale on EC2 for applications requiring a large data set just because of the cost of spinning enough instances. Imagine the 4TB Twitter Redis cluster on EC2. Totally possible even for small companies on premise. That's pre-jemalloc IMHO. I'm quite sure that as soon as we provide solid Sentinel and a Redis Cluster that works, we'll see a lot of new users...
> -- -- |
| Re: Redis critiques, let's take the good part. | Felix Gallo | 06/12/13 08:32 | I think there's three types of criticism. The first type comes from a surge in popularity of high-A-style systems and, owing to the sexiness of those concepts and relative newness, a corresponding surge in dilettantes who try to eagerly apply knowledge gleaned from Aphyr's (great) Jepsen posts against all use cases, find Redis wanting, and try to be the first to tweet out the hipster sneering. I won't name names but there's a dude who posted that you should replace redis with zookeeper. I literally cried with laughter.
The second type is serious high-A folk like Aphyr, who do correctly point out that Redis cluster was not designed "properly." It turns out that distributed systems are incredibly complicated and doing things the most simple and direct way, as Salvatore seems to aim to do, frequently misses some complex edge cases. This type of criticism is more important, because here traditionally Redis has claimed it has a story when it really didn't. I have concerns that Salvatore working alone will not get to a satisfactory story here owing to the complexities, and sometimes wonder if maybe external solutions (e.g. the system that uses zookeeper as a control plane) would not be better, not go for 100% availability, and for focus to be placed on the third area of criticism.
The third type is the most important, in my opinion: it's the people who fundamentally misunderstand Redis. You see it all the time on this list: people who think Redis is mysql, or who ask why the server seems to have exploded when they put 100G of data in an m1.small, or why expiry is not instant, or why a transaction isn't rollable back. The problem here is that Redis is very much a database construction set, with Unix-style semantics. By itself it gives you just enough rope to hang you with. By itself without care and feeding and diligence, Redis will detonate over time in the face of junior- and mid- level developers. People will create clashing schemas across applications. People will issue KEYS * in production. People will DEL a 4 million long list and wonder why it doesn't return immediately (<-- this was me). Heck, I'd been using Redis hard for a year before I learned the stupid SORT join trick from Josiah. Many of these warts and complexities around usage and operation of a single instance could be smoothed over (KEYS *, ARE YOU SURE (Y/N) in redis-cli), and as far as making The World happy, that's probably the biggest bang for the buck.
Personally, I've just finished deploying a major application component for an online game for which you have seen many billboards no matter where you are in the world. Over 2 million users use the component every day, and we put and get tens-to-hundreds-of-thousands of data items per second. We don't use in-redis clustering, and we don't use sentinel, but I sleep at night fine because my dev and ops teams understand the product and know how it fails.
F.
|
| Re: Redis critiques, let's take the good part. | Shane McEwan | 06/12/13 08:42 | On 06/12/13 16:15, Salvatore Sanfilippo wrote:I haven't looked at Redis Cluster yet so I can't say for sure. The main reason for choosing Riak was scalability and redundancy. We know there's some huge Riak clusters out there and we plan to be one of them eventually. Our dataset is larger than can easily (cheaply!) fit into memory so we use Riak with LevelDB to store our data while anything we want quick and easy access to we store in Redis. Shane. |
| Re: Redis critiques, let's take the good part. | Yiftach | 06/12/13 09:03 | From the point of view of a Redis provider who "live" from these OSS issues I can only say that I know a handful of companies that can actually manage themselves any OSS DB at a large scale in production. I'm quite sure that most of these transitions to Riak/Cassandra were backed by Basho and Datastax guys. The fact that Redis is much more popular than those DBs (only 2nd to Mongo in real NoSQL deployments) actually means that someone built a solid product here.
From the commercial side, there are now a few companies with enough cash in the bank for supporting and giving services around Redis, I'm this will only strengthen its position. Another point to mention is the cloud deployment - I can only guess that most of the Redis deployments today are on AWS, and managing any large distributed deployment over this environment is a great challenge and especially with in-memory databases. This is because: instances fail frequently, data-centers fail, network partition happens too often, noisy neighbor all over, the concept of ephemeral storage, SAN/EBS storage which is not tuned for sequential writes, etc - I can only say that due to the competition from the other cloud vendors, SoftLayer, GCE and Azure, AWS infrastructure is constantly improving. For instance - last year there were 4 zone (data-center) failure events in the AWS us-east region; this year - zero. The new AWS C3 instances are now based on HVM and most of the BGSAVE fork time issues have been solved
+972-54-7634621 |
| Re: Redis critiques, let's take the good part. | Josiah Carlson | 06/12/13 10:29 | > Heck, I'd been using Redis hard for a year before I learned the stupid SORT join trick from Josiah. Not stupid, just crazy :) My criticisms are primarily from the point of view of someone who knows enough about Redis to be dangerous, who has spent the last 11+ years studying, designing, and building data structures, but who doesn't have a lot of time to work on Redis itself. All of the runtime-related issues have already been covered.
Long story short: every one of the existing data structures in Redis can be improved substantially. All of them can have their memory use reduced, and most of them can have their performance improved. I would argue that the ziplist encoding should be removed in favor of structures that are concise enough to make the optimization unnecessary for structures with more than 5 or 10 items. If the intset encoding is to be kept, I would also argue that it should be modified to apply to all sets of integers (not just small ones), and its performance characteristics updated if it happens that the implementation changes to improve large intset performance.
I might also argue that something like Redis-nds should be included in core, but that it should *not* involve the development of a new storage engine, unless that storage engine is super simple (I wrote a bitcask w/index creation on shutdown in Go a few weeks ago in a week, and it is the best on-disk key/value storage engine I've ever used). I don't know whether explicitly paging data in and out makes sense, or whether it should be automatic, as I can make passionate arguments on both sides.
All of that said, Redis does work very well for every use case that I find reasonable, even if there are some rough edges. - Josiah |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 10:36 | On Fri, Dec 6, 2013 at 5:32 PM, Felix Gallo <felix...@gmail.com> wrote:Hello Felix, I like classifications ;-) Skipping that... as I recognized this and not worth analyzing :-) Here there is a "mea culpa" to do, the first Sentinel and the first version of Redis Cluster were designed before I seriously learned the theoretical basis of distributed systems. This is why I used the past months to read and learn about distributed systems. I believe the new design of Redis Cluster is focused on real trade offs and will hold well in practice. It may not be bug free or some minor changes may be needed but IMHO there are not huge mistakes. Aphyr did a great thing analyzing systems in practice, they hold the expectations? However I think that distributed systems are not super hard, like kernel programming is not super hard, like C system programming is not super hard. Everything new or that you don't do in a daily basis seems super hard, but it is actually different concepts that are definitely things everybody here in this list can master. So Redis Sentinel as a distributed system was not consistent? Wow, asynchronous replication used so no way for the master partitioned away to stop receiving writes, no merge operation afterward but "who is the master rewrites the history". Also the first Sentinel was much simpler to take apart from a theoretical perspective, the system would not converge after the partition hails and it was simple to prove. It is also possible to trivially prove that the ODOWN state for the kind of algorithm used does not guarantee liveness (but this is practically not important for *how* it is used now). It is important to learn, but there is no distributed system cathedral that is impossible to escalate. At max to learn more is needed, and to adapt the implementation to the best one could provide in a given moment, given understanding, practical limits (a single coder) and so forth. However my take on that is that the Redis project responded in a positive way to theoretical criticisms. I never believed that was interesting for the kind of uses Redis was designed to, to improve a lot our story about consistency. I changed idea, and we got things like WAIT. This is a huge change, WAIT means that if you run three nodes A, B, C where every node contains a Sentinel instance and a Redis instance, and you "WAIT 1" after every operation to reach the majority of slaves, you get a consistent system. Totally agree, what is disturbing is that in most environments where you could expect "A class" developers sometimes the system was misused like that. Totally reasonable... thanks for sharing. |
| Re: Redis critiques, let's take the good part. | John Watson | 06/12/13 11:34 | We outgrew Redis in 1 specific use case. For the exact tradeoff Salvatore has already ceded as a possible deficiency. Some info about that in this slide deck: Besides that, Redis is still a critical piece of our infrastructure and has not been much of a pain point. We "cluster" by running many instances per machine (and in some "clusters", some semblance of HA by a spider web of SLAVE OFs between them.) We also built a Python library for handling the clustering client side using various routing methods: https://pypi.python.org/pypi/nydus Of course Nydus has some obvious drawbacks and so we're watching the work Salvatore has been putting in to Sentinel/Cluster very closely. |
| Re: Redis critiques, let's take the good part. | Aphyr Null | 06/12/13 12:07 | > WAIT means that if you run three nodes A, B, C where every node contains a Sentinel instance and a Redis instance, and you "WAIT 1" after every operation to reach the majority of slaves, you get a consistent system. While I am enthusiastic about the Redis project's improvements with respect to safety, this is not correct. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 13:14 | On Fri, Dec 6, 2013 at 9:07 PM, Aphyr Null <aphyr...@gmail.com> wrote:It is not correct if you take it as "strong consistency" because there are definitely failure modes, basically it is not like if synchronous replication + failover turned the system into Paxos or Raft. For example if the master returns writable when the failover already started we are no longer sure to pick the slave with the best replication offset. However this is definitely "more consistent" then in the past, and probably it is possible to achieve strong consistency if you have a way to stop writes during the replication process. I understand this not the "C" consistency of "CAP" but, before: the partition with clients and the (old) master partitioned away would receive writes that gets lost. after: under certain system models the system is consistent, like if you assume that crashed instances never start again. It is not realistic as a system model, but it means that in practice you have a better real-world behavior, and in theory you have a system that is going towards a better consistency model. Regards, Salvatore -- Salvatore 'antirez' Sanfilippo open source developer - GoPivotal http://invece.org We suspect that trading off implementation flexibility for understandability makes sense for most system designs. — Diego Ongaro and John Ousterhout (from Raft paper) |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 06/12/13 14:04 | On Fri, Dec 06, 2013 at 11:09:30AM -0500, Jonathan Leibiusky wrote:
> One of the big challenges we had with redis in mercadolibre was size of > dataset. The fact that it needs to fit in memory was a big issue for us. > We used to have, on a common basis, 500gb DBs or even more. > Not sure if this is a common case for other redis users anyway. Common enough that I sat down and hacked together NDS to satisfy it. As you said in your other message, though, it isn't that anyone usually *plans* to store 500GB of data from the start and chooses Redis anyway, but rather that you start small, and then things get out of hand... the situation isn't helped when the developers aren't aware enough of what's going on "inside the box" that they don't realise that they can't just throw data at the Redis indefinitely -- but then, I (ops) didn't exactly give them the full visibility required to know how big those Rediises were getting... - Matt -- Ruby's the only language I've ever used that feels like it was designed by a programmer, and not by a hardware engineer (Java, C, C++), an academic theorist (Lisp, Haskell, OCaml), or an editor of PC World (Python). -- William Morgan |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 06/12/13 14:06 | On Fri, Dec 06, 2013 at 07:22:02AM -0800, Pierre Chapuis wrote:
> So when I read someone saying he would ban Redis from > his architecture if he ever makes a startup, I think: "good > thing he doesn't." :) I, on the other hand, just sincerely hope that whatever startup he makes is competing with mine, because if he refuses to use the right tool for the job (if Redis turns out to be the right tool for a specific use case), then I'll gladly use that tool as a competitive advantage, and I need every advantage I can get. - Matt |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 14:16 | On Fri, Dec 6, 2013 at 11:06 PM, Matt Palmer <mpa...@hezmatt.org> wrote:This is a fundamental point. If you consider systems from a theoretical point of view, everybody should use Zookeeper. It is like to try to win all the wars with a precision rifle: it is the most accurate, however it does not work against a tank. People use Redis because it solves problems, because of the data model that fits a given problem, and so forth, not because of it offers the best consistency guarantees. This is the point of view of us, programmers. We try to do the best to implement systems in a great way. There are other guys, like the authors of the Raft algorithm, that try to do "A grade" work in the field of applicable distributed systems. Those people provide us with the theoretical foundation to improve the systems we are designing, however it is the sensibility of the programmer to pick the trade offs, the API, and so forth. Companies using the right tools will survive and will solve user problems. When a tool, like Redis, starts to solve no problems, it gets obsoleted and after a few years marginalized. This is not a linear process because fashion also is a big player in tech. Especially in the field of DBs lately there are too much money for the environment to be sane, people don't just argue from a technical point of view, there is a bit too much rage IMHO. But my optimism says me that eventually the technology is the most important thing. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 14:37 | On Fri, Dec 6, 2013 at 8:34 PM, John Watson <jo...@disqus.com> wrote:
> We outgrew Redis in 1 specific use case. For the exact tradeoff Salvatore > has already ceded as a possible deficiency. > > Some info about that in this slide deck: > http://www.slideshare.net/gjcourt/cassandra-sf-meetup20130731 > > Besides that, Redis is still a critical piece of our infrastructure and > has not been much of a pain point. We "cluster" by running many instances > per > machine (and in some "clusters", some semblance of HA by a spider web of > SLAVE OFs between them.) We also built a Python library for handling the > clustering > client side using various routing methods: > https://pypi.python.org/pypi/nydus Hello John, thank you a lot for your feedback. I seriously believe in using multiple DB systems to get the job done, maybe because my point of view is biased by Redis being not very general purpose, but I believe there is definitely value in being open to use the right technologies for the right jobs. Of course it is a hard to generalize concept, good engineers will understand when something new is needed with great sensibility, and less experienced ones sometimes do the error of trowing many technologies together when they are not exactly needed, including Redis... Thanks for the link to Nydus, I was not aware of this project. I'm adding it here in the tools section -> http://redis.io/clients > Of course Nydus has some obvious drawbacks and so we're watching the work > Salvatore has been putting in to Sentinel/Cluster very closely. Thanks, those are the priorities of the Redis project currently! Salvatore > On Friday, December 6, 2013 8:08:19 AM UTC-8, Pierre Chapuis wrote: >> >> Le vendredi 6 décembre 2013 16:47:11 UTC+1, Salvatore Sanfilippo a écrit : >>> >>> On Fri, Dec 6, 2013 at 4:22 PM, Pierre Chapuis >>> <catwell...@catwell.info> wrote: >>> > Others: >>> > >>> > Quentin Adam, CEO of Clever Cloud (a PaaS) has a presentation that says >>> > Redis is not fit to store sessions: >>> > http://www.slideshare.net/quentinadam/dotscale2013-how-to-scale/15 (he >>> > advises Membase) >>> >>> I don't quite understand the presentation to be super-honest, what >>> means "multiple writes" / "pseudo automic"? I'm not sure. >> >> >> Afaik he is saying the system is single master and you cannot >> have two writes executing concurrently, so write throughput / latency >> is limited by a single node. >> >>> > Then there's the Disqus guys, who migrated to Cassandra, >>> >>> I've no idea why Disqus migrated to Cassandra, probably it was just a >>> much better pick for them? >>> >>> Migrating to a different does not necessarily implies a problem with >>> Redis, so this is not a criticism we can use in a positive way to act, >>> unless Disqus guys write us why they migrated and what Redis >>> deficiencies they found. >> >> >> They mention it here: >> >> http://planetcassandra.org/blog/post/disqus-discusses-migration-from-redis-to-cassandra-for-horizontal-scalability >> >> But they don't say much about their reasons, basically "it didn't >> scale" :( >> >>> > This presentation about scaling Instagram with a small >>> > team (by Mike Krieger) is very interesting as well: >>> > >>> > http://qconsf.com/system/files/presentation-slides/How%20a%20Small%20Team%20Scales%20Instagram.pdf >>> > He says he would go with Redis again, but there are >>> > some points about scaling up Redis starting at slide 56. >>> >>> This is interesting indeed, and sounds like problems that we can solve >>> with Redis Cluster. [...] >> >> >> He also mentions the allocator as their reason to use Memcache >> instead of Redis. I wonder if a lot of this criticism does not come >> from people who don't use jemalloc. >> >>> >>> Let's face it, partitioning client side is complex. Redis Cluster >>> provides a lot of help for big players with many instances since >>> operations will be much simpler once you can reshard live. >> >> >> I can't comment much on that, I don't see a reason to use Redis >> Cluster for now. Most of my data is trivial to shard in the application. >> Maybe that would help with migrations / re-sharding but this is not >> *so* terrible if you don't let your shards grow really huge. >>>> :) > > -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to redis-db+u...@googlegroups.com. > To post to this group, send email to redi...@googlegroups.com. > Visit this group at http://groups.google.com/group/redis-db. > For more options, visit https://groups.google.com/groups/opt_out. |
| Re: Redis critiques, let's take the good part. | Alberto Gimeno Brieba | 06/12/13 14:46 | Hi, I use redis a lot (3 big projects already) and I love it. And I know many people that love it too. For me the two biggest problems with redis used to be: - distribute it over several nodes. The problem is being solved with redis-cluster. And synchronous replication is a great feature. - the dataset size needs to fit into memory. Of course I totally understand that redis is an in-memory db and is the main reason that makes redis so fast. However I would appreciate something like NDS officially supported. There were some attempts to address this problem in the past (vm and diskstore) but in the end they were removed. I think that having something like NDS officially supported would make redis a great option for many more usage cases. Many times the 90% of the "hot data" in your db fits in an inexpensive server, but the rest of the data is too big and would be too expensive (unaffordable) to have enough RAM for it. So in the end you choose other db for the entire dataset. My 2 cents. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 14:54 | On Fri, Dec 6, 2013 at 6:03 PM, Yiftach Shoolman
<yiftach....@gmail.com> wrote: > From the commercial side, there are now a few companies with enough cash in > the bank for supporting and giving services around Redis, I'm this will only > strengthen its position. This is a very important point. With Redis you were alone until recently, that's not good. > Another point to mention is the cloud deployment - I can only guess that > most of the Redis deployments today are on AWS, and managing any large > distributed deployment over this environment is a great challenge and > especially with in-memory databases. This is because: instances fail > frequently, data-centers fail, network partition happens too often, noisy > neighbor all over, the concept of ephemeral storage, SAN/EBS storage which > is not tuned for sequential writes, etc - I can only say that due to the > competition from the other cloud vendors, SoftLayer, GCE and Azure, AWS > infrastructure is constantly improving. For instance - last year there were > 4 zone (data-center) failure events in the AWS us-east region; this year - > zero. The new AWS C3 instances are now based on HVM and most of the BGSAVE > fork time issues have been solved Absolutely, in some way EC2 is good for distributed systems, it is problematic enough that it is much simpler to sense pain points and see partitions that in practice are very rare in other deployments. This somewhat is "training for failure" that's good. But seriously, sometimes the problems are *just* a result of EC2 and if you don't know how to fine-tune for this environment it is likely to see latency and other issues that in other conditions are very hard to see at all. I'm super happy about C3 instances, but... what about EBS? It remains a problem I guess when AOF is enabled and disk can't cope with the fsync policy... Thanks, Salvatore > > > On Fri, Dec 6, 2013 at 6:18 PM, Salvatore Sanfilippo <ant...@gmail.com> > wrote: >> >> On Fri, Dec 6, 2013 at 5:12 PM, Alexander Gladysh <agla...@gmail.com> >> wrote: >> > On Fri, Dec 6, 2013 at 8:09 PM, Jonathan Leibiusky <iona...@gmail.com> >> > wrote:>> > Seems to be kind of screwdriver vs. nails problem, no? Why use Redis >> > for the task that it is explicitly not designed for? >> >> This is entirely possible but depends a lot on use case. If IOPS for >> object are in a range that you pay less for RAM compared to how many >> nodes you need to spin with an on-disk solution, then switching >> becomes hard even when you realize you are using a lot of RAM. Also it >> depends on where you run. On premise 500GB is not huge, on EC2 it is. >> >> --> Yiftach Shoolman > +972-54-7634621 > |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 15:00 | On Fri, Dec 6, 2013 at 11:46 PM, Alberto Gimeno BriebaI completely understand this, but IMHO to make Redis on disk right we need: 1) an optional threaded model. You may use it to dispatch maybe only slow queries and on-disk queries. Threads are not a good fit for Redis in memory I think. Similarly I believe that threads are the key for a good on disk implementation. 2) Representing every data structure on disk in a native way. Mostly a btree of btrees or alike, but definitely some work ahead to understand what to use or what to implement. Currently it is an effort that is basically impossible to do. If I'll be able to continue the development of what we have now that the complexity is raised is already a good result, the core, cluster, sentinel, the community... So for now the choice is to stay focused in the in-memory paradigm even if I understand this makes Redis less useful for certain use cases, since there are other DBs solving at least in part the "Redis on disk" case, but there are little systems doing the Redis work well IMHO. Thanks! |
| Re: Redis critiques, let's take the good part. | dvirsky | 06/12/13 15:00 | Just my two cents. I've been using redis in production for almost 3 years, and I've had many difficulties but many many wins with it. I think the biggest mistake I made was, being excited about it in the beginning, to use it for too many things, some of which it didn't fit.
I'm very happy with redis as: 1. a geo resolving database. 2. complex cache (where just putting a blob in something like memcache is not enough) 3. distributed event bus 4. semantic entity store.
Where I wasn't happy with it was: 1. storing data that had complex relations in it. 2. storing data that needed migration from other dbs constantly (that's not redis' fault thuogh)
3. storing data that needed high persistence rates on EC2, although the forking problem was solved in recent generation machines. 4. having a mission critical DB that needed super fast failover. Sentinel in its original form was simply not good enough for what we needed.
5. needing cross DC replication. This has been solved in 2.8, but I needed it before. So we've been moving some things we used to do with redis to other databases, but I still love this tool and would definitely use it for new projects.
On Fri, Dec 6, 2013 at 3:52 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote: Hello dear Redis community, Dvir Volk
Chief Architect, Everything.me
|
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 15:02 | On Fri, Dec 6, 2013 at 7:29 PM, Josiah Carlson <josiah....@gmail.com> wrote:
> Long story short: every one of the existing data structures in Redis can be > improved substantially. All of them can have their memory use reduced, and > most of them can have their performance improved. I would argue that the > ziplist encoding should be removed in favor of structures that are concise > enough to make the optimization unnecessary for structures with more than 5 > or 10 items. If the intset encoding is to be kept, I would also argue that > it should be modified to apply to all sets of integers (not just small > ones), and its performance characteristics updated if it happens that the > implementation changes to improve large intset performance. Hello Josiah, thanks for your contrib. I agree with you, it is exactly another case of "this is the simplest way to avoid work given that it is good enough". This would deserve a person allocated to this solely that is able to do steady progresses and merge code when it is mature / tested enough to avoid disasters, since it is a very sensible area. Cheers, |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 15:07 | Thanks Dvir, this is a very balanced message.
Certain use cases in your non-happy list probably will never be good for Redis, including complex relations. The good side is that I see other entries about issues that are getting solved finally. Just to collect a data point, about the fast failover, what were the consistency requirements and the actual failover times? A few seconds, or milliseconds, or what? New Sentinel is faster at failing over, but could be made a lot faster (in the order of 200 milliseconds instead of a 2/3 seconds it takes now). Salvatore |
| Re: Redis critiques, let's take the good part. | dvirsky | 06/12/13 15:20 | A few seconds were fine. If you remember our lengthy discussion about it (and the rejected 1000 line-long pull request :) ) from about a year ago, the problem we had was how to do this stuff dynamically without changing config files, and without having sentinel state conflicting with Chef state. I ended up protecting the code itself from having a lost master, so the app won't fail while we do a longer failover process; And also moving the write intensive, mission critical stuff, away from redis, to cassandra. As long as I treat redis as read (almost) only, and a potentially volatile (Although we never suffered any major data loss with it) data store - all is fine.
|
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 06/12/13 15:26 | Ok, thanks for the additional context, a few seconds is already in
line with the new implementation, however now that you said that, it is really easy to bring the failover timeout delay under a few hundred milliseconds. About the PR, I'm sorry but I had no enough focus / context at the time to really understand if it was a good thing or not... really tried to take a slower evolution path where I was able to understand more during the process. Thanks for the PR anyway and for the chats ;-) Salvatore |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 06/12/13 15:42 | Le vendredi 6 décembre 2013 16:22:02 UTC+1, Pierre Chapuis a écrit : OK, it looks like I have an apology to make. I wanted to say that Tony had often criticised Redis. Instead I used an English expression which I clearly did not understand well. That was a really, really stupid thing to do. Moreover, even though I do not share his point of view on Redis, I think Tony is a very good engineer I respect a lot. In particular, he wrote Celluloid, which you probably know about if you are interested in distributed systems and/or Ruby. That makes me even more ashamed to have written such a terrible thing. |
| Re: Redis critiques, let's take the good part. | Aphyr Null | 06/12/13 15:44 |
A formal model and proof would go a long way towards convincing me. I strongly suspect that in the absence of transactional rollbacks, one cannot currently use WAIT to guarantee both linearizability and liveness in the presence of one or more node failures--not without careful control of the election process, anyway. |
| Re: Redis critiques, let's take the good part. | Howard Chu | 06/12/13 15:44 |
LMDB, which NDS uses, already supports btree of btrees. The major flaw in any in-memory DB design, as I see it, is the notion that there is a difference between in-memory data and on-disk data. It inherently leads to waste of CPU + memory due to redundant caches and associated management code.
|
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 06/12/13 16:21 |
Descriptions like this indicate the trade-offs aren't understood, explicitly chosen and designed or accounted for. What is Redis trying to be? Is Redis trying to be a CP or AP system? Pick one and design it as such. From my perspective, with masters and slaves, Redis is trying to be a CP system but it's not achieving the goals. If it's trying to be an AP system, it isn't achieving those goals either. Broken CP systems are the worst kinds of AP systems. The aren't as consistent as they intend to be, nor as available and eventually consistent as they ought to be. Now for a little tough love. I share the #2 type criticism concern Felix mentioned. Respect for the complexity of the problems that production distributed systems face seems to be a root problem here. This is a common theme I see repeating, even today. I don't think one can claim that "distributed systems are not super hard" while their distributed system has issues. Some people devote their entire career to this domain and you don't just learn it in a couple months. I post this because like many, I want to see Redis improve and I want to see users I work with that use it and everyone else have a better experience. I think the distributed systems community is very welcoming and that Redis could benefit from some design discussions and peer review in these areas.
|
| Re: Redis critiques, let's take the good part. | Josiah Carlson | 06/12/13 16:31 | I thought your use of "frequent offender" with respect to Tony's complaints against Redis was right on :P
Whether or not he has built a lot of good stuff, Salvatore pointed out that his complaints were either FUD or missing the point of what Redis offers. Right tool for the right job and all that.
I wouldn't take it back, and I don't think that any reasonable person should have a problem with what you said. - Josiah
|
| Re: Redis critiques, let's take the good part. | Josiah Carlson | 06/12/13 16:39 | On Fri, Dec 6, 2013 at 3:02 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote: Having someone on this as their job is exactly what it needs. It's a pity Pivotal missed the boat back in July. - Josiah |
| Re: Redis critiques, let's take the good part. | Alberto Gimeno | 06/12/13 17:00 | Hi,
What about using an already working disk key-value store like leveldb, rocksdb (http://rocksdb.org), lmdb (like nds does https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ), etc.? |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 06/12/13 17:43 | On Sat, Dec 07, 2013 at 12:00:01AM +0100, Salvatore Sanfilippo wrote:Well, in theory we've got Posix AIO and O_NONBLOCK, but... hahahaha. No. I've pondered using bio to handle reads from disk, which would mostly just involve adding the ability for bio to notify the event loop that a particular key was now in memory (and thus running all those commands blocked on that key), but I'm keeping that in reserve for a rainy and boring weekend... For now, I recommend enabling nds-keycache, keeping an eye on your cache hit rate to make sure your maxmemory is high enough, and living with the occasional latency spike when you have to go to disk to read in a rarely-used key. Hell, if you're running Redis in EC2, you're used to huge latency spikes, right? </me ducks> I actually don't think this is a huge blocker. The time involved in deserialising a value from the packed RDB format is, I believe, a small part of the total time involved in getting a key from disk to memory -- compared to how long you spend waiting for the disk to barf up something useful, almost any CPU-oriented operation is lightning fast. True, I haven't benchmarked this, and if someone does wave a profiler at NDS and it shows that the amount of time spent in rdbLoadObject is a significant percentage of the time spent in getNDS, I'll gladly change my opinion. Until then, I'll worry more about reducing the impact of disk operations on request latency. For you, perhaps. I'm having quite a bit of fun over here shuffling data on and off disk inside of Redis. <grin> It's the beauty of OSS -- you can focus on what you think is more important / interesting, and so can everyone else. And thanks, by the way, for providing such a high-quality, easy-to-hack-on codebase to use as a starting point for my adventures. - Matt -- "After years of studying math and encountering surprising and counterintuitive results, I came to accept that math is always reasonable, by my intuition of what is reasonably is not always reasonable." -- Steve VanDevender, ASR |
| Re: Redis critiques, let's take the good part. | jberkus | 06/12/13 18:00 | On 12/06/2013 05:43 PM, Matt Palmer wrote:Actually, you'd be surprised how much time you can spend in serailization operations. It's nothing compared with reading from EBS, of course, but some people have faster disks than that; SSDs are quite affordable these days, and even Amazon has dedicated IOPS. Not that your prioritization is wrong; it's still better to spend your time where you are spending it. BTW, once we go over to disk-backed Redis, we're pretty much certain to need a better append-only log. The general approach for on-disk databases is to write first to the AOL (or WAL), and then have a background process shuffle data to the searchable representation of the database on disk; it turns out that writing to an AOL is vastly faster than writing to more elaborately structured data, even (nay, especially) on SSD. Of course, right now we don't *have* background processes ... Anyway, as an Old Database Geek, I'll speak for the Postgres community and say that we're around if you need advice on how to manage disk-based access. We have more than a little experience in this regard ;-) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 06/12/13 18:04 | On Sat, Dec 07, 2013 at 02:00:12AM +0100, Alberto Gimeno wrote: > > I completely understand this, but IMHO to make Redis on disk right we[...] I think the issue that Salvatore is talking about there is that with all those examples you've given, they all treat the values associated with keys as opaque blobs. Redis, on the other hand, provides its value squarely in the realm of "I know what these values are, and I have the commands necessary to allow you to manipulate them". For NDS, I've gotten around that by only allowing disk/memory granularity at the key level -- if you want any part of a key, the entire key gets loaded into memory, and then Redis works on it entirely as normal. This is hideously inefficient for very large values (hence the "Naive" in "Naive Disk Store") and performance for almost all types of values would be greatly improved if granularity increased, but what's there now works well enough for a great variety of workloads. (/me gives the sigmonster a cookie) - Matt -- > There really is no substitute for brute force. Indeed - I must admit to being a disciple of blessed Saint Makita myself. -- Robert Sneddon and Tanuki, in the Monastery |
| Re: Redis critiques, let's take the good part. | Rodrigo Ribeiro | 06/12/13 21:05 | This is a great post Antirez, redis will only improve from this kind of feedback. Well, we use redis extensively at JusBrasil and my biggest complaint is how expensive can be to keep a large dataset high available. One of our use is to process user feed. For this we have a cluster of 90 redis instances(distributed across 15 servers), 2/3 of those instances are slaves, used to read from and by our failover mechanism(similar to sentinel). The problem is that we need to use 2x more memory even if we decide not to read from slaves. Redis could have a option to run as a "cold-slave", that only receive changes from master and append to disc(RDB+AOF or something similar to NDS fork), keeping minimal memory usage while in this state. Then when sentinel elect it as the new master, it would load everything to memory and come back to normal execution. This would represent huge memory reduction to our cluster, just an idea though. I also think the core development could be closer with the community work. I understand that is important to keep redis simple, but I see few forks that have good contributions(eg: NDS, Sentinel automatic discovery/registration), yet not much movement to merge in the core. -- On Friday, December 6, 2013 10:52:41 AM UTC-3, Salvatore Sanfilippo wrote: Hello dear Redis community, |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 07/12/13 00:26 | On Fri, Dec 06, 2013 at 06:00:44PM -0800, Josh Berkus wrote:While I've come to the conclusion that PIOPS are snakeoil, SSDs are quite nice -- but they're not magic. They're still not as fast as RAM or CPU. Oh, definitely. In the case of NDS, writing to disk doesn't impact performance, because that's done from memory to disk in a forked background process, but that naturally sucks because the data isn't properly durable (the use case I was addressing meant I can suffer the loss of the last few writes). For a proper disk-backed Redis, I'd be switching to something like AOF fragments to store the log, and the background process would rewrite the AOF fragments into the disk cache; on startup, this would also be done before we start serving data. Yeah, I can imagine... - Matt -- The hypothalamus is one of the most important parts of the brain, involved in many kinds of motivation, among other functions. The hypothalamus controls the "Four F's": 1. fighting; 2. fleeing; 3. feeding; and 4. mating. -- Psychology professor in neuropsychology intro course |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 07/12/13 00:31 | On Fri, Dec 06, 2013 at 09:05:30PM -0800, Rodrigo Ribeiro wrote:You could definitely do this with NDS right now -- set a low nds-watermark and a huge maxmemory on the slaves, and then as part of the promotion process, set nds-waterwark to 0 (turns it off) and trigger a preload. Performance will suck a little while the preloading gets everything into memory, but after that it'll feel just like normal Redis, except you'll get the benefits of NDS persistence (quick restarts, frequent but tiny disk flushes, etc). - Matt -- Judging by this particular thread, many people in this group spent their school years taking illogical, pointless orders from morons and having their will to live systematically crushed. And people say school doesn't prepare kids for the real world. -- Rayner, in the Monastery |
| Re: Redis critiques, let's take the good part. | Robert Allen | 06/12/13 18:07 | Firstly, I would like to say thank you to all contributors for your time, efforts and contributions to this outstanding project. We have utilised redis for three and a half years with only one notable incident; an incident I attribute solely to a failed HA implementation not related to redis itself. Charles Eames is quoted as saying, "design depends largely on constraints." This holds true with redis and all other systems components. As consumers, we have the noble responsibility to ensure we know, define and learn the constraints of all components we deploy or develop. Our deployment[s] of redis has grown massively in the three+ years of constant use, necessitating these deployments to be configured and tuned with workloads divided specifically to what they are responsible for. We do not mix persisting data, transient cache keys or sessions; we do not utilise Sentinel for HA yet (I would like to but I am giving it more time).
In summary, I am convinced that, at this time, there are no other viable products that would fit our environment and constraints as well as redis has and will continue to for the foreseeable future.
--
|
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 07/12/13 00:52 | On Sat, Dec 7, 2013 at 12:44 AM, Aphyr Null <aphyr...@gmail.com> wrote:A formal model would be required indeed, but surprisingly I think that transactional rollbacks are totally irrelevant. Take Raft for example: when the system replies you that your new entry was not accepted as it was not able to replicate it to the majority of the nodes, it actually means: "I don't know if the entry will ever be replicated, but I can't guarantee". I'm not sure for Paxos, it is possible that there is the same semantics, but anyway whatever is the ability to internally roll back or not an operation that did not reached the majority, you are always facing the problem of the *client* not receiving the acknowledgement after the write was accepted. So I believe all the care is in the failover process. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 07/12/13 02:21 | On Sat, Dec 7, 2013 at 1:21 AM, Kelly Sommers <kell.s...@gmail.com> wrote:I believe there is a place for "relaxed" CP systems. In Redis by default the replication is asynchronous, and most people will use it this way. At the same time because of the data model, and the size of aggregate values single keys can hold, I don't want application assisted merges, semantically. Relaxed CP systems can trade part of consistency properties for performance and simple semantics, I'm not sure why this is not acceptable. It is not new, too: when a sysop relaxes the fsync policy of a database to "flush every 2 seconds" he is making the same conceptual tradeoff, but for good reasons. What I mean is that everything is hard, and everything is approachable at the same time. Designing video games 3D engines is super hard. Writing a device driver is super hard. Implement reliable system programs is hard. However distributed systems are everywhere, including where they were not supposed to do, better to develop, as a community, skills about it at large instead of being intimidating. I spent decades to learn how to write proper C code that does not crashes easily, but I don't pretend that less experienced people are scared about doing system programming in C. For sure a few months of exposure will not make you able to provide work like Raft or Paxos, but the basics can be used in order to try to design practical systems, that can be improved over time. While we are having this conversation, half the AP systems maybe are running with a last-write wall clock win for *practical* reasons, so it is not like distributed systems is an exact science when applied: it is theoretical tradeoffs, then implementation tradeoffs, then application semantics tradeoffs. My process has always been like that: publish a description of the system, make people aware of that, and finally start to write the implementation. I think this is an open process that allows for contributions. If you or other interested parties are willing to comment Redis Cluster design, I'll be super excited about that, seriously. Of course if you tell me: hey no no no, let's make this a true CP system regardless of the fact that this means mandatory synchronous replication to the majority of nodes, and fsyncing every operation to disk, I can't say this will be helpful. I mean, theory should not win over the intended goals of the system. However suggestions could surely help. Redis Cluster is a simple system currently, with a simple implementation, it is something we can change if needed. I'm open to suggestions... Salvatore
> -- |
| Re: Redis critiques, let's take the good part. | Javier Guerra | 07/12/13 06:51 | On Fri, Dec 6, 2013 at 6:44 PM, Howard Chu <highl...@gmail.com> wrote:for very limited values of 'support'. after playing it for a while, i think it's the best option for on-disk key-value libraries; but the limited key size makes it not enough out of the box for a redis-like on disk db. a posibility i tried (just some PoC code) is what i called a "hash tree": like a hash table but instead of an array indexed by key hashes, use the LMDB tree keyed by key hashes. IOW: take the user's key, hash it, and use that as (part of) the key to the LMDB tree. since the keys aren't as limited as the indexes on a hash table, collisions are _extremely_ rare, and easily solved by storing the full key with the value. I tried with 128-bit SIPHash and wasn't able to get a single collision with any dataset i could get my hands on (murmur3 did get a few) then, thinking on the 'btree of btrees' comment of Salvatore, i tried a two-key schema: the LMDB key would be (<parent key hash>,<item key hash>), and since the underlying tree preserves ordering, all the 'siblings' are consecutive and can be retrieved efficiently. ... and i stopped there, because got distracted by some other shiny things... , ahem, i mean, paying jobs. -- Javier |
| Re: Redis critiques, let's take the good part. | Quentin Adam | 07/12/13 08:59 | Hi To be clear : I think it's important to consider the right technology to your usage. People often use technology they like or know... And redis is moving fast, but some peoples doesn't use it for the good use case. Indeed, i'm not very confortable with redis for session, in lot of use case... Like log expiration session : couchbase usage is cool because it use the hard drive. But Redis can be great and move fast :-) Best regards
|
| Re: Redis critiques, let's take the good part. | Alberto Gimeno | 07/12/13 10:05 | Hi,
Just an idea about the disk-based storage. Many people say that redis is like a toolset, and I agree. What about doing this toolset more flexible to support hybrid approaches. For example redis could have an interface to load keys from a secondary source. Everytime lookupKey() does not find the key in the current redis database, let plug something using a simple interface to be able to lookup in other source. And probably an option to move a key from redis to the other datasource. This way you can integrate redis with a disk-based key-value store or many other things without changing anything at the application level, lua scripts, etc. And redis remains as clean as possible, with minimal dependencies, but with the ability to interoperate with other storage engines or datasources. I don’t know how it could be implemented. Maybe at compilation level being able to implement a simple interface, maybe through a tcp connection with a minimal protocol,… -- Alberto Gimeno http://backbeam.io/ http://twitter.com/gimenete > -- > You received this message because you are subscribed to a topic in the > Google Groups "Redis DB" group.> To unsubscribe from this topic, visit > https://groups.google.com/d/topic/redis-db/Oazt2k7Lzz4/unsubscribe. > To unsubscribe from this group and all its topics, send an email to |
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 07/12/13 10:35 |
"Relaxed CP" is a new one I've never heard. You're either consistent or you are not. There's no such thing as "I'm a little less pregnant". Please let's not start making stuff up. Once you relax consistency, you're no longer a CP system. This denial reminds me of RavenDB's insistence that they are an ACID database while admitting they have a broken isolation model. Nobody believes this incorrect representation of ACID outside their own community. StackOverFlow is full of problems the users are experiencing due to partially provided guarantees. Denial only continues to hurt the users because the problems aren't being addressed. If Redis is going to get a good story for replication and distribution, it's not going to get there by denying the current design flaws. What's the difference between Cassandra's ConsistencyLevel.ALL and Redis WAIT? Not a lot. Cassandra is an AP system. CL.ALL in Cassie gives a "best effort" for consistency but in my experience this confuses and misleads users because some people think this means they can opt out of AP and become CP if you use CL.ALL and you will be consistent. I have to explain why this is false on a daily basis. No joke. I wish CL.ALL didn't exist. There are use cases for it, but these are few and you need to understand the nuances to be effective. Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_. Redis is trying to cherry pick the best of both worlds, a master/slave system with WAIT but without the proper failure semantics to make it a true reliable CP system and with an asynchronous replication that undermines all of that. On the flip side, the asynchronous replication could do with a lot more supporting features if it is to support an AP system. What we are left with is a system that isn't good at either. It looks like a system where someone can't make up their mind what to build. Databases provide or trade-off guarantees so that applications have a set of expectations on what they can consider correct. When correct state is confusing and difficult to predict, it makes it very difficult for applications to compensate. I don't recommend any of Redis clustering or replication to people I work with because people find it hard to reason about in production systems - for good reason. This will continue until I see that the design is influenced by explicit trade-offs made and that I am confident users can use it properly. This conversation is about how to fix that, which I would love to see! It starts with a simple, but hard question to answer. What do you want a Redis cluster to be? |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 07/12/13 13:16 | On Sat, Dec 07, 2013 at 07:05:50PM +0100, Alberto Gimeno wrote:Having done exactly that, I can say that there are a *lot* of places you've got to hook into Redis to make that a possibility, and the semantics involved will take some effort to genericise. There's also a fair amount of logic involved to minimise the impact of this work on the fast path of serving in-memory requests quickly (which I haven't even solved completely yet, so I'm not sure how deep that particular rabbit hole will go yet). I would presume that anything that cuts into Redis' fundamental mission of being a very fast and consistently responsive data structure store wouldn't be well regarded as a core feature. - Matt -- When the revolution comes, they won't be able to FIND the wall. -- Brian Kantor, in the Monastery |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 07/12/13 14:12 | On Sat, Dec 7, 2013 at 7:35 PM, Kelly Sommers <kell.s...@gmail.com> wrote:Yes, when I talk about "Relaxed CP" I mean, not true CP systems, no strong consistency, but a tradeoff that is an approximation of CP. "C" Consistency of CAP requires two fundamental things to happen: 1) Replicate to majority before acknowledge the write. 2) Sync every operation on disk in the replicas before to acknowledge the node proposing a new state. Waiting for the slowest fsync between the first N/2+1 replicas is not going to work well for Redis. However I believe it is wrong to think that there is only strong consistency that is worthwhile. For example a system may employ asynchronous replication with asynchronous acks: if the operation committed more than 1 second ago was still not acknowledged the system stops accepting writes. This is a form of non-strong consistency that trades strong consistency for latency. Are people crazy to use such a system? I don't believe it, because a system is the sum of the performance characteristics it has while running without partitions, plus the consistency and availability characteristics it has when partition happens. You can't consider only what happens during the worst case scenario to evaluate a system. So an application where losing some write bound to a given maximum window will be ok with using a system that performs very well during normal operations, because it is "business acceptable" to trade consistency for this gain in performances. WAIT is a tool that you can use in general, exactly like CL.ALL, the problem is, with every tool, that you have to understand the exact semantics. Example: WAIT can be used in order to run a Redis stand-alone instance with replicas in "CA" mode. Writes only succeed if you can replicate to all the replicas (no availability at all under partitions). WAIT can be also used, by improving the failover procedure, in order to have a strong consistent system (no writes to the older master from the point the failure detection is positive, to the end of the failover when the configuration is updated, or alternative, disconnect the majority of slaves you can reach during the failure detection so that every write will fail during this time). WAIT also improves the real-world "holes" that you face if the failure detection is not designed to be safe. For people it is important how systems behaves in practice. /dev/null is not the same consistency of asynchronous replication for example. Similarly users can say, I'm ok with a system that has excellent latency and IOPs where everything is fine but is not able to feature strong consistency, howevern when shit happens given that it can't guarantee strong consistency, what degree of consistency will it offer? What is the contract with the user? I find the idea that there is "strong consistency" or nothing not correct, the AP systems you cite are a perfect example of that. Wallclock last-write-win is a model, but there are better more costly models, and so forth. From the point of view of weak CP systems you can see this in terms of the kind of partition you have to create for inconsistencies to be created. There are systems where of all the partitions and failures possible only a small subset will create inconsistencies, there are other systems that are affected by a larger subset. To reply you with a counter-example that is as pointless as your pregnancy example: car safety is not just cars that I can't be killed in an accident, or cars where I die. Different cars will have different security levels, and if you want to run faster, you are more exposed. Yes, but there are applications where data loss is totally acceptable if it is an exception that happens with a given probability and with given results in terms of amount of write lost. There are instead applications where data loss is unacceptable, so one loss, or 10 loss, is the same, and there you need a CP system. That's pretty simple: Redis cluster can't be CP because the performance is unacceptable for the way most people use Redis. However Redis could be optionally CP for some operation, and I believe that WAIT is a start in that direction: not enough, more work is needed in the leader switch to make the process safe. But that's optional, so let's reason about: no synchronous replication by default. Redis also can't accept a distribution model where there is the need of merging values, since values are easily two billion zsets, or alike. To timestamp with a logical clock each element is crazy for instance, and the time in order to analyze and merge such big values can be seriously big, the semantics not trivial to predict in real use cases. So no synchronous replication, no merge. What is the guarantees it should be able to provide? The best consistency possible that is possible to achieve in order to survive certain partitions. With certain partitions I mean, the majority of masters with at least a slave for every hash slot, should be able to continue operations. My feeling is that under the above assumptions the best model is a CP model with maximum windows to lose writes. The tradeoff of Redis Cluster also changes the guarantees of clients in the majority partition and clients in the minority have. Salvatore |
| Re: Redis critiques, let's take the good part. | dvirsky | 07/12/13 15:21 | I don't understand what the big deal about strong consistency in redis is. People use redis for one main reason: it's super fast and you can hack your data model. As someone said before, if i wanted strong consistency I'd use ZooKeeper. Actually I'm using it. but where 10 writes per second are acceptable.
|
| Re: Redis critiques, let's take the good part. | Matt Palmer | 07/12/13 18:23 | On Sun, Dec 08, 2013 at 01:21:17AM +0200, Dvir Volk wrote:People don't understand tradeoffs. Short of some sort of magical faster-than-light transmission, instantaneous-calculation-and-storage technology, you don't get to have everything you want, but that doesn't stop people wanting it, and not casting a cynical and critical eye over every technology they consider to ensure it meets their needs. That isn't helped by developers (understandbly) talking up their features and not being quite so loud about the limitations, but I know from experience that even 72pt font in <blink> tags saying "this software doesn't do X" won't stop people from sending you an e-mail complaining about how the software you made freely available doesn't do X. Oooh... kinda like this (from "14 Ways to Tick off a Writer" http://blog.pshares.org/index.php/14-ways-to-tick-off-a-writer/): Read ten pages of the author’s book. Realize that it’s absolutely not for you: you thought it was a zombie story, and it’s actually historical fiction about Alexander Graham Bell. Go on Goodreads anyway, and give it one star for not being a zombie story. - Matt -- I was punching a text message into my phone yesterday and thought, "they need to make a phone that you can just talk into." -- Major Thomb |
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 08/12/13 01:57 | Couple points about #2. Firstly, there are many ways to optimize disk usage when processing transactions. Doing a fsync per operation is a naive approach that won't be very successful. Even before SSD's, if you study many databases, there are many optimizations used. One (but not limited to) example is coalescing transactions. Most good databases do more transactions than the 120ish IOPS a rotational disk can offer. The key to the durability guarantee is not to acknowledge a transaction until it's written. However if there are tens of thousands (or more) of concurrent transactions, you can commit them in a single fsync and acknowledging them all. There are also papers on how to write high performance WAL's (write-ahead logs). I'm not going into extensive detail here but studying current systems and the state of the art research can be helpful here. You do not have to flush disk buffers per database operation, that is overkill. You can make whatever optimizations you want (lots of papers and prior art implementations covering different approaches) so long as the transaction response does not lie and the system state is correct. Secondly, #2 is not true. Nothing about a CP system requires disks. You can have in-memory only system that is a CP system. If the node has to re-sync with another for some purpose, it must not be capable of becoming the master (someone who is up to date should be the master) or responding to read requests. This is a CAP trade-off being made. Even D (durability) in ACID isn't restricted to fsyncing to disks. I suggest reading Jim Gray's papers on the topic. Writing to a disk is a form of data replication just like writing to another node is a form of data replication. Disks die just like nodes do. Disks can write out of order and corrupt data too. There's no such thing as "CA mode". I recommend reading this wonderful post by Henry Robinson from Cloudera. More specifically item #10 related to "CA". I highly recommend reading the whole thing.
Because you're trying to pretend to be a CP system (but not one) with things like WAIT, you will have a horde of users not understanding what a failed WAIT that writes to 1 node but not 2 nodes means. The ones who do understand what this means (after some pain in production) will learn that this operation doesn't work as expected and will have to consider WAIT having AP like semantics. Similar to CL.ALL. It's really important that a transaction tell the truth of what happened and that the expectations are intuitive to the users. If that is not the case then users will struggle reasoning about the system and their application code will have incorrect expectations and potentially lacking compensations. This can all lead to applications causing incorrect state.
100% agree that for some applications it's acceptable to provide the highest performance with the risk of losing data. However Redis doesn't currently present itself as a predictable CP nor a predictable AP system. It needs to be a predictable _something_. I am not suggesting that you make Redis a CP system. There are many varying designs, not only the ones you or I mentioned so far. Regardless of available choices, you need to explicitly decide on what type of system you are building and acknowledge that in the design.
I don't think WAIT is designed correctly, especially in the failure scenarios.
The same theme exists for the persistence problem that is also discussed in this thread. It's blurry what kind of database Redis is trying to become in the future. Does Redis want to cater to the needs for consistency, or availability and/or durability? What problems does it want to solve moving forward? I hear suggestions of people trying to hack storage engines into 50 different places in Redis because it's not designed (and wasn't intended to be) a disk based system. Hacking these things together isn't the right approach. If it's going to be a durable disk based system it should be designed as one _properly_. Both of these problems whether you choose to support or trade them off for other benefits require a holistic approach with explicit trade-offs and decisions accounted for in the design and implementation. Database engineers are faced with a ton of trade-off decisions that ultimately decide what kind of system the database presents itself as and what it's good at. |
| Fwd: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 08/12/13 02:46 | Kelly for an error sent me this via private email, but it was intended
to be public, so here is my reply: [snip] > commit them in a single fsync and acknowledging them all. There are alsoRedis already does this when fsync = always. Let forget for a moment that latency of acks alone is already too much, for CP systems without disks what kind of System Model are assuming? There are three processes A, B, C. Process A replicates to B, receives the acknowledge, and replies ok to the client since the majority was reached. Process A fails, at the same time process B reboots. Process B returns available again after the reboot, there is the majority: B and C that can continue, however the write is lost. There is no way to reach strong consistency in a system model where RAM is volatile and processes can restart, without using an external storage that guarantees that certain state is durable. "CA" mode is often a way to refer to systems that are not partition tolerant but consistent. When we talk of "CA" we are actually outside of the whole point of the CAP theorem, so if you prefer we can call it just consistent systems that are totally unable to handle partitions. Raft faces the user with the same exact tradeoff, when Raft replies that it failed to replicate to the majority, it really means that the result is undetermined. The entry may be applied or not. This is something hard to avoid without making the algorithm more complex, and something somewhat pointless to avoid since you have always the case where the system is not able to sent the ACK to the client before of a partition, so you anyway have to re-apply operations, or if partitioned away as a client, live with the undetermined state of your operation. I think that Raft semantics is good enough for most use cases: if the reply is positive we guarantee the operation will be retained, if the reply is negative you can't count on it. If the write is idempotent you retry it usually, but there is always the case of the client that at this point is partitioned away, and will live with the undetermined state for an unbound time before the partition heals. I think that this is actually quite clear in the design even if not stated in the CAP terms. The consistency, since is not strong, can be considered eventual because when the partition hails actually there is agreement about the state. However this agreement only selects a single timeline between all the possible, so it means that it is possible to lose data, like in last-write-win AP systems. However some care in the distributed system orchestration try to reduce the windows to lose data to a minimum. That is the default. With WAIT we improve just the windows so far, there are less failure modes out of all the partitions and failures that the system can face. With future work in the failover process of the cluster it will be possible, probably, to also ensure strong consistency. So Raft if you think so. I don't believe this to be a valid point. > That being said, the success of Redis has come from the in-memory > performance that Redis provides. I think it's logical to continue on that > path. Do I think Redis could be a high performing disk-based system with the > right engineering? Yes. Will it be slower than a purely in-memory based > solution? Of course. But like the CAP trade-offs, you can't have the _best > of both_ worlds. Currently I've not interest to make Redis on-disk as I said multiple times. Software is not just engineering, but also culture. I don't like a Redis on-disk as a system, I want to make the performance side an extreme, not a compromise, so only memory. Maybe in the future with a pluggable storage engine... > I suggest doing more research because there are a lot more options than the > ones you are suggesting so far in this thread. Database architecture and > distributed systems are both deep topics. Sometimes I wish they weren't so > that it is easier to cover but I guess that's what makes them interesting :) I'll surely do my research, but I'm not a Right Thing person. What I mean is that I'll try to provide what I can provide with my best of my capabilities now, making clear what are the tradeoffs. This is how it always has worked with Redis. As my vision improves, I try to transfer it into the code. I'm here open to improvements to the Redis Cluster design (or whatever) that allow to retain the same goals with an improved consistency. In lack of suggestions I'll try to do the best to improve it during the future time, as long as people use and I enjoy working at it at least. Cheers, Salvatore |
| Re: Redis critiques, let's take the good part. | dvirsky | 08/12/13 03:47 |
how does that work in a single threaded model? you mean an entire transaction request? |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 08/12/13 04:24 | Just a few points on this whole CA / CP issue. First, about this example: There are at least two ways to solve that. The first one is to assumeThere are three processes A, B, C. Process A replicates to B, receives fail-stop: process B cannot reboot by itself. And if both A and B die, then it's the "uh oh I have lost over half of my cluster" issue that will always exist anyway. Even with disks, if two machines out of three are obliterated by a nuclear strike, you can lose data... Another solution is to use a write quorum of 3 (every replica must receive all writes). I think that is actually what the original Dynamo paper was doing. Also, on CAP: There is no useful (*) distributed CA system. CA means partitions cannot happen, which means a single node system. But then can we really say it is highly available? (*) /dev/null is CAP, hence the "useful" qualifier. It *is* possible to make a non-integral trade off between C and A, but then you stop calling them Consistency and Availability and say Harvest and Yield instead: http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf I don't think those ideas are really useful for a datastore though (but they are for a search engine for instance). As for Redis Cluster... Kelly is completely right in that the problem is to define what Redis is. If I had to name *one* property of Redis that makes people use it, it would be performance (low latency (*), high throughput), for both reads and writes. This is basically the reason why it is an in-memory system. (*) If you use it correctly, i.e. with mostly O(1) operations, Tony Arcieri would say :) I don't think it will be possible to keep these properties with a CP system. Inter-node network latencies will be deadly. So I just don't think it makes sense to try to make Redis cluster CP. |
| Re: Redis critiques, let's take the good part. | dvirsky | 08/12/13 04:47 |
Adding to that, a bit off topic in this sense, but on topic in the broader sense of "what redis is":
We need to also keep in mind that redis cluster removes a few useful (to me) aggregate commands like ZINTERSTORE, SINTER, SUNION, etc. If we'll want them in a cluster, an access node has to be added, adding latency and consistency problem of its own, but that's another story.
So right out the gate redis cluster will not be for everyone. Having said that, I don't have any statistical data about the actual use cases of redis out in the world, I suspect most people won't care about them.
But it seems the whole cluster focus keeps redis in a bit of a split brain state (pun intended :) ) , where it is two (somewhat) different beasts if you're using it as a classic master/slave system and do your own sharding if needed, or using it in cluster mode.
I'm not saying redis should take one way or another, though, just pointing it out. I don't think I'll be using redis cluster any time soon.
|
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 08/12/13 05:25 | On Sun, Dec 8, 2013 at 1:24 PM, Pierre ChapuisI'm assuming crash-recovery that is a lot more similar to reality. In the crash recovery model, without persistent state even when the majority is available again, and even if the write was acknowledged, the write is lost, so you can't achieve strong consistency. The number of replicas has nothing to do with that, because you can assume mass-reboot in the crash-recovery model. Otherwise you have to make your system model more forgiving and assume that only a minority of processes fail and restart. Btw Redis Cluster assumes fail-recovery in the failover procedure right now, so the new state is synched on disk before replying for other nodes, for everything related to the hash slots configuration, epochs, and so forth. This is why I use "CA" to say, systems that provide no availability on partitions but that are able to stop working instead of providing wrong results when partitions happen. It is just a way to name things, if "CA" is not the best, we can call it in another way, but I find it helpful to say "CA" from the point of view of finding a common name for those kind of systems. Honestly the O(1) thing is a misconception. Redis provides excellent performance and latency for: 1) O(1) operations. 2) Logarithmic operations (most basic sorted sets operations, including ZRANK). 3) O(N) seek + O(M) work (for example LTRIM) every time you can make sure to take M small. Example: capped collections implementation. 4) Log(N) seek + O(M) work (for example removing ranges of elements from a sorted set). The Redis single thread model results in big latency with O(N) operations with large "N" of course. The most important cases where it was a big problem: KEYS * or SMEMBERS & similar stuff. Now there is a solution with SCAN & co. The slow O(N) operations are conceived for two use cases: 1) When called against collections that are small enough that the latency is acceptable, considering the fact that constant times are very small. 2) When called in a context where Redis is used as a computational node in an asynchronous way. I agree, as already stated the default operations can't be CP, however I would be enthusiast to have an optional CP mode based on WAIT that is able to make its work without affecting the other clients, and I think this is possible to achieve in accordance with the other goals exposed. Salvatore |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 08/12/13 05:31 | On Sun, Dec 8, 2013 at 12:47 PM, Dvir Volk <dvi...@gmail.com> wrote:This is extremely easy to provide in an event-driven programming model. The Redis event loop is designed to guarantee that if a client was processed for the "readable" event, it is not processed for the "writable" event in the same loop. So this is what happens: 1) Fsync is set to always. 2) Multiple processes for every event loop cycle try to write to the database. 3) When we write, instead of using fsync, we just set a flag instead. 4) At the end of the event loop cycle we are sure that no reply was yet delivered to processes that performed a write, because no writable event was fired for all the clients we processed a write for. 5) Redis ae.c has an event loop function called "before sleep" that is invoked before re-entering the event loop for the next cylce. 6) If we find that we need to fsync, we do it there. So we grouped all the clients trying to write in a given event loop cycle into a single fsync. No added latency, and under load a huge decrease in the number of fsync performed. Redis Cluster uses the same trick in order to rewrite the nodes.conf file a single time before replying to the other nodes. Salvatore |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 08/12/13 06:03 | Le dimanche 8 décembre 2013 14:25:52 UTC+1, Salvatore Sanfilippo a écrit :> There is no useful (*) distributed CA system. CA means partitions This looks like the definition of CP to me. It will prefer to stop working instead of compromising consistency for the sake of availability. Yes, completely true. O(1) is an over-simplification. The point is that it is low-latency if you take care not to run a command that could take forever. But basic operations are faster than e.g. a SQL database (if nobody else is blocking the server). > I don't think it will be possible to keep these properties with a Why not... I think Brewer said explicitly in a paper (which I cannot find right now) that CAP was to be understood for some data at some point in time. So you can have a system that is CP for some data and CA for other data, and you can have a system that switches between CA and CP modes. But I think that this "CP" mode should be seen like a bonus, and not hinder the "natural" distributed mode for Redis which is AP. |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 08/12/13 06:04 | Le dimanche 8 décembre 2013 15:03:10 UTC+1, Pierre Chapuis a écrit : Of course you should read AP instead of CA in that paragraph... |
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 08/12/13 09:47 |
Bingo :) One thing I realized when I started building 1,000+ node production clusters is that "partitions" happen all the time and many times it has nothing to do with networking equipment. The experience I gained with these systems absolutely opened my eyes to what the papers and others who build even bigger systems than I do have been saying for a long time. Once it hits you smack in the face it's hard not to learn it.
Agreed. To do so you're going to want a Redis cluster to scale in a predictable fashion. This means that when I add N nodes I can have a general idea how much capacity that is adding to my cluster. Creating a system where all nodes talk to each other on every single write (or coalesced batch) isn't going to scale well. This now directs us into bounding the amount of nodes included in a transaction and partitioning the data so that the cost is somewhat constant. I've seen some people trying to scale Redis too far and while I would love a better story for larger scale, I think creating a system that does well with high performance and small node sizes that behaves predictably is a better goal for the short term. With that experience in the project it increases the knowledge for taking it to the next scale. It's important this gets communicated properly so that people don't misuse though. I'm going to guess you meant "AP" and not "CA" since you correctly identified that above :) I agree that CP for some data and AP for other data can definitely be advantageous if the user is able to understand the nuances and the requirements from their data properly. This sets clear expectations of what the system does with a piece of data even though both may behave differently. However! Switching between CP and AP for the same data means you are basically an AP system. From the perspective of the actors and observers of the system, they can't trust the system to ever be correct so they must consider that AP mode happens anyways. A CP system means that actors and observers have a set of guarantees. If that can be traded-off then the application must account for this trade-off. Even more problematic, if this toggle is done with a command like WAIT, a misbehaving application can cause incorrect state to well behaved applications. We must consider the serializability implications when CP can be circumvented. Again, I'm not promoting CP systems here, I'm just trying to clarify the implications of what these suggestions mean because I don't think they are thought out. My 1,000 node clusters are AP systems, I try to make the trade-offs where they make sense :) |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 08/12/13 10:46 | Le dimanche 8 décembre 2013 18:47:14 UTC+1, Kelly Sommers a écrit : I am not sure how misbehaving applications should be taken into account. After all a misbehaving application could also decrement your counter when it shouldn't, canceling your increment somehow... But I agree that we should not make it extremely hard for users (i.e. application developers) to understand the guarantees provided by the system. I admit I do not understand them well myself. Antirez cites Raft as an example, but Raft is all about leader election. In Redis Cluster the guarantees that Raft offers are apparently not there, and the WAIT command cannot provide them anyway. For instance, imagine you have 5 replicas (A to E). You would think that by using WAIT 3 you would be safe. But if you perform three operations (1 to 3) on three different clients, acknowledged respectively by nodes (A, B, C), (C, D, E) and (B, D, E), then you *do not* have the certainty that any node has seen all three operations: A -> 1 B -> 1, 3 C -> 1, 2 D -> 2, 3 E -> 2, 3 If the master fails then, how do you pick the new master? It looks to me as if even WAIT N where N is the *total* number of replicas (here 5) could offer real guarantees in the event of master failure. And even then, I am not sure it would be enough. I may be wrong though, because I don't understand the Cluster replication algorithm. Maybe if Antirez could publish an explanation of how it works and the assumptions it makes (comparable to the Raft paper and associated lecture slides) it would answer a lot of the questions people have. But I can understand this would be a *lot* of work... |
| Re: Redis critiques, let's take the good part. | Pierre Chapuis | 08/12/13 10:58 | Le dimanche 8 décembre 2013 19:46:33 UTC+1, Pierre Chapuis a écrit : TBH there is http://redis.io/topics/cluster-spec which I should read more attentively :) It is not as clear as what exists for Raft but could contain the answer to my question, which is: what exactly does WAIT garantee? Is there any way a write followed by a successful WAIT 3 in a cluster of 5 could not be acknowledged by the next master? |
| Re: Redis critiques, let's take the good part. | Mark Papadakis | 08/12/13 12:49 | Group commit does wonders for a write ops throughput. it's somewhat non trivial to get right on a multi threaded environment. Mariadb devs have a nice writeup that describes how it works on their implementation and what kind of performance and efficiency that provides. You may want to google for that. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 08/12/13 14:15 | On Sun, Dec 8, 2013 at 7:46 PM, Pierre ChapuisI cited Raft as an example of CP system with false negatives. It ensures that a positive reply means the entry will be applied to the state machine, but it does not offer guarantees of the opposite when a negative reply is provided to the client. This totally makes sense IMHO for a number of reasons in the case of Raft. For Redis Cluster + WAIT to be consistent, you have to improve the failover process, in two ways at least. You may already know that the failover is performed by slaves in Redis Cluster. It requires two basic things, that are, the slave that wins the election should only failover if it can get an agreement from N/2 other slaves (so itself included, there is majority), and the acknowledge from the point of view of the slaves means, I'll stop acknowledging writes to the master (but I'll continue to process the replication stream) until a new version of the configuration is available for the slots I'm replicating. This means that no writes with WAIT set to N can be accepted during the failover process, and that we are guaranteed to select the next master from the majority of slaves, so that we are sure at least one slave must have the last write, if we select the one with the greatest replication offset. The exact same thing could be implemented into Redis Sentinel as well. I don't think something like that will be available in the first version of Redis Cluster, so WAIT in the short term will only have the effect of improving the consistency guarantees provided by Redis Cluster, but without providing strong consistency. When we mix Redis Cluster partitioning schema with CP, what happens is that actually every set of replicas serving a given hash slot is a CP system per se, so this is the set of nodes where you need majority. To avoid that during normal operations Redis Cluster allows to have just one replica and still perform the failover, as the trick is to use as majority, to version new configurations, the full set of masters available. Long story short if the design was directly targeting only a CP system, the failover could be *just* made in terms of a given set of replicas, instead of involving all the cluster. It is like if you take a CP system based on Raft or Paxos or whatever, and run N systems like that, and partition your keys in ranges across the N systems. Cheers, |
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 08/12/13 15:30 |
What you've discovered here isn't specific to Raft or Redis. Almost all transaction approaches suffer from this problem. It can be described simply as: The client is part of the distributed system. This is a point that gets lost often. As one example, a two-phase commit transaction has the same problem. What happens if the TCP socket between the transaction coordinator and the client disconnects while the server was sending the success acknowledgement? As far as the client is concerned the transaction failed, but it could have succeeded. This problem exists because the client isn't considered part of the transaction scope but it definitely can be and there are transaction models where the client is included. As you can imagine, this comes at a cost. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 09/12/13 00:25 | On Mon, Dec 9, 2013 at 12:30 AM, Kelly Sommers <kell.s...@gmail.com> wrote:Yes this is obvious, but from what you said about WAIT I believed it was useful to make it clear: So assuming the failover is safe, the semantics of 1 node vs 2 is handled as with other systems: retrying most of the times, or dealing with the indetermination. Note that I did not assumed that your sentence above had something to do with the failover properties because with a broken failover WAIT semantics is not CP *even* if it returns N-1, with N being the total number of replicas. Btw the client being part of the distributed system is a different problem in the case of Raft, this is why I mentioned Raft. Raft can provide you with a *false negative* even if there is no partition between the client and the leader, that's the point. However the rationale for allowing this behavior is that because anyway this partition between the client and the leader can happen, you have to handle it in a way or the other. The same applies to WAIT. |
| Re: Redis critiques, let's take the good part. | javier ramirez | 09/12/13 03:23 | On 07/12/13 01:00, Alberto Gimeno
wrote:
What about using an already working disk key-value store like leveldb, rocksdb (http://rocksdb.org), lmdb (like nds does https://github.com/mpalmer/redis/tree/nds-2.6/deps/liblmdb ), etc.? FWIW, I attended a talk by basho the past week and they were talking about the upcoming features of riak. One of the new features are data types in a similar way to redis (lists, hashes, sets...) but running on riak, so with replication and persistence baked in. This piqued my curiosity, so I went to talk to the basho people after the talk, to see what can be done and how it was implemented. The relevant part for the discussion here is if you want to use the type system you need to choose the riak LevelDB store, so it would seem possible to implement types on something derived from leveldb. The thing is on riak you don't have a double paradigm at the same time, either you are using the memory store, or you are using the type system, which uses LevelDB. After talking to their engineers, I decided the right tool for us right now is still redis. I like redis very much the way it is and in my opinion the minimalistic approach is one of Redis killer features. I for one prefer to see the future of redis as the best in-memory data store (which I think it is right now) rather than trying to cover several areas and not being the best in all of them. Cheers, j |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 09/12/13 03:27 | Hello Javier,
I believe the problem of storing data structures on a btree where the first "level" of the btree is the key, it is easy to obtain if you have small data structures. I think AP stores like Riak are adding support for some data structure but those are not conceived to be used like you use them in Redis, that is, with million of elements in a single data structure. Redis-on-disk with values bound to a given size is easy to accomplish. The "diskstore" branch was a good approximation: working more on that we would be there. The problem is that my feelings about a "capped Redis" on disk are not great. So what you could do is to use a file for every b-tree. This works as long as you don't have many millions of keys, otherwise you have to understand if the filesystem is designed to cope with that. Salvatore > -- |
| Re: Redis critiques, let's take the good part. | dvirsky | 09/12/13 04:49 | Just as a reference, CQL on top of Cassandra adds container columns using wide rows. meaning if you have a dictionary field, what's really written is a column with the name "mydict:key" and the value. And the same way they have lists and sets. This allows atomic persistence operations on a small part of the value.
So this can be done, with its price. But of course Cassandra's model is different, and I don't know the details of how it touches the filesystem in this manner. Plus I'm not sure how it will scale to millions of keys with millions of sub-keys.
|
| Re: Redis critiques, let's take the good part. | Aphyr Null | 09/12/13 09:40 | > Example: WAIT can be used in order to run a Redis stand-alone instance Please note that WAIT provides, in the context of the CAP theorem, exactly zero of consistency, availability, and partition tolerance. Labeling it CA or "relaxed CP" is misleading at best and dangerous at worst. > Because you're trying to pretend to be a CP system (but not one) with things like WAIT, you will > means. The ones who do understand what this means (after some pain in production) will learn Precisely. WAIT is *not* a consensus algorithm and it *can not* provide serializable semantics without implementing some kind of coherent transactional rollback. > "CA" mode is often a way to refer to systems that are not partition I have yet to encounter any system labeled "CA" which actually provided CA. This should not be surprising because CA has been shown to be impossible in real-world networks. Please read http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf. > Raft faces the user with the same exact tradeoff, when Raft replies You have not implemented or described RAFT's semantics in Redis, and failing to understand how Redis WAIT differs from RAFT, VR, multipaxos, etc is a dangerous mistake. Consensus protocols are subtle and extremely difficult to design correctly. Please consider writing a formal model and showing verification by a model checker, if not a proof. > I think that Raft semantics is good enough for most use cases Please don't claim these are equivalent designs. In particular, the RAFT inductive consistency constraint is not present in the current or proposed WAIT/failover design. Without a similar constraint you will not be able to provide linearizability. > I'll surely do my research, but I'm not a Right Thing person. What I Please consider choosing a proven consistency model and implementing it, instead of rolling your own. Alternatively, consider documenting that Redis can easily lose your data. I see an awful lot of people treating it as a system of record rather than a cache. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 09/12/13 12:06 | On Mon, Dec 9, 2013 at 6:40 PM, Aphyr Null <aphyr...@gmail.com> wrote:I used "CA" and "relaxed CP" in totally different contexts. Btw if we don't like those names, we can just talk about concepts. In the above sentence what I mean is that WAIT is a tool that you can use to get real guarantees in real contexts, like the one described above, that is, a master + N salves setup. Operations acknowledged by WAIT N make the user aware that the write is accepted by all the replicas. When master fails, if there is manual failover in which the master is taken down, and a random replica restarted as master, it guarantees you the obvious property that all the writes for which you received a positive acknowledge, are retained by the system. Nobody claimed that, but WAIT can be used as one of the building blocks to mount a system featuring strong consistency (good start could be a failover process that does not accept writes during the failover, and is guaranteed to elect the slave with the higher replication offset). What I claim here is that the point is not transactional rollbacks, see later. The above sentence was about a specific issue: false negatives. Apparently you also agree that without transactional rollbacks you can't mount a CP system. How is WAIT returning a non majority different than Raft false negative from the point of view of transactional rollbacks? If you get a positive reply, the write is accepted, if you get a negative reply, you don't know and can retry. This was in the above context, false negatives. Salvatore |
| Re: Redis critiques, let's take the good part. | Yiftach | 09/12/13 13:20 | In 2002 when the first paper of CAP theorem published 40msec and even 500msec database latency was acceptable by 99% of the apps on earth. I'm talking on a daily basis with companies who have decided to migrate from DynamoDB (the holy grail of the AP systems) to Redis because with 10-20msec average latency ( 40msec at the 95 percentile) their application just cannot work! - and this when it runs on the strongest dedicated EC2 instances with ultra-fast SSD (not avaliable to the public).
Redis should nigher be built to serve 1000 nodes cluster scenario nor to comply to all the corners of a CP system. IMO "Relax CP" is when the probability to reach these corners is smaller than the probability of a major infrastructure failure.
+972-54-7634621 |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 09/12/13 13:46 | On Mon, Dec 09, 2013 at 11:23:41AM +0000, javier ramirez wrote:Good to see others are seeing the value in a data structures server. I can definitely see the value in being able to operate on more complicated data structures inside the Riak paradigm, although it's going to start getting awfully tricky if you still want to use a conflict resolution algorithm more complicated than LWW. Redis still has a place, though, in the high-throughput / low latency arena. We just trialled using Riak to store some metrics data for M/R querying, and try as we might, we couldn't get it to keep up with the write rate for even *one* of the incoming data streams, let alone the full set we wanted to point at it. - Matt -- School never taught ME anything at all, except that there are even more morons out there than I would have dreamed, and many of them like to beat up people smaller than they are. -- SeaWasp in RASFW |
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 09/12/13 13:53 |
This is why I keep saying the holistic design around the trade-offs made from top to bottom is important. Some of these data structures are possible in Riak because of the CRDT research and Riak's implementation of this research. A comprehensive study of Convergent and Commutative Replicated Data Types
|
| Re: Redis critiques, let's take the good part. | Aphyr Null | 09/12/13 13:54 | > When master fails, if there is manual I'm not sure how to state this any clearer. The problem is not false negatives. The problem is a lack of an inductive constraint on leader election+log replication. If you really want to insist on claiming WAIT prevents false positive acks, I'd be happy to attempt an existence proof of this problem in the next installation of Jepsen. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 09/12/13 15:02 | On Mon, Dec 9, 2013 at 10:54 PM, Aphyr Null <aphyr...@gmail.com> wrote:WAIT alone can't be evaluated without the failover properties. If you want to evaluate WAIT with a good failover procedure, you can do the failover manually. 1) Install 5 nodes, master + 4 slaves of normal instances (Redis unstable). 2) Write to the master with WAIT 2. 3) Consider acknowledged every write where WAIT returns 2 or more (accepted by the majority of nodes). 4) If master is down, do a manual failover: stop the master completely, issue INFO in al the slaves, check what has the most recent replication offset, turn it into a master. Note 1 about 4: As long as N/2+1 nodes are available (so the master and another slave can fail) we are guaranteed to have all the writes performed to the master that returned a 2 or more as WAIT return value. Note 2 about 4: If a node acknowledged a given write, because of how WAIT works, it also acknowledged to have received all the previous writes. I did not analyzed the above system in depth, but I can't find a trivial failure mode. Redis Cluster and Redis Sentinel are currently both not able to provide the same guarantees of the manual failover procedure described above for different reasons. However what could be a key idea to implement this, is that instead to make sure that the master does not return available, which is very hard, it is possible to tell N/2 nodes to stop acknowledging writes before the next master switch. In this way if the master returns available it will not be able to reach the majority and all the writes will not be acknowledged. |
| Re: Redis critiques, let's take the good part. | Aphyr Null | 09/12/13 15:29 |
0.) This presupposes the existence of strong coordination for the failover process itself. 1.) This precludes any recovery from a failed or isolated primary node. All nodes must halt until the primary is reachable by whatever system is coordinating failover. 2.) Even if strong coordination about the order of shutdown and takeover were possible, this system is not linearizable. Can you guess why? |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 09/12/13 16:12 | On Tue, Dec 10, 2013 at 12:29 AM, Aphyr Null <aphyr...@gmail.com> wrote:Yes, the idea is that this strong coordinator could be elected among slave nodes by only voting the node if the vote request has a replication offset greater than your (and majority would be needed to get elected). If you win the election and a given entry was replicated to the majority, then you should have the entry. Not sure why: majority of slaves agreed to don't ack writes, so the primary can be back reachable and clients writing to it will only see non acknowledged writes. Slaves will not ack again until a slave wins the election and gets promoted. No sorry, I did not analyzed the system very well, but I can't find trivial to spot reasons why it is not linearizable, assuming we talk of the single hash slot and not of the system as a whole. I'm interested to understand why. Salvatore
|
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 10/12/13 00:53 | On Tue, Dec 10, 2013 at 1:12 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:About non linearizability, perhaps it does not apply to the case where a strong coordinator exists, but in the general case one issue is that when we read, we can't just read because a stale master could reply with stale data, breaking linearizability. There is a trick to force the read to be acknowledged that could work: MULTI INCR somecounter GET data EXEC WAIT <n> I'll check this better, because a safer master switch could be one of the applicable things in Redis Cluster, at least when there are more than two replicas per hash slot. |
| Re: Redis critiques, let's take the good part. | Marc Gravell | 10/12/13 11:22 | I came all the way back to the top post, because I kinda feel that the thread has gone astray on the CAP things. Which isn't to diminish those things in any way: just - it is now going around in circles. The reality is that while CAP is interesting, that isn't the only feature a product needs, and comments along the lines of "what does redis want to be when it grows up?" are pretty condescending IMO. If I had to give a list of things that cause me pain in redis, I would say: - "keys" et al: which is now a solved problem with "scan" So actually, most of the things that *are actual real problems for me* : already in hand. The transaction model takes a little getting used to - but when you get into the "assert, try, redo from start if fail" mindset it is a breeze (and client libraries can do things to help here) - so I don't count this as a plus or a minus - just a "difference". Of course, when this isn't practical: LUA allows the problem to be approached procedurally instead. For the good: redis is a kickass product with insanely fast performance and rock solid reliability even when under sustained and aggressive load. The features are versatile allowing complex models to be built from easy to understand primitives. We love you :p Marc On 6 Dec 2013 13:53, "Salvatore Sanfilippo" <ant...@gmail.com> wrote:
Hello dear Redis community, |
| Re: Redis critiques, let's take the good part. | Pieter Noordhuis | 10/12/13 11:33 | WAIT in isolation doesn’t give any guarantees, it seems, only
information about the state of the slave links. A concept like WAIT only becomes useful when action is taken on its failure. Right now, if a call to WAIT returns with an undesirable result, it is up to the user to figure out what to do next. In my opinion, there is an opportunity for Redis to do the right thing instead and provide bounds towards how many writes can be lost. Looking at the set of operations that Redis currently supports, we find the following: - For one single key, there must be only one process taking writes for it, or the linearizability property is violated. As soon as there is more than one process taking writes, there is no way these processes can converge, because of the ordering requirement on operations (think of a list push; there exists no merge function with a predictable result). - Following this observation, selecting which process is going to take writes for a single key needs a majority vote. If a majority vote cannot be achieved, the system must halt. If it doesn’t, we again end up with the possibility of multiple processes taking writes and absence of linearizability. - AP is out of the question for Redis, in its current form. It looks like a failed WAIT needs to be followed by the process taking writes to stop taking writes (halting). It means that writes are no longer replicated to a majority of slaves. This in turn means that the system can partition in a way where the process taking writes and its slaves are separated from a majority of slaves. This majority can then elect a new master and start taking writes, violating linearizability since it is not allowed to have more than one process taking writes for a single key. The fact that Redis uses asynchronous replication means that it can’t be a pure CP system, which is what people in this thread have been arguing for/against. I think it can be a CP system with bounds on write loss (can this still be called a CP system?). The bound is defined by the time the process taking writes continues to take writes without a majority acknowledgement. This is only possible when the process taking writes halts. Otherwise, there are very few guarantees that can be made towards retention of writes. These statements reflect my understanding of the domain, please tell me if/where I’m wrong. Cheers, Pieter On Tue, Dec 10, 2013 at 12:53 AM, Salvatore Sanfilippo |
| Re: Redis critiques, let's take the good part. | Kelly Sommers | 10/12/13 12:45 |
I definitely didn't mean that comment as condescending and if that came across in that way to anyone I sincerely apologize. I'm sorry. The intent was to say that deciding what type of system Redis wants to be is important in the decision making to reduce the opposing functionality that makes the system more unstable than it needs to be (which addresses some of your concerns). From your list below are a lot of items I hear from a lot of customers I work with. These are definitely a common theme I hear at least from my perspective. I'm not suggesting that's what the focus should be though. If those are to be improved, the concerns I pointed out and Aphyr elaborated on are part of that solution. I don't know why there's push back on these topics I raised because they are involved in 3 of the 4 bullet items you want fixed. Redis Cluster currently has features that were broken in the last implementation and potentially broken even worse in the "fixed" implementation. Some features conflict and undermine these from working properly. If you don't want blips under maintenance in a larger scalable cluster without replication quirks, this all requires a sound distributed system implementation. I can't stress enough how picking trade-offs that undermine each other negate most of the benefits from the trade-offs and cause problems. The good news is this can be improved :) |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 10/12/13 15:10 | Hello, I wrote a better description of the "toy" distributed system I
proposed as a starting model. It is a toy since it uses a powerful entity that is actually impractical, but the idea is that it is possible to remove it with careful changes. https://gist.github.com/antirez/7901666 This was mostly an exercise to me, instead analyzing this trivial model I found a trivial improvement that I can make to the Redis replication process that results in a better degree of data safety, opening an issue right now. Salvatore |
| Re: Redis critiques, let's take the good part. | Howard Chu | 10/12/13 17:34 | While you're on the subject of replication, I suggest you read RFC 4533 (LDAP Content Sync Replication) to get some ideas. Currently your replication protocol's resync after a node disconnect/reconnect is far too expensive. Inventing new replication protocols is a loser's game, especially when a lot of dedicated/determined people have already done the hard work. Learn from the mistakes and lessons of the work that has come before you. (And for an even better method, read up on OpenLDAP's enhancement of RFC 4533, Delta-Sync replication.) |
| Re: Redis critiques, let's take the good part. | dvirsky | 11/12/13 00:52 | On Wed, Dec 11, 2013 at 3:34 AM, Howard Chu <highl...@gmail.com> wrote: That't not how replication works in redis anymore. Redis 2.8 does not do a full resync on reconnect.
|
| Re: Redis critiques, let's take the good part. | Howard Chu | 11/12/13 21:56 | Even so - in reality, there is no difference between "single-master with failover" and "multimaster" but the redis protocol doesn't maintain enough state to track multiple masters. Which is why it so easily loses data in a failover condition. |
| Re: Redis critiques, let's take the good part. | Wayne Brantley | 11/12/13 22:32 | First, I think this is a very open discussion and very helpful. Here are my thoughts: 1) Has to be 100% memory backed (database is held in memory during operation). People constantly suggest a disk backed Redis and you always say no. There is even a project that implements it and you still say no. Heck start with that design and bolt it on (optionally). I really think you need to reconsider. You could have 'memory only' mode as well as disk backed mode. There would be trade offs, but there are always trade offs - difference is it would be up to me (not you) to decide those. 2) HA support is weak. Heck, you guys know this - this entire thread has turned into that. The system needs to scale horizontally too though. I want HA to ensure my Redis is up and running. If there is a failure, it should failover and when something comes back online add the capacity back in. This should all be easy to setup and make work - just dead simple. The how-to should read something like this: http://www.couchbase.com/couchbase-server/scalability 3) Publish/Subscribe model cannot be backed by a list. Would be nice when publishing to a channel, that channel can be backed by a list, so subscribers can get messages while they were offline. Additionally, a fan-out type of publish/subscribe feature would be more than welcome. (Note this would mean I could subscribe to Redis Keyspace Notifications and not worry that I missed some because my client was not subscribed!) 4) Publish/Subscribe cannot guarantee message was processed. Sort of related to #3, but if a message is consumed by a subscriber, should be able to require an acknowledgement of message before it is removed from the list in #3. 5) >>I also think the core development could be closer with the community work. I understand that is important to keep redis simple, but I see few forks that have good contributions(eg: NDS, Sentinel automatic discovery/registration), yet not much movement to merge in the core. I agree with prior poster on this. As an example there are 215 pull requests! That is an open source dream - all those pull requests ready to go. They are not all so complex you need to study them for years or so 'changing' you do not agree with their premise. Some are simple spelling mistakes, etc. It does not look like a healthy open source project. People want to help, change, add features - let us/them! Great product and nice to see some movement and improvements! Thanks for your time and for listening/considering. |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 12/12/13 02:53 | Hello,
some update about a possible improvement in the Redis Cluster design that apparently is not conflictual with any of the other main constraints: http://antirez.com/news/68 It is far from being a complete proposal, but the main idea is covered in the blog post. As usually the evil in the details so this require to be studied carefully before being applied. Salvatore On Wed, Dec 11, 2013 at 12:10 AM, Salvatore Sanfilippo |
| Re: Redis critiques, let's take the good part. | Matt Stancliff | 12/12/13 14:21 | Let’s just say Redis is opinionated. “Mass storage” hasn’t really been a primary goal of Redis in the same way many Redis features aren’t prioritized by HDFS. The document model of Couch allows it to have very fine-grained replication. Redis doesn’t quite work that way. I don’t think Redis HA is as weak as people believe though. You can always construct very specific edge cases where systems will catastrophically and irrevocably fail. People are imagining Redis under those scenarios instead of some real-world workloads (many companies love to have just a replicated memcache so their primary DB doesn’t get a thundering herd when a single cache server dies due to hardware or network errors). “add the capacity back in” assumes we’re running under a cluster scenario where additional nodes add additional capacity. This isn’t quite relevant towards non-alpha Redis yet. Redis Sentinel can handle some finer nuances of HA management right now though. Every time I think I want to use Redis PubSub, what I really end up needing is a list with consumers popping from it. Live chat? PubSub. Work queue system? List. Live debug status being broadcast to listeners? PubSub. History of error log broadcasts? List. It’s easy to say “PubSub should persist until all my consumers have read all the messages!” But — How does Redis know when you’re done reading everything (or if all consumers are online? how long do messages persist waiting ’to be read’)? Redis PubSub is for live notifications of things, not dead queuing notifications of things. We have to build systems using the behaviors of our infrastructure. If Redis doesn’t make you happy computationally, there are other caching/queuing/persistent notification systems out there who would love to store and forward your data. Redis isn’t *quite* One Process To Rule Them All (yet). Using Redis now, we can emulate that behavior by deleting the element from the list when it’s processed. You may have to set up multiple lists (one for “To Be Processed” then move items to “Processing” then delete from Processing when it’s done, otherwise re-add to “To Be Processed.”) Becoming a full message broker isn’t necessarily a goal here. Check out work queue systems built on top of Redis for more examples. There are only so many hours in the day. One pull request can easily consume hours (2 to 40) of review and integration and style changes and testing. Trying to merge every one of those pull requests would take over 100 days of work. Sometimes pull requests are somebody’s weekend project they think is useful. Sometimes it’s a company who added new commands to Redis and wants to share. Sometimes it’s somebody who rewrites 50% of Redis and wants to see things done in a different way. Every issue requires some communication back-and-forth and consideration from multiple points of view. It takes a while. (Plus, people always want new features and current bug fixes and live support from the same person too.) Make sense? -Matt |
| Re: Redis critiques, let's take the good part. | Josiah Carlson | 13/12/13 13:41 | Matt said everything I wanted to say, along with several things that I wouldn't have mentioned. <3 - Josiah
|
| Re: Redis critiques, let's take the good part. | TPoise | 13/12/13 19:43 | My unsolicited thoughts: 1) I'd estimate 90% or more of real people that use Redis are simple GET/SET users with an occasional LPUSH. Basically memcache with persistence. They can count the number of slaves on one hand. Most of these people use the system with no problem and you never hear from them on this mailing list because Redis just simply works with this scenario flawlessly. 2) There are a small minority of users that tend to be the most vocal. These are the same people with fsync=always but using EBS or some other high-latency cloud disk provider and trying to do a million SETs a second and wondering why Redis can't keep up. 3) A disk-persisted Redis sounds like it would make a lot of people happy, but what would make a Redis disk system any different than say... LevelDB, RocksDB or LMDB for that matter? Performance is Redis' main selling point. You take away performance and Redis is then just another key-value store and there are already plenty of those out in the world. 4) I personally wish antirez would re-evaluate his stance towards an official Windows port. I understand the goals of keeping a clean code base. But Windows users are numerous, and they aren't afraid to spend money on licenses and support. An official Windows port sold for $100-200 per node would be chump change to those spending thousands already for the OS. SQL Server is what, $7k per CPU core nowadays? If anything, use the money from Windows users to subsidize more developers for the main Unix branch. Heck, even ServiceStack is about to start charging money just to have access to the C# client for Redis! |
| Re: Redis critiques, let's take the good part. | Matt Palmer | 14/12/13 12:25 | On Fri, Dec 13, 2013 at 07:43:58PM -0800, TPoise wrote:The difference is the network-accessable, atomic data structure manipulation. LevelDB, RocksDB, LMDB, KyotoCabinet are all local-only, and while the likes of KyotoTycoon and memcachedb are network-accessable, they still only provide opaque values to work with. Also, disk-persisted doesn't *have* to mean dog-slow and poor-performing. Hell, Redis is *already* disk-persisted, it's just got a very innovative way of doing it that doesn't cause everything to grind to an almighty halt. Sounds like a business opportunity you could explore. Presumably, Salvatore doesn't need help in making sufficient money to support himself and the continuing development of Redis (and if he does, I would encourage him to e-mail me privately). One of the benefits of open source software is that people other than the original author of a piece of software can adapt it and commercialise it in a niche they consider capable of supporting a commercial enterprise. I have no interest in Redis for Windows, but I do have a great interest in more commerce surrounding truly open source software. - Matt -- Java/XML are the hammer and the Internet is the thumb. -- rone, in a place that does not exist |
| Re: Redis critiques, let's take the good part. | jberkus | 14/12/13 13:17 | On 12/07/2013 12:26 AM, Matt Palmer wrote:
> On Fri, Dec 06, 2013 at 06:00:44PM -0800, Josh Berkus wrote: >> Actually, you'd be surprised how much time you can spend in >> serailization operations. It's nothing compared with reading from EBS, >> of course, but some people have faster disks than that; SSDs are quite >> affordable these days, and even Amazon has dedicated IOPS. > > While I've come to the conclusion that PIOPS are snakeoil, SSDs are quite > nice -- but they're not magic. They're still not as fast as RAM or CPU. Really? I've got a few clients using PIOPS for Postgres with some fairly good results. Doesn't affect Redis one way or the other, of course. Anyway, it's not that IO gets as fast as RAM (although that could be coming with persistent RAM). It's that it gets fast enough that serialization time is a significant portion of write or read time. > Oh, definitely. In the case of NDS, writing to disk doesn't impact > performance, because that's done from memory to disk in a forked background > process, but that naturally sucks because the data isn't properly durable > (the use case I was addressing meant I can suffer the loss of the last few > writes). Yeah. That's often acceptable if you meet two conditions: a) the user has a way of estimating how much they lost, and b) the DB comes back up without manual hackery Even postgres has an option of "synchronous_commit = OFF" which sacrifices a measurable amount of durability in trade for better performance on systems with high IO latency (like AWS). There's two separate goals here: - automated crash recovery, and - synchronous durability of writes These two goals are not the same thing, and there are ways that you can achieve either one without the other. Just something to keep in mind. > For a proper disk-backed Redis, I'd be switching to something like AOF > fragments to store the log, and the background process would rewrite the AOF > fragments into the disk cache; on startup, this would also be done before we > start serving data. Yeah, that's the tried-and-true approach. Don't underestimate the difficulty of getting this right though. --Josh -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com |
| Re: Redis critiques, let's take the good part. | Salvatore Sanfilippo | 14/12/13 13:31 | On Sat, Dec 14, 2013 at 10:17 PM, Josh Berkus <jo...@agliodbs.com> wrote:Hello Josh, I could say that the above was what made me discontinuing the "diskstore" thing. It was based on the idea that you serialize-deserialize on the file, while Redis is based on the idea that a set composed of 5 elements, or 50 millions, is the same. The two concepts don't play well together :-) So in my opinion Redis on disk "done right" requires the following introductions: 1) On-disk representation of data structures, minimum requirement is btree-of-btrees or at least use cases for which one-key one-file is fine. 2) Multi threading model to serve requests. 3) A different model for replication since it is currently highly dependent on the in-memory tricks. Once you do this you get a store with the same data model as Redis, but actually it is no longer quite Redis. Still a worthwhile project I believe. Cheers, SalvatoreTo "attack a straw man" is to create the illusion of having refuted a proposition by replacing it with a superficially similar yet unequivalent proposition (the "straw man"), and to refute it — Wikipedia (Straw man page) |
| Re: Redis critiques, let's take the good part. | Arnaud Granal | 16/12/13 04:58 | On Sat, Dec 14, 2013 at 11:31 PM, Salvatore SanfilippoHi, > [...] > 2) Multi threading model to serve requests.I might be conservative on my opinion about this, but Redis is a big part of my life now too (and I am happy of the bride so far :o)) - Base Redis features, including replication should use as less disk as possible. Redis is not multithreaded, so if you have lot of queries, you need to run one instance per core (or more). In reality, you may end up with xx (xxx?) Redis instances on one machine. However, because replication is using disk: => It's impossible to replicate 2 or 3 instances from a machine A to a machine B without having freezes on machine A. This makes things such as Redis Sentinel or Redis Cluster unusable. However, I consider redis-nds as very appreciable efforts but I don't think it should be integrated the way it is currently. Arnaud. |