a few months after VM started to work my feeling about it started to
be not very good. I stated in the blog, and privately on IRC,
especially talking with Pieter, that VM was not the way to go for the
future of Redis, and that the new path we were taking about using less
memory was a much better approach. Together with cluster.
However there are a number of different models for dealing with
datasets bigger than RAM for Redis. Just to cite a few:
1) virtual memory, where we swap values on disk as needed (The Redis
Virtual Memory way)
2) storing data on disk, in a complex form so that operations can be
implemented directly in the on-disk representation, and using the OS
cache as a cache layer for the working set (let's call it the Mongo DB
way)
3) storing data on disk, but not for direct manipulation, and use
memory as a cache of objects that are active, flushing writes on disks
when this objects change.
It is now clear that VM is not the right set of tradeoffs, it was
designed to be pretty fast but on the other hand there was a too big
price to pay for all the rest: slow restarts, slow saving, and in turn
slow replication, very complex code, and so forth.
If you want pure speed with Redis, in memory is the way to go. So as a
reaction to the email sent by Tim about his unhappiness with VM I used
a few vacation days to start implementing a new model, that is was was
listed above as number "3".
The new set of tradeoffs are very different. The result is called
diskstore, and this is how it works, in a few easy to digest points.
- In diskstore key-value paris are stored on disk.
- Memory works as a cache for live objects. Operations are only
performed on in memory keys, so data on disk does not need to be
stored in complex forms.
- The cache-max-memory limit is strict. Redis will never use more RAM,
even if we have 2 MB of max memory and 1 billion of keys. This works
since now we don't need to take keys in memory.
- Data is flushed on disk asynchronously. If a key is marked as dirty,
and IO operation is scheduled for this key.
- You can control the delay between modifications of keys and disk
writes, so that if a key is modified a lot of time in small time, it
will written only one time on disk.
- Setting the delay to 0 means, sync it as fast as possible.
- All I/O is performed by a single dedicated thread, that is
long-running and not spawned on demand. The thread is awaked with a
conditional variable.
- The system is much simpler and sane than VM implementation, as there
is no need to "undo" operations on race conditions.
- Zero start-up time... as objects are loaded on demand.
- There is negative caching. If a key is not on disk we remember it
(if there is memory to do so). So we avoid accessing the disk again
and again for keys that are not there.
- The system is very fast if we access mostly our working set, and
this working set happens to fit memory. Otherwise the system is much
slower (I/O bound).
- The system does not support BGSAVE currently, but will support this,
and what is cool, with minimal overhead and used memory in the saving
child, as data on disk is already written using the same serialization
format as .rdb files. So our child will just copy files to obtain the
.rdb. In the mean time the objects in cache are not flushed, so the
system may use more memory, but it's not about copy-on-write, so it
will use very very little additional memory.
- Persistence is *PER KEY* this means, there is no point in time persistence.
I think that the above points may give you an idea about how it works.
But let me stress the per-key persistence point a bit.
LPUSH a 0
LPUSH b 1
LPUHS a 2
So after this commands we may have two IO scheduled operations
pending. One for "a" and one for "b".
Now imagine "a" is saved, and then the server goes down, Redis is
brutally killed, or alike. The database will contain a consistent
version of "a" and "b", but the version of "b" will be old, without
the "1" pushed.
Also currently MULTI/EXEC is not transactional, but this will be
fixed, at least inside a multi/exec there will be guarantee that
either all values or nothing will be synched to disk (this will be
obtained using a journal file for transactions).
Some more details. The system is composed of two layers:
diskstore.c -- implements a trivial on disk key-value store
dscache.c -- implements the more complex caching layer
diskstore.c is currently a filesystem based KV store. This can be
replaced with a B-TREE or something like that in the future if this
will be needed. However even if the current implementation has a big
overhead, it's pretty cool to have data as files, with very little
chances of loosing data and corruption (rename is used for writes).
But well if this does not scale well enough we'll drop it and replace
it with something better.
The current implementations is similar to bigdis. 256 directories
containing 256 directories each are used, for a total of 65536 dir.
Every key is put inside the dirs addressed by SHA1(key) translated in
hex, for instance key "foo" is at:
/0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
The cool thing is, diskstore.c exports a trivial interface to Redis,
so it's very simple to replace with something else without touching
too much internals.
Stability: the system is obviously in alpha stage, however it works
pretty well, without obvious crashes. But warning, it will crash with
an assert if you try to BGSAVE.
To try it download the "unstable" branch, edit redis.conf to enable
diskstore. Play with it. Enjoy a redis instance that starts in no time
even when it's full of data :)
Feedbacks are really really appreciated here. I want to know what you
think, what are your impressions on the design, tradeoffs, and so
forth, how it feels when you experiment with it. If you want to see
the inner workings set log level to "debug".
The goal is to ship 2.4 ASAP with VM replaced with a good
implementation of diskstore.
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
Hello Andy, thanks for your feedbacks and your questions, they are
very interesting.
> What happens if I have a producer (or perhaps many producers) of data that
> add dirty data to Redis faster than the Diskstore IO thread can write it?
> Specifically, what if the dirty data approaches the max memory limits? Will
> the memory footprint grow temporarily until those values are persisted, then
> free itself?
Redis will start evicting keys that are not dirty from the cache, but
when it eventually runs completely out of memory compared to the max
memory limit, it will start blocking waiting for disk I/O to complete
so that memory can be reclaimed. Currently this is done not
incrementally enough and will be improved, but there are no other
solutions.
This basically means that redis will survive without problems to
peaks, but if the writes are sustainedly faster than IO eventually
there is to wait for IO to compelte :)
But I think it's a good idea that the new semantics is to use
cache-max-memory as an *hard* limit.
The only reason it Redis will use more memory than that is to process
a command that involves using values that are larger than the max
memory setting itself (for instance you set maxmemory to 1mb and then
do an lpush against a list that in memory is 2mb, but this is not a
problem I think).
> What about if I have a large data structure? One with thousands of items in
> a list or zset. If this key is written to frequently (multiple times a
> second), will the IO thread get backed up by having to write such a large
> structure over and over again?
It is mainly up to you, as there is a setting called cache-flush-delay.
If you set it to, for instance, 10, then the first time you write
against the big zset it will get queued for disk IO, with time for
completion set to now+10 seconds. The next writes will not add new
writes to the queue, so after 10 seconds the key will be transfered on
disk, and then if there is a new operation against this key, it will
be queued again. The effect is that the key will be written on disk
every 10 seconds if there is a continuos stream of writes against it.
So it's a clear trade-off between durability and speed.
I want to point out that for things like leader boards and big sorted
sets, lists, ...., the way to go is to use Redis with the default
in-memory back end. I guess that in many conditions it will make a lot
of sense to run two Redis instances: one for bulk data, and one for
fast things like the leader board and other things that must be very
fast.
Cheers,
Salvatore
Thank you for your comments Chris! Replying below.
> * Once redis reaches the cache RAM limit, the machine starts thrashing
> pretty bad, and ends up spending a good deal of time waiting for disk
> i/o and generally slowing redis to a halt. The SETs per second at
> that point quickly drops of from about 25-26k to 10k, and then to
> nothing as it thrashes back and forth.
Yes, this is expected behavior but for the fact that currently when we
reach the condition that there is no memory to create a longer write
queue the whole current queue is processed in a blocking way. Instead
I need to make this incremental. It's very easy, just the current
implementation was a two minutes effort to make it working without
surpassing the memory limit. But whatever we can do of course when
there are too many writes we'll start waiting for I/O.
> * I haven't spent as much time testing this out, but increasing cache-
> flush-delay seems to decrease the "hit" that occurs once we reach the
> RAM limit, but that success is short lived :)
Increasing cache-flush-delay and cache-max-memory helps in two different ways:
1) increasing cache-flush-delay helps a lot if you write again and again again
2) increasing cache-max-memory helps to handle well a peak that will
last for more time.
But if the write-heavy workload continues for too long in the end
performances will start to get lower and lower, until they'll match
the performance of the I/O subsystem.
> I will freely admit that my test is a little unfair at best, and
> unrealistic at worst -- obviously in a production scenario I wouldn't
> be using redis with a disk backing store simply to constantly issue
> SET requests. However, I've been checking out various options to do
> more or less exactly what the future redis v2.4 aims to do, and "what
> happens when you run out of RAM" is a question that I need to answer,
> and the answer, in most cases, seems to be "noooooo!!!!!!!" :). Not
> entirely surprising given the stress I'm putting on it, but I need to
> know what happens in worst case scenarios.
Completely agree with your vision, the worst case scenario must be very clear.
And of course we can improve the I/O performance itself.
For instance, serializing / deserializing hashes / lists / sets is
slow, but now that we have specially encoded data type that are...
guess what? A linear array of data! We can serialize it on disk just
by copying.
Many optimizations are possible.
> That being said, while having something that magically experienced
> zero performance degradation when writing to disk instead of RAM would
> be just lovely, I'm not gonna hold my breath :). I am looking for
> consistent and predictable performance characteristics, however. At
> the moment, redis effectively locks up if it's under heavy load, has
> reached the cache size limit, and the db is continuing to grow.
Yes now it consumes the whole queue of 200 writes, every time... not
exactly a good idea.
Will fix it soon.
> I'll be continuing some of my testing in the near future to include
> some scenarios that are a bit more realistic, mixing GET/SET/etc.
> requests in certain ratios and so forth, using things that are at
> least closer to "real world" data (for us, anyway).
Great
> Depending on how far I get, I might offer up a branch with a few
> tweaks of my own, but that depends largely on how much time I can
> devote to it in the near future :).
Thanks
> To close, I'll just say that for being as new and early stage as the
> branch formerly known as 'diskstore' is, it's looking pretty good.
> Thanks for all the hard work!
Thanks! Very useful feedbacks.
Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
--
Salvatore 'antirez' Sanfilippo
The idea is to provide a tool to convert an .rdb file into diskstore format.
The contrary will be handled natively by redis, just calling BGSAVE.
> - Are there any ideas to change the process for starting new slaves, to
> address some of the existing shortcomings with replication, or do we think
> that the improved speed and lower memory requirements associated with RDB
> generation will negate the most serious concerns?
I think that BGSAVE will be a very efficient operation with diskstore,
so will not create the problems that created when VM was active. We'll
see how the actual implementation works, but I'm very positive about
this.
> Overall, this looks very exciting, as it opens up even more use cases for
> Redis.
Thanks!
Salvatore
Dvir right this is very interesting, 100 or 256 or ... 1000?
This was my reasoning. I want it to work well with 1 billion keys. So:
1000000000/(256*256) = 15258
15k files per dir are more or less near to the limit to get good performances.
On the other side to create 65536 directory, or to scan 65536
directory for BGSAVE, seems acceptable, and works pretty fast in
practice.
> It didn't scale well with millions of objects, and made the engine become
> very slow over time. I assume the difference between 100^2 and 256^2 will
> make it scale better though. Also, for me most reads were from disk as
> caches were very small back then, so I'm guessing it's less of an issue if
> most of the reads are from memory.
An important variable here is the filesystem in use for sure.
Btw there is no reason to don't try with a B-TREE, given the interface
is pretty abstract. We should give it a try at least to get the
feeling about what other solutions can provide us.
> Back then I eventually wrote a b-tree based storage engine that worked much
> better with millions of varying size, growing blobs (inverted indexes
> mainly).
Cool. If you want to hack a patch it can be an interesting experiment.
In the end we'll need something able to auto compact itself if we'll
select the B-TREE approach in the end.
> Another question is - can diskstore (optionally of course) totally replace
> RDBs? besides the fact that backups will be slower, why not allowing users
> to use diskstore as the only store in redis?
It's already this way. you can *optionally* call BGSAVE when diskstore
is active (or better you will ... still not implemented), but there is
no need for it. BGSAVE will also be called for replication if our
diskstore instance is a master, but our diskstore-BGSAVE should work
very well.
I'll try implementing it tomorrow so'll know very soon :)
Cheers,
Salvatore
> - Is support for loading RDB files planned (when diskstore is enabled)? > If so, there will be a start-up delay as the RDB is re-written to the
> diskstore format, correct?The idea is to provide a tool to convert an .rdb file into diskstore format.
The contrary will be handled natively by redis, just calling BGSAVE.
Pardon my ignorance, but I am curious as to if this is similar to the
way MongoDB works (it used memory-mapped files and fsyncs at regular
intervals).
If not, what would be the disadvantages if Redis adopted such a scheme?
Regards,
BG
--
Baishampayan Ghose
b.ghose at gmail.com
> If you want pure speed with Redis, in memory is the way to go. So as aPardon my ignorance, but I am curious as to if this is similar to the
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".
way MongoDB works (it used memory-mapped files and fsyncs at regular
intervals).
If not, what would be the disadvantages if Redis adopted such a scheme?
Thanks Josiah!
As an aside, if Redis stores the value pointed by each key in its own
file, what are the implications of using filesystems which limit the
total number of files?
Great, please if you have the chance of doing any testing, drop us an
email with what you find :)
> On question here "the contrary" means convert diskstore into rdb format? If
> so, why we still need it after the adoption of diskstore?
> What is the way to do BACKUP for diskstore ? Just copy / rsync the entire
> directory tree I hope?
.rdb and AOF will still be the persistence model for in-memory operations.
Even in diskstore mode .rdb is good as it will be a compact consistent
point-in-time dump, and we need it anyway in order to support
replication with diskstore. It's like if .rdb is the lingua franka
among different Redis underlying implementations, so for replication
we transfer the first bulk data as .rdb regardless of the back end
used by a given instance.
Cheers,
Salvatore
> As an aside, if Redis stores the value pointed by each key in its own
> file, what are the implications of using filesystems which limit the
> total number of files?
Most filesystems have limit in the max number of files per directory,
this is why we use 65536 directories, so there is no such a problem.
Mmap is faster and will provide use a single file database, but the
problem is, it's very hard to avoid corruption with this schema. For
instance in MongoDB if I'm correct in order to have a decent amount of
durability you need to run a replica, otherwise something bad can
happen.
This is since mmaped files will flush things on disk without any
ordering, and at hardware page level. So it's very hard to guess what
will be written and what not on disk in case of a crash.
Even if we switch to something different than our filesystem-based
approach we need something more consistent than this. It's fine for
mongo as it's the unique persistence offered and must be both fast and
persistence, and mongodb guys tried to find some tradeoff. But in our
case for speed we have the in memory back end, and when diskstore is
enabled we have the object cache, so my feeling is that our on-disk KV
implementation should be able to provide some more durability.
Not at all :)
> So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".
>
> The new set of tradeoffs are very different. The result is called
> diskstore
excellent news! also, "diskstore" sounds better than "durable vm" :)
i will try to give it a spin later today.
tim
Thanks for the explanation, Salvatore. Redis rocks (we use it in
production) and I am very optimistic about its future :)
Supporting multiple back ends surely makes sense. On the other hand I
want to have a clean build system with a clean default, so this is
what we can do.
For default we'll have something like:
diskstore-plugin native
Or...
diskstore-plugin /path/to/your/plugin
This plugin will just be popen()-ed by Redis, so Redis will talk to
this plugin via standard input/output, and will issue commands like
GET / SET / and DEL, with a simple binary protocol. The plugin will
reply writing to standard output, again with a simple binary protocol.
Given that anyway disk I/O is involved the overhead of the stdin/out
chat is not a big issue, and this way people can create plugins for
diskstore with any conceivable language and architecture. At the same
time Redis will ship what a default that we consider sane for most
users.
Makes sense?
On Wed, Jan 5, 2011 at 10:36 AM, Dvir Volk <dvi...@gmail.com> wrote:
> just a thought, Salvatore,
> have you considered using berkeley db (sleepycat), or even mysql?
Or the InnoDB plugin directly, given that it seems to have good performance:
http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html
But given that the diskstore will be able to use different back-ends
via plugins, I assume that if someone really wants this, they will
appear.
Bye,
--
Pedro Melo
http://www.simplicidade.org/
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org
> So the question is how much performance are you losing by doing this
> externally via a pipe?
I think there is not even need to measure it since we receive commands
via a socket, so the redis-plugin link will be for sure in the same
scale (much faster actually). And probably most of this plugins may
end talking with some other networking layer like MySQL or Riak or
others.
For sure dynamic lib is more efficient, but everything considered the
flexibility of using the pipe is probably a bigger advantage.
Cheers,
Salvatore
The reason I'm not so happy with the .so thing is that I never saw a
software with a plugin system where a great deal of the plugins were
in a sad state. And this in the case of a database is a serious
issue... Imagine this C-written plugins around on github, at different
levels of completion... does not sound too good to me :)
With the pipe we can have much stabler plugins, isolated form Redis so
that if the pipe is closed Redis will just report "Hey your plugin
crashed, use something better. It's an important gain...
We'll be always in time to fix this later. If a plugin will become
dominant, we can implement it as a built-in option perhaps.
i see two main scenarios:
a) multiple redis instances on the same machine, sharing the same
diskstore.
this should work today, although i have not yet tested it.
b) multiple redis instance on different machines, sharing the same
(remote
or clustered) diskstore backend -- NFS, cluster filesystem, riak,
cassandra etc.
this should work as soon as salvatore has implemented the pluggable
backend
part.
now that diskstore has removed "total size of dataset" as bottleneck,
both a)
and b) address "throughput" as the next likely bottleneck. in
addition, b) makes
diskstore viable if the working dataset is to large for a single
machine.
if a) turns out to be useful (disk i/o might become the bottleneck
here), it could
even be made into a diskstore configuration option, telling redis to
fork n
processes on consecutive ports.
tim
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/