Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Redis diskstore
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 52 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Salvatore Sanfilippo  
View profile  
 More options Jan 4 2011, 2:57 pm
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Tue, 4 Jan 2011 20:57:53 +0100
Local: Tues, Jan 4 2011 2:57 pm
Subject: Redis diskstore
Hi all,

a few months after VM started to work my feeling about it started to
be not very good. I stated in the blog, and privately on IRC,
especially talking with Pieter, that VM was not the way to go for the
future of Redis, and that the new path we were taking about using less
memory was a much better approach. Together with cluster.

However there are a number of different models for dealing with
datasets bigger than RAM for Redis. Just to cite a few:

1) virtual memory, where we swap values on disk as needed (The Redis
Virtual Memory way)
2) storing data on disk, in a complex form so that operations can be
implemented directly in the on-disk representation, and using the OS
cache as a cache layer for the working set (let's call it the Mongo DB
way)
3) storing data on disk, but not for direct manipulation, and use
memory as a cache of objects that are active, flushing writes on disks
when this objects change.

It is now clear that VM is not the right set of tradeoffs, it was
designed to be pretty fast but on the other hand there was a too big
price to pay for all the rest: slow restarts, slow saving, and in turn
slow replication, very complex code, and so forth.

If you want pure speed with Redis, in memory is the way to go. So as a
reaction to the email sent by Tim about his unhappiness with VM I used
a few vacation days to start implementing a new model, that is was was
listed above as number "3".

The new set of tradeoffs are very different. The result is called
diskstore, and this is how it works, in a few easy to digest points.

- In diskstore key-value paris are stored on disk.
- Memory works as a cache for live objects. Operations are only
performed on in memory keys, so data on disk does not need to be
stored in complex forms.
- The cache-max-memory limit is strict. Redis will never use more RAM,
even if we have 2 MB of max memory and 1 billion of keys. This works
since now we don't need to take keys in memory.
- Data is flushed on disk asynchronously. If a key is marked as dirty,
and IO operation is scheduled for this key.
- You can control the delay between modifications of keys and disk
writes, so that if a key is modified a lot of time in small time, it
will written only one time on disk.
- Setting the delay to 0 means, sync it as fast as possible.
- All I/O is performed by a single dedicated thread, that is
long-running and not spawned on demand. The thread is awaked with a
conditional variable.
- The system is much simpler and sane than VM implementation, as there
is no need to "undo" operations on race conditions.
- Zero start-up time... as objects are loaded on demand.
- There is negative caching. If a key is not on disk we remember it
(if there is memory to do so). So we avoid accessing the disk again
and again for keys that are not there.
- The system is very fast if we access mostly our working set, and
this working set happens to fit memory. Otherwise the system is much
slower (I/O bound).
- The system does not support BGSAVE currently, but will support this,
and what is cool, with minimal overhead and used memory in the saving
child, as data on disk is already written using the same serialization
format as .rdb files. So our child will just copy files to obtain the
.rdb. In the mean time the objects in cache are not flushed, so the
system may use more memory, but it's not about copy-on-write, so it
will use very very little additional memory.
- Persistence is *PER KEY* this means, there is no point in time persistence.

I think that the above points may give you an idea about how it works.
But let me stress the per-key persistence point a bit.

LPUSH a 0
LPUSH b 1
LPUHS a 2

So after this commands we may have two IO scheduled operations
pending. One for "a" and one for "b".
Now imagine "a" is saved, and then the server goes down, Redis is
brutally killed, or alike. The database will contain a consistent
version of "a" and "b", but the version of "b" will be old, without
the "1" pushed.

Also currently MULTI/EXEC is not transactional, but this will be
fixed, at least inside a multi/exec there will be guarantee that
either all values or nothing will be synched to disk (this will be
obtained using a journal file for transactions).

Some more details. The system is composed of two layers:

diskstore.c -- implements a trivial on disk key-value store
dscache.c -- implements the more complex caching layer

diskstore.c is currently a filesystem based KV store. This can be
replaced with a B-TREE or something like that in the future if this
will be needed. However even if the current implementation has a big
overhead, it's pretty cool to have data as files, with very little
chances of loosing data and corruption (rename is used for writes).
But well if this does not scale well enough we'll drop it and replace
it with something better.

The current implementations is similar to bigdis. 256 directories
containing 256 directories each are used, for a total of 65536 dir.
Every key is put inside the dirs addressed by SHA1(key) translated in
hex, for instance key "foo" is at:

   /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33

The cool thing is, diskstore.c exports a trivial interface to Redis,
so it's very simple to replace with something else without touching
too much internals.

Stability: the system is obviously in alpha stage, however it works
pretty well, without obvious crashes. But warning, it will crash with
an assert if you try to BGSAVE.

To try it download the "unstable" branch, edit redis.conf to enable
diskstore. Play with it. Enjoy a redis instance that starts in no time
even when it's full of data :)

Feedbacks are really really appreciated here. I want to know what you
think, what are your impressions on the design, tradeoffs, and so
forth, how it feels when you experiment with it. If you want to see
the inner workings set log level to "debug".

The goal is to ship 2.4 ASAP with VM replaced with a good
implementation of diskstore.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy McCurdy  
View profile  
 More options Jan 4 2011, 3:49 pm
From: Andy McCurdy <sed...@gmail.com>
Date: Tue, 4 Jan 2011 12:49:52 -0800
Local: Tues, Jan 4 2011 3:49 pm
Subject: Re: Redis diskstore

Hey Salvatore,

I've definitely had issues with VM, and am excited to see how diskstore
works out. In theory, the tradeoffs seem much better. Couple questions.

What happens if I have a producer (or perhaps many producers) of data that
add dirty data to Redis faster than the Diskstore IO thread can write it?
Specifically, what if the dirty data approaches the max memory limits? Will
the memory footprint grow temporarily until those values are persisted, then
free itself?

What about if I have a large data structure? One with thousands of items in
a list or zset. If this key is written to frequently (multiple times a
second), will the IO thread get backed up by having to write such a large
structure over and over again?

-andy

On Tue, Jan 4, 2011 at 11:57 AM, Salvatore Sanfilippo <anti...@gmail.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
andrychris  
View profile  
 More options Jan 4 2011, 3:50 pm
From: andrychris <ch...@noneofyo.biz>
Date: Tue, 4 Jan 2011 12:50:26 -0800 (PST)
Local: Tues, Jan 4 2011 3:50 pm
Subject: Re: Redis diskstore
To share some quick, high level thoughts on my initial tests and what
I've seen so far...

I've been starting off with a basic test that essentially spams simple
SET requests to a redis instance with small (1-30 bytes or so) keys
and values as fast as it can go.  I've been using a 2-cpu (for a total
of 16 cores) Xeon X5570 with 24GB of RAM and, at the moment, a solid
state drive that I'm using for long term storage.

* Once redis reaches the cache RAM limit, the machine starts thrashing
pretty bad, and ends up spending a good deal of time waiting for disk
i/o and generally slowing redis to a halt.  The SETs per second at
that point quickly drops of from about 25-26k to 10k, and then to
nothing as it thrashes back and forth.
* I haven't spent as much time testing this out, but increasing cache-
flush-delay seems to decrease the "hit" that occurs once we reach the
RAM limit, but that success is short lived :)

I will freely admit that my test is a little unfair at best, and
unrealistic at worst -- obviously in a production scenario I wouldn't
be using redis with a disk backing store simply to constantly issue
SET requests.  However, I've been checking out various options to do
more or less exactly what the future redis v2.4 aims to do, and "what
happens when you run out of RAM" is a question that I need to answer,
and the answer, in most cases, seems to be "noooooo!!!!!!!" :).  Not
entirely surprising given the stress I'm putting on it, but I need to
know what happens in worst case scenarios.

That being said, while having something that magically experienced
zero performance degradation when writing to disk instead of RAM would
be just lovely, I'm not gonna hold my breath :).  I am looking for
consistent and predictable performance characteristics, however.  At
the moment, redis effectively locks up if it's under heavy load, has
reached the cache size limit, and the db is continuing to grow.

I'll be continuing some of my testing in the near future to include
some scenarios that are a bit more realistic, mixing GET/SET/etc.
requests in certain ratios and so forth, using things that are at
least closer to "real world" data (for us, anyway).

Depending on how far I get, I might offer up a branch with a few
tweaks of my own, but that depends largely on how much time I can
devote to it in the near future :).

To close, I'll just say that for being as new and early stage as the
branch formerly known as 'diskstore' is, it's looking pretty good.
Thanks for all the hard work!

-Chris

On Jan 4, 2:57 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Russo  
View profile  
 More options Jan 4 2011, 4:24 pm
From: Michael Russo <mjru...@gmail.com>
Date: Tue, 4 Jan 2011 16:24:04 -0500
Local: Tues, Jan 4 2011 4:24 pm
Subject: Re: Redis diskstore

A few questions:

- Is support for loading RDB files planned (when diskstore is enabled)?  If
so, there will be a start-up delay as the RDB is re-written to the diskstore
format, correct?

- Are there any ideas to change the process for starting new slaves, to
address some of the existing shortcomings with replication, or do we think
that the improved speed and lower memory requirements associated with RDB
generation will negate the most serious concerns?

Overall, this looks very exciting, as it opens up even more use cases for
Redis.

Thanks,
Michael

On Tue, Jan 4, 2011 at 2:57 PM, Salvatore Sanfilippo <anti...@gmail.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 4 2011, 4:29 pm
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Tue, 4 Jan 2011 22:29:54 +0100
Local: Tues, Jan 4 2011 4:29 pm
Subject: Re: Redis diskstore

On Tue, Jan 4, 2011 at 9:49 PM, Andy McCurdy <sed...@gmail.com> wrote:
> Hey Salvatore,
> I've definitely had issues with VM, and am excited to see how diskstore
> works out. In theory, the tradeoffs seem much better. Couple questions.

Hello Andy, thanks for your feedbacks and your questions, they are
very interesting.

> What happens if I have a producer (or perhaps many producers) of data that
> add dirty data to Redis faster than the Diskstore IO thread can write it?
> Specifically, what if the dirty data approaches the max memory limits? Will
> the memory footprint grow temporarily until those values are persisted, then
> free itself?

Redis will start evicting keys that are not dirty from the cache, but
when it eventually runs completely out of memory compared to the max
memory limit, it will start blocking waiting for disk I/O to complete
so that memory can be reclaimed. Currently this is done not
incrementally enough and will be improved, but there are no other
solutions.

This basically means that redis will survive without problems to
peaks, but if the writes are sustainedly faster than IO eventually
there is to wait for IO to compelte :)

But I think it's a good idea that the new semantics is to use
cache-max-memory as an *hard* limit.
The only reason it Redis will use more memory than that is to process
a command that involves using values that are larger than the max
memory setting itself (for instance you set maxmemory to 1mb and then
do an lpush against a list that in memory is 2mb, but this is not a
problem I think).

> What about if I have a large data structure? One with thousands of items in
> a list or zset. If this key is written to frequently (multiple times a
> second), will the IO thread get backed up by having to write such a large
> structure over and over again?

It is mainly up to you, as there is a setting called cache-flush-delay.
If you set it to, for instance, 10, then the first time you write
against the big zset it will get queued for disk IO, with time for
completion set to now+10 seconds. The next writes will not add new
writes to the queue, so after 10 seconds the key will be transfered on
disk, and then if there is a new operation against this key, it will
be queued again. The effect is that the key will be written on disk
every 10 seconds if there is a continuos stream of writes against it.
So it's a clear trade-off between durability and speed.

I want to point out that for things like leader boards and big sorted
sets, lists, ...., the way to go is to use Redis with the default
in-memory back end. I guess that in many conditions it will make a lot
of sense to run two Redis instances: one for bulk data, and one for
fast things like the leader board and other things that must be very
fast.

Cheers,
Salvatore

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dvir Volk  
View profile  
 More options Jan 4 2011, 4:36 pm
From: Dvir Volk <dvir...@gmail.com>
Date: Tue, 4 Jan 2011 23:36:32 +0200
Local: Tues, Jan 4 2011 4:36 pm
Subject: Re: Redis diskstore

Salvatore, this sounds great.
I told you on twitter about my b-tree implementation, and it's funny that I
wrote it after having started my project with almost exactly the same
diskstore pattern (although it was 100 sub directories per directory, which
is a big difference).
It didn't scale well with millions of objects, and made the engine become
very slow over time. I assume the difference between 100^2 and 256^2 will
make it scale better though. Also, for me most reads were from disk as
caches were very small back then, so I'm guessing it's less of an issue if
most of the reads are from memory.
Back then I eventually wrote a b-tree based storage engine that worked much
better with millions of varying size, growing blobs (inverted indexes
mainly).

Another question is - can diskstore (optionally of course) totally replace
RDBs? besides the fact that backups will be slower, why not allowing users
to use diskstore as the only store in redis?

On Tue, Jan 4, 2011 at 9:57 PM, Salvatore Sanfilippo <anti...@gmail.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 4 2011, 5:06 pm
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Tue, 4 Jan 2011 23:06:30 +0100
Local: Tues, Jan 4 2011 5:06 pm
Subject: Re: Redis diskstore

On Tue, Jan 4, 2011 at 9:50 PM, andrychris <ch...@noneofyo.biz> wrote:

Thank you for your comments Chris! Replying below.

> * Once redis reaches the cache RAM limit, the machine starts thrashing
> pretty bad, and ends up spending a good deal of time waiting for disk
> i/o and generally slowing redis to a halt.  The SETs per second at
> that point quickly drops of from about 25-26k to 10k, and then to
> nothing as it thrashes back and forth.

Yes, this is expected behavior but for the fact that currently when we
reach the condition that there is no memory to create a longer write
queue the whole current queue is processed in a blocking way. Instead
I need to make this incremental. It's very easy, just the current
implementation was a two minutes effort to make it working without
surpassing the memory limit. But whatever we can do of course when
there are too many writes we'll start waiting for I/O.

> * I haven't spent as much time testing this out, but increasing cache-
> flush-delay seems to decrease the "hit" that occurs once we reach the
> RAM limit, but that success is short lived :)

Increasing cache-flush-delay and cache-max-memory helps in two different ways:

1) increasing cache-flush-delay helps a lot if you write again and again again
2) increasing cache-max-memory helps to handle well a peak that will
last for more time.

But if the write-heavy workload continues for too long in the end
performances will start to get lower and lower, until they'll match
the performance of the I/O subsystem.

> I will freely admit that my test is a little unfair at best, and
> unrealistic at worst -- obviously in a production scenario I wouldn't
> be using redis with a disk backing store simply to constantly issue
> SET requests.  However, I've been checking out various options to do
> more or less exactly what the future redis v2.4 aims to do, and "what
> happens when you run out of RAM" is a question that I need to answer,
> and the answer, in most cases, seems to be "noooooo!!!!!!!" :).  Not
> entirely surprising given the stress I'm putting on it, but I need to
> know what happens in worst case scenarios.

Completely agree with your vision, the worst case scenario must be very clear.
And of course we can improve the I/O performance itself.

For instance, serializing / deserializing hashes / lists / sets is
slow, but now that we have specially encoded data type that are...
guess what? A linear array of data! We can serialize it on disk just
by copying.

Many optimizations are possible.

> That being said, while having something that magically experienced
> zero performance degradation when writing to disk instead of RAM would
> be just lovely, I'm not gonna hold my breath :).  I am looking for
> consistent and predictable performance characteristics, however.  At
> the moment, redis effectively locks up if it's under heavy load, has
> reached the cache size limit, and the db is continuing to grow.

Yes now it consumes the whole queue of 200 writes, every time... not
exactly a good idea.
Will fix it soon.

> I'll be continuing some of my testing in the near future to include
> some scenarios that are a bit more realistic, mixing GET/SET/etc.
> requests in certain ratios and so forth, using things that are at
> least closer to "real world" data (for us, anyway).

Great

> Depending on how far I get, I might offer up a branch with a few
> tweaks of my own, but that depends largely on how much time I can
> devote to it in the near future :).

Thanks

> To close, I'll just say that for being as new and early stage as the
> branch formerly known as 'diskstore' is, it's looking pretty good.
> Thanks for all the hard work!

Thanks! Very useful feedbacks.

Cheers,
Salvatore

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 4 2011, 5:08 pm
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Tue, 4 Jan 2011 23:08:32 +0100
Local: Tues, Jan 4 2011 5:08 pm
Subject: Re: Redis diskstore

On Tue, Jan 4, 2011 at 10:24 PM, Michael Russo <mjru...@gmail.com> wrote:
> A few questions:
> - Is support for loading RDB files planned (when diskstore is enabled)?  If
> so, there will be a start-up delay as the RDB is re-written to the diskstore
> format, correct?

The idea is to provide a tool to convert an .rdb file into diskstore format.
The contrary will be handled natively by redis, just calling BGSAVE.

> - Are there any ideas to change the process for starting new slaves, to
> address some of the existing shortcomings with replication, or do we think
> that the improved speed and lower memory requirements associated with RDB
> generation will negate the most serious concerns?

I think that BGSAVE will be a very efficient operation with diskstore,
so will not create the problems that created when VM was active. We'll
see how the actual implementation works, but I'm very positive about
this.

> Overall, this looks very exciting, as it opens up even more use cases for
> Redis.

Thanks!
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 4 2011, 5:15 pm
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Tue, 4 Jan 2011 23:15:17 +0100
Local: Tues, Jan 4 2011 5:15 pm
Subject: Re: Redis diskstore

On Tue, Jan 4, 2011 at 10:36 PM, Dvir Volk <dvir...@gmail.com> wrote:
> Salvatore, this sounds great.
> I told you on twitter about my b-tree implementation, and it's funny that I
> wrote it after having started my project with almost exactly the same
> diskstore pattern (although it was 100 sub directories per directory, which
> is a big difference).

Dvir right this is very interesting, 100 or 256 or ... 1000?

This was my reasoning. I want it to work well with 1 billion keys. So:

1000000000/(256*256) = 15258

15k files per dir are more or less near to the limit to get good performances.
On the other side to create 65536 directory, or to scan 65536
directory for BGSAVE, seems acceptable, and works pretty fast in
practice.

> It didn't scale well with millions of objects, and made the engine become
> very slow over time. I assume the difference between 100^2 and 256^2 will
> make it scale better though. Also, for me most reads were from disk as
> caches were very small back then, so I'm guessing it's less of an issue if
> most of the reads are from memory.

An important variable here is the filesystem in use for sure.
Btw there is no reason to don't try with a B-TREE, given the interface
is pretty abstract. We should give it a try at least to get the
feeling about what other solutions can provide us.

> Back then I eventually wrote a b-tree based storage engine that worked much
> better with millions of varying size, growing blobs (inverted indexes
> mainly).

Cool. If you want to hack a patch it can be an interesting experiment.
In the end we'll need something able to auto compact itself if we'll
select the B-TREE approach in the end.

> Another question is - can diskstore (optionally of course) totally replace
> RDBs? besides the fact that backups will be slower, why not allowing users
> to use diskstore as the only store in redis?

It's already this way. you can *optionally* call BGSAVE when diskstore
is active (or better you will ... still not implemented), but there is
no need for it. BGSAVE will also be called for replication if our
diskstore instance is a master, but our diskstore-BGSAVE should work
very well.

I'll try implementing it tomorrow so'll know very soon :)

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xiangrong Fang  
View profile  
 More options Jan 4 2011, 6:17 pm
From: Xiangrong Fang <xrf...@gmail.com>
Date: Wed, 5 Jan 2011 07:17:33 +0800
Local: Tues, Jan 4 2011 6:17 pm
Subject: Re: Redis diskstore

This is really cool.  Most dataset in practice has a % of items as "hot".
 This approach makes redis scale UP well.  For example, we plan to use a
Dell SSD box to replace regular hard disk in production, this will
definitely help throughput I think :-)

> - Is support for loading RDB files planned (when diskstore is enabled)?  >
> If so, there will be a start-up delay as the RDB is re-written to the

> diskstore format, correct?

> The idea is to provide a tool to convert an .rdb file into diskstore
> format.
> The contrary will be handled natively by redis, just calling BGSAVE.

On question here "the contrary" means convert diskstore into rdb format? If
so, why we still need it after the adoption of diskstore?

What is the way to do BACKUP for diskstore ? Just copy / rsync the entire
directory tree I hope?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Baishampayan Ghose  
View profile  
 More options Jan 5 2011, 1:43 am
From: Baishampayan Ghose <b.gh...@gmail.com>
Date: Wed, 5 Jan 2011 12:13:09 +0530
Local: Wed, Jan 5 2011 1:43 am
Subject: Re: Redis diskstore

> If you want pure speed with Redis, in memory is the way to go. So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".

Pardon my ignorance, but I am curious as to if this is similar to the
way MongoDB works (it used memory-mapped files and fsyncs at regular
intervals).

If not, what would be the disadvantages if Redis adopted such a scheme?

Regards,
BG

--
Baishampayan Ghose
b.ghose at gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Josiah Carlson  
View profile  
 More options Jan 5 2011, 2:57 am
From: Josiah Carlson <josiah.carl...@gmail.com>
Date: Tue, 4 Jan 2011 23:57:36 -0800
Local: Wed, Jan 5 2011 2:57 am
Subject: Re: Redis diskstore

On Tue, Jan 4, 2011 at 10:43 PM, Baishampayan Ghose <b.gh...@gmail.com>wrote:

> > If you want pure speed with Redis, in memory is the way to go. So as a
> > reaction to the email sent by Tim about his unhappiness with VM I used
> > a few vacation days to start implementing a new model, that is was was
> > listed above as number "3".

> Pardon my ignorance, but I am curious as to if this is similar to the
> way MongoDB works (it used memory-mapped files and fsyncs at regular
> intervals).

It is not the same. MongoDB uses large files that hold both data and indexes
that are memory-mapped, and thus are limited to have databases at most the
architecture limit of your platform. That is, only up to about 3GB on a 32
bit machine. The method described by Salvatore stores the value pointed to
by each base key in it's own file on disk, reading and (re-)writing them as
necessary.

If not, what would be the disadvantages if Redis adopted such a scheme?

Given the current description of Redis diskstore, you could specify a 3 gig
cache limit, but store an unlimited amount of data on disk. As long as your
working-set tended below 3 gig, you would generally have very good
performance. If a MongoDB-based scheme was used, you would be limited to 3GB
of total data size, AND your on-disk representation must be the same as your
in-memory representation (the on-disk representation for Redis structures
tends to be smaller than their in-memory representation).

Regards,
 - Josiah


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Josiah Carlson  
View profile  
 More options Jan 5 2011, 3:46 am
From: Josiah Carlson <josiah.carl...@gmail.com>
Date: Wed, 5 Jan 2011 00:46:27 -0800
Local: Wed, Jan 5 2011 3:46 am
Subject: Re: Redis diskstore

Salvatore,

Thank you for taking the time and effort to design and build this
system. There is one concern that have about this particular implementation,
and a potential solution.

Depending on the load, some people will see a drastic reduction in IO
performance when writing many different keys and files, even more so than
just the random IO would suggest. Why? Filesystem metadata. Every time a
file is written to, it's size, modification date, etc., all need to be
updated. On a modern transactional filesystem (which we should all be using
by now), this is done by first allocating the data extent, writing the data
to disk, writing to a journal about how the metadata will change, then
writing the actual metadata to the inode(s). This is necessary for all
writes, and we have to deal with it.  However, as the number of files in an
individual path increase, the efficiency of the inode that describes that
path decreases, and the amount of time taken to update that particular file
increases.  Probably not a big deal for some with a million or so keys (only
16 files on average in each directory, with an expectation of some having up
to 32).

However, what I'm taking from this design description is that the desire is
for Redis to be usable in systems where main memory is significantly smaller
than problem sizes, so looking at a hundred million keys is not
unreasonable, which pushes this to 16k files in each path on average,
peaking to 32k expected.

Here's an alternate design which is similar, but which should scale out
better due to a 1) reduced number of files, 2) fixed number of inode
entries, 3) potential higher write throughput by aggregating more data, and
4) a method that lends itself to round-robin flushing/rewriting (you'll see
why this is important later):
* Data path has 256 pre-sharded hex paths (like the first-level of the
described disk-store method).
* Instead of up to 256 sub-directories of the higher-level 256 directories,
we instead have 256 AOF files, which represent the data read/written to the
64k shards.
* Handle all writes/caching exactly the same as using the described
diskstore, only now there are chunks of data that are read/written together
when flushed.
* A 64k entry table can be used to keep track of the last time an AOF chunk
was reprocessed, whether there is any dirty data, etc.
* Every X seconds, a small table can be flushed to disk to describe the
cache state, last modified state, etc., of shards in the system. This would
allow for Redis to pre-cache before requests start coming in.

Also, because the AOFs are pre-sharded 64k ways, re-writing each of them
individually to be compact should be fast.

The major drawback to this system is that if you miss on a key, you have to
read out the entire chunk, which could have large unrelated data contained
therein. For example, I want to the integer stored in key "foo", but there
are a half-dozen 5 million entry zsets that in the same shard, and that
shard isn't in cache... ouch. *

It's just some ideas. Feel free to discard it, consider it, whatever.

Regards,
 - Josiah

* This particular issue can be worked-around by having "automatic" and
"user-defined" shard ids. If a key does not contain the "{...}" string, it
is automatically sharded into the 64k shards via the 2 byte prefix of the
sha1 hash. If it contains a "{....}" string, and the "...." is exactly 4 hex
digits, then that is the shard-id (if it's not exactly 4 hex digits, then
just that portion is the key to be passed into sha1). To handle the "but I
don't want my integer value to collide with my giant zsets" issue, the "run
a second Redis with a different data path" answer should suffice.

On Tue, Jan 4, 2011 at 11:57 AM, Salvatore Sanfilippo <anti...@gmail.com>wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Baishampayan Ghose  
View profile  
 More options Jan 5 2011, 3:55 am
From: Baishampayan Ghose <b.gh...@gmail.com>
Date: Wed, 5 Jan 2011 14:25:25 +0530
Local: Wed, Jan 5 2011 3:55 am
Subject: Re: Redis diskstore

>> Pardon my ignorance, but I am curious as to if this is similar to the
>> way MongoDB works (it used memory-mapped files and fsyncs at regular
>> intervals).

> It is not the same. MongoDB uses large files that hold both data and indexes
> that are memory-mapped, and thus are limited to have databases at most the
> architecture limit of your platform. That is, only up to about 3GB on a 32
> bit machine. The method described by Salvatore stores the value pointed to
> by each base key in it's own file on disk, reading and (re-)writing them as
> necessary.

Thanks Josiah!

Regards,
BG

--
Baishampayan Ghose
b.ghose at gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Baishampayan Ghose  
View profile  
 More options Jan 5 2011, 3:59 am
From: Baishampayan Ghose <b.gh...@gmail.com>
Date: Wed, 5 Jan 2011 14:29:13 +0530
Local: Wed, Jan 5 2011 3:59 am
Subject: Re: Redis diskstore

> It is not the same. MongoDB uses large files that hold both data and indexes
> that are memory-mapped, and thus are limited to have databases at most the
> architecture limit of your platform. That is, only up to about 3GB on a 32
> bit machine. The method described by Salvatore stores the value pointed to
> by each base key in it's own file on disk, reading and (re-)writing them as
> necessary.

As an aside, if Redis stores the value pointed by each key in its own
file, what are the implications of using filesystems which limit the
total number of files?

Regards,
BG

--
Baishampayan Ghose
b.ghose at gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 5 2011, 4:06 am
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Wed, 5 Jan 2011 10:06:22 +0100
Local: Wed, Jan 5 2011 4:06 am
Subject: Re: Redis diskstore

On Wed, Jan 5, 2011 at 12:17 AM, Xiangrong Fang <xrf...@gmail.com> wrote:
> This is really cool.  Most dataset in practice has a % of items as "hot".
>  This approach makes redis scale UP well.  For example, we plan to use a
> Dell SSD box to replace regular hard disk in production, this will
> definitely help throughput I think :-)

Great, please if you have the chance of doing any testing, drop us an
email with what you find :)

> On question here "the contrary" means convert diskstore into rdb format? If
> so, why we still need it after the adoption of diskstore?
> What is the way to do BACKUP for diskstore ? Just copy / rsync the entire
> directory tree I hope?

.rdb and AOF will still be the persistence model for in-memory operations.
Even in diskstore mode .rdb is good as it will be a compact consistent
point-in-time dump, and we need it anyway in order to support
replication with diskstore. It's like if .rdb is the lingua franka
among different Redis underlying implementations, so for replication
we transfer the first bulk data as .rdb regardless of the back end
used by a given instance.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redis-db@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Josiah Carlson  
View profile  
 More options Jan 5 2011, 4:07 am
From: Josiah Carlson <josiah.carl...@gmail.com>
Date: Wed, 5 Jan 2011 01:07:58 -0800
Local: Wed, Jan 5 2011 4:07 am
Subject: Re: Redis diskstore

On Wed, Jan 5, 2011 at 12:59 AM, Baishampayan Ghose <b.gh...@gmail.com>wrote:

> > It is not the same. MongoDB uses large files that hold both data and
> indexes
> > that are memory-mapped, and thus are limited to have databases at most
> the
> > architecture limit of your platform. That is, only up to about 3GB on a
> 32
> > bit machine. The method described by Salvatore stores the value pointed
> to
> > by each base key in it's own file on disk, reading and (re-)writing them
> as
> > necessary.

> As an aside, if Redis stores the value pointed by each key in its own
> file, what are the implications of using filesystems which limit the
> total number of files?

The implication is that you can only store so many keys on disk. You aren't
using fat-12, are you? ;)

 - Josiah


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 5 2011, 4:10 am
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Wed, 5 Jan 2011 10:10:20 +0100
Local: Wed, Jan 5 2011 4:10 am
Subject: Re: Redis diskstore

On Wed, Jan 5, 2011 at 9:59 AM, Baishampayan Ghose <b.gh...@gmail.com> wrote:
> As an aside, if Redis stores the value pointed by each key in its own
> file, what are the implications of using filesystems which limit the
> total number of files?

Most filesystems have limit in the max number of files per directory,
this is why we use 65536 directories, so there is no such a problem.

Mmap is faster and will provide use a single file database, but the
problem is, it's very hard to avoid corruption with this schema. For
instance in MongoDB if I'm correct in order to have a decent amount of
durability you need to run a replica, otherwise something bad can
happen.

This is since mmaped files will flush things on disk without any
ordering, and at hardware page level. So it's very hard to guess what
will be written and what not on disk in case of a crash.

Even if we switch to something different than our filesystem-based
approach we need something more consistent than this. It's fine for
mongo as it's the unique persistence offered and must be both fast and
persistence, and mongodb guys tried to find some tradeoff. But in our
case for speed we have the in memory back end, and when diskstore is
enabled we have the object cache, so my feeling is that our on-disk KV
implementation should be able to provide some more durability.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Baishampayan Ghose  
View profile  
 More options Jan 5 2011, 4:10 am
From: Baishampayan Ghose <b.gh...@gmail.com>
Date: Wed, 5 Jan 2011 14:40:54 +0530
Local: Wed, Jan 5 2011 4:10 am
Subject: Re: Redis diskstore

>> As an aside, if Redis stores the value pointed by each key in its own
>> file, what are the implications of using filesystems which limit the
>> total number of files?

> The implication is that you can only store so many keys on disk. You aren't
> using fat-12, are you? ;)

Not at all :)

Regards,
BG

--
Baishampayan Ghose
b.ghose at gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Lossen  
View profile  
 More options Jan 5 2011, 4:28 am
From: Tim Lossen <t...@lossen.de>
Date: Wed, 5 Jan 2011 10:28:55 +0100
Local: Wed, Jan 5 2011 4:28 am
Subject: Re: Redis diskstore
On 2011-01-04, at 20:57 , Salvatore Sanfilippo wrote:

> So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".

> The new set of tradeoffs are very different. The result is called
> diskstore

excellent news! also, "diskstore" sounds better than "durable vm" :)
i will try to give it a spin later today.

tim

--
http://tim.lossen.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Baishampayan Ghose  
View profile  
 More options Jan 5 2011, 4:32 am
From: Baishampayan Ghose <b.gh...@gmail.com>
Date: Wed, 5 Jan 2011 15:02:10 +0530
Local: Wed, Jan 5 2011 4:32 am
Subject: Re: Redis diskstore

Thanks for the explanation, Salvatore. Redis rocks (we use it in
production) and I am very optimistic about its future :)

Regards,
BG

--
Baishampayan Ghose
b.ghose at gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dvir Volk  
View profile  
 More options Jan 5 2011, 5:36 am
From: Dvir Volk <dvir...@gmail.com>
Date: Wed, 5 Jan 2011 12:36:18 +0200
Local: Wed, Jan 5 2011 5:36 am
Subject: Re: Redis diskstore

just a thought, Salvatore,
have you considered using berkeley db (sleepycat), or even mysql?
if the API is so simple, it's worth a shot giving poeple a couple of options
for the "back engine" or something.
I will certainly play with it over the weekend and try to use my storage
engine as an alternative, but I think it's worth also exploring BDB and even
MYSQL (although I'm sure the network overhead will make it much slower).
the reason I didn't use BDB back then was its licensing.

On Wed, Jan 5, 2011 at 12:15 AM, Salvatore Sanfilippo <anti...@gmail.com>wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Salvatore Sanfilippo  
View profile  
 More options Jan 5 2011, 5:51 am
From: Salvatore Sanfilippo <anti...@gmail.com>
Date: Wed, 5 Jan 2011 11:51:11 +0100
Local: Wed, Jan 5 2011 5:51 am
Subject: Re: Redis diskstore

On Wed, Jan 5, 2011 at 11:36 AM, Dvir Volk <dvir...@gmail.com> wrote:
> just a thought, Salvatore,
> have you considered using berkeley db (sleepycat), or even mysql?
> if the API is so simple, it's worth a shot giving poeple a couple of options
> for the "back engine" or something.
> I will certainly play with it over the weekend and try to use my storage
> engine as an alternative, but I think it's worth also exploring BDB and even
> MYSQL (although I'm sure the network overhead will make it much slower).
> the reason I didn't use BDB back then was its licensing.

Supporting multiple back ends surely makes sense. On the other hand I
want to have a clean build system with a clean default, so this is
what we can do.

For default we'll have something like:

diskstore-plugin native

Or...

diskstore-plugin /path/to/your/plugin

This plugin will just be popen()-ed by Redis, so Redis will talk to
this plugin via standard input/output, and will issue commands like
GET / SET / and DEL, with a simple binary protocol. The plugin will
reply writing to standard output, again with a simple binary protocol.

Given that anyway disk I/O is involved the overhead of the stdin/out
chat is not a big issue, and this way people can create plugins for
diskstore with any conceivable language and architecture. At the same
time Redis will ship what a default that we consider sane for most
users.

Makes sense?

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Lossen  
View profile  
 More options Jan 5 2011, 5:53 am
From: Tim Lossen <t...@lossen.de>
Date: Wed, 5 Jan 2011 11:53:36 +0100
Local: Wed, Jan 5 2011 5:53 am
Subject: Re: Redis diskstore
i think there are even more interesting options -- how about
something clustered, like riak for example? this would take care
of storing multiple copies, thus guarding against disk failures,
and making failover much easier.

On 2011-01-05, at 11:36 , Dvir Volk wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pedro Melo  
View profile  
 More options Jan 5 2011, 6:14 am
From: Pedro Melo <m...@simplicidade.org>
Date: Wed, 5 Jan 2011 11:14:14 +0000
Local: Wed, Jan 5 2011 6:14 am
Subject: Re: Redis diskstore
Hi,

On Wed, Jan 5, 2011 at 10:36 AM, Dvir Volk <dvir...@gmail.com> wrote:
> just a thought, Salvatore,
> have you considered using berkeley db (sleepycat), or even mysql?

Or the InnoDB plugin directly, given that it seems to have good performance:

http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-s...

But given that the diskstore will be able to use different back-ends
via plugins, I assume that if someone really wants this, they will
appear.

Bye,
--
Pedro Melo
http://www.simplicidade.org/
xmpp:m...@simplicidade.org
mailto:m...@simplicidade.org


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 52   Newer >
« Back to Discussions « Newer topic     Older topic »