Redis in the place of Tyrant

740 views
Skip to first unread message

Gopalakrishnan Subramani

unread,
May 31, 2011, 1:51:32 PM5/31/11
to Redis DB
I have tyrant database with 2 millions of records and the db expected
to grow to 5 millions of records.

Key is 24 bytes string and the value would be approximately 1 KB.
Since I already heavily use Redis for my site's master data, I am
thinking of moving the tyrant to redis.

I do not want redis to eat up all the memory of my machine. I am aware
of vm feature so I can restrict redis to 1 GB size in memory and rest
to be stored in file system. I am looking for solution like newly
added records shall be retrieved from memory and older records to be
fetched from file system.

Can redis do this nicely? Should I restrict to tyrant?

--\

Gopalakrishnan Subramani

Salvatore Sanfilippo

unread,
May 31, 2011, 1:57:02 PM5/31/11
to redi...@googlegroups.com
On Tue, May 31, 2011 at 7:51 PM, Gopalakrishnan Subramani
<gopalakrishn...@gmail.com> wrote:
> I do not want redis to eat up all the memory of my machine. I am aware
> of vm feature so I can restrict redis to 1 GB size in memory and rest
> to be stored in file system. I am looking for solution like newly
> added records shall be retrieved from memory and older records to be
> fetched from file system.
>
> Can redis do this nicely? Should I restrict to tyrant?

Hello,

5 million keys of 1k each is not a lot of memory with Redis. I suggest
doing this in memory.
Don't use VM, it was a legitimate try but is IMHO a failure.

To reply to Tim briefly at the same time as I still did not found the
time to reply to his email properly: also disk store is currently low
priority. I trust Redis-on-disk every day less. But it is not
impossible that we'll do other work in this field, but as long as
features remain so low, and SSD performances basically trashed by the
OS API in the specific case you want to use it as a safe random access
memory, I continue to think as Redis on disk as a bad deal.


Gopalakrishnan: either go with Redis in memory or keep Tyrant IHMO :)

Salvatore


--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Gopalakrishnan Subramani

unread,
Jun 1, 2011, 1:10:19 AM6/1/11
to Redis DB
Salvatore,

Thanks for reply. I would stick to Tyrant..

--

Gopalakrishnan Subramani

On May 31, 10:57 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> On Tue, May 31, 2011 at 7:51 PM, Gopalakrishnan Subramani
>

Dvir Volk

unread,
Jun 1, 2011, 2:06:08 AM6/1/11
to redi...@googlegroups.com
To me, the more interesting case for disk store, is very large databases with low write activity, regardless of being entirely inside RAM or not.
even though it's not the most resource consuming thing (especially now that we found the problem with Ubuntu and Amazon), dumping a few gigs of data every N minutes because of say, less than 5% change,
seems like a waste to me.
diskstore can provide the best of both worlds, eliminating the need for both AOF (in some cases), and for dumps.


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.


teleo

unread,
Jun 1, 2011, 2:42:20 AM6/1/11
to Redis DB
Hi Salvatore,

Can you please elaborate more on SSD performance in conjunction to
Redis?

- Te

On May 31, 8:57 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> On Tue, May 31, 2011 at 7:51 PM, Gopalakrishnan Subramani
>

Tim Lossen

unread,
Jun 1, 2011, 3:59:53 AM6/1/11
to redi...@googlegroups.com
well, "low-priority" is ok, as long as you don't drop it altogether.

i was mainly trying to make the point that the existing diskstore
implementation is fully usable -- provided that the use case is
compatible, i.e. more data than fits in memory, small percentage
of hot keys at any given time, and values which are not too small.

tim

On 2011-05-31, at 19:57 , Salvatore Sanfilippo wrote:
> To reply to Tim briefly at the same time as I still did not found the
> time to reply to his email properly: also disk store is currently low
> priority. I trust Redis-on-disk every day less. But it is not
> impossible that we'll do other work in this field, but as long as
> features remain so low, and SSD performances basically trashed by the
> OS API in the specific case you want to use it as a safe random access
> memory, I continue to think as Redis on disk as a bad deal.

--
http://tim.lossen.de

Xiangrong Fang

unread,
Jun 1, 2011, 4:20:46 AM6/1/11
to redi...@googlegroups.com
Could you please tell me what is values "not too small"? What's wrong with diskstore for small data?  Although swapping kills performance, is there any better solutions than swapping if data is larger than memory?

Shannon

2011/6/1 Tim Lossen <t...@lossen.de>

Tim Lossen

unread,
Jun 1, 2011, 5:16:04 AM6/1/11
to redi...@googlegroups.com
diskstore creates one file per value, which seems like a lot of
overhead for small values -- but it might still work fine.

the other requirement, small percentage of hot keys at any given
time, is far more important i think.

tim

--
http://tim.lossen.de

Xiangrong Fang

unread,
Jun 1, 2011, 5:22:10 AM6/1/11
to redi...@googlegroups.com
Sounds like some overhead there.   Is it possible to use a big sparse file to store small values, will this perform better than small per-value files?

Also I do agree that small percentage of hot keys are more important. I would say that we have to use disk store despite of possible performance degradation, because this is the only way if memory's not enough.

Dvir Volk

unread,
Jun 1, 2011, 5:28:32 AM6/1/11
to redi...@googlegroups.com
Oh, I thought it was already using Salvatore's btree implementation. 
I'm also sure it shouldn't be much trouble to put BDB or even riak as the engine for that.
Saving a lot of files is a huge overhead in very large databases.

Salvatore Sanfilippo

unread,
Jun 1, 2011, 5:29:33 AM6/1/11
to redi...@googlegroups.com
On Wed, Jun 1, 2011 at 11:22 AM, Xiangrong Fang <xrf...@gmail.com> wrote:
> Also I do agree that small percentage of hot keys are more important. I
> would say that we have to use disk store despite of possible performance
> degradation, because this is the only way if memory's not enough.

For Redis on disk to work well, you need:

1) Very biased data access.
2) Mostly reads.
3) Dataset consisting of key->value data where values are small.
4) A dataset that is big enough to really pose memory/cost problems on
the ever growing RAM you find in a entry level server.

the intersection of 1+2+3+4 is small and fits exactly in the case where
Redis for metadata, or as a cache, plus another datastore designed to
work on disk is the right pick. So why should not we focus, instead,
into doing what we already do (the in-memory but persistent data
structure server) better? It would be already an huge success to
enhance what we already have.

Xiangrong Fang

unread,
Jun 1, 2011, 9:28:37 AM6/1/11
to redi...@googlegroups.com
Hi Salvatore,

Thank you for your explanation. But I got somewhat confused about this:

<quote>

So why should not we focus, instead, into doing what we already do (the in-memory but persistent data structure server) better?
</quote>

In fact what I want is the "persistent" part. Could you please explain briefly that, without diskstore, what will happen when dataset exceeds system memory (suppose either rdb or aof persistence in action)?   Is the VM idea totally bad, or it is workable, just with some performance penalty?

Shannon

2011/6/1 Salvatore Sanfilippo <ant...@gmail.com>

--

Salvatore Sanfilippo

unread,
Jun 1, 2011, 11:36:39 AM6/1/11
to redi...@googlegroups.com
On Wed, Jun 1, 2011 at 3:28 PM, Xiangrong Fang <xrf...@gmail.com> wrote:
> In fact what I want is the "persistent" part. Could you please explain
> briefly that, without diskstore, what will happen when dataset exceeds
> system memory (suppose either rdb or aof persistence in action)?   Is the VM
> idea totally bad, or it is workable, just with some performance penalty?

Hello, first of all, let's put things in context. In the "what
happens" scenario we are talking about 99% of happy Redis users.
Diskstore + VM together probably don't reach 1% of the user base.

So what happens to the vast majority of Redis instances out there when
they run out of memory?
That the system will start to perform not well, will be slower, and so
forth, as the OS will start to swap Redis pages on disk.
If your system is configured with too little (or zero) swap space
likely the OOM killer will kill the Redis instance.
It is also possible that you misconfigured your system and the
overcommit policy is set the wrong way (but Redis warns about this in
the logs), and fork() will start failing when there is not enough
memory, causing the DB saving (or the AOF log rewrite) to fail.

But I can't understand why you force a connection between persistence
and having or not a disk-backed data storage.
Those are two completely different things, if Redis crashes for out of
memory, or if it crashes since you had a power failure in your data
center, there is no difference: you had AOF enabled? You likely have
all your data minus the latest few seconds.
You had .rdb enabled? You have your .rdb files saved as configured.

Nothing magical or different than usual.

Xiangrong Fang

unread,
Jun 1, 2011, 11:50:43 AM6/1/11
to redi...@googlegroups.com
Hi Salvatore,

You mean, there is no connection between persistence and disk backed storage?  I try to understand this as:

1) if dataset fits in physical memory, nothing strange happens, persistence only ensures that no data is lost during a reboot or a power failure, or redis server failure etc.  It has nothing to do with data reading/writing which all happens in memory.

2) if dataset exceeds memory, the OS swap mechanism kicks in and redis will NOT try to swap out "cold" data (as long as vm is not in use, and vm is now a discouraged technology).

Is that correct?

Thanks,
Shannon

2011/6/1 Salvatore Sanfilippo <ant...@gmail.com>
On Wed, Jun 1, 2011 at 3:28 PM, Xiangrong Fang <xrf...@gmail.com> wrote:

Paolo Negri

unread,
Jun 1, 2011, 12:53:29 PM6/1/11
to Redis DB
Hi Salvatore

I would like to point out that at the current state Redis is very
problematic to persist in write heavy (60/40 read/write split)
scenarios, BGSAVE has a massive memory overhead (not far from 100%)
and Virtual Memory is incompatible with replication or BGSAVE because
of IO factors.
So the value of diskstore isn't only to allow to store datasets larger
than RAM but also is to provide a viable persistency solution for
datasets that change at a rate where taking a consistent snapshot of
the database is hard or impossible altogether.

If you take a look at the mailing lists threads in late december [1],
[2], [3] you'll notice that the development of diskstore was sparked
by two needs, be a viable alternative to VM allowing datasets larger
than RAM capacity but also to offer a different model of persistency
compatible with write heavy loads.

So while you state that persistency is a redis feature you want to
provide, the descoping of diskstore (or a similar alternative) from
the roadmap gives redis a weak and problematic persistency support in
scenarios where instead redis is meant to be shining.
Personally i hope that if not diskstore then a different solution will
be offered soon in order to be able to have durable backups that don't
have the high cost of a consistent snapshot like SAVE / BGSAVE.

I would also like to give my +1 for the inclusion of diskstore in the
2.4 branch and release, I think that as it is now is a much more
sensible alternative to the VM implementation.

[1] https://groups.google.com/group/redis-db/browse_thread/thread/9267ab64bdb4665e/d99a97afbab3cbd0
[2] https://groups.google.com/group/redis-db/browse_thread/thread/d444bc786689bde9/d7f3b611be46714b
[3] https://groups.google.com/group/redis-db/browse_thread/thread/c6b41aafc537fafb/2af72542d40ea2c3

Thanks,

Paolo

Salvatore Sanfilippo

unread,
Jun 1, 2011, 1:42:43 PM6/1/11
to redi...@googlegroups.com
On Wed, Jun 1, 2011 at 6:53 PM, Paolo Negri <paolo...@wooga.net> wrote:
> Hi Salvatore
>
> I would like to point out that at the current state Redis is very
> problematic to persist in write heavy (60/40 read/write split)
> scenarios, BGSAVE has a massive memory overhead (not far from 100%)
> and Virtual Memory is incompatible with replication or BGSAVE because
> of IO factors.
> So the value of diskstore isn't only to allow to store datasets larger
> than RAM but also is to provide a viable persistency solution for
> datasets that change at a rate where taking a consistent snapshot of
> the database is hard or impossible altogether.

Hello Paolo,

I think the reality is more or less *exactly* the contrary.
One of the problems of diskstore is that you need to have a time limit
after you change a key to flush it into disk (otherwise it is not a
store at all).
So even if the same key is changing continuously like into an high
performance INCR business, you end having the write queue always
populated (imagine having many counters). And what to do once you
can't write fast enough? The only thing you can do is blocking
clients, or violate the contract with the user that a modified key
will be transfered in at max tot seconds.

That's point one. Point two is: there is no such a think like a
problem with BGSAVE saving on disk under high write load.
The *peak* requirement is 2x memory, but in Redis 2.2 (make sure to
use the absolute latest 2.2 patch level) this was improved
significantly, and with 2.4 this is much better as the persistence
itself is much faster. One of the problems with 2.2 was the dict
iterator creating copy-on-write of pages even without any write
against the key.

So in the worst case you need 2x memory, but this worst case is now
really unlikely to happen compared to the past. When this happens,
this is a requirement to take in mind.

And interesting: this has nothing to do with diskstore. The
requirement above is a direct consequences of the fact that we create
*point in time* snapshots of the dataset.

> If you take a look at the mailing lists threads in late december [1],
> [2], [3] you'll notice that the development of diskstore was sparked
> by two needs, be a viable alternative to VM allowing datasets larger
> than RAM capacity but also to offer a different model of persistency
> compatible with write heavy loads.

You are wrong, diskstore only other advantage is *fast restarts*. And
it is also more or less a fake thing as actually the startup will not
provide the same performances as normal running time, as every key
will produce a disk access.
I can assure you that to make write faster you don't start flushing
things on disk ;)

> So while you state that persistency is a redis feature you want to
> provide, the descoping of diskstore (or a similar alternative) from
> the roadmap gives redis a weak and problematic persistency support in
> scenarios where instead redis is meant to be shining.

Again, not agreed. The fact that the peak memory can be 2x the RAM
does not mean we have a persistence problem. Just if you have a lot of
writes, and you want point-in-time persistence, the price to pay is
obviously up to 2x memory. But I suggest trying this into Redis 2.4 in
a simulation, it is much better than it used to be.

It is however possible to reason about having an alternative
persistence not guaranteeing point-in-time snapshots, but guaranteeing
instead very low memory usage while saving. In the persistence arena
there is a lot to experiment for us still, but this has nothing to do
with diskstore in my opinion.

Valentino Volonghi

unread,
Jun 1, 2011, 3:07:11 PM6/1/11
to redi...@googlegroups.com
On Jun 1, 2011, at 10:42 AM, Salvatore Sanfilippo wrote:

> Hello Paolo,
>
> I think the reality is more or less *exactly* the contrary.
> One of the problems of diskstore is that you need to have a time limit
> after you change a key to flush it into disk (otherwise it is not a
> store at all).
> So even if the same key is changing continuously like into an high
> performance INCR business, you end having the write queue always
> populated (imagine having many counters). And what to do once you
> can't write fast enough? The only thing you can do is blocking
> clients, or violate the contract with the user that a modified key
> will be transfered in at max tot seconds.

Redis basic caching behavior wouldn't change, you'd just write to persistent storage after X changes happened, but instead of writing out the entire memory content you write out only the values that changed. This can't be slower or worse than writing out GBs of data each time.

You don't need to keep a timer on each key, you simply need to write to disk only the keys that have actually changed instead of the entire snapshot.

> So in the worst case you need 2x memory, but this worst case is now
> really unlikely to happen compared to the past. When this happens,
> this is a requirement to take in mind.

Worst case might be 2x but even if it was 1.5x it's still 30% of the memory that you can't use for storage but have to reserve for backups.

> And interesting: this has nothing to do with diskstore. The
> requirement above is a direct consequences of the fact that we create
> *point in time* snapshots of the dataset.

Being diskstore more key-oriented it should be possible to have the requirement above be done at the # key-value changes size level rather than at the entire dataset level.

>> If you take a look at the mailing lists threads in late december [1],
>> [2], [3] you'll notice that the development of diskstore was sparked
>> by two needs, be a viable alternative to VM allowing datasets larger
>> than RAM capacity but also to offer a different model of persistency
>> compatible with write heavy loads.
>
> You are wrong, diskstore only other advantage is *fast restarts*. And
> it is also more or less a fake thing as actually the startup will not
> provide the same performances as normal running time, as every key
> will produce a disk access.
> I can assure you that to make write faster you don't start flushing
> things on disk ;)

Why is its only advantage fast restart? Its biggest advantage is better memory efficiency, I think very few care about fast restarts and even in that case fast restarts happen mostly with big datasets that will be crippled first by the non efficient use of memory.

>> So while you state that persistency is a redis feature you want to
>> provide, the descoping of diskstore (or a similar alternative) from
>> the roadmap gives redis a weak and problematic persistency support in
>> scenarios where instead redis is meant to be shining.
>
> Again, not agreed. The fact that the peak memory can be 2x the RAM
> does not mean we have a persistence problem. Just if you have a lot of
> writes, and you want point-in-time persistence, the price to pay is
> obviously up to 2x memory. But I suggest trying this into Redis 2.4 in
> a simulation, it is much better than it used to be.

But you can't store stuff that is bigger than RAM even though 60% of the time you use 10% of what you are storing in redis.

> It is however possible to reason about having an alternative
> persistence not guaranteeing point-in-time snapshots, but guaranteeing
> instead very low memory usage while saving. In the persistence arena
> there is a lot to experiment for us still, but this has nothing to do
> with diskstore in my opinion.


RDMSes have been doing point-in-time snapshots and guaranteeing persistency for a very long time.

--
Valentino Volonghi aka Dialtone
Now Running MacOSX 10.6
http://www.adroll.com/

Paolo Negri

unread,
Jun 1, 2011, 3:36:49 PM6/1/11
to Redis DB


On Jun 1, 7:42 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:
Is true that there's no problem if enough margin is allocated, but the
problem is how big this margin has to be in order to safely operate a
redis instance that can be guaranteed to be persistent under write
heavy load.

Let's make an example, I have a 16GB machine, I'll then be willing to
have at most 8GB of data at any point in time since the maximum BGSAVE
overhead is 100%.

What happens when the data set reaches 8GB is rather worrying, if
BGSAVE starts failing because it running out of RAM it will have 2
effects first the persistent snapshot will stop being updated and
second every time BGSAVE tries to trigger there's a risk of the OOM
killer waking up and killing the redis instance. This means data loss
+ server downtime, disastrous.

So let's say that I want to operate my system with a 10% safety margin
in order to accommodate some data set growth before reaching the
dangerous 8GB threshold, this means that 7.3GB becomes the amount of
data I'll be actually willing to store on my 16GB server.

But it's even worse than that because this means that at 7.3GB I'll
actually want to upgrade to a bigger machine in order to not erode my
10% safety margin.
7.3GB is then the end of life data set size for my 16GB server.

So, in case of linear data growth, I'll be using my 16GB machine for a
data set of an average size of 5.5GB (assuming that I moved into the
16GB machine upgrading from a 8GB one at a data set size of 3.7GB).
This is close to a 200% average overhead.
And in the case I'm running a master slave setup in order to have a
hot replacement of the master instance then an average of 5.5 GB of
data costed 2 servers with 16GB of RAM each.

As of today disk store is the only working solution that would
alleviate the problem.

Paolo

Salvatore Sanfilippo

unread,
Jun 1, 2011, 4:13:16 PM6/1/11
to redi...@googlegroups.com
On Wed, Jun 1, 2011 at 9:07 PM, Valentino Volonghi <dial...@gmail.com> wrote:

> Redis basic caching behavior wouldn't change, you'd just write to persistent storage after X changes happened, but instead of writing out the entire memory content you write out only the values that changed. This can't be slower or worse than writing out GBs of data each time.
>
> You don't need to keep a timer on each key, you simply need to write to disk only the keys that have actually changed instead of the entire snapshot.

This is not how it works, just an example: you set a time of 1000
seconds between saves, in this 1000 seconds you touch 50% of the
dataset. Then you need to have half of the dataset in memory, as
modified keys can't be discarded to free memory. So you can no longer
guarantee diskstore-max-memory. It is more complex than that btw but
the example is enough to show that things are more interesting than
what you may think at first.

> Worst case might be 2x but even if it was 1.5x it's still 30% of the memory that you can't use for storage but have to reserve for backups.

If you want an in-memory data store, you can't escape the 2x rule. I
can provide you a mathematical proof of that.

You have two storage media, one is RAM, one is disk. You want to
transfer a point in time snapshot from RAM to disk.
Once you start dumping the content of the first media into the second
one, new writes may arrive in the first media. In order to guarantee
the point-in-time semantics you need to accumulate this changes in
some way.

So, Law Of Redis #1: an in memory database takes, to produce an
on-disk point-in-time snapshot of the dataset, an amount of additional
memory proportional to the changes received in the dataset while the
snapshot is being performed.

You can implement that in different ways but the rule does not change.
In our case we use copy-on-write, so it is particularly severe as
every modified byte will copy a whole page. So for instance in the
worst case just 5000 modified keys will COW 19 MB of pages.

However the alternative is to duplicate values when needed, and Redis
values can be very large lists for instance, so COW is the best thing
we can do. Now that we optimized it, in most conditions this is a non
issue for many users. And for users with a trilion of changes per
second, 2x RAM is the requirement.

> Being diskstore more key-oriented it should be possible to have the requirement above be done at the # key-value changes size level rather than at the entire dataset level.

I think that people want simple behavior for persistance, without
weights to assign to keys.
But this is not the point of what I was saying. What I was saying in
the above sentence is that diskstore is unrelated to that, you can
just have, without diskstore, a saving thread that saves the keys one
after the other without guaranteeing any point-in-time feature, and
such a system will use constant additional memory to save.

So we are mixing arguments IMHO. Diskstore was addressing just two things:

1) datasets larger than RAM.
2) fast restarts, as there is no load time since memory is just a cache.

We'll see later in this email how actually it sucks at "1" and at "2" anyway.

> Why is its only advantage fast restart? Its biggest advantage is better memory efficiency, I think very few care about fast restarts and even in that case fast restarts happen mostly with big datasets that will be crippled first by the non efficient use of memory.

I agree that fast restarts are not that killer feature, but I just was
mentioning one of the advantages.
The other is that you use less memory to store the same amount of
data. For that to work well you need:

1) That you have a very biased access pattern. If access is evenly
distributed the system will behave like an on-disk system, and this is
not the goal of Redis.
2) That you have little writes, otherwise you can either select to
have an hard memory bound and slow down writes, or you can try to
handle peaks going higher with memory usage for a while. But at the
end of the day you need to start slowing down clients to save your
long queue of keys. Remember: you have a fixed memory limit in this
setup.
3) Also values need to be small. Basically this means to turn Redis
into a plain key-value store. You can't have a big sorted set as a
value, as to serialize-unserialize that is hard. Otherwise you need to
go much more forward and implement all the Redis types on disk
directly, efficiently, and without fragmentation. This means to create
a completely different project: could be nice but is not Redis.

So I'm pretty shocked to hear that diskstore is the solution to a
persistence problem.
If we can have a different persistence engine that drops point-in-time
optionally in favor of some other kind of persistence mechanism, why
not? This would be cool. But fixing that with a disk-based storage
engine does not make sense if we want a fast in-memory store that can
do writes as fast as reads, can have a key holding a 40 millions
entries sorted set without even noticing the load, and so forth.


> But you can't store stuff that is bigger than RAM even though 60% of the time you use 10% of what you are storing in redis.

The secret is being able to accept compromises in systems.

When there is such a big bias, it is at application level that you
need to get smarter. Use Redis as a cache, but even write against your
cache and transfer your values when needed (I do this with good
results for instance). But expose all this tradeoffs to the
application.

It is crystal clear that Redis data model is not compatible with on
disk storage, at least with the limitations imposed by the OS API and
disk controllers, where you can't have decent guarantee of consistency
without resorting to things like fsync() or to journals. So even SSD
are out of questions, you can't model complex data structures with
pointers on disk and expect it to rock.

> RDMSes have been doing point-in-time snapshots and guaranteeing persistency for a very long time.

The 2x memory problem only exists when you want point-in-time on an
external media.
When both your "live" data and your "persistence" data are the same
thing to have point-in-time is a non brainer and done by RDMSs
forever.

Ciao,
Salvatore

> --
> Valentino Volonghi aka Dialtone
> Now Running MacOSX 10.6
> http://www.adroll.com/
>

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore Sanfilippo

unread,
Jun 1, 2011, 4:18:58 PM6/1/11
to redi...@googlegroups.com
On Wed, Jun 1, 2011 at 9:36 PM, Paolo Negri <paolo...@wooga.net> wrote:
> But it's even worse than that because this means that at 7.3GB I'll
> actually want to upgrade to a bigger machine in order to not erode my
> 10% safety margin.
> 7.3GB is then the end of life data set size for my 16GB server.

That is all up to our factors:

1) time it takes to persist on disk.
2) numer of changes per unit of time.

Additional memory is 1 * 2.

When this is not acceptable there are two possibile solutions we
should start developing.

SOLUTION A: a persistence method that does not have point-in-time
guarantees. Just a thread that saves a key after the other.

SOLUTION B: using the append only file, but write it in segments, and
write tools that can compact this pieces tracing DELs and other
commands. So you don't need a background rewrite process.

I like that we are talking about improving that part of Redis, and
there are the tools to do that, and nice challenges as well :)
But diskstore is not the solution for this problem... it is instead
the start of many other issues.

Valentino Volonghi

unread,
Jun 1, 2011, 5:50:55 PM6/1/11
to redi...@googlegroups.com
On Jun 1, 2011, at 1:13 PM, Salvatore Sanfilippo wrote:

> On Wed, Jun 1, 2011 at 9:07 PM, Valentino Volonghi <dial...@gmail.com> wrote:
>
>> Redis basic caching behavior wouldn't change, you'd just write to persistent storage after X changes happened, but instead of writing out the entire memory content you write out only the values that changed. This can't be slower or worse than writing out GBs of data each time.
>>
>> You don't need to keep a timer on each key, you simply need to write to disk only the keys that have actually changed instead of the entire snapshot.
>
> This is not how it works, just an example: you set a time of 1000
> seconds between saves, in this 1000 seconds you touch 50% of the
> dataset. Then you need to have half of the dataset in memory, as
> modified keys can't be discarded to free memory. So you can no longer
> guarantee diskstore-max-memory. It is more complex than that btw but
> the example is enough to show that things are more interesting than
> what you may think at first.

It is only slightly more complex, if you also have a max-memory requirement then you can set a flush memory threshold that flushes to disk, the same way you currently have multiple settings for the redis snapshot depending on how many changes there are or how much time passed.

If diskstore-max-memory is an hard limit then it's gonna have to be part of the flushing decision beyond the timer setting. Likewise for all other hard memory limits that a user sets.

>> Worst case might be 2x but even if it was 1.5x it's still 30% of the memory that you can't use for storage but have to reserve for backups.
>
> If you want an in-memory data store, you can't escape the 2x rule. I
> can provide you a mathematical proof of that.

The 2x rule is only valid when you save the entire dataset instead of just what has changed in which case it would only be valid in a very extreme case.

>> Being diskstore more key-oriented it should be possible to have the requirement above be done at the # key-value changes size level rather than at the entire dataset level.
>
> I think that people want simple behavior for persistance, without
> weights to assign to keys.

Nobody assigns weights to keys any more than what it's currently being done in volatile-lru. Most accessed stays in memory, least accessed doesn't.

>> Why is its only advantage fast restart? Its biggest advantage is better memory efficiency, I think very few care about fast restarts and even in that case fast restarts happen mostly with big datasets that will be crippled first by the non efficient use of memory.
>
> I agree that fast restarts are not that killer feature, but I just was
> mentioning one of the advantages.
> The other is that you use less memory to store the same amount of
> data. For that to work well you need:

These are the same 2 advantages that I mentioned, just to be clear, so we are on the same page.

> 1) That you have a very biased access pattern. If access is evenly
> distributed the system will behave like an on-disk system, and this is
> not the goal of Redis.

I can't understand this.

Why is it impossible to implement a memory store that uses a background process to save changed keys?

Redis has no power to decide on the frequency of access of each key. If you access evenly every key that you'll need more memory, this point is moot. But the vast majority of use-cases accesses a few keys most of the time, in this case the solution proposed would work fine. People who enable diskstore do it knowing that it's the usecase. All redis users are already keeping everything in memory so you'd be providing a new feature to better handle more data without requiring more machines.

> 2) That you have little writes, otherwise you can either select to
> have an hard memory bound and slow down writes, or you can try to
> handle peaks going higher with memory usage for a while. But at the
> end of the day you need to start slowing down clients to save your
> long queue of keys. Remember: you have a fixed memory limit in this
> setup.

writes to memory get aggregated in memory then saved only once to disk. If you can save the entire dataset every X minutes today there is just no way you wouldn't be able to save just a few keys. Guaranteeing fixed memory limits to the single byte is really hard but I can't see it as a problem.

> 3) Also values need to be small. Basically this means to turn Redis
> into a plain key-value store. You can't have a big sorted set as a
> value, as to serialize-unserialize that is hard. Otherwise you need to
> go much more forward and implement all the Redis types on disk
> directly, efficiently, and without fragmentation. This means to create
> a completely different project: could be nice but is not Redis.

I can see that this is an issue, but the rest you mentioned is definitely not.

> So I'm pretty shocked to hear that diskstore is the solution to a
> persistence problem.

It's a memory efficiency problem from my point of view, not only you can't manage datasets bigger than RAM, you actually can't manage datasets bigger than 50% of your RAM.

> If we can have a different persistence engine that drops point-in-time
> optionally in favor of some other kind of persistence mechanism, why
> not? This would be cool. But fixing that with a disk-based storage
> engine does not make sense if we want a fast in-memory store that can
> do writes as fast as reads, can have a key holding a 40 millions
> entries sorted set without even noticing the load, and so forth.

I fail to see how aggregated memory writes are connected to disk writes any more than the current snapshotting solution. It's the equivalent problem to what happens when you can't save a snapshot in time for the next snapshot to be taken?

>> But you can't store stuff that is bigger than RAM even though 60% of the time you use 10% of what you are storing in redis.
>
> The secret is being able to accept compromises in systems.

There are good compromises and bad compromises.

> When there is such a big bias, it is at application level that you
> need to get smarter. Use Redis as a cache, but even write against your
> cache and transfer your values when needed (I do this with good
> results for instance). But expose all this tradeoffs to the
> application.
>
> It is crystal clear that Redis data model is not compatible with on
> disk storage, at least with the limitations imposed by the OS API and
> disk controllers, where you can't have decent guarantee of consistency
> without resorting to things like fsync() or to journals. So even SSD
> are out of questions, you can't model complex data structures with
> pointers on disk and expect it to rock.

Redis is the perfect write-through cache for more complex data structures, I can see all your points regarding the complexity of implementing a disk format that allows for fast updates in place (what most other stores do is use append-only files and compaction, similar to what MVCC), but I can't see any other point about consistency, durability, tradeoffs or memory usage. Redis is in the best position to know how to manage its keys rather than the client and this is a pretty widespread usecase.

>> RDMSes have been doing point-in-time snapshots and guaranteeing persistency for a very long time.
>
> The 2x memory problem only exists when you want point-in-time on an
> external media.
> When both your "live" data and your "persistence" data are the same
> thing to have point-in-time is a non brainer and done by RDMSs
> forever.


Most RDBMSes also write the changes to a buffer and flush it when needed, it's a basic configuration parameter for all of them. There is nothing inherently wrong or slow in RDBMSes, it's the number of features that you use that makes it a problem, if you want full ACID guarantees then you are gonna be slow, relax D a bit and it will be much faster.

http://www.scribd.com/doc/31669670/PostgreSQL-and-NoSQL

Valentino Volonghi

unread,
Jun 1, 2011, 5:57:46 PM6/1/11
to redi...@googlegroups.com

On Jun 1, 2011, at 1:18 PM, Salvatore Sanfilippo wrote:

> I like that we are talking about improving that part of Redis, and
> there are the tools to do that, and nice challenges as well :)
> But diskstore is not the solution for this problem... it is instead
> the start of many other issues.


diskstore is just the name of a system that behaves like a write-back cache instead of a VM.

Xiangrong Fang

unread,
Jun 1, 2011, 9:24:39 PM6/1/11
to redi...@googlegroups.com
Hi Salvatore,

Could you please explain (probably using an example) what is "point-in-time guarantees"?   I don't understand why you say dataset larger than memory is NOT (a part of) persistence problem.  

If one can guarantee data won't exceed physical memory, thing is much easier. e.g. you can setup a master-slave, where writes are *only* directed to master, reads are distributed evenly.  The master do not bgsave and only slave does the save.   This is not as strong as rdbms transaction level security, but if there are any crash as severe as both master and slave are down, only a portion of data will be lost even for very busy site.

As redis is primarily used for web 2.0 applications or anything not like a financial institution. I think this level of persistence is more than enough. The only problem I feel is dataset larger than memory.

Shannon

2011/6/2 Salvatore Sanfilippo <ant...@gmail.com>

Didier Spezia

unread,
Jun 2, 2011, 3:02:32 AM6/2/11
to Redis DB

Gentlemen,

I don't wish to jump into this interesting diskstore discussion,
but I would like to comment on this 2x memory thing due to the
COW.

On a typical Posix system, the absolute worst case for bgsave
is not x2, but x3. You tend to forget the filesystem cache
in your calculations.

Let's imagine a Redis instance containing a large numbers
of 4 Kb strings of non compressible data. The size of the dump
file is therefore close to Redis memory consumption.
In the worst case, writing a large file without triggering
swapping mayhem requires as much memory as the size of the
file (especially on Linux with oldish kernels).

So we have:
+ memory of the Redis instance
+ COW duplicated pages
+ filesystem cache for the dump file

The theorical worst case is therefore x3 memory. Now this
is the theory. In practice hopefully, both COW and
filesystem cache memory overhead are not that bad.

One way to alleviate filesystem cache pressure would be
to play with incremental fsync'ing and posix_fadvise
in rdbWriteRaw.

Regards,
Didier.


On Jun 1, 10:18 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:

Lu Wenlong

unread,
Jun 2, 2011, 12:07:44 PM6/2/11
to redi...@googlegroups.com
hello,

I am new to redis, actually even new to computer programming. The following raw proposal for trying the combination of diskstore and rdb/aof to solve the drawbacks of memory further---poor persisted and limited space. OF COURSE, redis has done most of them, and I just give a trivial proposal here. I will be very glad if it is a little bit helpful. If the proposal is too bad or totally wrong, I am sorry for wasting your time to read them. (actually I do not understand some concepts/mechanism mentioned below) 


*Design*
1. treat diskstore and rdb/aof as two copies of in-memory dataset backed on the disk, which both hold extra cold dataset plus hot dataset fully fitting into memory;
--use more disk and CPU to complement the disadvantage of memory, that is, poor persisted and space limitation (smaller amountOfData per cost);
 
--when implementing replacement algorithm in diskstore cache file, lots of things should be taken care, like accessing freq(read/write), creation time, size of values,  etc.

--memory as the primary storage, diskstore as the second one, rdb/aof only as persistence



2. diskstore only used for read channel, just pumping the plain key-values into memory. AND diskstore should be well-designed only for read channel. There are only two cases that this process happens, that is, reading and modifying keys that are not inside memory. If creating or deleting, it is not necessary to dump key-values into memory from disk.

--eliminating the direct write-channel between memory and diskstore because of its bad performance



3. rdb/aof as write channel for dumping the modified key-values into disk, AND it is pretty well-designed only for this requirement. It will be used when creating/deleting/modifying keys happening.

--add more info to the current rdb/aof for the conversion in step 4.



4. Converting rdb/aof to diskstore timely when persisted. How to do that? Can it be very fast? Does it need even more memory or CPU?  Please keep in mind that this happens only in one direction.



*Process flow*
1. WHEN fitting into the memory, everything happens as usual, reading/writing dataset from memory, dumping the dirty data into disk by rdb/aof. BUT there is one more thing to take care, i.e., converting rdb/aof into plain key-value pairs and constructing the diskstore in the background.

2. WHEN exceeding the memory, we should kick out the key-value pairs by replacement algorithm. One thing should be kept in mind that the kicked pairs should be already inside rdb/diskstore or both. 

   2.1 IF keys that read/written are in the memory (cache hit), everything behaves exactly    same as the case of fitting into memory.  Maybe some key-value pairs should be kicked out at the same time.
   2.2 IF keys that you want to manipulate are not in the memory (cache miss), there are two cases here. 

2.2.1 For creating/deleting key-value pairs, no need to pumping data from diskstore to memory, just do it in memory and pass necessary info to rdb, it will change the data info in diskstore indirectly
2.2.2 For reading/modifying the key-value pairs, just pumping data from diskstore to memory, do related manipulations and kicking out the extra pairs if necessary, modifications persisted into rdb/aof and then transfer into the diskstore.
  


RGS


Wenlong


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.




--
沒有理想走不遠,不面對現實連今天都走不過.
 


Gaurav

unread,
Jun 2, 2011, 6:32:25 PM6/2/11
to Redis DB
This is a great discussion. This is hurting us quite a bit and making
us rethink the redis route altogether.

Here are our requirements:
- Blazingly fast reads (we do more than 200M reads per day, about 25M
writes).
- Disk backed storage to minimize data loss. We don't care about
transactional consistency and are tolerant to a few 10s of minutes of
data loss
- replication. single master, multiple slaves.
- slaves are globally distributed so if a slave losses connection then
we can tolerate serving stale data for a short period of time till the
slave catches up with the master.
- easy future scalability. we envision adding several slave as we
grow. We also have done some application level sharding so we can grow
memory consumption by moving the shards to larger memory machines or
in the worst case, resharding.

Redis is a great piece of software and it fits the bill extremely well
and we were very happy we used it. This was the case till we got into
swapping hell.

To say that the over head is just 2X is not fair. Memory is very
expensive. As an example a 34G double extra large memory instance on
EC2 costs twice as much as a 16GB Extra large memory instance (http://
aws.amazon.com/ec2/pricing/). And as Didier pointed out in comment
above, it is not 2X but as much as 3X memory overhead. Since we are
not Google, Twitter or Facebook, but an up and coming startup, we
would love to keep costs under control ;)

This memory overhead is only required, in our case, when the slave
loses connectivity, or for launching new slaves, or rewriting
appendonly files.

I think we can do with a better design for this, using the appendonly
file. Here are my thoughts on this, and I will be willing to
contribute if it makes sense:
- The appendonly file is written to disk as usual
- The rdb file is written by maintaining pointers on the appendonly
file
- when a slave connects you stop writing to the rdb file and replicate
the existing rdb file immediately to the new slave.
- once the slave replication is complete you restart writing the rdb
- once the slave finishes loading the rdb the master slave replication
works as it does now
- bgrewriteaof can be done by splitting the existing file and using
the rdb

- you could even execute the save command using the appendonly file.
This will take longer but won't be taxing on memory.

Let me know what you think.

Thanks,
Gaurav.

Jak Sprats

unread,
Jun 3, 2011, 12:59:34 AM6/3/11
to Redis DB
Hi Guarav,

we have been discussing this 2X (3X whatever) explosion for a long
time (more than a year). There have been several suggestions (yours is
a good one). I believe the last decision on this was that this problem
was lower priority than the cluster so it was on hold.

Are there any other pure inMemoryDBs that have to deal w/ this problem
(e.g. timesTen), because they may have an enterprise quality solution
to this memory explosion from COW problem.

I have the feeling, if someone did the research and found the best
solution already done (this problem is not new), it would end up
getting done.

It is a hard problem.

- Jak

Didier Spezia

unread,
Jun 3, 2011, 3:38:50 AM6/3/11
to Redis DB

Please note the x2 or x3 (with the filesystem cache) memory overhead
is a *theorical* worst case. In practice, you do not update 100% of
the pages in 10 secs, and the dump file is generally significantly
smaller than the objects in memory. There is no need to provision
memory for x2 or x3 ...

You can also segment your Redis instances so that only one of them
is bgsaving at a given point in time. For instance on a 48 Gb
machine, one could have 8 instances each of them managing 5 Gb
of RAM, be safe, and benefit from multiple cores.

Regards,
Didier.

Gaurav

unread,
Jun 3, 2011, 8:10:50 PM6/3/11
to Redis DB
Didier, Jak,

Thanks very much for your replies. My comments below:

On Jun 3, 12:38 am, Didier Spezia <didier...@gmail.com> wrote:
> Please note the x2 or x3 (with the filesystem cache) memory overhead
> is a *theorical* worst case. In practice, you do not update 100% of
> the pages in 10 secs, and the dump file is generally significantly
> smaller than the objects in memory. There is no need to provision
> memory for x2 or x3 ...

Even though I have save on when the slave requests for SYNC the master
spawns a process for creating a new save file. I can't confirm that
but I am sure that is what I noticed. We still need to provision for
2X memory.

>
> You can also segment your Redis instances so that only one of them
> is bgsaving at a given point in time. For instance on a 48 Gb
> machine, one could have 8 instances each of them managing 5 Gb
> of RAM, be safe, and benefit from multiple cores.

We already do that. This too is risky since if more than 1 shard needs
to create the save file for syncing with slaves the whole machine goes
into swap hell. Also, you have to be careful when you have deletes due
to expiring keys or otherwise. The used_memory_rss is what matters.

Jak,

> we have been discussing this 2X (3X whatever) explosion for a long
> time (more than a year). There have been several suggestions (yours is
> a good one). I believe the last decision on this was that this problem
> was lower priority than the cluster so it was on hold.

Seems like the clustering release will need to have this fixed too. My
feeling is that with auto-sharding, etc you will need to fix this
problem before the clustered redis release.

>
> Are there any other pure inMemoryDBs that have to deal w/ this problem
> (e.g. timesTen), because they may have an enterprise quality solution
> to this memory explosion from COW problem.
>

I know traditional databases maintain redo logs (like the appendonly
file) for each replicating slave. I don't know if any other memory dbs
have the kind of features that redis has.

> I have the feeling, if someone did the research and found the best
> solution already done (this problem is not new), it would end up
> getting done.
>
> It is a hard problem.

Keep up the terrific work. Redis is amazing.

Thanks,
Gaurav.

Gaurav

unread,
Jun 3, 2011, 8:12:59 PM6/3/11
to Redis DB
I meant, "I can't confirm that but I think that is what I noticed"

Jak Sprats

unread,
Jun 4, 2011, 3:47:04 AM6/4/11
to Redis DB
Hi Guarev,

I like fixing this too before the cluster comes out, but I also have
not come up w/ an acceptable solution, so I understand the
prioritization.

The best I came up w/ was something pretty complicated, log
reconstruction in a separate process. It sorted appendonly file
snippets (i.e. since last log reconstruction) and then merged
operations on the same key, so a [INCR X,INCR X,INCR X] can be
combined to a X+=3 (which could also be applied to a previous rdb
snapshot).

The problem w/ this was set intersections/unions, they can be between
N sets, so you need the N sets in memory to perform the reconstruction
or alternately on those operations you can dump the resulting set
(which can be huge) - both options are expensive. As an example the
command "SINTERSTORE X Y" seems simple to persist to disk, but it may
intersect 2 1million object sets in memory and there is no quick-
enough-way to consistently persist this to disk. If you just did
"SINTERSTORE X Y" on huge SETS as quick as you could, you can NOT
persist it to disk, the RAM operations are too quick.

So in theory, this is an unsolvable problem, but in practice it isnt.
Probably dumping the result of a set-intersection up to a certain size
(256KB???) makes sense, and if it is too big, then dump the command
and pull the values into memory from the .rdb, which could be
optimised to find values quickly (e.g. create hash table of keys
first, and then create objects they point to).

Anyways, it is a real hard problem, what I suggested only kind of
works, and it needs lots of brains and different opinions on it. I
looked into how RDBMSes do this on large UPDATEs, and have not found a
silver bullet there either.

- Russ

Salvatore Sanfilippo

unread,
Jun 4, 2011, 3:52:01 AM6/4/11
to redi...@googlegroups.com
On Sat, Jun 4, 2011 at 2:10 AM, Gaurav <gaura...@gmail.com> wrote:
> Even though I have save on when the slave requests for SYNC the master
> spawns a process for creating a new save file. I can't confirm that
> but I am sure that is what I noticed. We still need to provision for
> 2X memory.

Please upgrade to 2.2.8, there was an important change lately that
reduced this problem a lot.

Before of that fix Redis used a lot of memory when saving sets,
hashes, or sorted sets that were not specially encoded even if there
were *no writes* going on. This new fix finally prevents this issue.

Cheers,

Gaurav

unread,
Jun 6, 2011, 2:50:19 AM6/6/11
to Redis DB
Russ,

Thanks for your insight. I am not familiar with all of the Redis
datatypes. I mostly use hashes, sets (simple add/delete), and simple
key values.

Keep up the terrific work.

> Please upgrade to 2.2.8, there was an important change lately that
> reduced this problem a lot.
>

Salvatore, I will schedule upgrading to 2.2.8 at some point. Any idea
when the jemalloc branch will be merged into main? I will schedule
upgrading to both jemalloc and 2.2.X at the same time.

Thanks,
Gaurav.
Reply all
Reply to author
Forward
0 new messages