Redis out of memory issues, how can I avoid them?

Theo

unread,

Sep 28, 2010, 1:03:40 PM9/28/10

to Redis DB, dan...@burtcorp.com

I've been experimenting with using Redis as a backend for an
application that needs to calculate the cardinality of huge sets of
user IDs (essentially counting the number of unique UIDs over an
arbitrary date range). Redis seems perfectly suited for this in all
but one respect: my dataset will not fit in memory. When running small
scale tests it all works great. Loading the data is pretty fast, and
making the queries is unbelievably quick.

Before going into details, my setup is Redis 2.0.2 on an EC2 large
instance running Ubuntu with 7.5 GB RAM.

The problem I have is that when I run tests on real data the Redis
process gets killed by the OS because memory runs out. I've seen lines
like this in the syslog:

Sep 26 20:12:42 ip-10-226-202-140 kernel: [30978.436808] Out of
memory: kill process 2437 (redis-server) score 49437 or a child
Sep 26 20:12:42 ip-10-226-202-140 kernel: [30978.436830] Killed
process 2437 (redis-server)

The Redis log file doesn't say anything about this, but it prints this
from time to time:

19983] 27 Sep 19:31:39 # WARNING: vm-max-memory limit exceeded by more
than 10% but unable to swap more objects out!

I knew that memory was going to be an issue, so I turned on virtual
memory in Redis:

vm-enabled yes

I also want it to swap out everything it can, so I set

vm-max-memory 0

(which makes the error message above puzzling, of course it's going to
be 10% more than 0...)

It has a swapfile of 320 GB:

vm-page-size 32
vm-pages 10000000000

this should use roughly 1.2 GB RAM, which I think is ok since I have
7.5 GB so the keys should have room enough.

the last time I checked I had loaded roughly 3 million keys into the
database. I don't know how many there were when it was killed the last
time, and I can't say, because I haven't been able to start it up
again. it's been running at 100% CPU for 12 hours now and it still
gives no response to commands in redis-cli (which makes me think that
the database has been corrupted -- it doesn't matter, it was only a
test).

what am I doing wrong? what is the correct setup if I want to use
Redis as a database for huge sets without having to worry about the
memory usage? I understand that my dataset can't grow forever, but it
feels like a couple of million keys should be able to fit in 7.5 GB
RAM? speed isn't so important, I'm more than willing to trade it for
being able to keep as much data as possible on the same machine.

yours
Theo

Josiah Carlson

unread,

Sep 29, 2010, 2:36:07 PM9/29/10

to redi...@googlegroups.com, dan...@burtcorp.com

The easiest thing you can do is to get a bigger amazon instance
(nothing will work quite as well as that). I personally don't have
much experience with Redis' VM system, as our use-cases require low
latency, and any swapping is a bad thing for us.

That said, if you're not able to provision a larger Redis, it may be
easier to keep your uniques on a daily basis in Redis, then blasting
them out to a relational database, indexed by uid, then performing a
'SELECT count(unique uid) FROM uniques' (assuming that the table only
contains the data relevant to your time range). That could be trimmed
down to hourly, or whatever reasonable resolution, then calculated and
cached offline. Sadly, it's not as clean as a Redis-only solution,
but most RDBs would be able to handle that query pretty well (it's a
scan of the index, which doesn't require creating a hash/btree for
counting uniqueness).

Regards,
- Josiah

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

Jak Sprats

unread,

Sep 30, 2010, 12:14:48 AM9/30/10

to Redis DB

are you using ZSETs, or just STRINGs?

how big are the keys you are using, how big are the values?

vm doesnt help much for small values.

you may also be able to normalise some of your keys into a hash-table,
which saves a lot of memory.

give some more information, this is a somewhat common problem, there
are lots of tricks.

Theo

unread,

Sep 30, 2010, 3:25:17 AM9/30/10

to Redis DB

hi, thanks for you answers.

the keys look like this:

APIKEYAPIKEY:
20100817:194c183c03e3ffe59e5b93853781ae49:site_domain:example.com:exposure_reach

where the first component is the customer API key (12 chars), the
second the date (10 chars) the third is a string identifying an ad (32
chars), the fourth is the dimension (we have a couple, most have low
cardinality, site_domain is about the only one which is unbounded),
the fifth is the dimension value and finally the sixth is the metric.
the key above points to a set of strings which represent the user IDs
of all people who have seen a specific ad on the site example.com. the
user IDs are 12 char strings.

when loading data, the only operation I do is this

SADD APIKEYAPIKEY:
20100817:194c183c03e3ffe59e5b93853781ae49:site_domain:example.com:exposure_reach
18HT5HM8QM6C

over and over again for all combinations of API key, date, ad ID,
dimension and dimension value, as well as a number of metrics (for
example the number of people who have seen the ad and interacted with
it, the number of people who have seen the ad and followed the click
through link, etc.)

in the end what I want to do is a) get the cardinality of each
combination of keys and (i.e. the number of unique user IDs a day) b)
get the cardinality of the union of each combination of keys except
for date (i.e. the total number of unique users for all days).
something like this (pseudocode):

a) daily_exposure_reach = SCARD APIKEYAPIKEY:
20100817:194c183c03e3ffe59e5b93853781ae49:site_domain:example.com:exposure_reach

b) KEYS APIKEYAPIKEY:*:
194c183c03e3ffe59e5b93853781ae49:site_domain:example.com:exposure_reach
total_exposure_reach = SUINIONSTORE tmp (all keys from the last
command)
DEL tmp

(SUNIONSTORE returns the cardinality of the union, which is what I'm
interested in, then I just delete the stored set)

just to be clear: this last part is not what is causing my memory
problems because I'm not doing it yet. the only thing I'm doing on the
server is loading data. however, it wouldn't surprise me if there's a
better way of organizing the data I'm loading that would use less
memory while still giving me the results I want.

since posting my first question I've run a few more tests -- I
discovered why I couldn't get Redis to start up after it was killed,
the append only log was on, which was apparently a bad idea, I waited
over 12 hours before I killed it and it wasn't done yet. the really
strange thing that I can't figure out is this: I wrote a small script
that prints out the memory used by Redis (redis-cli info | grep
used_memory_human) and have it run once a minute. Then I watched top
while the load script chugged away loading data into Redis. The memory
usage reported by Redis never went over 4 GB, and in htop it said
beteen 50 and 60% in the MEM column (which seems right as there is 7.6
GB in total), with no other process using more than fractions of a
percent. still the total memory usage grew and grew until it almost
filled up, then fell several GB, and then grew quickly up to almost
max, fell, grew, fell and then finally the OS killed Redis.

now, from this it should be totally plain that I'm no sysadmin, and
don't know what I'm talking about. but thats sort of why I'm asking
you guys. is there a way to either reorganize my data to use less
memory, or configure Redis or my server so that it doesn't die? Since
Redis never reports that it uses more than 4 GB of data I feel that
there should be room enough.

---

josiah, thanks for the suggestions. it could probably work as an
alternative solution. I've also thought about rolling up the data into
buckets for each month, so that I still got all the unique IDs, but in
bigger buckets. the overlap between adjacent days is big, so that
should save loads, probably.

---

again, thanks for your replies and suggestions
/Theo

Jak Sprats

unread,

Sep 30, 2010, 5:00:45 PM9/30/10

to Redis DB

one tip, turn off vm, it can double memory overhead on some redis
versions, and since your values are 12 bytes and vm only helps w/
large values, its a bad fit.

second tip, if you are uploading data in mass, turn off all saving in
the config file and then post load turn saving back on via the CONFIG
(http://code.google.com/p/redis/wiki/ConfigCommand) command

your use case is pretty complicated, one thing that may be happening
is that the SETs you have usually do not many entries, so you are
allocating a hash table for 1 or 2 entries, which uses up lots of
memory. It may actually use LESS memory to denormalise your key-values
so that the key is "key:value" (e.g.
0100817:194c183c03e3ffe59e5b93853781ae49:site_domain:example.com:exposure_
reach:18HT5HM8QM6C ). You could then use "KEYS
0100817:194c183c03e3ffe59e5b93853781ae49:site_domain:example.com:exposure_
reach:*" to get your cardinality.

one more thing, did you compile w/ 32bits, cause that will die at 4GB
for known reasons.

The only other explanation for the discrepancies between reported
memory and actual system memory would be memory fragmentation, but the
use case you are describing should not have memory fragmentation ...
what version of redis are you using? 2.2 is not production ready, but
is tighter w/ memory (so Ive heard)

Theo

unread,

Oct 1, 2010, 3:16:53 AM10/1/10

to Redis DB

> and since your values are 12 bytes and vm only helps w/ large values, its a bad fit.

but the values are _sets_ of strings that are 12 bytes, isn't the
whole set swapped to disk as one? is each set member swapped
individually? in that case I can understand why VM wouldn't help.

I'll try a few of your suggestions and see what happens. adding the
user ID to the key would probably not work though since I want to get
the cardinality of the union of multiple sets.

currently I'm using 2.0.2 compiled for 64 bit, but I'll try 2.2-
alpha2.

T#

Jak Sprats

unread,

Oct 1, 2010, 4:48:20 AM10/1/10

to Redis DB

>> isn't the whole set swapped to disk

yeah, youre right, my mistake

Here are your use cases:
a) daily_exposure_reach = SCARD APIKEYAPIKEY:DATE:X:Y:Z
b) KEYS APIKEYAPIKEY:*:X:Y:Z
tot_reach = SUINIONSTORE tmp (all keys from the last command)
DEL tmp

If you use "SET APIKEYAPIKEY:DATE:X:Y:Z:USER_ID 1"

you can get a.) from:
KEYS APIKEYAPIKEY:DATE:X:Y:Z:* -> count the lines

you can get b.) from:
KEYS APIKEYAPIKEY:*:X:Y:Z:* -> count the lines

Maybe I am missing something.

This is actually a problem that is solved pretty well by a single
relational table
CREATE TABLE whatever (apikey varchar(32) primary key, date int, x
text, y test, z text, user_id varchar(12))

Is mysql too slow? Are you doing these requests on the frontend or for
data-warehousing?

Theo

unread,

Oct 1, 2010, 7:27:21 AM10/1/10

to Redis DB

> you can get b.) from:
> KEYS APIKEYAPIKEY:*:X:Y:Z:* -> count the lines

no, that's not quite right, you're forgetting that a union operation
will filter out duplicates. if day one saw the IDs A, B, C and day two
saw A, D, E then I would end up with the keys (I'm simplifying the
names a bit just to make it easier to follow):

K:1:X:A
K:1:X:B
K:1:X:C
K:2:X:A
K:2:X:D
K:2:X:E

to get the union of the IDs I would have to pipe that through cut -d: -
f 4 | sort | uniq, which is the job Redis does for me if I use proper
sets. I think it's quicker too. I could do the cut/sort/uniq, but then
I would probably be better off using the filesystem to start with. the
same goes with using a relation database. I could do that too, but I
want to use Redis because it's such a great fit for the problem (minus
the issue of having to keep the dataset in memory).

the purpose of the experiment is to see if I can replace some really
inefficient Hadoop jobs, so it's more of a data warehousing solution
than frontend. however, if it works out there's no reason our frontend
couldn't make its queries directly to this database (but I'll probably
add one more layer of pre-computing, once the data is in there it
won't have to be recalculated very often).

speed is not really an issue. the hadoop jobs run in around six hours
at the moment, and I think that even a solution that appended the IDs
to files on disk and then ran sort/uniq would be faster (or just a
rewrite of the hadoop jobs, I don't think they are very good). I don't
need to fit the whole dataset into one server, and there's lots of
opportunity for sharding (queries are only over date ranges, so I can
shard on API key, ad ID, or dimension or a combination), but I'm not
sure that I should use this approach if I'm bound by RAM, it would be
a little too expensive to scale. it's a pity because Redis is so
awesome to work with.

I'm running an experiment with Redis 2.2-alpha2 now, with VM and
saving turned off. it uses way too much memory and will probably die
within the hour (220 K keys using 5 gigs of ram). I'll try turning VM
on again to see what happens, but after that I'm out of ideas again.

T#

> ...
>
> read more »

Josiah Carlson

unread,

Oct 1, 2010, 1:07:53 PM10/1/10

to redi...@googlegroups.com

Please forgive me if I'm misreading your posts, but if you have user
keys and are putting them into sets based on (ad,location,date)
3-tuples (which seem to be roughly 74 bytes long), and all you really
want to know is "how many unique users do we have today (or any other
day in the last X days)?" and "how many unique users do we have over
the last X days?", then depending on the number of ads/locations
(where the ads are placed), you may very well have a lot of sets.

Are these sets necessary? Do you need to aggregate these results by
ad? Do you need to aggregate these results by location? Do you need
to aggregate results by (ad,location)? Because of the way you are
storing your data, I would imagine the answer for all of these is yes.
If that is the case, and you are trying to explore all of the things
that you can do right now with what you have, I would bet that you can
aggregate over everything for 30 days, aggregate over ads for 30 days,
and aggregate over location within your existing memory limits.

The aggregation keys for all/ad/location aggregation are obvious.

In terms of being able to count uniques on a daily/monthly basis for
(ad,location) pairs, that's something that could/should be sharded
based on $/gig for memory (High Memory Extra Large instances at Amazon
are the current winner here, and come with 17 gigs), and/or
pre-sharded out to logfiles by (ad,location,day) for fast low-memory
processing (Syslog-ng and/or Flume are good on the logging side of
things, with hooks for automatically counting the uniques for the
largest output files first), and/or stuffed into a database (covering
indexes would work great here).

As another idea to try out, if you are okay with having a secondary
lookup database, hashing your ~74-byte keys down to even 8 bytes may
help reduce memory use depending on your number of keys, but as your
system grows, that (ad,location) breakdown is going to grow very
quickly.

Regards,
- Josiah

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

Jak Sprats

unread,

Oct 1, 2010, 5:19:48 PM10/1/10

to Redis DB

Matt wrote:
>> K:1:X:A
>> K:1:X:B
>> K:1:X:C
>> K:2:X:A
>> K:2:X:D
>> K:2:X:E
>> to get the union of the IDs I would have to pipe that through cut -d: - f 4 | sort | uniq, which is the job Redis does for me if I use proper
sets.

If you just stored a seperate counter for this (e.g. K2:A, K2:B, ....)
and did a seperate "SET K2:X 1" per ad view, you could get the union
of keys by doing a "KEYS K2:*"

But I think you want to use SETs because they are the proper data
structure for this problem, I get that I am suggesting awkward
workarounds.

The Ruby client (are you using ruby?) http://github.com/ezmobius/redis-rb/
has some sharding options. You have 5 variables apikey, date, ad,
d_card, and d_val ... but you are not aggregating by apikey, ad, or
d_card, so you can boil this down to (v1, date, d_val), where v1 is
concat(apikey, ad, d_card). You can shard to v1 using crc32 hashing in
the ruby client, and then you are bounded by several machines RAM
sizes. W/ the sharding approach, the final SUNION should work because
you are always doing unions w/in a (apikey, ad, d_card) tuple {is that
correct?} .. if you are doing a UNION across all keys, then this
doesnt work.

Another thing, definitely comment out all "save" lines in your
redis.conf. And set "appendonly" to no. Redis does disk persistence by
copy on write forking, so if you load up 2GB of data real quick, the
first disk backup of this data will use an additionally 2GB of data.
Do you need disk persistence? can you lose data? I have had cases
where I needed disk persistence, and I needed to use over 50% of my
machine's RAM, so I put delays in while loading the data and then ran
a BGSAVE (http://code.google.com/p/redis/wiki/BgsaveCommand)
explicitly.

> >> > > > memory...
>
> read more »

Reply all

Reply to author

Forward