Efficiently storing 100 million items into Redis. What is the best way to do this?

karthik

unread,

Mar 31, 2014, 11:10:13 AM3/31/14

to redi...@googlegroups.com

I have a file which has around 100 million strings each no more than 20 chars in length.

I need to find whether a string that the user enters exists within these strings.

I am using Amazon ElastiCache to run Redis.

The features are:

STANDARD

Small Cache Node (cache.m1.small): 1.3 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform, Moderate I/O Capacity

64-bit platform, Moderate I/O Capacity
Medium Cache Node (cache.m1.medium): 3.35 GB memory, 2 ECU (1 virtual core with 2 ECUs), 64-bit platform, Moderate I/O Capacity
Large Cache Node (cache.m1.large): 7.1 GB memory, 4 ECUs (2 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity
Extra Large Cache Node (cache.m1.xlarge): 14.6 GB of memory, 8 ECUs (4 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity

The 100 million strings occupy around 1.66 GB in the disk.

I tried storing the strings as a key value pair. For ex: happy => true, sad => true, asdadf => true etc. But it quickly maxes out the memory after a million pairs get inserted in the Small Cache Node.

I have to use the Extra Large Cache Node but it is pretty expensive.

So, I tried storing all the data as a single set in Medium Cache Node. Then I can query SISMEMBER dictionary "happy".

This is consuming around 1 GB of memory for storing 10 Million strings. Extrapolating, I will need to use 10 GB for 100 million strings. Which means I have to use Extra Large Cache Node.

Is there any way I could use a smaller instance. What would you do?

TPoise

unread,

Apr 6, 2014, 7:10:56 PM4/6/14

to redi...@googlegroups.com

You can try hashes, which Instagram used at one point to store 300 million keys in 5GB of ram.

Still, even if it's optimized as good as their implementation, you're still going to run over 1.3GB RAM limitation. Meaning, get a bigger instance.

Josiah Carlson

unread,

Apr 7, 2014, 3:07:37 PM4/7/14

to redi...@googlegroups.com

If you are trying to determine whether you have seen a value or not, the best answer is to use sharded sets.

That will get you there with approximately 2 bytes of overhead for each string, which would be about 1.86 gigs given your on-disk size, or call it about 2 gigs. Most of the code for doing it can be found in my book:

To calculate the shard key:

https://github.com/josiahcarlson/redis-in-action/blob/master/python/ch09_listing_source.py#L184

def shard_key(base, key, total_elements, shard_size):
if isinstance(key, (int, long)) or key.isdigit():
shard_id = int(str(key), 10) // shard_size
else:
shards = 2 * total_elements // shard_size
shard_id = binascii.crc32(key) % shards
return "%s:%s"%(base, shard_id)

To add to a sharded SET:

https://github.com/josiahcarlson/redis-in-action/blob/master/python/ch09_listing_source.py#L239

def shard_sadd(conn, base, member, total_elements, shard_size):
shard = shard_key(base,
'x'+str(member), total_elements, shard_size)
return conn.sadd(shard, member)

To check shard containment:

def shard_sismember(conn, base, member, total_elements, shard_size):

shard = shard_key(base,

'x'+str(member), total_elements, shard_size)

return conn.sismember(shard, member)

If you have any questions, please feel free to ask :)

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward