Efficiently storing 100 million items into Redis. What is the best way to do this?

2,930 views
Skip to first unread message

karthik

unread,
Mar 31, 2014, 11:10:13 AM3/31/14
to redi...@googlegroups.com
I have a file which has around 100 million strings each no more than 20 chars in length.
I need to find whether a string that the user enters exists within these strings.

I am using Amazon ElastiCache to run Redis.
The features are:

STANDARD

  • Small Cache Node (cache.m1.small): 1.3 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform, Moderate I/O Capacity
  • 64-bit platform, Moderate I/O Capacity
  • Medium Cache Node (cache.m1.medium): 3.35 GB memory, 2 ECU (1 virtual core with 2 ECUs), 64-bit platform, Moderate I/O Capacity
  • Large Cache Node (cache.m1.large): 7.1 GB memory, 4 ECUs (2 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity
  • Extra Large Cache Node (cache.m1.xlarge): 14.6 GB of memory, 8 ECUs (4 virtual cores with 2 ECUs each), 64-bit platform, High I/O Capacity
The 100 million strings occupy around 1.66 GB in the disk. 
I tried storing the strings as a key value pair. For ex: happy => true, sad => true, asdadf => true etc. But it quickly maxes out the memory after a million pairs get inserted in the Small Cache Node.

I have to use the Extra Large Cache Node but it is pretty expensive. 

So, I tried storing all the data as a single set in Medium Cache Node. Then I can query SISMEMBER dictionary "happy". 
This is consuming around 1 GB of memory for storing 10 Million strings. Extrapolating, I will need to use 10 GB for 100 million strings. Which means I have to use Extra Large Cache Node.

Is there any way I could use a smaller instance. What would you do?

TPoise

unread,
Apr 6, 2014, 7:10:56 PM4/6/14
to redi...@googlegroups.com
You can try hashes, which Instagram used at one point to store 300 million keys in 5GB of ram

Still, even if it's optimized as good as their implementation, you're still going to run over 1.3GB RAM limitation.  Meaning, get a bigger instance.

Josiah Carlson

unread,
Apr 7, 2014, 3:07:37 PM4/7/14
to redi...@googlegroups.com
If you are trying to determine whether you have seen a value or not, the best answer is to use sharded sets.

That will get you there with approximately 2 bytes of overhead for each string, which would be about 1.86 gigs given your on-disk size, or call it about 2 gigs. Most of the code for doing it can be found in my book:

To calculate the shard key:

def shard_key(base, key, total_elements, shard_size):
    if isinstance(key, (int, long)) or key.isdigit():
        shard_id = int(str(key), 10) // shard_size
    else:
        shards = 2 * total_elements // shard_size
        shard_id = binascii.crc32(key) % shards
    return "%s:%s"%(base, shard_id)

To add to a sharded SET:

def shard_sadd(conn, base, member, total_elements, shard_size):
    shard = shard_key(base,
        'x'+str(member), total_elements, shard_size)
    return conn.sadd(shard, member)

To check shard containment:

def shard_sismember(conn, base, member, total_elements, shard_size):
    shard = shard_key(base,
        'x'+str(member), total_elements, shard_size)
    return conn.sismember(shard, member)


If you have any questions, please feel free to ask :)
 - Josiah




--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages