If you are trying to determine whether you have seen a value or not, the best answer is to use sharded sets.
That will get you there with approximately 2 bytes of overhead for each string, which would be about 1.86 gigs given your on-disk size, or call it about 2 gigs. Most of the code for doing it can be found in my book:
To calculate the shard key:
def shard_key(base, key, total_elements, shard_size):
if isinstance(key, (int, long)) or key.isdigit():
shard_id = int(str(key), 10) // shard_size
else:
shards = 2 * total_elements // shard_size
shard_id = binascii.crc32(key) % shards
return "%s:%s"%(base, shard_id)
To add to a sharded SET:
def shard_sadd(conn, base, member, total_elements, shard_size):
shard = shard_key(base,
'x'+str(member), total_elements, shard_size)
return conn.sadd(shard, member)
To check shard containment:
def shard_sismember(conn, base, member, total_elements, shard_size):
shard = shard_key(base,
'x'+str(member), total_elements, shard_size)
return conn.sismember(shard, member)
If you have any questions, please feel free to ask :)
- Josiah