Suggestions for storing and pulling large "blobs" of data

Nathan Palmer

unread,

Sep 27, 2013, 5:03:15 PM9/27/13

to redi...@googlegroups.com

Today most of our actual data is stored as strings in a blob of protobuf. We have hundreds of sets that we intersect on to find the right subset of blobs to pull down at which point we pull everything and do clientside processing for the rest. This was done on purpose since it gave us the performance we needed. However....

I'd like to look into doing some server-side processing though via Lua scripts. Redis cannot understand these data blobs very well. To give you an idea of what is in each of these it consists of about 50 pieces of dimensional data married with 250 pieces of statistical data. Think of about 300 pieces of data against each key (2 keys actually.) The application at the moment needs all 300 pieces in order for the processing to occur. In the past I've attempted to store this in a hash instead but the performance of pulling every hash key (I believe I used hgetall) versus pulling the blob from the string was enormous (much much slower using hgetall.)

So to the group are there any better ways to do this?

Diving into a bit more details on the structure that is setup today for a little info we have

* ~1 million string keys full of dimensional data (each is a protobuf blob)

* ~3 million string keys full of measure data (each is a protobuf blob)

* After intersection we're left with at most around 700,000 that need to get processed

The performance differences seem to come down to this. Same object data when stored as a string issues this command

# Executes in 5.4 seconds

redis-cli -r 100000 get keyname

# Executes in 39 seconds

redis-cli -r 100000 hgetall keyname__hash

# Executes in 22 seconds

redis-cli -r 100000 hmget keyname__hash [explicit field names]

I'm pulling down the same data each time however in all tests storing a blob that redis cannot understand is the quickest.

Thanks,

Nathan Palmer

Josiah Carlson

unread,

Sep 28, 2013, 10:30:17 PM9/28/13

to redi...@googlegroups.com

I'm guessing that you are interested in doing some processing on the Redis side because you believe that there might be a way to improve overall performance. And you are interested in improving overall performance for the sake of reducing total processing time and/or network bandwidth. Are these reasonable guesses?

If you are interested in processing data inside Redis, my first recommendation would be to store your data in hashes, then benchmark how quickly you can process your hashes inside Redis with Lua. That should give you a ballpark estimate as to a more or less optimal way of handling it inside Lua. If you only need a subset of your data for processing, try both an hmget as well as hgetall, just to see which might be better inside Redis (which may be different than the sending data back to a client).

After you've tested that, Redis supports the encoding and decoding of two kinds of data blobs: messagepack and json. Messagepack is faster, and is fairly similar to Redis' built-in small list/set/hash/zset encoding known as a "ziplist" (it's transparent, you don't really need to worry about it), but if you need enough of your data, then you might be able to fetch your messagepack blob and decode it faster than Redis can return an hmget request to Lua (especially if you need your data in a key/value table in Lua).

I can't guarantee that either the use of hashes or messagepack will be faster than what you are doing. In fact, unless you are doing one of the following:

1. sharding your data across multiple Redis servers

2. replacing Lua with LuaJIT in Redis (apparently close to a drop-in replacement)

3. you are using a slow language to process your data now

4. you are only using a single core to process your data now

5. your network bandwidth and/or latency is crap (depending on how you request your data now)

... I'd suspect that fetching the data and processing it outside Redis has a good chance of actually being faster.

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/groups/opt_out.

Didier Spezia

unread,

Sep 30, 2013, 3:41:44 AM9/30/13

to redi...@googlegroups.com, em...@nathanpalmer.com

Just one remark:

>> # Executes in 5.4 seconds

>> redis-cli -r 100000 get keyname

I guess "redis-cli- r" is a poor benchmark, because it does not use

pipelining at all. If you have 700K items to process, you will

likely have to use pipelining as far as possible.

By benchmarking the retrieval of a single object, packet size

can skew your results. It all depends on the size of your objects.

If it is lower or higher than 1500 (ethernet MTU), you may have

huge variations.

Please note the protocol buffer data representation is very

compact, so protobuf encoded strings likely results in

less data on the wire (and in Redis memory).

You can use strace on redis-cli to evaluate the difference

in size between a protobuf encoded string, and a hash

object encoded with hgetall.

> strace -e read,write ./redis-cli ping

...

write(3, "*1\r\n$4\r\nping\r\n", 14) = 14

read(3, "+PONG\r\n", 16384) = 7

...

Here the query weights 14 bytes, and the reply 7 bytes.

Regards,

Didier.

Nathan Palmer

unread,

Oct 4, 2013, 11:33:01 AM10/4/13

to Didier Spezia, redi...@googlegroups.com

@didier - I'll have to check it out with strace.

@josiah - Thanks for the good feedback. My own tests all seem to indicate that I can't get any better performance inside of redis. However there are some operations where I don't need all of the data only a single datapoint (sum or distinct operation.) Those are where I'm hoping that a server side script will help. I may look into messagepack as a format versus protobuf since it seems client performance is very similar.