Data compression for Redis & Smaz

1,109 views
Skip to first unread message

Andy

unread,
Feb 23, 2012, 2:10:30 AM2/23/12
to Redis DB
Hi,

I'm going to run my application on a small server so I'd like to
reduce memory consumption as much as possible. I looked into data
compression but it doesn't look like Redis uses it. But I came upon
Smaz:

http://news.ycombinator.com/item?id=540048

1) antirez mentioned that Smaz was intended to be used within Redis.
Why didn't that happen? One comment in hackernews mentioned security
concerns when using Smaz on user-generated inputs. Is that a valid
concern?

2) Would you recommend using Smaz in my application code to compress
user-submitted data before putting it into Redis? How many short
strings (each fewer than 100 bytes) per second can Smaz compress on a
quadcore CPU?

Thanks.

Didier Spezia

unread,
Feb 23, 2012, 12:39:42 PM2/23/12
to redi...@googlegroups.com

Hi,

Redis includes the LZF light data compressor
which is probably more generic than smaz:


AFAIK, it is only used to compress strings at
dump time (RDB).

Regards,
Didier.

Andy

unread,
Feb 23, 2012, 1:09:47 PM2/23/12
to Redis DB
Thanks Didier.

But if it is only used to compress strings at dump time, it wouldn't
help with reducing memory consumption, right?

Any other options for reducing memory?

Josiah Carlson

unread,
Feb 23, 2012, 1:30:52 PM2/23/12
to redi...@googlegroups.com

Like you said, you could compress your data in advance. Also,
depending on your actual problem, you may be able to store your data
in different ways that can reduce memory use overall, regardless of
data compression. Generally, people have had pretty good luck telling
this list the data that they are planning on storing and data access
patterns, and getting a better solution than they would have otherwise
come up with.

In terms of using smaz directly, it may be useful, but you will want
to test it out, as well as sanitize your data. Alternatively, if you
pre-train an arithmetic or huffman coder on a representative sample of
your input, you can use them as a sort of fixed codebook, which should
be able to beat smaz in just about every situation.

Regards,
- Josiah

Javier Guerra Giraldez

unread,
Feb 23, 2012, 1:42:56 PM2/23/12
to redi...@googlegroups.com
On Thu, Feb 23, 2012 at 1:09 PM, Andy <selfor...@gmail.com> wrote:
> But if it is only used to compress strings at dump time, it wouldn't
> help with reducing memory consumption, right?

it would be great if it was accessible from Lua

--
Javier

Didier Spezia

unread,
Feb 23, 2012, 1:54:33 PM2/23/12
to redi...@googlegroups.com

Well, Redis already implements a bunch of optimizations
regarding memory consumption: ziplist, zipmap, intset, etc ...

For string/blob values, compressing the data can be done
on client side. It would be my preferred way to do it.

One problem with light compression algorithms is they are
not always symmetric in term of performance. Compression
is generally slower than decompression. If Redis could
compress/uncompress serialized data structures such as
ziplist/zipmap/intset, it would make write operations much
slower than read ones. 

One interesting compression algorithm is QuickLZ:
compression and decompression performance is almost the
same.


I am a big fan of this library.

Regards,
Didier.

Dvir Volk

unread,
Feb 23, 2012, 1:55:51 PM2/23/12
to redi...@googlegroups.com
At least in python, and in other dynamic languages I guess, adding a proxy class that transparently does it automatically on the client side would be extremely easy, and save network bandwidth compared to doing it on the server with lua.



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.




--
Dvir Volk
System Architect, The Everything Project (formerly DoAT)

Javier Guerra Giraldez

unread,
Feb 23, 2012, 4:40:03 PM2/23/12
to redi...@googlegroups.com
On Thu, Feb 23, 2012 at 1:55 PM, Dvir Volk <dv...@doat.com> wrote:
> At least in python, and in other dynamic languages I guess, adding a proxy
> class that transparently does it automatically on the client side would be
> extremely easy, and save network bandwidth compared to doing it on the
> server with lua.

totally right; but what if you want to do some processing with the
content? after all, that's why there's JSON (and soon MessagePack)
libraries. the best would be to send compressed MP structures and
still be able to access it from server scripts

--
Javier

Salvatore Sanfilippo

unread,
Feb 24, 2012, 3:35:17 AM2/24/12
to redi...@googlegroups.com
On Thu, Feb 23, 2012 at 8:10 AM, Andy <selfor...@gmail.com> wrote:

> Hi,

Hi Andy, sorry for the delay in the reply.

> I'm going to run my application on a small server so I'd like to
> reduce memory consumption as much as possible. I looked into data
> compression but it doesn't look like Redis uses it. But I came upon
> Smaz:
>
> http://news.ycombinator.com/item?id=540048
>
> 1) antirez mentioned that Smaz was intended to be used within Redis.
> Why didn't that happen? One comment in hackernews mentioned security
> concerns when using Smaz on user-generated inputs. Is that a valid
> concern?

This is still in my plans, I've a TODO entry in my private todo, I
purchased a data compression book and will probably get more.
And finally... Redis is already mostly internally ready to support this.

Why this was not done before? Because is one of those experimental
projects that needs some love.

There are no security concerns, this is how it works (exactly like
integer encoded values):

SET foo bar

After the SET the function tryObjectEncoding() is called in order to
check if the string is representable in some other way (currently it
only checks if the string looks lile an integer that can be exactly
converted back into a string later, no spaces and no overflow and so
forth).
This function calls Smaz against the string if it can't be encoded as
an integer. If the result is shorter than the native string, then the
object->encoding is set to REDIS_ENCODING_SMAZ and the strings is
stored as a smaz string.

When reading the string Redis always uses getDecodedObject() to create
the original object from the encoded one.

So we basically have most of the core ready, but is smaz itself that
is not ready because it needs to be tuned, the code must be reviewer
for stability (I think there are a few bugs), and the encoder can be
improved.

> 2) Would you recommend using Smaz in my application code to compress
> user-submitted data before putting it into Redis? How many short
> strings (each fewer than 100 bytes) per second can Smaz compress on a
> quadcore CPU?

It's a brillant idea but I remember a bug report that I never fixed in
smaz, I'll try to check if I can find the issue and fix the lib at
least.

Smaz should be in theory pretty fast, what I mean is, if it is not
already very fast it should be improved, because the compression used
is of a kind that is simple to turn into a very fast computation. But
I did not tried the speed... I'll write a short benchmark if there is
not one already in the distribution.

Cheers,
Salvatore

> Thanks.


>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Salvatore Sanfilippo

unread,
Feb 24, 2012, 4:14:54 AM2/24/12
to redi...@googlegroups.com
Hi again,

I just tested smaz with the three old bug report I got, actually it
was an issue with the Python binding I think.
I also improved the fuzzing stresser of the implementation, no bugs.

I think it's pretty safe, but I'm pretty sure it is possible to make
smaz compressing more than that. A few things to try are:

1) Some form of entropy coding. Even if I think this is not going to
give extraordinary results because of the frequency distribution.
2) An optional optimizer on the output. It is possible to improve the
compression just with this.
3) I think a better version of smaz could reserve an opcode for back
references in case of obvious entropy in the text. It is strange that
currently it is not able to compress something like
"abababababababababab" IMHO.

Cheers,
Salvatore

Salvatore Sanfilippo

unread,
Feb 24, 2012, 4:21:08 AM2/24/12
to redi...@googlegroups.com
About the speed, the current implementation is not very fast, I just checked it:

it can compress at 10mb per second in my MBA 11" (i7 processor)
the same test with memcpy() easily do 10times better, going at
100mb/sec in the same benchmark.
I'm not sure how this compare with other compression algorithms.

Cheers,
Salvatore

On Fri, Feb 24, 2012 at 10:14 AM, Salvatore Sanfilippo

Hampus Wessman

unread,
Feb 24, 2012, 4:34:21 AM2/24/12
to redi...@googlegroups.com
It might be interesting (for someone) to compare this with Google's
Snappy library (http://code.google.com/p/snappy/). I don't know much
about it, but Google apparently uses it for LevelDB and Bigtable and it
seems to be extremely fast. I don't think it compresses the data very
much on the other hand.

Cheers,
Hampus

Salvatore Sanfilippo

unread,
Feb 24, 2012, 4:43:26 AM2/24/12
to redi...@googlegroups.com
On Fri, Feb 24, 2012 at 10:34 AM, Hampus Wessman
<hampus....@gmail.com> wrote:
> It might be interesting (for someone) to compare this with Google's
> Snappy library (http://code.google.com/p/snappy/). I don't know much
> about it, but Google apparently uses it for LevelDB and Bigtable and it
> seems to be extremely fast. I don't think it compresses the data very
> much on the other hand.

Hello Hampus, thanks for the pointer. From the README it does not
appear that this library can compress small strings, if this is true
the comparison may not be apple to apple. Btw IMHO it is possible to
make Smaz much more faster, but... given some free time ;)

Salvatore

Salvatore Sanfilippo

unread,
Feb 24, 2012, 4:56:12 AM2/24/12
to redi...@googlegroups.com
Ok I investigated a little more with the profiler:

The ratio between compression and decompression speed is 26 ;) So the
decompressor is pretty fast, but the compressor does silly things
because of the way the input table is formed, so it always will try to
find the longest occurrence (and this is part of the algorithm of
course) and there is no way that it can understand that if something
did not matched, then moving one byte forward will still not match, or
something like that. Currently it's a vanilla "for every offset, with
step 1, find the longest match".

This can surely be improved a lot if there is some way to create a
context at startup that will result in some form of tree built or
some other structure for efficient lookup of matches in the table.

Cheers,
Salvatore

On Fri, Feb 24, 2012 at 10:43 AM, Salvatore Sanfilippo

Hampus Wessman

unread,
Feb 24, 2012, 5:06:51 AM2/24/12
to redi...@googlegroups.com
I think you're right about snappy and short string. Google needs to
compress longer strings, so it's different use cases really.

Interesting. It should definitely be possible to improve the compression
speed further. I don't have time either right now, otherwise I would
have given it a try :)

Cheers,
Hampus

Josiah Carlson

unread,
Feb 24, 2012, 12:04:54 PM2/24/12
to redi...@googlegroups.com
There are two very simple ways to improve encoding performance. Either
use a Trie (for this situation, either a 16-ary or 256-ary trie would
be simplest), or use a pre-sorted array of structs.

If you use a Trie, you never actually have to include your strings, it
is implicit in your data. Also, you could encode your entire table in
a single array, using the values of the pointers as offsets into the
remainder of the array. Using a 256-ary trie, it would take 21760x 16
bit entries, or just over 40k. Using a 16-ary trie would take 3260x
8-bit entries. pre-computing the data and shipping the array as static
data is probably the right answer.

Your array of structs can also use a 256-entry lookup table to know
where at least the first of the entries with a given character lies
(and use the next entry to know where the last one lies). This will
reduce your search space down to 26 items when you are starting with a
space (and fewer with all other initial characters).

Regards,
- Josiah

David Yu

unread,
Feb 24, 2012, 12:14:43 PM2/24/12
to redi...@googlegroups.com
On Fri, Feb 24, 2012 at 5:34 PM, Hampus Wessman <hampus....@gmail.com> wrote:
It might be interesting (for someone) to compare this with Google's
Snappy library (http://code.google.com/p/snappy/). I don't know much
about it, but Google apparently uses it for LevelDB and Bigtable and it
seems to be extremely fast. I don't think it compresses the data very
much on the other hand.



--
When the cat is away, the mouse is alone.
- David Yu
Reply all
Reply to author
Forward
0 new messages