Like you said, you could compress your data in advance. Also,
depending on your actual problem, you may be able to store your data
in different ways that can reduce memory use overall, regardless of
data compression. Generally, people have had pretty good luck telling
this list the data that they are planning on storing and data access
patterns, and getting a better solution than they would have otherwise
come up with.
In terms of using smaz directly, it may be useful, but you will want
to test it out, as well as sanitize your data. Alternatively, if you
pre-train an arithmetic or huffman coder on a representative sample of
your input, you can use them as a sort of fixed codebook, which should
be able to beat smaz in just about every situation.
Regards,
- Josiah
it would be great if it was accessible from Lua
--
Javier
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
totally right; but what if you want to do some processing with the
content? after all, that's why there's JSON (and soon MessagePack)
libraries. the best would be to send compressed MP structures and
still be able to access it from server scripts
--
Javier
> Hi,
Hi Andy, sorry for the delay in the reply.
> I'm going to run my application on a small server so I'd like to
> reduce memory consumption as much as possible. I looked into data
> compression but it doesn't look like Redis uses it. But I came upon
> Smaz:
>
> http://news.ycombinator.com/item?id=540048
>
> 1) antirez mentioned that Smaz was intended to be used within Redis.
> Why didn't that happen? One comment in hackernews mentioned security
> concerns when using Smaz on user-generated inputs. Is that a valid
> concern?
This is still in my plans, I've a TODO entry in my private todo, I
purchased a data compression book and will probably get more.
And finally... Redis is already mostly internally ready to support this.
Why this was not done before? Because is one of those experimental
projects that needs some love.
There are no security concerns, this is how it works (exactly like
integer encoded values):
SET foo bar
After the SET the function tryObjectEncoding() is called in order to
check if the string is representable in some other way (currently it
only checks if the string looks lile an integer that can be exactly
converted back into a string later, no spaces and no overflow and so
forth).
This function calls Smaz against the string if it can't be encoded as
an integer. If the result is shorter than the native string, then the
object->encoding is set to REDIS_ENCODING_SMAZ and the strings is
stored as a smaz string.
When reading the string Redis always uses getDecodedObject() to create
the original object from the encoded one.
So we basically have most of the core ready, but is smaz itself that
is not ready because it needs to be tuned, the code must be reviewer
for stability (I think there are a few bugs), and the encoder can be
improved.
> 2) Would you recommend using Smaz in my application code to compress
> user-submitted data before putting it into Redis? How many short
> strings (each fewer than 100 bytes) per second can Smaz compress on a
> quadcore CPU?
It's a brillant idea but I remember a bug report that I never fixed in
smaz, I'll try to check if I can find the issue and fix the lib at
least.
Smaz should be in theory pretty fast, what I mean is, if it is not
already very fast it should be improved, because the compression used
is of a kind that is simple to turn into a very fast computation. But
I did not tried the speed... I'll write a short benchmark if there is
not one already in the distribution.
Cheers,
Salvatore
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele
I just tested smaz with the three old bug report I got, actually it
was an issue with the Python binding I think.
I also improved the fuzzing stresser of the implementation, no bugs.
I think it's pretty safe, but I'm pretty sure it is possible to make
smaz compressing more than that. A few things to try are:
1) Some form of entropy coding. Even if I think this is not going to
give extraordinary results because of the frequency distribution.
2) An optional optimizer on the output. It is possible to improve the
compression just with this.
3) I think a better version of smaz could reserve an opcode for back
references in case of obvious entropy in the text. It is strange that
currently it is not able to compress something like
"abababababababababab" IMHO.
Cheers,
Salvatore
it can compress at 10mb per second in my MBA 11" (i7 processor)
the same test with memcpy() easily do 10times better, going at
100mb/sec in the same benchmark.
I'm not sure how this compare with other compression algorithms.
Cheers,
Salvatore
On Fri, Feb 24, 2012 at 10:14 AM, Salvatore Sanfilippo
Cheers,
Hampus
Hello Hampus, thanks for the pointer. From the README it does not
appear that this library can compress small strings, if this is true
the comparison may not be apple to apple. Btw IMHO it is possible to
make Smaz much more faster, but... given some free time ;)
Salvatore
The ratio between compression and decompression speed is 26 ;) So the
decompressor is pretty fast, but the compressor does silly things
because of the way the input table is formed, so it always will try to
find the longest occurrence (and this is part of the algorithm of
course) and there is no way that it can understand that if something
did not matched, then moving one byte forward will still not match, or
something like that. Currently it's a vanilla "for every offset, with
step 1, find the longest match".
This can surely be improved a lot if there is some way to create a
context at startup that will result in some form of tree built or
some other structure for efficient lookup of matches in the table.
Cheers,
Salvatore
On Fri, Feb 24, 2012 at 10:43 AM, Salvatore Sanfilippo
Interesting. It should definitely be possible to improve the compression
speed further. I don't have time either right now, otherwise I would
have given it a try :)
Cheers,
Hampus
If you use a Trie, you never actually have to include your strings, it
is implicit in your data. Also, you could encode your entire table in
a single array, using the values of the pointers as offsets into the
remainder of the array. Using a 256-ary trie, it would take 21760x 16
bit entries, or just over 40k. Using a 16-ary trie would take 3260x
8-bit entries. pre-computing the data and shipping the array as static
data is probably the right answer.
Your array of structs can also use a 256-entry lookup table to know
where at least the first of the entries with a given character lies
(and use the next entry to know where the last one lies). This will
reduce your search space down to 26 items when you are starting with a
space (and fewer with all other initial characters).
Regards,
- Josiah
It might be interesting (for someone) to compare this with Google's
Snappy library (http://code.google.com/p/snappy/). I don't know much
about it, but Google apparently uses it for LevelDB and Bigtable and it
seems to be extremely fast. I don't think it compresses the data very
much on the other hand.