Hello,
Redis 2.8.9 is out and is the strangest beast in our releases
history... a patch release that only adds significant new features and
not a single bug fix.
The lack of bug fixes is actually good news, no significant issue was
found in many weeks, which means that the 2.8 branch si starting to be
pretty stable.
About the new features introduced into 2.8.9, they are:
1) A new data structure, the HyperLogLog, partially documented here:
http://antirez.com/news/75 (read more about it in the next sections of
this email).
2) Three new commands for the sorted set data structure in order to
support lexicographical range queries in Redis: ZRANGEBYLEX,
ZLEXCOUNT, ZREMRANGEBYLEX.
The rest of this email is split in sections in order to make an easier read.
Why a 2.8 patch release with new features?
===
After months of working to Redis Cluster and Sentinel, people using
Redis as a pure in-memory data structure server, with big interests in
the capabilities of Redis from that point of view, and less interests
in the storage, HA, sharding aspects of the system, were starting to
feel a bit alone. The Redis project really was catalyzed by cluster,
cluster, cluster, for a long time. The idea as announced here before,
was to use the 3.2 release as a "back to data structures" theme for
new Redis developments, but it was *a lot* of time to wait. So I
decided to act ASAP and provide two self-contained but important new
tools to Redis users. Both the new things introduced are unlikely to
have any stability impact on Redis if you don't use them, so it was
possible to have them ASAP in a stable release.
Also before putting those changes in a stable release, they were
tested by me and other users for a significant amount of time. They
are working very reliably.
Now it is time to focus again on Redis Cluster and Sentinel for the
next months, but I hope Redis 2.8.9 is sending a signal to the Redis
community that we are not going to turn into a different thing. Redis
Cluster will be a feature like Redis persistence or replication are,
but the Redis project main goal is to provide a core of easy to
exploit in-memory data structures.
HyperLogLog
===
The blog post linked above does a decent job at describing what the
new data structure is hopefully, however the recent developments that
went into 2.8.9 are not documented there, that is, the support merged
into 2.8.9 supports the sparse representation of HyperLogLogs. This
means that HLLs with low cardinalities don't need to use 12k of
memory, but a lot less.
Let's be pragmatic.. HLLs observing N elements use M bytes of memory
as in the following table:
N=10, M=47
N=100, M=290
N=500, M=1046
N=1000, M=1894
N=4000, M=12304
As you can see at N=4000, the HyperLogLog is switched to the 12k-using
dense representation. But for lower cardinalities, the amount of
memory saved is big, if you have many HLLs counting different things
with a long-tail distribution, this really makes a difference between
the ability to apply Redis or not.
Most of the complexity in the hyperloglog.c file is due to sparse
representation, but I believe it was worth it. The encoding I used is
"proprietary", in the sense that it was invented specifically for
Redis requirements of speed and in-place update, plus the attempt to
be usage-pattern agnostic, which is, the representation is optimized
to work when PFCOUNT is called as often as PFADD. . The moment it
switches from the sparse representation to the dense one can be
configured. By default when the sparse representation uses 3000 byes
(1/4 of the dense representation memory usage) we switch, as this is
the moment where we are again on-pair with the order of magnitude of
memory used per elements observed, and, the moment where the CPU usage
to support the sparse representation is significant.
The commands PFADD, PFCOUNT, and PFMERGE are documented at
redis.io.
The dense and sparse representations used are documented in the
hyperloglog.c file comments.
Sorted sets lexicographic range queries
===
Sorted sets were since the start about range queries, because being an
ordered collection of elements, ranges are what matter most: get the
top N elements, a by-score interval, are the common operations.
However even if sorted sets elements were already ordered
lexicographically when inserted with the same score, there was no easy
way to query by lexicographical ranges. It was possible to more or
less do it by exploiting the score plus a few Lua scripts, but this is
not a very high performance business, nor it is as simple as it should
be.
Lexicographical range queries turn Redis into many things. For example
into an high performance autocompletion server: I wrote a demo about
this use case:
http://autocomplete.redis.io/
You can do a lot more stuff with it as explained already in this
mailing list, like building secondary indexes, graphs, and so forth.
The new use cases are supported by three new commands: ZRANGEBYLEX,
ZLEXCOUNT, ZREMRANGEBYLEX. The commands are documented at
redis.io.
No internal change was performed to the sorted set implementation, so
the memory requirements, RDB file format, and everything else is
untouched.
It is interesting how using different sorted sets features you can
mount an incremental autocompletion engine with a few lines of code.
For example:
1) Every time you see a query, you use ZINCRBY in a sorted set to
increment its popularity.
2) Every time a given entry reaches a given value, you insert it in
the sorted set you use to auto-complete.
This is the vanilla schema, but of course a lot of refinements are
possible, like to add the Top-N by frequency, or having multiple
sorted sets with different "ranks" that you can use in order to sort
the autocomplete list you provide to the user, and so forth.
Also in many real-world setups, you want to trim the first sorted set
from time to time to just retain the top elements.
It's the kind of stuff where it makes sense to build libraries
abstracting Redis away.
Ok, that's all for 2.8.9, I hope you'll enjoy this new Redis release
in the next days. If you find issues, please report them ASAP via
Github or replying here.
Regards,
Salvatore
--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org
To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it
— Wikipedia (Straw man page)