Hello list!
It's a few weeks that I'm designing and implementing the new Stream data structure for Redis.
I'm very excited because this is the first general purpose data structure added to Redis after many years. Yes, in the past we added things like bit operations, hyperloglogs, lex ranges in sorted sets and with modules there are a multitude of other things possible. But Streams (which are actually "logs" conceptually), are the first new big data structure in Redis, like hashes, sets, lists and so forth.
The implementation of the proposal is proceeding in a now public branch here:
So far we have the following commands implemented:
XADD key field1 value1 ... fieldN valueN
XLEN key
XRANGE key <START id or time in milliseconds> <END id or time in milliseconds> [COUNT count]
XREAD key [BLOCK timeout] [COUNT count] STREAMS key1 start-ID1 key2 start-ID2 key3 start-ID3 ...
RDB storage is supported, AOF is still not supported.
So basically what we have is the data structure itself implemented, as an hybrid between a radix tree and the new "Listpack" serialization format. Range queries are fully implemented, and blocking on streams for new data is also supported.
However I reached a design dilemma with XADD, and I think that raising the issue publicly may improve the design, picking more brains to reach a better result.
This is how XADD now works:
The entry ID is generated and returned.
The XADD command looks simple and natural like the other basic Redis commands to add things, like SADD or RPUSH or alike.
However there is a problem. In order to replicate the command to AOF and slaves, the ID must be passed as already generated, because it is not generated purely on what the stream already contained, there is also the local time of the server involved (not always, the IDs are guaranteed to be always increasing, so if the last ID in the stream has a timestamp part of the ID which is greater than the current time, it is used instead of the local milliseconds time, and just the sequence part is implemented, which is the part after the dot -- All the details are in the specification).
P.S. in case you wonder why unique IDs are generated with a timestamp part, this gives you, with XRANGE, not just IDs range queries, but also time range queries.
So, this problem could be easily bypassed just adding a different XADD form to use for AOF / replication. We do this with EXPIRE for example, which is translated to EXPIRE-AT.
So if we get from the client:
XADD mystream name Ada surname Lovelace => 1505126157183.0
We replicate something like:
XADD-WITH-ID mystream 1505126157183.0 name Ada surname Lovelace
And this is what will be sent to AOF / slaves. Consistency will be saved and everybody is happy.
However in most applications, streams should also be capped collections, since to append to an infinitely long stream is not what you always want indeed.
So you should be able to specify, when adding, how many maximum items, or the maximum age, entries can have: everything older, or after a certain number of elements will be evicted automatically. Something like:
XADD-AND-EVICT mystream BYLEN 10000 name Ada surname Lovelace
Or
XADD-AND-EVICT mystream BYAGE 5000 name Ada surname Lovelace
For now, I'll even ignore that to replicate this we need another variant where I can also specify the ID. Also note that BYAGE will be translated to an absolute fixed millisecond.
Now here is the dilemma. The above not just gets confusing soon, but also if you think at it, most of the people may never use the vanilla XADD in many applications, since to have a capped stream could be vital for *most* applications. Remember that Redis is in memory. Even if the memory efficiency of streams is amazing (thanks to the representation used) and is going to improve even more with IDs / fields compression, still you want to say "no more than 100 million entries" or alike I guess.
So, maybe it's worth to get out of the mentality of "vanilla XADD should be simple" and give directly a command which is capable of handling all the exceptions?
It would be like that:
XADD key <ID or *> [MAXLEN <len> | MAXAGE <ms-age> | MAXTIME <ms-unixtime> | NO CAP] filed1 value1 ... fieldN valueN
Note that the * ID means: generate it yourself, as the current XADD implementation does by default. Why NO CAP means not capped. This also means that the user has to explicitly write that it is using an unbound stream.
So the initial example in this email will turn into:
XADD mystream * NO CAP name Ada surname Lovelace => 1505126157183.0
Which is more complex, but with a single command we'll accomodate everything including replication. The command will be simply replicated as:
XADD mystream 1505126157183.0 NO CAP name Ada surname Lovelace
Note that the obvious solution of adding just optional specifiers for the ID or eviction policy is not simple because we don't know the field-value number of pairs, nor we want that certain field names are invalid.
So far, that's my reasoning, any hint? Thanks!
--
Salvatore 'antirez' Sanfilippo
open source developer - Redis Labs
https://redislabs.com"If a system is to have conceptual integrity, someone must control the concepts."
— Fred Brooks, "The Mythical Man-Month", 1975.