Hi,
There are two algorithms proposed in the paper HLL-TC and HLL-TC+. HLL-TC has the same error rate as HLL but provides 20% savings memory whereas HLL-TC+ reduces the error and also saves 45% in memory compared to vanilla HLL.
A couple of questions here,
HLL-TC seems to be very close to HLL with regards to estimation, so it does seem like it can support unions (this is also validated in the paper) whereas HLL-TC+ seems to be using an interesting MLE calculation each time the base register changes, I can't see how this would allow unions.
The second question is how duplicates are handled. It looks to me duplicates will not affect the result if they show up before base register is increased for an offset, but what if a value v1 is duplicated like this v1 -> something else -> base register updated and offsets adjusted -> v1 shows up again. Does this matter?
Regards,
Nat
P.S. Thanks Matt for pointing this paper to me