There are few ways to manage data retention, and one we’ve pushed in the past was to support TTL so that datapoint can be stored with internal HBase insert time according to this TTL.
In case an operator implement a TTL based data eviction policy, there is a situation that can occur where if a series don’t have any new data points pushed during the TTL period, then there is no point anymore for a series, but where the series still exist.
I’ve called this the Dead Series pattern, and we've though of different ways to answer this need.
The first one would be to track the TTL from the token to the metadata.
Then the directory would be TTL aware and could process some routine to garbage collect dead series that have a TTL in their metadata structure. Then we can process a find on a selector and apply a comparison between a LA and a TTL field.
The second one would be to add a specific Egress call.
Alongside w/ /update /meta /delete /fetch /find, for example a /clean?ttl= , so that the TTL is actually not forged into the metadata structure but passed as a parameter to a specific method. This way we can still implement the cleaning process inside the directory directly which would : scan the series like a FIND, compare the LastActivity with the given TTL, and delete the series directly. The problem here is that it would require to query over each directory to make it happen. Then I propose that this routine could be enable on a special directory that could be specialized on this job, and push a delete message on Kafka metadata topic so that all directories will consume it.
I feel that the second proposition is more efficient and less intrusive than the first one. The first one requires to modify the token and metadata structures and offers less flexibility where the second could enable a clean process by a user on an arbitrary TTL value (one week for example, while on the operator side, it could be the TTL defined on the platform).
Also, since we rely on TTL based on the LSM implementation in HBase, we decorrelate the series from the data points, but TTL applies on the datapoints only. This mecanism is a proposition to help customers manage the entire dataset, by completing the metadata part.
What do you think ?