We have a use case where events are sent from mobile devices, often in batches. We want to use the timestamp of when the event was created, however this is often way past the window period for the realtime node's plumber, so by the time the event arrives it's game over for that event. We could extend this window period, but that that is still a fixed interval where we would risk losing events. We would also to prefer not the carry the memory burden that a long window would incur.
So we need some kind of process where late arriving events (not a majority of our traffic, but a significant part of it) can be persisted to historical nodes so the data is available in the query result set -- ideally as "real time" as it can be, ie, once a day batch jobs to incorporate the stragglers probably won't meet the biz requirements.
I found some comments on previous posts that seem to imply there are some hooks or recommended methods for this:
"Most of the segments in a Druid cluster are immutable "historical" segments. To update data that already exists in a historical segment, you have to build a new segment for that interval of data with an increased version identifier. Druid uses MVCC to surface data from segments with higher version identifiers and obsolete outdated data from segments with older version identifiers. This process isn't really well documented but we will try in the near future to add more information on the wiki."
This sounds like a reasonable solution, but also sounds like it's potentially expensive and should only be run as often as one would run a batch job.
"Our realtime ingestion always has an accompanying batch process. The batch process and realtime processes deal with different data ranges and do not clobber each other. For example, over the course of a day, our realtime nodes collect data and periodically hand that data off to the rest of the Druid cluster. At the end of the day, our batch process runs and builds a daily segment for the previous day's data. This segment enters the cluster and obsoletes segments that were built by realtime ingestion for the previous day. At this point, the realtime node is ingesting and working with data that is more recent than the previous day."
Again, very similar and probably exactly what we need, would love to learn more about the implementation details if it's a good fit.
Any guidance or suggestions are greatly appreciated.
- Mark