I read that someone got just 2 historical nodes, 2 realtime nodes and about 20(!) middle-workers only for reindexing last historical segment for fixing CAP theorem in lambda architecture. Does it only the way to update data with new metrics? What if I constantly get new metrics in arbitrary intervals e.g. I need to update arbitrary already archived segments, so, for user satisfaction with realtime stats I need to regenerate multiple segments at the same time permanently. Seems irrational.
Constant raw data re-indexing is not an option for sure. But I have idea that need to be tested. For instance, I have two type of events: clicks — realtime flow and need to be logged in time, and conversions — delayed, which impact multiple historical segments with clicks (maybe 100-200 every minute) every minute. What if I index click events with granularity of 1 hour, but conversions will be indexed with granularity of 1 minute. 1 minute granularity is required for near-realtime stats displaying. So I will get pairs of segments which can be easily (I hope) merged together after small delay. But still it seems too complicated.
Problem of data updating is not fully disclosed, I think. I will be glad to hear the possible solutions from Druid developers or experienced users.