You have the right idea. Let me describe how we do it, which I think
is simpler than what you're thinking.
Consider the simple case of counting. In that case, the realtime layer
can essentially ignore the batch layer and update databases ad
infinitum. For example, our realtime layer does batch atomic
increments into Cassandra for clicks over time.
The batch layer performs the same computations (perhaps with
additional analysis only possible in batch), and emits it's results
into ElephantDB. It also emits a "data up to date since this time"
value that is used by the *application* layer, not the realtime layer.
Our application, when resolving a query, knows what the cutoff time is
for which it should get results from EDB and for which it should get
results from Cassandra (because the batch layer told it). So it splits
up the query appropriately among those two databases.
Now, the next step is to flush data you no longer need from Cassandra.
You can do this by running two Cassandra clusters and rotating between
them as the batch layer performs its updates. A rotate will clear and
reset one of the clusters.
Time tends to be a recurrent way to merge the results of both the
batch and realtime layers, which I guess isn't surprising.
You can extend this pattern to involve more coordination between batch
and realtime layers for more complex kinds of queries.
The reason why this architecture with seemingly many moving parts
works is because everything in the realtime layer is transient. The
batch workflow, the authoritative source of data in this system, is
actually rather simple.
-Nathan