To be clear, Druid treats all data the same way when doing batch
ingestion. It's only the real-time ingestion piece that can ignore
data, but it can always be fixed back up in batch. And, we have plans
to and most definitely will adjust it so that real-time appends can
also be done regardless of timestamp.
Basically, whatever other system you dealt with, you most likely had
some tables partitioned out by time and if some time period in the
past changed, you went and re-processed that chunk of time in the
past, truncated some partition and replaced it with a new one. That's
the same model that batch ingestion takes, Druid just takes care of
truncating and replacement for you in a seamless manner that doesn't
impact users.
In "data replay" cases, it's actually very difficult to get away from
a restatement model in batch, especially if your "replay" (or
"restatement") might delete some data. If you've already aggregated
some values and you need to remove some of the initial input rows for
those aggregations, it's significantly easier to just re-aggregate
than it is to actually try to apply deltas (at least, everyone I know
that has built a system trying to apply deltas has eventually realized
that it was a fool's errand and restructured it to just restate the
data).
All of that said, the recommendation is that however you are getting
your data in in real-time, you should also be teeing it off to some
other warehouse that holds the raw data. There are a number of
reasons for this architecture including
1) What Fangjin said, no real-time setup with presently available
technology that I am aware of can provide 100% data quality guarantees
on streaming data. Batch processing can provide much more guarantees
right now
2) You don't want to be locked in to any data processing system (even
Druid). Having raw data sitting around gives you the freedom and
flexibility to change your mind about infrastructure later.
3) Druid is best when it is summarizing your data. In general,
storing raw data in Druid is usually more expensive than people want
to pay. If you are storing summaries, you are losing some fidelity in
what is available, having the raw data available in some warehouse
somewhere gives you the flexibility to go back and ask whatever
arbitrary question you want
4) Druid is not currently a data warehouse replacement, it is a tool
in the open source data warehouse arsenal that can provide fast
queries against summaries of data and has the ability to provide the
operational characteristics required to actually power a user-facing
application.
While I am biased, I actually think that recommending people setup an
architecture where they maintain data in a batch-process-able setup is
setting them up for success in the long-run rather than adding
unnecessary complexity. At least, given the current state of the art.
--Eric
>
https://groups.google.com/d/msgid/druid-development/CAKyF60Jvy1hDUwuyiDR9rZaREkHRjGtVcST%2Bth9Wa9pcytrsmg%40mail.gmail.com.