Roll Up with Kafka Indexing Service for delayed events

76 views
Skip to first unread message

Pravesh Gupta

unread,
Jun 21, 2017, 9:27:10 AM6/21/17
to Druid User
Hi,
I have an use case where I would want to do Roll Up with LongMax Strategy for duplicated events (Same Timestamp & Same Dimension Values) which are separated by more than taskDuration time length. So Basically, these events would end up in different Segment.
But If My desired Roll Up Strategy is longMax then no matter what I query for that Metric(s), I am bound to get incorrect result. Aggregation at the query time on metric is longSum, lets say.

Is this a known issue and limitation with Kafka Indexing Service ?
If it is, Shouldn`t it be mentioned in the Documentation of Druid.

Please help me here.

Thanks in advance,
Pravesh Gupta

Gian Merlino

unread,
Jun 21, 2017, 1:29:59 PM6/21/17
to druid...@googlegroups.com
Hey Pravesh,

Yes, in general, rollup is not "guaranteed" with realtime ingestion. It will happen if the events are in the same kafka partition and arrive close in time, but it's not guaranteed if they arrive far in time. A note on http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html would be useful, perhaps you'd like to raise a patch to add a note like that to "On the Subject of Segments" on https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/kafka-ingestion.md?

It's possible to achieve full rollup after initial ingestion by running reindexing tasks, as the doc mentions.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/f5d1d67f-328b-4ca0-ab68-ab858bb9ae8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pravesh Gupta

unread,
Jun 22, 2017, 9:23:01 AM6/22/17
to Druid User
Thanks Gian for the Confirmation.


On Wednesday, 21 June 2017 22:59:59 UTC+5:30, Gian Merlino wrote:
Hey Pravesh,

Yes, in general, rollup is not "guaranteed" with realtime ingestion. It will happen if the events are in the same kafka partition and arrive close in time, but it's not guaranteed if they arrive far in time. A note on http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html would be useful, perhaps you'd like to raise a patch to add a note like that to "On the Subject of Segments" on https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/kafka-ingestion.md?

It's possible to achieve full rollup after initial ingestion by running reindexing tasks, as the doc mentions.

Gian

On Wed, Jun 21, 2017 at 6:27 AM, Pravesh Gupta <gupta.p...@gmail.com> wrote:
Hi,
I have an use case where I would want to do Roll Up with LongMax Strategy for duplicated events (Same Timestamp & Same Dimension Values) which are separated by more than taskDuration time length. So Basically, these events would end up in different Segment.
But If My desired Roll Up Strategy is longMax then no matter what I query for that Metric(s), I am bound to get incorrect result. Aggregation at the query time on metric is longSum, lets say.

Is this a known issue and limitation with Kafka Indexing Service ?
If it is, Shouldn`t it be mentioned in the Documentation of Druid.

Please help me here.

Thanks in advance,
Pravesh Gupta

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

Pravesh Gupta

unread,
Jun 22, 2017, 9:25:51 AM6/22/17
to Druid User
After giving it a thought, Its not even possible to handle such kind of scenarios . Because we cannot rollup two events if One is in Middle Manager Memory and another one in some Segment in deep storage. I am thinking of it as a trade off between Windowless and Windowed Druid Ingestion.


On Wednesday, 21 June 2017 22:59:59 UTC+5:30, Gian Merlino wrote:
Hey Pravesh,

Yes, in general, rollup is not "guaranteed" with realtime ingestion. It will happen if the events are in the same kafka partition and arrive close in time, but it's not guaranteed if they arrive far in time. A note on http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html would be useful, perhaps you'd like to raise a patch to add a note like that to "On the Subject of Segments" on https://github.com/druid-io/druid/blob/master/docs/content/development/extensions-core/kafka-ingestion.md?

It's possible to achieve full rollup after initial ingestion by running reindexing tasks, as the doc mentions.

Gian

On Wed, Jun 21, 2017 at 6:27 AM, Pravesh Gupta <gupta.p...@gmail.com> wrote:
Hi,
I have an use case where I would want to do Roll Up with LongMax Strategy for duplicated events (Same Timestamp & Same Dimension Values) which are separated by more than taskDuration time length. So Basically, these events would end up in different Segment.
But If My desired Roll Up Strategy is longMax then no matter what I query for that Metric(s), I am bound to get incorrect result. Aggregation at the query time on metric is longSum, lets say.

Is this a known issue and limitation with Kafka Indexing Service ?
If it is, Shouldn`t it be mentioned in the Documentation of Druid.

Please help me here.

Thanks in advance,
Pravesh Gupta

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages