does druid support pre-aggregation of data as part of ingestion?

806 views
Skip to first unread message

Nicolae Marasoiu

unread,
Oct 2, 2015, 8:15:30 AM10/2/15
to Druid User
Hi,

Ad-hoc queries are great, but we also need to serve predefined queries in millis: and for those, pre-aggregations are king (like group by on certain dimensions combinations done in advance).

Does Druid support also predefined queries by means of preaggregation in the process of ingestion? (i.e. realtime preaggregation on realtime nodes and batch preaggregation on a batch import).

I am thinking on a workflow like, we would use an HTTP endpoint to which we send requests like create new preaggregate or index with this json definition, or delete one preaggregate.

If this is possible, is it also possible to create some preaggregations from the input stream and then dump the granular events (the input stream itself)?

This would allow us to skip the Spark Streaming and just use one system, one metrics definitions/implementations, so on.

Please advise,
Nicu

Fangjin Yang

unread,
Oct 2, 2015, 4:34:55 PM10/2/15
to Druid User
Hi, please see inline.


On Friday, October 2, 2015 at 8:15:30 AM UTC-4, Nicolae Marasoiu wrote:
Hi,

Ad-hoc queries are great, but we also need to serve predefined queries in millis: and for those, pre-aggregations are king (like group by on certain dimensions combinations done in advance).

Does Druid support also predefined queries by means of preaggregation in the process of ingestion? (i.e. realtime preaggregation on realtime nodes and batch preaggregation on a batch import).

Yes, this is set as part of the ingestion spec used at ingestion time. Ingested data can be rolled up to pre-defined granularities (minute, hour, day, etc.), or a custom granularity. 

I am thinking on a workflow like, we would use an HTTP endpoint to which we send requests like create new preaggregate or index with this json definition, or delete one preaggregate.

If this is possible, is it also possible to create some preaggregations from the input stream and then dump the granular events (the input stream itself)?

This is also possible. Often times people will preaggregate data before loading it into Druid. 

Saksham Garg

unread,
Oct 12, 2015, 11:55:48 AM10/12/15
to Druid User
Hi,
You mentioned that data ingested data can be rolled up on the basis of some custom granularity. Can you give an example spec for it?

In my system, people generally do topN queries on some dimension. It would be great if this can be pre-aggregated for atleast some dimensions(5-10, there are around 80 dimensions).

Gian Merlino

unread,
Oct 14, 2015, 4:01:14 AM10/14/15
to Druid User
Hey Saksham,

All of the ingestion specs (batch, realtime) can be given a "queryGranularity" to do rollup at ingestion time. By default this is "NONE" but you can also set it to "MINUTE" or "HOUR" or any query granularity object supported by Druid.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/39b6c1e8-eacc-4fc2-9c98-c2a8b7c69454%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Nicolae Marasoiu

unread,
Nov 3, 2015, 8:47:54 AM11/3/15
to Druid User
Hi,

How can I index different pre-aggregations of the same data? In the same data source or another?

Let's say I import data into datasource "hourly_dimensions_metrics", which is a primary aggregation of the actual events from which all the others are done.
(My understanding is that I can do a hadoop index task to scan directly events pre-joined logs and output "hourly_dimensions_metrics", or have spark/hadoop map-reduce pre-compute it.)

Now I want to precompute a few aggregations on top of "hourly_dimensions_metrics", like daily, or with fewer dimensions.
How do I do this? With more index tasks, having "hourly_dimensions_metrics" as both source and destination?
If I need to use different sources (if they have a fixed schema which I expect intuitively), than the query will no longer be agnostic of the indexing, meaning the query will contain the datasource (the index name). Is this correct?

Thanks,
Nicu

Fangjin Yang

unread,
Nov 4, 2015, 7:31:11 PM11/4/15
to Druid User
Inline.


On Tuesday, November 3, 2015 at 5:47:54 AM UTC-8, Nicolae Marasoiu wrote:
Hi,

How can I index different pre-aggregations of the same data? In the same data source or another?

You can reindex data for different time intervals.

Let's say I import data into datasource "hourly_dimensions_metrics", which is a primary aggregation of the actual events from which all the others are done.
(My understanding is that I can do a hadoop index task to scan directly events pre-joined logs and output "hourly_dimensions_metrics", or have spark/hadoop map-reduce pre-compute it.)

Now I want to precompute a few aggregations on top of "hourly_dimensions_metrics", like daily, or with fewer dimensions.
How do I do this? With more index tasks, having "hourly_dimensions_metrics" as both source and destination?

Druid uses MVCC and every segment has a version. When you create new segments with modified data for the same interval, they replace old segments that were there before.

anshu...@deber.co

unread,
Jul 1, 2016, 11:39:56 AM7/1/16
to Druid User
If druid already aggregating the data on granularities what are the advantages of providing pre aggregated data? 


On Saturday, October 3, 2015 at 2:04:55 AM UTC+5:30, Fangjin Yang wrote:
Reply all
Reply to author
Forward
0 new messages