Pre aggregated data from druid.

338 views
Skip to first unread message

anshu...@deber.co

unread,
Jul 1, 2016, 11:34:04 AM7/1/16
to Druid User
Druid supports both raw and pre aggregated data (on dimensions) with ingestion. What are the advantages and disadvantages of providing pre aggregated data? Also how druid aggregate raw data?

Slim Bouguerra

unread,
Jul 1, 2016, 12:25:32 PM7/1/16
to druid...@googlegroups.com
By ingesting pre aggregated data to druid, will make ingestion "faster" since most of the work will be done pre ingestion.
 

On Fri, Jul 1, 2016 at 8:34 AM, <anshu...@deber.co> wrote:
Druid supports both raw and pre aggregated data (on dimensions) with ingestion. What are the advantages and disadvantages of providing pre aggregated data? Also how druid aggregate raw data?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/756b84a3-be3f-4822-aada-689ddccfa2b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fangjin

unread,
Jul 1, 2016, 12:39:26 PM7/1/16
to Druid User
By rolling up data, or pre-aggregating it, you are reducing the storage requirements if your data rolls up well, and you will save costs on hardware of running a Druid cluster. In practice, rolling up data can reduce your storage requirements by an average of 40x. The tradeoff is that you won't lose fidelity in your metrics, but you will lose information about the exact time an event occurred (due to truncation).

anshu...@deber.co

unread,
Jul 1, 2016, 2:28:12 PM7/1/16
to Druid User
Thanks for the quick response Fangjin and Slim. I am adding raw data in druid using spark streaming and tranquility. Now I have two more questions.

1 My realtime streaming is running at interval of 2 mins but I need time granularity of hour(in druid). As my streaming interval is of 2 mins I can pre aggregate data for 2 mins only. I have to ingest data immediately as I need to support till now queries also.  I can run reindexing to get pre aggregated data for past hours. Is there is any other way to achieve same.

2 While querying druid using plyql  I am getting rows with already aggregated data. Like if I add 100 events with same dimensions and only count as my metric, on querying "select * from datasource" i am getting 1 row with count as 100. So is druid itself aggregating some data before ingestion?

Fangjin Yang

unread,
Jul 1, 2016, 3:26:32 PM7/1/16
to Druid User
Inline.


On Friday, July 1, 2016 at 11:28:12 AM UTC-7, anshu...@deber.co wrote:
Thanks for the quick response Fangjin and Slim. I am adding raw data in druid using spark streaming and tranquility. Now I have two more questions.

1 My realtime streaming is running at interval of 2 mins but I need time granularity of hour(in druid). As my streaming interval is of 2 mins I can pre aggregate data for 2 mins only. I have to ingest data immediately as I need to support till now queries also.  I can run reindexing to get pre aggregated data for past hours. Is there is any other way to achieve same.

Druid does rollup/pre-aggregation for you. Set the queryGranularity to configure this. You don't need to do it in Spark.
 
2 While querying druid using plyql  I am getting rows with already aggregated data. Like if I add 100 events with same dimensions and only count as my metric, on querying "select * from datasource" i am getting 1 row with count as 100. So is druid itself aggregating some data before ingestion?

Reply all
Reply to author
Forward
0 new messages