Query Performance issues while evaluating Druid

Abhishek Jain

unread,

Nov 11, 2013, 1:05:19 AM11/11/13

to druid-de...@googlegroups.com

Hi

I am noticing an excessively high amount of latency in processing queries on data of the order of size : 20G with total number of rows: 284 million.

Here is the schema that I am using for ingestion of data (There are about 31 dimensions and 4 aggregations. It is a day wise segment granularity with the data been populated for a month) :

"dataSource": "test_data",

"timestampColumn": "timestamp",

"timestampFormat": "iso",

"dataSpec": {

"format": "json",

"dimensions": ["d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31"]

},

"granularitySpec": {

"type":"uniform",

"intervals":["2013-06-01T00:00:00Z/P30D"],

"gran":"day"

},

"pathSpec": { "type": "static",

"paths": "/Users/abhishek.jain/Documents/test/inputFiles/*" },

"rollupSpec": { "aggs":[ {"type": "count", "name": "event_count"},

{"type": "longSum", "name": "agg_d1", "fieldName": "d1"},

{"type": "longSum", "name": "agg_d2", "fieldName": "d2"},

{"type": "doubleSum", "name": "agg_d3", "fieldName": "d3"}

],

"rollupGranularity": "minute"},

"workingPath": "/tmp/working_path",

"segmentOutputPath": "/tmp/segments",

"leaveIntermediate": "false",

"partitionsSpec": {
"targetPartitionSize": 1000000
},
"updaterJobSpec": {

"type":"db",

"connectURI":"jdbc:mysql://localhost:3306/druid",

"user":"druid",

"password":"druid",

"segmentTable":"prod_segments"

}

In some of the previous posts, I observed that you had suggested using a timeseries query without any longSum aggregations in order to better utilise caching on the broker node. Here is a comparison of how the latency changed with the above factors and how it is still unusually high.

Note: It is a 3 node setup with compute nodes alone running on 2 of the nodes and the {broker node, master node, zookeeper and mysql} running on another node. All the 3 nodes are in the same region on AWS (network latency considerations). The node configuration for each of the nodes is: {dual core 1.2 Ghz processor, 7.5G of memory and 840G of disk space)

Broker Node Process launched with the following configuration:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker com.metamx.druid.http.BrokerMain

Master Node Process launched with the following configuration:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/master com.metamx.druid.http.MasterMain

Compute Node Process launched with the following configuration

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/compute com.metamx.druid.http.ComputeMain

The "druid.server.maxsize" property is set to 1 TB and the currently loaded segments are roughly 20 GB in size which is 2% of the server capacity.

When I use a groupBy query with longSum aggregations like the one below:

{

"queryType": "groupBy",
"dataSource": "test_data",
"granularity": "all",
"dimensions": [],
"aggregations": [{"type": "longSum", "fieldName": "event_count", "name": "events"},
{"type": "count", "name": "rows"}],
"intervals": ["2013-06-01T00:00/2013-07-01T00"]
}

Latencies described below:

Using broker node 1st attempt: 37sec
Using broker node 2nd attempt (to see how caching improves it): 37 sec - Seems no caching in case of groupBy queries as evident from previous posts as well
Using compute node : 37sec

When I change the query to a timeseries query with the same aggregations:

{

"queryType": "timeseries",
"dataSource": "test_data",
"granularity": "all",
"aggregations": [{"type": "longSum", "fieldName": "event_count", "name": "events"},
{"type": "count", "name": "rows"}],
"intervals": ["2013-06-01T00:00/2013-07-01T00"]
}

Using broker node 1st attempt : 15.38sec
Using broker node 2nd attempt : 15.36sec
Using compute node : 15.42sec

When I further change the query without any longSum aggregations (as suggested in one of the previous posts)

{

"queryType": "timeseries",
"dataSource": "test_data",
"granularity": "all",
"aggregations": [
{"type": "count", "name": "rows"}],
"intervals": ["2013-06-01T00:00/2013-07-01T00"]
}

Using broker node 1st attempt : 7.34sec

Using broker node 2nd attempt : 7.4sec

Using compute node : 7.32sec

As we can see from the observations above, that the latency doesn't seem to improve in the second attempt on the broker node in all types of queries. Also, these query attempts were made from the same machine which is hosting the broker node. A query latency of even 7.5 sec which is the best case seems to be a lot on the higher side as we are evaluating Druid for a realtime interactive reporting engine.

An important point to note here is that the query is going to span multiple segments as the segments are on a day basis further split by the partitioning size (with each day segment being split into about 32-33 partitions approx.). I am suspicious of this being related to one of the issues related to loading of multiple segments into memory (paging) as outlined in one of the previous posts.

I think the examples and the analysis posted above are highly related to this post(maybe this could be helpful in diving further) : https://groups.google.com/forum/#!topic/druid-development/JgJ305rl7JM

Can you please suggest if there are other ways to improve these latencies or are these the expected query latencies? I will be glad to provide you with any further detail that you may require. A quick help will be greatly appreciated.

Fangjin Yang

unread,

Nov 11, 2013, 7:38:06 PM11/11/13

to druid-de...@googlegroups.com

Hi Abhishek,

It appears you are using default configs and not really utilizing the configs you should adjust depending on the underlying hardware to improve performance.

There are some docs about this under http://druid.io/docs/latest/Configuration.html.

Druid Processing Module

This module contains query processing functionality.

Property	Description	Default
`druid.processing.buffer.sizeBytes`	This specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime nodes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed.	1073741824 (1GB)
`druid.processing.formatString`	Realtime and historical nodes use this format string to name their processing threads.	processing-%s
`druid.processing.numThreads`	The number of processing threads to have available for parallel processing of segments. Our rule of thumb is `num_cores - 1`, this means that even under heavy load there will still be one core available to do background tasks like talking with ZK and pulling down segments.	1

Depending on what nodes you have running, I'd recommend increasing the JVM heap size such that it can accommodate a 1G intermediate buffer with some overhead space. Definitely change the numThreads you are running with. You will likely need to tune your broker too.

Abhishek Jain

unread,

Nov 12, 2013, 4:35:43 AM11/12/13

to druid-de...@googlegroups.com

Hi Fangjin

Thanks for helping in locating the tuneable configurations to improve query performance. It indeed resulted in improved query times but even after specifying the optimal configurations (as far as I understand) , the query latency is still on the higher side and I am suspicious of this being not able to scale for larger amounts of data.

Just to give you a rough idea, a month's data of ours is currently occupying around 20G of segment size with around 300 million records been ingested with around 32 dimensions and 3-4 aggregations.

Just to re-iterate , each compute node is of the following configuration : dual core 1.2 Ghz processor, 7.5G of memory and 840G of disk space

We are using the following relevant properties in compute node config:

druid.processing.buffer.sizeBytes=2000000000 //which is roughly 2G

druid.processing.numThreads=2

Launching the compute node with -Xmx6G

Also, I would like to bring to your notice that the parameter "druid.processing.buffer.sizeBytes" is being read as an integer in Druid Compute Node code which causes a NumberFormatException when a value > 2000000000 is specified. I guess this is a bug where this should have been read in a bigger sized data type.

After tuning all of the above parameters, here is a look at the query times:

Query

{

"queryType": "timeseries",
"dataSource": "test_data",
"granularity": "all",
"aggregations": [
{"type": "count", "name": "rows"}],
"intervals": ["2013-06-01T00:00/2013-07-01T00"]
}

Time taken - 3.5 - 4 sec

and it is about the same for groupBy Queries as well for single dimension and a single filter.

Also, the query time seems to decrease linearly with the amount of data being queried for. It becomes 1.5-2 sec for data range for 15 days.

Analysing the above, it seems that if we have more data points coming in a month in future or queries elapsing more than a month's time, the query latencies are going to degrade further.

We are looking forward to around 8X more data as compared to this being queried in the future. We intend to achieve sub-second latencies for the same.

Can you please suggest what other changes we should be making in our system in order to scale Druid to our requirements. It will be great if you can let us know if these are the expected query times and what kinds of data loads are you guys having on the Druid cluster with tolerable query latencies.

Eric Tschetter

unread,

Nov 13, 2013, 12:35:25 AM11/13/13

to druid-de...@googlegroups.com

Abhishek,

Thanks for helping in locating the tuneable configurations to improve query performance. It indeed resulted in improved query times but even after specifying the optimal configurations (as far as I understand) , the query latency is still on the higher side and I am suspicious of this being not able to scale for larger amounts of data.

Just to give you a rough idea, a month's data of ours is currently occupying around 20G of segment size with around 300 million records been ingested with around 32 dimensions and 3-4 aggregations.

Just to re-iterate , each compute node is of the following configuration : dual core 1.2 Ghz processor, 7.5G of memory and 840G of disk space

These are fairly small, weak machines. You will do better on better hardware.

We are using the following relevant properties in compute node config:

druid.processing.buffer.sizeBytes=2000000000 //which is roughly 2G

druid.processing.numThreads=2
Launching the compute node with -Xmx6G

So, what this means is that you have 6GB of memory max assigned to the JVM, you then have 2 processing threads that each utilize a 2GB buffer of memory for doing their processing. You then have 20GB of segments replicated twice on two machines, meaning you have 20GB of segments on each machine.

So, you are running with 6 + 4 + 20GB of memory for the process to use. You have 7.5 GB of memory on the node. So, you are over-subscribed by 3x in memory on the node. Given this setup, there is a high probability that you will start swapping, swapping in and off disk slows things down.

If you are just benchmarking count queries, try setting your sizeBytes to more like 64MB and your compute node heap down to 1GB, see how that does for you.

Also, I would like to bring to your notice that the parameter "druid.processing.buffer.sizeBytes" is being read as an integer in Druid Compute Node code which causes a NumberFormatException when a value > 2000000000 is specified. I guess this is a bug where this should have been read in a bigger sized data type.

Nope, this is not a bug. In Java you can only create off-heap buffers of 2GB (without doing your own magical tricks with Unsafe). These buffers are the amount of "scratch" space a given processing thread has to store intermediate results, so a workload that requires more than 2GB is actually fairly uncommon right now.

After tuning all of the above parameters, here is a look at the query times:

Query
{
"queryType": "timeseries",
"dataSource": "test_data",
"granularity": "all",
"aggregations": [
{"type": "count", "name": "rows"}],
"intervals": ["2013-06-01T00:00/2013-07-01T00"]
}

Time taken - 3.5 - 4 sec

That means you are achieving a 3.5 * 4 processors = 14 core-seconds. 300M rows / 14 core-seconds = 21M rows per second per core scan speed. That's generally good.

Also, the query time seems to decrease linearly with the amount of data being queried for. It becomes 1.5-2 sec for data range for 15 days.

That is expected. Each query is scanning *all* relevant data. Adding filters reduces the set of relevant data and speeds up queries. Reducing the time range reduces relevant data and speeds up queries.

Analysing the above, it seems that if we have more data points coming in a month in future or queries elapsing more than a month's time, the query latencies are going to degrade further.

We are looking forward to around 8X more data as compared to this being queried in the future. We intend to achieve sub-second latencies for the same.

That is not a problem. If you want to achieve sub-second latencies, though, you will have to make different hardware decisions.

Can you please suggest what other changes we should be making in our system in order to scale Druid to our requirements. It will be great if you can let us know if these are the expected query times and what kinds of data loads are you guys having on the Druid cluster with tolerable query latencies.

I don't remember all of the numbers, but the last numbers I had for the Metamarkets cluster was that it was serving 20~30TB of segments on ~40 machines with 95th percentile query latencies under 1 second.

--Eric

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/98275a56-c710-45a6-947c-6e8763b1724d%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Rakesh

unread,

Nov 13, 2013, 3:18:21 AM11/13/13

to druid-de...@googlegroups.com

Eric,

Abhishek and I are part of the same team - thanks for your input! Couple of follow up queries:

The current stack is on Cassandra and the reason to try out Druid is for the interactive drill-down queries ( slice-dice ). Looking at the projected numbers, we need to plan to scale it to support ~20TB of data. From your response, I'm a bit unclear around the overall cluster memory requirements - Does Druid expects all the data to be in memory? If not, how can one estimate the total memory requirement ( irrespective of # compute nodes and assuming RF=1 )?
We are on EC2 - any recommendation of the class of node? Based on http://druid.io/docs/0.6.10/Cluster-setup.html , it is recommended that compute node have as many cores as possible and good memory. c1.xlarge ( 7GB RAM / 8 cores / 1.7TB disk ) seems like a good fit ( cores/$ ). Since max cores in any EC2 instance is 8, the other possible candidate is m3.2xlarge ( 30GB RAM / 8 cores / EBS )
From one of your earlier points, the overall node memory requirement = HEAP size + (number of cores - 1 * buffer.sizeBytes ). Using this, is this a good correct way to look at the configs:

If c1.xlarge ( 7GB RAM ), then 4GB heap + ( 7 x 300MB ) = 6.1 GB
If m3.2xlarge ( 30GB RAM ), then 8GB heap + ( 7 X 2GB ) = 22 GB

As mentioned in the previous thread, druid.server.maxsize is set as 1 TB - does the property tuning just depends upon the total disk space OR also on the node's memory?
"Metamarkets cluster: 20~30TB of segments on ~40 machines with 95th percentile query latencies under 1 second" - So looks like ~750GB/node. Could you share the machine specs, JVM params and relevant druid tuning configs?

We are excited about Druid and already ran multiple POCs to confirm that Druid indeed solves our feature requirements. The only blocker for us to go ahead with Druid is to understand the scale implication - sub second interactive queries over 20TB of data ( groupBy queries, 1-2 dimensions, 2-4 filters with occasional orderBy+limit )

Thanks in advance!

Abhishek Jain

unread,

Nov 15, 2013, 6:31:29 AM11/15/13

to druid-de...@googlegroups.com

Hi Eric

It shall be great if you can help us answer the queries posted by Rakesh. Expecting a reply from your side soon.

Fangjin Yang

unread,

Nov 15, 2013, 11:31:17 PM11/15/13

to druid-de...@googlegroups.com

Hi Rakesh, hectic week causing some delays in emails :P. See inline.

On Wednesday, November 13, 2013 12:18:21 AM UTC-8, Rakesh wrote:

Eric,

Abhishek and I are part of the same team - thanks for your input! Couple of follow up queries:
The current stack is on Cassandra and the reason to try out Druid is for the interactive drill-down queries ( slice-dice ). Looking at the projected numbers, we need to plan to scale it to support ~20TB of data. From your response, I'm a bit unclear around the overall cluster memory requirements - Does Druid expects all the data to be in memory? If not, how can one estimate the total memory requirement ( irrespective of # compute nodes and assuming RF=1 )?

Druid by default uses a memory-mapped model. It previously ran entirely in memory but we decided it was more cost effective to memory map. If your node has sufficient memory to hold all your data you care about in memory, then all data is in memory with this model. Otherwise, some data may spill to disk. More info here:

http://druid.io/faq.html

We are on EC2 - any recommendation of the class of node? Based on http://druid.io/docs/0.6.10/Cluster-setup.html , it is recommended that compute node have as many cores as possible and good memory. c1.xlarge ( 7GB RAM / 8 cores / 1.7TB disk ) seems like a good fit ( cores/$ ). Since max cores in any EC2 instance is 8, the other possible candidate is m3.2xlarge ( 30GB RAM / 8 cores / EBS )

SSDs can really help with the paging cost when data needs to be paged into memory. 7GB of ram is a bit small given the space required for OS operations, the intermediate buffer and memory required to host segments.

From one of your earlier points, the overall node memory requirement = HEAP size + (number of cores - 1 * buffer.sizeBytes ). Using this, is this a good correct way to look at the configs:
If c1.xlarge ( 7GB RAM ), then 4GB heap + ( 7 x 300MB ) = 6.1 GB
If m3.2xlarge ( 30GB RAM ), then 8GB heap + ( 7 X 2GB ) = 22 GB

There is also a memory requirement to actually load segments into memory such that they can be scanned.

As mentioned in the previous thread, druid.server.maxsize is set as 1 TB - does the property tuning just depends upon the total disk space OR also on the node's memory?

It depends on the memory to disk ratio that you want. Spilling too much data onto disk will introduce a ton of paging cost for queries that span multiple segments.

"Metamarkets cluster: 20~30TB of segments on ~40 machines with 95th percentile query latencies under 1 second" - So looks like ~750GB/node. Could you share the machine specs, JVM params and relevant druid tuning configs?

I have to ask if we are allowed to publicly disclose this information. Let me get back to you on that.

We are excited about Druid and already ran multiple POCs to confirm that Druid indeed solves our feature requirements. The only blocker for us to go ahead with Druid is to understand the scale implication - sub second interactive queries over 20TB of data ( groupBy queries, 1-2 dimensions, 2-4 filters with occasional orderBy+limit )

Sorry for the delay! Let me know if this helped or if I can help further.

Reply all

Reply to author

Forward