Middle manager Configuration

Torche Guillaume

unread,

Jan 28, 2015, 1:33:26 PM1/28/15

to druid-de...@googlegroups.com

Hi all,

I'm experiencing some issues with my real time pipeline (Kafka + Storm + Druid).

For some reason and only periodically I get the following logs from my Storm workers:

com.twitter.finagle.GlobalRequestTimeoutException: exceeded 1.minutes+30.seconds to druid:firehose:rtb_impressions_deals-00-0007-0000 while waiting for a response for the request, including retries (if applicable) at com.twitter.finagle.NoStacktrace(Unknown Source) ~[na:na]
It seems that my Storm workers cannot send data into Druid causing the Storm topology to LAG. I am using r3.2xlarge (8 CPU, 60GB RAM, 160GB SSD) for my middle managers.
All my middle managers have a 12 workers capacity.
The problem occurred at 20:50 pm yesterday, my Druid workers have to build new in memory indexes for some of my datasources at this time (warming period set to 15 minutes and 10 minutes for 2 datasources producing each 10 real time tasks). These real time tasks are the most "intense" tasks I have and the request timeout exception are only thrown for these tasks.

So I am wondering if my Middle managers have too much workers (12 workers/8 CPU). 10 minutes before each hour I put more pressure on them because of the creation of the new in memory indexes. I think my Middle managers are not able to do that and also process my real time tasks... Moreover nobody is querying my real time data at the moment, so I have to configure my indexing service so that it can also handle real time data queries when the time will come.

I have attached some Ganglia graphs from one of my Middle manager which could be useful to make a decision.

Do you guys have any advise/idea how I should configure my middle managers? Do you think my analysis of the situation is correct or am I missing something here?

Thanks a lot for your advises and comments!

Guillaume

http://ganglia.gumgum.com/graph.php?g=load_report&z=large&c=druid&h=ip-10-225-167-70.ec2.internal&m=&r=day&s=descending&hc=4&st=1422469263

graph_load_proc.png

graph_cpu_utilisation.png

graph_memory_utilisation.png

Torche Guillaume

unread,

Jan 28, 2015, 1:45:32 PM1/28/15

to druid-de...@googlegroups.com

I would also like to point out that in the druid documentation about the configuration of a production cluster, you are choosing a r3.8xlarge with 32 CPU and only 10 workers for your middle manager.

I would like to understand you made this choice? It seems to me that this configuration under uses the machine capacity right?

Guillaume

Gian Merlino

unread,

Jan 28, 2015, 1:58:59 PM1/28/15

to druid-de...@googlegroups.com

The example config is a conservative approach that should offer high ingestion and query performance on a wide variety of datasets. It allocates 3 CPUs to each peon: 1 for ingestion and 2 for querying (the peons are multi-threaded). It's also generous with memory, with the goal of providing enough page cache for a potentially large working set. This is important because ingestion and querying both make heavy use of the filesystem, so disks can become a bottleneck if you are light on page cache.

You could potentially optimize a config for a specific dataset and get good performance out of less generous hardware. I think a good place to start would be to measure how much data you are actually writing per unit of time, and then choose disks and memory appropriately. After that, you can choose how many "extra" CPUs to include based on your expected query load.

Torche Guillaume

unread,

Jan 28, 2015, 11:54:14 PM1/28/15

to druid-de...@googlegroups.com

Thanks for your reply Gian, I'm gonna study the difference things you mentioned and try to come up with the best indexing service configuration.

I'll let you know what I choose and how it works in production!

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/xqRIMgn2Jk4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/c42957dd-ce9f-44e9-ac16-36f1d8af1d1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Torche Guillaume

unread,

Jan 29, 2015, 8:58:51 PM1/29/15

to druid-de...@googlegroups.com

Hi all,

Here is what I did for my indexing configuration:

I basically move from r3.2xlarge instance to c3.4xlarge because I actually don't need that much RAM for my working set but more computational power than the previous instances offered.

I am running each middle manager with a 8 peons capacity and two processing threads. it should allocate one CPU for ingestion and one CPU for querying. I also tuned other parameters which I believe are appropriate for my working set.

My real time pipeline seems to run fine. I only need 4 middle managers instead of 6 with the previous instance types and I get no timeout request from Storm. Druid seems to ingest my real time data fast enough.

Thanks again for your help.

Guillaume

--

Guillaume Torche

Computer engineering student - Université Technologique de Compiègne (UTC)

gto...@gmail.com

0033 6 60 48 51 23

Gian Merlino

unread,

Jan 30, 2015, 1:26:14 AM1/30/15

to druid-de...@googlegroups.com

Cool, good to know you found the right config!

Reply all

Reply to author

Forward