I'm trying to create a new druid cluster but before that I wanted to make sure that I know all the variables that are involved on the queries that we will run. Doing some querying with pivot on a test database (that already has more than 150M events) sometimes we get timeout errors, mostly on topN queries, like this one:
Historical logs
2016-08-11T17:22:08,901 INFO [topN_eventData_[2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:08.901Z","service":"druid/historical","host":"
some.host.com:8083","metric":"query/segmentAndCache/time","value":15529,"dataSource":"eventData","dimension":"url","duration":"PT3600S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z"],"numComplexMetrics":"2","numMetrics":"3","segment":"eventData_2016-08-07T15:00:00.000Z_2016-08-07T16:00:00.000Z_2016-08-08T15:53:02.665Z","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-08-11T17:22:08,901 INFO [topN_eventData_[2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:08.901Z","service":"druid/historical","host":"
some.host.com:8083","metric":"query/wait/time","value":24364,"dataSource":"eventData","dimension":"url","duration":"PT3600S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-07T15:00:00.000Z/2016-08-07T16:00:00.000Z"],"numComplexMetrics":"2","numMetrics":"3","segment":"eventData_2016-08-07T15:00:00.000Z_2016-08-07T16:00:00.000Z_2016-08-08T15:53:02.665Z","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-08-11T17:22:09,010 INFO [qtp806588571-19[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] io.druid.query.ChainedExecutionQueryRunner - Query timeout, cancelling pending results for query id [aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]
2016-08-11T17:22:09,010 INFO [qtp806588571-19[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] io.druid.server.QueryResource - Query timeout [aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]
2016-08-11T17:22:09,010 INFO [topN_eventData_[2016-08-06T04:00:00.000Z/2016-08-06T05:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:09.010Z","service":"druid/historical","host":"
some.host.com:8083","metric":"query/segment/time","value":5687,"dataSource":"eventData","dimension":"url","duration":"PT3600S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-06T04:00:00.000Z/2016-08-06T05:00:00.000Z"],"numComplexMetrics":"2","numMetrics":"3","segment":"eventData_2016-08-06T04:00:00.000Z_2016-08-06T05:00:00.000Z_2016-08-06T16:51:25.061Z","status":"failed","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-08-11T17:22:09,010 INFO [topN_eventData_[2016-08-06T14:00:00.000Z/2016-08-06T15:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:09.010Z","service":"druid/historical","host":"
some.host.com:8083","metric":"query/segment/time","value":9232,"dataSource":"eventData","dimension":"url","duration":"PT3600S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-06T14:00:00.000Z/2016-08-06T15:00:00.000Z"],"numComplexMetrics":"2","numMetrics":"3","segment":"eventData_2016-08-06T14:00:00.000Z_2016-08-06T15:00:00.000Z_2016-08-06T16:51:25.061Z","status":"failed","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-08-11T17:22:09,010 INFO [topN_eventData_[2016-08-07T16:00:00.000Z/2016-08-07T17:00:00.000Z]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:09.010Z","service":"druid/historical","host":"
some.host.com:8083","metric":"query/segment/time","value":15772,"dataSource":"eventData","dimension":"url","duration":"PT3600S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-07T16:00:00.000Z/2016-08-07T17:00:00.000Z"],"numComplexMetrics":"2","numMetrics":"3","segment":"eventData_2016-08-07T16:00:00.000Z_2016-08-07T17:00:00.000Z_2016-08-08T15:53:02.665Z","status":"failed","threshold":"1000","type":"topN","version":"0.9.1.1"}]
Broker logs
2016-08-11T17:22:09,011 INFO [HttpClient-Netty-Worker-21] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:09.011Z","service":"druid/broker","host":"
some.host.com:8082","metric":"query/node/bytes","value":26,"dataSource":"eventData","dimension":"url","duration":"PT388800S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-06T00:00:00.000Z/2016-08-06T01:00:00.000Z","2016-08-06T02:00:00.000Z/2016-08-10T13:00:00.000Z"],"numComplexMetrics":"2","numMetrics":"3","server":"
some.host.com:8083","threshold":"1000","type":"topN","version":"0.9.1.1"}]
2016-08-11T17:22:09,011 INFO [qtp783141366-52[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] io.druid.server.QueryResource - Query timeout [aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]
2016-08-11T17:22:09,011 INFO [qtp783141366-52[topN_eventData_aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b]] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-08-11T17:22:09.011Z","service":"druid/broker","host":"
some.host.com:8082","metric":"query/time","value":40009,"context":"{\"queryId\":\"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b\",\"timeout\":40000}","dataSource":"eventData","duration":"PT604800S","hasFilters":"true","id":"aacc0fe6-8cbc-40b2-b9d0-8517d4de9e8b","interval":["2016-08-03T13:00:00.000Z/2016-08-10T13:00:00.000Z"],"remoteAddress":"127.0.0.1","success":"false","type":"topN","version":"0.9.1.1"}]
Please notice that this is using quickstart test configurations with the following changes:
- on historical/run.properties
# HTTP server threads
druid.server.http.numThreads=11
# Processing threads and buffers
druid.processing.buffer.sizeBytes=1073741824
druid.processing.numThreads=11
- on broker/run.properties
# HTTP server threads
druid.server.http.numThreads=11
# Processing threads and buffers
druid.processing.buffer.sizeBytes=1073741824
druid.processing.numThreads=11
Segments size between 50MB and 400MB each on hourly granularity (only one partition per segment, target of 5GB partition at index time). Using local storage as deep storage.
Druid version: 0.9.1.1. Java 1.8.
Server specs: 1 server, 12 cores, 32GB RAM, 2x 256GB Samsung SSD (100k IOPS).
I'm having trouble finding which is the bottle-neck here. Using an administrative tool on the server it seems that all the cores in parallel are being used for every query. However I can't find more details about what is druid doing when the query fails or details about the failure (ie: why it takes so long to run the query).
Any help will be appreciated. Thanks a lot !