Actual load-based segment rebalance and Incremental broker query results

712 views
Skip to first unread message

Roman Leventov

unread,
Dec 15, 2016, 6:53:03 PM12/15/16
to druid-de...@googlegroups.com
We see significant variance (up to x3 between outliners) in the number of queries to historical nodes and consequently amount of computation that they have to do, consequently query wait times, consequently total query latency.

It seems to me that the only way to resolve this is to start accounting actual load that historical nodes experience and making some fuzzy rebalance (most loaded -> least loaded) based on that. (1)

This variance affects total query latency, because broker wait until all historical/indexing nodes respond before returning results back to the client. So query latency is not less than the latency of the slowest historical/indexing node.

A possible way to mitigate this problem is to make broker to return incremental results via keep-alive response and send results to the client as they arrive from from historical/indexing nodes. (2)

Also if some nodes fail to respond, this will allow to return and show to the client at least some results that is often better than nothing.

IntervalChunkingQueryRunner doesn't seem to be an equivalent, actually I don't see what is its purpose other than emitting query/intervalChunk/time, that allows to do some monitoring, but I don't see how it helps to complete queries faster.

Have those questions and solutions been discussed before? Do you have any thoughts about (1) and (2)?

Please correct me if I'm wrong about something.

Gian Merlino

unread,
Dec 15, 2016, 7:31:39 PM12/15/16
to druid-de...@googlegroups.com
Do you use connectionCount load balancing? We have seen some weird behavior recently that I think would most likely be explained by a bug in the connection tracking, which would end up biasing queries against certain historical nodes.

For (1) I think that makes sense and is within the bounds of what a distributed database should be able to do. We just need to make sure we get the feedback loop right so it doesn't cause stability problems.

For (2) I'm not sure how this helps? The total query time should still be limited by slower historical nodes. Also, queries that can stream results back already do (timeseries and some groupBy v2, depending on sorting).

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwdB1b1Px-oNwurP7s%2BaFQn69oV2EFMQkHSRWSasD2dx4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Roman Leventov

unread,
Dec 15, 2016, 9:02:59 PM12/15/16
to druid-de...@googlegroups.com
On Thu, Dec 15, 2016 at 6:31 PM, Gian Merlino <gi...@imply.io> wrote:
Do you use connectionCount load balancing?
 
What is that? I didn't find in druid source or docs.

For (2) I'm not sure how this helps? The total query time should still be limited by slower historical nodes. Also, queries that can stream results back already do (timeseries and some groupBy v2, depending on sorting).

Thanks, so seems that it is pretty much already implemented. May only be worth to add parameter async=true to QueryResource.doPost().
 

Gian

On Thu, Dec 15, 2016 at 3:53 PM, Roman Leventov <roman.leventov@metamarkets.com> wrote:
We see significant variance (up to x3 between outliners) in the number of queries to historical nodes and consequently amount of computation that they have to do, consequently query wait times, consequently total query latency.

It seems to me that the only way to resolve this is to start accounting actual load that historical nodes experience and making some fuzzy rebalance (most loaded -> least loaded) based on that. (1)

This variance affects total query latency, because broker wait until all historical/indexing nodes respond before returning results back to the client. So query latency is not less than the latency of the slowest historical/indexing node.

A possible way to mitigate this problem is to make broker to return incremental results via keep-alive response and send results to the client as they arrive from from historical/indexing nodes. (2)

Also if some nodes fail to respond, this will allow to return and show to the client at least some results that is often better than nothing.

IntervalChunkingQueryRunner doesn't seem to be an equivalent, actually I don't see what is its purpose other than emitting query/intervalChunk/time, that allows to do some monitoring, but I don't see how it helps to complete queries faster.

Have those questions and solutions been discussed before? Do you have any thoughts about (1) and (2)?

Please correct me if I'm wrong about something.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwdB1b1Px-oNwurP7s%2BaFQn69oV2EFMQkHSRWSasD2dx4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Dec 15, 2016, 9:52:38 PM12/15/16
to druid-de...@googlegroups.com
I meant druid.broker.balancer.type=connectionCount on the broker (http://druid.io/docs/latest/configuration/broker.html)

Gian
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3Dwc-UNN%3DuWLkPTMMFnRkmV-d%3DYBZBWUBaW8izq%2BFfQ%3D-cQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages