consider doing a 0.12.1 release

214 views
Skip to first unread message

Prashant Deva

unread,
Apr 8, 2018, 8:45:22 PM4/8/18
to Druid Development
Current 0.12.0 release has some major issues:

  1. Coordinator loses leadership
    https://github.com/druid-io/druid/issues/5561

  2. Newly introduced Quantiles sketch is broken
    https://github.com/druid-io/druid/issues/5575

  3. Coordinator+overlord web console broken
    https://github.com/druid-io/druid/issues/5559

1. is especially very important. Without a coordinator, druid stops functioning.
With bug 5561, it is impossible to use druid for long periods since coordinator eventually does lose leadership and the whole process needs to be restarted for it to come back.

Why not wait till 0.13.0?

A lot of companies like to update one version at a time and may not want to jump directly to 0.13.0.
Those companies will hit a bad surprise due to bug 5561 essentially rendering the cluster useless in production.
Also quantiles being the new feature and broken does not look good either.

0.12.0 is a 'release', not an RC, thus marking it good for production, but the bugs listed above prevent it from being used as such.
I highly recommend 0.12.1 release, thus marking the right version to upgrade to from 0.11.0

Gian Merlino

unread,
Apr 9, 2018, 1:35:35 PM4/9/18
to druid-de...@googlegroups.com, d...@druid.apache.org
I think this conversation is worth having. I have cross posted this to d...@druid.apache.org and will reply there. Since we're trying to migrate the dev list, please cross post any dev messages there, or even only post to that list.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/165b491e-3ec2-4744-a228-d1270c9d283a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Prashant Deva

unread,
Apr 9, 2018, 3:02:02 PM4/9/18
to d...@druid.apache.org, Druid Development
May I also add this to the list Gian proposed:

- Load rules should honor partial overlap #5595

Prashant

On Mon, Apr 9, 2018 at 11:57 AM, Gian Merlino <gi...@apache.org> wrote:
My feeling is that #3 and #2 are borderline, but #1 definitely warrants a
new release. Personally I have seen it occur at least a half dozen times,
and I had been thinking about proposing a Druid 0.12.1 release, so I'm glad
you brought it up.

If we do 0.12.1, it would be another non-ASF release (we haven't got the
ASF process set up yet, and are not likely to have it set up in time) so we
should notify the incubator folks about it.

I would also consider including these fixes in 0.12.1:

- DoublesSketchModule: Fix serde for DoublesSketchMergeAggregatorFactory.
(#5587)
- ArrayAggregation: Use long to avoid overflow (#5544)
- Respect forceHashAggregation in queryContext (#5533)
- Fix indexTask to respect forceExtendableShardSpecs (#5509)
- Add overlord unsecured paths to coordinator when using combined service
(#5579)
- Fix SQLMetadataSegmentManager to allow succesive start and stop (#5554)
- Fix supervisor tombstone auth handling (#5504)
- Authorize supervisor history instead of current active supervisors for
supervisor history API (#5501)
- Fix round robining in router. (#5500)
- SegmentMetadataQuery: Fix default interval handling. (#5489)
- Log exceptions thrown before persist() for indexing tasks (#5374)
- More memory limiting for HttpPostEmitter (#5300)
- pass configuration from context into JobConf for determining
DatasourceInputFormat splits (#5408)
- Lookups: Inherit "injective" from registered lookups, improve docs.
(#5316)
- SQL: Throttle metadata refreshes when they fail. (#5328)


On Mon, Apr 9, 2018 at 10:35 AM, Gian Merlino <gi...@imply.io> wrote:

> I think this conversation is worth having. I have cross posted this to
> d...@druid.apache.org and will reply there. Since we're trying to migrate
> the dev list, please cross post any dev messages there, or even only post
> to that list.
>
> Gian
>
> On Sun, Apr 8, 2018 at 5:45 PM, Prashant Deva <prasha...@gmail.com>
> wrote:
>
>> Current 0.12.0 release has some major issues:
>>
>>
>>    1. Coordinator loses leadership
>>    https://github.com/druid-io/druid/issues/5561
>>
>>    2. Newly introduced Quantiles sketch is broken
>>    https://github.com/druid-io/druid/issues/5575
>>
>>    3. Coordinator+overlord web console broken

>>    https://github.com/druid-io/druid/issues/5559
>>
>>
>> 1. is especially very important. Without a coordinator, druid stops
>> functioning.
>> With bug 5561, it is impossible to use druid for long periods since
>> coordinator eventually does lose leadership and the whole process needs to
>> be restarted for it to come back.
>>
>> *Why not wait till 0.13.0?*
>>
>> A lot of companies like to update one version at a time and may not want
>> to jump directly to 0.13.0.
>> Those companies will hit a bad surprise due to bug 5561 essentially
>> rendering the cluster useless in production.
>> Also quantiles being the new feature and broken does not look good either.
>>
>> 0.12.0 is a 'release', not an RC, thus marking it good for production,
>> but the bugs listed above prevent it from being used as such.
>> I highly recommend 0.12.1 release, thus marking the right version to
>> upgrade to from 0.11.0
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Druid Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to druid-development+unsubscribe@googlegroups.com.
>> To post to this group, send email to druid-development@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/druid-development/165b491e-3ec2-4744-a228-d1270c9d283a%4
>> 0googlegroups.com
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-development+unsubscribe@googlegroups.com.
> To post to this group, send email to druid-development@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/

Nishant Bangarwa

unread,
Apr 9, 2018, 3:06:15 PM4/9/18
to druid-de...@googlegroups.com, d...@druid.apache.org
+1 on doing a 0.12.1 release. 
Kerberos security also has some issues - https://github.com/druid-io/druid/pull/5596
I propose we also get this in 0.12.1.

>> email to druid-developm...@googlegroups.com.
>> To post to this group, send email to druid-de...@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/druid-development/165b491e-3ec2-4744-a228-d1270c9d283a%4
>> 0googlegroups.com
>> <https://groups.google.com/d/msgid/druid-development/165b491e-3ec2-4744-a228-d1270c9d283a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-developm...@googlegroups.com.
> To post to this group, send email to druid-de...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAEg0NFaW%2BtWRHsJWSDbhTU%2BBBJUTM-r7GteuQiuREpzLQ83-uA%40mail.gmail.com.

Marcin Kuthan

unread,
Apr 11, 2018, 3:37:05 AM4/11/18
to Druid Development
This issue is the most problematic for our cluster stability, +1 for 0.12.1 release.

I also found another scenario when coordinator is not able to recovery after connection to zookeeper is lost:

WARN org.apache.zookeeper.ClientCnxn Session 0x761da07ba56edff for server x.x.x.x:2181, unexpected error, closing socket conn...
INFO org.apache.curator.framework.state.ConnectionStateManager State change: SUSPENDED
INFO io.druid.server.coordinator.DruidCoordinator I am no longer the leader...
INFO io.druid.curator.discovery.CuratorServiceAnnouncer Unannouncing service[DruidNode{serviceName='druid/coordinator', host='druidcoordinator...
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_SUSPENDED] for nodeType [overlord] watcher.
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_SUSPENDED] for nodeType [broker] watcher.
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_SUSPENDED] for nodeType [historical] watcher.
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_SUSPENDED] for nodeType [peon] watcher.
INFO org.apache.zookeeper.ClientCnxn Opening socket connection to server x.x.x.x:2181. Will not attempt to authenticate using...
INFO org.apache.zookeeper.ClientCnxn Socket connection established to x.x.x.x:2181, initiating session
INFO org.apache.zookeeper.ClientCnxn Session establishment complete on server x.x.x.x:2181, sessionid = 0x761da07ba56edff, ne...
INFO org.apache.curator.framework.state.ConnectionStateManager State change: RECONNECTED
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_RECONNECTED] for nodeType [broker] watcher.
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_RECONNECTED] for nodeType [overlord] watcher.
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_RECONNECTED] for nodeType [historical] watcher.
INFO io.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeTypeWatcher Ignored event type [CONNECTION_RECONNECTED] for nodeType [peon] watcher.
INFO io.druid.server.coordinator.DruidCoordinator I am the leader of the coordinators, all must bow!
INFO io.druid.server.coordinator.DruidCoordinator Starting coordination in [PT30S]
INFO io.druid.curator.discovery.CuratorServiceAnnouncer Announcing service[DruidNode{serviceName='druid/coordinator', host='druidcoordinator.n...
INFO io.druid.metadata.SQLMetadataRuleManager Polled and found rules for 29 datasource(s)
INFO io.druid.server.coordinator.DruidCoordinator Done making indexing service helpers [[io.druid.server.coordinator.helper.DruidCoordinatorSegmentInf...
INFO io.druid.server.lookup.cache.LookupCoordinatorManager Not updating lookups because no data exists
ERROR io.druid.server.coordinator.DruidCoordinator InventoryManagers not started[[false, true]]
INFO io.druid.server.coordinator.DruidCoordinator I am no longer the leader...
INFO io.druid.curator.discovery.CuratorServiceAnnouncer Unannouncing service[DruidNode{serviceName='druid/coordinator', host='druidcoordinator...
ERROR io.druid.server.coordinator.DruidCoordinator Caught exception, ignoring so that schedule keeps going.: {class=io.druid.server.coordinator.DruidCo...
INFO io.druid.server.coordinator.DruidCoordinator I am no longer the leader...
ERROR io.druid.server.coordinator.DruidCoordinator InventoryManagers not started[[false, true]]

And so on every 30 seconds.


Gian Merlino

unread,
Apr 11, 2018, 12:12:20 PM4/11/18
to druid-de...@googlegroups.com, d...@druid.apache.org
Hi Marcin,

Do you have a non-cut-off version of this log line? It seems like this log message would have the cause in it.

  ERROR io.druid.server.coordinator.DruidCoordinator Caught exception, ignoring so that schedule keeps going.: {class=io.druid.server.coordinator.DruidCo...

Btw, since we are currently trying to migrate our mailing lists, please also include d...@druid.apache.org in Druid dev threads (I have added it to this one).

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/3ca336e9-ad73-4a4d-92d8-bb135d17c6cb%40googlegroups.com.

Marcin Kuthan

unread,
Apr 11, 2018, 3:47:30 PM4/11/18
to Druid Development
Hi Gian

Yep, the root cause was IllegalMonitorStateException so it should by fixed by https://github.com/druid-io/druid/pull/5554.
The scenario was as follows (the log presented before was a mix from both coordinators, sorry for that):

1. Active coordinator was disconnected from ZK (connection reset or session timeout).
2. Active coordinator was suspended (almost immediately).
3. Second coordinator was elected as master (2 seconds later).
4. On the second coordinator there was an error "InventoryManagers not started[[false, true]]" (30 seconds later, I don't understand this error with my limited druid knowledge)
5. 20ms later there was IllegalMonitorStateException on the second coordinator, and since then the election process had stopped until I restarted coordinators manually. There was 20ms gap between errors but I'm not fully sure which exception was a root cause of failure.

Tomorrow, I'm going to update cluster using 0.12.0 branch with all backported bug fixes.

Marcin


Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Marcin Kuthan

unread,
Apr 13, 2018, 3:32:21 AM4/13/18
to Druid Development
Cluster deployed from 0.12 branch with all fixes up to 10th April. No issues so far.

Indrek Juhkam

unread,
Apr 16, 2018, 4:12:30 AM4/16/18
to Druid Development
Our cluster just failed because of this bug. 0.12.1 would be nice.

Gian Merlino

unread,
Apr 17, 2018, 1:40:20 PM4/17/18
to druid-de...@googlegroups.com
Hi to people following the druid-development version of this thread,

Please follow the thread on d...@druid.apache.org. You can sign up by emailing dev-su...@druid.apache.org. We'll be sunsetting this list soon as we continue migrating to ASF infra.

Marcin: it's good to hear that the fixes helped!
Indrek: I am sure we will be doing a 0.12.1.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/729ca3cb-01d2-4ed9-a098-d0d2700a32e8%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages