Kill tasks timeout, loop, and choke metadata.

180 views
Skip to first unread message

Avinoam Ben Nachum

unread,
Oct 6, 2021, 4:48:57 AM10/6/21
to Druid User

Hello,
I have been trying to send kill tasks to a dataSource but it gets stuck on the metadata while quering for the unused segments in the given interval.

I have sent kill tasks to all my other dataSources in the same way, and had no issues.
I am assuming its due to the task's query reaching its timeout (which is 15min), and then the log of the task shows a timeout error and launchs another query (exect same one). this leads to queries pileing up till the database gets stuck.

I have tried adding "timeout" to the task's context but nothing changed.

  • kill task:
{
        "type": "kill",
        "dataSource": "bananas",
        "interval" : "2020-08-10T00:00:00.000Z/2020-08-17T00:00:00.000Z",
        "context" : { "timeout" : "0" }
}

  • the query that gets stunk in the metadata:
SELECT payload FROM druid_segments WHERE dataSource = 'bananas' and start >= '2020-08-10T00:00:00.000Z' and start <= '2020-08-17T00:00:00.000Z' and `end` <= '2020-08-17T00:00:00.000Z' and used = false;

  • the task log
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp/druid_s3
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA>
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.version=4.15.0-50-generic
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:user.name=ubuntu
2021-10-06T07:55:35,085 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:user.home=/home/ubuntu
2021-10-06T07:55:35,086 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/usr/local/apache-druid-0.21.1
2021-10-06T07:55:35,087 INFO [main] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper.com sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@42c0a16e
2021-10-06T07:55:35,110 INFO [main] org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema
2021-10-06T07:55:35,124 INFO [main-SendThread(zookeeper.com:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.com. Will not attempt to authenticate using SASL (unknown error)
2021-10-06T07:55:35,136 INFO [main-SendThread(zookeeper.com:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to zookeeper.com, initiating session
2021-10-06T07:55:35,147 INFO [main-SendThread(zookeeper.com:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server zookeeper.com, sessionid = 0x170a545cf12aebf, negotiated timeout = 30000
2021-10-06T07:55:35,162 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED
2021-10-06T07:55:35,217 WARN [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_RECONNECTED] for node watcher of role[coordinator].
2021-10-06T07:55:35,217 WARN [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Ignored event type[CONNECTION_RECONNECTED] for node watcher of role[overlord].
2021-10-06T07:55:35,256 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://overlord:8083] of role[overlord] detected.
2021-10-06T07:55:35,256 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://coordinator:8085] of role[coordinator] detected.
2021-10-06T07:55:35,258 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node watcher of role[overlord] is now initialized.
2021-10-06T07:55:35,258 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node watcher of role[coordinator] is now initialized.
2021-10-06T07:55:35,413 WARN [main] org.apache.druid.query.lookup.LookupReferencesManager - No lookups found for tier [__default], response [org.apache.druid.java.util.http.client.response.StringFullResponseHolder@47fcefb3]
2021-10-06T07:55:35,414 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - Coordinator is unavailable. Loading saved snapshot instead
2021-10-06T07:55:35,450 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Running with task: {
"type" : "kill",
"id" : "kill_bananas_hkobfecj_2020-08-10T00:00:00.000Z_2020-08-17T00:00:00.000Z_2021-10-06T07:55:28.342Z",
"dataSource" : "bananas",
"interval" : "2020-08-10T00:00:00.000Z/2020-08-17T00:00:00.000Z",
"context" : {
"forceTimeChunkLock" : true,
"timeout" : "0"
},
"groupId" : "kill_bananas_hkobfecj_2020-08-10T00:00:00.000Z_2020-08-17T00:00:00.000Z_2021-10-06T07:55:28.342Z",
"resource" : {
"availabilityGroup" : "kill_bananas_hkobfecj_2020-08-10T00:00:00.000Z_2020-08-17T00:00:00.000Z_2021-10-06T07:55:28.342Z",
"requiredCapacity" : 1
}
}
2021-10-06T07:55:35,451 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Attempting to lock file[var/druid/task/kill_bananas_hkobfecj_2020-08-10T00:00:00.000Z_2020-08-17T00:00:00.000Z_2021-10-06T07:55:28.342Z/lock].
2021-10-06T07:55:35,453 INFO [main] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Acquired lock file[var/druid/task/kill_bananas_hkobfecj_2020-08-10T00:00:00.000Z_2020-08-17T00:00:00.000Z_2021-10-06T07:55:28.342Z/lock] in 2ms.
2021-10-06T07:55:35,571 INFO [main] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Using storage location strategy: [LeastBytesUsedStorageLocationSelectorStrategy]
2021-10-06T07:55:35,575 INFO [task-runner-0-priority-0] org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Running task: kill_bananas_hkobfecj_2020-08-10T00:00:00.000Z_2020-08-17T00:00:00.000Z_2021-10-06T07:55:28.342Z
2021-10-06T07:55:35,582 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [SERVER]
2021-10-06T07:55:35,595 INFO [main] org.eclipse.jetty.server.Server - jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10
2021-10-06T07:55:35,670 INFO [main] org.eclipse.jetty.server.session - DefaultSessionIdManager workerName=node0
2021-10-06T07:55:35,670 INFO [main] org.eclipse.jetty.server.session - No SessionScavenger set, using defaults
2021-10-06T07:55:35,672 INFO [main] org.eclipse.jetty.server.session - node0 Scavenging every 600000ms
2021-10-06T07:55:35,814 INFO [main] com.sun.jersey.server.impl.application.WebApplicationImpl - Initiating Jersey application, version 'Jersey: 1.19.3 10/24/2016 03:43 PM'
2021-10-06T07:55:36,549 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@1be8c122{/,null,AVAILABLE}
2021-10-06T07:55:36,584 INFO [main] org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@6e73974{HTTP/1.1, (http/1.1)}{0.0.0.0:8100}
2021-10-06T07:55:36,585 INFO [main] org.eclipse.jetty.server.Server - Started @8065ms
2021-10-06T07:55:36,586 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [ANNOUNCEMENTS]
2021-10-06T07:55:36,586 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Successfully started lifecycle [module]
2021-10-06T08:10:35,683 WARN [HttpClient-Netty-Worker-22] org.apache.druid.java.util.http.client.pool.ResourcePool - Resource at key[http://overlord:8083] was returned multiple times?
2021-10-06T08:10:35,689 WARN [task-runner-0-priority-0] org.apache.druid.discovery.DruidLeaderClient - Request[http://overlord:8083/druid/indexer/v1/action] failed.
org.jboss.netty.handler.timeout.ReadTimeoutException: null
at org.jboss.netty.handler.timeout.ReadTimeoutHandler.<clinit>(ReadTimeoutHandler.java:84) ~[netty-3.10.6.Final.jar:?]
at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:172) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.http.client.AbstractHttpClient.go(AbstractHttpClient.java:33) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:143) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.fetchLookupsForTier(LookupReferencesManager.java:576) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.tryGetLookupListFromCoordinator(LookupReferencesManager.java:429) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.lambda$getLookupListFromCoordinator$4(LookupReferencesManager.java:407) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:87) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:115) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:105) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.getLookupListFromCoordinator(LookupReferencesManager.java:397) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.getLookupsList(LookupReferencesManager.java:374) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.loadAllLookupsAndInitStateRef(LookupReferencesManager.java:357) ~[druid-server-0.21.1.jar:0.21.1]
at org.apache.druid.query.lookup.LookupReferencesManager.start(LookupReferencesManager.java:157) ~[druid-server-0.21.1.jar:0.21.1]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
at org.apache.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler.start(Lifecycle.java:446) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.java.util.common.lifecycle.Lifecycle.start(Lifecycle.java:341) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.guice.LifecycleModule$2.start(LifecycleModule.java:143) ~[druid-core-0.21.1.jar:0.21.1]
at org.apache.druid.cli.GuiceRunnable.initLifecycle(GuiceRunnable.java:115) ~[druid-services-0.21.1.jar:0.21.1]
at org.apache.druid.cli.CliPeon.run(CliPeon.java:298) ~[druid-services-0.21.1.jar:0.21.1]
at org.apache.druid.cli.Main.main(Main.java:113) ~[druid-services-0.21.1.jar:0.21.1]


If anyone has a clue or an idea on how I can overcome this...
Kind regards and many thanks!
Avinoam

Ben Krug

unread,
Oct 6, 2021, 5:58:49 PM10/6/21
to druid...@googlegroups.com
I've seen some cases where optimizing druid_segments can help.  Eg, if you're using MySQL: OPTIMIZE TABLE druid_segments;
(Or first you can try ANALYZE, which is quicker, and if it doesn't help, OPTIMIZE, which is not blocking unless your MySQL is pretty old, but can take a long time.)

Otherwise, maybe you'd have to start killing segments individually to clean it up, then try again.  (Unless someone else knows a better way.)

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/f3392f2a-fd54-4af3-bb3d-109f4acc91can%40googlegroups.com.

Avinoam Ben Nachum

unread,
Oct 19, 2021, 8:43:49 AM10/19/21
to Druid User
Thank you for your response!
I have tried what you suggested but it did not solve the matter.
what solved it was indexing the table:

```
create index druid_segments_index_for_unused_segments on druid_segments (used,dataSource,start)
```

The query that took longer than 30 minutes, now takes lass than 30 seconds.
Maybe this should be a default setting in druid's metadata?..

I hope this sulotion will help others who might stumble upon the same issue in the future.
Thank you again for your assitance!

Avinoam
Reply all
Reply to author
Forward
0 new messages