Tasks not distributing to middlemanager or possibly failing in middle manager

2,309 views

Skip to first unread message

Avi Deitcher

unread,

Jan 31, 2017, 8:27:36 PM1/31/17

to Druid User

To be fair, I am unsure if that is the problem.

I have the druid components running in containers on a shared network segment. They all can communicate with each other and with zookeeper using their names (broker resolves to the broker IP, zookeeper to zookeeper, overlord to overlord, etc.).

For all nodes, I have druid.logLevel=debug

I am trying the simplest test possible: the curl from the quickstart guide.

When I submit it, I see the following overlord logs:

overlord_1         | 2017-01-31T23:08:25,022 WARN [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.

Here is the output from zookeeper and coordinator:

coordinator_1      | 2017-01-31T23:18:02,306 WARN [DatabaseSegmentManager-Exec--0] io.druid.metadata.SQLMetadataSegmentManager - No segments found in the database!
zookeeper_1        | 2017-01-31 23:18:06,203 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.16:51482
zookeeper_1        | 2017-01-31 23:18:06,212 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.16:51482
zookeeper_1        | 2017-01-31 23:18:06,231 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@673] - Established session 0x159f63188ab0017 with negotiated timeout 30000 for client /192.168.32.16:51482
zookeeper_1        | 2017-01-31 23:18:07,917 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@487] - Processed session termination for sessionid: 0x159f63188ab0017
zookeeper_1        | 2017-01-31 23:18:07,930 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /192.168.32.16:51482 which had sessionid 0x159f63188ab0017

192.168.32.16 resolves to the middlemanager.

Nothing logs to stdout from the middlemanager. However, if I watch it, I see a new java process pop up and disappear quickly.

The task does appear in the overlord console under "Running Tasks", eventually moving to "Complete Tasks" with a status code of "FAILED". Payload is correct, status shows FAILED, but clicking the "Log" link gives:

No log was found for this task. The task may not exist, or it may not have begun running yet.

I am stumped. How do I figure out which stage is doing what, why it is failing, and where?

Avi Deitcher

unread,

Jan 31, 2017, 8:36:56 PM1/31/17

to Druid User

Once more piece of info. I tried creating the task again, while logged into the middlemanager and watching /tmp/druid/persistent/task/restore.json

Normally, the contents of that file are:

{"runningTasks":[]}

However, for a few seconds after submission (about as long as the the second java process lasts), they are:

{"runningTasks":["index_hadoop_wikiticker_2017-01-31T23:34:48.189Z"]}

So it looks like the task is getting submitted to the middle manager, which is creating a worker for it, but

Why does it fail?
How do I find out why it failed?
Why is neither the broker nor the middlemanager reporting these to the stdout logs?

Slim Bouguerra

unread,

Feb 1, 2017, 1:53:13 PM2/1/17

to druid...@googlegroups.com

can you get the task logs ?

the file system does depend on the value of druid.indexer.logs.type

and the location can be set via druid.indexer.logs.directory

Here is the doc about those properties.

Once you find the task logs will find why it is failing.

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/464216e3-416b-406b-9319-5c666cfa62c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avi Deitcher

unread,

Feb 1, 2017, 2:34:33 PM2/1/17

to Druid User

Ah. I don't see that defined anywhere. There is a version for S3 commented out in the basic config, but that would explain a lot.

I will set

druid.indexer.logs.type=file
druid.indexer.logs.directory=var/druid/indexing-logs

Per the example. Is there a logs.type option for stdout?

Avi Deitcher

unread,

Feb 1, 2017, 2:40:32 PM2/1/17

to Druid User

At least I have data now. :-)

2017-02-01T19:37:23,807 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29
2017-02-01T19:37:25,480 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2017-02-01T19:37:16.536Z, type=index_hadoop, dataSource=wikiticker}]
com.metamx.common.ISE: Hadoop dependency [/hadoop-dependencies/hadoop-client/2.3.0] didn't exist!?
	at io.druid.initialization.Initialization.getHadoopDependencyFilesToLoad(Initialization.java:249) ~[druid-server-0.9.2.jar:0.9.2]
	at io.druid.indexing.common.task.HadoopTask.buildClassLoader(HadoopTask.java:147) ~[druid-indexing-service-0.9.2.jar:0.9.2]
	at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:166) ~[druid-indexing-service-0.9.2.jar:0.9.2]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.2.jar:0.9.2]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.2.jar:0.9.2]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111-internal]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111-internal]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111-internal]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111-internal]

and in case you asked:

bash-4.3# ls /opt/druid/hadoop-dependencies/hadoop-client/2.3.0/
activation-1.1.jar                           hadoop-annotations-2.3.0.jar                 jackson-mapper-asl-1.8.8.jar
avro-1.7.4.jar                               hadoop-auth-2.3.0.jar                        jaxb-api-2.2.2.jar
commons-beanutils-1.7.0.jar                  hadoop-client-2.3.0.jar                      jersey-core-1.9.jar
commons-beanutils-core-1.8.0.jar             hadoop-common-2.3.0.jar                      jetty-util-6.1.26.jar
commons-cli-1.2.jar                          hadoop-hdfs-2.3.0.jar                        jsr305-1.3.9.jar
commons-codec-1.4.jar                        hadoop-mapreduce-client-app-2.3.0.jar        log4j-1.2.17.jar
commons-collections-3.2.1.jar                hadoop-mapreduce-client-common-2.3.0.jar     paranamer-2.3.jar
commons-compress-1.4.1.jar                   hadoop-mapreduce-client-core-2.3.0.jar       protobuf-java-2.5.0.jar
commons-configuration-1.6.jar                hadoop-mapreduce-client-jobclient-2.3.0.jar  servlet-api-2.5.jar
commons-digester-1.8.jar                     hadoop-mapreduce-client-shuffle-2.3.0.jar    slf4j-api-1.7.5.jar
commons-httpclient-3.1.jar                   hadoop-yarn-api-2.3.0.jar                    slf4j-log4j12-1.7.5.jar
commons-io-2.4.jar                           hadoop-yarn-client-2.3.0.jar                 snappy-java-1.0.4.1.jar
commons-lang-2.6.jar                         hadoop-yarn-common-2.3.0.jar                 stax-api-1.0-2.jar
commons-logging-1.1.3.jar                    hadoop-yarn-server-common-2.3.0.jar          xmlenc-0.52.jar
commons-math3-3.1.1.jar                      httpclient-4.2.5.jar                         xz-1.0.jar
commons-net-3.1.jar                          httpcore-4.2.5.jar                           zookeeper-3.4.5.jar
guava-11.0.2.jar                             jackson-core-asl-1.8.8.jar

Is it looking in the wrong place?

Slim Bouguerra

unread,

Feb 1, 2017, 3:47:02 PM2/1/17

to druid...@googlegroups.com

can you make sure to set druid.extensions.hadoopDependenciesDir

http://druid.io/docs/latest/configuration/index.html

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/1842ff73-e3dc-43c1-abde-4086e48c1f2b%40googlegroups.com.

Avi Deitcher

unread,

Feb 1, 2017, 4:16:12 PM2/1/17

to Druid User

Definitely getting more interesting:

bash-4.3# cat /var/druid/indexing-logs/index_hadoop_wikiticker_2017-02-01T21\:16\:19.621Z.log
2017-02-01T21:16:25,205 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/druid-0.9.2/lib/log4j-slf4j-impl-2.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/druid/hadoop-dependencies/hadoop-client/2.3.0/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-01 21:16:27,880 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
..
..

2017-02-01 21:16:27,883 task-runner-0-priority-0 WARN JMX runtime input lookup class is not available because this JRE does not support JMX. JMX lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JmxRuntimeInputArgumentsLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
..
..

2017-02-01 21:16:27,895 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
..
..
2017-02-01 21:16:27,897 task-runner-0-priority-0 WARN JMX runtime input lookup class is not available because this JRE does not support JMX. JMX lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JmxRuntimeInputArgumentsLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
..
..
2017-02-01 21:16:27,921 task-runner-0-priority-0 WARN JNDI lookup class is not available because this JRE does not support JNDI. JNDI string lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JndiLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
..
..
2017-02-01 21:16:27,922 task-runner-0-priority-0 WARN JMX runtime input lookup class is not available because this JRE does not support JMX. JMX lookups will not be available, continuing configuration. java.lang.ClassCastException: Cannot cast org.apache.logging.log4j.core.lookup.JmxRuntimeInputArgumentsLookup to org.apache.logging.log4j.core.lookup.StrLookup
	at java.lang.Class.cast(Class.java:3369)
	at org.apache.logging.log4j.util.LoaderUtil.newCheckedInstanceOf(LoaderUtil.java:167)
..
..
2017-02-01T21:16:31,920 WARN [task-runner-0-priority-0] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01T21:16:32,439 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-02-01T21:16:32,444 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-02-01T21:16:32,467 WARN [task-runner-0-priority-0] org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/quickstart/wikiticker-2015-09-12-sampled.json
2017-02-01T21:16:32,471 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2017-02-01T21:16:19.621Z, type=index_hadoop, dataSource=wikiticker}]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
..
..
2017-02-01T21:16:31,920 WARN [task-runner-0-priority-0] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-02-01T21:16:32,439 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-02-01T21:16:32,444 WARN [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-02-01T21:16:32,467 WARN [task-runner-0-priority-0] org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/quickstart/wikiticker-2015-09-12-sampled.json
2017-02-01T21:16:32,471 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_wikiticker_2017-02-01T21:16:19.621Z, type=index_hadoop, dataSource=wikiticker}]
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
	at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.2.jar:0.9.2]

Slim Bouguerra

unread,

Feb 1, 2017, 4:27:52 PM2/1/17

to druid...@googlegroups.com

getting there !

try tip 2 as indicated on the druid page

http://druid.io/docs/latest/operations/other-hadoop.html

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/65de8e43-e851-4044-89d8-e3b5eca3388f%40googlegroups.com.

Avi Deitcher

unread,

Feb 1, 2017, 11:14:35 PM2/1/17

to Druid User

Will do, thank you.

Of course, all of this was just using the basic test from the quickstart page... just to see if the basics work. I still cannot figure out my kafka stream, which was why I did this in the first place.

It looks like it runs through kafka, picked up by tranquility, but then gets lost again. The middlemanager does appear to ask something of zookeeper, unsure what.

Combined logs:

tranquility_1      | 2017-02-01 22:32:51,324 [KafkaConsumer-0] INFO  c.m.t.kafka.writer.WriterController - Creating EventWriter for topic [csmessage] using dataSource [cs-conversation]
tranquility_1      | 2017-02-01 22:32:51,919 [KafkaConsumer-0] INFO  o.a.c.f.imps.CuratorFrameworkImpl - Starting
tranquility_1      | 2017-02-01 22:32:51,937 [KafkaConsumer-0] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@3655f270
tranquility_1      | 2017-02-01 22:32:51,951 [KafkaConsumer-0-SendThread(analyticslauncher_zookeeper_1.analyticslauncher_default:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server analyticslauncher_zookeeper_1.analyticslauncher_default/192.168.32.9:2181. Will not attempt to authenticate using SASL (unknown error)
tranquility_1      | 2017-02-01 22:32:51,955 [KafkaConsumer-0-SendThread(analyticslauncher_zookeeper_1.analyticslauncher_default:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established to analyticslauncher_zookeeper_1.analyticslauncher_default/192.168.32.9:2181, initiating session
zookeeper_1        | 2017-02-01 22:32:51,968 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.15:36872
zookeeper_1        | 2017-02-01 22:32:52,000 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.15:36872
tranquility_1      | 2017-02-01 22:32:52,094 [KafkaConsumer-0-SendThread(analyticslauncher_zookeeper_1.analyticslauncher_default:2181)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server analyticslauncher_zookeeper_1.analyticslauncher_default/192.168.32.9:2181, sessionid = 0x159fbca1496000a, negotiated timeout = 40000
zookeeper_1        | 2017-02-01 22:32:52,095 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@673] - Established session 0x159fbca1496000a with negotiated timeout 40000 for client /192.168.32.15:36872
tranquility_1      | 2017-02-01 22:32:52,118 [KafkaConsumer-0-EventThread] INFO  o.a.c.f.state.ConnectionStateManager - State change: CONNECTED
tranquility_1      | 2017-02-01 22:32:52,668 [KafkaConsumer-0] INFO  c.m.t.finagle.FinagleRegistry - Adding resolver for scheme[disco].
coordinator_1      | 2017-02-01T22:32:54,472 WARN [DatabaseSegmentManager-Exec--0] io.druid.metadata.SQLMetadataSegmentManager - No segments found in the database!
tranquility_1      | 2017-02-01 22:32:55,506 [KafkaConsumer-0] INFO  o.h.validator.internal.util.Version - HV000001: Hibernate Validator 5.1.3.Final
tranquility_1      | 2017-02-01 22:32:56,001 [KafkaConsumer-0] INFO  io.druid.guice.JsonConfigurator - Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, directory='extensions', hadoopDependenciesDir='hadoop-dependencies', hadoopContainerDruidClasspath='null', loadList=null}]
tranquility_1      | 2017-02-01 22:32:56,469 [KafkaConsumer-0] INFO  c.metamx.emitter.core.LoggingEmitter - Start: started [true]
tranquility_1      | 2017-02-01 22:32:57,488 [KafkaConsumer-0] INFO  c.m.t.finagle.FinagleRegistry - Adding resolver for scheme[druidTask!druid:overlord].
zookeeper_1        | 2017-02-01 22:32:57,696 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000a type:create cxid:0x2 zxid:0x12a txntype:-1 reqpath:n/a Error Path:/tranquility/beams/druid:overlord/cs-conversation Error:KeeperErrorCode = NoNode for /tranquility/beams/druid:overlord/cs-conversation
zookeeper_1        | 2017-02-01 22:32:57,985 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000a type:create cxid:0xc zxid:0x130 txntype:-1 reqpath:n/a Error Path:/tranquility/beams/druid:overlord/cs-conversation/mutex/locks Error:KeeperErrorCode = NoNode for /tranquility/beams/druid:overlord/cs-conversation/mutex/locks
tranquility_1      | 2017-02-01 22:32:58,003 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] WARN  org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.
zookeeper_1        | 2017-02-01 22:32:58,083 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000a type:create cxid:0x17 zxid:0x134 txntype:-1 reqpath:n/a Error Path:/tranquility/beams/druid:overlord/cs-conversation/mutex/leases Error:KeeperErrorCode = NoNode for /tranquility/beams/druid:overlord/cs-conversation/mutex/leases
tranquility_1      | 2017-02-01 22:32:58,265 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.tranquility.beam.ClusteredBeam - Creating new merged beam for identifier[druid:overlord/cs-conversation] timestamp[2017-02-01T22:00:00.000Z] (target = 1, actual = 0)
tranquility_1      | 2017-02-01 22:32:58,512 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  com.metamx.common.scala.control$ - Creating druid indexing task (service = druid:overlord): {
tranquility_1      |   "type" : "index_realtime",
tranquility_1      |   "id" : "index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0",
tranquility_1      |   "resource" : {
tranquility_1      |     "availabilityGroup" : "cs-conversation-2017-02-01T22:00:00.000Z-0000",
tranquility_1      |     "requiredCapacity" : 1
tranquility_1      |   },
tranquility_1      |   "spec" : {
tranquility_1      |     "dataSchema" : {
tranquility_1      |       "dataSource" : "cs-conversation",
tranquility_1      |       "parser" : {
tranquility_1      |         "type" : "map",
tranquility_1      |         "parseSpec" : {
tranquility_1      |           "format" : "json",
tranquility_1      |           "timestampSpec" : {
tranquility_1      |             "column" : "createTstamp",
tranquility_1      |             "format" : "millis",
tranquility_1      |             "missingValue" : null
tranquility_1      |           },
tranquility_1      |           "dimensionsSpec" : {
tranquility_1      |             "dimensionExclusions" : [ "count", "createTstamp" ],
tranquility_1      |             "spatialDimensions" : [ ]
tranquility_1      |           }
tranquility_1      |         }
tranquility_1      |       },
tranquility_1      |       "metricsSpec" : [ {
tranquility_1      |         "type" : "count",
tranquility_1      |         "name" : "count"
tranquility_1      |       } ],
tranquility_1      |       "granularitySpec" : {
tranquility_1      |         "type" : "uniform",
tranquility_1      |         "segmentGranularity" : "HOUR",
tranquility_1      |         "queryGranularity" : {
tranquility_1      |           "type" : "none"
tranquility_1      |         }
tranquility_1      |       }
tranquility_1      |     },
tranquility_1      |     "ioConfig" : {
tranquility_1      |       "type" : "realtime",
tranquility_1      |       "plumber" : null,
tranquility_1      |       "firehose" : {
tranquility_1      |         "type" : "clipped",
tranquility_1      |         "interval" : "2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z",
tranquility_1      |         "delegate" : {
tranquility_1      |           "type" : "timed",
tranquility_1      |           "shutoffTime" : "2017-02-01T23:15:00.000Z",
tranquility_1      |           "delegate" : {
tranquility_1      |             "type" : "receiver",
tranquility_1      |             "serviceName" : "firehose:druid:overlord:cs-conversation-022-0000-0000",
tranquility_1      |             "bufferSize" : 100000
tranquility_1      |           }
tranquility_1      |         }
tranquility_1      |       }
tranquility_1      |     },
tranquility_1      |     "tuningConfig" : {
tranquility_1      |       "shardSpec" : {
tranquility_1      |         "type" : "linear",
tranquility_1      |         "partitionNum" : 0
tranquility_1      |       },
tranquility_1      |       "rejectionPolicy" : {
tranquility_1      |         "type" : "none"
tranquility_1      |       },
tranquility_1      |       "buildV9Directly" : false,
tranquility_1      |       "maxPendingPersists" : 0,
tranquility_1      |       "intermediatePersistPeriod" : "PT10M",
tranquility_1      |       "windowPeriod" : "PT10M",
tranquility_1      |       "type" : "realtime",
tranquility_1      |       "maxRowsInMemory" : "100000"
tranquility_1      |     }
tranquility_1      |   }
tranquility_1      | }
tranquility_1      | Feb 01, 2017 10:32:58 PM com.twitter.finagle.Init$$anonfun$1 apply$mcV$sp
tranquility_1      | INFO: Finagle version 6.31.0 (rev=50d3bb0eea5ad3ed332111d707184c80fed6a506) built at 20151203-164135
tranquility_1      | 2017-02-01 22:32:59,285 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.c.s.net.finagle.DiscoResolver - Updating instances for service[druid:overlord] to Set(ServiceInstance{name='druid:overlord', id='6048e6c9-e931-4b12-87c4-e21f92bb482b', address='overlord', port=8090, sslPort=null, payload=null, registrationTimeUTC=1485988149060, serviceType=DYNAMIC, uriSpec=null})
tranquility_1      | 2017-02-01 22:32:59,438 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.FinagleRegistry - Created client for service: disco!druid:overlord
overlord_1         | 2017-02-01T22:33:03,712 WARN [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selections strategy set. Using default.
tranquility_1      | 2017-02-01 22:33:03,959 [finagle/netty3-1] INFO  com.metamx.common.scala.control$ - Created druid indexing task with id: index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0 (service = druid:overlord)
tranquility_1      | 2017-02-01 22:33:04,005 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.DruidTaskResolver - Started monitoring task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#1).
tranquility_1      | 2017-02-01 22:33:04,007 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.DruidTaskResolver - Starting poll thread for indexService[com.metamx.tranquility.druid.IndexService@18de8800], pollPeriod[PT20S].
tranquility_1      | 2017-02-01 22:33:04,016 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.FinagleRegistry - Created client for service: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0
tranquility_1      | 2017-02-01 22:33:04,084 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.tranquility.beam.ClusteredBeam - Created beam: {"interval":"2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z","partition":0,"tasks":[{"id":"index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0","firehoseId":"cs-conversation-022-0000-0000"}],"timestamp":"2017-02-01T22:00:00.000Z"}
tranquility_1      | 2017-02-01 22:33:04,090 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.metamx.tranquility.druid.DruidBeam - Closing Druid beam for datasource[cs-conversation] interval[2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z] (tasks = index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0)
tranquility_1      | 2017-02-01 22:33:04,093 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.FinagleRegistry - Closing client for service: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0
tranquility_1      | 2017-02-01 22:33:04,110 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.DruidTaskResolver - Stopped monitoring task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#1).
tranquility_1      | 2017-02-01 22:33:04,128 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.tranquility.beam.ClusteredBeam - Writing new beam data to[/tranquility/beams/druid:overlord/cs-conversation/data]: {"latestTime":"2017-02-01T22:00:00.000Z","latestCloseTime":"1970-01-01T00:00:00.000Z","beams":{"2017-02-01T22:00:00.000Z":[{"interval":"2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z","partition":0,"tasks":[{"id":"index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0","firehoseId":"cs-conversation-022-0000-0000"}],"timestamp":"2017-02-01T22:00:00.000Z"}]}}
tranquility_1      | 2017-02-01 22:33:04,204 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.tranquility.beam.ClusteredBeam - Adding beams for identifier[druid:overlord/cs-conversation] timestamp[2017-02-01T22:00:00.000Z]: List(Map(interval -> 2017-02-01T22:00:00.000Z/2017-02-01T23:00:00.000Z, partition -> 0, tasks -> ArraySeq(Map(id -> index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0, firehoseId -> cs-conversation-022-0000-0000)), timestamp -> 2017-02-01T22:00:00.000Z))
tranquility_1      | 2017-02-01 22:33:04,210 [DruidTaskResolver[com.metamx.tranquility.druid.IndexService@18de8800]] INFO  c.m.t.finagle.DruidTaskResolver - Updating location for task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#1) to[Neg].
tranquility_1      | 2017-02-01 22:33:04,265 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.DruidTaskResolver - Started monitoring task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#2).
tranquility_1      | 2017-02-01 22:33:04,285 [DruidTaskResolver[com.metamx.tranquility.druid.IndexService@18de8800]] INFO  c.m.t.finagle.DruidTaskResolver - Updating location for task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#2) to[Neg].
tranquility_1      | 2017-02-01 22:33:04,285 [ClusteredBeam-ZkFuturePool-f606b3f0-a681-4765-ad63-27360b5bf720] INFO  c.m.t.finagle.FinagleRegistry - Created client for service: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0
tranquility_1      | Feb 01, 2017 10:33:04 PM com.twitter.finagle.loadbalancer.LoadBalancerFactory$StackModule$$anonfun$5 apply
tranquility_1      | INFO: druidTask!druid:overlord!index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0: name resolution is negative (local dtab: Dtab())
zookeeper_1        | 2017-02-01 22:33:14,253 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.14:53186
zookeeper_1        | 2017-02-01 22:33:14,259 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.14:53186
zookeeper_1        | 2017-02-01 22:33:14,320 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@673] - Established session 0x159fbca1496000b with negotiated timeout 30000 for client /192.168.32.14:53186
zookeeper_1        | 2017-02-01 22:33:15,025 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000b type:create cxid:0x13 zxid:0x13f txntype:-1 reqpath:n/a Error Path:/druid/discovery/firehose:druid:overlord:cs-conversation-022-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/firehose:druid:overlord:cs-conversation-022-0000-0000
zookeeper_1        | 2017-02-01 22:33:15,087 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x159fbca1496000b type:create cxid:0x19 zxid:0x142 txntype:-1 reqpath:n/a Error Path:/druid/discovery/cs-conversation-022-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/cs-conversation-022-0000-0000
tranquility_1      | 2017-02-01 22:33:25,441 [DruidTaskResolver[com.metamx.tranquility.druid.IndexService@18de8800]] INFO  c.m.t.finagle.DruidTaskResolver - Updating location for task[index_realtime_cs-conversation_2017-02-01T22:00:00.000Z_0_0] (#2) to[Bound(Set(middlemanager/192.168.32.14:8100),Map())].
tranquility_1      | 2017-02-01 22:33:28,760 [KafkaConsumer-CommitThread] INFO  c.m.tranquility.kafka.KafkaConsumer - Flushed {csmessage={receivedCount=1, sentCount=1, droppedCount=0, unparseableCount=0}} pending messages in 29720ms and committed offsets in 13ms.
broker_1           | 370.475: [GC (Allocation Failure) 370.479: [ParNew: 222360K->17825K(230400K), 2.8099877 secs] 222360K->24416K(1022976K), 2.8152290 secs] [Times: user=0.31 sys=0.55, real=2.81 secs]
broker_1           | 373.292: [GC (CMS Initial Mark) [1 CMS-initial-mark: 6590K(792576K)] 25544K(1022976K), 0.0099279 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
broker_1           | 373.302: [CMS-concurrent-mark-start]
broker_1           | 373.961: [CMS-concurrent-mark: 0.623/0.659 secs] [Times: user=0.08 sys=0.11, real=0.66 secs]
broker_1           | 373.963: [CMS-concurrent-preclean-start]
broker_1           | 373.972: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
broker_1           | 373.972: [CMS-concurrent-abortable-preclean-start]
broker_1           |  CMS: abort preclean due to time 379.085: [CMS-concurrent-abortable-preclean: 1.874/5.112 secs] [Times: user=1.89 sys=0.04, real=5.11 secs]
broker_1           | 379.085: [GC (CMS Final Remark) [YG occupancy: 24487 K (230400 K)]379.085: [Rescan (parallel) , 0.0144542 secs]379.100: [weak refs processing, 0.0001810 secs]379.100: [class unloading, 0.8384265 secs]379.939: [scrub symbol table, 0.2407246 secs]380.180: [scrub string table, 0.0010639 secs][1 CMS-remark: 6590K(792576K)] 31077K(1022976K), 1.0976879 secs] [Times: user=0.11 sys=0.12, real=1.10 secs]
broker_1           | 380.188: [CMS-concurrent-sweep-start]
broker_1           | 380.215: [CMS-concurrent-sweep: 0.027/0.027 secs] [Times: user=0.01 sys=0.00, real=0.03 secs]
broker_1           | 380.215: [CMS-concurrent-reset-start]
broker_1           | 380.226: [CMS-concurrent-reset: 0.010/0.010 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]

FYI, 192.168.32.14 that zookeeper is reporting receiving a connection from is middlemanager.

I am unsure who talks to zk and when, so hard for me to debug.

Also, every node - broker, overload, middlemanager, coordinator, historical - has the following in /opt/druid/conf/druid/_common/common.runtime.properties

druid.emitter=logging
druid.emitter.logging.logLevel=debug

Avi Deitcher

unread,

Feb 2, 2017, 8:32:45 AM2/2/17

to Druid User

You know what would have made this process far easier? If there were a page with flow. E.g.:

When a task is received:

It is received by the node XX. The log message confirming receipt is at /some/path with a filename of zzzz
That node passes the task to a middlemanager by finding one in zookeeper, looking at path YY for properties ZZ. The communications with zookeeper generate 3 messages in path ././..
The XX communicates directly with the middlemanager over protocol ___ to pass it the message. The XX records the following message at /././
The middlemanager receives the job and indicates successful receipt by recording message .... at /./..
The middlemanager creates an indexer peon by doing AAAA and records message asanas

You get the idea. For this entire process I have been flying mostly blind, so I have to turn back here. I would love to be able to have a flow of "this is what is to be expected when streaming a message in, and here is how it indicates what it is doing."

Then, I could follow that through and see where it is falling down.... and save you lots of work too! :-)

Avi Deitcher

unread,

Feb 6, 2017, 5:16:26 AM2/6/17

to Druid User

Slowly getting there. Turns out that I had a missing directory /tmp/druid/persistent/task/ . Once that existed, the middlemanager creates the task and it seems to be running.

However, now it is stuck in RUNNING mode. Shows that both on coordinator and in logs. All of /var/druid/indexing-logs, /var/druid/segments/, etc. are empty. the logs for the task itself are in: /tmp/druid/persistent/task/TASK_ID, and pretty short:

bash-4.3# cat /tmp/druid/persistent/task/index_realtime_cs-conversation_2017-02-06T10\:00\:00.000Z_0_0/log
2017-02-06T10:08:03,976 WARN [main] org.apache.curator.retry.ExponentialBackoffRetry - maxRetries too large (30). Pinning to 29
2017-02-06T10:08:06,411 WARN [task-runner-0-priority-0] org.apache.curator.utils.ZKPaths - The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead.

That is it, nothing more. Zookeeper logs for the connection from the middlemanager node show:

zookeeper_1        | 2017-02-06 10:08:05,474 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /192.168.32.11:46002
zookeeper_1        | 2017-02-06 10:08:05,482 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@928] - Client attempting to establish new session at /192.168.32.11:46002
zookeeper_1        | 2017-02-06 10:08:05,498 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@673] - Established session 0x15a12b275c6000e with negotiated timeout 30000 for client /192.168.32.11:46002
zookeeper_1        | 2017-02-06 10:08:06,383 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x15a12b275c6000e type:create cxid:0x13 zxid:0x15f txntype:-1 reqpath:n/a Error Path:/druid/discovery/firehose:druid:overlord:cs-conversation-010-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/firehose:druid:overlord:cs-conversation-010-0000-0000
zookeeper_1        | 2017-02-06 10:08:06,435 [myid:] - INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@649] - Got user-level KeeperException when processing sessionid:0x15a12b275c6000e type:create cxid:0x19 zxid:0x162 txntype:-1 reqpath:n/a Error Path:/druid/discovery/cs-conversation-010-0000-0000 Error:KeeperErrorCode = NoNode for /druid/discovery/cs-conversation-010-0000-0000

Looking in zk itself:

[zk: localhost:2181(CONNECTED) 13] ls /druid/discovery/cs-conversation-010-0000-0000
[65361c2d-6aa0-4a85-8534-cf8f17e1654d]
[zk: localhost:2181(CONNECTED) 14] ls /druid/discovery/cs-conversation-010-0000-0000/65361c2d-6aa0-4a85-8534-cf8f17e1654d
[]
[zk: localhost:2181(CONNECTED) 15] get /druid/discovery/cs-conversation-010-0000-0000/65361c2d-6aa0-4a85-8534-cf8f17e1654d
{"name":"cs-conversation-010-0000-0000","id":"65361c2d-6aa0-4a85-8534-cf8f17e1654d","address":"middlemanager","port":8101,"sslPort":null,"payload":null,"registrationTimeUTC":1486375686434,"serviceType":"DYNAMIC","uriSpec":null}
cZxid = 0x164
ctime = Mon Feb 06 10:08:06 GMT 2017
mZxid = 0x164
mtime = Mon Feb 06 10:08:06 GMT 2017
pZxid = 0x164
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x15a12b275c6000e
dataLength = 227
numChildren = 0

And... at a dead end again.

Avi Deitcher

unread,

Feb 6, 2017, 5:30:38 AM2/6/17

to Druid User

And eventually one task does index? It took an extraordinarily long time though.

{"task":"index_realtime_cs-conversation_2017-02-06T09:00:00.000Z_0_0","status":{"id":"index_realtime_cs-conversation_2017-02-06T09:00:00.000Z_0_0","status":"SUCCESS","duration":3549070}}

Reply all

Reply to author

Forward

0 new messages