Batch injestion with multiple subtasks is failing on random nodes

Satheesh Neela

unread,

Sep 22, 2021, 11:45:03 AM9/22/21

to Druid User

we have 25 node cluster (one master with zoo keeper) and one broker and rest data nodes. one of the index parallel job with 10 subtasks and each task ingesting 3 files from HDFS. This was working fine but recently the random nodes failing with the below errors.

Any idea where to look in detail, When we start the same job with single subtask its completing fine.

java.lang.RuntimeException: java.lang.InterruptedException
at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:279) ~[druid-server-0.21.0.jar:0.21.0]
at org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor.lambda$start$0(TaskMonitor.java:128) ~[druid-indexing-service-0.21.0.jar:0.21.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_291]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_291]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_291]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) ~[guava-16.0.1.jar:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:143) ~[druid-server-0.21.0.jar:0.21.0]
at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.0.jar:0.21.0]
at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:264) ~[druid-server-0.21.0.jar:0.21.0]
... 8 more
2021-09-20T22:32:11,200 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_pfkageka_2021-09-20T22:31:28.197Z]
2021-09-20T22:32:11,205 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_ohdacibj_2021-09-20T22:30:02.426Z]
2021-09-20T22:32:11,212 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_iiengljh_2021-09-20T22:30:02.501Z]
2021-09-20T22:32:11,224 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_gganjbho_2021-09-20T22:31:29.196Z]
2021-09-20T22:32:11,237 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor
2021-09-20T22:32:11,240 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
"id" : "index_parallel_US_SCORECARD_DEV.DAILY_SALES_DTL_inppkpbh_2021-09-20T22:29:55.597Z",
"status" : "FAILED",
"duration" : 129390,
"errorMsg" : null,
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}

Satish N

unread,

Sep 24, 2021, 10:24:47 AM9/24/21

to Druid User

Any inputs to look, We are still seeing the failures.

Config portion of payload. When we have the maxNumConcurrentSubTasks": as 10 the job fails on random nodes with the above error and also we have seen that

    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "appendableIndexSpec": {
        "type": "onheap"
      },
      "maxRowsInMemory": 300000,
      "maxBytesInMemory": 0,
      "maxTotalRows": null,
      "numShards": null,
      "splitHintSpec": null,
      "partitionsSpec": {
        "type": "dynamic",
        "maxRowsPerSegment": 5000000,
        "maxTotalRows": null
      },
      "indexSpec": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs",
        "segmentLoader": null
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs",
        "segmentLoader": null
      },
      "maxPendingPersists": 0,
      "forceGuaranteedRollup": false,
      "reportParseExceptions": false,
      "pushTimeout": 0,
      "segmentWriteOutMediumFactory": null,
      "maxNumConcurrentSubTasks": 10,
      "maxRetry": 3,
      "taskStatusCheckPeriodMs": 1000,
      "chatHandlerTimeout": "PT10S",
      "chatHandlerNumRetries": 5,
      "maxNumSegmentsToMerge": 100,
      "totalNumMergeTasks": 10,
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "maxColumnsToMerge": -1,
      "buildV9Directly": true,
      "partitionDimensions": []

When ever we have "maxNumConcurrentSubTasks": as 10 the job fails on random nodes and the subtask runs on one host and the error messages is seen on different host like below (sub task ran on lxappdrudprds06 and the java excption we are seeing is on lxappdrudprds14). The job runs fine when we have "maxNumConcurrentSubTasks": as 1

"id": "single_phase_sub_task_US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS_dbghchje_2021-09-23T12:45:14.571Z",
"groupId": "index_parallel_US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS_ifpmncde_2021-09-23T12:44:05.186Z",
"type": "single_phase_sub_task",
"createdTime": "2021-09-23T12:45:14.574Z",
"queueInsertionTime": "1970-01-01T00:00:00.000Z",
"statusCode": "FAILED",
"status": "FAILED",
"runnerStatusCode": "WAITING",
"duration": 44674,
"location": {
"host": "lxappdrudprds06",
"port": 8101,
"tlsPort": -1
},
"dataSource": "US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS",
"errorMsg": "java.net.UnknownHostException: lxappdrudprds14"
}

Didip Kerabat

unread,

Sep 30, 2021, 7:29:26 PM9/30/21

to Druid User

What is your indexer.runner.type? remote or httpRemote?

If remote, I would first check Zookeeper logs see if there are obvious errors. It's possible that your ZK does not have enough resources.

skneeli

unread,

Oct 1, 2021, 11:13:55 AM10/1/21

to druid...@googlegroups.com

Its remote

Satish

Sent from my Galaxy

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d19f1592-50cd-420e-a971-05a60cc8c46cn%40googlegroups.com.

Satish N

unread,

Oct 1, 2021, 12:37:32 PM10/1/21

to Druid User

I see some closed socket connection messages. found some ref that they are normal but some say the zookeeper is heavily loaded.

2021-10-01T15:56:46,347 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.45.120:33358

2021-10-01T15:56:46,349 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.45.120:33358

2021-10-01T15:56:46,350 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900361 with negotiated timeout 30000 for client /172.29.45.120:33358

2021-10-01T15:57:55,366 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.44.99:39036

2021-10-01T15:57:55,368 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.44.99:39036

2021-10-01T15:57:55,369 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900362 with negotiated timeout 30000 for client /172.29.44.99:39036

2021-10-01T15:58:39,552 INFO [ProcessThread(sid:0 cport:2181):] org.apache.zookeeper.server.PrepRequestProcessor - Processed session termination for sessionid: 0x100530b5590035e

2021-10-01T15:58:39,552 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /172.29.45.127:38768 which had sessionid 0x100530b5590035e

2021-10-01T15:59:01,630 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.45.127:39330

2021-10-01T15:59:01,632 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.45.127:39330

2021-10-01T15:59:01,633 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900363 with negotiated timeout 30000 for client /172.29.45.127:39330

2021-10-01T15:59:41,789 INFO [PurgeTask] org.apache.zookeeper.server.DatadirCleanupManager - Purge task started.

2021-10-01T15:59:41,790 INFO [PurgeTask] org.apache.zookeeper.server.DatadirCleanupManager - Purge task completed.

2021-10-01T16:14:09,525 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.45.126:40702

2021-10-01T16:14:09,527 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.45.126:40702

2021-10-01T16:14:09,528 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900364 with negotiated timeout 30000 for client /172.29.45.126:40702

2021-10-01T16:16:02,044 INFO [ProcessThread(sid:0 cport:2181):] org.apache.zookeeper.server.PrepRequestProcessor - Processed session termination for sessionid: 0x100530b55900363

2021-10-01T16:16:02,045 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /172.29.45.127:39330 which had sessionid 0x100530b55900363

Reply all

Reply to author

Forward