Batch injestion with multiple subtasks is failing on random nodes

33 views
Skip to first unread message

Satheesh Neela

unread,
Sep 22, 2021, 11:45:03 AM9/22/21
to Druid User
we have 25 node cluster (one master with zoo keeper) and one broker and rest data nodes. one of the index parallel job with 10 subtasks and each task ingesting 3 files from HDFS. This was working fine but recently the random nodes failing with the below errors.
 Any idea where to look in detail, When we start the same job with single subtask its completing fine. 


 java.lang.RuntimeException: java.lang.InterruptedException
 at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:279) ~[druid-server-0.21.0.jar:0.21.0]
 at org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor.lambda$start$0(TaskMonitor.java:128) ~[druid-indexing-service-0.21.0.jar:0.21.0]
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_291]
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_291]
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_291]
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_291]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
 Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_291]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_291]
 at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) ~[guava-16.0.1.jar:?]
 at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
 at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:143) ~[druid-server-0.21.0.jar:0.21.0]
 at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:127) ~[druid-server-0.21.0.jar:0.21.0]
 at org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:264) ~[druid-server-0.21.0.jar:0.21.0]
 ... 8 more
 2021-09-20T22:32:11,200 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_pfkageka_2021-09-20T22:31:28.197Z]
 2021-09-20T22:32:11,205 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_ohdacibj_2021-09-20T22:30:02.426Z]
 2021-09-20T22:32:11,212 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_iiengljh_2021-09-20T22:30:02.501Z]
 2021-09-20T22:32:11,224 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to cancel subtask[single_phase_sub_task_US_SCORECARD_DEV.DAILY_SALES_DTL_gganjbho_2021-09-20T22:31:29.196Z]
 2021-09-20T22:32:11,237 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor
 2021-09-20T22:32:11,240 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
 "id" : "index_parallel_US_SCORECARD_DEV.DAILY_SALES_DTL_inppkpbh_2021-09-20T22:29:55.597Z",
 "status" : "FAILED",
 "duration" : 129390,
 "errorMsg" : null,
 "location" : {
 "host" : null,
 "port" : -1,
 "tlsPort" : -1
 }
 }

Satish N

unread,
Sep 24, 2021, 10:24:47 AM9/24/21
to Druid User
Any inputs to look, We are still seeing the failures.
Config portion of payload. When we have the maxNumConcurrentSubTasks": as 10 the job fails on random nodes with the above error and also we have seen that

    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000,
      "appendableIndexSpec": {
        "type": "onheap"
      },
      "maxRowsInMemory": 300000,
      "maxBytesInMemory": 0,
      "maxTotalRows": null,
      "numShards": null,
      "splitHintSpec": null,
      "partitionsSpec": {
        "type": "dynamic",
        "maxRowsPerSegment": 5000000,
        "maxTotalRows": null
      },
      "indexSpec": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs",
        "segmentLoader": null
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "metricCompression": "lz4",
        "longEncoding": "longs",
        "segmentLoader": null
      },
      "maxPendingPersists": 0,
      "forceGuaranteedRollup": false,
      "reportParseExceptions": false,
      "pushTimeout": 0,
      "segmentWriteOutMediumFactory": null,
      "maxNumConcurrentSubTasks": 10,
      "maxRetry": 3,
      "taskStatusCheckPeriodMs": 1000,
      "chatHandlerTimeout": "PT10S",
      "chatHandlerNumRetries": 5,
      "maxNumSegmentsToMerge": 100,
      "totalNumMergeTasks": 10,
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "maxColumnsToMerge": -1,
      "buildV9Directly": true,
      "partitionDimensions": []

When ever we have "maxNumConcurrentSubTasks": as 10 the job fails on random nodes and the subtask runs on one host and the error messages is seen on different host like below (sub task ran on lxappdrudprds06 and the java excption we are seeing is on lxappdrudprds14). The job runs fine when we have "maxNumConcurrentSubTasks": as 1

"id": "single_phase_sub_task_US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS_dbghchje_2021-09-23T12:45:14.571Z",
 "groupId": "index_parallel_US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS_ifpmncde_2021-09-23T12:44:05.186Z",
 "type": "single_phase_sub_task",
 "createdTime": "2021-09-23T12:45:14.574Z",
 "queueInsertionTime": "1970-01-01T00:00:00.000Z",
 "statusCode": "FAILED",
 "status": "FAILED",
 "runnerStatusCode": "WAITING",
 "duration": 44674,
 "location": {
 "host": "lxappdrudprds06",
 "port": 8101,
 "tlsPort": -1
 },
 "dataSource": "US_SCORECARD_DEV.PRICE_ADJ_DAILY_METRICS",
 "errorMsg": "java.net.UnknownHostException: lxappdrudprds14"
 }

Didip Kerabat

unread,
Sep 30, 2021, 7:29:26 PM9/30/21
to Druid User
What is your indexer.runner.type? remote or httpRemote?

If remote, I would first check Zookeeper logs see if there are obvious errors. It's possible that your ZK does not have enough resources.

skneeli

unread,
Oct 1, 2021, 11:13:55 AM10/1/21
to druid...@googlegroups.com
Its remote

Satish



Sent from my Galaxy
--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d19f1592-50cd-420e-a971-05a60cc8c46cn%40googlegroups.com.

Satish N

unread,
Oct 1, 2021, 12:37:32 PM10/1/21
to Druid User
I see some closed socket connection messages. found some ref that they are normal but some say the zookeeper is heavily loaded. 
2021-10-01T15:56:46,347 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.45.120:33358
2021-10-01T15:56:46,349 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.45.120:33358
2021-10-01T15:56:46,350 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900361 with negotiated timeout 30000 for client /172.29.45.120:33358
2021-10-01T15:57:55,366 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.44.99:39036
2021-10-01T15:57:55,368 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.44.99:39036
2021-10-01T15:57:55,369 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900362 with negotiated timeout 30000 for client /172.29.44.99:39036
2021-10-01T15:58:39,552 INFO [ProcessThread(sid:0 cport:2181):] org.apache.zookeeper.server.PrepRequestProcessor - Processed session termination for sessionid: 0x100530b5590035e
2021-10-01T15:58:39,552 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /172.29.45.127:38768 which had sessionid 0x100530b5590035e
2021-10-01T15:59:01,630 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.45.127:39330
2021-10-01T15:59:01,632 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.45.127:39330
2021-10-01T15:59:01,633 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900363 with negotiated timeout 30000 for client /172.29.45.127:39330
2021-10-01T15:59:41,789 INFO [PurgeTask] org.apache.zookeeper.server.DatadirCleanupManager - Purge task started.
2021-10-01T15:59:41,790 INFO [PurgeTask] org.apache.zookeeper.server.DatadirCleanupManager - Purge task completed.
2021-10-01T16:14:09,525 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /172.29.45.126:40702
2021-10-01T16:14:09,527 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.ZooKeeperServer - Client attempting to establish new session at /172.29.45.126:40702
2021-10-01T16:14:09,528 INFO [SyncThread:0] org.apache.zookeeper.server.ZooKeeperServer - Established session 0x100530b55900364 with negotiated timeout 30000 for client /172.29.45.126:40702
2021-10-01T16:16:02,044 INFO [ProcessThread(sid:0 cport:2181):] org.apache.zookeeper.server.PrepRequestProcessor - Processed session termination for sessionid: 0x100530b55900363
2021-10-01T16:16:02,045 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /172.29.45.127:39330 which had sessionid 0x100530b55900363


Reply all
Reply to author
Forward
0 new messages