Real time index tasks failure with MappableContainerException

429 views
Skip to first unread message

Torche Guillaume

unread,
Sep 30, 2015, 7:38:32 PM9/30/15
to Druid User, jona...@gumgum.com
Hi all!

I am running into a production issue with some of my real time index tasks. Since 11 am this morning some of them are failing and I can't figure out why.

I can see the following exceptions on for some of my tasks logs:

SEVERE: The exception contained within MappableContainerException could not be mapped to a response, re-throwing to the HTTP container
com.fasterxml.jackson.databind.JsonMappingException: Early EOF (through reference chain: java.util.ArrayList[209])


I am not sure this is the root cause of the problem but I cannot see anything more which looks like an error in my task logs. Also on the overlord UI the status of the task is "failed" but I cannot see this status on my task logs. Usually it says something like:

Task completed with status: {
  "id" : "index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm",
  "status" : "SUCCESS",
  "duration" : 8692022
}

Another interesting thing is that I have also some tasks with a failed status sent to my overlord even though the task logs say the task has been completed with a successful status.

I looks to me the tasks effectively failed as I have missing data for some hours today.

Would you have any idea why this is happening? I have attached my tasks logs if you want to take a look. One with a success status on the task logs (but a failed status sent to the overlord) and one with no status at all and a MappableContainerException.

Thanks for your help!


index_task_logs_success_status.txt
index_task_logs_with_no_status.txt

Torche Guillaume

unread,
Sep 30, 2015, 10:14:17 PM9/30/15
to Druid User, jona...@gumgum.com
This part of the logs looks actually really interesting to us:

task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]
java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:119) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.TaskToolbox.pushSegments(TaskToolbox.java:202) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.task.RealtimeIndexTask$TaskActionSegmentPublisher.publishSegment(RealtimeIndexTask.java:351) [druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:452) [druid-server-0.7.0.jar:0.7.0]
	at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.7.0.jar:0.7.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
2015-09-30T14:25:00,796 ERROR [rtb_user_sync_statistics-2015-09-30T12:00:00.000-07:00-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[rtb_user_sync_statistics]: {class=io.druid.segment.realtime.plumber.RealtimePlumber, exceptionType=class java.io.IOException, exceptionMessage=Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions., interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.
	at io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:119) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.TaskToolbox.pushSegments(TaskToolbox.java:202) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.indexing.common.task.RealtimeIndexTask$TaskActionSegmentPublisher.publishSegment(RealtimeIndexTask.java:351) ~[druid-indexing-service-0.7.0.jar:0.7.0]
	at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:452) [druid-server-0.7.0.jar:0.7.0]
	at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40) [druid-common-0.7.0.jar:0.7.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
	at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
2015-09-30T14:25:00,800 INFO [rtb_user_sync_statistics-2015-09-30T12:00:00.000-07:00-persist-n-merge] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2015-09-30T14:25:00.797-07:00","service":"druid/prod/worker","host":"ec2-54-146-229-235.compute-1.amazonaws.com:8083","severity":"component-failure","description":"Failed to persist merged index[rtb_user_sync_statistics]","data":{"class":"io.druid.segment.realtime.plumber.RealtimePlumber","exceptionType":"java.io.IOException","exceptionMessage":"Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.","exceptionStackTrace":"java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions.\n\tat io.druid.indexing.common.actions.RemoteTaskActionClient.submit(RemoteTaskActionClient.java:119)\n\tat io.druid.indexing.common.TaskToolbox.pushSegments(TaskToolbox.java:202)\n\tat io.druid.indexing.common.task.RealtimeIndexTask$TaskActionSegmentPublisher.publishSegment(RealtimeIndexTask.java:351)\n\tat io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:452)\n\tat io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:40)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat java.lang.Thread.run(Thread.java:745)\n","interval":"2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00"}}]

It looks like the task cannot push its segment to S3 even after multiple retries. Could you confirm that's the issue?
Also how does Druid push segment into S3? Is it an API call?

We are also wondering why the task is marked as successful by the worker whereas the segment is not pushed into S3. Also the overlord seems to get a successful status from the worker:

com.metamx.common.ISE: Segments not covered by locks for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:00,812 INFO [qtp557493895-49] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]: LockReleaseAction{interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T14:25:00,812 ERROR [qtp557493895-49] io.druid.indexing.overlord.TaskLockbox - Lock release without acquire: {class=io.druid.indexing.overlord.TaskLockbox, task=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T14:25:00,812 INFO [qtp557493895-49] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2015-09-30T14:25:00.812-07:00","service":"druid/prod/indexer","host":"ec2-54-91-252-223.compute-1.amazonaws.com:8080","severity":"component-failure","description":"Lock release without acquire","data":{"class":"io.druid.indexing.overlord.TaskLockbox","task":"index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm","interval":"2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00"}}]
2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] wrote SUCCESS status for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:01,299 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] announced a status for a task I didn't know about, adding to runningTasks: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] completed task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] with status[SUCCESS]
2015-09-30T14:25:27,557 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] on worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080]
2015-09-30T14:25:27,559 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] went bye bye.

But the task is marked as failed on the overlord UI. How is that possible?

I have attached the overlord logs related to this task in case it helps!
overlord_logs.txt

Fangjin Yang

unread,
Oct 2, 2015, 4:25:20 PM10/2/15
to Druid User, jona...@gumgum.com
Hi Guillaume,

This log line: 

"2015-09-30T14:25:00,796 java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[ec2-54-91-252-223.compute-1.amazonaws.com:8080] logs for exceptions."

indicates which overlord's logs to check. There be an exception in there around 2015-09-30T14:25:00,796.

Torche Guillaume

unread,
Oct 2, 2015, 5:14:02 PM10/2/15
to Druid User, jona...@gumgum.com
Hi Fangjin,

Thanks for you answer. If you look at the overlord logs I have posted, it looks like the overlord tries to release the lock task without acquire. It's the only error I have on the overlord around 2015-09-30T14:25:00 for this task:

2015-09-30T14:25:00,812 ERROR [qtp557493895-49] io.druid.indexing.overlord.TaskLockbox - Lock release without acquire: {class=io.druid.indexing.overlord.TaskLockbox, task=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T14:25:00,812 INFO [qtp557493895-49] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"alerts","timestamp":"2015-09-30T14:25:00.812-07:00","service":"druid/prod/indexer","host":"ec2-54-91-252-223.compute-1.amazonaws.com:8080","severity":"component-failure","description":"Lock release without acquire","data":{"class":"io.druid.indexing.overlord.TaskLockbox","task":"index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm","interval":"2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00"}}] 

Then It looks like the worker reports a SUCCESS status for the task but the overlord doesn't know about this task:

2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] wrote SUCCESS status for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:01,299 WARN [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] announced a status for a task I didn't know about, adding to runningTasks: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T14:25:01,299 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080] completed task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] with status[SUCCESS]
2015-09-30T14:25:27,557 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Cleaning up task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] on worker[ec2-54-146-229-235.compute-1.amazonaws.com:8080]
2015-09-30T14:25:27,559 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] went bye bye.


Looking at the overlord logs I can see it gets a fail status from the task around 2015-09-30T12:05:

2015-09-30T12:00:09,192 INFO [qtp557493895-39] io.druid.indexing.overlord.TaskLockbox - Added task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] to TaskLock[index_realtime_rtb_user_sync_statistics]
2015-09-30T12:00:09,192 INFO [qtp557493895-39] io.druid.indexing.overlord.MetadataTaskStorage - Adding lock on interval[2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00] version[2015-09-30T12:00:09.192-07:00] for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T12:00:09,333 INFO [qtp557493895-46] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]: LockAcquireAction{interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00}
2015-09-30T12:00:09,381 INFO [qtp557493895-46] io.druid.indexing.overlord.TaskLockbox - Task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] already present in TaskLock[index_realtime_rtb_user_sync_statistics]
2015-09-30T12:05:06,581 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - [ec2-54-146-229-235.compute-1.amazonaws.com:8080]: Found [index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] running
2015-09-30T12:05:06,679 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Failing task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]
2015-09-30T12:05:06,680 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Received FAILED status for task: index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T12:05:06,682 INFO [PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Can't shutdown! No worker running task index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm
2015-09-30T12:05:06,683 INFO [PathChildrenCache-0] io.druid.indexing.overlord.MetadataTaskStorage - Updating task index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm to status: TaskStatus{id=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, status=FAILED, duration=-1}
2015-09-30T12:05:06,691 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskLockbox - Removing task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm] from TaskLock[index_realtime_rtb_user_sync_statistics]
2015-09-30T12:05:06,700 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task done: RealtimeIndexTask{id=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, type=index_realtime, dataSource=rtb_user_sync_statistics}
2015-09-30T12:05:06,700 INFO [PathChildrenCache-0] io.druid.indexing.overlord.TaskQueue - Task FAILED: RealtimeIndexTask{id=index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm, type=index_realtime, dataSource=rtb_user_sync_statistics} (-1 run duration) 
2015-09-30T13:10:00,584 INFO [qtp557493895-63] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_realtime_rtb_user_sync_statistics_2015-09-30T12:00:00.000-07:00_0_0_pdbdpbpm]: SegmentInsertAction{segments=[DataSegment{size=2224, shardSpec=LinearShardSpec{partitionNum=0}, metrics=[received, received_valid_id], dimensions=[bidder_id], version='2015-09-30T12:00:09.192-07:00', loadSpec={type=s3_zip, bucket=gumgum-druid, key=prod-realtime/rtb_user_sync_statistics/2015-09-30T12:00:00.000-07:00_2015-09-30T13:00:00.000-07:00/2015-09-30T12:00:09.192-07:00/0/index.zip}, interval=2015-09-30T12:00:00.000-07:00/2015-09-30T13:00:00.000-07:00, dataSource='rtb_user_sync_statistics', binaryVersion='9'}]} 


I guess that's why the overlord UI says this task failed. But I still don't get why the task sent a fail status to my overlord around 2015-09-30T12:05 even though the task continues to run and at the end reports a success status which is not acknowledged by the overlord because it doesn't know about it. 

What caused the first failure? I have attached my task logs if you want to take a look. I can only see a re connection to zookeeper around  2015-09-30T12:05 but I don't think this has caused any issue, plus it happened a bit after the overlord got a failed status from the task. 

Do you have any idea what happened here ?

Thanks for your help !
index_task_logs_success_status.txt

Fangjin Yang

unread,
Oct 6, 2015, 10:59:10 AM10/6/15
to Druid User, jona...@gumgum.com
Can you post the full overlord logs? There is likely an explicit exception I'm not seeing in the logs you posted. It occurs before the lock release warnings.

Torche Guillaume

unread,
Oct 6, 2015, 9:13:08 PM10/6/15
to Druid User, jona...@gumgum.com
I have attached the full overlord logs. Thanks for looking into this Fangjin!
xab.tar.gz

Fangjin

unread,
Oct 11, 2015, 4:13:08 PM10/11/15
to Druid User, jona...@gumgum.com
These are not the full overlord logs. These are a snippet of the overlord logs.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/fce52711-46aa-49c7-8b08-a785870ec412%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roman Baumgaertner

unread,
Mar 2, 2016, 1:17:05 PM3/2/16
to Druid User, jona...@gumgum.com
Jona,

Have you ever figured out the issue of

com.fasterxml.jackson.databind.JsonMappingException: Early EOF (through reference chain: java.util.ArrayList[209])?


On my server I am seeing the same issue and trying to understand what could cause the problem.

Thx,
--Roman



On Wednesday, September 30, 2015 at 4:38:32 PM UTC-7, Torche Guillaume wrote:

Fangjin Yang

unread,
Mar 9, 2016, 8:01:50 PM3/9/16
to Druid User, jona...@gumgum.com
can you post the full log?

Fangjin Yang

unread,
Mar 9, 2016, 8:02:03 PM3/9/16
to Druid User, jona...@gumgum.com
er, full stack trace
Reply all
Reply to author
Forward
0 new messages