Kafka indexing fails with com.metamx.common.ISE: Transaction failure publishing segments, aborting

Giri Tata

unread,

Oct 4, 2016, 11:16:46 AM10/4/16

to Druid User

I am using the default Derby Database for metadata. All of these issues started when i tried to reload some historical data with batch indexer for better compaction. Just to make sure .. i went in and deleted everything from  the following tables

DRUID_PENDINGSEGMENTS
DRUID_TASKS
DRUID_TASKLOGS
DRUID_SUPERVISORS

and resubmitted supervisor .. still no luck !

Relevant Logs  - Overlord

2016-10-04T14:47:34,392 INFO [qtp757779849-196] io.druid.metadata.IndexerSQLMetadataStorageCoordinator - Not updating metadata, existing state is not the expected start state.
2016-10-04T14:47:34,393 INFO [qtp757779849-196] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2016-10-04T14:47:34.393Z","service":"druid/overlord","host":"gb-slo-svb-0187.dunnhumby.co.uk:8090","metric":"segment/txn/failure","value":1,"dataSource":"tuk_real","taskType":"index_kafka"}]
2016-10-04T14:47:34,413 INFO [qtp757779849-197] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_kafka_tuk_real_f67568d067497e8_ocooldcn]: SegmentListUsedAction{dataSource='tuk_real', intervals=[2016-10-03T00:00:00.000Z/2016-10-06T00:00:00.000Z]}
2016-10-04T14:47:36,376 INFO [Curator-PathChildrenCache-0] io.druid.indexing.overlord.RemoteTaskRunner - Worker[gb-slo-svb-0187.dunnhumby.co.uk:8091] wrote FAILED status for task [index_kafka_tuk_real_f67568d067497e8_ocooldcn] on [TaskLocation{host='gb-slo-svb-0187.dunnhumby.co.uk', port=8100}]

016-10-04T14:47:34,474 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[KafkaIndexTask{id=index_kafka_tuk_real_f67568d067497e8_ocooldcn, type=index_kafka, dataSource=tuk_real}]
com.metamx.common.ISE: Transaction failure publishing segments, aborting
	at io.druid.indexing.kafka.KafkaIndexTask.run(KafkaIndexTask.java:506) ~[?:?]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]
2016-10-04T14:47:34,480 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_kafka_tuk_real_f67568d067497e8_ocooldcn] status changed to [FAILED].
2016-10-04T14:47:34,483 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_kafka_tuk_real_f67568d067497e8_ocooldcn",
  "status" : "FAILED",
  "duration" : 3885581
}

David Lim

unread,

Oct 4, 2016, 1:06:56 PM10/4/16

to Druid User

Hey Giri,

Try removing the druid_dataSource table as well and see if that helps.

Qiyun Liu

unread,

Dec 1, 2016, 5:39:38 AM12/1/16

to Druid User

Hi David,

We encountered the same exception info and then seems got a data loss result.

<1>
The issue https://github.com/druid-io/druid/issues/3600 says it might be a race condition issue of druid, and is there any plan to fix this issue? Before code fixing, is there any workaround?

<2>
Btw, in my understanding, if we set the replica of a task group to 2, there will be two same tasks to run in sync on diff middlemgr, and they will consumer the same kafka data using the same offset(s) at the same time, and they also will generate the same segment, however once one task completes publishing the segment, the another one will abandon its segment. On the contrary, if one failed to publish the segment when encounter above issue, the another still will complete the segment publishing work. Is my understanding on replica right?

Thanks in advance!

在 2016年10月5日星期三 UTC+8上午1:06:56，David Lim写道：

Gian Merlino

unread,

Dec 1, 2016, 10:30:08 AM12/1/16

to druid...@googlegroups.com

Hey Qiyun,

I raised a PR to fix #3600, you could try patching your local copy and see if that helps: https://github.com/druid-io/druid/pull/3728

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/cc71f0f5-0100-4cae-a19f-3d5f901733dc%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

David Lim

unread,

Dec 1, 2016, 12:01:29 PM12/1/16

to Druid User

Hey Qiyun,

Just to make sure you don't miss it, Gian responded to your questions in the linked PR.

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Reply all

Reply to author

Forward