Failed realtime task finished with SUCCESS state

190 views
Skip to first unread message

zdenek tison

unread,
Mar 2, 2015, 5:08:03 AM3/2/15
to druid-de...@googlegroups.com
Hi,

We observe weird behaviour of real time tasks. We have missing segments, but our tasks were finished as SUCCESS.
In logs we found out "No space left on device" exception, but finite status is "SUCCESS". Is it expected behaviour?

Interesting parts from log file:

2015-02-28 02:51:45,747 INFO [ssp-auction-2015-02-27T00:00:00.000Z-persist-n-merge] io.druid.segment.IndexMerger - outDir[/data/druid/baseTaskDir/index_realtime_ssp-auction_2015-02-27T00:00:00.000Z_1_0_dmpongod/work/persist/ssp-auction/2
015-02-27T00:00:00.000Z_2015-02-28T00:00:00.000Z/merged/v8-tmp] walked 500,000/29,000,000 rows in 16,455 millis.
2015-02-28 02:51:53,908 ERROR [ssp-auction-2015-02-27T00:00:00.000Z-persist-n-merge] io.druid.segment.realtime.plumber.RealtimePlumber - Failed to persist merged index[ssp-auction]: {class=io.druid.segment.realtime.plumber.RealtimePlumbe
r, exceptionType=class java.io.IOException, exceptionMessage=No space left on device, interval=2015-02-27T00:00:00.000Z/2015-02-28T00:00:00.000Z}
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:315)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at com.google.common.io.CountingOutputStream.write(CountingOutputStream.java:53)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at io.druid.segment.data.VSizeIndexedWriter.write(VSizeIndexedWriter.java:77)
        at io.druid.segment.IndexMerger.makeIndexFiles(IndexMerger.java:652)
        at io.druid.segment.IndexMerger.merge(IndexMerger.java:307)
        at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:169)
        at io.druid.segment.IndexMerger.mergeQueryableIndex(IndexMerger.java:162)
        at io.druid.segment.realtime.plumber.RealtimePlumber$4.doRun(RealtimePlumber.java:348)
        at io.druid.common.guava.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

......

2015-02-28 02:51:59,505 INFO [task-runner-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_realtime_ssp-auction_2015-02-27T00:00:00.000Z_1_0_dmpongod",
  "status" : "SUCCESS",
  "duration" : 53328147
}


Next question, what is the location of disk without space? Is it from log file (outDir[/data/druid/baseTask......)? 
Because we are not sure if that disk could be out of space:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       7.8G  3.8G  3.6G  52% /
tmpfs           3.6G     0  3.6G   0% /dev/shm
/dev/vdb1     727G   56G  671G   8% /data

Thanks

index_realtime_ssp-auction_2015-02-27T00%3A00%3A00.000Z_1_0_dmpongod.zip

zdenek tison

unread,
Mar 2, 2015, 5:26:36 AM3/2/15
to druid-de...@googlegroups.com
To be precise we also have set:

druid.indexer.task.baseTaskDir=/data/druid/baseTaskDir
druid.indexer.task.baseDir=/data/druid/baseDir
druid.indexer.task.hadoopWorkingPath=/data/druid/hadoopWorkingPath
druid.fork.property.druid.indexer.task.baseTaskDir=/data/druid/baseTaskDir
druid.fork.property.druid.indexer.task.baseDir=/data/druid/baseDir
druid.fork.property.druid.indexer.task.hadoopWorkingPath=/data/druid/hadoopWorkingPath 
 

Nishant Bangarwa

unread,
Mar 2, 2015, 11:13:50 AM3/2/15
to druid-de...@googlegroups.com
Hi zdenek, 
It seems like an issue with the way exceptions are handled while shutting down the plumber.
I feel the correct behavior to handle IOException on persisting the segment in all the cases should be to wait and retry for until the handoff succeeds to prevent any data loss.    
Can you create a github issue for this ? 

the location that is full is /data/druid/baseTaskDir 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/cec3c9cb-33f2-4e06-b91a-fee505887d3e%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

zdenek tison

unread,
Mar 3, 2015, 8:53:02 AM3/3/15
to druid-de...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages