Hi guys,
Here's my stubborn problem. Hope you will be able to help me !
Context :
- Druid Version : 0.9.2
- Deep Storage : Google Storage (Using the gcs-connector-latest-hadoop2.jar)
- Yarn Cluster : Google Dataproc with 2 permanent workers and 50 preemptible workers
- Batch Data Ingestion : Hadoop/ Yarn
99% of the time, everything is working like a charm. No problem. But once in a while, a task fails with a 'FileNotFoundException' error. Here's a part of a middlemanager's log for this kind of failed task :
--
2017-11-09T08:32:19,439 INFO [task-runner-0-priority-0] io.druid.indexer.hadoop.DatasourceInputFormat - Number of splits [119]
2017-11-09T08:32:20,094 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - number of splits:139
2017-11-09T08:32:20,416 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1510072522962_0990
2017-11-09T08:32:20,636 INFO [task-runner-0-priority-0] org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2017-11-09T08:32:20,950 INFO [task-runner-0-priority-0] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1510072522962_0990
2017-11-09T08:32:20,986 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - The url to track the job: http://druid-v2-m:8088/proxy/application_1510072522962_0990/
2017-11-09T08:32:20,986 INFO [task-runner-0-priority-0] io.druid.indexer.DetermineHashedPartitionsJob - Job track_cart_v1-determine_partitions_hashed-Optional.of(["2017-11-08T19:00:00/2017-11-08T21:00:00"]) submitted, status available at: http://druid-v2-m:8088/proxy/application_1510072522962_0990/
2017-11-09T08:32:20,988 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Running job: job_1510072522962_0990
2017-11-09T08:32:34,538 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Job job_1510072522962_0990 running in uber mode : false
2017-11-09T08:32:34,541 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 0% reduce 0%
2017-11-09T08:32:52,860 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 1% reduce 0%
2017-11-09T08:32:55,884 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 2% reduce 0%
2017-11-09T08:32:57,893 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 4% reduce 0%
2017-11-09T08:32:58,899 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 6% reduce 0%
2017-11-09T08:32:59,903 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 7% reduce 0%
2017-11-09T08:33:00,912 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 9% reduce 0%
2017-11-09T08:33:03,927 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 10% reduce 0%
2017-11-09T08:33:04,934 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 11% reduce 0%
2017-11-09T08:33:05,939 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - map 13% reduce 0%
2017-11-09T08:33:05,942 INFO [task-runner-0-priority-0] org.apache.hadoop.mapreduce.Job - Task Id : attempt_1510072522962_0990_m_000016_0, Status : FAILED
Error: java.lang.RuntimeException: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000038/tmp/1510216383786-0/index.drd (No such file or directory)
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at io.druid.indexer.hadoop.DatasourceRecordReader$1.apply(DatasourceRecordReader.java:103)
at io.druid.indexer.hadoop.DatasourceRecordReader$1.apply(DatasourceRecordReader.java:76)
at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:582)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.metamx.common.guava.BaseSequence.makeYielder(BaseSequence.java:105)
at com.metamx.common.guava.BaseSequence.toYielder(BaseSequence.java:82)
at com.metamx.common.guava.ConcatSequence.toYielder(ConcatSequence.java:58)
at io.druid.segment.realtime.firehose.IngestSegmentFirehose.<init>(IngestSegmentFirehose.java:171)
at io.druid.indexer.hadoop.DatasourceRecordReader.initialize(DatasourceRecordReader.java:114)
at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000038/tmp/1510216383786-0/index.drd (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at io.druid.segment.SegmentUtils.getVersionFromDir(SegmentUtils.java:43)
at io.druid.segment.IndexIO.loadIndex(IndexIO.java:217)
at io.druid.indexer.hadoop.DatasourceRecordReader$1.apply(DatasourceRecordReader.java:93)
... 18 more
if I look at the corresponding aggregated yarn log, here's the failed task's error :
---
2017-11-09T08:33:28,939 INFO [main] io.druid.indexer.hadoop.DatasourceRecordReader - Getting storage path for segment [track_cart_v1_2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z_2017-11-09T02:31:08.417Z]
2017-11-09T08:33:28,942 INFO [main] io.druid.indexer.hadoop.DatasourceRecordReader - Fetch segment files from [gs://melissa-druid-v2/indexedData/track_cart_v1/2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip]
2017-11-09T08:33:28,963 INFO [main] io.druid.indexer.hadoop.DatasourceRecordReader - Locally storing fetched segment at [/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/tmp/1510216408962-0]
2017-11-09T08:33:29,060 INFO [main] com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase - GHFS version: 1.6.0-hadoop2
2017-11-09T08:33:31,680 INFO [main] io.druid.indexer.hadoop.DatasourceRecordReader - finished fetching segment files
Heap
PSYoungGen total 124416K, used 53193K [0x00000000e6600000, 0x00000000f1480000, 0x0000000100000000)
eden space 116224K, 38% used [0x00000000e6600000,0x00000000e91f2518,0x00000000ed780000)
from space 8192K, 99% used [0x00000000ee480000,0x00000000eec7ff40,0x00000000eec80000)
to space 13312K, 0% used [0x00000000ed780000,0x00000000ed780000,0x00000000ee480000)
ParOldGen total 202240K, used 124923K [0x00000000b3200000, 0x00000000bf780000, 0x00000000e6600000)
object space 202240K, 61% used [0x00000000b3200000,0x00000000babfeca0,0x00000000bf780000)
Metaspace used 50038K, capacity 50404K, committed 50992K, reserved 1093632K
class space used 5992K, capacity 6105K, committed 6192K, reserved 1048576K
syslog267642017-11-09 08:33:12,263 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2017-11-09 08:33:12,628 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2017-11-09 08:33:12,628 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2017-11-09 08:33:12,662 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2017-11-09 08:33:12,662 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1510072522962_0990, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@561b6512)
2017-11-09 08:33:12,932 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2017-11-09 08:33:13,446 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990
2017-11-09 08:33:13,578 INFO [main] com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
---
2017-11-09 08:33:14,970 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2017-11-09 08:33:14,991 INFO [main] org.apache.hadoop.mapreduce.v2.util.MRApps: Creating job classloader
2017-11-09 08:33:15,013 INFO [main] org.apache.hadoop.util.ApplicationClassLoader: classpath: [file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-mapreduce-client-jobclient-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-mapreduce-client-core-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/guava-16.0.1.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-beanutils-1.7.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/druid-hdfs-storage-0.9.2.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-yarn-server-common-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/avro-1.7.4.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/servlet-api-2.5.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/druid-services-0.9.2-selfcontained.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-collections-3.2.1.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/mysql-metadata-storage-0.9.2.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/snappy-java-1.0.4.1.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-net-3.1.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-math3-3.6.1.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/paranamer-2.3.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-digester-1.8.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/stax-api-1.0-2.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-client-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-yarn-common-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-beanutils-core-1.8.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-yarn-api-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-mapreduce-client-app-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/mysql-connector-java-5.1.38.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/xz-1.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-auth-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/jaxb-api-2.2.2.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-yarn-client-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-compress-1.4.1.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-mapreduce-client-common-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/xmlenc-0.52.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/gcs-connector-latest-hadoop2.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/commons-configuration-1.6.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-hdfs-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-common-2.3.0.jar, file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/hadoop-mapreduce-client-shuffle-2.3.0.jar]
2017-11-09 08:33:15,014 INFO [main] org.apache.hadoop.util.ApplicationClassLoader: system classes: [-javax.validation., java., javax., org.apache.commons.logging., org.apache.log4j., org.apache.hadoop.]
2017-11-09 08:33:15,014 INFO [main] org.apache.hadoop.mapreduce.v2.util.MRApps: Setting classloader org.apache.hadoop.util.ApplicationClassLoader@68b32e3e on the configuration and as the thread context classloader
2017-11-09 08:33:16,887 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 2
2017-11-09 08:33:16,888 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2017-11-09 08:33:17,036 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2017-11-09 08:33:20,048 INFO [main] org.hibernate.validator.internal.util.Version: HV000001: Hibernate Validator 5.1.3.Final
2017-11-09 08:33:26,372 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: io.druid.indexer.hadoop.DatasourceInputSplit@270a620
2017-11-09 08:33:26,616 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2017-11-09 08:33:26,616 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 100
2017-11-09 08:33:26,616 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 83886080
2017-11-09 08:33:26,616 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 104857600
2017-11-09 08:33:26,616 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396; length = 6553600
2017-11-09 08:33:26,632 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2017-11-09 08:33:31,718 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-11-09 08:33:31,843 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/tmp/1510216408962-0/index.drd (No such file or directory)
--
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at io.druid.indexer.hadoop.DatasourceRecordReader$1.apply(DatasourceRecordReader.java:103)
at io.druid.indexer.hadoop.DatasourceRecordReader$1.apply(DatasourceRecordReader.java:76)
at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:582)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.metamx.common.guava.BaseSequence.makeYielder(BaseSequence.java:105)
at com.metamx.common.guava.BaseSequence.toYielder(BaseSequence.java:82)
at com.metamx.common.guava.ConcatSequence.toYielder(ConcatSequence.java:58)
at io.druid.segment.realtime.firehose.IngestSegmentFirehose.<init>(IngestSegmentFirehose.java:171)
at io.druid.indexer.hadoop.DatasourceRecordReader.initialize(DatasourceRecordReader.java:114)
at org.apache.hadoop.mapreduce.lib.input.DelegatingRecordReader.initialize(DelegatingRecordReader.java:84)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
So the problem comes from the loading of the segment located on gs://melissa-druid-v2/indexedData/track_cart_v1/2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip
If I unzip this index.zip, I can see that the 'version.bin' file is missing :
ls indexedData%2Ftrack_cart_v1%2F2017-11-08T19-00-00.000Z_2017-11-08T20-00-00.000Z%2F2017-11-09T02-31-08.417Z%2F0%2Findex
00000.smoosh meta.smoosh
Then, if I look at the task launched at 2017-11-09T02:31:08.417Z that generated this segment's version, I can't see any error in the log (Middle Manager, overlord, or Yarn).
Moreover, here's the part of the yarn log where the index.zip is created :
--
2017-11-09T02:43:01,127 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[dim_timetoaction_hour.drd]
2017-11-09T02:43:01,137 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Dimension[timetoaction_hour] is single value, converting...
2017-11-09T02:43:01,210 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[dim_timetoaction_month.drd]
2017-11-09T02:43:01,216 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Dimension[timetoaction_month] is single value, converting...
2017-11-09T02:43:01,322 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[dim_timetoaction_week.drd]
2017-11-09T02:43:01,339 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Dimension[timetoaction_week] is single value, converting...
2017-11-09T02:43:01,461 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[dim_version.drd]
2017-11-09T02:43:01,476 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Dimension[version] is single value, converting...
2017-11-09T02:43:01,476 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Dimension[version] has null rows.
2017-11-09T02:43:01,612 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[index.drd]
2017-11-09T02:43:01,612 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[inverted.drd]
2017-11-09T02:43:01,612 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[met_agg_count_LITTLE_ENDIAN.drd]
2017-11-09T02:43:01,625 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[met_agg_distinct_idfv_LITTLE_ENDIAN.drd]
2017-11-09T02:43:01,634 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[met_agg_quantity_LITTLE_ENDIAN.drd]
2017-11-09T02:43:01,637 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[met_agg_total_price_LITTLE_ENDIAN.drd]
2017-11-09T02:43:01,638 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[metadata.drd]
2017-11-09T02:43:01,638 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[spatial.drd]
2017-11-09T02:43:01,638 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Processing file[time_LITTLE_ENDIAN.drd]
2017-11-09T02:43:01,647 INFO [main] io.druid.segment.IndexIO$DefaultIndexIOHandler - Skipped files[[index.drd, inverted.drd, metadata.drd, spatial.drd]]
2017-11-09T02:43:01,736 INFO [main] com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase - GHFS version: 1.6.0-hadoop2
2017-11-09T02:43:02,986 INFO [main] io.druid.indexer.JobHelper - Creating new ZipEntry[meta.smoosh]
2017-11-09T02:43:02,995 INFO [main] io.druid.indexer.JobHelper - Creating new ZipEntry[00000.smoosh]
2017-11-09T02:43:05,342 INFO [main] io.druid.indexer.JobHelper - Creating new ZipEntry[version.bin]
2017-11-09T02:43:05,662 INFO [main] io.druid.indexer.JobHelper - Zipped 15,581,505 bytes to [gs://melissa-druid-v2/indexedData/track_cart_v1/
2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip.1]
2017-11-09T02:43:05,778 INFO [main] io.druid.indexer.JobHelper - Attempting rename from [gs://melissa-druid-v2/indexedData/track_cart_v1/2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip.1] to [gs://melissa-druid-v2/indexedData/track_cart_v1/2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip]
2017-11-09T02:43:07,021 INFO [main] io.druid.indexer.IndexGeneratorJob - Writing descriptor to path[hdfs://druid-v2-m/tmp/druid-indexing/track_cart_v1/2017-11-09T023108.417Z_fde79096874c454980e4f4aad6bf3bb1/segmentDescriptorInfo/track_cart_v1_2017-11-08T190000.000Z_2017-11-08T200000.000Z_2017-11-09T023108.417Z.json]
Heap
PSYoungGen total 704000K, used 358919K [0x000000078cc00000, 0x00000007bf680000, 0x00000007c0000000)
eden space 586240K, 60% used [0x000000078cc00000,0x00000007a261d6d8,0x00000007b0880000)
from space 117760K, 3% used [0x00000007b8380000,0x00000007b87e4680,0x00000007bf680000)
to space 121856K, 0% used [0x00000007b0880000,0x00000007b0880000,0x00000007b7f80000)
ParOldGen total 656384K, used 428644K [0x0000000726400000, 0x000000074e500000, 0x000000078cc00000)
object space 656384K, 65% used [0x0000000726400000,0x00000007406993c0,0x000000074e500000)
Metaspace used 53777K, capacity 54204K, committed 54704K, reserved 1097728K
class space used 6406K, capacity 6524K, committed 6576K, reserved 1048576K
syslog239192017-11-09 02:42:22,881 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2017-11-09 02:42:22,947 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2017-11-09 02:42:22,947 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
2017-11-09 02:42:22,957 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2017-11-09 02:42:22,957 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1510072522962_0852, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@561b6512)
2017-11-09 02:42:23,065 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
The 3 files 'meta.smoosh', '00000.smoosh' and 'version.bin' seems to be created and zipped well. Which, in reality, is not the case.
Also, I noticed that sometimes, it's not the 'version.bin' file that is missing, but the 'meta.smoosh'. Then, the exception would look like something :
Exception running child : java.lang.RuntimeException: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1510072522962_0990/container_1510072522962_0990_01_000050/tmp/1510216408962-0/meta.smoosh
But I can't provide these logs anymore because they have been deleted. But when I checked the corresponding index.zip, I could see that indeed it didn't contain the 'meta.smoosh' file. Only the other 2 ('0000.smoosh' and 'version.bin').
Unfortunately, I can't reproduce the problem on purpose, it seems to happen at random sometimes. And the only way I have to detect that a version of an index.zip is "corrupted" is when an other task uses it to reindex some data. The only solution I have for the moment is to delete manually this version of the segment on the database and to roll back to its previous version, which is really dirty and time consuming.
Here's the task that I send to Druid :
{
"spec":{
"dataSchema":{
"dataSource":"track_cart_v1",
"parser":{
"parseSpec":{
"dimensionsSpec":{
"dimensionExclusions":[
],
"dimensions":[
"some_dim"
],
"spatialDimensions":[
]
},
"timestampSpec":{
"column":"timestamp",
"format":"posix"
},
"format":"json"
},
"type":"string"
},
"metricsSpec":[
{
"type":"count",
"name":"agg_count"
},
{
"type":"hyperUnique",
"name":"agg_distinct_idfv",
"fieldName":"idfv"
},
{
"type":"doubleSum",
"name":"agg_quantity",
"fieldName":"quantity"
},
{
"type":"javascript",
"name":"agg_total_price",
"fieldNames":[
"quantity",
"unitPrice"
],
"fnAggregate":"function(current, a, b) { return current + a*b; }",
"fnReset":"function() { return 0; }",
"fnCombine":"function(partialA, partialB) { return partialA + partialB; }"
}
],
"granularitySpec":{
"type":"uniform",
"segmentGranularity":"HOUR",
"queryGranularity":{
"type":"duration",
"duration":900000,
"origin":"1970-01-01T00:00:00.000Z"
},
"rollup":true,
"intervals":[
"2017-11-08T19:00:00/2017-11-08T21:00:00"
]
}
},
"ioConfig":{
"type":"hadoop",
"inputSpec":{
"type":"multi",
"children":[
{
"inputPath":"gs://data.melissa.com/logs/cart/",
"pathFormat":"'y'=yyyy/'m'=MM/'d'=dd/'h'=HH",
"dataGranularity":"hour",
"type":"granularity",
"filePattern":".*/y=2017/m=11/d=08/h=20/.*.json.*"
},
{
"type":"dataSource",
"ingestionSpec":{
"segments":[
{
"dataSource":"track_cart_v1",
"interval":"2017-11-08T19:00:00.000Z/2017-11-08T20:00:00.000Z",
"version":"2017-11-08T23:31:07.126Z",
"loadSpec":{
"type":"hdfs",
"path":"gs://melissa-druid-v2/indexedData/track_cart_v1/2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip"
},
"dimensions":"some_dims",
"shardSpec":{
"type":"none"
},
"binaryVersion":9,
"size":13021291,
"identifier":"track_cart_v1_2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z_2017-11-08T23:31:07.126Z"
}
],
"intervals":[
"2017-11-08T19:00:00/2017-11-08T21:00:00"
],
"dataSource":"track_cart_v1"
},
"segments":[
{
"segment":{
"dataSource":"track_cart_v1",
"interval":"2017-11-08T19:00:00.000Z/2017-11-08T20:00:00.000Z",
"version":"2017-11-08T23:31:07.126Z",
"loadSpec":{
"type":"hdfs",
"path":"gs://melissa-druid-v2/indexedData/track_cart_v1/2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z/2017-11-09T02:31:08.417Z/0/index.zip"
},
"dimensions":"some_dims",
"shardSpec":{
"type":"none"
},
"binaryVersion":9,
"size":13021291,
"identifier":"track_cart_v1_2017-11-08T19:00:00.000Z_2017-11-08T20:00:00.000Z_2017-11-08T23:31:07.126Z"
},
"interval":"2017-10-31T07:00:00.000Z/2017-10-31T08:00:00.000Z"
}
]
}
]
}
},
"tuningConfig":{
"type":"hadoop",
"avro":false,
"ignoreInvalidRows":true,
"maxRowsInMemory":500000,
"useCombiner":true,
"partitionsSpec":{
"type":"dimensions",
"partitionDimension":"a_dim",
"targetPartitionSize":5000000
},
"jobProperties":{
"mapreduce.job.user.classpath.first":"true",
"mapreduce.job.classloader":"true",
"mapreduce.job.classloader.system.classes":"-javax.validation.,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.",
"yarn.nodemanager.resource.memory-mb":6144,
"yarn.scheduler.minimum-allocation-mb":1536,
"yarn.scheduler.maximum-allocation-mb":6144,
"mapreduce.map.memory.mb":1536,
"mapreduce.reduce.memory.mb":3072,
"mapreduce.map.java.opts":"-server -Xmx1229m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
"mapreduce.reduce.java.opts":"-server -Xmx2459m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
"yarn.app.mapreduce.am.resource.mb":3072,
"yarn.app.mapreduce.am.command-opts":"-server -Xmx2459m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
}
}
Here's my Druid configs :
Common
# Zookeeper
druid.zk.service.host=zk1:2181,zk2:2181,zk3:2181
#Extensions
#druid.extensions.directory=/usr/local/druid/extensions
druid.extensions.hadoopDependenciesDir=/usr/local/druid/hadoop-dependencies
druid.storage.type=hdfs
druid.storage.storageDirectory=gs://melissa-druid-v2/indexedData
druid.cache.type=local
druid.cache.sizeInBytes=1000000000
druid.cache.logEvictionCount=1000
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.historical.cache.unCacheable=[]
druid.broker.cache.useCache=true
druid.broker.cache.populateCache=true
druid.broker.cache.unCacheable=[]
# Metadata storage
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://someIp
druid.metadata.storage.connector.user=someUser
druid.metadata.storage.connector.password=somePassword
# Indexing service discovery
druid.selectors.indexing.serviceName=druid/overlord
# Metrics logging (disabled for examples)
druid.emitter=noop
druid.processing.buffer.sizeBytes=500000000
druid.announcer.type=batch
Broker
druid.host=##HOSTNAME##:8080
druid.port=8080
druid.service=druid/broker
# We enable using the local query cache here
druid.broker.cache.useCache=true
druid.broker.cache.populateCache=true
# Bump these up only for faster nested groupBy
druid.processing.buffer.sizeBytes=536870912
druid.processing.numThreads=4
druid.selectors.indexing.serviceName=druid/overlord
user.timezone=UTC
file.encoding=UTF-8
druid.storage.type=hdfs
Coordinator
druid.host=##HOSTNAME##:8080
druid.port=8080
druid.service=druid/coordinator
druid.coordinator.startDelay=PT70s
druid.selectors.indexing.serviceName=druid/overlord
druid.extensions.loadList=[\"druid-hdfs-storage\",\"mysql-metadata-storage\"]
druid.extensions.directory=/usr/local/druid/extensions
Historical
druid.host=##HOSTNAME##:8080
druid.port=8080
druid.service=druid/historical
druid.processing.buffer.sizeBytes=1073741824
druid.processing.numThreads=4
druid.server.http.numThreads=12
druid.segmentCache.locations=[{"path":"/mnt/data/segmentCache","maxSize":1000000000000}]
druid.server.maxSize=1000000000000
druid.segmentCache.numLoadingThreads=4
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.sizeInBytes=5000000000
druid.cache.initialSize=1000000
druid.selectors.indexing.serviceName=druid/overlord
druid.extensions.loadList=[\"druid-hdfs-storage\"]
druid.extensions.directory=/usr/local/druid/extensions
MiddleManager
druid.host=##HOSTNAME##:8080
druid.port=8080
druid.service=druid/middlemanager
# Store task logs in deep storage
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=gs://melissa-druid-v2/mmLogs
# Resources for peons
druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:PermSize=512m -XX:MaxPermSize=512m
druid.indexer.task.baseTaskDir=/tmp/persistent/tasks
#Hadoop specific
druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing
# Peon properties
#druid.indexer.fork.property.druid.monitoring.monitors=["com.metamx.metrics.JvmMonitor"]
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=536870912
druid.indexer.fork.property.druid.processing.numThreads=2
druid.indexer.fork.property.druid.segmentCache.locations=[{"path": "/var/app/druid/segmentCache", "maxSize": 0}]
druid.indexer.fork.property.druid.server.http.numThreads=10
druid.indexer.fork.property.druid.storage.type=hdfs
druid.indexer.fork.property.druid.storage.storageDirectory=gs://melissa-druid-v2/indexedData
druid.indexer.fork.property.druid.metadata.storage.type=mysql
druid.indexer.fork.property.druid.metadata.storage.connector.connectURI=jdbc:mysql://someIp
druid.indexer.fork.property.druid.metadata.storage.connector.user=someUser
druid.indexer.fork.property.druid.metadata.storage.connector.password=somePassword
druid.indexer.fork.property.druid.announcer.type=batch
druid.indexer.fork.property.druid.indexer.logs.type=hdfs
druid.indexer.fork.property.druid.indexer.logs.directory=gs://melissa-druid-v2/mmLogs
druid.indexer.fork.property.druid.selectors.indexing.serviceName=druid/overlord
druid.worker.capacity=4
druid.worker.ip=##HOSTNAME##
druid.worker.version=1
user.timezone=UTC
file.encoding=UTF-8
druid.extensions.loadList=[\"druid-hdfs-storage\"]
druid.extensions.directory=/usr/local/druid/extensions
druid.selectors.indexing.serviceName=druid/overlord
Overlord
druid.host=##HOSTNAME##:8080
druid.port=8080
druid.service=druid/overlord
# Only required if you are autoscaling middle managers
druid.indexer.autoscale.doAutoscale=false
druid.indexer.autoscale.strategy=gce
druid.indexer.autoscale.workerIdleTimeout=PT9M
druid.indexer.autoscale.maxScalingDuration=PT5M
druid.indexer.autoscale.provisionPeriod=PT1M
druid.indexer.autoscale.terminatePeriod=PT1M
druid.indexer.autoscale.numEventsToTrack=10
druid.indexer.autoscale.pendingTaskTimeout=PT30S
druid.indexer.autoscale.workerVersion=1
druid.indexer.autoscale.workerPort=8080
# Upload all task logs to deep storage
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=gs://melissa-druid-v2/overlordLogs
# Run in remote mode
druid.indexer.runner.type=remote
druid.indexer.runner.minWorkerVersion=1
# Store all task state in the metadata storage
druid.indexer.storage.type=metadata
druid.indexer.processing.sizeBytes=100000000
druid.extensions.loadList=[\"druid-hdfs-storage\"]
druid.extensions.directory=/usr/local/druid/extensions
Thanks for your help and your time !!! If you need any other informations about the configurations, the architecture or logs, tell me.
Have a good day,
Jean.