Pushing Segments to Google Storage from Hadoop Batch Indexer

56 views
Skip to first unread message

th...@spotify.com

unread,
Jan 19, 2018, 1:46:53 PM1/19/18
to Druid User
Hello!
I'm successfully indexing data using google dataproc. However, the indexer tries to push the newly created segments to google storage. I've tried for a while to get this to work, but maybe someone here can provide more insight? 

I'm compiling a fatjar from source. I think the source code snapshot is between 0.10.1 and 0.11.0

The full stack trace is below.

Thanks in advance!
Peter

Peters-MacBook-Pro:~/repos/druid (master *% u-66)$ ./index_it.sh
Job [5bee1048-2ff8-49e2-82d5-d3eedb7c01be] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/5bee1048-2ff8-49e2-82d5-d3eedb7c01be/druid_build-assembly-0.1-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/01/19 18:14:54 INFO util.Version: HV000001: Hibernate Validator 5.1.3.Final
18/01/19 18:14:54 INFO guice.JsonConfigurator: Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[], defaultVersion='0.1-SNAPSHOT', localRepository='/root/.m2/repository', remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-client/2.3.0/hadoop-client-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-common/2.3.0/hadoop-common-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/thoughtworks/paranamer/paranamer/2.3/paranamer-2.3.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-auth/2.3.0/hadoop-auth-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/httpcomponents/httpcore/4.2.5/httpcore-4.2.5.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.3.0/hadoop-hdfs-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.3.0/hadoop-mapreduce-client-app-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.3.0/hadoop-mapreduce-client-common-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.3.0/hadoop-yarn-client-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.3.0/hadoop-yarn-server-common-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.3.0/hadoop-mapreduce-client-shuffle-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.3.0/hadoop-yarn-api-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.3.0/hadoop-mapreduce-client-core-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.3.0/hadoop-yarn-common-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/javax/servlet/servlet-api/2.5/servlet-api-2.5.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.3.0/hadoop-mapreduce-client-jobclient-2.3.0.jar]
18/01/19 18:14:55 INFO initialization.Initialization: Added URL[file:/root/.m2/repository/org/apache/hadoop/hadoop-annotations/2.3.0/hadoop-annotations-2.3.0.jar]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/5bee1048-2ff8-49e2-82d5-d3eedb7c01be/druid_build-assembly-0.1-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/01/19 18:14:55 INFO util.Version: HV000001: Hibernate Validator 5.1.3.Final
18/01/19 18:14:56 INFO guice.JsonConfigurator: Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[], defaultVersion='0.1-SNAPSHOT', localRepository='/root/.m2/repository', remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.metadata.storage.mysql.MySQLMetadataStorageModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.s3.S3StorageDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.firehose.s3.S3FirehoseDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.hdfs.HdfsStorageDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO guice.JsonConfigurator: Loaded class[class io.druid.guice.ExtensionsConfig] from props[druid.extensions.] as [ExtensionsConfig{searchCurrentClassloader=true, coordinates=[], defaultVersion='0.1-SNAPSHOT', localRepository='/root/.m2/repository', remoteRepositories=[https://repo1.maven.org/maven2/, https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local]}]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.metadata.storage.mysql.MySQLMetadataStorageModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.s3.S3StorageDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.firehose.s3.S3FirehoseDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.query.aggregation.histogram.ApproximateHistogramDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:56 INFO initialization.Initialization: Adding local extension module[io.druid.storage.hdfs.HdfsStorageDruidModule] for class[io.druid.initialization.DruidModule]
18/01/19 18:14:57 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@47b1f99f]
18/01/19 18:14:57 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=[]}]
18/01/19 18:14:57 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Using method itself for [druid.computation.buffer.size, ${base_path}.buffer.sizeBytes] on [io.druid.query.DruidProcessingConfig#intermediateComputeSizeBytes()]
18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Using method itself for [${base_path}.numThreads] on [io.druid.query.DruidProcessingConfig#getNumThreads()]
18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Using method itself for [${base_path}.columnCache.sizeBytes] on [io.druid.query.DruidProcessingConfig#columnCacheSizeBytes()]
18/01/19 18:14:58 INFO config.ConfigurationObjectFactory: Assigning default value [processing-%s] for [${base_path}.formatString] on [com.metamx.common.concurrent.ExecutorServiceConfig#getFormatString()]
18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[interface io.druid.segment.data.BitmapSerdeFactory] from props[druid.processing.bitmap.] as [ConciseBitmapSerdeFactory{}]
Jan 19, 2018 6:14:58 PM com.google.inject.servlet.GuiceFilter setPipeline
WARNING: Multiple Servlet injectors detected. This is a warning indicating that you have more than one GuiceFilter running in your web application. If this is deliberate, you may safely ignore this message. If this is NOT deliberate however, your application may not work as expected.
18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.DruidMonitorSchedulerConfig] from props[druid.monitoring.] as [io.druid.server.metrics.DruidMonitorSchedulerConfig@72f294f]
18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.metrics.MonitorsConfig] from props[druid.monitoring.] as [MonitorsConfig{monitors=[]}]
18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.server.DruidNode] from props[druid.] as [DruidNode{serviceName='druid/internal-hadoop-indexer', host='druid-indexer-1dot1-m.c.ad-veritas.internal', port=0}]
18/01/19 18:14:58 INFO guice.JsonConfigurator: Loaded class[class io.druid.metadata.MetadataStorageTablesConfig] from props[druid.metadata.storage.tables.] as [io.druid.metadata.MetadataStorageTablesConfig@452ccb54]
18/01/19 18:14:58 INFO mysql.MySQLConnector: Configured MySQL as metadata storage
18/01/19 18:14:58 INFO indexer.HadoopDruidIndexerConfig: Running with config:
{
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikiticker",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "time",
            "format" : "auto",
            "missingValue" : null
          },
          "dimensionsSpec" : {
            "dimensions" : [ "channel", "cityName", "comment", "countryIsoCode", "countryName", "isAnonymous", "isMinor", "isNew", "isRobot", "isUnpatrolled", "metroCode", "namespace", "page", "regionIsoCode", "regionName", "user" ],
            "dimensionExclusions" : [ "deleted", "added", "delta", "time" ],
            "spatialDimensions" : [ ]
          }
        }
      },
      "metricsSpec" : [ {
        "type" : "count",
        "name" : "count"
      }, {
        "type" : "longSum",
        "name" : "added",
        "fieldName" : "added"
      }, {
        "type" : "longSum",
        "name" : "deleted",
        "fieldName" : "deleted"
      }, {
        "type" : "longSum",
        "name" : "delta",
        "fieldName" : "delta"
      }, {
        "type" : "hyperUnique",
        "name" : "user_unique",
        "fieldName" : "user"
      } ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : {
          "type" : "none"
        },
        "intervals" : [ "2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z" ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "gs://forpeter/wikiticker-2015-09-12-sampled.json.gz"
      },
      "metadataUpdateSpec" : {
        "type" : "mysql",
        "connectURI" : "jdbc:mysql://35.187.161.88:3306/druid",
        "user" : "druid",
        "password" : {
          "type" : "default",
          "password" : "diurd"
        },
        "segmentTable" : "druid_segments"
      },
      "segmentOutputPath" : "gs://forpeter/"
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "workingPath" : "/tmp/druid-indexing",
      "version" : "2018-01-19T18:14:58.573Z",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000,
        "maxPartitionSize" : 7500000,
        "assumeGrouped" : false,
        "numShards" : -1
      },
      "shardSpecs" : { },
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : null,
        "metricCompression" : null
      },
      "leaveIntermediate" : false,
      "cleanupOnFailure" : true,
      "overwriteFiles" : false,
      "ignoreInvalidRows" : false,
      "jobProperties" : {
        "hadoop.mapreduce.job.user.classpath.first" : "true"
      },
      "combineText" : false,
      "persistInHeap" : false,
      "ingestOffheap" : false,
      "bufferSize" : 134217728,
      "aggregationBufferRatio" : 0.5,
      "useCombiner" : false,
      "rowFlushBoundary" : 80000
    }
  }
}
18/01/19 18:14:58 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]
18/01/19 18:14:59 INFO indexer.JobHelper: Uploading jar to path[/tmp/druid-indexing/classpath/druid_build-assembly-0.1-SNAPSHOT.jar]
18/01/19 18:15:00 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]
18/01/19 18:15:00 INFO client.RMProxy: Connecting to ResourceManager at druid-indexer-1dot1-m/10.132.0.40:8032
18/01/19 18:15:00 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/01/19 18:15:00 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
18/01/19 18:15:00 INFO input.FileInputFormat: Total input paths to process : 1
18/01/19 18:15:00 INFO mapreduce.JobSubmitter: number of splits:1
18/01/19 18:15:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515600795023_0009
18/01/19 18:15:01 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
18/01/19 18:15:01 INFO impl.YarnClientImpl: Submitted application application_1515600795023_0009
18/01/19 18:15:02 INFO mapreduce.Job: The url to track the job: http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0009/
18/01/19 18:15:02 INFO indexer.DetermineHashedPartitionsJob: Job wikiticker-determine_partitions_hashed-Optional.of([2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z]) submitted, status available at: http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0009/
18/01/19 18:15:02 INFO mapreduce.Job: Running job: job_1515600795023_0009
18/01/19 18:15:09 INFO mapreduce.Job: Job job_1515600795023_0009 running in uber mode : false
18/01/19 18:15:09 INFO mapreduce.Job:  map 0% reduce 0%
18/01/19 18:15:20 INFO mapreduce.Job:  map 100% reduce 0%
18/01/19 18:15:30 INFO mapreduce.Job:  map 100% reduce 100%
18/01/19 18:15:30 INFO mapreduce.Job: Job job_1515600795023_0009 completed successfully
18/01/19 18:15:30 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=1053
FILE: Number of bytes written=449763
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=2366222
GS: Number of bytes written=0
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=298
HDFS: Number of bytes written=99
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=26145
Total time spent by all reduces in occupied slots (ms)=37026
Total time spent by all map tasks (ms)=8715
Total time spent by all reduce tasks (ms)=6171
Total vcore-milliseconds taken by all map tasks=8715
Total vcore-milliseconds taken by all reduce tasks=12342
Total megabyte-milliseconds taken by all map tasks=26772480
Total megabyte-milliseconds taken by all reduce tasks=37914624
Map-Reduce Framework
Map input records=39244
Map output records=1
Map output bytes=1043
Map output materialized bytes=1053
Input split bytes=298
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=1053
Reduce input records=1
Reduce output records=0
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=437
CPU time spent (ms)=25090
Physical memory (bytes) snapshot=1078218752
Virtual memory (bytes) snapshot=11605217280
Total committed heap usage (bytes)=1108869120
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=94
18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: Job completed, loading up partitions for intervals[Optional.of([2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z])].
18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: Found approximately [40,337] rows in data.
18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: Creating [1] shards
18/01/19 18:15:30 INFO indexer.DetermineHashedPartitionsJob: DetermineHashedPartitionsJob took 31286 millis
18/01/19 18:15:30 INFO indexer.JobHelper: Deleting path[/tmp/druid-indexing/wikiticker/2018-01-19T181458.573Z]
18/01/19 18:15:30 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]
18/01/19 18:15:30 INFO path.StaticPathSpec: Adding paths[gs://forpeter/wikiticker-2015-09-12-sampled.json.gz]
18/01/19 18:15:30 INFO client.RMProxy: Connecting to ResourceManager at druid-indexer-1dot1-m/10.132.0.40:8032
18/01/19 18:15:30 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/01/19 18:15:30 WARN mapreduce.JobResourceUploader: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
18/01/19 18:15:30 INFO input.FileInputFormat: Total input paths to process : 1
18/01/19 18:15:30 INFO mapreduce.JobSubmitter: number of splits:1
18/01/19 18:15:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515600795023_0010
18/01/19 18:15:30 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
18/01/19 18:15:30 INFO impl.YarnClientImpl: Submitted application application_1515600795023_0010
18/01/19 18:15:30 INFO mapreduce.Job: The url to track the job: http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0010/
18/01/19 18:15:30 INFO indexer.IndexGeneratorJob: Job wikiticker-index-generator-Optional.of([2015-09-12T00:00:00.000Z/2015-09-13T00:00:00.000Z]) submitted, status available at http://druid-indexer-1dot1-m:8088/proxy/application_1515600795023_0010/
18/01/19 18:15:30 INFO mapreduce.Job: Running job: job_1515600795023_0010
18/01/19 18:15:42 INFO mapreduce.Job: Job job_1515600795023_0010 running in uber mode : false
18/01/19 18:15:42 INFO mapreduce.Job:  map 0% reduce 0%
18/01/19 18:15:56 INFO mapreduce.Job:  map 100% reduce 0%
18/01/19 18:16:07 INFO mapreduce.Job:  map 100% reduce 100%
18/01/19 18:16:11 INFO mapreduce.Job: Task Id : attempt_1515600795023_0010_r_000000_0, Status : FAILED
Error: com.metamx.common.IAE: Unknown file system scheme [gs]
at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:274)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:621)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:462)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

18/01/19 18:16:12 INFO mapreduce.Job:  map 100% reduce 0%
18/01/19 18:16:23 INFO mapreduce.Job:  map 100% reduce 100%
18/01/19 18:16:26 INFO mapreduce.Job: Task Id : attempt_1515600795023_0010_r_000000_1, Status : FAILED
Error: com.metamx.common.IAE: Unknown file system scheme [gs]
at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:274)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:621)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:462)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

18/01/19 18:16:27 INFO mapreduce.Job:  map 100% reduce 0%
18/01/19 18:16:38 INFO mapreduce.Job:  map 100% reduce 100%
18/01/19 18:16:41 INFO mapreduce.Job: Task Id : attempt_1515600795023_0010_r_000000_2, Status : FAILED
Error: com.metamx.common.IAE: Unknown file system scheme [gs]
at io.druid.indexer.JobHelper.serializeOutIndex(JobHelper.java:274)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:621)
at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.reduce(IndexGeneratorJob.java:462)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

18/01/19 18:16:42 INFO mapreduce.Job:  map 100% reduce 0%
18/01/19 18:16:53 INFO mapreduce.Job:  map 100% reduce 100%
18/01/19 18:16:57 INFO mapreduce.Job: Job job_1515600795023_0010 failed with state FAILED due to: Task failed task_1515600795023_0010_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1

18/01/19 18:16:57 INFO mapreduce.Job: Counters: 42
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=22635343
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=2366222
GS: Number of bytes written=0
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=281
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=1
Launched reduce tasks=4
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30225
Total time spent by all reduces in occupied slots (ms)=305802
Total time spent by all map tasks (ms)=10075
Total time spent by all reduce tasks (ms)=50967
Total vcore-milliseconds taken by all map tasks=10075
Total vcore-milliseconds taken by all reduce tasks=101934
Total megabyte-milliseconds taken by all map tasks=30950400
Total megabyte-milliseconds taken by all reduce tasks=313141248
Map-Reduce Framework
Map input records=39244
Map output records=39244
Map output bytes=22253954
Map output materialized bytes=22410936
Input split bytes=281
Combine input records=0
Spilled Records=39244
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=281
CPU time spent (ms)=18460
Physical memory (bytes) snapshot=720211968
Virtual memory (bytes) snapshot=4434993152
Total committed heap usage (bytes)=671088640
File Input Format Counters
Bytes Read=0
18/01/19 18:16:57 INFO indexer.JobHelper: Deleting path[/tmp/druid-indexing/wikiticker/2018-01-19T181458.573Z]
18/01/19 18:16:57 ERROR cli.CliHadoopIndexer: failure!!!!
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:120)
at io.druid.cli.Main.main(Main.java:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: com.metamx.common.ISE: Job[class io.druid.indexer.LegacyIndexGeneratorJob] failed!
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:202)
at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:96)
at io.druid.indexer.JobHelper.runJobs(JobHelper.java:182)
at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:132)
at io.druid.cli.Main.main(Main.java:91)
... 13 more
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [5bee1048-2ff8-49e2-82d5-d3eedb7c01be] entered state [ERROR] while waiting for [DONE].

Nishant Bangarwa

unread,
Jan 19, 2018, 2:08:45 PM1/19/18
to druid...@googlegroups.com
Hi, 
Have you added GCS connector jar to the classpath as recommended here - http://druid.io/docs/latest/development/extensions-core/hdfs.html

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/007c1941-dc32-47cd-8319-283d768cece0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

th...@spotify.com

unread,
Jan 19, 2018, 2:37:02 PM1/19/18
to Druid User
Hello,

Yes, I'm reading data from Google storage and the index json specification is also stored in GS. However, the job fails when it tries to push the segments to Google storage.

My hunch is that the batch hadoop indexer is a separate component and needs to be individually configured for Google. 

Here's the script that I use to launch the job in data proc:
gcloud dataproc jobs submit hadoop --quiet --cluster $cluster --project=ad-veritas --jar gs://forpeter/druid_build-assembly-0.1-SNAPSHOT.jar --region=global -- io.druid.cli.Main index hadoop gs://forpeter/wikiticker-index.json

Here's the index spec:
{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "gs://forpeter/wikiticker-2015-09-12-sampled.json.gz"
      },
      "metadataUpdateSpec" : {
          "type":"mysql",
          "connectURI" : "jdbc:mysql://35.187.161.88:3306/druid",
          "password" : "diurd",
          "segmentTable" : "druid_segments",
          "user" : "druid"
      },
      "segmentOutputPath" : "gs://forpeter/"
    },
    "dataSchema" : {
      "dataSource" : "wikiticker",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"]
      },
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "channel",
              "cityName",
              "comment",
              "countryIsoCode",
              "countryName",
              "isAnonymous",
              "isMinor",
              "isNew",
              "isRobot",
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user"
            ]
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "time"
          }
        }
      },
      "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        },
        {
          "name" : "added",
          "type" : "longSum",
          "fieldName" : "added"
        },
        {
          "name" : "deleted",
          "type" : "longSum",
          "fieldName" : "deleted"
        },
        {
          "name" : "delta",
          "type" : "longSum",
          "fieldName" : "delta"
        },
        {
          "name" : "user_unique",
          "type" : "hyperUnique",
          "fieldName" : "user"
        }
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
"workingPath" : "/tmp/druid-indexing",
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 5000000
      },
      "jobProperties" : {
            "hadoop.mapreduce.job.user.classpath.first": "true"
            }
    }
  }
}

-pt   

Erik Dubbelboer

unread,
Jan 19, 2018, 10:27:11 PM1/19/18
to Druid User
What does your Druid config look like? Are you loading the druid-google-extensions extension and setting druid.storage.type=google and druid.google.bucket=your-bucket-name and druid.google.prefix=yourprefix correctly?
Reply all
Reply to author
Forward
0 new messages