Quartz scheduler to run gobblin in map-reduce mode via standalone script not working

138 views

Skip to first unread message

RG

unread,

Apr 22, 2016, 1:56:04 PM4/22/16

to gobblin-users

Hi,

I am able to port avro data from kafka to s3 using map-reduce script. But I want to run the job every 5 min. I see that it's not possible to use default quartz scheduler via map-reduce launch script but we can use standalone script to launch the map reduce job with scheduler.

I modified my standalone properties and job file to do it but I am getting

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Also it's not reporting metric via standalone script even if I set it

Can we not use s3a file system via standalone script? How do I fix it? Also, what do I have to do differently to report matrix via standalone script?

Error logs :

2016-04-22 17:24:34 UTC INFO [JobScheduler STARTING] gobblin.scheduler.JobScheduler 146 - Starting the job scheduler

2016-04-22 17:24:34 UTC INFO [JobScheduler STARTING] org.quartz.core.QuartzScheduler 575 - Scheduler LocalJobScheduler_$_NON_CLUSTERED started.

2016-04-22 17:24:34 UTC INFO [JobScheduler STARTING] gobblin.scheduler.JobScheduler 365 - Scheduling locally configured jobs

2016-04-22 17:24:34 UTC INFO [JobScheduler STARTING] gobblin.scheduler.JobScheduler 378 - Loaded 1 job configuration

2016-04-22 17:24:34 UTC INFO [MetricsReportingService STARTING] gobblin.metrics.GobblinMetrics 481 - Not reporting metrics to JMX

2016-04-22 17:24:34 UTC INFO [MetricsReportingService STARTING] gobblin.metrics.GobblinMetrics 430 - Not reporting metrics to log files

2016-04-22 17:24:34 UTC INFO [MetricsReportingService STARTING] gobblin.metrics.GobblinMetrics 492 - Not reporting metrics to Kafka

2016-04-22 17:25:00 UTC WARN [LocalJobScheduler_Worker-1] org.apache.hadoop.util.NativeCodeLoader 62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2016-04-22 17:25:00 UTC INFO [LocalJobScheduler_Worker-1] org.quartz.core.JobRunShell 207 - Job JwGobblinKafkaImport5.JwGobblinKafkaImport5 threw a JobExecutionException:

org.quartz.JobExecutionException: gobblin.runtime.JobException: Failed to run job JwGobblinKafkaImport5 [See nested exception: gobblin.runtime.JobException: Failed to run job JwGobblinKafkaImport5]

at gobblin.scheduler.JobScheduler$GobblinJob.execute(JobScheduler.java:514)

at org.quartz.core.JobRunShell.run(JobRunShell.java:202)

at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)

Caused by: gobblin.runtime.JobException: Failed to run job JwGobblinKafkaImport5

at gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:294)

at gobblin.scheduler.JobScheduler$GobblinJob.execute(JobScheduler.java:512)

... 2 more

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at gobblin.runtime.JobContext.<init>(JobContext.java:125)

at gobblin.runtime.AbstractJobLauncher.<init>(AbstractJobLauncher.java:126)

at gobblin.runtime.local.LocalJobLauncher.<init>(LocalJobLauncher.java:67)

at gobblin.runtime.JobLauncherFactory.newJobLauncher(JobLauncherFactory.java:63)

at gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:292)

... 3 more

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)

... 14 more

Below are my configuration files :

1.pull

job.name=GobblinKafkaImport5

job.group=GobblinKafkaImport5

job.description=Gobblin quick start job for Kafka

job.lock.enabled=true

mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working

task.data.root.dir=${env:GOBBLIN_WORK_DIR}/working/task-data

job.schedule=0 0/5 * * * ?

job.runonce=False

launcher.type=MAPREDUCE

kafka.brokers=kafka-01.com:9092

source.class=com.extractor.KafkaSource2

extract.namespace= gobblin.extract.kafka

topic.whitelist=ping-avro

writer.builder.class=gobblin.writer.AvroDataWriterBuilder

writer.file.path.type=tablename

writer.destination.type=HDFS

writer.output.format=AVRO

data.publisher.final.dir=s3a://test.com/gobblin-mapr6/

bootstrap.with.offset=latest

metrics.reporting.file.enabled=true

metircs.enabled=true

metrics.reporting.file.suffix=txt

writer.partitioner.class=gobblin.writer.partitioner.TimeBasedAvroWriterPartitioner

writer.partition.granularity=day

data.publisher.type=gobblin.publisher.TimePartitionedDataPublisher

writer.partition.timezone=UTC

gobblin-standalone.properties

# Thread pool settings for the task executor

taskexecutor.threadpool.size=2

taskretry.threadpool.coresize=1

taskretry.threadpool.maxsize=2

# File system URIs

fs.uri=hdfs://{host}:8020

writer.fs.uri=${fs.uri}

state.store.fs.uri=s3a:/test.com/gobblin-mapr6/

# Writer related configuration properties

writer.destination.type=HDFS

writer.output.format=AVRO

writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging

writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output

# Data publisher related configuration properties

data.publisher.type=gobblin.publisher.BaseDataPublisher

data.publisher.final.dir=s3a://rohit-dev.ltvytics.com/gobblin-mapr6/

data.publisher.replace.final.dir=false

# Directory where job configuration files are stored

jobconf.dir=${env:GOBBLIN_JOB_CONFIG_DIR}

state.store.dir=s3a://test.com/gobblin-mapr6/state-store

# Directory where commit sequences are stored

gobblin.runtime.commit.sequence.store.dir=${env:GOBBLIN_WORK_DIR}/commit-sequence-store

# Directory where error files from the quality checkers are stored

qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err

# Directory where job locks are stored

job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks

# Directory where metrics log files are stored

metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics

# Interval of task state reporting in milliseconds

task.status.reportintervalinms=5000

# MapReduce properties

mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working

task.data.root.dir=${env:GOBBLIN_WORK_DIR}/working/task-data

# s3 bucket configuration

data.publisher.fs.uri=s3a:/test.com/gobblin-mapr6/

fs.s3a.access.key={key}

fs.s3a.secret.key={key}

Sahil Takiar

unread,

Apr 26, 2016, 5:23:08 PM4/26/16

to RG, gobblin-users

Make sure the hadoop-aws-2.6.0.jar and aws-java-sdk-1.7.4.jar JARs are under the "lib/" folder when you launch your script. If they aren't you can download them from Maven Central and put them there.

I believe the issues with Metrics Reporting is a known issue and there should be a fix soon, but could you please open up a GitHub Issue reporting the problems with Metrics Reporting?

--Sahil

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/ca67d134-6928-46b9-aba6-e2b49bc7b2a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages