kafka.schema.registry.url

Zhenhua Cao

unread,

May 14, 2016, 10:30:29 AM5/14/16

to gobblin-users

I ingest data from Kafka to HDFS ,

the job failed on the step get source avro schema

kafka.schema.registry.url=#schema registry URI

I add two class about kafka avro source as bellow:

public class KafkaAvroSource extends KafkaSource<Schema, GenericRecord> {

    private static final Logger LOG = LoggerFactory.getLogger(KafkaAvroSource.class);

    private Optional<KafkaAvroSchemaRegistry> schemaRegistry = Optional.absent();

    @Override
    public Extractor<Schema, GenericRecord> getExtractor(WorkUnitState state) throws IOException {
        return new CustomKafkaAvroExtractor(state);
    }

    @Override
    public List<WorkUnit> getWorkunits(SourceState state) {
        if (!this.schemaRegistry.isPresent()) {
            this.schemaRegistry = Optional.of(new KafkaAvroSchemaRegistry(state.getProperties()));
        }
        return super.getWorkunits(state);
    }

    /**
     * A {@link KafkaTopic} is qualified if its schema exists in the schema registry.
     */
    @Override
    protected boolean isTopicQualified(KafkaTopic topic) {
        Preconditions.checkState(this.schemaRegistry.isPresent(),
                "Schema registry not found. Unable to verify topic schema");

        try {
            this.schemaRegistry.get().getLatestSchemaByTopic(topic.getName());
            return true;
        } catch (SchemaRegistryException e) {
            LOG.error(String.format("Cannot find latest schema for topic %s. This topic will be skipped", topic.getName()));
            return false;
        }
    }
}

public class CustomKafkaAvroExtractor extends KafkaAvroExtractor {
    private static final Logger LOG = LoggerFactory.getLogger(CustomKafkaAvroExtractor.class);


    public CustomKafkaAvroExtractor(WorkUnitState state) {
        super(state);
    }

    @Override
    protected Schema getRecordSchema(byte[] payload) {
        Closer closer = Closer.create();
        Schema schema = null;
        InputStream inputStream = null;
        try  {
            inputStream = closer.register(new DataInputStream(new ByteArrayInputStream(payload)));
             schema = new Schema.Parser().parse(inputStream);
        } catch (Exception ex) {
            LOG.error(String.format("Failed to get Record Schema  at CustomKafkaAvroExtractor %s", inputStream));
        }
        return schema;
    }

    @Override
    protected Decoder getDecoder(byte[] payload) {
        Closer closer = Closer.create();
        InputStream inputStream = null;
        Decoder decoder = null;
        try {
            inputStream = closer.register(new DataInputStream(new ByteArrayInputStream(payload)));
            LOG.info(String.format("CustomKafkaAvroExtractor getDecoder inputStream:%s", inputStream.toString()));
            decoder = DecoderFactory.get().binaryDecoder(inputStream, null);
        } catch (Exception ex) {
            LOG.error(String.format("Failed to get Decoder  at CustomKafkaAvroExtractor %s", inputStream));
        }
        return decoder;
    }
}


job config file as follow:

job.name=GobblinKafkaQuickStart
job.group=GobblinKafka
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false


kafka.brokers=localhost:9092
topic.whitelist=test
bootstrap.with.offset=earliest


source.class=gobblin.source.extractor.extract.kafka.KafkaAvroSource
source.schema={"namespace":"example.avro", "type":"record", "name":"User", "fields":[{"name":"name", "type":"string"}, {"name":"favorite_number",  "type":"int"}, {"name":"favorite_color", "type":"string"}]}
extract.namespace=gobblin.extract.kafka


writer.builder.class=gobblin.writer.SimpleDataWriterBuilder

data.publisher.final.dir=/gobblintest/job-output

mr.job.max.mappers=1
mr.job.root.dir=/gobblin-kafka/working


metrics.reporting.file.enabled=true
metrics.reporting.file.suffix=txt


task.data.root.dir=/jobs/kafkaetl/gobblin/gobblin-kafka/task-data

Any advice will be very appreciated, thanks!

Zhenhua Cao

unread,

May 14, 2016, 10:33:28 AM5/14/16

to gobblin-users

Update the job config as below: this is test config.

job.name=GobblinKafkaQuickStart

job.group=GobblinKafka

job.description=Gobblin quick start job for Kafka

job.lock.enabled=false

kafka.brokers=localhost:9092

topic.whitelist=test

source.class=gobblin.source.extractor.extract.kafka.KafkaAvroSource

extract.namespace=gobblin.extract.kafka

source.schema={"namespace":"example.avro", "type":"record", "name":"User", "fields":[{"name":"name", "type":"string"}, {"name":"favorite_number", "type":"int"}, {"name":"favorite_color", "type":"string"}]}

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder

writer.file.path.type=tablename

writer.destination.type=HDFS

writer.output.format=AVRO

data.publisher.type=gobblin.publisher.BaseDataPublisher

mr.job.max.mappers=1

metrics.reporting.file.enabled=true

metrics.log.dir=/gobblin-kafka/metrics

metrics.reporting.file.suffix=txt

bootstrap.with.offset=earliest

fs.uri=hdfs://localhost:9090

writer.fs.uri=hdfs://localhost:9090

state.store.fs.uri=hdfs://localhost:9090

mr.job.root.dir=/gobblin-kafka/working

state.store.dir=/gobblin-kafka/state-store

task.data.root.dir=/jobs/kafkaetl/gobblin/gobblin-kafka/task-data

data.publisher.final.dir=/gobblintest/job-output

在 2016年5月14日星期六 UTC+8下午10:30:29，Zhenhua Cao写道：

Chavdar Botev

unread,

May 14, 2016, 3:25:37 PM5/14/16

to Zhenhua Cao, gobblin-users

What is the exception that you get?

> --
> You received this message because you are subscribed to the Google Groups
> "gobblin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gobblin-user...@googlegroups.com.
> To post to this group, send email to gobbli...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gobblin-users/3f9e6a97-4a99-4309-b494-6190c570fb23%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Zhenhua Cao

unread,

May 14, 2016, 8:12:34 PM5/14/16

to gobblin-users

2016-05-14 06:47:12 PDT INFO [main] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@4e34c76f[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]

2016-05-14 06:47:12 PDT ERROR [main] gobblin.runtime.SourceDecorator 59 - Failed to get work units for job job_GobblinKafkaQuickStart_1463233631178

java.lang.IllegalArgumentException: Property kafka.schema.registry.url not provided.

at com.google.common.base.Preconditions.checkArgument(Preconditions.java:93)

at gobblin.metrics.kafka.KafkaAvroSchemaRegistry.<init>(KafkaAvroSchemaRegistry.java:65)

at gobblin.source.extractor.extract.kafka.KafkaAvroSource.getWorkunits(KafkaAvroSource.java:54)

at gobblin.runtime.SourceDecorator.getWorkunits(SourceDecorator.java:52)

at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:241)

at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:87)

at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:64)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:110)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

2016-05-14 06:47:12 PDT ERROR [main] gobblin.runtime.AbstractJobLauncher 321 - Failed to launch and run job job_GobblinKafkaQuickStart_1463233631178: gobblin.runtime.JobException: Failed to get work units for job job_GobblinKafkaQuickStart_1463233631178

gobblin.runtime.JobException: Failed to get work units for job job_GobblinKafkaQuickStart_1463233631178

at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:249)