AvroTypeException: The datum is not an example of the schema - while schema is from schema-registry

2,512 views
Skip to first unread message

howar...@miopartners.com

unread,
Feb 26, 2016, 7:18:55 PM2/26/16
to Confluent Platform
Hi,

We are working on connecting Storm, Kafka and Confluent together. We are using Apache Storm with a python wrapper "Pyleus" and the Confluent framework (including Confluent and Kafka)

We set up a Confluent JDBC connector to monitor a table in our DB. When there is a change in DB, Kafka sends out a message in Avro format.

The above mentioned are all working.

Once we get our Avro message in Storm bolt, we want to parse it using the Avro schema obtained from Confluent's "Schema-registry".

We were able to obtain the correct schema for our message (in json) by using the schema registry URL:

Question is:
when we tried to use this schema to parse our message, we always get an error message complaining about our message is "not an example of the schema":

    AvroTypeException: The datum *Avro message in byte code* is not an example of the schema


The python-avro module we are using is from https://github.com/linkedin/python-avro-json-serializer
We tested the above module with some hand-made avro file and it works.

Any help is appreciated.
Howard

Ewen Cheslack-Postava

unread,
Mar 1, 2016, 11:21:57 AM3/1/16
to Confluent Platform
Howard,

It sounds like you probably aren't parsing the header that we include since the URL you used for the schema was "latest" instead of a specific version. You can see exactly how the data is written here: https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroSerializer.java#L52 It includes a magic byte (currently 0, but could change in the future to allow for format changes), then the schema version as an integer, and then the serialized avro data.

-Ewen

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platf...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/86f86c0d-3fea-43a1-afe5-260874519569%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Ewen

Howard Xie

unread,
Mar 1, 2016, 12:20:31 PM3/1/16
to confluent...@googlegroups.com

Hi Ewen:

 

Thanks for your email.

 

According to: ( I think you wrote this thread as well )

https://groups.google.com/forum/#!topic/confluent-platform/A7B6uSnJa5k

 

There are 4 bytes of extra information in Avro messages sent from Confluent, indicating the schema ID. So before I pass my avro message to “serializer.to_json()” method, I sliced off the first 4 bytes. However, it didn’t work neither with the same error. I tried to slice off the first 5 bytes as well but no luck.

 

The reason I’m hesitant to get the schema every time from schema registry once a message comes in is because of performance concerns.

 

I have a couple questions: do you know if there were an example to parse Avro message sent from Confluent using python? For the header in avro message that contains schema ID, how many bytes does it take (I tried 1, 2, 3, 4, 5); Is there a documentation page? And do we have to use the schema ID every time when a message comes in, or is there a way to get a schema once then use if for all subsequent messages?

 

I’ve also posted my question here: http://stackoverflow.com/questions/35712601/avro-io-avrotypeexception-the-datum-avro-data-is-not-an-example-of-the-schema

 

If you’d like to see my code I’ve put them there.

 

Thanks a lot!

Howard

--
You received this message because you are subscribed to a topic in the Google Groups "Confluent Platform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/confluent-platform/Dc4oRJKBfDM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to confluent-platf...@googlegroups.com.


To post to this group, send email to confluent...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.


========================================================================
This email is confidential and may be privileged. If you have received it
in error, please notify us immediately and then delete it. Please do not
copy it, disclose its contents or use it for any purpose.
========================================================================

Ewen Cheslack-Postava

unread,
Mar 1, 2016, 11:35:15 PM3/1/16
to Confluent Platform
On Tue, Mar 1, 2016 at 9:20 AM, Howard Xie <Howar...@miopartners.com> wrote:

Hi Ewen:

 

Thanks for your email.

 

According to: ( I think you wrote this thread as well )

https://groups.google.com/forum/#!topic/confluent-platform/A7B6uSnJa5k

 

There are 4 bytes of extra information in Avro messages sent from Confluent, indicating the schema ID.


The 4 bytes referred to there are for the integer ID *after* the magic byte, so in total it should be 5.
 

So before I pass my avro message to “serializer.to_json()” method, I sliced off the first 4 bytes.


I didn't realize before which library you were using. I think you have the serialize and deserialize steps mixed up. python-avro-json-serializer takes Python objects representing a value and serializes them to Avro's JSON format using an Avro schema. You want to deserialize from Avro's binary serialized form into Python objects. Avro has a Python library that you probably want to use directly for this purpose: https://avro.apache.org/docs/1.8.0/gettingstartedpython.html (If you then wanted to get a JSON string containing that data you could use python-avro-json-serializer).
 

However, it didn’t work neither with the same error. I tried to slice off the first 5 bytes as well but no luck.

 

 

The reason I’m hesitant to get the schema every time from schema registry once a message comes in is because of performance concerns.


You should simply cache these. They aren't large and generally you'll only be working with a couple of schemas. The IDs used are immutable in the schema registry, so they can be cached forever. Applications should only need to look them up once.
 

 

I have a couple questions: do you know if there were an example to parse Avro message sent from Confluent using python? For the header in avro message that contains schema ID, how many bytes does it take (I tried 1, 2, 3, 4, 5);


Instead of just chopping off those bytes, I'd try parsing them and making sure you get the expected value. Does the first byte match the expected magic byte of 0? Do the next four, parsed as an integer, have the correct schema ID. If those match, then the rest of the deserialization process is the likely culprit.
 

Is there a documentation page? And do we have to use the schema ID every time when a message comes in, or is there a way to get a schema once then use if for all subsequent messages?


With Avro you *must* have the schema originally used to write the data to decode the binary serialized value. You can also project to another schema if they are compatible (e.g. if you want your application to always see the data using the latest schema), but you the original schema used to write the data is required to properly decode the bytes.

Unfortunately the Avro docs don't seem to include good API docs and only a simple example, but googling a bit turns up examples like http://stackoverflow.com/a/25130722 that should be easy to adapt to your needs.

-Ewen
 

For more options, visit https://groups.google.com/d/optout.



--
Thanks,
Ewen

Ben Davison

unread,
Mar 9, 2016, 6:43:21 AM3/9/16
to confluent...@googlegroups.com
Hi Howard,

Did you ever manage to fix this? We have just come up against the same issue.

Thanks,

Ben


For more options, visit https://groups.google.com/d/optout.




This email, including attachments, is private and confidential. If you have received this email in error please notify the sender and delete it from your system. Emails are not secure and may contain viruses. No liability can be accepted for viruses that might be transferred by this email or any attachment. Any unauthorised copying of this message or unauthorised distribution and publication of the information contained herein are prohibited.

7digital Limited. Registered office: 69 Wilson Street, London EC2A 2BB.
Registered in
England and Wales. Registered No. 04843573.
Reply all
Reply to author
Forward
0 new messages