To use or not use the Schema Registry

1,597 views
Skip to first unread message

Roland Hochmuth

unread,
Feb 6, 2017, 2:52:14 PM2/6/17
to Confluent Platform
Hi, This might be heresy on the Confluent Platform group, but as a part of a solution that I am working on, I was considering the use of Avro as well as the Schema Registry. Avro seems like an excellent fit and I like the idea of a Schema Registry, but hesitant about deploying it. Deploying a new service, even if it is relatively simple, leads to additional overhead and complexity. Note, the target is to use Kafka for messaging. 

As an alternative to using the Schema Registry, I worked out a simple process to use git for managing Avro schemas and created a small library that can load Avro schemas given a directory/repo path/paths. The API of this library is conceptually similar to the Schema Registry, which I used as a model. The Avro schema files are stored in sub-directories organized by namespace. The schemas are stored in files of the form schema-name.version.avsc, where schema-name is the name of the schema and matches the "name" attribute for a "record" type, and version is an integer representing a version number. As a result, schemas are versioned, with the filename encoding the version. For each new version of a schema that is created, a new version is used in the filename. Compatibility is checked prior to committing a new version of the schema.

When the FileSchemaReigstry is initialized, it reads all the schemas/files. The library is opinionated about subject names in that there is a 1-to-1 direct mapping between schema name and subject name. The subject name is created by joining the namespace and name of the schema. All versions of the schema are stored for a subject. One difference between the Schema Registry API that might be relevant is that there is no globally unique integer ID for each unique schema version. You can combine the (namespace, name, version) and use that as the globally unique ID, but as there is only a file-system backing the schema store, there isn't an integer ID that is incremented for each new schema that is registered.

The planned usage of Kafka is to use a subject-per-topic, which in this case implies a schema-per-topic. As a consequence of the lack of a globally unique ID and using a subject/schema per topic in Kafka, I wasn't planning on supplying a one byte magic value and 4 byte integer ID encoded as a header with each Avro message that is sent. Something like this could be done if I managed another file to map from (schema, version) -> ID, but this would add more complexity and be fragile. Therefore, the schema ID is not sent with the message.

So now that I've implemented this, I'm wondering if this seems like a valid approach. I've searched for blogs/articles on approaches similar to this that don't use the Schema Registry, but I haven't found any, which makes me feel a little cautious.

I would be interested in hearing from folks more knowledgable and experienced on this subject than me whether this "lightweight" file based schema registry is destined to fail in ways I haven't realized yet?

Regards --Roland

Roland Hochmuth

unread,
Feb 7, 2017, 3:54:40 PM2/7/17
to Confluent Platform
I just realized something, that I thought was worth mentioning related to my previous post. Because I'm using a git repo to store all the schemas, I can use the datetime at which the schema was first added/committed, and then provide a unique integer schema ID, similar to the Confluent Schema Repo, based on the sorted order. I think that the order is which new schemas are registered in the Confluent Schema Repo is how schema IDs are created, but I haven't tested that yet.

So, I'm in the process of converting my code over to supply a one byte magic key and 4 byte schema ID in the Kafka messages that are serialized to match protocols.

I guess I'm going through a lot of trouble to avoid having to avoid deploying a Schema Repo, but if I decide to switch from this "lightweight" repo to the Confluent Repo, I think I'll be in reasonable shape to do so at a later date. Currently, I have a class called GitSchemaRepo, which could be theoretically replaced with a ConfluentSchemaRepo class in the future.

Still hope that I'm not destined for a fatal flaw though.

--Roland

Gwen Shapira

unread,
Feb 8, 2017, 12:57:10 AM2/8/17
to confluent...@googlegroups.com
It almost sounds like you are building a schema registry, except you
use Git as the storage mechanism instead of Kafka. With the benefit
that someone else is running Git for you.

I am curious if you are leveraging the fact that Git is your back end
in other ways... how do you use your schema registry in your
development workflow?

I know that teams that use Git for development often have their Avro
schemas in their Git repos anyway (since the Maven integration is
convenient) and often do code reviews for the schemas - they like the
idea of running compatiblity checks with the schema registry as part
of their development workflow. Are you doing anything like that?

Gwen
> --
> You received this message because you are subscribed to the Google Groups
> "Confluent Platform" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to confluent-platf...@googlegroups.com.
> To post to this group, send email to confluent...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/confluent-platform/5aeee4e2-1b62-445f-b606-1b7ff10ed7dd%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.



--
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Andrew Otto

unread,
Feb 8, 2017, 9:07:07 AM2/8/17
to confluent...@googlegroups.com
In general I like this idea. Plain ol' git means you don’t have to run and maintain a separate service, and the schemas are all there for view in a text editor / IDE.  At WMF, teams and volunteers can be pretty disjoint, and sometimes don’t have access to internal services.  Everybody can use git :)

We do a similar thing (probably not as well), although we use JSONSchema much more than Avro.


> email to confluent-platform+unsub...@googlegroups.com.
> To post to this group, send email to confluent-platform@googlegroups.com.
--
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog
--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent-platform@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/confluent-platform/CAA9QSJK32PBN3VMjO6Gnvk_mTRNNptk2-GmMCZbe2GgeK_bFSA%40mail.gmail.com.

Roland Hochmuth

unread,
Feb 8, 2017, 4:44:31 PM2/8/17
to Confluent Platform
Hi Gwen, That is exactly what I'm trying to do. In our company we have github enterprise deployed, which is well managed and backed-up, at least that is what they tell us:-)

We haven't done any development with Avro or the schema registry yet. We use Kafka in the products that we develop. We are in the process of adding an event/message bus, probably based on Kafka. We have a lot of experience and knowledge in running Kafka. Currently, we just json for messages, but this seemed like a good time to investigate and look at the possibility of using Avro or other protocols. I believe that Avro is an excellent choice, but then I needed to resolve how schemas are managed.

We haven't used git for managing Avro schemas and we haven't done code reviews on schemas. I was thinking about running a compatibility check on schemas as a gated job in Github, prior to commit, but not having that tooling in place yet, a manual check would be required. So, I think those are excellent suggestions and I think we'll end up doing something similar.

Thanks for mentioning the Maven integration. I didn't realize that there was a Maven integration and that could be useful. Although, we are in a mixed environment with Go, Python and Java being used.

Regards --Roland

Roland Hochmuth

unread,
Feb 8, 2017, 4:54:19 PM2/8/17
to Confluent Platform
Thanks Andrew. Nice to know that you've implemented something conceptually similar already. More validation makes me feel more confident in this approach. The other reasons you list above are also good to know.

I think I like your file/schema naming convention better than what I came up with so I might change. Thanks again.

Regards --Roland
> To post to this group, send email to confluent...@googlegroups.com.
--
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

--
You received this message because you are subscribed to the Google Groups "Confluent Platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to confluent-platform+unsub...@googlegroups.com.
To post to this group, send email to confluent...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages