Hi, This might be heresy on the Confluent Platform group, but as a part of a solution that I am working on, I was considering the use of Avro as well as the Schema Registry. Avro seems like an excellent fit and I like the idea of a Schema Registry, but hesitant about deploying it. Deploying a new service, even if it is relatively simple, leads to additional overhead and complexity. Note, the target is to use Kafka for messaging.
As an alternative to using the Schema Registry, I worked out a simple process to use git for managing Avro schemas and created a small library that can load Avro schemas given a directory/repo path/paths. The API of this library is conceptually similar to the Schema Registry, which I used as a model. The Avro schema files are stored in sub-directories organized by namespace. The schemas are stored in files of the form schema-name.version.avsc, where schema-name is the name of the schema and matches the "name" attribute for a "record" type, and version is an integer representing a version number. As a result, schemas are versioned, with the filename encoding the version. For each new version of a schema that is created, a new version is used in the filename. Compatibility is checked prior to committing a new version of the schema.
When the FileSchemaReigstry is initialized, it reads all the schemas/files. The library is opinionated about subject names in that there is a 1-to-1 direct mapping between schema name and subject name. The subject name is created by joining the namespace and name of the schema. All versions of the schema are stored for a subject. One difference between the Schema Registry API that might be relevant is that there is no globally unique integer ID for each unique schema version. You can combine the (namespace, name, version) and use that as the globally unique ID, but as there is only a file-system backing the schema store, there isn't an integer ID that is incremented for each new schema that is registered.
The planned usage of Kafka is to use a subject-per-topic, which in this case implies a schema-per-topic. As a consequence of the lack of a globally unique ID and using a subject/schema per topic in Kafka, I wasn't planning on supplying a one byte magic value and 4 byte integer ID encoded as a header with each Avro message that is sent. Something like this could be done if I managed another file to map from (schema, version) -> ID, but this would add more complexity and be fragile. Therefore, the schema ID is not sent with the message.
So now that I've implemented this, I'm wondering if this seems like a valid approach. I've searched for blogs/articles on approaches similar to this that don't use the Schema Registry, but I haven't found any, which makes me feel a little cautious.
I would be interested in hearing from folks more knowledgable and experienced on this subject than me whether this "lightweight" file based schema registry is destined to fail in ways I haven't realized yet?
Regards --Roland