Schema Registry corruption

489 views
Skip to first unread message

Paul Pearcy

unread,
Dec 1, 2015, 1:31:48 AM12/1/15
to Confluent Platform
I am running a version of the schema registry from master with the latest commit being 6cd6c30bbbc28a65a77b32af42dab8f057694f7d, from August 5th. 

In my QA environment w/ 3 nodes all in same DC, somehow my schema registry has gotten into an inconsistent state and is stuck this way. 

I have two subjects that return the same schema id via subject/versions/latest API. This is very wrong. When I request the schema id via /schemas/ids/ API it returns the schema for one of the subjects. 

https://www.dropbox.com/s/txwta0thoanrumj/Screenshot%202015-12-01%2001.22.39.png?dl=0

https://www.dropbox.com/s/0mddnjhruy5oo2s/Screenshot%202015-12-01%2001.23.06.png?dl=0

https://www.dropbox.com/s/wjhr56ec0ssmw43/Screenshot%202015-12-01%2001.23.23.png?dl=0


Issues with the problematic node first surfaced with some errors like these:

[2015-11-19 08:15:47,114] INFO Wait to catch up until the offset of the last message at 98 (io.confluent.kafka.schemaregistry.storage.KafkaStore:221)

[2015-11-19 11:43:14,657] ERROR [ConsumerFetcherThread-schema-registry-1442611704002-6fe37829-0-3], Current offset 100 for partition [_schemas,0] out of range; reset offset to 0 (kafka.consumer.ConsumerFetcherThread:97)

[2015-11-19 11:43:14,657] ERROR [ConsumerFetcherThread-schema-registry-1442611704002-6fe37829-0-3], Current offset 100 for partition [_schemas,0] out of range; reset offset to 0 (kafka.consumer.ConsumerFetcherThread:97)


I then started getting other errors, eg:

[2015-11-19 12:45:58,007] INFO 172.31.26.108 - - [19/Nov/2015:12:45:57 -0500] "GET /schemas/ids/208 HTTP/1.1" 404 49  17 (io.confluent.rest-utils.requests:77)

[2015-11-19 15:54:46,094] INFO Wait to catch up until the offset of the last message at 99 (io.confluent.kafka.schemaregistry.storage.KafkaStore:221)

[2015-11-19 21:34:46,305] INFO 54.164.41.97 - - [19/Nov/2015:21:34:45 -0500] "POST /subjects/SessionEvent-value/versions HTTP/1.1" 500 61  502 (io.confluent.rest-utils.requests:77)


I restarted the problematic node and things looked to heal, but now I have underlying data that is inconsistent as mentioned above.


Currently, all my events being encoded with the corrupted subject cannot be decoded. 


Please let me know if there are any details I can provide that can help debug the issue. 


Also, let me know if there is a recommended way to fix. For the currently corrupted messages, the only option I know if is to create a custom decoder to hack around the corruption. 


Thanks,

Paul

Paul Pearcy

unread,
Dec 1, 2015, 1:52:22 AM12/1/15
to Confluent Platform
Also worth noting, I just dumped my _schemas topic and see schema id creation occurring as expected, but then goes 207 -> 221 -> 208 and continues and 221 is then re-used. 

Paul Pearcy

unread,
Dec 1, 2015, 10:40:48 AM12/1/15
to Confluent Platform
Minor correction. I am actually running the official 1.0 Schema registry release for the service. 

I needed a release from master to work around this bug: https://github.com/confluentinc/schema-registry/pull/167

Paul Pearcy

unread,
Dec 1, 2015, 3:42:06 PM12/1/15
to Confluent Platform
I think that I understand more about what happened. Looks like my 3 node zookeeper cluster got into a split state were handing out counter IDs incorrectly. I am still digging further on why the split single node zookeeper cluster was able to give out counters, likely a config setting I am missing to require a minimum quorum. 

Unfortunately, a clean non-destructive recovery seems impossible. 
Reply all
Reply to author
Forward
0 new messages