Ungraceful Shutdown in Kubernetes causes Corruprt Database Exception

75 views
Skip to first unread message

Paul Connolly

unread,
Apr 16, 2020, 1:36:06 PM4/16/20
to Event Store
We recently scaled down our Eventstore cluster from 5 nodes to 0 as a test.

Node 0 failed to come up with the following.

[00001,01,15:16:01.561] "WRITER CHECKPOINT:" 12747737373 (0x2F7D3091D) [00001,01,15:16:01.567] "CHASER CHECKPOINT:" 12795330217 (0x2FAA93EA9) [00001,01,15:16:01.567] "EPOCH CHECKPOINT:" 12745752545 (0x2F7B4BFE1) [00001,01,15:16:01.567] "TRUNCATE CHECKPOINT:" -1 (0xFFFFFFFFFFFFFFFF) [00001,01,15:16:01.749] MessageHierarchy initialization took 00:00:00.1641696. [00001,01,15:16:01.782] Unhandled exception while starting application: EXCEPTION OCCURRED Corrupt database detected. [00001,01,15:16:01.801] "Corrupt database detected. Checkpoint 'chaser' has greater value than writer checkpoint."


While we were able to recover from the other 4 nodes, is there anything we can do to prevent or repair this?

Greg Young

unread,
Apr 16, 2020, 1:43:38 PM4/16/20
to event...@googlegroups.com
This sounds like something odd has happened ... How did you take nodes down etc? I would usually say it smells like caching ...

But ...

The values are quite far apart! The chaser checkpoint is almost 50 MB ahead of the writer checkpoint ... could it be copied from time etc etc etc?

12747737373 - 12795330217 = -47592844?!

--
You received this message because you are subscribed to the Google Groups "Event Store" group.
To unsubscribe from this group and stop receiving emails from it, send an email to event-store...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/event-store/9e8200a9-697a-466d-b119-b4cfe42b322b%40googlegroups.com.


--
Studying for the Turing test

Paul Connolly

unread,
Apr 16, 2020, 1:47:15 PM4/16/20
to Event Store
Basically the nodes were scaled down in GKE Kubernetes from 5 to 0. This would take down the nodes in the order, 4, 3, 2, 1, and then 0, which was the corrupt node.


On Thursday, April 16, 2020 at 6:43:38 PM UTC+1, Greg Young wrote:
This sounds like something odd has happened ... How did you take nodes down etc? I would usually say it smells like caching ...

But ...

The values are quite far apart! The chaser checkpoint is almost 50 MB ahead of the writer checkpoint ... could it be copied from time etc etc etc?

12747737373 - 12795330217 = -47592844?!

On Thu, Apr 16, 2020 at 1:36 PM Paul Connolly <paulco...@gmail.com> wrote:
We recently scaled down our Eventstore cluster from 5 nodes to 0 as a test.

Node 0 failed to come up with the following.

[00001,01,15:16:01.561] "WRITER CHECKPOINT:" 12747737373 (0x2F7D3091D) [00001,01,15:16:01.567] "CHASER CHECKPOINT:" 12795330217 (0x2FAA93EA9) [00001,01,15:16:01.567] "EPOCH CHECKPOINT:" 12745752545 (0x2F7B4BFE1) [00001,01,15:16:01.567] "TRUNCATE CHECKPOINT:" -1 (0xFFFFFFFFFFFFFFFF) [00001,01,15:16:01.749] MessageHierarchy initialization took 00:00:00.1641696. [00001,01,15:16:01.782] Unhandled exception while starting application: EXCEPTION OCCURRED Corrupt database detected. [00001,01,15:16:01.801] "Corrupt database detected. Checkpoint 'chaser' has greater value than writer checkpoint."


While we were able to recover from the other 4 nodes, is there anything we can do to prevent or repair this?

--
You received this message because you are subscribed to the Google Groups "Event Store" group.
To unsubscribe from this group and stop receiving emails from it, send an email to event...@googlegroups.com.

Greg Young

unread,
Apr 16, 2020, 1:49:24 PM4/16/20
to event...@googlegroups.com
Are any backup/restores/file copies/etc etc being done?

To unsubscribe from this group and stop receiving emails from it, send an email to event-store...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/event-store/f91d773d-7f37-47d7-9a7b-c11a8d2e1f2d%40googlegroups.com.

Paul Connolly

unread,
Apr 16, 2020, 2:11:03 PM4/16/20
to Event Store
Nothing has been touched on those nodes since day 1. It's a dev env so we don't run any kind of backups or even have access to the disk.
Reply all
Reply to author
Forward
0 new messages