Journal refactorings and snapshotting

Martin Krasser

unread,

Apr 6, 2013, 3:53:55 AM4/6/13

to eventsou...@googlegroups.com

Hackers,

just wanted to let you know that I started some journal refactorings on branch wip-8-snapshots. I mainly moved ReplayDone replies to SWRS (commit) and AWRS (commit). All tests are passing (both it and test). Should I still have screwed something up (esp. in the two mongodb journals and the dynamodb journal) please let me know.

Main work on this branch however will be snapshotting support. I'll make a proposal for snapshotting based on LevelDB and HBase and follow up on this thread as soon as I've something to show. Current idea is to store snapshots in a storage backend specific way i.e. with LevelDB on the local filesystem (or directly in LevelDB), with HBase on HDFS etc.

I found the alternative of using an independent storage backend for snapshots less convenient for the user, both, in terms of additional configuration complexity and different QoS for event storage and snapshot storage. Backend specific snapshot storage on the other hand is a bit more effort as it needs to be implemented differently for each journal.

Thoughts?

Cheers,
Martin

-- 
Martin Krasser

blog:    http://krasserm.blogspot.com
code:    http://github.com/krasserm
twitter: http://twitter.com/mrt1nz

scott clasen

unread,

Apr 6, 2013, 5:27:02 PM4/6/13

to Martin Krasser, eventsou...@googlegroups.com

Looking forward to getting snapshotting implemented! Agree with going for backend specific snapshotting.

ddevore

unread,

Apr 7, 2013, 1:49:39 AM4/7/13

to eventsou...@googlegroups.com, Martin Krasser

Agreed on all accounts. We'll need some kind of generic traits, like AWRS and SRWS are to journaling, but keep the snapshots specific to the journal implementation.

Very exciting. Can't wait to see what you come up w/.

The preceding email message may contain confidential information of Viridity Energy, Inc. It is not intended for transmission to, or receipt by, any unauthorized persons. If you have received this message in error, please (i) do not read it, (ii) reply to the sender that you received the message in error, and (iii) erase or destroy the message.

ddevore

unread,

Apr 9, 2013, 3:06:34 PM4/9/13

to eventsou...@googlegroups.com, kras...@googlemail.com

The Snapshots proposal looks good.

scott clasen

unread,

Apr 9, 2013, 5:40:41 PM4/9/13

to ddevore, eventsou...@googlegroups.com, Martin Krasser

Yes looks great!

Couple of questions related to snapshots in general, not really to the proposal...

Though I do think journal implementations should handle the details of storage, do you think some standardization/utils around serialization formats, etc, make sense?

In practice how will this work if you are attempting to snapshot a processor with lots of state (say > 1GB)

Should the snapshottter be a different actor / process than the journal?

Message has been deleted

ahjohannessen

unread,

Apr 9, 2013, 6:44:45 PM4/9/13

to eventsou...@googlegroups.com

I wonder how to handle a situation like this:

- processor emits message on a default channel.
- message is not confirmed right away.
- processor does a successful snapshot.
- system crashes.

If processor replay restores from last snapshot, then we lost a message?

Use reliable channels behind snapshotted processors?

Or is there some smarter way that could take this into account, e.g. snapshot - 1 or similar, if system did not shutdown properly. I guess this one is probably not a walk in the park.

Martin Krasser

unread,

Apr 10, 2013, 1:01:30 AM4/10/13

to eventsou...@googlegroups.com

Hi Scott,

Am 09.04.13 23:40, schrieb scott clasen:

Yes looks great!

Couple of questions related to snapshots in general, not really to the proposal...

Though I do think journal implementations should handle the details of storage, do you think some standardization/utils around serialization formats, etc, make sense?

Absolutely, I just didn't cover that yet. Journals should be configurable with serialization strategies (that can be re-used across journals).

In practice how will this work if you are attempting to snapshot a processor with lots of state (say > 1GB)

Should the snapshottter be a different actor / process than the journal?

Yes.

Thanks for your valuable feedback. I added your comments to ticket #8

Cheers,
Martin

Martin Krasser

unread,

Apr 10, 2013, 1:20:31 AM4/10/13

to eventsou...@googlegroups.com

Hi Alex,

Am 10.04.13 00:44, schrieb ahjohannessen:

I wonder how to handle a situation like this:

 - processor emits message on a default channel.
 - message is not confirmed right away.
 - processor does a successful snapshot.
 - system crashes.

If processor replay restores from last snapshot, then we lost a message?

Correct, this can happen. Very good catch.


Use reliable channels behind snapshotted processors?

This would mitigate the risk but not completely avoid it. For example

- processor emits message to reliable channel
- processor does a successfull snapshot
- system crashes before reliable channel stores message together with ack

This is very unlikely but it can happen.


Or is there some smarter way that could take this into account, e.g. snapshot - 1 or similar, if system did not shutdown properly. I guess this one is probably not a walk in the park.

- a practical solution is to do a replay from snapshots that are older than a certain limit (see also this post). For example, if you only recover from snapshots older than 1 hour and you expect all receipt confirmations to occur within 1 hour, you should be on the safe side.
- to completely avoid the situation you mentioned, for default and reliable channels, you could still do a replay from scratch.

I added " Several snapshots per processor and selection criteria on ReplayParams" (= recovery from older snapshots) to ticket #8. Would this support your needs?

Thanks for your valuable feedback.

Martin Krasser

unread,

Apr 10, 2013, 1:59:12 AM4/10/13

to eventsou...@googlegroups.com

Am 09.04.13 21:06, schrieb ddevore:

The Snapshots proposal looks good.

Glad that you like it, thanks for reviewing.

Martin Krasser

unread,

Apr 10, 2013, 7:50:16 AM4/10/13

to eventsou...@googlegroups.com

Am 10.04.13 07:20, schrieb Martin Krasser:

Hi Alex,

Am 10.04.13 00:44, schrieb ahjohannessen:
I wonder how to handle a situation like this:

 - processor emits message on a default channel.
 - message is not confirmed right away.
 - processor does a successful snapshot.
 - system crashes.

If processor replay restores from last snapshot, then we lost a message?
Correct, this can happen. Very good catch.
Use reliable channels behind snapshotted processors? 
This would mitigate the risk but not completely avoid it. For example

- processor emits message to reliable channel
- processor does a successfull snapshot
- system crashes before reliable channel stores message together with ack

This is very unlikely but it can happen.
Or is there some smarter way that could take this into account, e.g. snapshot - 1 or similar, if system did not shutdown properly. I guess this one is probably not a walk in the park.
- a practical solution is to do a replay from snapshots that are older than a certain limit (see also this post). For example, if you only recover from snapshots older than 1 hour and you expect all receipt confirmations to occur within 1 hour, you should be on the safe side.
- to completely avoid the situation you mentioned, for default and reliable channels, you could still do a replay from scratch.

I added " Several snapshots per processor and selection criteria on ReplayParams" (= recovery from older snapshots) to ticket #8. Would this support your needs?

Here's an example of what is meanwhile supported on branch wip-8-snapshots.

Reply all

Reply to author

Forward