Ec2 spot

Nathan ToddStone

unread,

Oct 20, 2016, 1:35:19 PM10/20/16

to Onyx

zookeeper aside, is there a sane way to run a cluster partially or fully on spot instances?

perhaps taking advantage of the two minute spot termination warning, though not relying on it. or even persisting ebs volumes from one spot instance to the next during a termination event?

alternatively, and more simply, just letting them die, autoscaling back up with spot instances from a diff zone or of different type, and onyx just figures it out?

this would be for less critical work. ie reduced costs more important than low/consistent latency, and other nondeterministic nonsense brought by the spot market gods.

Mike Drogalis

unread,

Oct 20, 2016, 1:39:10 PM10/20/16

to Nathan ToddStone, Onyx

On Thu, Oct 20, 2016 at 10:35 AM, Nathan ToddStone <m...@nathants.com> wrote:

zookeeper aside, is there a sane way to run a cluster partially or fully on spot instances?

perhaps taking advantage of the two minute spot termination warning, though not relying on it. or even persisting ebs volumes from one spot instance to the next during a termination event?

alternatively, and more simply, just letting them die, autoscaling back up with spot instances from a diff zone or of different type, and onyx just figures it out?

^-- This is how it works. There are no changes that need to be made to run Onyx on Spot instances. An Onyx peer can be terminated, and the cluster will continue to make progress. It can recover even if all nodes are terminated when any node comes back online. The only place where state is kept is ZooKeeper, and you're safe there as long as a majority of the nodes are up. If a majority is lost with ZooKeeper, work halts.

this would be for less critical work. ie reduced costs more important than low/consistent latency, and other nondeterministic nonsense brought by the spot market gods.

--
You received this message because you are subscribed to the Google Groups "Onyx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+unsubscribe@googlegroups.com.
To post to this group, send email to onyx...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/onyx-user/6d0a9012-96ec-4e60-be5c-415471dbf3b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nathan ToddStone

unread,

Oct 20, 2016, 3:00:34 PM10/20/16

to Onyx, m...@nathants.com

first of all. bravo, sir. bravo. not much else to say. if i could tip my hat in a text editor, i would. in fact, i am right now.

i've probably missed this in the docs or code, but what about local state? rocksdb? other stuff?

is it just abandoned and recomputed from the top, from scratch? magic happens? is this documented somewhere that i've missed?

On Thursday, October 20, 2016 at 10:39:10 AM UTC-7, Mike Drogalis wrote:

On Thu, Oct 20, 2016 at 10:35 AM, Nathan ToddStone <m...@nathants.com> wrote:
zookeeper aside, is there a sane way to run a cluster partially or fully on spot instances?

perhaps taking advantage of the two minute spot termination warning, though not relying on it. or even persisting ebs volumes from one spot instance to the next during a termination event?

alternatively, and more simply, just letting them die, autoscaling back up with spot instances from a diff zone or of different type, and onyx just figures it out?

^-- This is how it works. There are no changes that need to be made to run Onyx on Spot instances. An Onyx peer can be terminated, and the cluster will continue to make progress. It can recover even if all nodes are terminated when any node comes back online. The only place where state is kept is ZooKeeper, and you're safe there as long as a majority of the nodes are up. If a majority is lost with ZooKeeper, work halts.

this would be for less critical work. ie reduced costs more important than low/consistent latency, and other nondeterministic nonsense brought by the spot market gods.

--
You received this message because you are subscribed to the Google Groups "Onyx" group.

To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+...@googlegroups.com.

Mike Drogalis

unread,

Oct 20, 2016, 3:11:04 PM10/20/16

to Nathan ToddStone, Onyx

I should be a little more careful with how I answered that. In addition to ZooKeeper, a majority of BookKeeper nodes must remain alive, too, if you're using windowing. It's quite similar to ZK in it's quorum semantics, exception its designed for high write throughput.

Onyx records window contents as a state machine of incremental update, which are captured by BookKeeper. BookKeeper will replicate the window contents across its servers so that it's fault tolerant. Onyx retains local state via RocksDB to ensure that duplicates are thrown away, but this information is built up through checkpoints at boot-up time, so this information can be thrown away and recomputed on start up/shutdown. RocksDB is pretty invisible to the developer.

This section of the user guide discusses how we record state updates. It should answer most questions, but also happy to expand if needed.

To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+unsubscribe@googlegroups.com.

To post to this group, send email to onyx...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/onyx-user/42679545-2457-42d5-b2fa-4851f84c7166%40googlegroups.com.

Nathan ToddStone

unread,

Oct 20, 2016, 4:22:49 PM10/20/16

to Onyx, m...@nathants.com

thanks for the clarification. re-reading that section, by searching for fail and jumping to each instance, does serve to verify that everything you told me is there in the docs. this part especially, is almost exactly what i was looking for, fault-tolerance.

what i personally would love to see in the guide, is a top level section on failure. some high level, story style, overviews that outline how the system handles various failure scenarios, under various processing loads. what happens in a netsplit, what happens when half your nodes get spot terminated, what happens when batch/stream/agg/etc, what happens when x. not even getting into the specifics, as that's all pretty well covered, but just hitting the high notes, and in few words. the interested reader could then digger deeper and understand, from the docs that already exist, all the details of how exactly those stories play out, on the boots on the ground.

thanks again! top notch stuff. keep it up...

Mike Drogalis

unread,

Oct 20, 2016, 4:29:26 PM10/20/16

to Nathan ToddStone, Onyx

Thanks. We're overhauling the streaming engine for 0.10.0, and expect to have a document as described since there will be a lot of new things.

To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+unsubscribe@googlegroups.com.

To post to this group, send email to onyx...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/onyx-user/08b35512-6f81-4647-86ce-59f4548f56eb%40googlegroups.com.

Reply all

Reply to author

Forward