H2 clustering trusts the server list passed to the URL and doesn't verify it.

95 views
Skip to first unread message

Ryan

unread,
Mar 16, 2012, 11:28:29 PM3/16/12
to h2-da...@googlegroups.com
Hi,

I've been playing around with H2's clustering for a couple of days to get an understanding of how it works and I've noticed something that doesn't quite seem right to me.  I wrote a very simple example to show what I mean.

http://pastebin.com/BizWgf3y

I can rationalize why I think that type of behavior is dangerous, but I thought I'd check if it's intended first.  It's also possible I'm overlooking something or misunderstanding the intended use of H2's clustering.

Thomas Mueller

unread,
Mar 20, 2012, 2:46:34 AM3/20/12
to h2-da...@googlegroups.com
Hi,

The main problem the current H2 clustering mechanism is trying to
solve is high availability. That means, you start two database
servers, and even if one fails the other one can still be used. The
idea is not that you manually start and stop one server and then the
other: you let both servers run.

I know there are many use cases that the H2 cluster mechanism doesn't
solve at all, like "scalable writes" or "synchronizing changes between
cluster nodes". The H2 cluster mechanism is very limited. I don't
currently plan to add new features; probably it would make more sense
to write a new kind of cluster mechanism that solves a lot more
problems than the current mechanism.

Regards,
Thomas

Ryan

unread,
Mar 21, 2012, 4:23:15 AM3/21/12
to h2-da...@googlegroups.com
Thanks for the reply. Starting, stopping, and restarting H2 in my example was the only way I could think of to (roughly) simulate having one node in the cluster disconnect temporarily. It could be the result of a network disruption or something as simple as someone rebooting one of the machines in a cluster without understanding the implications. If that happens, the node that went offline temporarily shouldn't be allowed to rejoin the cluster automatically, because it may have missed some transactions while offline. I'm not sure what would happen if one of the nodes fails a commit due to a constraint violation in such a situation.

I'm assuming the expectation is for the node to be readded manually using the create cluster tool. If that's the expectation, I think it would be prudent to make eviction from the cluster permanent until an administrator intervenes. Once SessionRemote sets CLUSTER='' an administrator should be forced to run the CreateCluster tool to resynchronize the failed node.

If you end up implementing a new clustering mechanism, I would really like to see some type of fail fast mode in addition to high availability. To clarify what I mean, if even one node in the cluster can't be reached, the cluster is taken offline. I think it would be useful for some types of small business applications where budgetary constraints increase the likelyhood of hardware or network failure (networks that are poorly designed by local admins and end up being prone to partitioning). In most of those cases redundant hardware is usually minimal and the applications are fairly low capacity. For those situations, I think providing data consistency at the expense of availability can be easy to justify because inconsistent or lost data ends up being more detrimental than taking the application offline temporarily while the root cause of the failure is identified and fixed.

Let me know if you'd like me to clarify anything. Hopefully I haven't misinterpreted what SessionRemote is doing.

Ryan

Reply all
Reply to author
Forward
0 new messages