What you write cannot happen with Galera. n3 will notice that it is
disconnected from the main partition and it will stop accepting
queries until it is able to reconnect and get back in sync.
If you are talking microseconds, then the issue is different. You want
to familiarize yourself with wsrep_causal_reads variable
(http://www.codership.com/wiki/doku.php?id=mysql_options_0.8)
henrik
> --
> You received this message because you are subscribed to the Google Groups
> "codership" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/codership-team/-/nSHkSMVQZwcJ.
> To post to this group, send email to codersh...@googlegroups.com.
> To unsubscribe from this group, send email to
> codership-tea...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/codership-team?hl=en.
--
henri...@avoinelama.fi
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc
My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559
> To post to this group, send email to codership-team@googlegroups.com.
> To unsubscribe from this group, send email to
> For more options, visit this group at
> http://groups.google.com/group/codership-team?hl=en.
Yes. Basically all SELECTS will wait for their pending slave queue to
be applied before they read. There is no additional network
connectivity to other nodes, just a delay before the read can be
"securely" executed.
Note that if read-causality is important to you (for most applications
I would say that an inconsistency of at most some milliseconds is not
an issue) then another way around this is to direct all writes and
reads to a single node.
> Exactly how long does it take for n3 to notice it has lost the cluster? My
> cursory reading would indicate it would time all other nodes out
> after inactive_timeout, which is bad since thats 15 seconds (longer then the
> main cluster times n3 out). Is that incorrect?
Note that you can configure these values to suit your own needs. The
defaults are fairly lax, so making these timeouts tighter is certainly
an option.
henrik
Exactly how long does it take for n3 to notice it has lost the cluster? My cursory reading would indicate it would time all other nodes out after inactive_timeout, which is bad since thats 15 seconds (longer then the main cluster times n3 out). Is that incorrect?
- Teemu
Nono. When n3 loses connection, it immediately cannot commit any new
transactions. The commit is synchronous and hard-wired into the group
communication, so if you can't talk to the primary component, you
cannot commit a single transaction. (with default setting of
pc.ignore_sb) As I understand it, the suspect_timeout is more like
the time after which a node gives up trying.
> I think galera looks very cool, it's very close to what we want and WAY
> ahead of 'nosql' projects I'm looking at like mongodb. However, it looks
> like the 'causal reads' were added on as an after thought, and it still
> doesn't look like that is guaranteed cluster wide. If galera can provide a
> guarantee of 'if the cluster returned success to any write transaction, then
> any read initiated after that time will see it, regardless of what node it's
> using', that would work for us. However, it appears to me that guarantee
> will not hold during fail over conditions.
You are lumping 2 separate things together.
Transactions are committed to the cluster, not just a single node. The
sequence of committed transactions is well defined across the cluster,
there is no transaction committed only on some node, it's always the
primary component.
However, transactions are not synchronously *applied* to the innodb
table space. So they exist on all nodes at commit time, and a
certification algorithm guarantees that they are able to apply, i.e.
they do not conflict with any other to-be-applied transactions, but
they are not yet visible if you read from the InnoDB table. So the
causal reads feature is there to bridge this small delay. If you want
to ensure that you really read the results that were committed (via
any node) at the start of your current transaction, then galera gives
you this guarantee by looking at the queue of
committed-but-not-yet-applied transactions, wait for it to be applied,
then executes the read.
Most applications are fine with that level of inconsistency, at least
for most reads. But you can have causal reads if you need them (and
it's a completely legitimate request of course).
**
Note that performance-wise there seems to be a more performant
implementation available once galera moves to support MySQL 5.6. The
technique is described in this blog post using global transaction id's
from MySQL 5.6.
Same concept could be used for Galera. The benefit would be that
instead of waiting to apply the queue that exists at the start of your
current transaction, you would know the transaction id of your last
commit (for this application thread) and could start executing the
next read earlier.
> And no, putting logic into the load balancer to 'only use one node' strikes
> me as equally risky. How do you know all active sessions of a load balancer
> get moved/terminated as a unit? You know in real life there are could be
> connections to all nodes.
It depends on the application. In many applications it is the case
that not all transactions need be consistent globally across the whole
application. For instance, if you post something to my facebook wall
now, I don't really care if I can read it now or 5 seconds from now.
Unless we sit next to each other with 2 laptops, I couldn't tell the
difference anyway. Otoh many applications need causality for
transactions within the same session, so if it happens within the same
TCP/IP connection you will typically be connected to the same node.
However, for web applications this isn't true, each HTTP request of
course is independent (unless you embed some cookie in the
application, which is a very common approach to solve this btw).
The choice between "read from the same node" and using the causal
reads feature is mostly a performance vs convenience tradeoff. The
causal reads allow you to get what you want without touching your
application. (No, I haven't tested what the performance penalty
actually is, if there is much at all.)
Anyway, I don't know if this is even really what you ask for. If you
are only concerned about failovers, then it is a non-issue and galera
really does what you want. If you want it for all transactions, then
galera also does what you want if you turn on causal reads.
henrik
I agree that n3 cannot commit write transactions - but that's not my
concern. My concern is about reads on n3. That's what I'm talking
about, avoiding stale reads in all cases, even during the time n3
becomes partitioned from the cluster and the rest of the cluster
processes another write transaction. If both sides of the partition
time each other out in (approximately) 5 seconds, that strikes me as
racy. Do you understand?
I think you should re-read my question. We are a transactional
processing app and would like consistent (up to date reads) all the
time, regardless of session or node, and including when failovers
happen *due to assymetric network partitions*.
>
> henrik
>
>
> --
> henri...@avoinelama.fi
> +358-40-8211286 skype: henrik.ingo irc: hingo
> www.openlife.cc
>
> My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559
--
Karl Pickett
I agree that n3 cannot commit write transactions - but that's not my
concern. My concern is about reads on n3. That's what I'm talking
about, avoiding stale reads in all cases, even during the time n3
becomes partitioned from the cluster and the rest of the cluster
processes another write transaction. If both sides of the partition
time each other out in (approximately) 5 seconds, that strikes me as
racy. Do you understand?
I suppose this kind of "global" read causality is a reasonable
requirement for instance in something like financial trading. In such
a case other tricks like "read your own writes" (from the same node,
in the case of galera) are not enough, but you truly want everyone to
see the exact same snapshot in time.
henrik
henrik
Well, the thing is that the meaning of "everyone" and "the same" here
is not as obvious as you might think.
If reading and writing clients are causally dependent - i.e. the
reading client somehow knows that the writing client has updated the
value, then yes, the reading client can be concerned about reading stale
data. But in this case:
- reading and writing clients communicate independently of replication,
so there is some sort of client "cluster" parallel to server cluster
(and so "everyone" means the members of this client cluster)
- it is not clear why the reading client can't read from the same node
as the writing client
- it is not clear why the reading client needs to get this data from
the database server instead of directly from the writing client
If reading and writing clients are causally independent - i.e. the
reading client has no communication with the writing client, then the
whole concept of "stale" data or "same snapshot in time" is moot - there
is no way to tell if the reading client attempted to read data before or
after they were written. As there is also no way to tell who is
"everyone".
My guess is that there is an attempt to use MySQL/Galera cluster as a
messaging device between clients. That's what Galera is for, but
MySQL/Galera cluster really is not.
Regards,
Alex
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011