Consistent Reads - Are they possible?

189 views
Skip to first unread message

karlito

unread,
Apr 16, 2012, 5:43:18 PM4/16/12
to codersh...@googlegroups.com
Your documentation says a lot about consistency but one thing concerns me.   Imagine the following scenario:

3 nodes in cluster (n1, n2, n3)
Application A writes a transaction to n1 and n2 successfully, however the node n3 is unreachable from n1 and n2 (but is still up).  A should still get a commit ok, correct?  How long would it wait?
Application B does a query on n3 however it doesn't see App A's transaction because n3 was partitioned from the other nodes.  Thus a stale read is given by the cluster.

Can that happen with galera?  That seems like it could be solved with:
a. reads always query other nodes (bad for performance)
b. making application a wait for some period of time (fail over time) after which hopefully node 3 marks itself as dead
c. ... 

Regards,

Karl Pickett

Laurent MINOST

unread,
Apr 17, 2012, 5:55:34 AM4/17/12
to codersh...@googlegroups.com
Hi karlito,

I thlink the solution to this problem is on the way you manage to dispatch connections from frontend to galera nodes ? Application Level / LB level ... etc ...?

In my opinion, you should avoid dispatching some queries (whatever they are : r/w ... select / insert / update ...) when the node status is different from Synced.

In your situation, if I understood properly how Galera is working then nodes 1 and 2 should be still up and running with a Synced status as they see each other in the cluster but node 3 is in a Split Brain situation and as it does not see any other nodes then its status should be different than Synced.
I don't know how you want to manage your queries dispatching but in my current test environment, I dispatch queries ONLY to nodes which I know that current status is Synced, in any other cases the node is then down/unavailable for my LoadBalancing system and so for my frontend servers. I choosed this behaviour to insure data integrity and to avoid writing things where they shouln't or serve data which is not up-to-date (from a node which is not synced and so which could contain data which has not been updated yet).

Only problem with this behaviour is that with a 3 nodes cluster, when one is crashing or in split brain then when recovering, you will have 2/3 of your cluster which will be unavailable during time for the third node to recover either from IST or SST ... I think this could be a problem to manage all traffic from only one node during this time and I was not yet able to estimate the load as I do not have the same traffic in my test environnment as I can have in production ...

Hope it helps ?

BR,

Laurent

karlito

unread,
Apr 17, 2012, 11:26:10 AM4/17/12
to codersh...@googlegroups.com
No, that doesn't help.  I want to know exactly how long it takes n3 to 'unsync' itself, i.e. is there a race condition for stale reads or not.  In my example, assume A and B are running within microseconds of eachother.  How does B prevent a stale read microseconds after a transaction commit on n1 and n2, during a network partition?

Google megastore prevents this by invalidating caches at *Every single datacenter* on write, at the expense of some uptime.    Of course we don't use app engine so megastore is not an option for us.

Henrik Ingo

unread,
Apr 17, 2012, 11:43:14 AM4/17/12
to karlito, codersh...@googlegroups.com
Karlito

What you write cannot happen with Galera. n3 will notice that it is
disconnected from the main partition and it will stop accepting
queries until it is able to reconnect and get back in sync.

If you are talking microseconds, then the issue is different. You want
to familiarize yourself with wsrep_causal_reads variable
(http://www.codership.com/wiki/doku.php?id=mysql_options_0.8)

henrik

> --
> You received this message because you are subscribed to the Google Groups
> "codership" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/codership-team/-/nSHkSMVQZwcJ.
> To post to this group, send email to codersh...@googlegroups.com.
> To unsubscribe from this group, send email to
> codership-tea...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/codership-team?hl=en.

--
henri...@avoinelama.fi
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559

karlito

unread,
Apr 17, 2012, 11:44:16 AM4/17/12
to codersh...@googlegroups.com
I reread the following:


and I still am concerned.  I see that App A will wait for probably 5 seconds (the consensus timeout - all nodes see n3 as unreachable) then the commit will succeed.  HOWEVER, n3 won't time the others out until 15 seconds (inactive timeout) , right?  So are there 10 seconds in which n3 could serve stale reads before realizing he's not part of the cluster?  

Please correct me if I'm wrong.  Regards,

Karl

karlito

unread,
Apr 17, 2012, 11:50:38 AM4/17/12
to codersh...@googlegroups.com, karlito, henri...@avoinelama.fi
Yes, I noticed we will have to turn the casual reads setting on.   Does the casual read setting turn on extra network overhead on reads, or does it just wait for the local threads to finish?   I thought it was the latter, but the docs aren't clear ("Results in larger read latencies.")

Exactly how long does it take for n3 to notice it has lost the cluster?  My cursory reading would indicate it would time all other nodes out after inactive_timeout, which is bad since thats 15 seconds (longer then the main cluster times n3 out).  Is that incorrect?

> To post to this group, send email to codership-team@googlegroups.com.


> To unsubscribe from this group, send email to

> codership-team+unsubscribe@googlegroups.com.


> For more options, visit this group at
> http://groups.google.com/group/codership-team?hl=en.

Henrik Ingo

unread,
Apr 17, 2012, 12:01:33 PM4/17/12
to karlito, codersh...@googlegroups.com
On Tue, Apr 17, 2012 at 6:50 PM, karlito <karl.p...@gmail.com> wrote:
> Yes, I noticed we will have to turn the casual reads setting on.   Does the
> casual read setting turn on extra network overhead on reads, or does it just
> wait for the local threads to finish?   I thought it was the latter, but the
> docs aren't clear ("Results in larger read latencies.")

Yes. Basically all SELECTS will wait for their pending slave queue to
be applied before they read. There is no additional network
connectivity to other nodes, just a delay before the read can be
"securely" executed.

Note that if read-causality is important to you (for most applications
I would say that an inconsistency of at most some milliseconds is not
an issue) then another way around this is to direct all writes and
reads to a single node.

> Exactly how long does it take for n3 to notice it has lost the cluster?  My
> cursory reading would indicate it would time all other nodes out
> after inactive_timeout, which is bad since thats 15 seconds (longer then the
> main cluster times n3 out).  Is that incorrect?

Note that you can configure these values to suit your own needs. The
defaults are fairly lax, so making these timeouts tighter is certainly
an option.

henrik

Teemu Ollakka

unread,
Apr 17, 2012, 12:15:42 PM4/17/12
to codersh...@googlegroups.com, karlito, henri...@avoinelama.fi

Hi Karl,


On Tuesday, April 17, 2012 6:50:38 PM UTC+3, karlito wrote:

Exactly how long does it take for n3 to notice it has lost the cluster?  My cursory reading would indicate it would time all other nodes out after inactive_timeout, which is bad since thats 15 seconds (longer then the main cluster times n3 out).  Is that incorrect?

If n3 detects that it has lost connectivity to all other nodes in cluster, it will fall to non-primary component after evs.suspect_timeout (which is 5 secs by default). IIRC whole evs.inactive_timeout period is waited only in the case when two node cluster splits in half.

- Teemu

karlito

unread,
Apr 17, 2012, 3:57:11 PM4/17/12
to codersh...@googlegroups.com, karlito, henri...@avoinelama.fi
That strikes me as still racy.  n1 and n2 will time n3 out after 5 seconds, and vice versa?  That still sounds like there's a (milliseconds level at best) race.  Time is a very abstract concept in distributed systems.  

I think galera looks very cool, it's very close to what we want and WAY ahead of 'nosql' projects I'm looking at like mongodb.  However, it looks like the 'causal reads' were added on as an after thought, and it still doesn't look like that is guaranteed cluster wide.  If galera can provide a guarantee of 'if the cluster returned success to any write transaction, then any read initiated after that time will see it, regardless of what node it's using', that would work for us.  However, it appears to me that guarantee will not hold during fail over conditions.

And no, putting logic into the load balancer to 'only use one node' strikes me as equally risky.  How do you know all active sessions of a load balancer get moved/terminated as a unit?  You know in real life there are could be connections to all nodes.

So anyway, if some work could be done to:

a. officially put the guarantee (cluster wide consistent reads even during fail over) in writing on the web site
b. put the galera protocol up as a paper
c. ensure timers will time a partitioned node out (it commits suicide) before the remaining quorum commits and acks write transactions

then galera would be very interesting for us.






- Teemu

Henrik Ingo

unread,
Apr 17, 2012, 11:59:43 PM4/17/12
to karlito, codersh...@googlegroups.com
On Tue, Apr 17, 2012 at 10:57 PM, karlito <karl.p...@gmail.com> wrote:
>> On Tuesday, April 17, 2012 6:50:38 PM UTC+3, karlito wrote:
>> If n3 detects that it has lost connectivity to all other nodes in cluster,
>> it will fall to non-primary component after evs.suspect_timeout (which is 5
>> secs by default). IIRC whole evs.inactive_timeout period is waited only in
>> the case when two node cluster splits in half.
>
>
>
> That strikes me as still racy.  n1 and n2 will time n3 out after 5 seconds,
> and vice versa?  That still sounds like there's a (milliseconds level at
> best) race.  Time is a very abstract concept in distributed systems.

Nono. When n3 loses connection, it immediately cannot commit any new
transactions. The commit is synchronous and hard-wired into the group
communication, so if you can't talk to the primary component, you
cannot commit a single transaction. (with default setting of
pc.ignore_sb) As I understand it, the suspect_timeout is more like
the time after which a node gives up trying.

> I think galera looks very cool, it's very close to what we want and WAY
> ahead of 'nosql' projects I'm looking at like mongodb.  However, it looks
> like the 'causal reads' were added on as an after thought, and it still
> doesn't look like that is guaranteed cluster wide.  If galera can provide a
> guarantee of 'if the cluster returned success to any write transaction, then
> any read initiated after that time will see it, regardless of what node it's
> using', that would work for us.  However, it appears to me that guarantee
> will not hold during fail over conditions.

You are lumping 2 separate things together.

Transactions are committed to the cluster, not just a single node. The
sequence of committed transactions is well defined across the cluster,
there is no transaction committed only on some node, it's always the
primary component.

However, transactions are not synchronously *applied* to the innodb
table space. So they exist on all nodes at commit time, and a
certification algorithm guarantees that they are able to apply, i.e.
they do not conflict with any other to-be-applied transactions, but
they are not yet visible if you read from the InnoDB table. So the
causal reads feature is there to bridge this small delay. If you want
to ensure that you really read the results that were committed (via
any node) at the start of your current transaction, then galera gives
you this guarantee by looking at the queue of
committed-but-not-yet-applied transactions, wait for it to be applied,
then executes the read.

Most applications are fine with that level of inconsistency, at least
for most reads. But you can have causal reads if you need them (and
it's a completely legitimate request of course).


**

Note that performance-wise there seems to be a more performant
implementation available once galera moves to support MySQL 5.6. The
technique is described in this blog post using global transaction id's
from MySQL 5.6.

http://blog.ulf-wendel.de/2012/slides-mysql-56-global-transaction-identifier-and-peclmysqlnd_ms-for-session-consistency/

Same concept could be used for Galera. The benefit would be that
instead of waiting to apply the queue that exists at the start of your
current transaction, you would know the transaction id of your last
commit (for this application thread) and could start executing the
next read earlier.

> And no, putting logic into the load balancer to 'only use one node' strikes
> me as equally risky.  How do you know all active sessions of a load balancer
> get moved/terminated as a unit?  You know in real life there are could be
> connections to all nodes.

It depends on the application. In many applications it is the case
that not all transactions need be consistent globally across the whole
application. For instance, if you post something to my facebook wall
now, I don't really care if I can read it now or 5 seconds from now.
Unless we sit next to each other with 2 laptops, I couldn't tell the
difference anyway. Otoh many applications need causality for
transactions within the same session, so if it happens within the same
TCP/IP connection you will typically be connected to the same node.
However, for web applications this isn't true, each HTTP request of
course is independent (unless you embed some cookie in the
application, which is a very common approach to solve this btw).

The choice between "read from the same node" and using the causal
reads feature is mostly a performance vs convenience tradeoff. The
causal reads allow you to get what you want without touching your
application. (No, I haven't tested what the performance penalty
actually is, if there is much at all.)


Anyway, I don't know if this is even really what you ask for. If you
are only concerned about failovers, then it is a non-issue and galera
really does what you want. If you want it for all transactions, then
galera also does what you want if you turn on causal reads.

henrik

Karl Pickett

unread,
Apr 18, 2012, 9:25:45 AM4/18/12
to henri...@avoinelama.fi, codersh...@googlegroups.com
On Tue, Apr 17, 2012 at 10:59 PM, Henrik Ingo <henri...@avoinelama.fi> wrote:
> On Tue, Apr 17, 2012 at 10:57 PM, karlito <karl.p...@gmail.com> wrote:
>>> On Tuesday, April 17, 2012 6:50:38 PM UTC+3, karlito wrote:
>>> If n3 detects that it has lost connectivity to all other nodes in cluster,
>>> it will fall to non-primary component after evs.suspect_timeout (which is 5
>>> secs by default). IIRC whole evs.inactive_timeout period is waited only in
>>> the case when two node cluster splits in half.
>>
>>
>>
>> That strikes me as still racy.  n1 and n2 will time n3 out after 5 seconds,
>> and vice versa?  That still sounds like there's a (milliseconds level at
>> best) race.  Time is a very abstract concept in distributed systems.
>
> Nono. When n3 loses connection, it immediately cannot commit any new
> transactions. The commit is synchronous and hard-wired into the group
> communication, so if you can't talk to the primary component, you
> cannot commit a single transaction. (with default setting of
> pc.ignore_sb)  As I understand it, the suspect_timeout is more like
> the time after which a node gives up trying.

I agree that n3 cannot commit write transactions - but that's not my
concern. My concern is about reads on n3. That's what I'm talking
about, avoiding stale reads in all cases, even during the time n3
becomes partitioned from the cluster and the rest of the cluster
processes another write transaction. If both sides of the partition
time each other out in (approximately) 5 seconds, that strikes me as
racy. Do you understand?

I think you should re-read my question. We are a transactional
processing app and would like consistent (up to date reads) all the
time, regardless of session or node, and including when failovers
happen *due to assymetric network partitions*.

>
> henrik
>
>
> --
> henri...@avoinelama.fi
> +358-40-8211286 skype: henrik.ingo irc: hingo
> www.openlife.cc
>
> My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559

--
Karl Pickett

Teemu Ollakka

unread,
Apr 18, 2012, 11:07:16 AM4/18/12
to codersh...@googlegroups.com, henri...@avoinelama.fi

Karl,


On Wednesday, April 18, 2012 4:25:45 PM UTC+3, karlito wrote:

I agree that n3 cannot commit write transactions - but that's not my
concern.   My concern is about reads on n3.  That's what I'm talking
about, avoiding stale reads in all cases, even during the time n3
becomes partitioned from the cluster and the rest of the cluster
processes another write transaction.  If both sides of the partition
time each other out in (approximately) 5 seconds, that strikes me as
racy.  Do you understand?


I think you have a point here. Causality/consistency constraints are deduced by inspecting local state only, so it is possible to construct a case where stale data is read from partitioned node.

The scenario would be the following:
* Start with three nodes n1, n2, n3
* Freeze n3 (kill -STOP mysqld process or similar)
* Make a write to n1 and wait until it commits in new group view formed by n1 and n2
* Signal n3 to continue and make a read from it
Now, if n3 was stopped in suitable state and read manages to reach group communication layer before any of the EVS timer expires, read will return stale data.

To achieve consistent reads in all possible scenarios I don't think there is a way to avoid generating network traffic also for read operations. One way would be to generate change to EVS protocol state for each read. In practice this would mean generating a message (at least of AGREED delivery guarantee) for each read operation. This would enforce node to have some communication with other nodes so that partitioning would be detected before read completes.

- Teemu

Alexey Yurchenko

unread,
Apr 18, 2012, 3:04:59 PM4/18/12
to codersh...@googlegroups.com
You're absolutely right.

Henrik Ingo

unread,
Apr 19, 2012, 2:10:50 AM4/19/12
to codersh...@googlegroups.com
Interesting discussion. It's enlightening to hear from someone who has
more strict requirements than myself.

I suppose this kind of "global" read causality is a reasonable
requirement for instance in something like financial trading. In such
a case other tricks like "read your own writes" (from the same node,
in the case of galera) are not enough, but you truly want everyone to
see the exact same snapshot in time.

henrik

henrik

Alex Yurchenko

unread,
Apr 19, 2012, 6:59:34 AM4/19/12
to codersh...@googlegroups.com
On 2012-04-19 09:10, Henrik Ingo wrote:
> Interesting discussion. It's enlightening to hear from someone who
> has
> more strict requirements than myself.
>
> I suppose this kind of "global" read causality is a reasonable
> requirement for instance in something like financial trading. In such
> a case other tricks like "read your own writes" (from the same node,
> in the case of galera) are not enough, but you truly want everyone to
> see the exact same snapshot in time.

Well, the thing is that the meaning of "everyone" and "the same" here
is not as obvious as you might think.

If reading and writing clients are causally dependent - i.e. the
reading client somehow knows that the writing client has updated the
value, then yes, the reading client can be concerned about reading stale
data. But in this case:

- reading and writing clients communicate independently of replication,
so there is some sort of client "cluster" parallel to server cluster
(and so "everyone" means the members of this client cluster)
- it is not clear why the reading client can't read from the same node
as the writing client
- it is not clear why the reading client needs to get this data from
the database server instead of directly from the writing client

If reading and writing clients are causally independent - i.e. the
reading client has no communication with the writing client, then the
whole concept of "stale" data or "same snapshot in time" is moot - there
is no way to tell if the reading client attempted to read data before or
after they were written. As there is also no way to tell who is
"everyone".

My guess is that there is an attempt to use MySQL/Galera cluster as a
messaging device between clients. That's what Galera is for, but
MySQL/Galera cluster really is not.

Regards,
Alex

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Alexey Yurchenko

unread,
Sep 29, 2012, 4:09:07 PM9/29/12
to codersh...@googlegroups.com
Just released 2.2rc1 fixes this issue. "Consistent" reads are now "consistent" across network partitions.

Regards,
Alex

Tuure Laurinolli

unread,
Oct 1, 2012, 6:27:40 AM10/1/12
to Teemu Ollakka, codership
If causal consistency is all that is required, there is no need to change Galera internals. In the example above, the view 2 (with n1 and n2) is not the same as view 1 (n1, n2, n3). Therefore any read done by client B from the outdated view is invalid, and client B can detect that and retry its read. All it needs to do this is for lient A that does the commit to communicate to client B the view identifier to which the commit was made.

The scenario then becomes:
* Start with three nodes n1, n2, n3, they form view 1
* Freeze n3
* Nodes n1 and 2 form new view 2
* Client A commits to view 2 and gets back view id to which the commit was made (surely there is some way to achieve this?)
* Client A communicates with Client B out-of-band that the data it has written can now be read, and that current view is view 2
* Unfreeze n3
* Client B does a read from n3 and gets back invalid data and view-id of view 1
* Client B notes that its view-id does not match that which was communicated to it and handles the error by e.g. retrying


Alex Yurchenko

unread,
Oct 1, 2012, 6:45:46 AM10/1/12
to codersh...@googlegroups.com
Heh, Tuure, it gets even better than that! Client A could simply pass
its commit ID to client B and then it would have been even simpler -
client B would just wait for this commit to happen (which in 99% of
cases is a done thing) and voi la.

Now if we only could get them to rewrite all those thousands of clients
(and, preferably, mysql protocol - to avoid extra queries)... Any
suggestions how to do that? :D
Reply all
Reply to author
Forward
0 new messages