Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Consistent Reads - Are they possible?

Received: by 10.42.169.67 with SMTP id a3mr1253661icz.8.1334755546454;
        Wed, 18 Apr 2012 06:25:46 -0700 (PDT)
X-BeenThere: codership-team@googlegroups.com
Received: by 10.231.0.82 with SMTP id 18ls968434iba.4.gmail; Wed, 18 Apr 2012
 06:25:45 -0700 (PDT)
Received: by 10.42.89.20 with SMTP id e20mr1267317icm.3.1334755545577;
        Wed, 18 Apr 2012 06:25:45 -0700 (PDT)
Received: by 10.42.89.20 with SMTP id e20mr1267316icm.3.1334755545565;
        Wed, 18 Apr 2012 06:25:45 -0700 (PDT)
Return-Path: <karl.pick...@gmail.com>
Received: from mail-ob0-f178.google.com (mail-ob0-f178.google.com [209.85.214.178])
        by gmr-mx.google.com with ESMTPS id hq2si8741804igc.3.2012.04.18.06.25.45
        (version=TLSv1/SSLv3 cipher=OTHER);
        Wed, 18 Apr 2012 06:25:45 -0700 (PDT)
Received-SPF: pass (google.com: domain of karl.pick...@gmail.com designates 209.85.214.178 as permitted sender) client-ip=209.85.214.178;
Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of karl.pick...@gmail.com designates 209.85.214.178 as permitted sender) smtp.mail=karl.pick...@gmail.com; dkim=pass header...@gmail.com
Received: by mail-ob0-f178.google.com with SMTP id wc18so1432450obb.23
        for <codership-team@googlegroups.com>; Wed, 18 Apr 2012 06:25:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type:content-transfer-encoding;
        bh=A4+7SSPYlsmpC5CG6+yz9hCdg/UMaAlxTMKGZNh99dQ=;
        b=eSxwIvNwJinWDH7aS6CzMd75FHcuPDMb6ZbBICPsgrNK6OqNnKajjDFjarQ3LmPTTq
         amsKxRLmH0ay/6mI/4CQIkRDsAky1iBRlZWOXjglbKnJGgqCV33af2GIuWcOb+WEFDEZ
         +F6LdRnbHR0zA03b5UEl32HgtmVDdOcQOyQhIXWmIXbPUtoqCXZRfUAet5IQBtPe+PYG
         GCsh6LafyMJXoPw73JOLrDvR0KwlA8EYdjt+xO3Edn46FuaXyjDIhwj7xkm6eWQs/aud
         wTHtUODG9L7fRMRBg8Uu9GYirRCAM6hdmGePhS0WVQKsXNveMa6aD0yIH9gjedvqsUYI
         8VLQ==
MIME-Version: 1.0
Received: by 10.182.72.38 with SMTP id a6mr3137210obv.38.1334755545370; Wed,
 18 Apr 2012 06:25:45 -0700 (PDT)
Received: by 10.182.75.229 with HTTP; Wed, 18 Apr 2012 06:25:45 -0700 (PDT)
In-Reply-To: <CAKHyketCu=3OvH_vUti=GjbKe0mU_urhdp2sc7y8NkCWTOa...@mail.gmail.com>
References: <13101417.830.1334612598704.JavaMail.geo-discussion-forums@ynhh34>
	<CAKHykevKp_a0jsTi5V5=oLekE8eXYhouprxRD3nVTxqA4a2...@mail.gmail.com>
	<4886238.2099.1334677838552.JavaMail.geo-discussion-forums@ynaz34>
	<23640307.1716.1334679342228.JavaMail.geo-discussion-forums@vbab2>
	<26575854.234.1334692631412.JavaMail.geo-discussion-forums@ynll26>
	<CAKHyketCu=3OvH_vUti=GjbKe0mU_urhdp2sc7y8NkCWTOa...@mail.gmail.com>
Date: Wed, 18 Apr 2012 08:25:45 -0500
Message-ID: <CANDBRPt3+0NsKwuO0X89dQ0tnShB3P+xHEDAM-DH5Z_nyx_...@mail.gmail.com>
Subject: Re: [codership-team] Consistent Reads - Are they possible?
From: Karl Pickett <karl.pick...@gmail.com>
To: henrik.i...@avoinelama.fi
Cc: codership-team@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, Apr 17, 2012 at 10:59 PM, Henrik Ingo <henrik.i...@avoinelama.fi> w=
rote:
> On Tue, Apr 17, 2012 at 10:57 PM, karlito <karl.pick...@gmail.com> wrote:
>>> On Tuesday, April 17, 2012 6:50:38 PM UTC+3, karlito wrote:
>>> If n3 detects that it has lost connectivity to all other nodes in clust=
er,
>>> it will fall to non-primary component after evs.suspect_timeout (which =
is 5
>>> secs by default). IIRC whole evs.inactive_timeout period is waited only=
 in
>>> the case when two node cluster splits in half.
>>
>>
>>
>> That strikes me as still racy. =A0n1 and n2 will time n3 out after 5 sec=
onds,
>> and vice versa? =A0That still sounds like there's a (milliseconds level =
at
>> best) race. =A0Time is a very abstract concept in distributed systems.
>
> Nono. When n3 loses connection, it immediately cannot commit any new
> transactions. The commit is synchronous and hard-wired into the group
> communication, so if you can't talk to the primary component, you
> cannot commit a single transaction. (with default setting of
> pc.ignore_sb) =A0As I understand it, the suspect_timeout is more like
> the time after which a node gives up trying.

I agree that n3 cannot commit write transactions - but that's not my
concern.   My concern is about reads on n3.  That's what I'm talking
about, avoiding stale reads in all cases, even during the time n3
becomes partitioned from the cluster and the rest of the cluster
processes another write transaction.  If both sides of the partition
time each other out in (approximately) 5 seconds, that strikes me as
racy.  Do you understand?

>
>> I think galera looks very cool, it's very close to what we want and WAY
>> ahead of 'nosql' projects I'm looking at like mongodb. =A0However, it lo=
oks
>> like the 'causal reads' were added on as an after thought, and it still
>> doesn't look like that is guaranteed cluster wide. =A0If galera can prov=
ide a
>> guarantee of 'if the cluster returned success to any write transaction, =
then
>> any read initiated after that time will see it, regardless of what node =
it's
>> using', that would work for us. =A0However, it appears to me that guaran=
tee
>> will not hold during fail over conditions.
>
> You are lumping 2 separate things together.
>
> Transactions are committed to the cluster, not just a single node. The
> sequence of committed transactions is well defined across the cluster,
> there is no transaction committed only on some node, it's always the
> primary component.
>
> However, transactions are not synchronously *applied* to the innodb
> table space. So they exist on all nodes at commit time, and a
> certification algorithm guarantees that they are able to apply, i.e.
> they do not conflict with any other to-be-applied transactions, but
> they are not yet visible if you read from the InnoDB table. So the
> causal reads feature is there to bridge this small delay. If you want
> to ensure that you really read the results that were committed (via
> any node) at the start of your current transaction, then galera gives
> you this guarantee by looking at the queue of
> committed-but-not-yet-applied transactions, wait for it to be applied,
> then executes the read.
>
> Most applications are fine with that level of inconsistency, at least
> for most reads. But you can have causal reads if you need them (and
> it's a completely legitimate request of course).
>
>
> **
>
> Note that performance-wise there seems to be a more performant
> implementation available once galera moves to support MySQL 5.6. =A0The
> technique is described in this blog post using global transaction id's
> from MySQL 5.6.
>
> http://blog.ulf-wendel.de/2012/slides-mysql-56-global-transaction-identif=
ier-and-peclmysqlnd_ms-for-session-consistency/
>
> Same concept could be used for Galera. The benefit would be that
> instead of waiting to apply the queue that exists at the start of your
> current transaction, you would know the transaction id of your last
> commit (for this application thread) and could start executing the
> next read earlier.
>
>> And no, putting logic into the load balancer to 'only use one node' stri=
kes
>> me as equally risky. =A0How do you know all active sessions of a load ba=
lancer
>> get moved/terminated as a unit? =A0You know in real life there are could=
 be
>> connections to all nodes.
>
> It depends on the application. In many applications it is the case
> that not all transactions need be consistent globally across the whole
> application. For instance, if you post something to my facebook wall
> now, I don't really care if I can read it now or 5 seconds from now.
> Unless we sit next to each other with 2 laptops, I couldn't tell the
> difference anyway. Otoh many applications need causality for
> transactions within the same session, so if it happens within the same
> TCP/IP connection you will typically be connected to the same node.
> However, for web applications this isn't true, each HTTP request of
> course is independent (unless you embed some cookie in the
> application, which is a very common approach to solve this btw).
>
> The choice between "read from the same node" and using the causal
> reads feature is mostly a performance vs convenience tradeoff. The
> causal reads allow you to get what you want without touching your
> application. (No, I haven't tested what the performance penalty
> actually is, if there is much at all.)
>
>
> Anyway, I don't know if this is even really what you ask for. If you
> are only concerned about failovers, then it is a non-issue and galera
> really does what you want. If you want it for all transactions, then
> galera also does what you want if you turn on causal reads.

I think you should re-read my question.  We are a transactional
processing app and would like consistent (up to date reads) all the
time, regardless of session or node, and including when failovers
happen *due to assymetric network partitions*.

>
> henrik
>
>
> --
> henrik.i...@avoinelama.fi
> +358-40-8211286=A0skype: henrik.ingo irc: hingo
> www.openlife.cc
>
> My LinkedIn profile: http://www.linkedin.com/profile/view?id=3D9522559



--=20
Karl Pickett