wsrep_sst_method skip and IST.

Ilias Bertsimas

unread,

Sep 27, 2012, 7:28:34 AM9/27/12

to codersh...@googlegroups.com

Hello,

I have a galera cluster with a huge amount of data where a full SST would be pointless at it will take 3-4 days plus the amount of time needed to apply the new writesets to catch up.

I have set cluster's wsrep_sst_method to skip but it is not clear if it will skip IST as well.

Can someone confirm how it will react if it needs an IST ?

Kind Regards,

Ilias.

Henrik Ingo

unread,

Sep 27, 2012, 8:11:02 AM9/27/12

to Ilias Bertsimas, codersh...@googlegroups.com

On Thu, Sep 27, 2012 at 2:28 PM, Ilias Bertsimas <awar...@gmail.com> wrote:
> I have a galera cluster with a huge amount of data where a full SST would be
> pointless at it will take 3-4 days plus the amount of time needed to apply
> the new writesets to catch up.
> I have set cluster's wsrep_sst_method to skip but it is not clear if it will
> skip IST as well.

Actually, I don't think you are supposed to use the skip method as a
permanent setting. If I understood correctly, Percona developed it to
be used when initially starting the cluster. In this case you could
manually restore the same data to all nodes, so you know they are in
the same state before you start any nodes at all.

Otoh if you have a running cluster and some node is disconnected long
enough to need an SST, then you can't leave wsrep_sst_method to skip
since the node would then have inconsistent data.

> Can someone confirm how it will react if it needs an IST ?

No. (I have my guess, but that's not what you want, so I'll leave to
Codership guys to confirm.)

But referring to what I said above, you should just make sure that
your gcache.size is large enough that SST never needs to happen. And
if a node is disconnected long enough that IST won't work, then you
are back to square one.

henrik
--
henri...@avoinelama.fi
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559

Ilias Bertsimas

unread,

Sep 27, 2012, 8:22:30 AM9/27/12

to codersh...@googlegroups.com, Ilias Bertsimas, henri...@avoinelama.fi

Hello Henrik,

Yes I know the purpose of skip sst method is for setting up a cluster manually.

The only reason I use it is because I do not want an sst to happen under any circumstances and it happens once it can't do an IST and sometimes it can happen without really needed based on my experience.

An SST is impractical on a 5TB dataset.

I have a big enough gcache size to cover at least 12 hours of data changes.

Thanks!

Alex Yurchenko

unread,

Sep 27, 2012, 3:00:35 PM9/27/12

to codersh...@googlegroups.com

Hi,

Henrik is correct on all accounts actually.

wsrep_sst_method=skip is just another way of faking grastate.dat file.
And you should do it only if you're certain, that your nodes ARE
CONSISTENT.

I don't see how you expect your application to work over inconsistent
nodes. Moreover, replication won't work unless the inconsistency is in
the parts never touched.

So you need SST, whether it is "practical" or not.

The other thing is that you can optimize SST to transfer only parts
that were likely to have changed. E.g. if you use file-per-table and you
have a bunch of tables that never change, you can adjust a filter in
rsync SST script to skip those tables.

Regards,
Alex

On 2012-09-27 15:22, Ilias Bertsimas wrote:
> Hello Henrik,
>
> Yes I know the purpose of skip sst method is for setting up a cluster
> manually.
> The only reason I use it is because I do not want an sst to happen
> under
> any circumstances and it happens once it can't do an IST and
> sometimes it
> can happen without really needed based on my experience.
> An SST is impractical on a 5TB dataset.
> I have a big enough gcache size to cover at least 12 hours of data
> changes.
>
> Thanks!
>
> On Thursday, September 27, 2012 1:11:04 PM UTC+1, Henrik Ingo wrote:
>>
>> On Thu, Sep 27, 2012 at 2:28 PM, Ilias Bertsimas

>> <awar...@gmail.com<javascript:>>

>> henri...@avoinelama.fi <javascript:>

>> +358-40-8211286 skype: henrik.ingo irc: hingo
>> www.openlife.cc
>>
>> My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559
>>

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Henrik Ingo

unread,

Sep 27, 2012, 4:41:27 PM9/27/12

to Ilias Bertsimas, codersh...@googlegroups.com

Ilias

My point is, rather than leaving wsrep_sst_method=skip, you should
leave it to something else so that SST will *fail* and the node is not
allowed to return to cluster. Now with skip method, the node will
"succeed" in joining the cluster but will still not have the same
data.

As a quick and dirty solution, I would set wsrep_sst_method=rsync and
then uninstall rsync from the servers, so then SST will fail if it is
tried. A nicer solution of course is to create your own sst script (or
ask Codership to) that will just return error immediately. (Heh, that
would then be wsrep_sst_method=fail :-)

Alex: You didn't answer the actual question: Will IST be used even
when wsrep_sst_method=skip? (I assume yes, but I've been wrong
before...)

henrik

> --

Ilias Bertsimas

unread,

Sep 27, 2012, 5:10:17 PM9/27/12

to codersh...@googlegroups.com, Ilias Bertsimas, henri...@avoinelama.fi

Henrik,

Thank you for your answer. I understand the importance of SST and the need for it to run when the nodes are inconsistent and that is the way I use (xtrabackup) it on another production cluster with ~120GB of DATA.

From what I am seeing examining the sst scripts it seems IST is done there on CASE BYPASS so a good idea would be to modify one of the scripts to work for BYPASS and fail with an error on full SST so I can handle the data consistency manually from there.

Alexey can you please confirm the above ?

Kind Regards,

Ilias.

Alex Yurchenko

unread,

Sep 28, 2012, 9:54:29 AM9/28/12

to codersh...@googlegroups.com

On 2012-09-28 00:10, Ilias Bertsimas wrote:
> Henrik,
>
> Thank you for your answer. I understand the importance of SST and the
> need
> for it to run when the nodes are inconsistent and that is the way I
> use
> (xtrabackup) it on another production cluster with ~120GB of DATA.
> From what I am seeing examining the sst scripts it seems IST is done
> there
> on CASE BYPASS so a good idea would be to modify one of the scripts
> to work
> for BYPASS and fail with an error on full SST so I can handle the
> data
> consistency manually from there.
> Alexey can you please confirm the above ?

Yes, you're absolutely correct there. But I immediately have another
question: why do it manually when you can write a script?
Well, in any case, you need to cook a new SST script - whether to fail
or to handle data consistency.

> Kind Regards,
> Ilias.
>
> On Thursday, September 27, 2012 9:41:29 PM UTC+1, Henrik Ingo wrote:
>>
>> Ilias
>>
>> My point is, rather than leaving wsrep_sst_method=skip, you should
>> leave it to something else so that SST will *fail* and the node is
>> not
>> allowed to return to cluster. Now with skip method, the node will
>> "succeed" in joining the cluster but will still not have the same
>> data.
>>
>> As a quick and dirty solution, I would set wsrep_sst_method=rsync
>> and
>> then uninstall rsync from the servers, so then SST will fail if it
>> is
>> tried. A nicer solution of course is to create your own sst script
>> (or
>> ask Codership to) that will just return error immediately. (Heh,
>> that
>> would then be wsrep_sst_method=fail :-)
>>
>> Alex: You didn't answer the actual question: Will IST be used even
>> when wsrep_sst_method=skip? (I assume yes, but I've been wrong
>> before...)

O, tempora! What? You people can't just try and see what happens? Ok,
can tell you cause I tried and saw. No state transfer will happen -
neither SST nor IST. But it is not in the spec, so this behavior should
not be relied on. wsrep_sst_method=skip was invented to assemble _idle_
clusters of the nodes which are known to be identical. Because some
users were uncomfortab;e with copying grastate.dat from node to node.
Well, turns out it is even worse than copying grastate.dat, cause it
requires an absolutely idle cluster.

>> henrik
>>
>> On Thu, Sep 27, 2012 at 3:22 PM, Ilias Bertsimas

>> <awar...@gmail.com<javascript:>>

>> henri...@avoinelama.fi <javascript:>

>> +358-40-8211286 skype: henrik.ingo irc: hingo
>> www.openlife.cc
>>
>> My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559
>>

--

Henrik Ingo

unread,

Sep 30, 2012, 9:40:31 AM9/30/12

to Alex Yurchenko, codersh...@googlegroups.com

On Fri, Sep 28, 2012 at 4:54 PM, Alex Yurchenko
<alexey.y...@codership.com> wrote:
> Yes, you're absolutely correct there. But I immediately have another
> question: why do it manually when you can write a script?

While I'm not the one with the original question, there are cases
where manual copy over a media other than Ethernet (ie Fedex, etc) is
the highest bandwidth option available. In my company it has been used
to provision MySQL slaves - for real.

Come to think of it, there could be a wsrep_sst_method=fedex script :-)

> O, tempora! What? You people can't just try and see what happens? Ok, can
> tell you cause I tried and saw. No state transfer will happen - neither SST
> nor IST. But it is not in the spec, so this behavior should not be relied
> on. wsrep_sst_method=skip was invented to assemble _idle_ clusters of the
> nodes which are known to be identical. Because some users were uncomfortab;e
> with copying grastate.dat from node to node. Well, turns out it is even
> worse than copying grastate.dat, cause it requires an absolutely idle
> cluster.

Now I'm curious. (And this is why I have stopped guessing, I've
learned that Galera is usually more clever than I am...)

I would have guessed that Galera first tries IST, and if it is not
possible, then try full SST. ...which in this case is set to skip, but
that wouldn't then affect the earlier IST attempt. Now it sounds like
the opposite is true? What am I missing?

Alex Yurchenko

unread,

Sep 30, 2012, 11:20:22 AM9/30/12

to henri...@avoinelama.fi, codersh...@googlegroups.com

On 2012-09-30 16:40, Henrik Ingo wrote:
> On Fri, Sep 28, 2012 at 4:54 PM, Alex Yurchenko
> <alexey.y...@codership.com> wrote:
>> Yes, you're absolutely correct there. But I immediately have another
>> question: why do it manually when you can write a script?
>
> While I'm not the one with the original question, there are cases
> where manual copy over a media other than Ethernet (ie Fedex, etc) is
> the highest bandwidth option available. In my company it has been
> used
> to provision MySQL slaves - for real.

Makes perfect sense, indeed. I just got too focused on networking.

> Come to think of it, there could be a wsrep_sst_method=fedex script
> :-)
>
>> O, tempora! What? You people can't just try and see what happens?
>> Ok, can
>> tell you cause I tried and saw. No state transfer will happen -
>> neither SST
>> nor IST. But it is not in the spec, so this behavior should not be
>> relied
>> on. wsrep_sst_method=skip was invented to assemble _idle_ clusters
>> of the
>> nodes which are known to be identical. Because some users were
>> uncomfortab;e
>> with copying grastate.dat from node to node. Well, turns out it is
>> even
>> worse than copying grastate.dat, cause it requires an absolutely
>> idle
>> cluster.
>
> Now I'm curious. (And this is why I have stopped guessing, I've
> learned that Galera is usually more clever than I am...)
>
> I would have guessed that Galera first tries IST, and if it is not
> possible, then try full SST. ...which in this case is set to skip,
> but
> that wouldn't then affect the earlier IST attempt. Now it sounds like
> the opposite is true? What am I missing?

You're missing a whole lot ;)

To begin with, what do we need for IST on the joiner side? - Right, a
fully functional mysqld with storage engines initialized.
So when IST is not possible and we then need to do SST? - Right, we're
sadly out of luck, unless we use mysqldump.

So in Galera cluster you always do SST, joiner initializes storage
engines, and then IST. It's just that SST is trivial (that's what bypass
flag is for) when we can do IST (and vice versa).

So I guess that could explain, but...

The skip method... It actually goes against the above rule and uses a
magic code to not invoke ANY state transfer logic on the donor as well
as the joiner (it is also used for garbd joining the cluster)... I guess
"skip" is a bit of misnomer there, but that's all we could come up with.
Note that it would equally be a misnomer if we had an actual
wsrep_sst_skip script - what would be its purpose: we do IST if we can
and we skip SST if we can't do IST? Just return success? Why bother with
IST then? So, to be semantically consistent, wsrep_sst_method=skip skips
ANY sort of node provisioning. Node is assumed up-to-date and just joins
the cluster.

Regards,
Alex

Henrik Ingo

unread,

Sep 30, 2012, 12:08:16 PM9/30/12

to Alex Yurchenko, codersh...@googlegroups.com

Thanks, it makes sense.

Reply all

Reply to author

Forward