how to handle split brain situations

23 views
Skip to first unread message

Igor Galić

unread,
Jan 25, 2016, 6:08:03 AM1/25/16
to s3ql

Hi folks!

So yesterday i managed to setup s3ql successfully, and
checked out a git repository and run a php application on top of that.

all of that is even automated using a puppet module i wrote.

Everything works super fine, except when it fails, then it fails horribly.
Even just a shutdown of the vm (which, in theory, unmounts all file systems)
will make it so that it's impossible to mount the fs on another machine:

Backend reports that fs is still mounted elsewhere, aborting.

What can i do in such situations?
If one server dies, or loses network connectivity?

Thank you in advance,

i

Nikolaus Rath

unread,
Jan 25, 2016, 1:20:14 PM1/25/16
to s3...@googlegroups.com
On Jan 25 2016, Igor Galić <i.g...@brainsware.org> wrote:
> Everything works super fine, except when it fails, then it fails horribly.
> Even just a shutdown of the vm (which, in theory, unmounts all file systems)
> will make it so that it's impossible to mount the fs on another machine:
>
> Backend reports that fs is still mounted elsewhere, aborting.
>
> What can i do in such situations?

Avoid them in the first place, by ensuring that file systems are
unmounted not just in theory but in practice.

If the damage is already done, you can get the file system mountable
again by running fsck.s3ql, but you will most likely loose data.


Best,
-Nikolaus

(No Cc on replies please, I'm reading the list)
--
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

Igor Galić

unread,
Jan 26, 2016, 1:58:30 PM1/26/16
to s3ql


----- On 25 Jan, 2016, at 19:20, Nikolaus Rath Niko...@rath.org wrote:

> On Jan 25 2016, Igor Galić <i.g...@brainsware.org> wrote:
>> Everything works super fine, except when it fails, then it fails horribly.
>> Even just a shutdown of the vm (which, in theory, unmounts all file systems)
>> will make it so that it's impossible to mount the fs on another machine:
>>
>> Backend reports that fs is still mounted elsewhere, aborting.
>>
>> What can i do in such situations?
>
> Avoid them in the first place, by ensuring that file systems are
> unmounted not just in theory but in practice.

split brain situations aren't theoretical. completely avoiding them
is sometimes impossible. http://queue.acm.org/detail.cfm?id=2655736
Sometimes you lose a machine because it's literally on fire, or just
because its hard disk(s) breaks…

> If the damage is already done, you can get the file system mountable
> again by running fsck.s3ql, but you will most likely loose data.

from what my understanding, this only works on the original machine.
am I mistaken?

Also: I don't mind dataloss too much, cuz in such a timeframe it's
most likely to be log entires or temporary files.

> Best,
> -Nikolaus
>
> (No Cc on replies please, I'm reading the list)
> --
> GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
> Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
>
> »Time flies like an arrow, fruit flies like a Banana.«
>
> --
> You received this message because you are subscribed to a topic in the Google
> Groups "s3ql" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/s3ql/PJOvzga1MXY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> s3ql+uns...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.g...@brainsware.org
URL: https://brainsware.org/
GPG: 8716 7A9F 989B ABD5 100F 4008 F266 55D6 2998 1641

Cliff Stanford

unread,
Jan 26, 2016, 2:10:08 PM1/26/16
to s3...@googlegroups.com
On 26/01/16 19:58, 'Igor Galić' via s3ql wrote:

> split brain situations aren't theoretical. completely avoiding them
> is sometimes impossible. http://queue.acm.org/detail.cfm?id=2655736
> Sometimes you lose a machine because it's literally on fire, or just
> because its hard disk(s) breaks…

Have you thought about setting --metadata-upload-interval to something
lower than the default 86,400? Maybe reduce it to 3,600 which means you
won's lose more than an hour's data.

> Even just a shutdown of the vm (which, in theory, unmounts all file systems)
> will make it so that it's impossible to mount the fs on another machine:

Are you sure that the shutdown is (a) calling umount.s3ql and (b)
waiting for it to finish? It can take a very long time to return,
depending on the size of the dirty cache, the size of the metadata and
the speed of the line.

Regards
Cliff.

--
Cliff Stanford
Office: +44 20 0222 1666 UK Mobile: +44 7973 616 666
Spain: +34 952 587 666
http://www.may.be/

Nikolaus Rath

unread,
Jan 26, 2016, 3:49:58 PM1/26/16
to s3...@googlegroups.com
On Jan 26 2016, 'Igor Galić' via s3ql <s3...@googlegroups.com> wrote:
>> If the damage is already done, you can get the file system mountable
>> again by running fsck.s3ql, but you will most likely loose data.
>
> from what my understanding, this only works on the original machine.
> am I mistaken?

You are mistaken. Where does your understanding come from? (If it's the
S3QL documentation, I'd like to fix it).

Igor Galić

unread,
Feb 2, 2016, 5:24:07 PM2/2/16
to s3ql


On Tuesday, January 26, 2016 at 8:10:08 PM UTC+1, Cliff Stanford wrote:
On 26/01/16 19:58, 'Igor Galić' via s3ql wrote:

> split brain situations aren't theoretical. completely avoiding them
> is sometimes impossible. http://queue.acm.org/detail.cfm?id=2655736
> Sometimes you lose a machine because it's literally on fire, or just
> because its hard disk(s) breaks…

Have you thought about setting --metadata-upload-interval to something
lower than the default 86,400?  Maybe reduce it to 3,600 which means you
won's lose more than an hour's data.

> Even just a shutdown of the vm (which, in theory, unmounts all file systems)
> will make it so that it's impossible to mount the fs on another machine:

Are you sure that the shutdown is (a) calling umount.s3ql and (b)

probably not. Since it's mounted and not handled as service (yet),
and we're calling shutdown, it'll just be unmounted using umount
 
waiting for it to finish?  It can take a very long time to return,

and standard umount (or shutdown) might be a little impatient…
 
depending on the size of the dirty cache, the size of the metadata and
the speed of the line.


but this sounds like a very good idea, and i'll try that out *now*

 

Igor Galić

unread,
Feb 2, 2016, 5:31:53 PM2/2/16
to s3ql


On Tuesday, January 26, 2016 at 9:49:58 PM UTC+1, Nikolaus Rath wrote:
On Jan 26 2016, 'Igor Galić' via s3ql <s3...@googlegroups.com> wrote:
>> If the damage is already done, you can get the file system mountable
>> again by running fsck.s3ql, but you will most likely loose data.
>
> from what my understanding, this only works on the original machine.
> am I mistaken?

You are mistaken. Where does your understanding come from? (If it's the
S3QL documentation, I'd like to fix it).

i'm not sure more documentation will fix that.
reading is really hard.


"Enter "continue" to use the outdated data anyway:"

I kept reading this as "press "Enter" to continue to use the outdated data anyway:"

(yay dyslexia)
 
Reply all
Reply to author
Forward
0 new messages