disk raid, clone or simple 24-hr copying can have consequential boot issues during unattended hardware failures. Design flaw.

15 views
Skip to first unread message

Gary Pope

unread,
Jul 25, 2012, 8:31:31 PM7/25/12
to mlu...@googlegroups.com
A previous MLUG article outlined a FreeBSD boot issue (boot0) that has uncovered potential flaw in recovery procedures.
After an enlightening talk by Nitin at MLUG last night, about converting a bootable system to a software RAID-1, the group had a side dicusssion about a potential RISK/flaw in recovery procedures.
 
What I'm about to outline is a rare event, and a double-murphy scenario, but it happened to me recently, and really made me realise the vulnerability of boot sequences.
 
Let me attempt to explain this, without making it complicated with RAID, and just consider two disks.   behind the scenes there are other daily backups to tape and other disks involved, so DATA is secure,  but there can be a risk of losing one days data in this scenario.
 
1.  The preferred BOOT device is say:  sda1
2.  To have a standby bootable, recovery disk onsite,  a second disk is sitting on sdb1
3.  To provide recovery capability in the event of a user error (file deletion),  that second disk permits immediate recovery next day by mount/cp/umount - NICE.
4.  To allow (3) to be available,  (1) is copied to (2)  overnight - what I call a 24-delayed copy.
5.  Then the desire to have a bootable standby 'recent' disk comes into the game.  So (2) should be bootable.
6.  (You can have multiple (2) disks, keeping them offsite every other week for instance, and that provides for SYSTEM recovery in the event of fire/theft etc.)
7.  Meantime, plenty of other DATA 'only' backups are being made on LTO tape, or other portable disks, DVD's or memory sticks - whatever you like.
 
So, life continues every day nicely, and in the event of say, a total powre outage and the UPS drains, and finally comes back to life in the middle of the weekend when unattended,  you'd hope that the system reboots from (1)  (sda).  
 
Why?  My assumption is that it would be nice to have the machine running unattended  24/7  whenever possible.    And this assumption is probably the main flaw in the design.
 
What happens if there is some machine malfunction, such as disk (1) on sda1  dies, or is for some reason momentary not visible, and a reboot occurs?
(In my recent experience, the motherboard controller failed momentarily, leaving the second disk (2) on sdb1 visible and bootable. )
 
The risk exists, of the remaining wrong disk (2) being seen as the only surviving disk, and therefore becomes known as sda1 even though it is meant to be sdb1.
If this occurs unattended at a bad time, such as just prior to business opening,  then the site  could unknowingly be running on the wrong disk.
 
If the backup procedure (say a crontab automated script) initiates a backup to a series of external/other devices, such scripts could be unknowingly copying the wrong generation of SYSTEM/DATA to the backup.   You've unknowingly lost 24 hrs of DATA.
 
 
Now take the RAID-1 catchup scenario where a similar flaw exists.  In a Raid-1 mirror situation,  where one disk fails,  the failed disk is replaced for automated CATCHUP (if a H/W raid).  During that critcial time (could be hours for a 1Tb disk) when catchup is occuring,  the initial phase of catchup will have copied the boot blocks and probabl;y most of the system to the target disk.  Remember, you'rve just had a disk failure for reasons unknown.   If the CAUSE of the failure was in any way motherboard/controller/power/environment related, there remains a seondary risk of consequentiall failure at this point in time.  If catchup is NOT completed, and some incident causes the machine to REBOOT, unattended,  and at the same time, the survivor RAID-1 disk is suddenly offline, there is a chance the system could boot from the half-recovered replacement disk.  The end result is that the server could go live with users unknowingly running on half their data.
 
It appears from MLUG discussions aroudn the table last night, that care needs to be taken about the usage of features like: 
 
a)   If using GRUB,  then /etc/default/grub    should contain  degradedboot = false   (ie:  the system will NOT reboot until a RAID-1 is fully in sync, running as a pair)
b)   Under FreeBSD in a NON-Raid environment,  perhaps commands like  boot0cfg settings are needed to only allow booting from a specific device (ie: /dev/sda1)
      **however**  you'd need to equip the system with a seond type of disk controller for the 'copy' disks.  IE:  this would not save the /dev/sdb1 being seen as /dev/sda1 example above.
      So:  in my case, I use a 3WARE card that is seen as a /dev/twed0  device, and can never be seen as /dev/sda1 by mistake.
      But I have to make/leave  that /dev/twed0 NON-bootable, otherwise it could be the next best boot device in the event of a head crash on /dev/sda1.
 
The moral of the story as I see it,  is a compromise between the desire to have a auto-boot unmanned 24/7 operation available,  versus the the machine stopping in the event of a failure that prevents booting from the desired  primary drive.     And, from what I can tell, there still remains a window of opportunity for consequential failure during recovery, if the machines are permitted to auto-boot before full integrity of all disks is completed.
 
I'd appreciate some feedback on this 'flaw' in disk recovery procedures.   Again,  to me,  it only comes about becuase of an underlying assumption/desire to have a machine capable of unmanned reboot when everything is 100% okay,  but this works against you during times of failure.
 
Gary
 
 
 
 

Gary A. Pope
B.Bus(ACC)
DIRECTOR

Alchester  Business  Systems
m:0408-994799 anytime.
p: 03-97626293 f: 03-97626293
e:
g...@alchester.com.au

Why us?  alchester.com.au/whyus.html
"We take care of everything!"

David Schoen

unread,
Jul 25, 2012, 10:49:31 PM7/25/12
to mlu...@googlegroups.com
You might be interested in this as well:
https://wiki.ubuntu.com/BootDegradedRaid

A few folks are working on getting the Ubuntu install and boot
processes to work better with degraded arrays. Where "better" mainly
means forcing the user to decide if degraded arrays can be booted from
(good because it forces people to actually think about it) and some
thoughts around only assembling OS related degraded arrays (e.g. if
it's a remote machine you want the OS to load so that you can SSH in
to analyse the issue, but you may want /home to remain unwriteable).


Cheers,
Dave
> --
> You received this message because you are subscribed to the Google Groups
> "mlug-au" group.
> To post to this group, send email to mlu...@googlegroups.com.
> To unsubscribe from this group, send email to
> mlug-au+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mlug-au?hl=en.

Gary Pope

unread,
Jul 25, 2012, 11:34:24 PM7/25/12
to mlu...@googlegroups.com
David.   That is a good article - thanks.    You, and the article, well express the idea of splitting the O/S from /home  (data) with the idea being to allow a degraded riad 'system' to start,  but not the DATA (/home) aspect, so that remote support can be provided to the degraded raid array, but without users getting access to production data.   That is another deisgn aspect of partition layouts to be considered when talking raids, that's for sure.
 
That article well expresses my attempt to highlight the issue of unmanned reboots,  versus the objective of non-stop operation (even during failures).   I'm coming more and more to the conclusion that it is the unmanned reboot issue that is at the core of this dilemma.
 
 

Gary A. Pope
B.Bus(ACC)
DIRECTOR

Alchester  Business  Systems
m:0408-994799 anytime.
p: 03-97626293 f: 03-97626293
e:
g...@alchester.com.au

Why us?  alchester.com.au/whyus.html
"We take care of everything!"

Darren

unread,
Jul 26, 2012, 7:03:07 AM7/26/12
to mlu...@googlegroups.com
 
Now take the RAID-1 catchup scenario where a similar flaw exists.  In a Raid-1 mirror situation,  where one disk fails,  the failed disk is replaced for automated CATCHUP (if a H/W raid).  During that critcial time (could be hours for a 1Tb disk) when catchup is occuring,  the initial phase of catchup will have copied the boot blocks and probabl;y most of the system to the target disk.  Remember, you'rve just had a disk failure for reasons unknown.   If the CAUSE of the failure was in any way motherboard/controller/power/environment related, there remains a seondary risk of consequentiall failure at this point in time.  If catchup is NOT completed, and some incident causes the machine to REBOOT, unattended,  and at the same time, the survivor RAID-1 disk is suddenly offline, there is a chance the system could boot from the half-recovered replacement disk.  The end result is that the server could go live with users unknowingly running on half their data.

Hi Gaz,

I don't believe the above scenario is possible, the raid system should refuse to run in degraded mode from a half-rebuilt drive (at least not without a lot of fiddling, cursing and pleading). I tried testing this but I need more time and more scotch to complete that analysis! :)

If I'm right and mdadm/zfs/hardware controllers will all refuse to run off a partially-rebuilt array (as determined by the raid superblock) then you're down to one scenario where this problem can occur (assuming a raid1 system):

  1. healthy raid1
  2. power outage
  3. drive A not available on startup, raid1 resumes in degraded mode using drive B
  4. 24hours of work done
  5. power outage (worst switchboard ever)
  6. drive A now available again but drive B got knocked out
  7. Silent data loss of 24hours worth of work

So the scenario becomes more contrived/unlikely but it's still there. Once you move to raid5/6 the problem goes away (although raid5 has its own data integrity issues).

Cheers,

Darren


Gary Pope

unread,
Jul 26, 2012, 8:22:22 AM7/26/12
to mlu...@googlegroups.com
Good input Darren.    Ands yes, agreed, the scenario of failure I describe is 1 in a 1,000,000+ - but nonetheless, you know that in the computer industry the 'odds'  always come around faster than the law of averages suggests.    The average low end server would not entertain the cost of deploying a raid 5/6 perhaps (or I've failed to do the sums lately....)     But the whole concept of doub;e fatalities is what I'm trying to point out here,  and to come up with a fallback plan in case it were to happen (again).    Thamks for th eprompt input....   it makes for a worthwhile discussion, and imporves understanding of RAID, non-raid, boot and disaster planning - and that was the spirit of putting it all forward.
 
Gaz

Gary A. Pope
B.Bus(ACC)
DIRECTOR

Alchester  Business  Systems
m:0408-994799 anytime.
p: 03-97626293 f: 03-97626293
e:
g...@alchester.com.au

Why us?  alchester.com.au/whyus.html
"We take care of everything!"

----- Original Message -----
From: Darren
--
You received this message because you are subscribed to the Google Groups "mlug-au" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mlug-au/-/mGQv-etich4J.

Gary Pope

unread,
Sep 21, 2012, 1:01:35 AM9/21/12
to mlu...@googlegroups.com
More input on this subject after 2 months.....   How to force a preferred disk to be the ONLY disk to boot from.
ie:  we're trying to force  SATA disks to be known as a preferred  device name based on the physical port they are cabled to,  and not just allow FreeBSD boot code to allocate a first-in-first-named approach.
 
We've isolated the BOOT issue much further recently.   Wer're down to one issue now.  
a)  say the preferred device to boot from is:   /dev/twed0s1a   (1st SATA disk on 1st controller)    -    port 0.
b)  say you have a 24hr older, but otherwise identical BOOTABLE disk on the next port of the same controller as /dev/twed1s1a   -  on port 1.
c)  Then say,  disk (a)  crashes,  or is not 'seen' or cable falls out, or the controller Port '0'   fails......     leaving only the drive (b).
 
What we're seeing,   is the FreeBSD startup  sees  (b) as the only disk,  it reports it as a JBOD on port 1 (correct),   but as this is the first seen 'twe'  device,  it becomes known as /dev/twed0,   and becuase the e/tc/fstbab has been setup to desire mounting /dev/twed0s1a   and ( ... d,e,f,g)  then the boot procedure does exactly that,  mount them and goes live!
 
AT first I thought  the boot0cfg command  could mess about with the option  'setdrv'  to force the use of twed0....    but the mis-represention of the (b) disk (on port 1, not 0)  as being treated as 'twed0'  during the process of booting up,  is all too LATE for any adjustments.
 
What we need, is a way of disk (b) on port 1,  to always be treated as twed1  (not twed0)  regardless of whetehr the disk (a) is present, missing, broken or any other type of mischeif.
 
ie:  we need disk (b) to be known as twed1 when referenced at boot time, even though it is an exact copy of twed0 from the day before!
 
There seems no way to FORCE the naming convention in BIOS (and FreeBSD would likely ignore anything in BIOS anyway).   There's no setting for drive name/number conventions in the 3WARE 8006-2LP raid card (and we're now exploring the idea of having TWO such RAID controllers, each with  2 ports, present in the system, just to make it a little harder).....   
 
Once a disk starts to boot,  it is a matter of having something in the startup procedure detect/determine that (b) is on port 1 and not port 0,   and to FORCE such procedures to define (b) as twed1 (not twed0)  becuase of that PORT NUMBER recognition.    Then,  when etc/fstab (on the disk (b), attempting bootup)  is referred to,  and the the system determines there are no slices available for 'twed0'  then the system will pausse at the "mountroot>"  prompt,  awaiting an administrator to attend the site and determine why the real port 0 (a) disk is broken, missing, not available.
 
Any ideas on this one aspect of device name allocation?
 
Gaz
 

Gary A. Pope
B.Bus(ACC)
DIRECTOR

Alchester  Business  Systems
m:0408-994799 anytime.
p: 03-97626293 f: 03-97626293
e:
g...@alchester.com.au

Why us?  alchester.com.au/whyus.html
"We take care of everything!"

----- Original Message -----
From: Darren
Sent: Thursday, July 26, 2012 9:03 PM
Subject: [MLUG] Re: disk raid, clone or simple 24-hr copying can have consequential boot issues during unattended hardware failures. Design flaw.

 
Now take the RAID-1 catchup scenario where a similar flaw exists.  In a Raid-1 mirror situation,  where one disk fails,  the failed disk is replaced for automated CATCHUP (if a H/W raid).  During that critcial time (could be hours for a 1Tb disk) when catchup is occuring,  the initial phase of catchup will have copied the boot blocks and probabl;y most of the system to the target disk.  Remember, you'rve just had a disk failure for reasons unknown.   If the CAUSE of the failure was in any way motherboard/controller/power/environment related, there remains a seondary risk of consequentiall failure at this point in time.  If catchup is NOT completed, and some incident causes the machine to REBOOT, unattended,  and at the same time, the survivor RAID-1 disk is suddenly offline, there is a chance the system could boot from the half-recovered replacement disk.  The end result is that the server could go live with users unknowingly running on half their data.
Reply all
Reply to author
Forward
0 new messages