UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

Clint Olsen

unread,

Sep 21, 2008, 5:34:26 PM9/21/08

to

Sep 21 08:57:54 belle fsck: /dev/ad4s1d: 1 DUP I=190
Sep 21 08:57:54 belle fsck: /dev/ad4s1d: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.

Ok, so I ran fsck manually (even with -y), but yet it refuses to clear/fix
whatever to the questions posed as fsck runs. What does this all mean?

Thanks,

-Clint

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Jeremy Chadwick

unread,

Sep 21, 2008, 5:52:03 PM9/21/08

to

On Sun, Sep 21, 2008 at 02:34:26PM -0700, Clint Olsen wrote:
> Sep 21 08:57:54 belle fsck: /dev/ad4s1d: 1 DUP I=190
> Sep 21 08:57:54 belle fsck: /dev/ad4s1d: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.
>
> Ok, so I ran fsck manually (even with -y), but yet it refuses to clear/fix
> whatever to the questions posed as fsck runs. What does this all mean?

Are you running fsck on the filesystem while its mounted? Are you doing
this in single-user or multi-user mode?

Jeremy Chadwick

unread,

Sep 21, 2008, 6:07:20 PM9/21/08

to

On Sun, Sep 21, 2008 at 02:59:30PM -0700, Clint Olsen wrote:
> I ran in multi-user mode because the system booted. I figured that it
> would have halted the boot if it was serious enough to warrant single-user
> mode fsck. That has happened before.
>
> Thanks,
>
> -Clint

>
> On Sep 21, Jeremy Chadwick wrote:
> > Are you running fsck on the filesystem while its mounted? Are you doing
> > this in single-user or multi-user mode?

Re-adding mailing list to the CC list.

No, I don't think that is the case, assuming the filesystems are UFS2
and are using softupdates. When booting multi-user, fsck is run in the
background, meaning the system is fully up + usable even before the fsck
has started.

Consider using background_fsck="no" in /etc/rc.conf if you prefer the
old behaviour. Otherwise, boot single-user then do the fsck.

You could also consider using clri(8) to clear the inode (190). Do this
in single-user while the filesystem is not mounted. After using clri,
run fsck a couple times.

Also, are there any kernel messages about ATA/SCSI disk errors or other
anomalies?

Jeremy Chadwick

unread,

Sep 21, 2008, 9:00:40 PM9/21/08

to

On Sun, Sep 21, 2008 at 05:40:40PM -0700, Clint Olsen wrote:
> On Sep 21, Jeremy Chadwick wrote:

> > With regards to this specific item: can you provide the full smartctl
> > command you're using (including device), and all of the output? I have
> > an idea of what the problem is, but I'd need to see the output first.
>
> # smartctl /dev/ad6
> smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/

The tool is behaving how it should. Try using the -a flag.

Derek Kuliński

unread,

Sep 27, 2008, 12:33:41 AM9/27/08

to

Hello Jeremy,

Sunday, September 21, 2008, 3:07:20 PM, you wrote:

> Consider using background_fsck="no" in /etc/rc.conf if you prefer the
> old behaviour. Otherwise, boot single-user then do the fsck.

Actually what's the advantage of having fsck run in background if it
isn't capable of fixing things?
Isn't it more dangerous to be it like that? i.e. administrator might
not notice the problem; also filesystem could break even further...

--
Best regards,
Derek mailto:tak...@takeda.tk

I tried to daydream, but my mind kept wandering.

Jeremy Chadwick

unread,

Sep 27, 2008, 1:14:13 AM9/27/08

to

On Fri, Sep 26, 2008 at 09:33:41PM -0700, Derek Kuli??ski wrote:
> Hello Jeremy,
>
> Sunday, September 21, 2008, 3:07:20 PM, you wrote:
>
> > Consider using background_fsck="no" in /etc/rc.conf if you prefer the
> > old behaviour. Otherwise, boot single-user then do the fsck.
>
> Actually what's the advantage of having fsck run in background if it
> isn't capable of fixing things?
> Isn't it more dangerous to be it like that? i.e. administrator might
> not notice the problem; also filesystem could break even further...

This question should really be directed at a set of different folks,
e.g. actual developers of said stuff (UFS2 and soft updates in
specific), because it's opening up a can of worms.

I believe it has to do with the fact that there is much faith given to
UFS2 soft updates -- the ability to background fsck allows the user to
boot their system and have it up and working (able to log in, etc.) in a
much shorter amount of time[1]. It makes the assumption that "everything
will work just fine", which is faulty.

It also gives the impression of a journalled filesystem, which UFS2 soft
updates are not. gjournal(8) on the other hand, is, and doesn't require
fsck at all[2].

I also think this further adds fuel to the "so why are we enabling soft
updates by default and using UFS2 as a filesystem again?" fire. I'm
sure someone will respond to this with "So use ZFS and shut up". *sigh*

[1]: http://lists.freebsd.org/pipermail/freebsd-questions/2004-December/069114.html
[2]: http://lists.freebsd.org/pipermail/freebsd-questions/2008-April/173501.html

_______________________________________________

Derek Kuliński

unread,

Sep 27, 2008, 1:35:57 AM9/27/08

to

Hello Jeremy,

Friday, September 26, 2008, 10:14:13 PM, you wrote:

>> Actually what's the advantage of having fsck run in background if it
>> isn't capable of fixing things?
>> Isn't it more dangerous to be it like that? i.e. administrator might
>> not notice the problem; also filesystem could break even further...

> This question should really be directed at a set of different folks,
> e.g. actual developers of said stuff (UFS2 and soft updates in
> specific), because it's opening up a can of worms.

> I believe it has to do with the fact that there is much faith given to
> UFS2 soft updates -- the ability to background fsck allows the user to
> boot their system and have it up and working (able to log in, etc.) in a
> much shorter amount of time[1]. It makes the assumption that "everything
> will work just fine", which is faulty.

As far as I know (at least ideally, when write caching is disabled)
the data should always be consistent, and all fsck supposed to be
doing is to free unreferenced blocks that were allocated.
Wouldn't be possible for background fsck to do that while the
filesystem is mounted, and if there's some unrepairable error, that
somehow happen (while in theory it should be impossible) just
periodically scream on the emergency log level?

> It also gives the impression of a journalled filesystem, which UFS2 soft
> updates are not. gjournal(8) on the other hand, is, and doesn't require
> fsck at all[2].

> I also think this further adds fuel to the "so why are we enabling soft
> updates by default and using UFS2 as a filesystem again?" fire. I'm
> sure someone will respond to this with "So use ZFS and shut up". *sigh*

I think the reason for using Soft Updates by default is that it was
a pretty hard thing to implement, and (at least in theory it supposed
by as reliable as journaling.

Also, if I remember correctly, PJD said that gjournal is performing
much better with small files, while softupdates is faster with big
ones.

--
Best regards,
Derek mailto:tak...@takeda.tk

Programmers are tools for converting caffeine into code.

Derek Kuliński

unread,

Sep 27, 2008, 3:37:50 AM9/27/08

to

Hello Jeremy,

Friday, September 26, 2008, 11:44:17 PM, you wrote:

>> As far as I know (at least ideally, when write caching is disabled)

> Re: write caching: wheelies and burn-outs in empty parking lots
> detected.

> Let's be realistic. We're talking about ATA and SATA hard disks, hooked
> up to on-board controllers -- these are the majority of users. Those
> with ATA/SATA RAID controllers (not on-board RAID either; most/all of
> those do not let you disable drive write caching) *might* have a RAID
> BIOS menu item for disabling said feature.

> FreeBSD atacontrol does not let you toggle such features (although "cap"
> will show you if feature is available and if it's enabled or not).

> Users using SCSI will most definitely have the ability to disable
> said feature (either via SCSI BIOS or via camcontrol). But the majority
> of users are not using SCSI disks, because the majority of users are not
> going to spend hundreds of dollars on a controller followed by hundreds
> of dollars for a small (~74GB) disk.

> Regardless of all of this, end-users should, in no way shape or form,
> be expected to go to great lengths to disable their disk's write cache.
> They will not, I can assure you. Thus, we must assume: write caching
> on a disk will be enabled, period. If a filesystem is engineered with
> that fact ignored, then the filesystem is either 1) worthless, or 2)
> serves a very niche purpose and should not be the default filesystem.

> Do we agree?

Yes, but...

In the link you sent to me, someone mentioned that write cache is
always creates problem, and it doesn't matter on OS or filesystem.

There's more below.

>> the data should always be consistent, and all fsck supposed to be
>> doing is to free unreferenced blocks that were allocated.

> fsck does a heck of a lot more than that, and there's no guarantee
> that's all fsck is going to do on a UFS2+SU filesystem. I'm under the
> impression it does a lot more than just looking for unref'd blocks.

Yes, fsck does a lot more than that. But the whole point of soft
updates is to reduce the work of fsck to deallocate allocated blocks.

Anyway, maybe my information are invalid, though funny thing is that
Soft Updates was mentioned in one of my lecture on Operating Systems.

Apparently the goal of Soft Updates is to always enforce those rules
in very efficient manner, by reordering the writes:
1. Never point to a data structure before initializing it
2. Never reuse a structure before nullifying pointers to it
3. Never reset last pointer to live structure before setting a new one
4. Always mark free-block bitmap entries as used before making the
directory entry point to it

The problem comes with disks which for performance reasons cache the
data and then write it in different order back to the disk.
I think that's the reason why it's recommended to disable it.
If a disk is reordering the writes, it renders the soft updates
useless.

But if the writing order is preserved, all data remains always
consistent, the only thing that might appear are blocks that were
marked as being used, but nothing was pointing to them yet.

So (in ideal situation, when nothing interferes) all fsck needs to do
is just to scan the filesystem and deallocate those blocks.

> The system is already up and the filesystems mounted. If the error in
> question is of such severity that it would impact a user's ability to
> reliably use the filesystem, how do you expect constant screaming on
> the console will help? A user won't know what it means; there is
> already evidence of this happening (re: mysterious ATA DMA errors which
> still cannot be figured out[6]).

> IMHO, a dirty filesystem should not be mounted until it's been fully
> analysed/scanned by fsck. So again, people are putting faith into
> UFS2+SU despite actual evidence proving that it doesn't handle all
> scenarios.

Yes, I think the background fsck should be disabled by default, with a
possibility to enable it if the user is sure that nothing will
interfere with soft updates.

> The problem here is that when it was created, it was sort of an
> "experiment". Now, when someone installs FreeBSD, UFS2 is the default
> filesystem used, and SU are enabled on every filesystem except the root
> fs. Thus, we have now put ourselves into a situation where said
> feature ***must*** be reliable in all cases.

I think in worst case it just is as realiable as if it wouldn't be
enabled (the only danger is the background fsck)

> You're also forgetting a huge focus of SU -- snapshots[1]. However, there
> are more than enough facts on the table at this point concluding that
> snapshots are causing more problems[7] than previously expected. And
> there's further evidence filesystem snapshots shouldn't even be used in
> this way[8].

there's not much to argue about that.

>> Also, if I remember correctly, PJD said that gjournal is performing
>> much better with small files, while softupdates is faster with big
>> ones.

> Okay, so now we want to talk about benchmarks. The benchmarks you're
> talking about are in two places[2][3].

> The benchmarks pjd@ provided were very basic/simple, which I feel is
> good, because the tests were realistic (common tasks people will do).
> The benchmarks mckusick@ provided for UFS2+SU were based on SCSI
> disks, which is... interesting to say the least.

> Bruce Evans responded with some more data[4].

> I particularly enjoy this quote in his benchmark: "I never found the
> exact cause of the slower readback ...", followed by (plausible)
> speculations as to why that is.

> I'm sorry that I sound like such a hard-ass on this matter, but there is
> a glaring fact that people seem to be overlooking intentionally:

> Filesystems have to be reliable; data integrity is focus #1, and cannot
> be sacrificed. Users and administrators *expect* a filesystem to be
> reliable. No one is going to keep using a filesystem if it has
> disadvantages which can result in data loss or "waste of administrative
> time" (which I believe is what's occurring here).

> Users *will* switch to another operating system that has filesystems
> which were not engineered/invented with these features in mind. Or,
> they can switch to another filesystem assuming the OS offers one which
> performs equally as good/well and is guaranteed to be reliable --
> and that's assuming the user wants to spend the time to reformat and
> reinstall just to get that.

I wasn't trying to argue about that. Perhaps my assumption is wrong,
but I belive that the problems that we know about Soft Updates, at
worst case make system as reliable as it was without using it.

> In the case of "bit rot" (e.g. drive cache going bad silently, bad
> cables, or other forms of low-level data corruption), a filesystem is
> likely not to be able to cope with this (but see below).

> A common rebuttal here would be: "so use UFS2 without soft updates".
> Excellent advice! I might consider it myself! But the problem is that
> we cannot expect users to do that. Why? Because the defaults chosen
> during sysinstall are to use SU for all filesystems except root. If SU
> is not reliable (or is "reliable in most cases" -- same thing if you ask
> me), then it should not be enabled by default. I think we (FreeBSD)
> might have been a bit hasty in deciding to choose that as a default.

> Next: a system locking up (or a kernel panic) should result in a dirty
> filesystem. That filesystem should be *fully recoverable* from that
> kind of error, with no risk of data loss (but see below).

> (There is the obvious case where a file is written to the disk, and the
> disk has not completed writing the data from its internal cache to the
> disk itself (re: write caching); if power is lost, the disk may not have
> finished writing the cache to disk. In this case, the file is going to
> be sparse -- there is absolutely nothing that can be done about this
> with any filesystem, including ZFS (to my knowledge). This situation
> is acceptable; nature of the beast.)

> The filesystem should be fully analysed and any errors repaired (either
> with user interaction or automatically -- I'm sure it depends on the
> kind of error) **before** the filesystem is mounted.

> This is where SU gets in the way. The filesystem is mounted and the
> system is brought up + online 60 seconds before the fsck starts. The
> assumption made is that the errors in question will be fully recoverable
> by an automatic fsck, which as this thread proves, is not always the
> case.

That's why I think background fsck should be disabled by default.
Though I still don't think that soft updates hurt anything (probably
except performance)

> ZFS is the first filesystem, to my knowledge, which provides 1) a
> reliable filesystem, 2) detection of filesystem problems in real-time or
> during scrubbing, 3) repair of problems in real-time (assuming raidz1 or
> raidz2 are used), and 4) does not need fsck. This makes ZFS powerful.

> "So use ZFS!" A good piece of advice -- however, I've already had
> reports from users that they will not consider ZFS for FreeBSD at this
> time. Why? Because ZFS on FreeBSD can panic the system easily due to
> kmem exhaustion. Proper tuning can alleviate this problem, but users do
> not want to to have to "tune" their system to get stability (and I feel
> this is a very legitimate argument).

> Additionally, FreeBSD doesn't offer ZFS as a filesystem during
> installation. PC-BSD does, AFAIK. So on FreeBSD, you have to go
> through a bunch of rigmarole[5] to get it to work (and doing this
> after-the-fact is a real pain in the rear -- believe me, I did it this
> weekend.)

> So until both of these ZFS-oriented issues can be dealt with, some
> users aren't considering it.

> This is the reality of the situation. I don't think what users and
> administrators want is unreasonable; they may be rough demands, but
> that's how things are in this day and age.

> Have I provided enough evidence? :-)

Yes, but as far as I understand it's not as bad as you think :)
I could be wrong though.

I 100% agree on disabling background fsck, but I don't think soft
updates are making the system any less reliable than it would be
without it.

Also, I'll have to play with ZFS some day :)

--
Best regards,
Derek mailto:tak...@takeda.tk

It's a little-known fact that the Y1K problem caused the Dark Ages.

sth...@nethelp.no

unread,

Sep 27, 2008, 4:04:58 AM9/27/08

to

> > IMHO, a dirty filesystem should not be mounted until it's been fully
> > analysed/scanned by fsck. So again, people are putting faith into
> > UFS2+SU despite actual evidence proving that it doesn't handle all
> > scenarios.
>
> Yes, I think the background fsck should be disabled by default, with a
> possibility to enable it if the user is sure that nothing will
> interfere with soft updates.

Having been bitten by problems in this area more than once, I now always
disable background fsck. Having it disabled by default has my vote too.

Steinar Haug, Nethelp consulting, sth...@nethelp.no

Charles Sprickman

unread,

Sep 27, 2008, 3:16:11 PM9/27/08

to

I'm forking the thread on fsck/soft-updates in hopes of getting some
practical advice based on the discussion here of background fsck,
softupdates and write-caching on SATA drives.

On Fri, 26 Sep 2008, Jeremy Chadwick wrote:

> Let's be realistic. We're talking about ATA and SATA hard disks, hooked
> up to on-board controllers -- these are the majority of users. Those
> with ATA/SATA RAID controllers (not on-board RAID either; most/all of
> those do not let you disable drive write caching) *might* have a RAID
> BIOS menu item for disabling said feature.

While I would love to deploy every server with SAS, that's not practical
in many cases, especially for light-duty servers that are not being pushed
very hard. I am taking my chances with multiple affordable drives and
gmirror where I cannot throw in a 3Ware card. I imagine that many
non-desktop FreeBSD users are doing the same considering you can fetch a
decent 1U box with plenty of storage for not much more than $1K. I assume
many here are in agreement on this point -- just making it clear that the
bargain crowd is not some weird edge case in the userbase...

> Regardless of all of this, end-users should, in no way shape or form,
> be expected to go to great lengths to disable their disk's write cache.
> They will not, I can assure you. Thus, we must assume: write caching
> on a disk will be enabled, period. If a filesystem is engineered with
> that fact ignored, then the filesystem is either 1) worthless, or 2)
> serves a very niche purpose and should not be the default filesystem.

Arguments about defaults aside, this is my first questions. If I've got a
server with multiple SATA drives mirrored with gmirror, is turning on
write-caching a good idea? What kind of performance impact should I
expect? What is the relationship between caching, soft-updates, and
either NCQ or TCQ?

Here's an example of a Seagate, trimmed for brevity:

Protocol Serial ATA v1.0
device model ST3160811AS

Feature Support Enable Value Vendor
write cache yes yes
read ahead yes yes
Native Command Queuing (NCQ) yes - 31/0x1F
Tagged Command Queuing (TCQ) no no 31/0x1F

TCQ is clearly not supported, NCQ seems to be supported, but I don't know
how to tell if it's actually enabled or not. Write-caching is currently
on.

The tradeoff is apparently performance vs. more reliable recovery should
the machine lose power, smoke itself, etc., but all I've seen is anecdotal
evidence of how bad performance gets.

FWIW, this machine in particular had it's mainboard go up in smoke last
week. One drive was too far gone for gmirror to rebuild it without doing
a "forget" and "insert". The remaining drive was too screwy for
background fsck, but a manual check in single-user left me with no real
suprises or problems.

> The system is already up and the filesystems mounted. If the error in
> question is of such severity that it would impact a user's ability to
> reliably use the filesystem, how do you expect constant screaming on
> the console will help? A user won't know what it means; there is
> already evidence of this happening (re: mysterious ATA DMA errors which
> still cannot be figured out[6]).
>

> IMHO, a dirty filesystem should not be mounted until it's been fully
> analysed/scanned by fsck. So again, people are putting faith into
> UFS2+SU despite actual evidence proving that it doesn't handle all
> scenarios.

I'll ask, but it seems like the consensus here is that background fsck,
while the default, is best left disabled. The cases where it might make
sense are:

-desktop systems
-servers that have incredibly huge filesystems (and even there being able
to selectively background fsck filesystems might be helpful)

The first example is obvious, people want a fast-booting desktop. The
second is trading long fsck times in single-user for some uncertainty.

> The problem here is that when it was created, it was sort of an
> "experiment". Now, when someone installs FreeBSD, UFS2 is the default
> filesystem used, and SU are enabled on every filesystem except the root
> fs. Thus, we have now put ourselves into a situation where said
> feature ***must*** be reliable in all cases.
>

> You're also forgetting a huge focus of SU -- snapshots[1]. However, there
> are more than enough facts on the table at this point concluding that
> snapshots are causing more problems[7] than previously expected. And
> there's further evidence filesystem snapshots shouldn't even be used in
> this way[8].

...

> Filesystems have to be reliable; data integrity is focus #1, and cannot
> be sacrificed. Users and administrators *expect* a filesystem to be
> reliable. No one is going to keep using a filesystem if it has
> disadvantages which can result in data loss or "waste of administrative
> time" (which I believe is what's occurring here).

The softupdates question seems tied quite closely to the write-caching
question. If write-caching "breaks" SU, that makes things tricky. So
another big question:

If write-caching is enabled, should SU be disabled?

And again, what kind of performance and/or reliability sacrifices are
being made?

I'd love to hear some input from both admins dealing with this stuff in
production and from any developers who are making decisions about the
future direction of all of this.

Thanks,

Charles

> [1]: http://www.usenix.org/publications/library/proceedings/bsdcon02/mckusick/mckusick_html/index.html
> [6]: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting
> [7]: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues
> [8]: http://lists.freebsd.org/pipermail/freebsd-stable/2007-January/032070.html

>
> --
> | Jeremy Chadwick jdc at parodius.com |
> | Parodius Networking http://www.parodius.com/ |
> | UNIX Systems Administrator Mountain View, CA, USA |
> | Making life hard for others since 1977. PGP: 4BD6C0CB |
>

Jeremy Chadwick

unread,

Sep 27, 2008, 4:22:50 PM9/27/08

to

On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:
> On Fri, 26 Sep 2008, Jeremy Chadwick wrote:
>> Let's be realistic. We're talking about ATA and SATA hard disks, hooked
>> up to on-board controllers -- these are the majority of users. Those
>> with ATA/SATA RAID controllers (not on-board RAID either; most/all of
>> those do not let you disable drive write caching) *might* have a RAID
>> BIOS menu item for disabling said feature.
>
> While I would love to deploy every server with SAS, that's not practical
> in many cases, especially for light-duty servers that are not being
> pushed very hard. I am taking my chances with multiple affordable drives
> and gmirror where I cannot throw in a 3Ware card. I imagine that many
> non-desktop FreeBSD users are doing the same considering you can fetch a
> decent 1U box with plenty of storage for not much more than $1K. I
> assume many here are in agreement on this point -- just making it clear
> that the bargain crowd is not some weird edge case in the userbase...

I'm in full agreement here. As much as I love SCSI (and I sincerely do)
it's (IMHO unjustifiably) overpriced, simply because "it can be". You'd
expect the price of SCSI to decrease over the years, but it hasn't; it's
become part of a niche market, primarily intended for large businesses
with cash to blow. As I said, I love SCSI, the protocol is excellent,
and it's very well-supported all over the place -- and though I have
no personal experience with SAS, it appears to be equally as excellent,
yet the price is comparative to SCSI.

Even at my place of work we use SATA disks in our filers. I suppose this
is justified in the sense that a disk failure there will be less painful
than it would be in a single or dual-disk server, so saving money is
legitimate since RAID-5 (or whatever) is in use. But with regards to
our server boxes, either single or dual SATA disks are now being used,
rather than SCSI. I haven't asked our datacenter and engineering folks
why we've switched, but gut feeling says "saving money"

>> Regardless of all of this, end-users should, in no way shape or form,
>> be expected to go to great lengths to disable their disk's write cache.
>> They will not, I can assure you. Thus, we must assume: write caching
>> on a disk will be enabled, period. If a filesystem is engineered with
>> that fact ignored, then the filesystem is either 1) worthless, or 2)
>> serves a very niche purpose and should not be the default filesystem.
>
> Arguments about defaults aside, this is my first questions. If I've got
> a server with multiple SATA drives mirrored with gmirror, is turning on
> write-caching a good idea? What kind of performance impact should I
> expect? What is the relationship between caching, soft-updates, and
> either NCQ or TCQ?
>
> Here's an example of a Seagate, trimmed for brevity:
>
> Protocol Serial ATA v1.0
> device model ST3160811AS
>
> Feature Support Enable Value Vendor
> write cache yes yes
> read ahead yes yes
> Native Command Queuing (NCQ) yes - 31/0x1F
> Tagged Command Queuing (TCQ) no no 31/0x1F
>
> TCQ is clearly not supported, NCQ seems to be supported, but I don't know
> how to tell if it's actually enabled or not. Write-caching is currently
> on.

Actually, no -- FreeBSD ata(4) does not support NCQ. I believe there
are some unofficial patches (or even a PR) floating around which are for
testing, but out of the box, it lacks support. The hyphen you see under
the Enable column is supposed to signify that (I feel it's badly placed;
it should say "notsupp" or "unsupp" or something like that. Hyphen is
too vague).

The NCQ support patches might require AHCI as well, I forget. It's been
a while.

> The tradeoff is apparently performance vs. more reliable recovery should
> the machine lose power, smoke itself, etc., but all I've seen is
> anecdotal evidence of how bad performance gets.
>
> FWIW, this machine in particular had it's mainboard go up in smoke last
> week. One drive was too far gone for gmirror to rebuild it without doing
> a "forget" and "insert". The remaining drive was too screwy for
> background fsck, but a manual check in single-user left me with no real
> suprises or problems.

As long as the array rebuilt fine, I believe small quirks are
acceptable. Scenarios where the array *doesn't* rebuild properly when a
new disk is added are of great concern (and in the case of some features
such as Intel MatrixRAID, the FreeBSD bugs are so severe that you are
liable to lose data in such scenarios. MatrixRAID != gmirror, of
course).

This also leads me a little off-topic -- when it comes to disk
replacements, administrators want to be able to do this without taking
the system down. There are problems with this, but it often depends
greatly on hardware and BIOS configuration.

I've successfully done a hot-swap (hardware: SATA hot-swap backplane,
AHCI in use, SATA2 disks), but it required me to issue "atacontrol
detach" first (I am very curious to know what would've happened had I
just yanked the disk). Upon inserting the new disk, one has to be
*very* careful about the order of atacontrol commands given -- there
are cases where "attach" will cause the system to panic or SATA bus to
lock up, but it seems to depend upon what commands were executed
previously (such as "reinit").

Sorry if this is off-topic, but I wanted to mention it.

>> The system is already up and the filesystems mounted. If the error in
>> question is of such severity that it would impact a user's ability to
>> reliably use the filesystem, how do you expect constant screaming on
>> the console will help? A user won't know what it means; there is
>> already evidence of this happening (re: mysterious ATA DMA errors which
>> still cannot be figured out[6]).
>>
>> IMHO, a dirty filesystem should not be mounted until it's been fully
>> analysed/scanned by fsck. So again, people are putting faith into
>> UFS2+SU despite actual evidence proving that it doesn't handle all
>> scenarios.
>
> I'll ask, but it seems like the consensus here is that background fsck,
> while the default, is best left disabled. The cases where it might make
> sense are:
>
> -desktop systems
> -servers that have incredibly huge filesystems (and even there being able
> to selectively background fsck filesystems might be helpful)
>
> The first example is obvious, people want a fast-booting desktop. The
> second is trading long fsck times in single-user for some uncertainty.

The first item I agree with, and I believe the benefits there easily
outweigh the risks/quirks.

The 2nd item I can go either way on; for example, my home BSD box has
4x500GB disks in it (and about 1/3rd is used/filled). If that box
crashes, I *most definitely* want data integrity preserved as best as
possible. Of course, I'm using ZFS + raidz1 there, so maybe I'm arguing
to hear myself talk -- but at one time, I wasn't using ZFS.

I suppose it ultimately depends on what the administrator wants; I don't
think we'll find a default that will please everyone, and I accept
that reality.

>> Filesystems have to be reliable; data integrity is focus #1, and cannot
>> be sacrificed. Users and administrators *expect* a filesystem to be
>> reliable. No one is going to keep using a filesystem if it has
>> disadvantages which can result in data loss or "waste of administrative
>> time" (which I believe is what's occurring here).
>
> The softupdates question seems tied quite closely to the write-caching
> question. If write-caching "breaks" SU, that makes things tricky. So
> another big question:
>
> If write-caching is enabled, should SU be disabled?

This is an excellent question, one I too have been pondering.

If the answer is "yes", then there's two options (pick one):

a) Change the defaults during sysinstall; do NOT enable SU on all
non-root filesystems,
b) Set hw.ata.wc=0 during the installation startup, and upon a
completed FreeBSD installation, set hw.ata.wc=0 in sysctl.conf
(because the user sure won't know or remember to do this).

(b) has risks involved, such as the scenario where someone has two or
more disks, and only one disk is dedicated to FreeBSD; hw.ata.wc=0
disables write caching for **all** disks, so they'd possibly see
degraded performance on the non-OS disks once mounted.

If the answer is "no", then I guess we're fine.

> And again, what kind of performance and/or reliability sacrifices are
> being made?
>
> I'd love to hear some input from both admins dealing with this stuff in
> production and from any developers who are making decisions about the
> future direction of all of this.

As would I. Good questions, Charles! (As usual! :-) )

Gavin Atkinson

unread,

Sep 29, 2008, 10:47:22 AM9/29/08

to

On Mon, 2008-09-29 at 15:43 +0200, Miroslav Lachman wrote:

> Jeremy Chadwick wrote:
>
> > On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:
> >
> > I've successfully done a hot-swap (hardware: SATA hot-swap backplane,
> > AHCI in use, SATA2 disks), but it required me to issue "atacontrol
> > detach" first (I am very curious to know what would've happened had I
> > just yanked the disk). Upon inserting the new disk, one has to be
> > *very* careful about the order of atacontrol commands given -- there
> > are cases where "attach" will cause the system to panic or SATA bus to
> > lock up, but it seems to depend upon what commands were executed
> > previously (such as "reinit").
> >
> > Sorry if this is off-topic, but I wanted to mention it.
>

> Hot-swapping is totally upredictable on FreeBSD (from my experiences). I
> tried it many times on Asus 1U servers and on Sun Fire X2100 / X2100 M2
> with FreeBSD 6.2 and 7.0 (both i386).

I can't speak for the Dell, but I can at least say that at least on the
X2100, not even Solaris supports either hot-swapping or the built in
software RAID. When they were first released the advertising said that
they had these, but those claims was quietly removed from the website
some weeks after release. Short answer: give up on hot-swap the X2100.

As for the X2100 M2, that is supposed to support it, and I believe it
works fine for us under Solaris. I'm not sure if I've got any spare
M2's here, if so I'll have a play.

Gavin

Miroslav Lachman

unread,

Sep 29, 2008, 11:25:32 AM9/29/08

to

Gavin Atkinson wrote:
> On Mon, 2008-09-29 at 15:43 +0200, Miroslav Lachman wrote:
>
>>Jeremy Chadwick wrote:
>>
>>
>>>On Sat, Sep 27, 2008 at 03:16:11PM -0400, Charles Sprickman wrote:
>>>
>>>I've successfully done a hot-swap (hardware: SATA hot-swap backplane,
>>>AHCI in use, SATA2 disks), but it required me to issue "atacontrol
>>>detach" first (I am very curious to know what would've happened had I
>>>just yanked the disk). Upon inserting the new disk, one has to be
>>>*very* careful about the order of atacontrol commands given -- there
>>>are cases where "attach" will cause the system to panic or SATA bus to
>>>lock up, but it seems to depend upon what commands were executed
>>>previously (such as "reinit").
>>>
>>>Sorry if this is off-topic, but I wanted to mention it.
>>
>>Hot-swapping is totally upredictable on FreeBSD (from my experiences). I
>>tried it many times on Asus 1U servers and on Sun Fire X2100 / X2100 M2
>>with FreeBSD 6.2 and 7.0 (both i386).
>
>
> I can't speak for the Dell, but I can at least say that at least on the
> X2100, not even Solaris supports either hot-swapping or the built in
> software RAID. When they were first released the advertising said that
> they had these, but those claims was quietly removed from the website
> some weeks after release. Short answer: give up on hot-swap the X2100.
>
> As for the X2100 M2, that is supposed to support it, and I believe it
> works fine for us under Solaris. I'm not sure if I've got any spare
> M2's here, if so I'll have a play.

It was about year ago with Asus and Sun Fire X2100. I don't have Asus
servers now (all returned as reclamation). Now I am running one X2100
and about ten X2100 M2. I have one spare X2100 M2, so if somebody have
exact order of commands used to "hot-swap" the disk, I can test it in
few days.

Miroslav Lachman

Zaphod Beeblebrox

unread,

Sep 29, 2008, 11:49:06 AM9/29/08

to

On Mon, Sep 29, 2008 at 11:09 AM, Zaphod Beeblebrox <zbe...@gmail.com>wrote:

>
> I certainly can't agree with this. I don't think you're measuring the
> performance of the machine --- only measuring the performance of the
> filesystem. When ZFS is active, memory is committed in the kernel to ZFS.
> That memory is then no longer available for paging. Also, there exists data
> within the ARC (I'm always tempted to say the ARC Cache, but that is
> redundant) that is also then in paging memory. You end up having to tune
> the size of the ARC to your workload, which is how UN*X did it upto 1992 or
> so. If you choose the wrong size for the ARC, you end up with very poor
> performance.
>
> In particular, the ARC fights for space with the nvidia binary driver
> (which really does need _real_ memory). To have the driver work at all, I
> have to keep the ARC from growing too large --- which is at odds with
> filesystem perofrmance.
>

I thought about this statement a bit, and I believe it might be good to give
a real world example.

I use spamprobe. It's a dual-word basian spam filter. It runs from
procmail as part of the local delivery process on my laptop for mail. It
uses one of the berkley DB variants (I forget which at the moment and my
binary is statically linked so it will run on AMD64 as well as I386, so I
can't easily check).

I strongly suspect that the basic operation of spamprobe involves mmap()ing
the file (80 to 100 meg), performing a flurry of lookups (words in the
message) followed by a flurry of updates, followed by closing the file. Now
my current laptop has ZFS for my home directories and that ZFS is backed by
two 320G laptop drives (RAID 1). My old laptop has UFS on a single 160G
laptop drive. The old laptop would deliver 10 to 20 email messages per
second through spamprobe. The new laptop deilvers a mail message every 2 to
5 seconds.

In the UFS case, there would be little disk activity. Running spamprobe
would only involved cached items and the disk activity would be the
occaisional write to sync the new data and writes to deposit each email
message.

Now there's some things I don't know. Is the whole file rewritten to sync
the changes? It seems like it is. Whatever the difference between UFS and
ZFS here, it's triggering some pathological behavior.

The needs of NAS are not the needs of a local filesystem. They are
different beasts. If ZFS is optimized for being a NAS store, it does this
well. Maybe UFS can become more ZFS like or vice-versa. ZFS is a
comparatively young filesystem (and UFS is an equally old one).

Eugene Grosbein

unread,

Sep 29, 2008, 1:37:28 PM9/29/08

to

On Mon, Sep 29, 2008 at 03:31:35PM +0200, Miroslav Lachman wrote:

> >Having been bitten by problems in this area more than once, I now always
> >disable background fsck. Having it disabled by default has my vote too.

> Is there any possibility to selectively disable / enable background fsck
> on specified mount points?
>
> I can imagine system, where root, /usr, /var and /tmp will be checked by
> fsck in foreground, but waiting to foreground fsck on data partitions of
> about 500GB or more (it can take up tens of minutes or "hours") is scary.
> I need server with ssh running up "quickly" after the crash, so I can
> investigate what the problem was and not just sit and wait tens of
> minutes "if" machine gets online again or not... answering phone calls
> of clients in the meantime.

I solve this problem this way: size of /usr is 300Mb or less and it is
mounted read-only, so does not have any problem with fsck (it will
always skip it as clean); write activity on root fs is minimized by moving
objects being modified out of root (/tmp is symlink or other fs,
ntp drift is located on /var, not /etc etc.), background fsck is enabled.

And you always may use /etc/rc.early to force foreground check for /var
even if fsck runs in background later:

#!/bin/sh
/sbin/fsck -p /var || /sbin/fsck -y /var || exit 1

So, if /var was clean then fsck -p skips it, else it checks it in foreground
and marks clean if possible. For serious errors, check runs again trying
its best to clean /var and continue.

Later background fsck skips clean (possibly fixed already) filesystems.

Eugene Grosbein

Matthew Dillon

unread,

Sep 29, 2008, 1:44:11 PM9/29/08

to

A couple of things to note here. Well, many things actually.

* Turning off write caching, assuming the drive even looks at the bit,
will destroy write performance for any driver which does not support
command queueing. So, for example, scsi typically has command
queueing (as long as the underlying drive firmware actually implements
it properly), 3Ware cards have it (underlying drives, if SATA, may not,
but 3Ware's firmware itself might do the right thing).

The FreeBSD ATA driver does not, not even in AHCI mode. The RAID
code does not as far as I can tell. You don't want to turn this off.

* Filesystems like ZFS and HAMMER make no assumptions on write
ordering to disk for completed write I/O vs future write I/O
and use BIO_FLUSH to enforce ordering on-disk. These filesystems
are able to queue up large numbers of parallel writes inbetween
each BIO_FLUSH, so the flush operation has only a very small
effect on actual performance.

Numerous Linux filesystems also use the flush command and do
not make assumptions on BIO-completion/future-BIO ordering.

* UFS + softupdates assumes write ordering between completed BIO's
and future BIOs. This doesn't hold true on a modern drive (with
write caching turned on). Unfortunately it is ALSO not really
the cause behind most of the inconsistency reports.

UFS was *never* designed to deal with disk flushing. Softupdates
was never designed with a BIO_FLUSH command in mind. They were
designed for formally ordered I/O (bowrite) which fell out of
favor about a decade ago and has since been removed from most
operating systems.

* Don't get stuck in a rut and blame DMA/Drive/firmware for all the
troubles. It just doesn't happen often enough to even come close
to being responsible for the number of bug reports.

With some work UFS can be modified to do it, but performance will
probably degrade considerably because the only way to do it is to
hold the completed write BIOs (not biodone() them) until something
gets stuck, or enough build up, then issue a BIO_FLUSH and, after
it returns, finish completing the BIOs (call the biodone()) for the
prior write I/Os. This will cause softupdates to work properly.
Softupdates orders I/O's based on BIO completions.

Another option would be to complete the BIOs but do major surgery on
softupdates itself to mark the dependancies as waiting for a flush,
then flush proactively and re-sync.

Unfortunately, this will not solve the whole problem. IF THE DRIVE
DOESN'T LOOSE POWER IT WILL FLUSH THE BIOs IT SAID WERE COMPLETED.
In otherwords, unless you have an actual power failure the assumptions
softupdates will hold. A kernel crash does NOT prevent the actual
drive from flushing the IOs in its cache. The disk can wind up with
unexpected softupdates inconsistencies on reboot anyway. Thus the
source of most of the inconsistency reports will not be fixed by adding
this feature. So more work is needed on top of that.

--

Nearly ALL of the unexpected softupdates inconsistencies you see *ARE*
for the case where the drive DOES in fact get all the BIO data it
returned as completed onto the disk media. This has happened to me
many, many times with UFS. I'm repeating this: Short of an actual
power failure, any I/O's sent to and acknowledged by the drive are
flushed to the media before the drive resets. A FreeBSD crash does
not magically prevent the drive from flushing out its internal queues.

This means that there are bugs in softupdates & the kernel which can
result in unexpected inconsistencies on reboot. Nobody has ever
life-tested softupdates to try to locate and fix the issues. Though I
do occassionally see commits that try to fix various issues, they tend
to be more for live-side non-crash cases then for crash cases.

Some easy areas which can be worked on:

* Don't flush the buffer cache on a crash. Some of you already do this
for other reasons (it makes it more likely that you can get a crash
dump).

The kernel's flushing of the buffer cache is likely a cause of a
good chunk of the inconsitency reports by fsck, because unless
someone worked on the buffer flushing code it likely bypasses
softupdates. I know when working on HAMMER I had to add a bioop
explicitly to allow the kernel flush-buffers-on-crash code to query
whether it was actually ok to flush a dirty buffer or not. Until I
did that DragonFly was flushing HAMMER buffers which on crash which
it had absolutely no business flushing.

* Implement active dependancy flushing in softupdates. Instead of
having it just adjust the dependancies for later flushes softupdates
needs to actively initiate I/O for the dependancies as they are
resolved. To do this will require implementing a flush queue,
you can't just recurse (you will blow out the kernel stack).

If you dont do this then you have to sync about a dozen times,
with short delays between each sync, to ensure that all the
dependancies are flushed. The only time this is done automatically
is during a nominal umount during shutdown.

* Once the above two are fixed start testing within virtual environments
by virtually pulling the plug, and virtually crashing the kernel.
Then fscking to determine if an unexpected softupdates inconsistency
occured. There are probably numerous cases that remain.

Of course, what you guys decide to do with your background fsck is up
to you, but it seems to be a thorn in the side of FreeBSD from the day
it was introduced, along with snapshots. I very explicitly avoided
porting both the background fsck and softupdates snapshot code to DFly
due to their lack of stability.

The simple fact of the matter is that UFS just does not recover well
on a large disk. Anything over 30-40 million inodes and you risk
not being able to fsck the drive at all, not even in 64-bit mode (you
will run out of swap). You get one inconsistency and the filesystem
is broken forever. Anything over 200GB and your background fsck can
wind up taking hours, seriously degrading the performance of the system
in the process. It can take 6 hours to fsck a full 1TB HD. It can
take over a day to fsck larger setups. Putting in a few sleeps here
and there just makes the run time even longer and perpetuates the pain.

My recommendation? Default UFS back to a synchronous fsck and stop
treating ZFS (your only real alternative) as being so ultra-alpha that
it shouldn't be used. Start recommending it for any filesystem larger
then 200GB. Clean up the various UI issues that can lead to self
immolation and foot stomping. Fix the defaults so they don't blow out
kernel malloc areas, etc etc. Fix whatever bugs pop up. UFS is
already unsuitable for 'common' 1TB consumer drives even WITH the
background fsck. ZFS is ALREADY far safer to use then UFS for
large disks, given reasonable constraints on feature selection.

-Matt

Andrew Snow

unread,

Sep 29, 2008, 9:40:50 PM9/29/08

to

Zaphod Beeblebrox wrote:
> Also, there
> exists data within the ARC (I'm always tempted to say the ARC Cache, but
> that is redundant) that is also then in paging memory.

OK, but one advantage of ZFS memory consumption is under heavy write
loads, where much of the memory is used to store and reorder writes.
The heavy memory consumption under reading is a shame, but ZFS has to
cache and use more metadata than UFS, so its a price you pay for the
extra features and benefits.

What I think we need is a way to turn off read-caching except for
metadata. This allows ARC to only be used more efficiently. Currently
you can turn all read-ahead on or off, with the provided sysctl
tunables, but would be easy to implement a metadata-only option. I
found that access speed suffers when metadata is not prefetched.

If you are running an X workstation with 2GB or less memory, then I
agree ZFS is a bad default choice.

For my workstation I would still use ZFS, I would:

* turn down ARC size,

* turn off read-ahead except for metadata,

* and even turn off ZIL and write cache flushing, which solves the
annoyance of unpredictable delays when flushing buffers. Not a good
choice for a server but perfect for a workstation.

- Andrew

Dan Nelson

unread,

Sep 30, 2008, 12:12:16 AM9/30/08

to

In the last episode (Sep 30), Andrew Snow said:
> Zaphod Beeblebrox wrote:
> > Also, there exists data within the ARC (I'm always tempted to say
> > the ARC Cache, but that is redundant) that is also then in paging
> > memory.
>
> OK, but one advantage of ZFS memory consumption is under heavy write
> loads, where much of the memory is used to store and reorder writes.
> The heavy memory consumption under reading is a shame, but ZFS has to
> cache and use more metadata than UFS, so its a price you pay for the
> extra features and benefits.
>
> What I think we need is a way to turn off read-caching except for
> metadata. This allows ARC to only be used more efficiently.
> Currently you can turn all read-ahead on or off, with the provided
> sysctl tunables, but would be easy to implement a metadata-only
> option. I found that access speed suffers when metadata is not
> prefetched.

That'd be handy, but at least on my system the data prefetcher isn't
really called often enough to make a difference either way (assuming
the counts are accurate). Metadata prefetch is a big win, however.

(d...@dan.15) /home/dan> uptime
11:00PM up 5 days, 13:47, 21 users, load averages: 1.52, 1.68, 1.69
(d...@dan.15) /home/dan> sysctl kstat
[..]
kstat.zfs.misc.arcstats.hits: 211130907 (95%)
kstat.zfs.misc.arcstats.misses: 9808431
kstat.zfs.misc.arcstats.demand_data_hits: 116614377 (98%)
kstat.zfs.misc.arcstats.demand_data_misses: 2477943
kstat.zfs.misc.arcstats.demand_metadata_hits: 55805261 (96%)
kstat.zfs.misc.arcstats.demand_metadata_misses: 2310006
kstat.zfs.misc.arcstats.prefetch_data_hits: 79878 (53%)
kstat.zfs.misc.arcstats.prefetch_data_misses: 71741
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 38556033 (88%)
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 4947270
kstat.zfs.misc.arcstats.mru_hits: 23702582 (95%)
kstat.zfs.misc.arcstats.mru_ghost_hits: 1274189
kstat.zfs.misc.arcstats.mfu_hits: 149722171 (98%)
kstat.zfs.misc.arcstats.mfu_ghost_hits: 2944572
[..]
kstat.zfs.misc.arcstats.p: 235221504
kstat.zfs.misc.arcstats.c: 268435456
kstat.zfs.misc.arcstats.c_min: 67108864
kstat.zfs.misc.arcstats.c_max: 268435456
kstat.zfs.misc.arcstats.size: 263926784

--
Dan Nelson
dne...@allantgroup.com

Jeremy Chadwick

unread,

Sep 30, 2008, 1:36:19 AM9/30/08

to

On Mon, Sep 29, 2008 at 10:44:11AM -0700, Matthew Dillon wrote:
> A couple of things to note here. Well, many things actually.

Matt, I just wanted to take a moment to thank you for your verbose
and thorough outline of the issues as you see them. You're the
first developer (albeit Dragonfly :-) ) I've seen to comment on
these in detail.

I fully agree with each and every item you covered, as well as the items
in your follow-up mail to Andrew (re: mentioning BIOS/software RAID and
hardware RAID at the end).

Going with ZFS as the default filesystem is really something we should
be considering seriously. Oh, and yes, I *completely* agree with your
statement about the Foundation coughing up money to pjd@ for his
efforts. ZFS "saving our asses" is how I put it too. :-)

The topic of BIO_FLUSH is something I got to thinking about last night
at work; the only condition where a disk with write caching enabled
*would not* fully write the data to the platter would in fact be power
loss. All other conditions (specifically soft reset and panic) should
not require explicit flushing.

I wonder why this is being done, especially on shutdown of FreeBSD.
Assuming I understand it correctly, I'm talking about this:

Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining...3 3 3 2 2 0 0 done
All buffers synced.

_______________________________________________

Matthew Dillon

unread,

Sep 30, 2008, 3:00:30 PM9/30/08

to

:The topic of BIO_FLUSH is something I got to thinking about last night

:at work; the only condition where a disk with write caching enabled
:*would not* fully write the data to the platter would in fact be power
:loss. All other conditions (specifically soft reset and panic) should
:not require explicit flushing.
:
:I wonder why this is being done, especially on shutdown of FreeBSD.
:Assuming I understand it correctly, I'm talking about this:
:
:Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
:Waiting (max 60 seconds) for system process `syncer' to stop...
:Syncing disks, vnodes remaining...3 3 3 2 2 0 0 done
:All buffers synced.
:
:--
:| Jeremy Chadwick jdc at parodius.com |

BIO_FLUSH and "Syncing disks, vnodes ..." are two different things,
so I'm not sure of the context but I will describe issues with both.

--

BIO_FLUSH commands the disk firmware to flush out any dirty buffers in
its drive cache. That is, writes that you have *already* issued to
the drive and which returned completion, but which have not actually
made it to the physical media yet. This is different from dirty buffers
still being maintained by the kernel which have not yet been sent to
the drive. (Just repeating this so the definition is clear to all
the readers).

So, yes, you would want to do a BIO_FLUSH before powering down a
machine (halt -p) to ensure that all the dirty data you sent to the
disk actually gets to the platter.

I think you also want to issue it for a soft reset. It would not
effect a SATA drive but it certainly would effect a USB drive powered
from the computer. USB ports will be powered down during a soft
reset. BIO_FLUSH isn't likely to cause problems during a crash, unlike
flushing the buffer cache.

Some people may remember earlier versions of Windows XP often powered
the machine down before the hard drive managed to write all of its data
to the platter. Sometime that would even destroy sectors on the drive.

We know bad things happen if we don't issue the command, so best not to
take chances by making assumptions.

--

The "Syncing disks, vnodes ..." is the kernel flushing out any dirty
data in the buffer cache which has not yet been sent to the disk
driver.

This is more problematic. Filesystems such as HAMMER (and presumably
ZFS) absolutely do NOT want the system to flush dirty buffers unless
they explicitly give permission to do so, because the dirty buffers
might represent data for which the recovery information has not yet
been written out, and thus can corrupt the filesystem on-media if a
crash were to occur right then.

In HAMMER's case I enchanced the bioops a bit to allow HAMMER to veto
write-outs initiated by the system. sync_on_panic is irrelevant,
the buffers will not be synced without HAMMER's permission and it
won't give it.

There is also the very real general case where a traditional filesystem
such as UFS must peform multiple buffer cache ops, dirtying multiple
buffer cache buffers, in order to complete an operation. If a crash
were to occur right in the middle of such a sequence the kernel would
wind up writing dirty buffers related to incomplete operations to the
media, resulting in corruption.

In the case of softupdates one is presented with a conundrum. If you
don't write out the buffer cache during a crash you stand to lose a lot
more then 60 seconds worth of changes due to deep dependancy chains.
One 'sync' doesn't do the job and even though it is supposed to get all
the primary data and meta-data onto the disk and just leave the bitmap
updates for background operations it doesn't always seem to do that.
The softupdates code is very fragile.

On the other hand, if you *DO* try to write out the buffer cache during
a crash you have a good chance of deadlocking the system or
double-panicing, resulting in inconsistencies on the media, and you
risk doing a partial write out also resulting in inconsistencies on the
media.

Here is example: How does the crash code deal with dirty but locked
buffer cache buffers? Say you have a softupdates filesystem and through
the course of operations you dirty a dozen buffers, then a crash occurs
while you are in the middle of ANOTHER softupdates operation which is
holding several buffers already dirtied by previous operations locked.

What happens now if the crash code tries to sync the buffer cache? Will
it sync the previously dirtied buffers that are currently locked? Will
it sync the ones that haven't been locked but skip the ones that are
locked? You lose both ways. There is no way to safely sync ANYTHING,
whether locked or not, without risking unexpected softupdates
inconsistencies on-media. This alone makes background fsck problematic
and risky.

-Matt
Matthew Dillon
<dil...@backplane.com>

Miroslav Lachman

unread,

Oct 16, 2008, 3:30:20 PM10/16/08

to

Jeremy Chadwick wrote:

> On Mon, Sep 29, 2008 at 05:25:32PM +0200, Miroslav Lachman wrote:
>
>>It was about year ago with Asus and Sun Fire X2100. I don't have Asus
>>servers now (all returned as reclamation). Now I am running one X2100
>>and about ten X2100 M2. I have one spare X2100 M2, so if somebody have
>>exact order of commands used to "hot-swap" the disk, I can test it in
>>few days.
>
>

> I believe the correct order of operation is to do a "detach" on the
> channel before physically removing the disk, insert the new disk, then
> do "attach" on the same channel. "list" should be done afterwards to
> ensure the new disk shows up.
>
> If you want me to verify for certain, I have a test box built in the
> other room which has a SATA hot-swap backplane on it.
>
> I've also seen cases where the "attach" works, but upon doing "list",
> the old disk ID/string is still shown. In this case I had to do a
> "detach", remove the disk, insert the new disk, "reinit", then an
> "attach" for things to work.
>
> Finally, I've also seen the kernel panic or hard-lock after running
> "reinit", but this may have had something to do with Intel MatrixRAID.

Today I was replacing disk in one Sun Fire X2100 M2 so I tried
hot-swapping. It was as you said: atacontrol detach ata3, replace the
HDD, atacontrol attach ata3 and new disk is in the system. I tried it 3
times to be sure that it was not coincidence - no panic was produced ;o)
So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 i386
works.

Miroslav Lachman

# atacontrol list
ATA channel 0:
Master: no device present
Slave: no device present
ATA channel 1:
Master: no device present
Slave: no device present
ATA channel 2:
Master: ad4 <Hitachi HDP725050GLA360/GM4OA52A> Serial ATA II
Slave: no device present
ATA channel 3:
Master: ad6 <Hitachi HDP725050GLA360/GM4OA52A> Serial ATA II
Slave: no device present

# atacontrol detach ata3
subdisk6: detached
ad6: detached
GEOM_MIRROR: Device gm0: provider ad6 disconnected

# atacontrol list
ATA channel 0:
Master: no device present
Slave: no device present
ATA channel 1:
Master: no device present
Slave: no device present
ATA channel 2:
Master: ad4 <Hitachi HDP725050GLA360/GM4OA52A> Serial ATA II
Slave: no device present
ATA channel 3:
Master: no device present
Slave: no device present

## [old disk was physically removed]

## [new disk was physically inserted]

# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 953869MB <SAMSUNG HD103UJ 1AA01113> at ata3-master SATA300
Master: ad6 <SAMSUBF HD103UJ/1AA01113> Serial ATA II
Slave: no device present

# atacontrol list
ATA channel 0:
Master: no device present
Slave: no device present
ATA channel 1:
Master: no device present
Slave: no device present
ATA channel 2:
Master: ad4 <Hitachi HDP725050GLA360/GM4OA52A> Serial ATA II
Slave: no device present
ATA channel 3:
Master: ad6 <SAMSUNG HD103UJ/1AA01113> Serial ATA II
Slave: no device present

Jeremy Chadwick

unread,

Oct 16, 2008, 4:23:22 PM10/16/08

to

On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:
> Today I was replacing disk in one Sun Fire X2100 M2 so I tried
> hot-swapping. It was as you said: atacontrol detach ata3, replace the
> HDD, atacontrol attach ata3 and new disk is in the system. I tried it 3
> times to be sure that it was not coincidence - no panic was produced ;o)
> So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 i386
> works.

That's excellent news. So it seems possibly the problem I was seeing
was with "reinit" causing some sort of chaos. I'll have to check things
on my testbox here at home to see how I caused the panic last time.

Thanks for providing feedback, as usual! :-)

_______________________________________________

Miroslav Lachman

unread,

Oct 17, 2008, 7:50:38 AM10/17/08

to

Jeremy Chadwick wrote:
> On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:
>
>>Today I was replacing disk in one Sun Fire X2100 M2 so I tried
>>hot-swapping. It was as you said: atacontrol detach ata3, replace the
>>HDD, atacontrol attach ata3 and new disk is in the system. I tried it 3
>>times to be sure that it was not coincidence - no panic was produced ;o)
>>So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0 i386
>>works.
>
>
> That's excellent news. So it seems possibly the problem I was seeing
> was with "reinit" causing some sort of chaos. I'll have to check things
> on my testbox here at home to see how I caused the panic last time.
>
> Thanks for providing feedback, as usual! :-)

Unfortunately there is one problem - I see a lot of interrupts after
disk swapping (about 193k of atapci1)

Interrupts
197k total
ohci0 21
ehci0 22
193k atapci1 23
2001 cpu0: time
1 bge1 273
2001 cpu1: time

Full output of systat -vm 2 is attached.

It is shown in top as 50% interrupt (CPU state) and load 1 until I
rebooted the machine (I can provide MRTG graphs). The system was not in
production load, but almost idle. (I will put it in production tomorrow).
After reboot, everything is OK.

Can somebody test hot-swapping with SATA drives and confirm this
behavior? (I can't test it now, because machine is in datacenter)

Miroslav Lachman

systat_vm.txt

Jeremy Chadwick

unread,

Oct 17, 2008, 8:08:58 AM10/17/08

to

On Fri, Oct 17, 2008 at 01:50:38PM +0200, Miroslav Lachman wrote:
> Jeremy Chadwick wrote:
>> On Thu, Oct 16, 2008 at 09:30:20PM +0200, Miroslav Lachman wrote:
>>
>>> Today I was replacing disk in one Sun Fire X2100 M2 so I tried
>>> hot-swapping. It was as you said: atacontrol detach ata3, replace the
>>> HDD, atacontrol attach ata3 and new disk is in the system. I tried
>>> it 3 times to be sure that it was not coincidence - no panic was
>>> produced ;o)
>>> So in this case, hot-swapping on Sun Fire X2100 M2 with FreeBSD 7.0
>>> i386 works.
>>
>>
>> That's excellent news. So it seems possibly the problem I was seeing
>> was with "reinit" causing some sort of chaos. I'll have to check things
>> on my testbox here at home to see how I caused the panic last time.
>>
>> Thanks for providing feedback, as usual! :-)
>
> Unfortunately there is one problem - I see a lot of interrupts after
> disk swapping (about 193k of atapci1)
>
> Interrupts
> 197k total
> ohci0 21
> ehci0 22
> 193k atapci1 23
> 2001 cpu0: time
> 1 bge1 273
> 2001 cpu1: time

Okay, so it looks like the interrupt rate on atapci1 after swapping is
going crazy. What you're showing there looks like heavily modified
vmstat -i output.

> Full output of systat -vm 2 is attached.
>
> It is shown in top as 50% interrupt (CPU state) and load 1 until I
> rebooted the machine (I can provide MRTG graphs). The system was not in
> production load, but almost idle. (I will put it in production tomorrow).
> After reboot, everything is OK.

And this box is running the ATA patch Andrey provided, yes?

> Can somebody test hot-swapping with SATA drives and confirm this
> behavior? (I can't test it now, because machine is in datacenter)

I can test it on my P4SCE box.

I'll check the interrupt rates after each step of the hot-swap to see
if/when the problem starts.

Miroslav Lachman

unread,

Oct 17, 2008, 10:09:17 AM10/17/08

to

The shown is manually cropped from systat -vm, I'll try vmstat -i next
time. ;)

>>Full output of systat -vm 2 is attached.
>>
>>It is shown in top as 50% interrupt (CPU state) and load 1 until I
>>rebooted the machine (I can provide MRTG graphs). The system was not in
>>production load, but almost idle. (I will put it in production tomorrow).
>>After reboot, everything is OK.
>
>
> And this box is running the ATA patch Andrey provided, yes?

It is clean install of FreeBSD 7.0-RELEASE-p5 amd64 without patches.

>>Can somebody test hot-swapping with SATA drives and confirm this
>>behavior? (I can't test it now, because machine is in datacenter)
>
>
> I can test it on my P4SCE box.
>
> I'll check the interrupt rates after each step of the hot-swap to see
> if/when the problem starts.

I'll check the interrupts next time too and will post results to this
thread.

Miroslav Lachman

Jeremy Chadwick

unread,

Oct 17, 2008, 11:06:16 AM10/17/08

to

As promised, here are notes from my testing:

First thing to note is that the BIOS on my P4SCE had the ICH5 SATA mode
set to "Auto", which was causing PATA emulation to happen on the SATA
controller, e.g. disk #0 == ata0-master, disk #1 == ata0-slave.

I changed the BIOS option from Auto to "SATA Enhanced", and now the
disks show up on their own channels, e.g. disk #0 == ata2-master, disk
#1 == ata3-master.

Here's the applicable data. Note that this kernel ***DOES*** include
Andrey's ATA patch:

FreeBSD testbox.home.lan 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Thu Oct 16 10:56:42 PDT 2008 ro...@testbox.home.lan:/usr/obj/usr/src/sys/TESTBOX i386

atapci1: <Intel ICH5 SATA150 controller> port 0xc000-0xc007,0xc400-0xc403,0xc800-0xc807,0xcc00-0xcc03,0xd000-0xd00f irq 18 at device 31.2 on pci0
atapci1: [ITHREAD]
ata2: <ATA channel 0> on atapci1
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci1
ata3: [ITHREAD]

SATA controller is on IRQ 18.

ad4: 114473MB <Seagate ST3120026AS 3.05> at ata2-master SATA150
ad6: 238474MB <WDC WD2500KS-00MJB0 02.01C03> at ata3-master SATA150

ATA channel 2:
Master: ad4 <ST3120026AS/3.05> Serial ATA v1.0

Slave: no device present
ATA channel 3:

Master: ad6 <WDC WD2500KS-00MJB0/02.01C03> Serial ATA II
Slave: no device present

testbox# df -k
Filesystem 1024-blocks Used Avail Capacity Mounted on
/dev/ad4s1a 507630 230182 236838 49% /
devfs 1 1 0 100% /dev
/dev/ad4s1e 507630 12 467008 0% /tmp
/dev/ad4s1f 108498334 2944826 96873642 3% /usr
/dev/ad4s1d 2008622 32360 1815574 2% /var
/dev/ad6s1d 236511738 4 217590796 0% /hotswap

testbox# vmstat -i
interrupt total rate
irq4: sio0 1398 34
irq6: fdc0 10 0
irq15: ata1 58 1
irq18: atapci1 945 23
irq23: em1 8 0
cpu0: timer 80033 1952
cpu1: timer 79808 1946
Total 162260 3957

testbox# umount /hotswap
testbox# atacontrol detach ata3
subdisk6: detached
ad6: detached
testbox# vmstat -i | grep atapci1
irq18: atapci1 2671 11

At this point I wanted to see what happened if I just reattached without
any physical changes to the SATA bus.

testbox# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 238474MB <WDC WD2500KS-00MJB0 02.01C03> at ata3-master SATA150
Master: ad6 <WDC WD2500KS-00MJB0/02.01C03> Serial ATA II
Slave: no device present

testbox# vmstat -i | grep atapci1
irq18: atapci1 2764 9
testbox# mount /dev/ad6s1d /hotswap
testbox# vmstat -i | grep atapci1
irq18: atapci1 2779 8

Now we're going to try detaching *without* umounting the filesystem,
then reattaching to see what happens. Based on what I've seen and
others have reported in the past, this should panic the kernel.
Supposedly this problem is fixed on CURRENT.

testbox# atacontrol detach ata3
subdisk6: detached
ad6: detached

testbox# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 238474MB <WDC WD2500KS-00MJB0 02.01C03> at ata3-master SATA150
Master: ad6 <WDC WD2500KS-00MJB0/02.01C03> Serial ATA II
Slave: no device present

testbox# df -k
Filesystem 1024-blocks Used Avail Capacity Mounted on
/dev/ad4s1a 507630 230182 236838 49% /
devfs 1 1 0 100% /dev
/dev/ad4s1e 507630 12 467008 0% /tmp
/dev/ad4s1f 108498334 2944826 96873642 3% /usr
/dev/ad4s1d 2008622 32360 1815574 2% /var
/dev/ad6s1d 236511738 4 217590796 0% /hotswap

testbox# ls -l /hotswap

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xc0
fault code = supervisor read, page not present
instruction pointer = 0x20:0xc0503ca7
stack pointer = 0x28:0xe6310a5c
frame pointer = 0x28:0xe6310a5c
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 795 (ls)
[thread pid 795 tid 100043 ]
Stopped at dev2udev+0x11: movl 0xc0(%eax),%eax

db> bt
Tracing pid 795 tid 100043 td 0xc3dcc460
dev2udev(3287166208,3861973668,3228755039,3861973872,3286025312,...) at dev2udev+17
ufs_getattr(3861973664,3861973800,3227504003,3229700640,3861973664,...) at ufs_getattr+222
VOP_GETATTR_APV(3229700640,3861973664,3229768320,3288955040,3861973684,...) at VOP_GETATTR_APV+68
vn_stat(3288955040,3861973908,3286230784,0,3286025312,...) at vn_stat+73
kern_lstat(3286025312,135344488,0,3861974040,3861974064,...) at kern_lstat+147
lstat(3286025312,3861974268,8,3861974328,3861974316,...) at lstat+43
syscall(3861974328) at syscall+814
Xint0x80_syscall() at Xint0x80_syscall+32
--- syscall (190, FreeBSD ELF32, lstat), eip = 1746463051, esp = 3217024524, ebp = 3217024664 ---

Yup, there's the panic. :-)

I rebooted the box from db, brought the system up in single-user, fsck'd
all the disks/filesystems (no anomalies were found), and rebooted the
box once more.

Now we're going to do everything properly: unmount /hotswap, detach,
yank the disk and insert a new Maxtor hard disk, attach, and see what
happens.

testbox# umount /hotswap
testbox# atacontrol detach ata3
subdisk6: detached
ad6: detached

testbox# vmstat -i | grep atapci1
irq18: atapci1 1174 6

I've now removed the disk physically from the machine. Let's check
interrupts again.

testbox# vmstat -i | grep atapci1
irq18: atapci1 1185 4

Now the new Maxtor disk has been inserted. LEDs for the SATA hot-swap
backplane lit up for about 5-6 seconds, then went off. Let's check
interrupts at this point:

testbox# vmstat -i | grep atapci1
irq18: atapci1 1193 3

Now let's attach. Note that there is no filesystem on this disk (it's
completely blank), so there's nothing to mount.

testbox# atacontrol attach ata3
ata3: [ITHREAD]
ad6: 286188MB <Maxtor 6L300S0 BANC1G20> at ata3-master SATA150
Master: ad6 <Maxtor 6L300S0/BANC1G20> Serial ATA v1.0
Slave: no device present

And now we check interrupts:

testbox# vmstat -i | grep atapci1
irq18: atapci1 1258 2

Looks fine to me.

_______________________________________________

Miroslav Lachman

unread,

Oct 17, 2008, 5:32:25 PM10/17/08

to

Thank you for your time, testing and reporting detailed results!
I will investigate my case somewhen in the future (if time permits)

Miroslav Lachman

unread,

Oct 26, 2008, 8:41:58 AM10/26/08

to

I played again with hot-swapping disks in Sun Fire X2100 M2 on FreeBSD
7.0-RELEASE-p5 i386 without ATA patches.
Both disks (ad4 + ad6) are in gmirror. There were high interrupts load
again! I tracked it to the point of pulling out the disk.
Interrupt was OK after 'atacontrol detach', but rise up after disk was
removed. When the disk is inserted back (same disk), interrupts are
going to normal rate without need to reboot.
I tried it three times and behavior was always the same.
It can be related to the use of gmirror.

Side note: If the disk was detached by 'atacontrol detach ata2' without
removing from gmirror (without gmirror remove or gmirror deactivate) and
then pulled out + inserted back, it was automagically attached without
need of 'atacontrol attach ata2' and gmirror synchronization was
autostarted.

As I am planing my vacation, I will not have time to test newer versions
of FreeBSD (or patches), I will test it later in December.

Miroslav Lachman

Jeremy Chadwick

unread,

Oct 26, 2008, 8:53:35 AM10/26/08

to

Interesting -- yeah, that's the one difference (besides hardware)
between your tests and mine: you're using gmirror, I'm not.

I'll take some time this weekend (or the upcoming weekend) to set up
gmirror on the above test box, and see if I can reproduce what you're
seeing. This will be my first experience with gmirror.

> Side note: If the disk was detached by 'atacontrol detach ata2' without
> removing from gmirror (without gmirror remove or gmirror deactivate) and
> then pulled out + inserted back, it was automagically attached without
> need of 'atacontrol attach ata2' and gmirror synchronization was
> autostarted.

This could mean that gmirror is constantly "polling" the underlying
device layer for certain things, which might explain the high interrupt
rate you're seeing. We should probably involve pjd@ in this discussion.

_______________________________________________