ad WRITE_DMA timing out frequently

Reid Linnemann

unread,

Feb 18, 2005, 10:03:52 AM2/18/05

to

I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It
usually just sits in the corner and runs services, but lately I've come
home form work or woken up to find that it is completely unresponsive,
and I have to hard reset the machine. It happens at least once a day,
and it's becoming more and more frequent. When I look at the console, I
always have the same 4 messages before the failure:

ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
kernel: ad0: FAILURE - WRITE_DMA timed out
kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error = 5

It seems to me that a sector on the disk might be dead in the ad0s1d
slice (/var), but I want to be certain before I take further steps that
the behavior I'm experiencing is positively unrelated to the migration
to 6-CURRENT.

I started poking around /var to see if anything was amiss, and I found
that mail messages are being stacked up in /var/spool/clientmqueue, even
though nothing should be using the msp queue (I've redirected periodic
outputs to logfiles). In the last daily run mailed to root in January,
I found records in the submit queue that looked like this:

j0EDINHh049826 2489 Fri Jan 14 07:18 MAILER-DAEMON
(Deferred: Permission denied)

There were nearly 500 of them.

Even after redirecting periodic output to logs and clearing out the
client mail queue, this continues to happen, and I have a hunch that it
may be related to the WRITE_DMA timeouts, as it's the only weird
behavior I can see on /var. If anyone can help me shed some light on
this, I'd appreciate it. I've had 2 IDE drives die in this machine
already, I'm going to be severely depressed if I've killed a third.

-Reid
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Colin Percival

unread,

Feb 18, 2005, 10:11:36 AM2/18/05

to

Reid Linnemann wrote:
> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
> kernel: ad0: FAILURE - WRITE_DMA timed out
> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error = 5

Looks like a dying disk.

> It seems to me that a sector on the disk might be dead in the ad0s1d
> slice (/var), but I want to be certain before I take further steps that
> the behavior I'm experiencing is positively unrelated to the migration
> to 6-CURRENT.

Install sysutils/smartmontools from ports and use smartctl. Odds are that
the drive has logged hundreds of errors by now.

Colin Percival

Reid Linnemann

unread,

Feb 18, 2005, 10:25:29 AM2/18/05

to

smartctl doesn't report any logged errors. On a hunch, I also dd'ed a
file large enough to fill /var, hoping that it would crater on writing
to that sector. It didn't. I know that's not a very useful test, but
it seems to hint to me that the disk isn't bad, but the driver is
freaking out from some event.

Reid Linnemann

unread,

Feb 18, 2005, 11:48:56 AM2/18/05

to

On 2/18/2005, "Colin Percival" <cper...@freebsd.org> wrote:

>Reid Linnemann wrote:
>> smartctl doesn't report any logged errors. On a hunch, I also dd'ed a
>> file large enough to fill /var, hoping that it would crater on writing
>> to that sector. It didn't. I know that's not a very useful test, but
>> it seems to hint to me that the disk isn't bad, but the driver is
>> freaking out from some event.
>

>It's quite possible that the driver is at fault, but I'd run a
>"smartctl -t long" test first, just to make sure the drive isn't
>suffering an intermitant fault.
>
>Colin Percival

I ran smartctl -t long on the disk, and sure enough it's healthy:

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 22444
-

There must be some strange system event happening that is causing the
driver to freak out. I really have a hunch it's related to the msp
queue mail that is being dumped in /var/spool/clientmqueue, because the
that occurance roughly falls in lline with the ad0 failures, and it's
all on the same partition that the error is reported on.

Paul Mather

unread,

Feb 18, 2005, 1:48:20 PM2/18/05

to

On Fri, 18 Feb 2005 09:03:35 -0600 (CST), "Reid Linnemann"
<lr...@cs.okstate.edu> wrote:

> I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It
> usually just sits in the corner and runs services, but lately I've
> come
> home form work or woken up to find that it is completely unresponsive,
> and I have to hard reset the machine. It happens at least once a day,
> and it's becoming more and more frequent. When I look at the console,
> I
> always have the same 4 messages before the failure:
>

> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
> kernel: ad0: FAILURE - WRITE_DMA timed out
> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error
> = 5
>

> It seems to me that a sector on the disk might be dead in the ad0s1d
> slice (/var), but I want to be certain before I take further steps
> that
> the behavior I'm experiencing is positively unrelated to the migration
> to 6-CURRENT.
>

> I started poking around /var to see if anything was amiss, and I found
> that mail messages are being stacked up in /var/spool/clientmqueue,
> even
> though nothing should be using the msp queue (I've redirected periodic
> outputs to logfiles). In the last daily run mailed to root in
> January,
> I found records in the submit queue that looked like this:
>
> j0EDINHh049826 2489 Fri Jan 14 07:18 MAILER-DAEMON
> (Deferred: Permission denied)
>
> There were nearly 500 of them.
>
> Even after redirecting periodic output to logs and clearing out the
> client mail queue, this continues to happen, and I have a hunch that
> it
> may be related to the WRITE_DMA timeouts, as it's the only weird
> behavior I can see on /var. If anyone can help me shed some light on
> this, I'd appreciate it. I've had 2 IDE drives die in this machine
> already, I'm going to be severely depressed if I've killed a third.

The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me
since somewhere in the 5.2.1--5.3 release range. (It's been so long now
that I don't remember whether it first started plaguing me in 5.2.1 or
5.3. I do know for definite I never got this problem in 5.1 and it only
crept in during an "upgrade.")

Like you, this has been happening more frequently with 6-CURRENT for me.
As in your case, I come to find the machine completely unresponsive
(though still pingable) and I have to hard reset the machine. I'm
finding this is now happening roughly every other day on average for the
past week since my last system rebuild (FreeBSD 6.0-CURRENT #0: Fri Feb
11 09:03:49 EST 2005).

In my case, I'm using geom_mirror to mirror two drives. The "TIMEOUT -
WRITE_DMA" involves the geom_mirror metadata sector on one of the two
drives, but not always the same one (sometimes it is ad0, sometimes
ad2). The net result is to cause the drive in question to be removed
from the mirror. Disappointingly, rather than carry on in degraded
fashion, lately the system seems eventually to seize up as you describe.
It doesn't seem to seize up immediately, because I notice an entry
in /var/log/messages after the error but before the required hard reset
reboot:

Feb 18 05:24:38 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
Feb 18 05:24:43 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
Feb 18 05:24:43 zappa kernel: ad2: FAILURE - WRITE_DMA timed out
Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Cannot update metadata on disk ad2 (error=5).
Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Device raid1: provider ad2 disconnected.
Feb 18 09:46:35 zappa named[349]: zone ./IN: Transfer started.
Feb 18 09:46:35 zappa named[349]: transfer of './IN' from 128.9.0.107#53: connected using 192.168.1.25#64153
Feb 18 09:46:37 zappa named[349]: zone ./IN: transferred serial 2005021800
Feb 18 09:46:37 zappa named[349]: transfer of './IN' from 128.9.0.107#53: end of transfer
[[forced reboot]]
Feb 18 11:48:46 zappa syslogd: kernel boot file is /boot/kernel/kernel
Feb 18 11:48:46 zappa kernel: Copyright (c) 1992-2005 The FreeBSD Project.
Feb 18 11:48:46 zappa kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Feb 18 11:48:46 zappa kernel: The Regents of the University of California. All rights reserved.
Feb 18 11:48:46 zappa kernel: FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005

I get this problem on 6-CURRENT and also RELENG_5. The RELENG_5 system
has a geom_vinum mirrored setup, and when the "TIMEOUT - WRITE_DMA"
occurs I lose the associated drive and plexes from the configuration.
The problem does not happen as often on the RELENG_5 system, as it does
on HEAD, at least nowadays it doesn't.

I run smartctl on the systems, and none of the drives report any errors,
and the "WORST" values recorded are nowhere near close to their
respective failure thresholds.

In my case, I have one area of commonality. Between the three different
systems on which I've experienced this problem, all use the Intel PIIX4
ATA controller and the same IBM-DJNA-352500/J51OA30K hard drives. So,
I'm wondering if there is something about this particular combination
that gives rise to this annoying problem?

I do use the same IBM-DJNA-352500/J51OA30K hard drives in another system
and have never experienced this (or any other) problem. However, it is
running 4.11-STABLE and has a "VIA 82C686 ATA66 controller", so it's
impossible to tell if it's 4.11-STABLE or the VIA ATA controller
contributing to the stability in that case.

I don't think I have a hardware problem. The same setup ran fine under
earlier 5.x releases. But, somewhere, this issue crept in (I remember
threads on freebsd-current about it), and recently it appears to be
getting worse (at least for me). Also, unfortunately for me,
geom_mirror used to roll with the punches when I lost a drive through
this "TIMEOUT - WRITE_DMA" problem, but now it doesn't. :-(

Cheers,

Paul.
--
e-mail: pa...@gromit.dlib.vt.edu

"Without music to decorate it, time is just a bunch of boring production
deadlines or dates by which bills must be paid."
--- Frank Vincent Zappa

Reid Linnemann

unread,

Feb 18, 2005, 2:10:49 PM2/18/05

to

The disk I am using is an IBM as well:

smartctl -a output yields this info on the device:

Device Model: IBM-DPTA-372050
Serial Number: JMYJM131600
Firmware Version: P76OA30A

My ATAPI controller is a VIA 82C686A as well. I have been running
FreeBSD 4.3 up to 6-CURRENT with this controller without issue until now
too. So I think we can assume that the problem was introduced in 5 and
carried on through 6. I think I recall bumping into this with a Western
Digital 10 gig disk a while back on 5.3-STABLE. I was under school
pressure then and just dropped the drive out completely when I started
getting a hung system and ad0 messages. I'll plug it back in as a slave
this weekend and run smartctl on it.

Philipp Stutz

unread,

Feb 19, 2005, 8:05:05 AM2/19/05

to

Reid Linnemann wrote:
> Colin Percival wrote:

>> Reid Linnemann wrote:
>>> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>>> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>>> kernel: ad0: FAILURE - WRITE_DMA timed out
>>> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error = 5
>>

>> Looks like a dying disk.

A few weeks ago I had the same problem on two different machines but
only when I synced two (mirrored) gvinum stripes/plexes. Think this was
with 5.3 release. Did the sync with the old vinum (and 5.2.1) and I
never again had these messages and the disk shows no defects.

>>> It seems to me that a sector on the disk might be dead in the
>>> ad0s1d slice (/var), but I want to be certain before I take
>>> further steps that the behavior I'm experiencing is positively
>>> unrelated to the migration to 6-CURRENT.
>>

>> Install sysutils/smartmontools from ports and use smartctl. Odds
>> are that the drive has logged hundreds of errors by now.
>

> smartctl doesn't report any logged errors. On a hunch, I also dd'ed a
> file large enough to fill /var, hoping that it would crater on
> writing to that sector. It didn't. I know that's not a very useful
> test, but it seems to hint to me that the disk isn't bad, but the
> driver is freaking out from some event.

same here

philipp stutz

Content Security by MailMarshal

Ian Dowse

unread,

Feb 19, 2005, 10:57:19 AM2/19/05

to

In message <1108752445.1...@zappa.Chelsea-Ct.Org>, Paul Mather writes:
>The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me
>since somewhere in the 5.2.1--5.3 release range. (It's been so long now
>that I don't remember whether it first started plaguing me in 5.2.1 or
>5.3. I do know for definite I never got this problem in 5.1 and it only
>crept in during an "upgrade.")

On a recent -CURRENT you could try the following patch. It attempts
to clean up the handling of timeouts in the ATA code by using the
new callout_init_mtx() function, and appeared to cure fairly frequent
WRITE_DMA timeout messages for me.

Ian

Index: sys/dev/ata/ata-all.c
===================================================================
RCS file: /dump/FreeBSD-CVS/src/sys/dev/ata/ata-all.c,v
retrieving revision 1.235
diff -u -r1.235 ata-all.c
--- sys/dev/ata/ata-all.c 7 Feb 2005 17:14:42 -0000 1.235
+++ sys/dev/ata/ata-all.c 17 Feb 2005 20:52:27 -0000
@@ -429,6 +429,7 @@
ch->state = ATA_ACTIVE;
else
ch->state = ATA_IDLE;
+ callout_stop(&request->callout);
mtx_unlock(&ch->state_mtx);
ch->locking(ch, ATA_LF_UNLOCK);
ata_finish(request);
Index: sys/dev/ata/ata-queue.c
===================================================================
RCS file: /dump/FreeBSD-CVS/src/sys/dev/ata/ata-queue.c,v
retrieving revision 1.41
diff -u -r1.41 ata-queue.c
--- sys/dev/ata/ata-queue.c 8 Dec 2004 11:16:33 -0000 1.41
+++ sys/dev/ata/ata-queue.c 18 Feb 2005 04:27:18 -0000
@@ -56,7 +56,7 @@

/* mark request as virgin */
request->result = request->status = request->error = 0;
- callout_init(&request->callout, 1);
+ callout_init_mtx(&request->callout, &ch->state_mtx, CALLOUT_RETURNUNLOCKED);

if (!request->callback && !(request->flags & ATA_R_REQUEUE))
sema_init(&request->done, 0, "ATA request done");
@@ -67,6 +67,7 @@
(request->flags & (ATA_R_CONTROL | ATA_R_IMMEDIATE))) {

/* arm timeout */
+ mtx_lock(&ch->state_mtx);
if (!dumping)
callout_reset(&request->callout, request->timeout * hz,
(timeout_t*)ata_timeout, request);
@@ -75,11 +76,13 @@
ch->running = request;
if (ch->hw.begin_transaction(request) == ATA_OP_FINISHED) {
ch->running = NULL;
- callout_drain(&request->callout);
+ callout_stop(&request->callout);
if (!request->callback)
sema_destroy(&request->done);
+ mtx_unlock(&ch->state_mtx);
return;
}
+ mtx_unlock(&ch->state_mtx);
}
else {
/* put request on the locked queue at the specified location */
@@ -193,6 +196,7 @@
ch->running = NULL;
ch->state = ATA_IDLE;
mtx_unlock(&ch->queue_mtx);
+ callout_stop(&request->callout);
mtx_unlock(&ch->state_mtx);
ch->locking(ch, ATA_LF_UNLOCK);
ata_finish(request);
@@ -216,9 +220,6 @@
ata_completed(request, 0);
}
else {
- if (!dumping)
- callout_reset(&request->callout, request->timeout * hz,
- (timeout_t*)ata_timeout, request);
if (request->bio && !(request->flags & ATA_R_TIMEOUT)) {
ATA_DEBUG_RQ(request, "finish bio_taskqueue");
bio_taskqueue(request->bio, (bio_task_t *)ata_completed, request);
@@ -263,9 +264,6 @@
request->result = EIO;
}
else {
- /* untimeout request now we have control back */
- callout_drain(&request->callout);
-
/* if this is a soft ECC error warn about it */
if ((request->status & (ATA_S_CORR | ATA_S_ERROR)) == ATA_S_CORR) {
ata_prtdev(request->device,
@@ -408,9 +406,8 @@
{
struct ata_channel *ch = request->device->channel;

- mtx_lock(&ch->state_mtx);
-
ATA_DEBUG_RQ(request, "timeout");
+ mtx_assert(&ch->state_mtx, MA_OWNED);

/* if interrupt has been seen, shout and just rearm timeout */
if (request->flags & ATA_R_INTR_SEEN) {
@@ -426,6 +423,7 @@
callout_reset(&request->callout, request->timeout * hz,
(timeout_t*)ata_timeout, request);
mtx_unlock(&ch->state_mtx);
+ /* CALLOUT_RETURNUNLOCKED */
return;
}

@@ -454,6 +452,7 @@
mtx_unlock(&ch->state_mtx);
ata_prtdev(request->device, "timeout state=%d unexpected\n", ch->state);
}
+ /* CALLOUT_RETURNUNLOCKED */
}

void
@@ -464,10 +463,11 @@
mtx_lock(&ch->state_mtx);
request = ch->running;
ch->running = NULL;
+ if (request)
+ callout_stop(&request->callout);
mtx_unlock(&ch->state_mtx);

if (request) {
- callout_drain(&request->callout);
ata_prtdev(request->device,
"WARNING - %s requeued due to channel reset",
ata_cmd2str(request));
@@ -501,11 +501,12 @@
mtx_lock(&ch->state_mtx);
request = ch->running;
ch->running = NULL;
+ if (request)
+ callout_stop(&request->callout);
mtx_unlock(&ch->state_mtx);

/* if we have a request "in flight" fail it as well */
if (request && (!device || request->device == device)) {
- callout_drain(&request->callout);
request->result = ENXIO;
if (request->callback)
(request->callback)(request);

Paul Mather

unread,

Feb 20, 2005, 12:01:39 PM2/20/05

to

On Sat, 2005-02-19 at 15:56 +0000, Ian Dowse wrote:

> In message <1108752445.1...@zappa.Chelsea-Ct.Org>, Paul Mather writes:
> >The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me
> >since somewhere in the 5.2.1--5.3 release range. (It's been so long now
> >that I don't remember whether it first started plaguing me in 5.2.1 or
> >5.3. I do know for definite I never got this problem in 5.1 and it only
> >crept in during an "upgrade.")
>
> On a recent -CURRENT you could try the following patch. It attempts
> to clean up the handling of timeouts in the ATA code by using the
> new callout_init_mtx() function, and appeared to cure fairly frequent
> WRITE_DMA timeout messages for me.

The patch applied cleanly for me, and I've built a new kernel
successfully. Let's hope it cures the problems for me, too.

If this patch is an effective cure, hopefully there is a chance it will
get committed and there'll be an MFC5, too.

Cheers,

Paul.
--
e-mail: pa...@gromit.dlib.vt.edu

"Without music to decorate it, time is just a bunch of boring production
deadlines or dates by which bills must be paid."
--- Frank Vincent Zappa