disk or cable error? (CAM status: ATA Status Error)

1,632 views
Skip to first unread message

ix...@riseup.net

unread,
Aug 26, 2018, 11:02:42 AM8/26/18
to FreeBSD Questions
Hello,

since 3 days I get these errors on my FreeBSD 11.2 server
which has two (WDC WD1005FBYZ-01YCBB2) disks (ZFS mirror):

(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 90 a4 aa 40 1d 00 00 00 00 00
(ada1:ahcich1:0:0:0): CAM status: ATA Status Error
(ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada1:ahcich1:0:0:0): RES: 41 10 90 a4 aa 00 1d 00 00 00 00
(ada1:ahcich1:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 60 38 7e 06 40 2e 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
(ada0:ahcich0:0:0:0): RES: 41 10 38 7e 06 00 2e 00 00 00 00
(ada0:ahcich0:0:0:0): Retrying command

note that both affect both disks (ada0 + ada1).

frequency: about 30 times per day.

zpool status is fine.


since last night also smartctl complains:

smartctl -a /dev/ada0 (ada1 followed 6 hours later with the same output)
output:
"
Error 1 occurred
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 00 00 00 00 Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------

"

Do any of these errors provide hints as to whether this is a disk error affecting both at the same time
or broken cable? (both disks are connected to the mainboard via the same cable)


thanks in advance,
ixbug


_______________________________________________
freebsd-...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questi...@freebsd.org"

Philipp Vlassakakis

unread,
Aug 27, 2018, 5:22:33 AM8/27/18
to ix...@riseup.net, FreeBSD Questions
Hi,

What are the SMART values of the hard disks?
High "Reallocated Sectors Count" or anything else?

I would change the cable and run a SMART-Test on both disks.
Also make sure you have recent backups.

Regards,

ix...@riseup.net

unread,
Aug 27, 2018, 5:52:33 AM8/27/18
to Philipp Vlassakakis, FreeBSD Questions
Philipp Vlassakakis:
> Hi,
>
> What are the SMART values of the hard disks?
> High "Reallocated Sectors Count" or anything else?
>
> I would change the cable and run a SMART-Test on both disks.
> Also make sure you have recent backups.
>

thanks for your reply, here are the SMART values

ada0:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 183 183 021 Pre-fail Always - 3808
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2858
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
16 Unknown_Attribute 0x0022 000 200 000 Old_age Always - 4413844054
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 118 113 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

ada1:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 143 143 021 Pre-fail Always - 3833
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 18
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2858
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18
16 Unknown_Attribute 0x0022 000 200 000 Old_age Always - 4411825353
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 115 109 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

tech-lists

unread,
Aug 27, 2018, 6:23:15 AM8/27/18
to ix...@riseup.net, FreeBSD Questions
Hello,

On 26/08/2018 16:01, ix...@riseup.net wrote:
> since 3 days I get these errors on my FreeBSD 11.2 server
> which has two (WDC WD1005FBYZ-01YCBB2) disks (ZFS mirror):

Run smartctl -t long /dev/ada0 and smartctl -t long /dev/ada1 as root
and then, after it's completed, look at smartctl -x /dev/ada0 and 1 and
look for this section:

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error

and paste the output.

--
J.

tech-lists

unread,
Aug 27, 2018, 7:49:12 AM8/27/18
to ix...@riseup.net, FreeBSD Questions
On 26/08/2018 16:01, ix...@riseup.net wrote:
> Do any of these errors provide hints as to whether this is a disk
> error affecting both at the same time or broken cable? (both disks
> are connected to the mainboard via the same cable)

these values:

5 Reallocated_Sector_Ct
196 Reallocated_Event_Count
197 Current_Pending_Sector
198 Offline_Uncorrectable

are all zero for both disks, so if these remain zero after a successful
long test with *no LBA errors* then that would seem to me at least to
indicate there's no problems with the disks themselves. If, in the face
of those results, this persisted:

> (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 90 a4 aa 40 1d 00 00 00 00 00
> (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
> (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )

then yeah I'd say bad cable/bad interface
--
J.

ix...@riseup.net

unread,
Aug 27, 2018, 7:56:29 AM8/27/18
to tech-lists, FreeBSD Questions

tech-lists:


> On 26/08/2018 16:01, ix...@riseup.net wrote:
>> Do any of these errors provide hints as to whether this is a disk
>> error affecting both at the same time or broken cable? (both disks
>> are connected to the mainboard via the same cable)
>
> these values:
>
> 5 Reallocated_Sector_Ct
> 196 Reallocated_Event_Count
> 197 Current_Pending_Sector
> 198 Offline_Uncorrectable
>
> are all zero for both disks, so if these remain zero after a successful long test with *no LBA errors* then that would seem to me at least to indicate there's no problems with the disks themselves. If, in the face of those results, this persisted:
>
>> (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 90 a4 aa 40 1d 00 00 00 00 00
>>  (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
>>  (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
>
> then yeah I'd say bad cable/bad interface


thanks for your input.
(long test is running..)

Dave

unread,
Aug 27, 2018, 10:55:31 AM8/27/18
to freebsd-...@freebsd.org
On Monday 27 August 2018 12:46:36 tech-lists wrote:
> On 26/08/2018 16:01, ix...@riseup.net wrote:
> > Do any of these errors provide hints as to whether this is a disk
> > error affecting both at the same time or broken cable? (both disks
> > are connected to the mainboard via the same cable)
>
> these values:
>
> 5 Reallocated_Sector_Ct
> 196 Reallocated_Event_Count
> 197 Current_Pending_Sector
> 198 Offline_Uncorrectable
>
> are all zero for both disks, so if these remain zero after a successful
> long test with *no LBA errors* then that would seem to me at least to
> indicate there's no problems with the disks themselves. If, in the face
> of those results, this persisted:
>
> > (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 90 a4 aa 40 1d 00 00 00 00 00
> > (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
> > (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 10 (IDNF )
>
> then yeah I'd say bad cable/bad interface
>

IME "199 UDMA_CRC_Error_Count" at zero generally means the
SATA cable is fine. A failing cable that works most of the
time almost always results in an ever increasing CRC Error
Count. No guarantees, obviously, but it moves the odds in
favour of something else being the problem.

I have seen cheap nasty SATA cables result in no CRC Error
Count but the drive dropping off the system. It can be cured
temporarily by plugging/unplugging the connector a few times
to "scrape clean" the contacts. I would guess it's two
different metals reacting poorly with each other, especially
in a humid environment.

Having said that, even decent SATA cables are cheap and it
can't hurt to try replacing them first if diagnostics can't
point to a specific cause.

ix...@riseup.net

unread,
Aug 27, 2018, 12:20:24 PM8/27/18
to freebsd-...@freebsd.org

SMART long test returned with no errors on both disks:

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2861 -


next I'm going to replace the cable.

ix...@riseup.net

unread,
Aug 31, 2018, 3:39:16 AM8/31/18
to freebsd-...@freebsd.org
> Do any of these errors provide hints as to whether this is a disk error affecting both at the same time
> or broken cable? (both disks are connected to the mainboard via the same cable)

for the record:

after replacing the cable that connected both disks to the mainboard
everything is fine again.

thanks for your input!
Reply all
Reply to author
Forward
0 new messages