On Sun, 3 Nov 2013, Reco wrote:
> On Sun, 3 Nov 2013 17:16:02 +0200 (IST)
> Itay <
deb...@itayf.fastmail.fm> wrote:
>
>> On Sun, 3 Nov 2013, Reco wrote:
> [...] Is there anything suspicious in the root mailbox?
root mail box has daily messages like this starting at june 2010
(yes, I know, bad me)
/etc/cron.daily/logrotate:
gzip: stdin: Input/output error
error: failed to compress log /var/log/syslog.1
run-parts: /etc/cron.daily/logrotate exited with return code 1
> And, is there anything unusual in /var/log/kern.log at the time you
> had this error?
Multiple messages like those two:
...
Oct 31 07:59:35 gandalf kernel: [4627180.405646] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Oct 31 07:59:35 gandalf kernel: [4627180.405650] ata3.00: irq_stat
0x40000008
Oct 31 07:59:35 gandalf kernel: [4627180.405653] ata3.00: failed
command: READ FPDMA QUEUED
Oct 31 07:59:35 gandalf kernel: [4627180.405659] ata3.00: cmd
60/08:00:cb:05:a9/00:00:05:00:00/40 tag 0 ncq 4096 in
Oct 31 07:59:35 gandalf kernel: [4627180.405661] res
41/40:00:cd:05:a9/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Oct 31 07:59:35 gandalf kernel: [4627180.405664] ata3.00: status: {
DRDY ERR }
Oct 31 07:59:35 gandalf kernel: [4627180.405666] ata3.00: error: { UNC
}
Oct 31 07:59:35 gandalf kernel: [4627180.407143] ata3.00: configured
for UDMA/133
Oct 31 07:59:35 gandalf kernel: [4627180.407153] sd 2:0:0:0: [sda]
Unhandled sense code
Oct 31 07:59:35 gandalf kernel: [4627180.407155] sd 2:0:0:0: [sda]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 31 07:59:35 gandalf kernel: [4627180.407158] sd 2:0:0:0: [sda]
Sense Key : Medium Error [current] [descriptor]
Oct 31 07:59:35 gandalf kernel: [4627180.407163] Descriptor sense data
with sense descriptors (in hex):
Oct 31 07:59:35 gandalf kernel: [4627180.407165] 72 03 11 04
00 00 00 0c 00 0a 80 00 00 00 00 00
Oct 31 07:59:35 gandalf kernel: [4627180.407173] 05 a9 05 cd
Oct 31 07:59:35 gandalf kernel: [4627180.407176] sd 2:0:0:0: [sda]
Add. Sense: Unrecovered read error - auto reallocate failed
Oct 31 07:59:35 gandalf kernel: [4627180.407181] sd 2:0:0:0: [sda]
CDB: Read(10): 28 00 05 a9 05 cb 00 00 08 00
Oct 31 07:59:35 gandalf kernel: [4627180.407188] end_request: I/O
error, dev sda, sector 94963149
Oct 31 07:59:35 gandalf kernel: [4627180.407208] ata3: EH complete
...
Nov 1 07:50:21 gandalf kernel: [4713026.178488] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 1 07:50:21 gandalf kernel: [4713026.178492] ata3.00: irq_stat
0x40000008
Nov 1 07:50:21 gandalf kernel: [4713026.178496] ata3.00: failed
command: READ FPDMA QUEUED
Nov 1 07:50:21 gandalf kernel: [4713026.178502] ata3.00: cmd
60/08:00:cb:05:a9/00:00:05:00:00/40 tag 0 ncq 4096 in
Nov 1 07:50:21 gandalf kernel: [4713026.178503] res
41/40:00:cd:05:a9/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Nov 1 07:50:21 gandalf kernel: [4713026.178506] ata3.00: status: {
DRDY ERR }
Nov 1 07:50:21 gandalf kernel: [4713026.178509] ata3.00: error: { UNC
}
Nov 1 07:50:21 gandalf kernel: [4713026.179984] ata3.00: configured
for UDMA/133
Nov 1 07:50:21 gandalf kernel: [4713026.179992] ata3: EH complete
...
>>> Does, say, 'md5sum /var/log/syslog' runs to the completion?
>>
>> Yes. Without warnings/errors.
>>
>>> What about 'cat /var/log/syslog > /dev/null'?
>>
>> Yes. Without warnings/errors.
>
> Ok. What about 'cat /var/log/syslog | gzip -c > /dev/null'?
> And, while we're at that, what about:
>
> cat /var/log/syslog | gzip -c > /var/log/syslog.test.gz
Both commands finished without warnings/errors.
> If error shows early, can you also post contents of (/tmp/gzip):
>
> strace -fo /tmp/gzip cat /var/log/syslog | gzip -c > /dev/null
Didn't try since there were no errors.
>>> Can you run fsck on the filesystem containing /var/log/syslog?
[snip]
File system was found clean. No errors were reported.
>>> What does smartctl --all shows on the partition with this filesystem?
>>
>> I never used smartctl (installed it now following-up your question).
>> In my system /var resides on a logical volume.
>> So I am not sure how to proceed.
>
> Find a physical volume corresponding to the /var logical volume.
> Run smartctl --all on the disk that's containing that physical volume.
> In case you have RAID (be it mdadm or dm-mirror) - run smartctl on all
> disks that are part of said RAID.
>
> While we're on it, also run smartctl -t long on said disk, wait for a
> while (smartctl should say you, how much), and run smartctl --all on
> the same disk again.
Output of 'smartctl --all' (after running 'smartctl -t long'):
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen,
http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA
Device Model: WDC WD1600AAJS-00L7A0
Serial Number: WD-WCAV34031063
LU WWN Device Id: 5 0014ee 15756c0f2
Firmware Version: 01.03E01
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Nov 4 10:42:48 2013 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 3000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 39) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 10434
3 Spin_Up_Time 0x0027 135 130 021 Pre-fail Always - 4241
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 119
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 060 060 000 Old_age Always - 29269
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 117
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 52
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 119
194 Temperature_Celsius 0x0022 100 093 000 Old_age Always - 43
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 29267 94963149
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
==========================================================
End of 'smartctl --all' output.
Many thanks for the help and the patience!
Itay
Archive:
http://lists.debian.org/alpine.DEB.2.02.1...@gandalf.furmanet