Alt-F 0.1RC5 on DNS320L: Buffer I/O error on dev mtdblock0 (pls help!)

230 views
Skip to first unread message

Daniele Strollo

unread,
Oct 25, 2016, 10:41:31 AM10/25/16
to Alt-F

I have problems with the Alt-F 0.1RC5 on DNS320L rev A3.
But I donno how to solve this. Please can someone help me?


Thanks.
  Daniele

Here the logs 

> dmesg | grep error | grep -v "uncorrectable ECC error"

Buffer I/O error on dev mtdblock0, logical block 0, async page read
Buffer I/O error on dev mtdblock0, logical block 66, async page read
Buffer I/O error on dev mtdblock0, logical block 66, async page read
blk_update_request: I/O error, dev mtdblock0, sector 248
blk_update_request: I/O error, dev mtdblock0, sector 0
blk_update_request: I/O error, dev mtdblock0, sector 8
blk_update_request: I/O error, dev mtdblock0, sector 16
blk_update_request: I/O error, dev mtdblock0, sector 24
blk_update_request: I/O error, dev mtdblock0, sector 32
blk_update_request: I/O error, dev mtdblock0, sector 40
blk_update_request: I/O error, dev mtdblock0, sector 48
blk_update_request: I/O error, dev mtdblock0, sector 56
blk_update_request: I/O error, dev mtdblock0, sector 64
Buffer I/O error on dev mtdblock0, logical block 0, async page read
Buffer I/O error on dev mtdblock0, logical block 66, async page read
Buffer I/O error on dev mtdblock0, logical block 66, async page read
blk_update_request: I/O error, dev mtdblock0, sector 248
blk_update_request: I/O error, dev mtdblock0, sector 0
blk_update_request: I/O error, dev mtdblock0, sector 8
blk_update_request: I/O error, dev mtdblock0, sector 16
blk_update_request: I/O error, dev mtdblock0, sector 24
blk_update_request: I/O error, dev mtdblock0, sector 32
blk_update_request: I/O error, dev mtdblock0, sector 40
blk_update_request: I/O error, dev mtdblock0, sector 48
blk_update_request: I/O error, dev mtdblock0, sector 56
blk_update_request: I/O error, dev mtdblock0, sector 64
Buffer I/O error on dev mtdblock0, logical block 0, async page read
Buffer I/O error on dev mtdblock0, logical block 66, async page read
Buffer I/O error on dev mtdblock0, logical block 66, async page read

> nandtest /dev/mtd1 -k

ECC corrections: 0
ECC failures   : 0
Bad blocks     : 0
BBT blocks     : 0
004e0000: checking...
Finished pass 1 successfully

> nandtest /dev/mtd5 -k

ECC corrections: 0
ECC failures   : 1
Bad blocks     : 0
BBT blocks     : 0
004e0000: checking...
Finished pass 1 successfully

> nandtest /dev/mtd6 -k

ECC corrections: 5
ECC failures   : 3
Bad blocks     : 0
BBT blocks     : 0
001e0000: checking...
Finished pass 1 successfully

João Cardoso

unread,
Oct 25, 2016, 11:32:16 AM10/25/16
to Alt-F


On Tuesday, 25 October 2016 15:41:31 UTC+1, Daniele Strollo wrote:

I have problems with the Alt-F 0.1RC5 on DNS320L rev A3.
But I donno how to solve this. Please can someone help me?

There is a recent topic addressing that/related issue, see https://groups.google.com/d/msg/alt-f/xBULQIdEax8/UMqZv0HsAgAJ


Thanks.
  Daniele

Here the logs 

> dmesg | grep error | grep -v "uncorrectable ECC error"

And the errors appear before or after your nand* usage?
To make the errors appear in context attach (not inline) the full dmesg output after a reboot.
In any case, there is nothing we can do.

mtd0/mtdblock0 is where the box bootloader resides, errors on it means that the box will not boot or, if the errors are sporadic at boot time, it might only boot after several poweron/cut attempts.

mtd0/mtdblock0 is not used by Alt-F. The errors can be related with the linux kernel (errors related to mtd0) or some filesystem such as jffs2 or filesystem utility such as blkid (errors related to mtdblock0) reading/scanning the flash chip to determine its bad block map or filesystem signature. A 'nanddump' on mtd0 also deploy errors on my system

Daniele Strollo

unread,
Oct 25, 2016, 12:00:21 PM10/25/16
to al...@googlegroups.com

At first thanks for your support.

Now I have attached the full dmesg after a restart.

These errors appear at every reboot.
Apparently the system seems to work since it completely starts with no problem but these lines appear at every reboot.

Second question, don't know if it is related...
I discovered weird behaviour on files shared on samba.
I had some pictures and accessing them from the laptop they seem to be corrupted (displayed with strange line bands corrupting colors).

After this I was a bit scared and tried to pass all files to an external HD via usb and the pictures where ok so it was probably a problem on sync of the two hard discs?

That's why I tried to check the log messages.

I hope to have explained the problem in a understandable manner :D

Daniele
dmesg.log

João Cardoso

unread,
Oct 25, 2016, 1:44:24 PM10/25/16
to Alt-F


On Tuesday, 25 October 2016 17:00:21 UTC+1, Daniele Strollo wrote:

At first thanks for your support.

Now I have attached the full dmesg after a restart.

OK. I hope you have meanwhile read the other topic I refer to, regarding uncorrectable ecc errors.
 

These errors appear at every reboot.
Apparently the system seems to work since it completely starts with no problem but these lines appear at every reboot.

Besides what I said in the other topic regarding the nand flash chip bitflips errors, read-perturbation, etc, it is also possible that those mtdblockX errors appears because some programs tries to discover what is contained on all block devices (usually disk partitions), to determine if they are RAID setups, filesystems, etc.
It happens that those programs consult a file named /proc/partitions that contains the list of all known block devices in the box, and it happens that the nand flash ship partitions also (legitimately) appear in that file, so those programs tries to read the flash chip partitions besides the disk partitions.
On a nand-flash chip with borderline specs, that might trigger those uncorrectable ecc errors.
 

Second question, don't know if it is related...
I discovered weird behaviour on files shared on samba.
I had some pictures and accessing them from the laptop they seem to be corrupted (displayed with strange line bands corrupting colors).

You (everybody!) should read the "WARNING: Network data corruption on RC5" top-posted topic and apply the fix.

Daniele Strollo

unread,
Oct 26, 2016, 4:13:56 AM10/26/16
to Alt-F

Besides what I said in the other topic regarding the nand flash chip bitflips errors, read-perturbation, etc, it is also possible that those mtdblockX errors appears because some programs tries to discover what is contained on all block devices (usually disk partitions), to determine if they are RAID setups, filesystems, etc.
It happens that those programs consult a file named /proc/partitions that contains the list of all known block devices in the box, and it happens that the nand flash ship partitions also (legitimately) appear in that file, so those programs tries to read the flash chip partitions besides the disk partitions.
On a nand-flash chip with borderline specs, that might trigger those uncorrectable ecc errors.

There is a way to check this? 

 
 

Second question, don't know if it is related...
I discovered weird behaviour on files shared on samba.
I had some pictures and accessing them from the laptop they seem to be corrupted (displayed with strange line bands corrupting colors).

You (everybody!) should read the "WARNING: Network data corruption on RC5" top-posted topic and apply the fix.

Sorry for that I didn't realize that it was related to my problem since I supposed it was related to a non working sync between the two disks in raid1.

Now I applied the patch.
THANKS again a lot for your time and for your patience.

  Daniele

João Cardoso

unread,
Oct 26, 2016, 12:24:13 PM10/26/16
to Alt-F


On Wednesday, 26 October 2016 09:13:56 UTC+1, Daniele Strollo wrote:

Besides what I said in the other topic regarding the nand flash chip bitflips errors, read-perturbation, etc, it is also possible that those mtdblockX errors appears because some programs tries to discover what is contained on all block devices (usually disk partitions), to determine if they are RAID setups, filesystems, etc.
It happens that those programs consult a file named /proc/partitions that contains the list of all known block devices in the box, and it happens that the nand flash ship partitions also (legitimately) appear in that file, so those programs tries to read the flash chip partitions besides the disk partitions.
On a nand-flash chip with borderline specs, that might trigger those uncorrectable ecc errors.

There is a way to check this? 

I see two interpretations for your question:

1-verifying that the nand chip has marginal specs can only be verified (out of an electronic lab) from the fact that some users report the error while others don't, and that some users have issues in some nand areas while other users report errors on other nand areas.
The DNS-325/320-rev-A/320L all have nand chips, use the same linux kernel and its Alt-F handling is identical, and only errors on the 320L have been reported. Notice also that the 320-rev-B board seems to be identical to the 320L-rev-A.
I own a 325 myself, no errors, its nand chip maker is Samsung, while for the 320L the maker is Hynix.
So all clues point to a marginal specs nand flash chip on the 320L

2-verifying my hypothesis that what is deploying the error is the scanning of all block devices by some programs, that can be verified.
Issue the following commands

logger -t ME1 "XXXX" # adds an entry to syslog, verify that using
logread
| tail
blkid  
# issue one of the commands that scan all block devices by reading /proc/partitions
logread
| sed -n '/ME1/,$p' # verify that/if the errors appears after your syslog marker
logger
-t ME2 "YYYY" # adds another marker entry to syslog
mdadm
--examine --scan --config=partitions # other command that scans all block devices
logread
| sed -n '/ME2/,$p' # verify the generated errors after new new marker


Additionally, when "settings" are first loaded at bootup or when they are saved, cleared, etc, a full JFFS2 filesystem "check" is performed on mtd5, and that can also deploy the errors (in my case in mtd6!)

Notice that not always the errors appears on the second or subsequent time that you issue the commands.

So, there is nothing that we can do about it, the errors are erratic and seems to be harmless (which is very odd and puzzles me -- being harmless).

Daniele Strollo

unread,
Oct 27, 2016, 4:34:29 AM10/27/16
to Alt-F
Yes these commands give the errors you said.
So it's not a real problem i have to care about?
Thanks again for the support!!!

The logs are following


Oct 27 10:27:11 raid user.notice ME1: STEP1--
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: blk_update_request: I/O error, dev mtdblock0, sector 248
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: blk_update_request: I/O error, dev mtdblock0, sector 0
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: blk_update_request: I/O error, dev mtdblock0, sector 8
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: blk_update_request: I/O error, dev mtdblock0, sector 16
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: Buffer I/O error on dev mtdblock0, logical block 0, async page read
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: Buffer I/O error on dev mtdblock0, logical block 66, async page read
Oct 27 10:27:25 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:25 raid user.err kernel: Buffer I/O error on dev mtdblock0, logical block 66, async page read

Oct 27 10:27:46 raid user.notice ME1: STEP2--
Oct 27 10:27:57 raid user.err kernel: __nand_correct_data: uncorrectable ECC error
Oct 27 10:27:57 raid user.warn kernel: blk_update_request: 27 callbacks suppressed
Oct 27 10:27:57 raid user.err kernel: blk_update_request: I/O error, dev mtdblock0, sector 0

João Cardoso

unread,
Oct 27, 2016, 3:05:11 PM10/27/16
to Alt-F


On Thursday, 27 October 2016 09:34:29 UTC+1, Daniele Strollo wrote:
Yes these commands give the errors you said.
So it's not a real problem i have to care about?

it's probably a symptom that something can become wrong in the future, but there is nothing one can do about it.
If you feel safer, you can flash the vendor firmware back, that's all I can say.

Daniele Strollo

unread,
Oct 28, 2016, 9:54:58 AM10/28/16
to al...@googlegroups.com
Ok.
Thanks for all the support.
  Daniele

João Cardoso

unread,
Oct 31, 2016, 11:14:30 AM10/31/16
to Alt-F


On Thursday, 27 October 2016 20:05:11 UTC+1, João Cardoso wrote:


On Thursday, 27 October 2016 09:34:29 UTC+1, Daniele Strollo wrote:
Yes these commands give the errors you said.
So it's not a real problem i have to care about?

it's probably a symptom that something can become wrong in the future, but there is nothing one can do about it.
If you feel safer, you can flash the vendor firmware back, that's all I can say.

I have researched and made more tests on my 320L, and have even reflash the vendor's firmware. The number of bad blocks on mtd5, where settings are saved, have start increasing from 0 to 5.

See http://forum.dsmg600.info/viewtopic.php?id=7594 for another user experience related with Alt-F, the vendor's fw and ECC errors.

As the result of my search on the NAND technology I will freely quote a researcher from one of its lectures slides (URL missed):

If you are archiving your family photos or videos on a SD card, USB stick or SSD disk, intending to show them to your suns or grand-suns when they will grow up, forget about it, data retention on NAND is only 10 years.
Reply all
Reply to author
Forward
0 new messages