__nand_correct_data: uncorrectable ECC error

1,443 views
Skip to first unread message

Erik J

unread,
Feb 16, 2015, 3:25:27 PM2/16/15
to al...@googlegroups.com

Ok, something I still need to mention, and some remarks.
Speaking about a DNS 320L with 256mB of memory.

- First, I rebooted the box.
and made immediately the system log. see log "1. Mon Feb 16 20:24:15."
There are few _nand errors there.

Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0
Feb 16 20:21:19 terra user.info kernel: md: bind<sda2>
Feb 16 20:21:19 terra user.info kernel: Adding 524284k swap on /dev/sdb1.  Priority:1 extents:1 across:524284k
Feb 16 20:21:19 terra user.notice root: Starting sslcert: OK.
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0

Alt-F folder is not there, and I see this line

Feb 16 20:21:19 terra user.info kernel: NAND device: Manufacturer ID: 0xad, Chip ID: 0xf1 (Hynix H27U1G8F2BTR-BC), 128MiB, page size: 2048, OOB size: 64

which surprises me, as the box should have 256 Mb of memory.


After a few minutes, and after pointing out the place of the Alt-F folder through the webinterface package-manager, I ran  this log, see attached file 2.Mon Feb 16 20:31:14
And big surprise, a lot of _nand errors more. I compared it to the last log which I made before rebooting, but it had the same amount of _nand errors.

What to do with this? Flash it back, and claim guarantee? Probably D-Link firmware will have no way to proof these errors.....
1. Mon Feb 16 20:24:15.log
2.Mon Feb 16 20:31:14.log

João Cardoso

unread,
Feb 16, 2015, 4:49:51 PM2/16/15
to al...@googlegroups.com


On Monday, February 16, 2015 at 8:25:27 PM UTC, Erik J wrote:

Ok, something I still need to mention, and some remarks.
Speaking about a DNS 320L with 256mB of memory.

- First, I rebooted the box.
and made immediately the system log. see log "1. Mon Feb 16 20:24:15."
There are few _nand errors there.

Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0
Feb 16 20:21:19 terra user.info kernel: md: bind<sda2>
Feb 16 20:21:19 terra user.info kernel: Adding 524284k swap on /dev/sdb1.  Priority:1 extents:1 across:524284k
Feb 16 20:21:19 terra user.notice root: Starting sslcert: OK.
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0


please perform the tests referred to on this post
 
Alt-F folder is not there, and I see this line

Feb 16 20:21:19 terra user.info kernel: NAND device: Manufacturer ID: 0xad, Chip ID: 0xf1 (Hynix H27U1G8F2BTR-BC), 128MiB, page size: 2048, OOB size: 64

which surprises me, as the box should have 256 Mb of memory.

it has 256MB of RAM , 128MB of flash memoy.

Erik J

unread,
Feb 17, 2015, 1:27:04 AM2/17/15
to al...@googlegroups.com




El lunes, 16 de febrero de 2015, 22:49:51 (UTC+1), João Cardoso escribió:


On Monday, February 16, 2015 at 8:25:27 PM UTC, Erik J wrote:

Ok, something I still need to mention, and some remarks.
Speaking about a DNS 320L with 256mB of memory.

- First, I rebooted the box.
and made immediately the system log. see log "1. Mon Feb 16 20:24:15."
There are few _nand errors there.

Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0
Feb 16 20:21:19 terra user.info kernel: md: bind<sda2>
Feb 16 20:21:19 terra user.info kernel: Adding 524284k swap on /dev/sdb1.  Priority:1 extents:1 across:524284k
Feb 16 20:21:19 terra user.notice root: Starting sslcert: OK.
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0


please perform the tests referred to on this post

See attached file.

Looks quite ok to me in general.

 
output_tests.txt

João Cardoso

unread,
Feb 17, 2015, 12:49:31 PM2/17/15
to al...@googlegroups.com


On Tuesday, February 17, 2015 at 6:27:04 AM UTC, Erik J wrote:




El lunes, 16 de febrero de 2015, 22:49:51 (UTC+1), João Cardoso escribió:


On Monday, February 16, 2015 at 8:25:27 PM UTC, Erik J wrote:

Ok, something I still need to mention, and some remarks.
Speaking about a DNS 320L with 256mB of memory.

- First, I rebooted the box.
and made immediately the system log. see log "1. Mon Feb 16 20:24:15."
There are few _nand errors there.

Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0
Feb 16 20:21:19 terra user.info kernel: md: bind<sda2>
Feb 16 20:21:19 terra user.info kernel: Adding 524284k swap on /dev/sdb1.  Priority:1 extents:1 across:524284k
Feb 16 20:21:19 terra user.notice root: Starting sslcert: OK.
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error
Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0


please perform the tests referred to on this post

See attached file.

Looks quite ok to me in general. 

Yes, it looks OK

But as the 'dmesg' output is always the same you can't be sure that no errors occurred during any of the tests. Please repeat the tests and use 'logread' instead; as it records the date/time you can verify if further output was generated during each test. You even add you own comments to the syslog using the command 'logger', e.g 'logger Hey there!'

And mtd0 also needs to be tested. As the errors you see come from it: 

Feb 16 20:21:19 terra user.err kernel: __nand_correct_data: uncorrectable ECC error<3>end_request: I/O error, dev mtdblock0, sector 0

But it better to not use nantest for it, as the test is destructive. The '-k' option rewrites the original contents, but if something nasty happens you box will be totally bricked. Instead use 

[root@DNS-325]# nanddump -f /tmp/mtd0-dump /dev/mtd0
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00100000...

This command only reads data, and errors should appears in its summary and logread output.

Erik J

unread,
Feb 17, 2015, 2:46:52 PM2/17/15
to al...@googlegroups.com


See attached  file, yes  lots of ECC: 8 uncorrectable bitflip(s) at offset xxx errors.

looks terribly bad......:((

output_tests+nandump.txt

João Cardoso

unread,
Feb 17, 2015, 4:10:53 PM2/17/15
to al...@googlegroups.com
Yes... I don't have any advise for you.

Your chances are:
-leave it as is, as you can still boot the box, and hope that no further errors develop
-flash D-Link fw back...
-reflash u-boot. But for that you would need a know good u-boot image from an identical box.
 

Erik J

unread,
Feb 17, 2015, 4:54:06 PM2/17/15
to al...@googlegroups.com


ok, so what happens if
 - leave it as it is, not really an option i think, the Alt-f system is deteriorating. After last reboot, no more ssh again, lost settings in transmission, no alt-f folder to be found, what will be next. It is not possible to move the mtd0 location to an error free zone?

 - flashing D-Link back, which can be done through the Alt-F interface as I remember well, but does D-Link use that part of the memory also, and as a result, will I have problems again? Then I can try to claim warranty. But, memory problems are difficult to proof..

I prefer option 3, but the most difficult, (I am that way), so a request:

- Anybody who can help me with a u-boot image for a DNS 320L A3 version??

João, do you have any link at hand how to make that image and how to flash it? (not serial soldering I hope, that is too difficult)

Thanks!!

João Cardoso

unread,
Feb 18, 2015, 1:07:35 PM2/18/15
to al...@googlegroups.com
 
See attached  file, yes  lots of ECC: 8 uncorrectable bitflip(s) at offset xxx errors.

looks terribly bad......:((

Yes... I don't have any advise for you.

Your chances are:
-leave it as is, as you can still boot the box, and hope that no further errors develop
-flash D-Link fw back...
-reflash u-boot. But for that you would need a know good u-boot image from an identical box.


ok, so what happens if
 - leave it as it is, not really an option i think, the Alt-f system is deteriorating. After last reboot, no more ssh again, lost settings in transmission, no alt-f folder to be found, what will be next. It is not possible to move the mtd0 location to an error free zone?

No. The bootloader has to reside in a fixed memory location -- that's where the processor starts executing programs instructions at power up.
Also, NAND chips have a special robust area intended specifically for the bootloader.
 


 - flashing D-Link back, which can be done through the Alt-F interface as I remember well, but does D-Link use that part of the memory also, and as a result, will I have problems again?

Probably yes. All firmware, be it D-Link, Alt-F, whatever, needs the bootloader to start. The bootloader is the first program that starts on a computer; then the bootloader starts an operating system. That is what happens on all computers, be it a PC, Mac, the DNS, a toaster...
 
Then I can try to claim warranty. But, memory problems are difficult to proof..

But D-Link has a back-door on all its NAS, the funplug script that allows running ffp and other programs. That's something that we all must thanks to D-Link. So, if you flash back d-link fw and install ffp and if mtd-utils is available for ffp, then you have an argument to claim warranty... if you want to go that way.
 

I prefer option 3, but the most difficult, (I am that way), so a request:

Much more difficult and subject to errors and incompatibilities.
Touching u-boot is out of my zone of comfort. Notice that I didn't even recommended running nandtest on it...
 

- Anybody who can help me with a u-boot image for a DNS 320L A3 version??

I can supply mine for a DNS-320L-A1. I have reasons to believe that all rev-Ax boards are identical. But having one for the rev-A3 is saffer.
 

João, do you have any link at hand how to make that image and how to flash it?

I reached my level of competence on this subject. I have a good understanding on the subject, but the gory details can make the difference.

The command is

nanddump -f filename /dev/mtd0 # dumps all mtd0 contents to file named filename

nanddump -l size -f filename /dev/mtd0 # dumps size bytes of mtd0  to file named filename

The difficulty, as you will see soon, it to know the size that one should dump.

On my DNS-325-A1, I used

[root@DNS-325]# nanddump -f dns-325-A1-mtd0-dump.bin /dev/mtd0
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00100000...
 
[root@DNS-325]# ls -l dns-325-A1-mtd0-dump.bin 
-rw-r--r--    1 root     root       1048576 Feb 18 16:25 dns-325-A1-mtd0-dump.bin

Without any NAND error being reported.

But on my DNS-320L-A1, surprise:

[root@dns-320l]# nanddump -f dns-320l-A1-mtd0-dump.bin /dev/mtd0
ECC failed: 1024
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00100000...
ECC: 8 uncorrectable bitflip(s) at offset 0x000a0000
ECC: 8 uncorrectable bitflip(s) at offset 0x000a0800
...
ECC: 8 uncorrectable bitflip(s) at offset 0x000bf000
ECC: 8 uncorrectable bitflip(s) at offset 0x000bf800
 
[root@dns-320l]# l dns-320l-A1-mtd0-dump.bin 
-rw-r--r--    1 root     root       1048576 Feb 18 16:28 dns-320l-A1-mtd0-dump.bin

Similar to yours. But only for the untrained eye.
While your errors occurs from the very beginning of mtd0:
ECC: 8 uncorrectable bitflip(s) at offset 0x00000000

my errors starts only at address 0x000a0000, which is out of the u-boot code (from the u-boot start message, which can only be seen with a serial adapter). So I'm not worried with my box.

From the u-boot start message, it looks like that its size is 524272 bytes, and if I only dump that amount I get no errors:

[root@dns-320l]# nanddump -l 524272 -f dns-320l-A1-mtd0-dump.bin /dev/mtd0
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x0007fff0...

So, and this is a conjecture, it looks like that D-Link only flashed the exact amount of bytes needed to hold u-boot, leaving the remaining area of the flash partition uninitialized.
I think that I could confirm this conjecture by dumping and examining the OOB. But this discussion is getting too esoteric for most.

Summarizing and concluding: I think that your box is dying, but I can't recommend doing something that I'm not completely sure of (and I have not yet addressed the u-boot nand writing procedure). If I had similar issues in my box I would research that for a few more days and I would eventually flash u-boot. Notice that flashing u-boot from within u-boot itself poses me no problems, but that requires a serial adapter, which you don't have.

Luck!

Erik J

unread,
Feb 19, 2015, 1:43:52 AM2/19/15
to al...@googlegroups.com


El miércoles, 18 de febrero de 2015, 19:07:35 (UTC+1), João Cardoso escribió:
 
See attached  file, yes  lots of ECC: 8 uncorrectable bitflip(s) at offset xxx errors.

looks terribly bad......:((

Yes... I don't have any advise for you.

Your chances are:
-leave it as is, as you can still boot the box, and hope that no further errors develop
-flash D-Link fw back...
-reflash u-boot. But for that you would need a know good u-boot image from an identical box.


ok, so what happens if
 - leave it as it is, not really an option i think, the Alt-f system is deteriorating. After last reboot, no more ssh again, lost settings in transmission, no alt-f folder to be found, what will be next. It is not possible to move the mtd0 location to an error free zone?

No. The bootloader has to reside in a fixed memory location -- that's where the processor starts executing programs instructions at power up.
Also, NAND chips have a special robust area intended specifically for the bootloader.

Ok.
 
 


 - flashing D-Link back, which can be done through the Alt-F interface as I remember well, but does D-Link use that part of the memory also, and as a result, will I have problems again?

Probably yes. All firmware, be it D-Link, Alt-F, whatever, needs the bootloader to start. The bootloader is the first program that starts on a computer; then the bootloader starts an operating system. That is what happens on all computers, be it a PC, Mac, the DNS, a toaster...
 
Then I can try to claim warranty. But, memory problems are difficult to proof..

But D-Link has a back-door on all its NAS, the funplug script that allows running ffp and other programs. That's something that we all must thanks to D-Link. So, if you flash back d-link fw and install ffp and if mtd-utils is available for ffp, then you have an argument to claim warranty... if you want to go that way.
 

It almost seems to go that way. But I run into one little problem, and that is that I have my raid setup with ext4, and I see on the D-Link website that ext4 is not supported. So destroy my raid, or put an old 4 Gb disc in... Whow, what a modern NAS I wil have :)
 

I prefer option 3, but the most difficult, (I am that way), so a request:

Much more difficult and subject to errors and incompatibilities.
Touching u-boot is out of my zone of comfort. Notice that I didn't even recommended running nandtest on it...

Yes, I got nervous as well, and no way to claim warranty after that. And if it is out of your zone, whoamI to do it confident?
 

- Anybody who can help me with a u-boot image for a DNS 320L A3 version??

I can supply mine for a DNS-320L-A1. I have reasons to believe that all rev-Ax boards are identical. But having one for the rev-A3 is safer.

Probably the differences between the A1-A3 series, are small changes in the manufacturers of components. Probably cheaper hardware. As the firmware or the D-Link website makes no reference to the different A versions, I suppose that the rev-Ax boards are identical.
 
 

João, do you have any link at hand how to make that image and how to flash it?

I reached my level of competence on this subject. I have a good understanding on the subject, but the gory details can make the difference.

Good that it is written out in a such a clear way. Until no-one steps in who can guide me through with 100% confidence, I will not do it. Especially because the box is still functional in a certain way.
 

quite out of my range yes. But I cannot see why it " looks like that its size is 524272 bytes "
 

Summarizing and concluding: I think that your box is dying, but I can't recommend doing something that I'm not completely sure of (and I have not yet addressed the u-boot nand writing procedure). If I had similar issues in my box I would research that for a few more days and I would eventually flash u-boot. Notice that flashing u-boot from within u-boot itself poses me no problems, but that requires a serial adapter, which you don't have.

Luck!

Thanks A LOT!

We will see, what will happen the next weeks.
 

João Cardoso

unread,
Feb 21, 2015, 1:40:00 PM2/21/15
to al...@googlegroups.com


On Thursday, February 19, 2015 at 6:43:52 AM UTC, Erik J wrote:


El miércoles, 18 de febrero de 2015, 19:07:35 (UTC+1), João Cardoso escribió:
 
See attached  file, yes  lots of ECC: 8 uncorrectable bitflip(s) at offset xxx errors.

looks terribly bad......:((

Yes, but what is odd is that the other flash partitions read/test without any issues. Only mtd0, the u-boot partition, shows that errors.
And I doubt that they are really errors, because if they were errors the system wouldn't even boot (u-boot, the bootloader, would have errors on it and wouldn't execute correctly).

I have made some further research (I just can't let it go, that's why I can't be happy :-), and found, e.g., this post. I have checked the value for the 320L for the linux kernel version that Alt-F is using and it is 40, big enough. But some other hardware uses 25 or 30 or 35 (25 is the general value for the kirkwood SOC that the box is using). And the used value (40) is used for the whole flash, not only for mtd0, so if it were a timing error it would affect all flash partitions. Conflicting information, another hypothesis is needed.

I also found that there are several ways to use ECC (Error Correction Code) on NANDS. It comes in several (incompatible) flavours and can be implemented in software of hardware.
I think that to be the reason why on my system does mtd0 also shows errors. I discovered that the errors appears only on a zone of the flash memory used by u-boot to save variables, and I think that the ECC algorithm used by u-boot is not identical to the one used by the mtd-utils and the linux kernel.
In your case, however, the errors spread around the whole mtd0.

Can you please run the following commands and attach the generated mtd0-dump.bin, mtd0-dump.log, and mtd0-dump.hex files? It could also be helpful if other DNS-320L users could execute the commands and post (attaching) the files.

nanddump -f mtd0-dump.bin /dev/mtd0 2> mtd0-dump.log
nanddump
-ocf mtd0-dump.hex /dev/mtd0

...
 
From the u-boot start message, it looks like that its size is 524272 bytes, and if I only dump that amount I get no errors:

...
 
quite out of my range yes. But I cannot see why it " looks like that its size is 524272 bytes "

An educated guess based on the u-boot boot-displayed message.

Erik J

unread,
Feb 22, 2015, 6:34:30 AM2/22/15
to al...@googlegroups.com


El sábado, 21 de febrero de 2015, 19:40:00 (UTC+1), João Cardoso escribió:


On Thursday, February 19, 2015 at 6:43:52 AM UTC, Erik J wrote:


El miércoles, 18 de febrero de 2015, 19:07:35 (UTC+1), João Cardoso escribió:
 
See attached  file, yes  lots of ECC: 8 uncorrectable bitflip(s) at offset xxx errors.

looks terribly bad......:((

Yes, but what is odd is that the other flash partitions read/test without any issues. Only mtd0, the u-boot partition, shows that errors.
And I doubt that they are really errors, because if they were errors the system wouldn't even boot (u-boot, the bootloader, would have errors on it and wouldn't execute correctly).

I have made some further research (I just can't let it go, that's why I can't be happy :-), and found, e.g., this post. I have checked the value for the 320L for the linux kernel version that Alt-F is using and it is 40, big enough. But some other hardware uses 25 or 30 or 35 (25 is the general value for the kirkwood SOC that the box is using). And the used value (40) is used for the whole flash, not only for mtd0, so if it were a timing error it would affect all flash partitions. Conflicting information, another hypothesis is needed.

I also found that there are several ways to use ECC (Error Correction Code) on NANDS. It comes in several (incompatible) flavours and can be implemented in software of hardware.
I think that to be the reason why on my system does mtd0 also shows errors. I discovered that the errors appears only on a zone of the flash memory used by u-boot to save variables, and I think that the ECC algorithm used by u-boot is not identical to the one used by the mtd-utils and the linux kernel.
In your case, however, the errors spread around the whole mtd0.

Can you please run the following commands and attach the generated mtd0-dump.bin, mtd0-dump.log, and mtd0-dump.hex files? It could also be helpful if other DNS-320L users could execute the commands and post (attaching) the files.

nanddump -f mtd0-dump.bin /dev/mtd0 2> mtd0-dump.log
nanddump
-ocf mtd0-dump.hex /dev/mtd0

...
 
with pleasure!
After running the above commands I tried to shutdown the box through the webinterface. To make a picture of the memory, see attachment.
The shutdown procedure did not work, while the temperature was correct (remind a mentioned problem ago). Even the front power button did not turn if off. Leds kept blinking in a pace like 3 times a second. I unplugged the cord and took some pictures. Here is the memory....  looks cheap. (but what did I expect, cheap box)

Ran some tests after reboot.
I noticed that temperature of the board and settings like location of Alt-F folder, Transmission, were gone again.
And that ECC errors seem to be less just after reboot. (for full log made after reboot, see attachments)
[root@terra]# nanddump -f /tmp/mtd0-dump /dev/mtd0
ECC failed: 80


I hope more people will run these tests. Because yes, would it not be dead totally if there were boot memory problems.


Just before posting this message, I noticed I forgot to attach the .bin file so I ran the tests again. As the old files disappeared. So you can see some differences in ECC numbers.

Thanks a lot.
 
mtd0-dump.log
nanddump -ocf.txt
SystemConf-Sun Feb 22 11:49:34 CET 2015.log
DSC00060-1.JPG
nanddump after reboot 22-2 12:03.txt
mtd0-dump-afterreboot.log
mtd0-dumpafterreboot.bin
nanddump -ocf-afterreboot.txt

João Cardoso

unread,
Feb 22, 2015, 1:02:54 PM2/22/15
to al...@googlegroups.com


On Sunday, February 22, 2015 at 11:34:30 AM UTC, Erik J wrote:


El sábado, 21 de febrero de 2015, 19:40:00 (UTC+1), João Cardoso escribió:


On Thursday, February 19, 2015 at 6:43:52 AM UTC, Erik J wrote:


El miércoles, 18 de febrero de 2015, 19:07:35 (UTC+1), João Cardoso escribió:
 
See attached  file, yes  lots of ECC: 8 uncorrectable bitflip(s) at offset xxx errors.

looks terribly bad......:((

Yes, but what is odd is that the other flash partitions read/test without any issues. Only mtd0, the u-boot partition, shows that errors.
And I doubt that they are really errors, because if they were errors the system wouldn't even boot (u-boot, the bootloader, would have errors on it and wouldn't execute correctly).

I have made some further research (I just can't let it go, that's why I can't be happy :-), and found, e.g., this post. I have checked the value for the 320L for the linux kernel version that Alt-F is using and it is 40, big enough. But some other hardware uses 25 or 30 or 35 (25 is the general value for the kirkwood SOC that the box is using). And the used value (40) is used for the whole flash, not only for mtd0, so if it were a timing error it would affect all flash partitions. Conflicting information, another hypothesis is needed.

I also found that there are several ways to use ECC (Error Correction Code) on NANDS. It comes in several (incompatible) flavours and can be implemented in software of hardware.
I think that to be the reason why on my system does mtd0 also shows errors. I discovered that the errors appears only on a zone of the flash memory used by u-boot to save variables, and I think that the ECC algorithm used by u-boot is not identical to the one used by the mtd-utils and the linux kernel.
In your case, however, the errors spread around the whole mtd0.

Can you please run the following commands and attach the generated mtd0-dump.bin, mtd0-dump.log, and mtd0-dump.hex files? It could also be helpful if other DNS-320L users could execute the commands and post (attaching) the files.

nanddump -f mtd0-dump.bin /dev/mtd0 2> mtd0-dump.log
nanddump
-ocf mtd0-dump.hex /dev/mtd0

...
 
with pleasure!

Thanks.
Unfortunately you didn't attach  the right files, let's try again.
The generated file will be /tmp/mtd0.tgz, attach only it, you don't need to reboot or post anything else

nanddump -f /tmp/mtd0-dump.bin /dev/mtd0 2> /tmp/mtd0-dump.log
nanddump
-ocf /tmp/mtd0-dump.hex /dev/mtd0 2> /dev/null
mtdinfo
-M /dev/mtd0 > /tmp/mtd0-info.txt
tar
-czf /tmp/mtd0.tgz /tmp/mtd0-*
rm
/tmp/mtd0-*


But the larger 1MB file that you posted is OK. I compared it with my own, and guess what: only three bytes (that might matter) out of 1MB are different. Three bytes can be more than enough, though...
The other difference is the date:

  Mine: U-Boot 1.1.4 (Apr 17 2012 - 19:36:21) Marvell version: 3.6.0 DNS-320L
Yours: U-Boot 1.1.4 (Apr 17 2014 - 17:31:34) Marvell version: 3.6.0 DNS-320L

After running the above commands I tried to shutdown the box through the webinterface. To make a picture of the memory, see attachment.

Here is some info about it. It says that the chip access time is 45ns, although the data sheet says it is 25ns. The linux kernel is using 40ns. I believe more the data sheet, it might be a Farnell typo or a second grade chip.

 
The shutdown procedure did not work, while the temperature was correct (remind a mentioned problem ago). Even the front power button did not turn if off. Leds kept blinking in a pace like 3 times a second. I unplugged the cord and took some pictures. Here is the memory....  looks cheap. (but what did I expect, cheap box)

Yes, cheap box, cheap disk holding frame and cheap thermal design.

But according to your previous 'nandtest' results there was no issues regarding the area in the flash chip that holds the kernel and rootfs, so unless there is a RAM memory issue there is no explanation for that other issues.
In this post I'm addressing *only* the "__nand_correct_data: uncorrectable ECC error" subject, and that comes from the part of the flash chip regarding the box boot loader (u-boot).

Again, I think that your box is ill or dying. Remove (and keep save) the disks and flash the D-Link fw back...

Erik J

unread,
Feb 22, 2015, 1:37:30 PM2/22/15
to al...@googlegroups.com
thanks for that, rather old no?
 
 
The shutdown procedure did not work, while the temperature was correct (remind a mentioned problem ago). Even the front power button did not turn if off. Leds kept blinking in a pace like 3 times a second. I unplugged the cord and took some pictures. Here is the memory....  looks cheap. (but what did I expect, cheap box)

Yes, cheap box, cheap disk holding frame and cheap thermal design.

But according to your previous 'nandtest' results there was no issues regarding the area in the flash chip that holds the kernel and rootfs, so unless there is a RAM memory issue there is no explanation for that other issues.

yes, one problem at the time. but btw, ssh which stopped working after the last reboot 5 days ago, is working again.
 
In this post I'm addressing *only* the "__nand_correct_data: uncorrectable ECC error" subject, and that comes from the part of the flash chip regarding the box boot loader (u-boot).

Again, I think that your box is ill or dying. Remove (and keep save) the disks and flash the D-Link fw back...

Not what I prefer, especially because the D-link firmware will not support my ext4 raid. :(

But for sure, then you are rid of me. But will do it soon if this this attachment will not shed some hopeful light.

Thanks

e.
 
mtd0.tgz

Erik J

unread,
Mar 8, 2015, 10:56:22 AM3/8/15
to al...@googlegroups.com


Hello João

Back again. I have the d-link firmware running again on my box. (so slow, and web interface has to go full screen on my netbook to show it all, it is a drag...)

The funplug is installed, but in all the repo's there is no mtd-utils available, so I can run the nanddump and check the state of the mtd0 memory partition. I know it is getting a little bit off-topic, but do you know of any possibilities to run nanddump. Do I have to compile it? Or can it be run stand-alone. Just would like to know if the errors are still there. Almost certain yes, but also a way to claim warranty. I hope....

Thanks again, and don burn out. Take it easy!
 

João Cardoso

unread,
Mar 8, 2015, 2:57:26 PM3/8/15
to al...@googlegroups.com
...
 
Hello João

Back again. I have the d-link firmware running again on my box. (so slow, and web interface has to go full screen on my netbook to show it all, it is a drag...)

The funplug is installed, but in all the repo's there is no mtd-utils available, so I can run the nanddump and check the state of the mtd0 memory partition. I know it is getting a little bit off-topic, but do you know of any possibilities to run nanddump. Do I have to compile it?

If it is not available as an ffp package, yes, you have to compile it... 
You might search for it also under 'optware', there are thousands of packages, not sure if if exists for your box.

I have analyzed your last files posted, and I can confirm that yours and my mtd0 are identical with the exception of 3 bytes.
These three bytes can cause no problems if they lie in an area of the bootloader code or data that is not used in the normal boot sequence. And that seems to be your case, as you can boot without issues.

Warning: technical content bellow (read: dragons flying bellow)

The big difference between yours and mine mtd0 is the OOB (Out Of Band) area.
There exists one 64byte OOB for every block of 2048 bytes of data, and the ECC (Error Correct Code) is stored in the OOB. For the same data in mtd0, all of yours ECC in the OOB are different from mine, and that explains the error/warning message: the system computes an ECC for each read block of 2048 bytes and compares that with the ECC stored in the OOB; as they are different this means that an error exists, and the system tries to use the ECC stored in the OOB to correct the data. It happens that the ECC is capable of detecting and correcting only a limited number of bit errors, and that limit is exceeded in your case.
I believe that for you the stored ECC is wrong -- yours and mine mtd0 data is identical (with the exception of the referred 3 bytes), but yours and mine ECC are different. As I have no errors like yours, my ECC is correct and your is wrong. The way of correcting all the ECC in error is to write mtd0, i.e., flash the bootloader in mtd0.

Up to a certain extent you are lucky that the ECC can't correct the errors, because the data read is not corrected and, as it is really OK, it is executed and the boot performs flawlessly.

Notice that the boot loader, u-boot, mtd0, is not touched by Alt-F or even the D-Link firmware. When you flash new firmware other areas of the flash memory are used.
I remember analysing all DNS-323 D-Link released firmware and found no bootloader in any of them. So it's a mistery for me how did the ECC got corrupted (if my hypothesis is correct)

I haven't post the above analysis and hypothesis because I don't have a solution for you (other than flashing mtd0, which as I already told, I'm not conformable to recommend doing).

Or can it be run stand-alone. Just would like to know if the errors are still there. Almost certain yes, but also a way to claim warranty. I hope....

Thanks again, and don burn out. Take it easy!

Yeah, "don't worry, be happy" should be my lemma. :-)
 
Thanks and luck

Erik J

unread,
Mar 9, 2015, 1:11:05 PM3/9/15
to al...@googlegroups.com


El domingo, 8 de marzo de 2015, 19:57:26 (UTC+1), João Cardoso escribió:
...
 
Hello João

Back again. I have the d-link firmware running again on my box. (so slow, and web interface has to go full screen on my netbook to show it all, it is a drag...)

The funplug is installed, but in all the repo's there is no mtd-utils available, so I can run the nanddump and check the state of the mtd0 memory partition. I know it is getting a little bit off-topic, but do you know of any possibilities to run nanddump. Do I have to compile it?

If it is not available as an ffp package, yes, you have to compile it... 

But, thanks to "mijzelf" on the ffp forum, it is already compiled for me. No time now, but soon I will dive into it.
 http://downloads.zyxel.nas-central.org/Users/Mijzelf/FFP-Stick/packages/0.7/arm/testing/

You might search for it also under 'optware', there are thousands of packages, not sure if if exists for your box.

I have analyzed your last files posted, and I can confirm that yours and my mtd0 are identical with the exception of 3 bytes.
These three bytes can cause no problems if they lie in an area of the bootloader code or data that is not used in the normal boot sequence. And that seems to be your case, as you can boot without issues.

Yes, booting yes. And remember, the _nand errors are few just after boot, and many more, till it reaches its maximum, after running for a few minutes.
Would it be an infringement on privacy if I, or you, harvest other owners of the DNS 320L from this forum, and ask them to run the test as you "prescribed"?

Really fed up with the D-Link firmware already, just noticed that my entire usb drive was open and public on the ftp server. (luckily it was only on a big intranet, guifi.net)

Warning: technical content bellow (read: dragons flying bellow)

Very technical, but for sure, you made a few people happy with this explanation.  (and just a thought, maybe there is a fault in the nanddump program, and in cannot handle this kind of ECC. As you said before:


I also found that there are several ways to use ECC (Error Correction Code) on NANDS. It comes in several (incompatible) flavours and can be implemented in software of hardware.

Thanks and I keep you updated.


João Cardoso

unread,
Mar 9, 2015, 2:24:24 PM3/9/15
to al...@googlegroups.com


On Monday, March 9, 2015 at 5:11:05 PM UTC, Erik J wrote:


El domingo, 8 de marzo de 2015, 19:57:26 (UTC+1), João Cardoso escribió:
...
 
Hello João

Back again. I have the d-link firmware running again on my box. (so slow, and web interface has to go full screen on my netbook to show it all, it is a drag...)

The funplug is installed, but in all the repo's there is no mtd-utils available, so I can run the nanddump and check the state of the mtd0 memory partition. I know it is getting a little bit off-topic, but do you know of any possibilities to run nanddump. Do I have to compile it?

If it is not available as an ffp package, yes, you have to compile it... 

But, thanks to "mijzelf" on the ffp forum, it is already compiled for me. No time now, but soon I will dive into it.
 http://downloads.zyxel.nas-central.org/Users/Mijzelf/FFP-Stick/packages/0.7/arm/testing/

You might search for it also under 'optware', there are thousands of packages, not sure if if exists for your box.

I have analyzed your last files posted, and I can confirm that yours and my mtd0 are identical with the exception of 3 bytes.
These three bytes can cause no problems if they lie in an area of the bootloader code or data that is not used in the normal boot sequence. And that seems to be your case, as you can boot without issues.

Yes, booting yes. And remember, the _nand errors are few just after boot, and many more, till it reaches its maximum, after running for a few minutes.

Yes, but the bootloader role was already accomplished, and it is not relevant anymore after boot starts. What must be happening is that linux mtd driver must be checking the whole flash chip, searching for bad blocks, and during that check the errors appear.
 
Would it be an infringement on privacy if I, or you, harvest other owners of the DNS 320L from this forum, and ask them to run the test as you "prescribed"?

I don't think there is any problem, as there is no user-related data in that flash area, so feel free to ask for users collaboration.
The only place where user data is stored on the DNS-320/325 is in the mtd5 flash partition, that Alt-F and D-Link uses to save "settings".

 

Really fed up with the D-Link firmware already, just noticed that my entire usb drive was open and public on the ftp server. (luckily it was only on a big intranet, guifi.net)

That's the result of "automagically" doing things. That's easier for the user, that hasn't to configure anything, but one never knows what the consequences are.
I try to avoid that kind of automagic, but I'm aware that under Alt-F at least the NFS server exports all filesystem mount points as shares when no user defined share is defined. It's a leftover from the ffp nfs server...
 

Warning: technical content bellow (read: dragons flying bellow)

Very technical, but for sure, you made a few people happy with this explanation.  (and just a thought, maybe there is a fault in the nanddump program, and in cannot handle this kind of ECC. As you said before:

I also found that there are several ways to use ECC (Error Correction Code) on NANDS. It comes in several (incompatible) flavours and can be implemented in software of hardware.

Possible, but not very probable, as it works for me.
The standard says that the ECC "algorithm" is specified in the flash chip itself, and it is retrievable through some specific commands. But users (read board manufacturers) are not obliged to follow standards ;-)

That is another reason for me to not feel comfortable flashing the bootloader: when writing to the flash chip, bad blocks (which naturally develop) are detected, marked as bad, and skipped. Up to a point, where the whole erase block is marked as bad. While a data block has 2KiB, an erase block has 128KiB. When an erase block is marked as bad the next erase block will be used. If the flash "partition" is small, that can make that the newly erase block belongs to the next "partition", ruining the system.
This is not very likely to happens, as the initial portion of a flash chip (where typically the bootloader lies), is more rugged and guaranteed by the chip manufacturer to be free of defects. But nonetheless...

My box has no bad erase blocks in mtd0, but has one in mtd3:

[root@dns-320l]# mtdinfo -M /dev/mtd0
mtd0
Name:                           u-boot
Type:                           nand
Eraseblock size:                131072 bytes, 128.0 KiB
Amount of eraseblocks:          8 (1048576 bytes, 1024.0 KiB)
Minimum input/output unit size: 2048 bytes
Sub-page size:                  512 bytes
OOB size:                       64 bytes
Character device major/minor:   90:0
Bad blocks are allowed:         true
Device is writable:             true
Eraseblock map:
 0: 00000000         1: 00020000         2: 00040000         3: 00060000        
 4: 00080000         5: 000a0000         6: 000c0000         7: 000e0000        

[root@dns-320l]# mtdinfo -M /dev/mtd3
mtd3
Name:                           image
Type:                           nand
Eraseblock size:                131072 bytes, 128.0 KiB
Amount of eraseblocks:          800 (104857600 bytes, 100.0 MiB)
Minimum input/output unit size: 2048 bytes
Sub-page size:                  512 bytes
OOB size:                       64 bytes
Character device major/minor:   90:6
Bad blocks are allowed:         true
Device is writable:             true
Eraseblock map:
   0: 00000000           1: 00020000           2: 00040000           3: 00060000        
   4: 00080000           5: 000a0000           6: 000c0000           7: 000e0000        
   8: 00100000           9: 00120000          10: 00140000          11: 00160000        
  ...
 124: 00f80000         125: 00fa0000         126: 00fc0000         127: 00fe0000        
 128: 01000000    BAD  129: 01020000         130: 01040000         131: 01060000        
 132: 01080000         133: 010a0000         134: 010c0000         135: 010e0000        

Erik J

unread,
Mar 12, 2015, 5:13:14 PM3/12/15
to al...@googlegroups.com


El lunes, 9 de marzo de 2015, 19:24:24 (UTC+1), João Cardoso escribió:


On Monday, March 9, 2015 at 5:11:05 PM UTC, Erik J wrote:


El domingo, 8 de marzo de 2015, 19:57:26 (UTC+1), João Cardoso escribió:
...
 
Hello João

Back again. I have the d-link firmware running again on my box. (so slow, and web interface has to go full screen on my netbook to show it all, it is a drag...)

The funplug is installed, but in all the repo's there is no mtd-utils available, so I can run the nanddump and check the state of the mtd0 memory partition. I know it is getting a little bit off-topic, but do you know of any possibilities to run nanddump. Do I have to compile it?

If it is not available as an ffp package, yes, you have to compile it... 

But, thanks to "mijzelf" on the ffp forum, it is already compiled for me. No time now, but soon I will dive into it.
 http://downloads.zyxel.nas-central.org/Users/Mijzelf/FFP-Stick/packages/0.7/arm/testing/

You might search for it also under 'optware', there are thousands of packages, not sure if if exists for your box.

I have analyzed your last files posted, and I can confirm that yours and my mtd0 are identical with the exception of 3 bytes.
These three bytes can cause no problems if they lie in an area of the bootloader code or data that is not used in the normal boot sequence. And that seems to be your case, as you can boot without issues.

Yes, booting yes. And remember, the _nand errors are few just after boot, and many more, till it reaches its maximum, after running for a few minutes.

Yes, but the bootloader role was already accomplished, and it is not relevant anymore after boot starts. What must be happening is that linux mtd driver must be checking the whole flash chip, searching for bad blocks, and during that check the errors appear.
 
Would it be an infringement on privacy if I, or you, harvest other owners of the DNS 320L from this forum, and ask them to run the test as you "prescribed"?

I don't think there is any problem, as there is no user-related data in that flash area, so feel free to ask for users collaboration.

I am waiting for replies...

The only place where user data is stored on the DNS-320/325 is in the mtd5 flash partition, that Alt-F and D-Link uses to save "settings".

 

Really fed up with the D-Link firmware already, just noticed that my entire usb drive was open and public on the ftp server. (luckily it was only on a big intranet, guifi.net)

That's the result of "automagically" doing things. That's easier for the user, that hasn't to configure anything, but one never knows what the consequences are.
I try to avoid that kind of automagic, but I'm aware that under Alt-F at least the NFS server exports all filesystem mount points as shares when no user defined share is defined. It's a leftover from the ffp nfs server...
 

Warning: technical content bellow (read: dragons flying bellow)

Very technical, but for sure, you made a few people happy with this explanation.  (and just a thought, maybe there is a fault in the nanddump program, and in cannot handle this kind of ECC. As you said before:

I also found that there are several ways to use ECC (Error Correction Code) on NANDS. It comes in several (incompatible) flavours and can be implemented in software of hardware.

Possible, but not very probable, as it works for me.
The standard says that the ECC "algorithm" is specified in the flash chip itself, and it is retrievable through some specific commands. But users (read board manufacturers) are not obliged to follow standards ;-)

mmm, not so funny at the moment. :)  but anyway, nothing I (or we)  can do about it.
 

That is another reason for me to not feel comfortable flashing the bootloader: when writing to the flash chip, bad blocks (which naturally develop) are detected, marked as bad, and skipped. Up to a point, where the whole erase block is marked as bad. While a data block has 2KiB, an erase block has 128KiB. When an erase block is marked as bad the next erase block will be used. If the flash "partition" is small, that can make that the newly erase block belongs to the next "partition", ruining the system.
This is not very likely to happens, as the initial portion of a flash chip (where typically the bootloader lies), is more rugged and guaranteed by the chip manufacturer to be free of defects. But nonetheless...

My box has no bad erase blocks in mtd0, but has one in mtd3:

And mine no bad blocks in mtd0, but in mtd3 and mtd5 (the config partition, so that can be a reason some settings are not saved?)
 
root@terra:/ffp/sbin# mtdinfo -M /dev/mtd5
mtd5
Name:                           config

Type:                           nand
Eraseblock size:                131072 bytes, 128.0 KiB
Amount of eraseblocks:          40 (5242880 bytes, 5.0 MiB)

Minimum input/output unit size: 2048 bytes
Sub-page size:                  512 bytes
OOB size:                       64 bytes
Character device major/minor:   90:10

Bad blocks are allowed:         true
Device is writable:             true
Eraseblock map:
  0: 00000000          1: 00020000          2: 00040000          3: 00060000       
  4: 00080000          5: 000a0000          6: 000c0000          7: 000e0000       
  8: 00100000          9: 00120000         10: 00140000         11: 00160000       
 12: 00180000         13: 001a0000         14: 001c0000         15: 001e0000       
 16: 00200000         17: 00220000         18: 00240000         19: 00260000       
 20: 00280000         21: 002a0000         22: 002c0000         23: 002e0000       
 24: 00300000         25: 00320000         26: 00340000    BAD  27: 00360000       
 28: 00380000         29: 003a0000         30: 003c0000         31: 003e0000       
 32: 00400000         33: 00420000         34: 00440000         35: 00460000       
 36: 00480000         37: 004a0000         38: 004c0000         39: 004e0000 




So as promised, the same output of
nanddump -f /tmp/mtd0-dump.bin /dev/mtd0 2> /tmp/mtd0-dump.log
nanddump -ocf /tmp/mtd0-dump.hex /dev/mtd0 2> /dev/null
mtdinfo -M /dev/mtd0 > /tmp/mtd0-info.txt
tar -czf /tmp/mtd0.tgz /tmp/mtd0-*
rm /tmp/mtd0-*

attached. while running D-Link software. I cannot read hex, but for sure no differences as you explanation was very clear and convincing. 

So back to Alt-F and let's make a fresh start again...

Thanks a LOT!

 
mtd0.tgz

Erik J

unread,
Mar 29, 2015, 7:32:26 AM3/29/15
to al...@googlegroups.com
Hello again.

Just FIY, and maybe curiosity. Attached another bin and hex files. This time from a 320L-A2. No nand problems as I can see. The result of writing 10 owners of a 320L-box.

Greetings.

e.
320L-A2.tar.gz

João Cardoso

unread,
Mar 29, 2015, 10:24:58 AM3/29/15
to al...@googlegroups.com


On Sunday, March 29, 2015 at 12:32:26 PM UTC+1, Erik J wrote:
Hello again.

Just FIY, and maybe curiosity. Attached another bin and hex files. This time from a 320L-A2. No nand problems as I can see.

Yes, that's exactly equal to my 320L_A1.

You can compare it (I use kdiff3 or kompare on linux) with your own box mtd0-dump.hex and see that the differences are only in the "OOB" areas (where the ECC is), and not in the data itself.
 
The result of writing 10 owners of a 320L-box.

I'm afraid I don't understand. What do you mean?
 

Greetings.

Erik J

unread,
Mar 29, 2015, 11:52:03 AM3/29/15
to al...@googlegroups.com

I mean, I asked 10 owners of a 320L-box to do the same tests as you proposed me, to look at the differences. One was so friendly to do it, and send it to me. Thanks for that! Still hoping to find results for a 320L-A3, as that is my box, and to see if other 320L-A3 also have nand memory errors.
 
 

Greetings.

Todd Lowe

unread,
Feb 14, 2016, 2:12:15 PM2/14/16
to Alt-F
I know this is an old thread, but I have just flashed a new 320L-A3 with alt-f RC4.1 and am seeing ECC errors on mtdblock0 as well.
I haven't had time yet to pull logs or do any tests, but since this is an old thread I thought I'd ask if 
a) there has been any resolution.
b) if anyone still wants dumps and data for comparison that have been requested above.

I have an older 320L-A1 which does not show these logs.

I'll follow up with my findings regardless to help anyone else that stumbles on this in the future.

Todd

erik

unread,
Feb 14, 2016, 3:07:50 PM2/14/16
to al...@googlegroups.com

Hello.

It was me who started this thread, and tell you quickly that the box is still running. Only weekends though. So it is not having a hard working life.

All the problems i had before did not return, who knows why. I dont.
But for now, i gave this problem a rest as i didnot experience the old problems.

Thanks for following up!

--
You received this message because you are subscribed to a topic in the Google Groups "Alt-F" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/alt-f/IcV6XOAmEPY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to alt-f+un...@googlegroups.com.
Visit this group at https://groups.google.com/group/alt-f.
For more options, visit https://groups.google.com/d/optout.

Todd Lowe

unread,
Mar 3, 2016, 6:19:19 PM3/3/16
to Alt-F
It seems to be working, and only complains at boot time so I've decided to ignore the error :-)
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
0 new messages