Hello,
I was running some computationally intense tests on a board for many hours. In one iteration, it seemed the board hung. This was running from the NAND flash. I wonder if there is something I can do to tweak the memory that would effectively get rid of the two problems I've seen: a few random timeouts, and one small cluster of memory write failures.
Someone pointed out some board issues uncovered by memtester. ("My ci20 board is always freezing with debian and android"
https://groups.google.com/forum/#!topic/mips-creator-ci20/TnU4rIPzBuI) Reading that post, I ran several iterations of "memtester 700" as root. Everything was fine until the 14th loop. Somewhere in there, the board ceased to respond.
This was originally sitting on the original shipping foam. I decided to add stand-offs so that the board is about 1 cm above the table, just in case there was a heat issue. I was also concerned that the NAND flash might have an issue. So I arranged to still boot from NAND flash, but mount a rootFS from SD card.
Repeating "memtester 700 2" showed no issues. Nothing else was running on the board.
I then used "memtester 800 16" to see if this would produce anything different. Again, nothing else running on the board. Starting in loop 5/16, I saw occasional timeout messages of the form:
[83414.463404] jzmmc jzmmc.0: timeout 1000ms op:25 w sz:32768 state:2 STAT:0x1E000580 DMALEN:0x00000A00 blks:42/64 clk:enable clk_gate: enable
In fact, in 25 hours as shown by dmesg, this has happened 9 times, with no particular clustering. The value after "sz:" was mostly 4096, with some 8192. (An early test log shows 32768.)
In loop 9/16, I saw this:
--------
8-bit Writes : |FAILURE: 0xd7fe7839 != 0x7ff74ae8 at offset 0x0cda4180.FAILURE: 0x75bb32bd != 0xfef7fbf4 at offset 0x0cda4184.
FAILURE: 0xfffe1e26 != 0xd7fe7839 at offset 0x0cda4188.
FAILURE: 0xffbe9b90 != 0x75bb32bd at offset 0x0cda418c.
FAILURE: 0x7fbe0544 != 0xfffe1e26 at offset 0x0cda4190.
FAILURE: 0xefbd79f4 != 0xffbe9b90 at offset 0x0cda4194.
16-bit Writes : ok
--------
Those types of errors were never repeated, and the system never hung during this test.
Specifics about the board:
* Rev B (purple) board
* power adapter is 120VAC->5VDC going to the 4mm x 1.7 mm pin
* OS: Debian 7 with some apt-get update and upgrades/installs
* kernel: 3.0.8-12439-gf697891
* room temperature is normal office temps.
What was I doing before memtester? Running a test loop of computational compliance tests, including occasional floating point. No graphics. Test log written to NFS.
Board was booted with:
console=ttyS0,115200 mem=256M@0x0 mem=768M@0x30000000 ubi.mtd=1 root=/dev/mmcblk0p5 rw
--------
Are these types of things addressed in 3.18? Is there some other tweak I should apply?
--Rick K