Crc Warm Copy Page CRC mismatch

529 views
Skip to first unread message

Kulunu Geeganage

unread,
Jan 29, 2018, 1:46:51 AM1/29/18
to stressapptest-discuss
Dear All,

I ran stress app test for my custom hardware design for 2 days and I have got final results as follows. (I attached complete log file here with)

Log: Thread 1 found 39 hardware incidents
Log: Thread 2 found 1 hardware incidents
Log: Thread 5 found 3 hardware incidents
Log: Thread 6 found 37 hardware incidents
Stats: Found 80 hardware incidents
Stats: Completed: 165702896.00M in 1001.73s 165416.61MB/s, with 80 hardware incidents, 0 errors
Stats: Memory Copy: 165702896.00M at 165427.95MB/s
Stats: File Copy: 0.00M at 0.00MB/s
Stats: Net Copy: 0.00M at 0.00MB/s
Stats: Data Check: 0.00M at 0.00MB/s
Stats: Invert Data: 0.00M at 0.00MB/s
Stats: Disk: 0.00M at 0.00MB/s

Status: FAIL - test discovered HW problems

In detail log file shows following type of errors :-

Log: Seconds remaining: 164140
Log: CrcWarmCopyPage CRC mismatch ffffffff011000001ff0110173ffff009010174024b0090 != ffffffff01ffffffff0110173ffff009010173ffff0090, but no miscompares found. Retrying with fresh data.
Report Error: miscompare : DIMM Unknown : 1 : 8669s
Hardware Error: miscompare on CPU 0(0xF) at 0x751d76d8(0x8df616da:DIMM Unknown): read:0x0000010000020100, reread:0x0000010000020100 expected:0x0000010000000100
Log: Seconds remaining: 164130

Could you please explain this ? How can I evaluate stress app test results technically ? How to understand and finalize what is wrong with memory ?

Regards,
Kulunu.
log.log

Nick Sanders

unread,
Jan 29, 2018, 10:20:21 PM1/29/18
to stressappt...@googlegroups.com


On Sun, Jan 28, 2018 at 10:46 PM, Kulunu Geeganage <
Hardware Error: miscompare on CPU 0(0xF) at 0x751d76d8(0x8df616da:DIMM Unknown): read:0x0000010000020100, reread:0x0000010000020100 expected:0x0000010000000100\

This means that the stressapptest software did more or less this: 
in64* testloc = 0x751d76d8;
*testloc = 0x0000010000000100;
(wait)
if (*testloc != 0x0000010000000100)
  printf("read: %x, expected %x\n", *testloc, 0x0000010000000100);

So stressapptest wrote a value to memory and it was not the same when it read it back. This indicates your DRAM is not working correctly.

Kulunu Geeganage

unread,
Feb 1, 2018, 2:44:53 AM2/1/18
to stressapptest-discuss
Dear Nick,

Many thanks for your explanation.

1) Can't we pin point which memory part is not working ? Or else which byte lane is not working ?
2) So will this be an issue of byte lane trace of PCB layout or bad soldering of the memory component ?
3) Do you know any method of pin pointing this kind of memory problems ?
4) "So stressapptest wrote a value to memory and it was not the same when it read it back." If the memory part can write why it can't read ? Is this indicate a component issue ?

Waiting for your quick reply.

Regards.
Kulunu

Nick Sanders

unread,
Feb 1, 2018, 4:23:59 PM2/1/18
to stressappt...@googlegroups.com
On Wed, Jan 31, 2018 at 11:44 PM, Kulunu Geeganage <iku...@gmail.com> wrote:
1) Can't we pin point which memory part is not working ? Or else which byte lane is not working ?
Physical address 0x8df616da is failing in this case. You'd need to look at your memory controller settings and pcb layout to see what component/bytelane this resolves to.
 
2) So will this be an issue of byte lane trace of PCB layout or bad soldering of the memory component ?
Yes, possibly. Or the DRAM could have a faulty cell. With multiple errors you should be able to distinguish an interconnect problem (multiple errors on the same bytelane) vs DRAM cell (multiple errors at the same address). 

If you have ECC, you're very unlikely to get single bit errors from DRAM issues. 
0x0000010000020100 
0x0000010000000100
             ^ this bit has flipped.
So a more likely issue in that case is CPU or cache fault.

3) Do you know any method of pin pointing this kind of memory problems ?
Yes, as above.
 
4) "So stressapptest wrote a value to memory and it was not the same when it read it back." If the memory part can write why it can't read ? Is this indicate a component issue ?
The the read request to the memory resulted in incorrect data. So either:
* The write transaction was corrupted or the read transaction was corrupted (interconnect problems)
* The DRAM corrupted the data internally between read and write (DRAM component internal problem)
* The data was corrupted elsewhere (could be CPU, Cache, kernel, anything really, though this is unlikely).
 

Kulunu Geeganage

unread,
Feb 5, 2018, 3:14:24 AM2/5/18
to stressapptest-discuss
Dear Nick,

Many thanks for your support.

1) How can I figure out the relevant component/byte lane from physical address 0x8df616da ? I have attached the DDR memory schematic and memory calibration data here with. Could you please explain this for me ?

        0x0035004B -> $mmdc_mpwldectrl0
        0x004D0045 -> $mmdc_mpwldectrl1
        0x421A0218 -> $mmdc_mpdgctrl0
        0x02120215 -> $mmdc_mpdgctrl1
        0x3F363337 -> $mmdc_mprddlctl
        0x41404742 -> $mmdc_mpwrdlctl
        0x0047004B -> $mmdc2_mpwldectrl0
        0x0046004B -> $mmdc2_mpwldectrl1
        0x0236024E -> $mmdc2_mpdgctrl0
        0x02270200 -> $mmdc2_mpdgctrl1
        0x3F3C3F43 -> $mmdc2_mprddlctl
        0x47374741 -> $mmdc2_mpwrdlctl

  
      0x00011700 -> $mmdc_mdmisc

2) What did you mean from this ? "DRAM could have a faulty cell"


3) Can't we say exactly which memory part (We are using 4 of 4Gb memory modules) is going wrong ?

4) What did you mean from ECC ?

"case is CPU or cache fault" Are you referring to processor ? (I'm using iMAX6Q NXP processor) Does that mean problem with the processor side ?

If you have ECC, you're very unlikely to get single bit errors from DRAM issues. 
0x0000010000020100 
0x0000010000000100
             ^ this bit has flipped.
So a more likely issue in that case is CPU or cache fault.


Waiting for your kind response.

Regards,
Kulunu.

Mem calibration data.txt
DDR memory schematic.pdf

Kulunu Geeganage

unread,
Feb 8, 2018, 12:19:54 AM2/8/18
to stressapptest-discuss
Dear Nick,

Can you see my previous message ? Could you please kindly reply my question. I'm new to this memory stress test applications and I need to clarify each debug data to identify the problems of memory modules and memory layout of my custom hardware design.

Regards,
Kulunu.

Kyewon Ha

unread,
Feb 8, 2018, 1:27:51 AM2/8/18
to stressappt...@googlegroups.com, Kulunu Geeganage
hello, 
if you know data mapping & address mapping, you can know the exact failure location of DDR.
first,    0x8df616da has  failure address information, if you decode this address, you can see 
 /cs, bank address, row address, column address.
 however, if you just want to know the failed memory location and the system is not server but consumer system.
 DQ mapping information is enough to find failed memory.

you designed your system with  4 pcs memory with x16 I/O

0x0000 0100 0002 0100 
0x0000 0100 0000 0100
                          ^ this bit has flipped.

if you wired all pins directly as your schematic and your memory controller doesn't use  data scramble mapping,
Physical data mapping should be like below

0x0000 0100 0000 0100
.   U5     U4   U3    U2

so, in your case, U3 might be failure location.

However, you need to check data mapping table of MX6Q. normally, it's data sheet should provide data mapping information.

thanks,
kyewon ha 


--

---
You received this message because you are subscribed to the Google Groups "stressapptest-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-discuss+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kulunu Geeganage

unread,
Feb 19, 2018, 1:30:58 AM2/19/18
to stressapptest-discuss
Dear Kyewon Ha,

Many thanks for your reply.

When I run stress app test I got following error after running 9290 seconds. Could you please explain this ? How can I back trace this message and find whether this is a problem of Memory module or layout ? ANy way to pinpoint the issue ?

This is looks like a kernel panic and I need to figure out why this is happening and how I should back trace this massage.

Log: Seconds remaining: 9300
Log: Seconds remaining: 9290
[ 1539.582819] Unable to handle kernel NULL pointer dereference at virtual address 00000100
[ 1539.590977] pgd = ce8d8000
[ 1539.593708] [00000100] *pgd=5e7e7831, *pte=00000000, *ppte=00000000
[ 1539.600074] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[ 1539.605496] Modules linked in: mxc_v4l2_capture ipu_bg_overlay_sdc ipu_still ipu_prp_enc tw6869 ipu_csi_enc adv7610_video ipu_fg_overlay_sdc v4l2_int_device videobuf2_dma_contig videobuf2_memops galcore(O)
[ 1539.624013] CPU: 0 PID: 440 Comm: stressapptest Tainted: G           O    4.1.15-2.0.0-ga+yocto+gff4e28b #1
[ 1539.633765] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 1539.640305] task: ce58b700 ti: ce882000 task.ti: ce882000
[ 1539.645752] PC is at pick_next_task_fair+0x424/0x5e4
[ 1539.650747] LR is at 0x0
[ 1539.653296] pc : [<8016b1f8>]    lr : [<00000000>]    psr: 20010093
[ 1539.653296] sp : ce883ec0  ip : ce883e98  fp : ce883f2c
[ 1539.664784] r10: d0f05880  r9 : ce58b9c8  r8 : 80e02e00
[ 1539.670019] r7 : 00000000  r6 : 80e025d8  r5 : ce58b748  r4 : 000000d0
[ 1539.676553] r3 : ce58b748  r2 : 001d1aff  r1 : ce58da47  r0 : ce58da47
[ 1539.683089] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 1539.690320] Control: 10c5387d  Table: 5e8d804a  DAC: 00000015
[ 1539.696078] Process stressapptest (pid: 440, stack limit = 0xce882210)
[ 1539.702615] Stack: (0xce883ec0 to 0xce884000)
[ 1539.706988] 3ec0: 80e87e54 00000000 ce883ee4 80e02508 d0f058c0 80d9b880 ce58b700 ce883ee8
[ 1539.715176] 3ee0: 80186760 80493dd0 ce58b700 d0f05880 ce883f2c ce883f00 80189900 dc8ba30f
[ 1539.723362] 3f00: d0f05880 80d9b880 d0f05880 ce58b700 00000000 80e02e00 ce58b9c8 00000000
[ 1539.731549] 3f20: ce883f6c ce883f30 8091b3d8 8016ade0 80159198 80161c54 ce58b748 ce58b748
[ 1539.739736] 3f40: d0f05880 ce882000 5016a000 d0f05880 0000009e 80108124 ce882000 00000000
[ 1539.747924] 3f60: ce883f84 ce883f70 8091b8ec 8091b2fc 80d9b880 5016a000 ce883fa4 ce883f88
[ 1539.756109] 3f80: 8015d8f8 8091b8ac 00015b9f 00000000 0126abe8 0000009e 00000000 ce883fa8
[ 1539.764295] 3fa0: 80107f80 8015d884 00015b9f 00000000 00000001 00000001 01268990 00000001
[ 1539.772481] 3fc0: 00015b9f 00000000 0126abe8 0000009e 000000ba 0125d240 00000000 0125d224
[ 1539.780667] 3fe0: 00057060 548fed7c 0003061b 76d1aec6 00010030 00000001 3f9171a7 acd8262c
[ 1539.788849] Backtrace:
[ 1539.791368] [<8016add4>] (pick_next_task_fair) from [<8091b3d8>] (__schedule+0xe8/0x5b0)
[ 1539.799474]  r10:00000000 r9:ce58b9c8 r8:80e02e00 r7:00000000 r6:ce58b700 r5:d0f05880
[ 1539.807403]  r4:80d9b880
[ 1539.809978] [<8091b2f0>] (__schedule) from [<8091b8ec>] (schedule+0x4c/0xa4)
[ 1539.817036]  r10:00000000 r9:ce882000 r8:80108124 r7:0000009e r6:d0f05880 r5:5016a000
[ 1539.824968]  r4:ce882000
[ 1539.827543] [<8091b8a0>] (schedule) from [<8015d8f8>] (sys_sched_yield+0x80/0x88)
[ 1539.835031]  r5:5016a000 r4:80d9b880
[ 1539.838681] [<8015d878>] (sys_sched_yield) from [<80107f80>] (ret_fast_syscall+0x0/0x3c)
[ 1539.846781]  r7:0000009e r6:0126abe8 r5:00000000 r4:00015b9f
[ 1539.852531] Code: ebffe325 e5904054 e3540000 0a000034 (e5945030)
[ 1539.858647] ---[ end trace 10405bbd804e9ba5 ]---

I have attached complete log here with.

I must be thankful to you if you will kindly reply me as soon as possible.

Regards,
Kulunu.
Stressapptest.log

Kulunu Geeganage

unread,
Feb 19, 2018, 5:37:44 AM2/19/18
to stressapptest-discuss
Hi All,
Hi Kyewon Ha & Nick,

Could you please reply for above question ?

Regards,
Kulunu.

Nick Sanders

unread,
Feb 20, 2018, 2:27:02 PM2/20/18
to stressappt...@googlegroups.com
Not sure, but an fairly common occurrence in a system with bad memory is that occasionally pointer data will be corrupted rather than test data, resulting in various crashes. If you see kernel crash or program crash in addition to stressapptest memory errors, it's likely that your system is crashing due to DRAM failure.

Failures called out in stressapptest are easier to match to module/layout since stressapptest specifies address, bit, and data patterns that has failed, which can be mapped to a specific trace on the MLB or DRAM module.


--

---
You received this message because you are subscribed to the Google Groups "stressapptest-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-di...@googlegroups.com.

Kyewon Ha

unread,
Feb 20, 2018, 9:26:09 PM2/20/18
to stressappt...@googlegroups.com
 
How many failed  systems do you have ?
 .
1. Please  check  the trace length between chipset and memory.
    data  timing margin by wiring can make failure
2. Please check dram timing on firmware
    normally, wrong layout cause failure right away during the DRAM initialization.
   However, your issue happens after starting test. it means it may not  have an  issue to load OS to memory.
   if dram timing on BIOS or firmware are not correct, it can make failure.

 2. please replace  suspected memory dram to new if you can do it.
     if memory has cell core failure in the dram and it can  make crash 






To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-discuss+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "stressapptest-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-discuss+unsub...@googlegroups.com.

Kulunu Geeganage

unread,
Feb 20, 2018, 11:53:47 PM2/20/18
to stressapptest-discuss
Dear Kyewon Ha,

Many thanks for your kind feedback.

1) "How many failed  systems do you have ?"

We have 50 units of custom hardware designs and out of those 50 units 30 units are working properly and other 20 units are getting kernel panic like above.

From that 20 Units :-

  • I could figure out one unit with fault memory module using your Stressapp test tool.
  • And another 2 units give errors like follow while running Stressapp test.
  • And other units pass 7 days of overnight Stressapp test and but fail while running applications on OS. Error kernel panics are like follows. Cant we make decision like memory modules and memory layout(Traces on PCB) are okay if it pass 7days stressapp test ? But why this still get kernel panics ? Any method to back trace this kernel panics and pinpoint the issue ?

Unable to handle kernel paging request at virtual address 7e263df8
pgd = 80004000
[7e263df8] *pgd=00000000
Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM
Modules linked in: adv7610_video mxc_v4l2_capture ipu_bg_overlay_sdc ipu_still ipu_prp_enc ipu_csi_enc tw6869 ipu_fg_overlay_sdc videobuf2_dma_contig v4l2_int_device videobuf2_memops galcore(O)
CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O    4.1.15-2.0.0-ga+yocto+gff4e28b #1


Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)

task: ce113c00 ti: ce144000 task.ti: ce144000
PC is at 0x7e263df8
LR is at update_blocked_averages+0xdc/0x71c
pc : [<7e263df8>]    lr : [<80158cc0>]    psr: 800f0193
sp : ce145d48  ip : d2a81d91  fp : ce145d6c
r10: d0f2c8c0  r9 : 00000000  r8 : 0000ded8
r7 : 00000000  r6 : 00000009  r5 : 00000001  r4 : d0f2c8c0
r3 : 000002cb  r2 : e7547ac6  r1 : 000002cb  r0 : 00000366
Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
Control: 10c5387d  Table: 5e73004a  DAC: 00000015
Process swapper/3 (pid: 0, stack limit = 0xce144210)
Stack: (0xce145d48 to 0xce146000)
5d40:                   00000000 ffffa228 00000000 d0f2c880 00000000 00000003
5d60: ce145de4 ce145d70 80158cc0 80157bd8 8015f800 600f0113 ce145d78 d0f2cd48
5d80: 00000003 80180854 00989680 00000000 d0f29650 0000000d d0f2c8c0 d0f2c880
5da0: 80180854 80180398 00989680 00000000 d0f2c944 dc8ba30f ce145e0c 80b80880
5dc0: ffffa228 00000000 d0f2c880 00000000 00000003 80c02100 ce145e3c ce145de8
5de0: 8015ed84 80158bf0 ce145e14 d0f2f4c0 80c0250c e9f93c00 00000000 8018d678
5e00: ce145e44 8017f454 00000001 dc8ba30f 80c0250c 80b80880 80c0209c d0f2c880
5e20: 00000003 80c02e44 0000000a ce145e78 ce145e74 ce145e40 8015f068 8015ed40
5e40: ce145e3c 00000000 00000000 ce144000 80c0209c 00000007 000000a0 00000101
5e60: 0000000a ce145e78 ce145ec4 ce145e78 801302a0 8015eff0 ffffffff 7fffffff
5e80: e960f78e 0000000d 80c02100 00000006 ffffa229 00200040 ffffe000 ffffe000
5ea0: ffffe000 00000000 00000001 00000003 00000004 00000000 ce145edc ce145ec8
5ec0: 801306c0 80130100 80b7ec34 ffffe000 ce145f04 ce145ee0 8010eac8 80130638
5ee0: f4a00100 80c02fc8 ce145f28 ce145f5c 00000001 00000004 ce145f24 ce145f08
5f00: 80101484 8010ea38 805de3d8 600f0013 ffffffff ce145f5c ce145fb4 ce145f28
5f20: 8010c400 8010142c 00000000 80cd3f58 dc8ba30f dc8ba30f d0f2bed0 00000001
5f40: e960ea88 0000000d 00000001 00000004 00000000 ce145fb4 ce145f20 ce145f70
5f60: 8018e7f8 805de3d8 600f0013 ffffffff e8c8f818 0000000d e960ea88 0000000d
5f80: 00000000 dc8ba30f 80b7c320 d0f2bed0 80b80880 80b7c300 80c08510 00000001
5fa0: 00000000 00000000 ce145fc4 ce145fb8 805de640 805de2f4 ce145fdc ce145fc8
5fc0: 80165440 805de628 ce144000 80c8e30c ce145ff4 ce145fe0 8010e7ac 8016523c
5fe0: 5e12406a 00000015 00000000 ce145ff8 1010152c 8010e694 e3be3c35 93bcffbf
Backtrace:
[<80157bcc>] (update_cfs_rq_blocked_load) from [<80158cc0>] (update_blocked_averages+0xdc/0x71c)
 r9:00000003 r8:00000000 r7:d0f2c880 r6:00000000 r5:ffffa228 r4:00000000
[<80158be4>] (update_blocked_averages) from [<8015ed84>] (rebalance_domains+0x50/0x2b0)
 r10:80c02100 r9:00000003 r8:00000000 r7:d0f2c880 r6:00000000 r5:ffffa228
 r4:80b80880
[<8015ed34>] (rebalance_domains) from [<8015f068>] (run_rebalance_domains+0x84/0x164)
 r10:ce145e78 r9:0000000a r8:80c02e44 r7:00000003 r6:d0f2c880 r5:80c0209c
 r4:80b80880
[<8015efe4>] (run_rebalance_domains) from [<801302a0>] (__do_softirq+0x1ac/0x2f8)
 r10:ce145e78 r9:0000000a r8:00000101 r7:000000a0 r6:00000007 r5:80c0209c
 r4:ce144000
[<801300f4>] (__do_softirq) from [<801306c0>] (irq_exit+0x94/0x100)
 r10:00000000 r9:00000004 r8:00000003 r7:00000001 r6:00000000 r5:ffffe000
 r4:ffffe000
[<8013062c>] (irq_exit) from [<8010eac8>] (handle_IPI+0x9c/0x28c)
 r5:ffffe000 r4:80b7ec34
[<8010ea2c>] (handle_IPI) from [<80101484>] (gic_handle_irq+0x64/0x6c)
 r9:00000004 r8:00000001 r7:ce145f5c r6:ce145f28 r5:80c02fc8 r4:f4a00100
[<80101420>] (gic_handle_irq) from [<8010c400>] (__irq_svc+0x40/0x74)
Exception stack(0xce145f28 to 0xce145f70)
5f20:                   00000000 80cd3f58 dc8ba30f dc8ba30f d0f2bed0 00000001
5f40: e960ea88 0000000d 00000001 00000004 00000000 ce145fb4 ce145f20 ce145f70
5f60: 8018e7f8 805de3d8 600f0013 ffffffff
 r7:ce145f5c r6:ffffffff r5:600f0013 r4:805de3d8
[<805de2e8>] (cpuidle_enter_state) from [<805de640>] (cpuidle_enter+0x24/0x28)
 r10:00000000 r9:00000000 r8:00000001 r7:80c08510 r6:80b7c300 r5:80b80880
 r4:d0f2bed0
[<805de61c>] (cpuidle_enter) from [<80165440>] (cpu_startup_entry+0x210/0x3e4)
[<80165230>] (cpu_startup_entry) from [<8010e7ac>] (secondary_start_kernel+0x124/0x144)
 r7:80c8e30c r4:ce144000
[<8010e688>] (secondary_start_kernel) from [<1010152c>] (0x1010152c)
 r5:00000015 r4:5e12406a
Code: bad PC value
---[ end trace b986ec0000af24b4 ]---
Kernel panic - not syncing: Fatal exception in interrupt
CPU0: stopping
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D    O    4.1.15-2.0.0-ga+yocto+gff4e28b #1


Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)

Backtrace:
[<8010b5a0>] (dump_backtrace) from [<8010b818>] (show_stack+0x20/0x24)
 r7:00000005 r6:00000000 r5:00000000 r4:80c4bb08
[<8010b7f8>] (show_stack) from [<80797814>] (dump_stack+0x84/0xcc)
[<80797790>] (dump_stack) from [<8010eb10>] (handle_IPI+0xe4/0x28c)
 r5:ffffe000 r4:80b7ec34
[<8010ea2c>] (handle_IPI) from [<80101484>] (gic_handle_irq+0x64/0x6c)
 r9:00000004 r8:00000001 r7:80c01f14 r6:80c01ee0 r5:80c02fc8 r4:f4a00100
[<80101420>] (gic_handle_irq) from [<8010c400>] (__irq_svc+0x40/0x74)
Exception stack(0x80c01ee0 to 0x80c01f28)
1ee0: 00000000 d0f084c0 dc8ba30f dc8ba30f d0f04ed0 00000001 02cf27a1 0000000e
1f00: 00000001 00000004 80b57168 80c01f6c 80c01ed8 80c01f28 8018e7f8 805de3d8
1f20: 600b0013 ffffffff
 r7:80c01f14 r6:ffffffff r5:600b0013 r4:805de3d8
[<805de2e8>] (cpuidle_enter_state) from [<805de640>] (cpuidle_enter+0x24/0x28)
 r10:80b57168 r9:00000000 r8:00000001 r7:80c08510 r6:80b7c300 r5:80b80880
 r4:d0f04ed0
[<805de61c>] (cpuidle_enter) from [<80165440>] (cpu_startup_entry+0x210/0x3e4)
[<80165230>] (cpu_startup_entry) from [<80795580>] (rest_init+0x84/0x9c)
 r7:80c02500 r4:00000002
[<807954fc>] (rest_init) from [<80b00d10>] (start_kernel+0x3a0/0x42c)
 r5:80c8e04c r4:00000000
[<80b00970>] (start_kernel) from [<1000807c>] (0x1000807c)
SMP: failed to stop secondary CPUs
---[ end Kernel panic - not syncing: Fatal exception in interrupt


2) "it may not  have an  issue to load OS to memory.

   if dram timing on BIOS or firmware are not correct, it can make failure."

Yes out of above 20 boards 18 boards load linux OS without any conflict. But it fails while running certain applications(Big load). But it can run some applications without any conflict. But other good 30 boards can run any load and any application on Linux OS. This is the problem I'm having.

What did you mean from this ? If dram timing on BIOS or firmware are not correct, it can make failure."

3) "please replace  suspected memory dram to new if you can do it.   
 if memory has cell core failure in the dram and it can  make crash"

Yes I did it to that board and now it is working. Many thanks to Stress app test. I could figure out fault memory part.

I'm worrying other boards which pass Stress test and but still getting kernel panics. According to your experience with many custom hardware designs I like to know your idea and like to share my experience on this issues to solve this problem with your experience.
Waiting for your kind feedback.


Many thanks,
Regards,
Kulunu.



Nick Sanders

unread,
Feb 21, 2018, 1:27:46 PM2/21/18
to stressappt...@googlegroups.com
stressapptest will find many memory errors, but depending on the exact nature of the error, stressapptest (or any test) may not trigger the failure quickly, or at all. So it's still possible that your kernel failures are triggered by memory or pcb errors. You might also log thermals as well since that's another failure you may notice only under load.




--

---
You received this message because you are subscribed to the Google Groups "stressapptest-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-di...@googlegroups.com.

Kyewon Ha

unread,
Feb 22, 2018, 1:10:05 AM2/22/18
to stressappt...@googlegroups.com
Hi,
I think it is timing margin issue but still not clear if it is not dram cell timing margin or I/O interface between chipset and dram.

1. first cause.
    DRAM cell failure --> Dram may have an defect in the dram cell and it can be failed according to stressful pattern.
    normally, if you replace dram with defect in the storage cell to new, it can be fixed.
    however, your failure rate is too big because normally memory vendor will screen them.

    --> solution :
         1) replace the dram.
          2) down operating frequency ( ex 2400Mbps --> 2133Mbps )
          3) try to loose dram cell core timing on firmware like tRP, tRCD, CL,.......
              try to dump dram timing from register if you can and compare it with datasheet.
          4) Increase DDR Vdd

 2. second  cause.
     I/O interface between chipset and dram.
     if I/O interface margin is poor by routing issue or layout, it may cause failure depending on noise pattern ( different application make different noise  level)
     you can check margin with timing shmoo test but you need to have test tool on your side.
    normally this kind of failure has  big failure rate and it will not be fixed by replacing to new dram because it is close to design issue.
    --> solution.
          1) chipset should have design option to control this kind of timing like phase shift  for tDQS and tDQ.
              ---> this need to modify firmware or BIOS on the system.
           2) down operating frequency ( ex 2400Mbps --> 2133Mbps )
           3) Increase DDR Vdd.
          4) if it is related to clock timing, you can add capacitance or change termination resistance on CLK and CLK/
          5) check the wire, you need to keep  DQ line away from  noise source like  power line and shield DQ line with ground.

 I would like suggest to change clock frequency or DDR vdd first of all if you can do it.

thanks,
kyewon ha 
           
   



Kulunu Geeganage

unread,
Feb 22, 2018, 5:03:04 AM2/22/18
to stressapptest-discuss
Dear Kyewon Ha,

Manya thanks for your reply.
What did you mean from following ?

1) Down operating frequency ( ex 2400Mbps --> 2133Mbps )


Presently I'm using uboot configuration file to configure DDR frequency. It is 396MHz and I configure it as follows. I have attached DDR timing parameters configuration file here with.

DATA 4 0x020C4018 0x00260324

It shows kernel clock dump as follows.

root@remotecockpit:~# cat /sys/kernel/debug/clk/clk_summary | grep mmdc
                         mmdc_ch1_axi           0            0   396000000          0 0
                         mmdc_ch0_axi           3            3   396000000          0 0

So what did you mean by down operating frequency ? Did you mean above parameters ? How can I do it ?

2) Try to loose dram cell core timing on firmware like tRP, tRCD, CL,.......

What did you mean from above ?

Regards,
Kulunu.
DDR setup.txt
Reply all
Reply to author
Forward
0 new messages