How can I get exact failure DQ and address for physical memory ?

627 views
Skip to first unread message

Kyewon Ha

unread,
Aug 3, 2015, 7:20:35 PM8/3/15
to stressapptest-discuss
Hi,
Our company is using Stressapptest for memory module test on the server system.
whenever we have memory correctable error  during the test, we get some error log but
it is very hard to know which location is failed.
To know exact failure information is very, very important because we can get a clue from failure information.
so, we need failure DQ and system system address for memory testing
if stressapptest can refer to syndrome table register in the CPU, we can get failure DQ directly but even if it cannot,
at least if we can get failure Data ( one of 128bit DQ ) with system address, it will be  very helpful to find a clue.
also can I get source code for this test program to improve failure information ?

mcelog: failed to prefill DIMM database from DMI data

Hardware event. This is not a software error.

MCE 0

CPU 16 BANK 8

MISC 15226ba86 ADDR 7ccdec4080

TIME 1436595885 Fri Jul 10 23:24:45 2015

MCG status:

MCi status:

Corrected error

MCi_MISC register valid

MCi_ADDR register valid

MCA: MEMORY CONTROLLER RD_CHANNEL0_ERR

Transaction: Memory read error

STATUS 8c00004000010090 MCGSTATUS 0

MCGCAP 7000c16 APICID 40 SOCKETID 1


thanks,

Kyewon ha

Nick Sanders

unread,
Aug 3, 2015, 8:05:40 PM8/3/15
to stressappt...@googlegroups.com
You can generate the DIMM / chip / DQ line failure from the syndrome and address, however it depends on the bios's memory configuration, ecc method, chipset, and board layout, so the code will be specific to your server model.

64b syndrome mapping is pretty well documented, 128b and 256b may require that you consult with your chipset vendor for the mapping algorithm. Unfortunately I don't believe Intel publicly releases this info, you'll need to request the appropriate docs. 

You can empirically test syndrome code by reworking a switchable short onto a DIMM, and you can test chip mapping by heating a specific DRAM chip with a heat gun until thermal/refresh failures occur.

stressapptest public code can be found here, but unfortunately the mce and syndrome decode don't have public versions. 





--

---
You received this message because you are subscribed to the Google Groups "stressapptest-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-di...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kyewon Ha

unread,
Aug 4, 2015, 1:19:07 AM8/4/15
to stressappt...@googlegroups.com
thank you for feedback.
Yes, I'm thinking of implementing DQ & address decoding information for DRAM.
Intel EDS volume 1 provide this information

Thanks,
Kyewon Ha

Kyewon Ha

unread,
Aug 5, 2015, 12:24:28 PM8/5/15
to stressappt...@googlegroups.com
Hi Nick,
 
I have one more question.
As a matter of fact, I got the weird test result during testing system with Stressapptest.
so, some people are suspecting if Stressapptest has something bug in Intel platform
one of result is this.  normally, if system or module has a problem, it is hard to get this kind of result.
why failure happen within 10min ? and once  we have no failure in 1hr or 2 hour,  we have never failure.
our system has 1.5tera memory. this means we cannot test 1.5tera within 24hrs and failure exist at a part of  first 10 min ?

Fail< 10minFirst error
Fail< 10minFirst error
Fail< 10minFirst error
Pass24hr 
Fail< 10minFirst error
Pass24hr  
Fail< 10minFirst error

Nick Sanders

unread,
Aug 5, 2015, 2:59:02 PM8/5/15
to stressappt...@googlegroups.com
stressapptest will print the memory bandwidth it's seeing, so you can see how long it will take to test all available memory. So maybe if you see 100GB/s memory bandwidth it will take ~15s to touch all of memory. 

Depending on the type of error, some guesses might be: 
A module might have an error with a memory cell at a specific location, so if you allocate 1TB out of 1.5TB for testing, you will miss this location 30% of the time, since linux will allocate different memory each run. You can see that the mce address would always be the same in this case, and any particular run would either fail quickly or pass forever. You can allocate 1.49TB (or whatever the max possible) and mostly avoid this problem.

Another possibility I've seen is a problem in memory training or clock initialization. So maybe the system will continue failing until a reboot forces a new memory initialization. You can check this if pass or fail is consistent between reboots. 

Are you seeing the failure with the "bank 8/dram cecc" kind of mce pasted above every time?

Kyewon Ha

unread,
Aug 5, 2015, 5:21:45 PM8/5/15
to stressappt...@googlegroups.com
oh.. thank you so much, I overlooked linux allocate different memory each run but I don't think it is training issue or clock initialization because the system succeed booting 
and always leave MCE log only during the Stressapptest.  and How can I allocate 1.49TB ? Is there any option ?
Actually Stressfulapptest is the first time for me.
I also think  if I can implement memory decoder to get exact failure DQ and address, I can see what's the problem.
By the way, our issue is bank7/8 dram cecc, Is there any issue about that ?

Many thanks,
Kyewon Ha

Nick Sanders

unread,
Aug 5, 2015, 6:10:18 PM8/5/15
to stressappt...@googlegroups.com
If you can see on every MCE the same "ADDR 7ccdec4080" then it is a very strong indication that one particular cell is bad, without needing to check DQ.

stressapptest -M 1520435
will allocate 1520435MB (1.45TB)

If it's too much the system will OOM or swap, then try a smaller amount. It depends on how much free memory is left after the system boots. You can check /proc/meminfo for MemFree + Cached to see more or less how much allocatable memory there is.

Kyewon Ha

unread,
Aug 5, 2015, 7:57:24 PM8/5/15
to stressappt...@googlegroups.com
I really appreciate your advice. it is very helpful for me.

Many thanks,
Kyewon Ha
Reply all
Reply to author
Forward
0 new messages