Unexpected behavior of custom hardware design based on i.MX6 & MT41K256M16TW-107 IT:P

203 views
Skip to first unread message

Kulunu Geeganage

unread,
Aug 14, 2018, 3:47:20 AM8/14/18
to stressapptest-discuss
Dear All,

I'm new to custom hardware designs and I'm going to scale up my custom hardware which is functioning well with few boards. I need some help with making decision on prototypes and scaling up with the state of the prototypes.   


This hardware is based on i.MX6Q processor & MT41K256M16TW-107 IT:P memory. This is most similar to nitrogen6_max development board.


I'm having trouble with my hardware which is really difficult to figure out as some boards are working really well and some are not (From 7 units of production 4 boards are functioning really well, one board getting segmentation faults and kernel panic while running linux application ). When I do memory calibration of bad boards those are really looks like same to good boards. 



Segmentation fault is directing to some memory issues, I back traced and core dumped using linux GDB. >>


Program terminated with signal SIGSEGV, Segmentation fault.

#0  gcoHARDWARE_QuerySamplerBase (Hardware=0x22193dc, Hardware@entry=0x0, VertexCount=0x7ef95370, VertexCount@entry=0x7ef95368, VertexBase=0x40000,

    FragmentCount=FragmentCount@entry=0x2217814, FragmentBase=0x0) at gc_hal_user_hardware_query.c:6020

6020    gc_hal_user_hardware_query.c: No such file or directory.

[Current thread is 1 (Thread 0x76feb010 (LWP 697))]

(gdb) bt

#0  gcoHARDWARE_QuerySamplerBase (Hardware=0x22193dc, Hardware@entry=0x0, VertexCount=0x7ef95370, VertexCount@entry=0x7ef95368, VertexBase=0x40000,

    FragmentCount=FragmentCount@entry=0x2217814, FragmentBase=0x0) at gc_hal_user_hardware_query.c:6020

#1  0x765d20e8 in gcoHAL_QuerySamplerBase (Hal=<optimized out>, VertexCount=VertexCount@entry=0x7ef95368, VertexBase=<optimized out>, FragmentCount=FragmentCount@entry=0x2217814,

    FragmentBase=0x0) at gc_hal_user_query.c:692

#2  0x681e31ec in gcChipRecompileEvaluateKeyStates (chipCtx=0x0, gc=0x7ef95380) at src/chip/gc_chip_state.c:2115

#3  gcChipValidateRecompileState (gc=0x7ef95380, gc@entry=0x21bd96c, chipCtx=0x0, chipCtx@entry=0x2217814) at src/chip/gc_chip_state.c:2634

#4  0x681c6da8 in __glChipDrawValidateState (gc=0x21bd96c) at src/chip/gc_chip_draw.c:5217

#5  0x68195688 in __glDrawValidateState (gc=0x21bd96c) at src/glcore/gc_es_draw.c:585

#6  __glDrawPrimitive (gc=0x21bd96c, mode=<optimized out>) at src/glcore/gc_es_draw.c:943

#7  0x68171048 in glDrawArrays (mode=4, first=6, count=6) at src/glcore/gc_es_api.c:399

#8  0x76c9ac72 in CEGUI::OpenGL3GeometryBuffer::draw() const () from /usr/lib/libCEGUIOpenGLRenderer-0.so.2

#9  0x76dd1aee in CEGUI::RenderQueue::draw() const () from /usr/lib/libCEGUIBase-0.so.2

#10 0x76e317d8 in CEGUI::RenderingSurface::draw(CEGUI::RenderQueue const&, CEGUI::RenderQueueEventArgs&) () from /usr/lib/libCEGUIBase-0.so.2

#11 0x76e31838 in CEGUI::RenderingSurface::drawContent() () from /usr/lib/libCEGUIBase-0.so.2

#12 0x76e36d30 in CEGUI::GUIContext::drawContent() () from /usr/lib/libCEGUIBase-0.so.2

#13 0x76e31710 in CEGUI::RenderingSurface::draw() () from /usr/lib/libCEGUIBase-0.so.2

#14 0x001bf79c in tengri::gui::cegui::System::Impl::draw (this=0x2374f08) at codebase/src/gui/cegui/system.cpp:107

#15 tengri::gui::cegui::System::draw (this=this@entry=0x2374e74) at codebase/src/gui/cegui/system.cpp:212

#16 0x000b151e in falcon::osd::view::MainWindowBase::Impl::preNativeUpdate (this=0x2374e10) at codebase/src/osd/view/MainWindow.cpp:51

#17 falcon::osd::view::MainWindowBase::preNativeUpdate (this=this@entry=0x209fe30) at codebase/src/osd/view/MainWindow.cpp:91

#18 0x000c4686 in falcon::osd::view::FBMainWindow::update (this=0x209fe00) at codebase/include/falcon/osd/view/FBMainWindow.h:56

#19 falcon::osd::view::App::Impl::execute (this=0x209fdb0) at codebase/src/osd/view/app_view_osd_falcon.cpp:139

#20 falcon::osd::view::App::execute (this=<optimized out>) at codebase/src/osd/view/app_view_osd_falcon.cpp:176

#21 0x000475f6 in falcon::osd::App::execute (this=this@entry=0x7ef95c84) at codebase/src/osd/app_osd_falcon.cpp:75

#22 0x00047598 in main () at codebase/src/main.cpp:5

(gdb) Quit

 



Here I have attached NXP tool calibration results for 2 good boards and 1 bad(getting segmentation faults) board. Click on following links. 


Board 1

Board 2

Board 3


I did stress test using stressapptest and it was a over night test. But I didn't get any fault and test was passed.

 



1. From above 3 boards Board 1 and Board 2 are working really well and Board 3 is getting kernel panics while running same application on 3 boards. Can you help me to figure out any clue from this results from above 3 boards ? 

 

2. I did 50 units of production 6 months ago and only 30 were worked properly. But that is with Alliance memory AS4C256M16D3A-12BCN. So will this be an issue of the design ? If this is an issue of the ddr layout or whole design why some boards are working really well ? 

 

3. Will this be an issue of the manufacturing side ? Then how this could be happen with the same production ? Because some are working and some are not.


4. Will your stressapptest stress power as well ?

 

I don't have much experience with mass production and but I like to move forward after learning and correcting this issues. 

 

I must be thankful to you if you will kindly reply me soon.

 

Many thanks and regards,

Kulunu.

Nick Sanders

unread,
Aug 14, 2018, 9:07:36 PM8/14/18
to stressappt...@googlegroups.com
On Tue, Aug 14, 2018 at 12:47 AM Kulunu Geeganage <iku...@gmail.com> wrote:

Segmentation fault is directing to some memory issues, I back traced and core dumped using linux GDB. >>

Yes, it could be memory issues. Unfortunately it's harder to debug from kernel, or especially in gfx crashes.
 

I did stress test using stressapptest and it was a over night test. But I didn't get any fault and test was passed.

stressapptest is a pretty good memory test but it can't catch every issue. So it's still possible to have memory issues even if stressapptest passes. 

It's worth checking if your system will slow down the memory clock while running stressapptest due to thermal or power issues, this is a common cause of stressapptest passing a long test while user apps may fail. You can tune --pause_delay and --pause_duration to run occasional memory tests without causing the system to throttle, so you can leave it running overnight.

4. Will your stressapptest stress power as well ?

Yes, it is somewhat stressful to power as well. It doesn't run any GPU load though. 

Kulunu Geeganage

unread,
Aug 15, 2018, 12:48:32 AM8/15/18
to stressapptest-discuss
Dear Nick,

Many thanks for your feedback. 

1) When we take 10 units of production. If 4 boards are working with user application and 6 boards are getting segmentation faults and kernel panics; can't we take as the hardware design and layout don't have any issue(Because 4 boards are working) ? Can't this be a manufacturing issue ? The reason is as a hardware designer, here I can't pinpoint whether this is a memory byte lane layout(routing) issue. Then I can do those changes. But if this is a manufacturing issue I can ask from manufacturer to do changes.  

2) "It's worth checking if your system will slow down the memory clock while running stressapptest due to thermal or power issues,"
How can I do this ? Could you please tell me the scenario ?

3)"You can tune --pause_delay and --pause_duration to run occasional memory tests without causing the system to throttle, so you can leave it running overnight."
How can I tune pause_delay and pause_duration to run occasional memory test ? Could you please explain how to do this ?

Here I have attached stressapp test result log for 4 boards. 

I must be thankful to you if you will kindly reply me soon.

Regards & Thanks,
Kulunu.

Board 1.txt
Board 2.txt
Board 5.txt
Board 9.txt

Yuehui Wu

unread,
May 13, 2021, 9:09:31 AM5/13/21
to stressapptest-discuss



if utils.get_board() == 'link':

args += memory_channel_args_snb_bdw([

['U1', 'U2', 'U3', 'U4'],

['U6', 'U5', 'U7', 'U8']]) # yes, U6 is actually before U5



if utils.get_board() == 'samus':

args += memory_channel_args_snb_bdw([

['U11', 'U12'],

['U13', 'U14']]) 
------------------------------------------------------------------------------------------------------------
Hi Guys,
Can anybody help me understand what's the order of the components?
Is U11 & U12 channel 0 or 1?
U11 is DQ[0--31], U12 is DQ[32--63],or U11 is DQ[32--63], U12 is DQ[0-31] ?
 
Thanks in advance!

Nick Sanders

unread,
May 13, 2021, 1:45:48 PM5/13/21
to stressappt...@googlegroups.com
'link' and 'samus' are the Chromebook Pixel1&2, with Ivy Bridge and Broadwell CPUs respectively.
'U11', 'U6' etc. are arbitrary strings, and are names of the memory chips silkscreened on the motherboard.

Since this config assumes intel laptop memory controllers, the first list is the chips spanning what's essentially
the virtual SODIMM DQ[0-63] on channel 0, the second list are the chips spanning DQ[0-63] on channel 1.

Within the list, the chips are listed as they map to bits in a uint64, which is IIRC often, but not necessarily, from low
to high DQ, like U11:DQ[0--31], U12:DQ[32--63]


Since this mapping is up to the particular memory controller and config on your system, you'd need to validate it
experimentally or by reading your memory controller spec if you have access to it.


  -Nick

--

---
You received this message because you are subscribed to the Google Groups "stressapptest-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stressapptest-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stressapptest-discuss/189a80dd-97cb-4b87-b72a-fb0abb0a4433n%40googlegroups.com.

Yuehui Wu

unread,
May 13, 2021, 7:17:19 PM5/13/21
to stressappt...@googlegroups.com
Hi Nick,

Thank you so much! I'll validate it.
I could not find the schematics of 'link' and 'samus'.
I have the other brand of chromebook  motherboards with defective DRAM chips and schematic to validate it.


Best Regards,
Yuehui Wu

You received this message because you are subscribed to a topic in the Google Groups "stressapptest-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stressapptest-discuss/6E2qKQBxo3s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stressapptest-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stressapptest-discuss/CAJ3XYyHHAgoWw70VJqf5%3DuKE%3DSAH01RDSftOj-zAbOqxiTmNxg%40mail.gmail.com.


--
Best Regards
Yuehui Wu

Nick Sanders

unread,
May 13, 2021, 7:38:22 PM5/13/21
to stressappt...@googlegroups.com
Unfortunately chromebook schematics aren't generally publicly available. If you have many of the same chromebook you can generate the mapping by injecting errors with a heat gun as in the documentation link I sent above. If you're working on an official Chrome OS project you can contact your technical account manager or SIE and get support on this via official channels. 

Yuehui Wu

unread,
May 14, 2021, 10:50:02 AM5/14/21
to stressappt...@googlegroups.com
I work for a chromebook motherboard repair center. 
I've seen a lot of issues and been resolved . 
The defective DRAM issue is on the top. 
It gives me a headache.


Reply all
Reply to author
Forward
0 new messages