Hi,
How to fix cgroup out of memory problem. It seems one of our application is killed and as a result it is escalating to board restart.
As per the below logs, it looks like the reason for application to be killed is due
to memory limit configured by the cgroup for the application.
0004: Apr 04 07:28:04 000400 kernel: loco_sps_evo_ed invoked oom-killer:
gfp_mask=0xd0, order=0, oom_score_adj=0
0004: Apr 04 07:28:04 000400 kernel: loco_sps_evo_ed cpuset=00050001
mems_allowed=0
0004: Apr 04 07:28:04 000400 kernel: CPU: 19 PID: 29944 Comm: loco_sps_evo_ed
Tainted: G O 3.14.39ltsi-WR7.0.0.11_standard #1
0004: Apr 04 07:28:04 000400 kernel: Call Trace:
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96b770] [c00000000501a6fc]
.show_stack+0x168/0x278 (unreliable)
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96b860] [c000000005870460]
.dump_stack+0x9c/0xfc
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96b8e0] [c000000005173a5c]
.dump_header.isra.9+0x9c/0x250
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96b9b0] [c000000005174218]
.oom_kill_process+0x2d8/0x4a0
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96ba80] [c0000000051df4b0]
.mem_cgroup_oom_synchronize+0x644/0x7a8
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96bb90] [c000000005174a04]
.pagefault_out_of_memory+0x1c/0x84
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96bc00] [c0000000058665b4]
.do_page_fault+0x7e8/0x86c
0004: Apr 04 07:28:04 000400 kernel: [c00000001e96be30] [c00000000502f1dc]
storage_fault_common+0x20/0x44
0004: Apr 04 07:28:04 000400 kernel: Task in /00000005/00050001 killed as a
result of limit of /00000005/00050001
0004: Apr 04 07:28:04 000400 kernel: memory: usage 205824kB, limit 205824kB,
failcnt 22
0004: Apr 04 07:28:04 000400 kernel: memory+swap: usage 205824kB, limit
18014398509481983kB, failcnt 0
0004: Apr 04 07:28:04 000400 kernel: kmem: usage 0kB, limit
18014398509481983kB, failcnt 0
0004: Apr 04 07:28:05 000400 kernel: Memory cgroup stats for
/00000005/00050001: cache:268KB rss:205556KB rss_huge:0KB mapped_file:264KB
writeback:0KB swap:0KB inactive_anon:268KB active_anon:205500KB
inactive_file:0KB active_file:0KB unevictable:0KB
0004: Apr 04 07:28:05 000400 kernel: [ pid ] uid tgid total_vm rss
nr_ptes swapents oom_score_adj name
0004: Apr 04 07:28:05 000400 kernel: [28771] 0 28771 194002 52320
157 0 0 loco-sps-ip-lx-
0004: Apr 04 07:28:05 000400 kernel: Memory cgroup out of memory: Kill process
28771 (loco-sps-ip-lx-) score 989 or sacrifice child
0004: Apr 04 07:28:05 000400 kernel: Killed process 28771 (loco-sps-ip-lx-)
total-vm:776008kB, anon-rss:205268kB, file-rss:4012kB
0004: Apr 04 07:28:05 000400 pghd[28193]: program (0x50001) terminated
0004: Apr 04 07:28:05 000400 pghd[28193]: pgh_cb_pgmterm for pid=0x50001,raw
status=9
0004: Apr 04 07:28:05 000400 pghd[28193]: Program 1 terminated abnormally by
signal for pid=0x50001
0004: Apr 04 07:28:05 000400 pghd[28193]: Program terminated abnormally with
signal number=9 for pid=0x50001
0004: Apr 04 07:28:05 000400 pghd[28193]: pgh_cb_pgmterm 3 0x25e5
0004: Apr 04 07:28:05 000400 TRI_SERVER[28198]: rlog: $ Restart request$
2016-04-04 07:28:05$ Board Manager$ - $ Warm$ -$ -$ Restart of program
identity:CXC1322156%3_P91A132, container handle:CXC1322156%3_P91A132/1 program
handle:loco-sps-ip-lx-lm/1 with escalation set to board restart.
In this case we are not getting any core dump.
We are afraid of how to investigate such problems with out core dump or snapshot of the system.
Could you please check and let us know if there is a way to
instruct/modify kernel such that it generates dump for this scenarios.
--