How to detect jobs killed due to out-of-limit resources usage?

Jonathan Ballet

unread,

Jan 4, 2017, 12:55:59 PM1/4/17

to nomad...@googlegroups.com

Hi,

I have a job which had a quite restrictive memory constraint set (50MB)
and although everything was fine most of the time, when hitting some API
endpoint the memory usage is trying to get higher than this limit and
the job is getting killed by the kernel OOM.
Nomad proceeds to restart the job correctly, but the problem was a bit
difficult to track down due to the lack of information.

The Nomad client is showing the following at the time the job is killed:

[DEBUG] driver.docker: error collecting stats from container
107450959eb38f5502fa08b3e470113f6be8564c9b4b63276401495956c5908b: io:
read/write on closed pipe
[DEBUG] plugin: /opt/nomad/0.5.2/nomad: plugin process exited
[INFO] client: task "my-app" for alloc
"27f72c98-1521-51e7-5471-6d00dd259292" failed: Wait returned exit code
255, signal 0, and error Docker container exited with non-zero exit
code: 255
[INFO] client: Restarting task "my-app" for alloc
"27f72c98-1521-51e7-5471-6d00dd259292" in 17.556986477s
[DEBUG] client: updated allocations at index 162211 (pulled 0) (filtered
11)
[DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 11)

And the kernel logs at the same time:

my-app invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
my-app
cpuset=107450959eb38f5502fa08b3e470113f6be8564c9b4b63276401495956c5908b
mems_allowed=0-1
CPU: 0 PID: 25242 Comm: my-app Tainted: P CIO 3.16.0-4-amd64 #1
Debian 3.16.36-1+deb8u1
Hardware name: Supermicro X10DAi/X10DAI, BIOS 1.0c 01/29/2015
0000000000000000 ffffffff81512391 ffff8802268b82d0 ffff8805f7ab3400
ffffffff8150ff8d ffff880c5ef01120 ffffffff810a8f67 0000000000000001
ffff880c5ef01100 ffff880225564c80 0000000000000001 0000000000000000
Call Trace:
[<ffffffff81512391>] ? dump_stack+0x5d/0x78
[<ffffffff8150ff8d>] ? dump_header+0x76/0x1e8
[<ffffffff810a8f67>] ? __wake_up_common+0x57/0x90
[<ffffffff8114239d>] ? find_lock_task_mm+0x3d/0x90
[<ffffffff811427dd>] ? oom_kill_process+0x21d/0x370
[<ffffffff8114239d>] ? find_lock_task_mm+0x3d/0x90
[<ffffffff811a288a>] ? mem_cgroup_oom_synchronize+0x52a/0x590
[<ffffffff811a1e10>] ? mem_cgroup_try_charge_mm+0xa0/0xa0
[<ffffffff81142f90>] ? pagefault_out_of_memory+0x10/0x80
[<ffffffff810584f5>] ? __do_page_fault+0x3c5/0x4f0
[<ffffffff81170a63>] ? vma_merge+0x223/0x340
[<ffffffff8109d7c6>] ? set_next_entity+0x56/0x70
[<ffffffff81171b58>] ? do_brk+0x248/0x340
[<ffffffff8151a568>] ? page_fault+0x28/0x30
Task in
/docker/107450959eb38f5502fa08b3e470113f6be8564c9b4b63276401495956c5908b
killed as a result of limit of
/docker/107450959eb38f5502fa08b3e470113f6be8564c9b4b63276401495956c5908b
memory: usage 51200kB, limit 51200kB, failcnt 42
memory+swap: usage 51200kB, limit 51200kB, failcnt 0
kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
Memory cgroup stats for
/docker/107450959eb38f5502fa08b3e470113f6be8564c9b4b63276401495956c5908b:
cache:84KB rss:51116KB rss_huge:0KB mapped_file:0KB writeback:0KB
swap:0KB inactive_anon:8KB active_anon:51164KB inactive_file:28KB
active_file:0KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj
name
[25151] 0 25151 50 1 3 0 0
dumb-init
[25164] 0 25164 379 1 5 0 0 sh
[25165] 0 25165 3487 1923 12 0 0
envconsul
[25173] 0 25173 30562 13734 64 0 0
my-app
Memory cgroup out of memory: Kill process 25173 (my-app) score 1077 or
sacrifice child
Killed process 25173 (my-app) total-vm:122248kB, anon-rss:48044kB,
file-rss:6892kB

And the job's stderr has this:

$ nomad fs -job my-app alloc/logs/my-app.stderr.0
2017/01/04 17:26:05 [ERR] unexpected exit from subprocess (-1)

... And that's pretty much all I have. Now, we have no problem raising
this limit and giving more resources to the job, but I fear it's going
to happen again in the future and I'm looking for ways to get (easily)
more information if this kind of problem occurs again.

We could try to reconciliate the logs from the kernel and the logs from
Nomad into Elasticsearch (we have centralized logging) but I was
wondering if there was a better way for Nomad to signal this.

Thanks!

Jonathan

Alex Dadgar

unread,

Jan 4, 2017, 1:58:22 PM1/4/17

to nomad...@googlegroups.com, Jonathan Ballet

Hey Jonathon,

You should be able to see the OOM kill by looking at `nomad alloc-status <alloc-id>`. The exit code will be 137 if OOM killed.

Thanks,
Alex Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/1483552556.1437587.837292561.56D4B2DC%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.

Diptanu Choudhury

unread,

Jan 4, 2017, 2:02:26 PM1/4/17

to Alex Dadgar, Nomad, Jonathan Ballet

I usually look into dmesg and see if the kernel has OOM killed any process. There is the cgroup notification API that we could use which can precisely tell us if a container was killed by the kernel because it exceeded resource usage limits.

Just looking at exit code of a container isn't enough to understand if the process was OOM killed.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/1483552556.1437587.837292561.56D4B2DC%40webmail.messagingengine.com.
For more options, visit https://groups.google.com/d/optout.

--

This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/a11a578b-b92f-4860-a7c4-f824ea880844%40Spark.

For more options, visit https://groups.google.com/d/optout.

--

Thanks,
Diptanu Choudhury

Web - www.linkedin.com/in/diptanu

Twitter - @diptanu

Reply all

Reply to author

Forward