We are experiencing a serious issue with servers rebooting under high load (CPU maxed, moderate ssd disk usage)
Update engine is turned off, there are absolutely no other logs saying why the host rebooted. This is happening on both current stable and alpha release.
- All are maxed on CPU, only seems to happen to instances with fully loaded cpu AND io intensive applications. These hosts are running aerospike and kafka
The only logs which could mean anything in dmesg is about CPU temperature, and page allocation failure relating to ext4 writeback (these aren't specifically around the reboot time):
[31554.458433] CPU69: Core temperature above threshold, cpu clock throttled (total events = 1)
[31554.458433] CPU25: Core temperature above threshold, cpu clock throttled (total events = 1)
[31554.458435] CPU74: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458437] CPU31: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458438] CPU30: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458439] CPU75: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458440] CPU25: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458444] mce: [Hardware Error]: Machine check events logged
[120977.571023] kworker/u177:0: page allocation failure: order:2, mode:0x2084020(GFP_ATOMIC|__GFP_COMP)
[120977.582413] CPU: 21 PID: 7248 Comm: kworker/u177:0 Not tainted 4.7.0-coreos-r1 #1
[120977.591159] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
[120977.600851] Workqueue: writeback wb_workfn (flush-259:0)
[120977.607137] 0000000000000286 00000000d8c26c5f ffff88019a6a3388 ffffffff992f21d3
[120977.615858] 0000000000000000 0000000000000002 ffff88019a6a3418 ffffffff9916daa6
[120977.624561] 0208402000000000 0000000000000246 ffff88019a6a33e0 ffffffff990b91f4
[120977.633282] Call Trace:
[120977.636215] [<ffffffff992f21d3>] dump_stack+0x63/0x90
[120977.642275] [<ffffffff9916daa6>] warn_alloc_failed+0x106/0x170
[120977.649239] [<ffffffff990b91f4>] ? __wake_up+0x44/0x50
[120977.655394] [<ffffffff9916e01d>] __alloc_pages_nodemask+0x50d/0xeb0
[120977.662887] [<ffffffff991c62b3>] ? kmem_cache_alloc+0x173/0x1e0
[120977.669961] [<ffffffff993f9bc9>] ? alloc_iova+0x49/0x230
[120977.676392] [<ffffffff991bab6c>] alloc_pages_current+0x8c/0x110
[120977.683460] [<ffffffff9916ec49>] alloc_kmem_pages+0x19/0x90
[120977.690126] [<ffffffff9918bf18>] kmalloc_order+0x18/0x50
[120977.696486] [<ffffffff9918bf74>] kmalloc_order_trace+0x24/0xa0
[120977.703448] [<ffffffff991c6b86>] __kmalloc+0x1c6/0x230
[120977.709602] [<ffffffffc043cfd1>] 0xffffffffc043cfd1
[120977.715460] [<ffffffff992d43f5>] ? blk_mq_get_tag+0x45/0xe0
[120977.722111] [<ffffffff99036699>] ? sched_clock+0x9/0x10
[120977.728373] [<ffffffff992cfa40>] ? __blk_mq_alloc_request+0xe0/0x1f0
[120977.735962] [<ffffffff992d1bd3>] ? blk_mq_map_request+0xb3/0x210
[120977.743125] [<ffffffff992c353f>] ? part_round_stats+0x4f/0x60
[120977.750007] [<ffffffff992d327a>] blk_mq_make_request+0x31a/0x4d0
[120977.757177] [<ffffffff992c6b62>] generic_make_request+0xf2/0x1d0
[120977.764348] [<ffffffff992c6cb6>] submit_bio+0x76/0x170
[120977.770511] [<ffffffff991735a5>] ? __test_set_page_writeback+0x155/0x1d0
[120977.778499] [<ffffffffc059fe5f>] ext4_io_submit+0x2f/0x40 [ext4]
[120977.785677] [<ffffffffc05a002d>] ext4_bio_write_page+0x19d/0x840 [ext4]
[120977.793561] [<ffffffffc0595d3d>] do_journal_get_write_access+0x55d/0x18d0 [ext4]
[120977.802454] [<ffffffffc0595e4b>] do_journal_get_write_access+0x66b/0x18d0 [ext4]
[120977.811242] [<ffffffffc0596bbb>] do_journal_get_write_access+0x13db/0x18d0 [ext4]
[120977.820133] [<ffffffffc059b382>] ext4_mark_inode_dirty+0x632/0xf70 [ext4]
[120977.828219] [<ffffffff99173fae>] do_writepages+0x1e/0x30
[120977.834587] [<ffffffff992145d5>] __writeback_single_inode+0x45/0x320
[120977.842158] [<ffffffff99214de6>] writeback_sb_inodes+0x286/0x580
[120977.849320] [<ffffffff9921516f>] __writeback_inodes_wb+0x8f/0xc0
[120977.867385] [<ffffffff992154f0>] wb_writeback+0x280/0x310
[120977.884538] [<ffffffff99215d09>] wb_workfn+0x2c9/0x3f0
[120977.901502] [<ffffffff9908ef76>] process_one_work+0x156/0x400
[120977.919271] [<ffffffff9908fa6e>] worker_thread+0x4e/0x4a0
[120977.936811] [<ffffffff9908fa20>] ? rescuer_thread+0x380/0x380
[120977.954714] [<ffffffff99094e58>] kthread+0xd8/0xf0
[120977.971909] [<ffffffff99079fc5>] ? do_group_exit+0x45/0xb0
[120977.990272] [<ffffffff9958e2ff>] ret_from_fork+0x1f/0x40
[120978.008743] [<ffffffff99094d80>] ? kthread_park+0x60/0x60
[120978.139031] Mem-Info:
[120978.153356] active_anon:19582653 inactive_anon:332 isolated_anon:1
active_file:74551667 inactive_file:2286516 isolated_file:0
unevictable:0 dirty:588573 writeback:564 unstable:0
slab_reclaimable:1374062 slab_unreclaimable:298859
mapped:47429 shmem:9314 pagetables:54960 bounce:0
free:423499 free_pcp:23799 free_cma:0
How can I go about finding the reason for the reboots? The only visible message in logs is that the host is Rebooting