Frequent rebooting under high load - ProLiant DL380 Gen9

1,196 views
Skip to first unread message

Andy V

unread,
Oct 26, 2016, 1:36:31 PM10/26/16
to CoreOS User
Hello all,

We are experiencing a serious issue with servers rebooting under high load (CPU maxed, moderate ssd disk usage)

Hardware: ProLiant DL380 Gen9
CPU: Dual Intel(R) Xeon(R) CPU E5-2699 v4
Ram: 384GB
Disk: HGST Ultrastar SN150 1.6TB PCIe SSD NVMe 
Kubernetes 1.4.4

Literally the only logs I see around the reboot time is (journald):

-- Reboot --

Update engine is turned off, there are absolutely no other logs saying why the host rebooted. This is happening on both current stable and alpha release. 
Out of about 40 online servers:
 - All are maxed on CPU, only seems to happen to instances with fully loaded cpu AND io intensive applications. These hosts are running aerospike and kafka

I've seen 2 other posts in this group referencing HP ProLiant with no resolution
  - June 22 - CoreOS v1010.5.0 Frequent servers reboot on DL380p G8 and DL360 G9
  - Oct 2 - Frequent server reboots


The only logs which could mean anything in dmesg is about CPU temperature, and page allocation failure relating to ext4 writeback (these aren't specifically around the reboot time):

[31554.458433] CPU69: Core temperature above threshold, cpu clock throttled (total events = 1)
[31554.458433] CPU25: Core temperature above threshold, cpu clock throttled (total events = 1)
[31554.458435] CPU74: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458437] CPU31: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458438] CPU30: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458439] CPU75: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458440] CPU25: Package temperature above threshold, cpu clock throttled (total events = 1)
[31554.458444] mce: [Hardware Error]: Machine check events logged


[120977.571023] kworker/u177:0: page allocation failure: order:2, mode:0x2084020(GFP_ATOMIC|__GFP_COMP)
[120977.582413] CPU: 21 PID: 7248 Comm: kworker/u177:0 Not tainted 4.7.0-coreos-r1 #1
[120977.591159] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
[120977.600851] Workqueue: writeback wb_workfn (flush-259:0)
[120977.607137]  0000000000000286 00000000d8c26c5f ffff88019a6a3388 ffffffff992f21d3
[120977.615858]  0000000000000000 0000000000000002 ffff88019a6a3418 ffffffff9916daa6
[120977.624561]  0208402000000000 0000000000000246 ffff88019a6a33e0 ffffffff990b91f4
[120977.633282] Call Trace:
[120977.636215]  [<ffffffff992f21d3>] dump_stack+0x63/0x90
[120977.642275]  [<ffffffff9916daa6>] warn_alloc_failed+0x106/0x170
[120977.649239]  [<ffffffff990b91f4>] ? __wake_up+0x44/0x50
[120977.655394]  [<ffffffff9916e01d>] __alloc_pages_nodemask+0x50d/0xeb0
[120977.662887]  [<ffffffff991c62b3>] ? kmem_cache_alloc+0x173/0x1e0
[120977.669961]  [<ffffffff993f9bc9>] ? alloc_iova+0x49/0x230
[120977.676392]  [<ffffffff991bab6c>] alloc_pages_current+0x8c/0x110
[120977.683460]  [<ffffffff9916ec49>] alloc_kmem_pages+0x19/0x90
[120977.690126]  [<ffffffff9918bf18>] kmalloc_order+0x18/0x50
[120977.696486]  [<ffffffff9918bf74>] kmalloc_order_trace+0x24/0xa0
[120977.703448]  [<ffffffff991c6b86>] __kmalloc+0x1c6/0x230
[120977.709602]  [<ffffffffc043cfd1>] 0xffffffffc043cfd1
[120977.715460]  [<ffffffff992d43f5>] ? blk_mq_get_tag+0x45/0xe0
[120977.722111]  [<ffffffff99036699>] ? sched_clock+0x9/0x10
[120977.728373]  [<ffffffff992cfa40>] ? __blk_mq_alloc_request+0xe0/0x1f0
[120977.735962]  [<ffffffff992d1bd3>] ? blk_mq_map_request+0xb3/0x210
[120977.743125]  [<ffffffff992c353f>] ? part_round_stats+0x4f/0x60
[120977.750007]  [<ffffffff992d327a>] blk_mq_make_request+0x31a/0x4d0
[120977.757177]  [<ffffffff992c6b62>] generic_make_request+0xf2/0x1d0
[120977.764348]  [<ffffffff992c6cb6>] submit_bio+0x76/0x170
[120977.770511]  [<ffffffff991735a5>] ? __test_set_page_writeback+0x155/0x1d0
[120977.778499]  [<ffffffffc059fe5f>] ext4_io_submit+0x2f/0x40 [ext4]
[120977.785677]  [<ffffffffc05a002d>] ext4_bio_write_page+0x19d/0x840 [ext4]
[120977.793561]  [<ffffffffc0595d3d>] do_journal_get_write_access+0x55d/0x18d0 [ext4]
[120977.802454]  [<ffffffffc0595e4b>] do_journal_get_write_access+0x66b/0x18d0 [ext4]
[120977.811242]  [<ffffffffc0596bbb>] do_journal_get_write_access+0x13db/0x18d0 [ext4]
[120977.820133]  [<ffffffffc059b382>] ext4_mark_inode_dirty+0x632/0xf70 [ext4]
[120977.828219]  [<ffffffff99173fae>] do_writepages+0x1e/0x30
[120977.834587]  [<ffffffff992145d5>] __writeback_single_inode+0x45/0x320
[120977.842158]  [<ffffffff99214de6>] writeback_sb_inodes+0x286/0x580
[120977.849320]  [<ffffffff9921516f>] __writeback_inodes_wb+0x8f/0xc0
[120977.867385]  [<ffffffff992154f0>] wb_writeback+0x280/0x310
[120977.884538]  [<ffffffff99215d09>] wb_workfn+0x2c9/0x3f0
[120977.901502]  [<ffffffff9908ef76>] process_one_work+0x156/0x400
[120977.919271]  [<ffffffff9908fa6e>] worker_thread+0x4e/0x4a0
[120977.936811]  [<ffffffff9908fa20>] ? rescuer_thread+0x380/0x380
[120977.954714]  [<ffffffff99094e58>] kthread+0xd8/0xf0
[120977.971909]  [<ffffffff99079fc5>] ? do_group_exit+0x45/0xb0
[120977.990272]  [<ffffffff9958e2ff>] ret_from_fork+0x1f/0x40
[120978.008743]  [<ffffffff99094d80>] ? kthread_park+0x60/0x60
[120978.139031] Mem-Info:
[120978.153356] active_anon:19582653 inactive_anon:332 isolated_anon:1
                 active_file:74551667 inactive_file:2286516 isolated_file:0
                 unevictable:0 dirty:588573 writeback:564 unstable:0
                 slab_reclaimable:1374062 slab_unreclaimable:298859
                 mapped:47429 shmem:9314 pagetables:54960 bounce:0
                 free:423499 free_pcp:23799 free_cma:0


How can I go about finding the reason for the reboots? The only visible message in logs is that the host is Rebooting

Brandon Philips

unread,
Oct 28, 2016, 12:29:24 AM10/28/16
to Andy V, CoreOS User
Hello Andy-

Is there a lot of logging happening on these hosts? If you set the journald configuration file to volatile does it fix itself? https://www.freedesktop.org/software/systemd/man/journald.conf.html#Storage=

Thank You,

Brandon

--
You received this message because you are subscribed to the Google Groups "CoreOS User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to coreos-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andy V

unread,
Oct 28, 2016, 1:21:18 AM10/28/16
to CoreOS User
There isn't alot of logging, a few lines every couple of seconds.

The weird issue is journald has -- Reboot -- as the last message in logs, and no reason for the reboot (update-engine and locksmithd are disabled), the page allocation failure doesn't seem to be an issue right now

Not sure how to trace what is triggering the reboot, I've enabled LogLevel=debug in systemd

Markus Berner

unread,
Nov 16, 2016, 6:58:44 AM11/16/16
to CoreOS User
We are seeing similar problems on KVM virtualized hardware on a cloud provider:

On multiple Xeon E5 processor models, e.g.:

processor : 3
vendor_id
: GenuineIntel
cpu family
: 6
model
: 62
model name
: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping
: 4
microcode
: 0x1
cpu
MHz : 2493.990
cache size
: 25600 KB

CoreOS 1185.3.0 Stable, but also had it with earlier Stable releases.

There i absolutely nothing in the logs the moment the machine dies. Usually it just reboots but we have also seen it freeze to the point that also the boot console was not reactive anymore and the server not pingable.
The occurences seem to be correlated with high load and high I/O. When putting on the load on the server, it will crash every 2-3 days.

No idea how to debug this - the IaaS provider does not seem to support pstore.

Andy V

unread,
Nov 16, 2016, 3:55:24 PM11/16/16
to CoreOS User
For what it's worth, I managed to solve this problem in our case by modifying a few sysctl's

kernel.perf_cpu_time_max_percent=0
kernel.nmi_watchdog=0

Seems to possibly be a rare smp bug in the kernel introduced in 4.5. Try modify /etc/systemd/system.conf (CrashReboot=no). I caught it by seeing the stack trace dump over the iLo

Reply all
Reply to author
Forward
0 new messages