Slow Docker Container on GCP running Python AI/ML Workloads

已查看 22 次
跳至第一个未读帖子

Ravindra Maurya

未读,
2021年8月11日 10:30:492021/8/11
收件人 gce-discussion

Looking for some advice on how to diagnose the situation below, essentially wrt. high system/kernel space load.

Setup

  • Docker CE on CentOS 7 (no SELinux)
  • Containerized Python based AI/ML workload.
  • Python multi-processing is in use with ((n/2)-2) pool workers
  • n => cpu/vcpu cores (single hyper-thread)

General Issue

  • On Bare metal container from our docker image runs as expected.
  • On GCP container based on same image runs slow, 3-4 times slow.
  • AI Pipelines that originally took 2 days now taking 4/5 days.
  • No difference in our workload image.

Specific Issue

  • In GCP top reports heavy load in system/kernel space (30-50).
  • In bare-metal top reports low load in kernel/system space (1-10), its mostly in user space.
  • Individual Processes on bare metal report at most 100-120 %CPU.
  • While individual processes on GCP report 300-500 %CPU.
  • Load Average on bare metal is like 11.85, 32.87, 19.43
  • While LA on GCP VM is like 43.82, 32.87, 19.43

What we Tried

  • Set secomp and other specific settings on docker run
  • Set full privileged=true on docker run
  • Doubled the vCPU on GCP (originally 32 core set it to 64)
  • Used C2 (Compute Intensive) machine type.

Since I'm having no clue on how to diagnose this top-down thus also tried perf to see what is happening in the kernel/system space, following is the output from both systems.

GCP VM 'perf trace' (please see on stackoverflow for better readability)

Summary of events: python (2551), 6580977 events, 3.9% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ sched_yield 3232286 50396.443 0.002 0.016 68.851 0.69% futex 58139 23715.636 0.001 0.408 21989.789 92.73% brk 1643 312.000 0.002 0.190 19.195 13.42% close 5 35.710 3.630 7.142 18.245 39.07% openat 4 12.764 2.710 3.191 3.767 7.16% write 165 9.681 0.003 0.059 6.496 66.95% ioctl 85 0.605 0.002 0.007 0.028 9.58% lseek 182 0.460 0.002 0.003 0.004 1.70% read 10 0.104 0.003 0.010 0.019 15.84% fstat 4 0.030 0.006 0.007 0.008 6.21% python (2847), 10240661 events, 6.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 2909 269175.495 0.001 92.532 21907.964 13.70% sched_yield 5119771 63387.192 0.002 0.012 98.500 0.60% ython (2848), 10952416 events, 6.4% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 2944 269333.572 0.001 91.486 21908.135 13.70% sched_yield 5475608 62191.735 0.002 0.011 54.373 0.56% python (2849), 10095381 events, 5.9% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 2869 269495.562 0.001 93.934 21907.785 13.69% sched_yield 5046875 63404.111 0.002 0.013 69.027 0.53% python (2850), 10871629 events, 6.4% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3379 269541.688 0.001 79.770 21908.129 13.71% sched_yield 5434655 61881.631 0.002 0.011 92.192 0.66% python (2851), 10471715 events, 6.2% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3267 269665.818 0.001 82.542 21908.306 13.70% sched_yield 5234762 62440.899 0.002 0.012 70.764 0.59% python (2852), 11048544 events, 6.5% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3075 269845.352 0.001 87.755 21908.402 13.69% sched_yield 5523384 61269.561 0.002 0.011 77.207 0.58% python (2853), 10802913 events, 6.4% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3411 269995.415 0.001 79.154 21907.717 13.70% sched_yield 5400121 61586.390 0.000 0.011 63.498 0.63% python (2854), 10580227 events, 6.2% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3237 270029.557 0.001 83.420 21907.737 13.69% sched_yield 5289038 62372.649 0.002 0.012 50.329 0.56% python (2855), 11269016 events, 6.6% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3309 270088.861 0.001 81.623 21908.143 13.69% sched_yield 5633482 61010.474 0.002 0.011 59.756 0.54% python (2856), 11194619 events, 6.6% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3355 270348.851 0.001 80.581 21917.988 13.68% sched_yield 5596230 61122.634 0.002 0.011 52.892 0.60% python (2857), 11236749 events, 6.6% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 2559 273140.199 0.001 106.737 21916.745 13.53% sched_yield 5618107 61171.202 0.002 0.011 74.198 0.59% python (2858), 10328444 events, 6.1% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 2759 270424.625 0.001 98.015 21916.130 13.66% sched_yield 5163708 62490.561 0.002 0.012 95.956 0.65% python (2859), 11507002 events, 6.8% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3065 270481.725 0.001 88.249 21910.849 13.67% sched_yield 5752618 60412.759 0.002 0.011 96.043 0.69% python (2860), 11013249 events, 6.5% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3254 270691.944 0.001 83.187 21912.965 13.67% sched_yield 5505249 60965.211 0.002 0.011 69.026 0.62% python (2861), 11793518 events, 6.9% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 2851 270700.999 0.001 94.949 21911.187 13.66% sched_yield 5896062 60165.571 0.002 0.010 69.030 0.69% python (2862), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (2863), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (2864), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (2865), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00%

Bare Metal 'perf trace' (please see on stackoverflow for better readability)

Summary of events: python (113694), 88758 events, 2.7% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3174 13767.708 0.001 4.338 13745.177 99.84% sched_yield 39353 130.937 0.002 0.003 3.972 3.91% brk 1132 91.005 0.002 0.080 4.163 15.54% write 101 1.928 0.002 0.019 0.673 34.59% lseek 110 0.310 0.001 0.003 0.032 12.49% openat 2 0.234 0.112 0.117 0.122 4.49% ioctl 36 0.162 0.002 0.005 0.010 7.15% read 6 0.033 0.003 0.005 0.008 16.09% fstat 2 0.008 0.004 0.004 0.004 2.01% close 3 0.007 0.002 0.002 0.003 5.55% python (114304), 3210496 events, 97.3% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 3640 97379.595 0.000 26.753 13930.078 20.68% sched_yield 1599811 3995.576 0.001 0.002 15.957 0.91% python (114305), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114306), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114307), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114308), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114309), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114310), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114311), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114312), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114313), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114314), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114315), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114316), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114317), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114318), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114319), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114320), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114321), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114322), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114323), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114324), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114325), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114326), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114327), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114328), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114329), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114330), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114331), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114332), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114333), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114334), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114335), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114336), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114337), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114338), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114339), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114340), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114341), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00% python (114342), 1 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ futex 1 0.000 0.000 0.000 0.000 0.00%

Apparently too many events in GCP, specifically futex and sched_yield, both clearly taking lot of time in the system space.

Any idea on what can be done next?

Thanks

回复全部
回复作者
转发
0 个新帖子