memory problems

375 views
Skip to first unread message

Andreas Bourges

unread,
Sep 19, 2022, 1:34:28 AM9/19/22
to AWX Project
Hi,

...we're experiencing problems running awx jobs due to memory exhaustion, I guess. The VM has decent memory (16G) and our jobs ran well for quite some time... But right now, it's almost a 50:50 chance, that jobs are failing (output just stops - no error in the details of the job output...) - when I look at the system, I can see that there's almost no memory left:

top - 07:26:54 up 2 days, 22:48,  3 users,  load average: 0.00, 0.04, 0.07
Tasks: 289 total,   1 running, 287 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.1 us,  0.2 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  16002.2 total,    572.8 free,  11127.4 used,   4302.0 buff/cache
MiB Swap:   4096.0 total,   4082.4 free,     13.6 used.   4404.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4452 u032359   20   0 3227024   3.0g  21540 S   0.0  19.5   0:21.45 uwsgi
   4453 u032359   20   0 2733744   2.6g  22308 S   0.0  16.5   0:34.42 uwsgi
   4454 u032359   20   0 1689112   1.6g  17964 S   0.0  10.2   0:08.28 uwsgi
   4450 u032359   20   0 1192644   1.1g  21752 S   0.0   7.1   0:05.32 uwsgi
   4451 u032359   20   0  806172 782648  17308 S   0.0   4.8   0:03.96 uwsgi
    841 root      20   0 1394412 514280  91092 S   8.9   3.1 277:29.50 k3s-server
   4488 u032359   20   0  175364 145524   9556 S   0.0   0.9   1:46.60 awx-manage
   4475 u032359   20   0  678812 141732  17236 S   0.0   0.9   4:41.08 awx-manage
 214943 u032359   20   0  682804 138636  11212 S   0.0   0.8   0:00.65 awx-manage
 214881 u032359   20   0  682872 138616  11212 S   0.0   0.8   0:00.68 awx-manage
   4489 u032359   20   0  167396 137672   9548 S   0.0   0.8   1:50.24 awx-manage
 215195 u032359   20   0  682688 136520   9352 S   0.0   0.8   0:00.39 awx-manage
 215273 u032359   20   0  682428 136324   9348 S   0.0   0.8   0:00.25 awx-manage
   4476 u032359   20   0  155472 134056  17168 S   0.0   0.8   1:54.50 awx-manage
   4490 u032359   20   0  163384 132588   9544 S   0.0   0.8   1:50.01 awx-manage
   4491 u032359   20   0  163064 132340   9512 S   0.0   0.8   1:52.48 awx-manage
   4447 u032359   20   0  210160 118848  17620 S   0.0   0.7   1:55.35 awx-manage
   4486 u032359   20   0  153708 108940   6328 S   0.0   0.7   3:49.42 awx-manage
   4446 u032359   20   0  195720 105356  17296 S   0.0   0.6   1:19.55 daphne
   3409 999       20   0  213128  84176  82252 S   0.0   0.5   0:02.74 postgres
   2656 65532     20   0  799396  72800  55004 S   0.0   0.4   1:32.21 traefik
    485 root      19  -1  145496  69596  68420 S   0.0   0.4   0:04.88 systemd-journal
   1118 root      20   0  768336  61364  30612 S   0.0   0.4  31:43.44 containerd
   3596 u032359   20   0  750832  47664  28832 S   0.0   0.3  19:25.71 metrics-server
   3021 root      20   0  754780  43756  30116 S   0.0   0.3   8:12.12 coredns
   3091 isarnet   20   0  742924  39912  22044 S   0.0   0.2   7:31.85 ansible-operato
   4057 u032359   20   0 2368144  36252  21900 S   1.0   0.2  13:31.51 receptor
    521 root      rt   0  289460  27244   9076 S   0.0   0.2   0:22.01 multipathd
   3927 u032359   20   0   29316  25156   9300 S   0.0   0.2   0:47.68 supervisord
   2668 root      20   0  734716  24912  18076 S   0.0   0.2   1:39.18 local-path-prov
   3981 u032359   20   0   29032  24760   9068 S   0.0   0.2   0:58.72 supervisord


Any hint on how to troubleshoot that further? It seems like the uwsgi process consumes way to much memory. AWX is running on k3s:

awx-operator: 0.23.0
k3s version v1.23.6+k3s1 (418c3fa8)
go version go1.17.5
AWX 21.2.0

Jobs are using dynamic inventory from nautobot.

Thanks,

Andreas  

Andreas Bourges

unread,
Sep 19, 2022, 11:34:17 AM9/19/22
to AWX Project
To clarify this - we observed that when we have a look at the job output, uwsgi starts to use more and more memory. Initially everything looks ok, but as soon as we start looking at the job output, we run into problems :-(

Any hint on how to troubleshoot that?

thanks,

Andreas

AWX Project

unread,
Sep 21, 2022, 2:46:44 PM9/21/22
to AWX Project
How many events are in the job output of the job in question? Also, are the jobs in a finished state ("succeeded" or "error") or running state?

AWX Team

andyb...@googlemail.com

unread,
Sep 22, 2022, 10:57:18 AM9/22/22
to awx-p...@googlegroups.com

Hi,

 

…just discovered, that when I look at the job output of a successful job, memory seems to be almost stable, as expected. But then I have those failed jobs – looking at the output of a failed job bumps the memory from ~200MB to 2.2 GB! Maybe it is related to the job, because the job (which runs once a day) has currently a 50:50 chance of failing – without a proper error message. From the log it looks like it just aborted:

 

PLAY [localhost] ***************************************************************

 

TASK [get golden_config device status] *****************************************

ok: [localhost]

 

TASK [check devices for recent backup] *****************************************

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesd0011-2812 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesd0012-2811 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0011-2809 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0012-2810 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0021-2811 - 2022-09-19T01:01:00.818174Z - 4 hours old)

skipping: [localhost] => (item=xxxesx0022-2812 - 2022-09-19T01:01:00.818174Z - 4 hours old)

 

[ *** removed ~200 lines showing the same skipping: … lines *** ]

skipping: [localhost] => (item=xxxt172d - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172e - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172f - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172g - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172h - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172i - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt172j - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt174a-1 - 2022-09-19T01:31:00.840784Z - 3 hours old)

skipping: [localhost] => (item=xxxt174b - 2022-09-19T01:31:00.840784Z - 3 hours old)

 

 

and that’s it – the output just stops. Failure reason is “Job terminated due to error”

 

In the logs I can see:

 

awx.main.dispatch job 4848 (error) encountered an error (rc=None), please see task stdout for details.

awx.main.dispatch task 34f3c74d-4394-4c33-8e70-0cbd52c7e385 starting awx.main.tasks.system.handle_work_error(*['34f3c74d-4394-4c33-8e70-0cbd52c7e385'])

awx.main.tasks.system Executing error task id 34f3c74d-4394-4c33-8e70-0cbd52c7e385, subtasks: [{'type': 'job', 'id': 4848}]

 

Just reconfirmed – if I have a look at the output of the successful runs of the job, memory stays stable. As soon as I have a look at a failed job output memory usage increases.

 

Thanks,

 

Andreas

--
You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/BTQ51PblMaI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/9de7d01d-c556-4c1a-9460-1d4ce9ed40dbn%40googlegroups.com.

AWX Project

unread,
Sep 23, 2022, 1:17:25 PM9/23/22
to AWX Project
Hi,

I believe this AWX issue describes the same problem https://github.com/ansible/awx/issues/12644. For now can you please bump that issue thread and summarize your issue there?

AWX Team

AWX Project

unread,
Sep 23, 2022, 3:04:42 PM9/23/22
to AWX Project
I believe this is a real bug and could use some help tracking it down

what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?

what does that same endpoint return for the same job, but when successful?

Do you get 502 errors when viewing the job detail page in the UI? 
open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?

e.g.

note the status and time columns

Thanks for any information you can provide

 Screenshot from 2022-09-23 15-03-01.png

andyb...@googlemail.com

unread,
Sep 23, 2022, 3:24:43 PM9/23/22
to awx-p...@googlegroups.com

Hi,

 

thanks for looking into that – will provide the requested information tomorrow. Should I still bump the old issue as told in the other mail (https://github.com/ansible/awx/issues/12644)?

 

Thanks,

 

Andreas

 

Von: AWX Project <awx-p...@googlegroups.com>
Gesendet: Freitag, 23. September 2022 21:05
An: AWX Project <awx-p...@googlegroups.com>
Betreff: Re: [awx-project] Re: memory problems

 

I believe this is a real bug and could use some help tracking it down

 

what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?

 

what does that same endpoint return for the same job, but when successful?

 

Do you get 502 errors when viewing the job detail page in the UI? 

open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?

 

e.g.

 

note the status and time columns

 

Thanks for any information you can provide

 

 

image001.png

AWX Project

unread,
Sep 23, 2022, 3:40:39 PM9/23/22
to AWX Project
when you collect data you can probably put it in issue (and here too, or a link to your github comment at least) so everything is in one place. Plus a better chance of other devs taking a look on github

andyb...@googlemail.com

unread,
Sep 24, 2022, 11:57:03 AM9/24/22
to awx-p...@googlegroups.com

OK - I addded the data to the github issue (see https://github.com/ansible/awx/issues/12644#issuecomment-1256998192)

 

My findings were:

 

-> what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?

result:

 

{

  "children_summary": {

    "2": {

      "rowNumber": 1,

      "numChildren": 282

    },

    "3": {

      "rowNumber": 2,

      "numChildren": 281

    },

    "4": {

      "rowNumber": 3,

      "numChildren": 2

    },

    "7": {

      "rowNumber": 6,

      "numChildren": 277

    }

  },

  "meta_event_nested_uuid": {},

  "event_processing_finished": true,

 "is_tree": true

}

 

-> what does that same endpoint return for the same job, but when successful?

Result:

{"detail":"Nicht gefunden."}
(which means "not found")

 

-> Do you get 502 errors when viewing the job detail page in the UI? open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?

Result: No - No 502's and a resonable amount of time to respond (5s):

image

 

 

But - if I pick up the scrollbar with my mouse and move it half way down, I can see a lot of requests being sent to the backend and then there is a 502 ... Additionally, I can see the response size is between 1 and 7 MB - seems a bit large for loading the events.

 

Here's a screenshot, after I moved the scrollbar down:

 

image

 

Once more - thanks for looking into this!

 

Best regards,

 

Andreas

 

image001.png
image002.png

Andreas Bourges

unread,
Sep 28, 2022, 11:30:50 AM9/28/22
to AWX Project
Just to let you know - reducing the data received from the inventory and tune container log rotation made our jobs run stable again!

Thanks for your support!

AWX Project

unread,
Sep 28, 2022, 12:03:47 PM9/28/22
to AWX Project
Thanks Andy, can you let us know what container log rotation setting you changed? Was it just "max-size"?

andyb...@googlemail.com

unread,
Sep 28, 2022, 11:58:15 PM9/28/22
to awx-p...@googlegroups.com

Hi,

 

container-log-max-files  and container-log-max-size – modified in /etc/systemd/system/k3s.service:

 

 

ExecStart=/usr/local/bin/k3s \

    server \

       '--kubelet-arg' \

       'container-log-max-files=4' \

       '--kubelet-arg' \

       'container-log-max-size=50Mi' \

 

 

 

Von: AWX Project <awx-p...@googlegroups.com>
Gesendet: Mittwoch, 28. September 2022 18:04
An: AWX Project <awx-p...@googlegroups.com>
Betreff: Re: [awx-project] Re: memory problems

 

Thanks Andy, can you let us know what container log rotation setting you changed? Was it just "max-size"?

--

You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/BTQ51PblMaI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages