Hi,
…just discovered, that when I look at the job output of a successful job, memory seems to be almost stable, as expected. But then I have those failed jobs – looking at the output of a failed job bumps the memory from ~200MB to 2.2 GB! Maybe it is related to the job, because the job (which runs once a day) has currently a 50:50 chance of failing – without a proper error message. From the log it looks like it just aborted:
PLAY [localhost] ***************************************************************
TASK [get golden_config device status] *****************************************
ok: [localhost]
TASK [check devices for recent backup] *****************************************
skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxx - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxxesd0011-2812 - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxxesd0012-2811 - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxxesx0011-2809 - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxxesx0012-2810 - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxxesx0021-2811 - 2022-09-19T01:01:00.818174Z - 4 hours old)
skipping: [localhost] => (item=xxxesx0022-2812 - 2022-09-19T01:01:00.818174Z - 4 hours old)
[ *** removed ~200 lines showing the same skipping: … lines *** ]
skipping: [localhost] => (item=xxxt172d - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt172e - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt172f - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt172g - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt172h - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt172i - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt172j - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt174a-1 - 2022-09-19T01:31:00.840784Z - 3 hours old)
skipping: [localhost] => (item=xxxt174b - 2022-09-19T01:31:00.840784Z - 3 hours old)
and that’s it – the output just stops. Failure reason is “Job terminated due to error”
In the logs I can see:
awx.main.dispatch job 4848 (error) encountered an error (rc=None), please see task stdout for details.
awx.main.dispatch task 34f3c74d-4394-4c33-8e70-0cbd52c7e385 starting awx.main.tasks.system.handle_work_error(*['34f3c74d-4394-4c33-8e70-0cbd52c7e385'])
awx.main.tasks.system Executing error task id 34f3c74d-4394-4c33-8e70-0cbd52c7e385, subtasks: [{'type': 'job', 'id': 4848}]
Just reconfirmed – if I have a look at the output of the successful runs of the job, memory stays stable. As soon as I have a look at a failed job output memory usage increases.
Thanks,
Andreas
--
You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/BTQ51PblMaI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/9de7d01d-c556-4c1a-9460-1d4ce9ed40dbn%40googlegroups.com.

Hi,
thanks for looking into that – will provide the requested information tomorrow. Should I still bump the old issue as told in the other mail (https://github.com/ansible/awx/issues/12644)?
Thanks,
Andreas
Von: AWX Project <awx-p...@googlegroups.com>
Gesendet: Freitag, 23. September 2022 21:05
An: AWX Project <awx-p...@googlegroups.com>
Betreff: Re: [awx-project] Re: memory problems
I believe this is a real bug and could use some help tracking it down
what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?
what does that same endpoint return for the same job, but when successful?
Do you get 502 errors when viewing the job detail page in the UI?
open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?
e.g.
note the status and time columns
Thanks for any information you can provide

To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/b2d5a365-b48f-447d-bd0a-81290f7a61c5n%40googlegroups.com.
OK - I addded the data to the github issue (see https://github.com/ansible/awx/issues/12644#issuecomment-1256998192)
My findings were:
-> what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?
result:
{
"children_summary": {
"2": {
"rowNumber": 1,
"numChildren": 282
},
"3": {
"rowNumber": 2,
"numChildren": 281
},
"4": {
"rowNumber": 3,
"numChildren": 2
},
"7": {
"rowNumber": 6,
"numChildren": 277
}
},
"meta_event_nested_uuid": {},
"event_processing_finished": true,
"is_tree": true
}
-> what does that same endpoint return for the same job, but when successful?
Result:
{"detail":"Nicht gefunden."}
(which means "not found")
-> Do you get 502 errors when viewing the job detail page in the UI? open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?
Result: No - No 502's and a resonable amount of time to respond (5s):

But - if I pick up the scrollbar with my mouse and move it half way down, I can see a lot of requests being sent to the backend and then there is a 502 ... Additionally, I can see the response size is between 1 and 7 MB - seems a bit large for loading the events.
Here's a screenshot, after I moved the scrollbar down:

Once more - thanks for looking into this!
Best regards,
Andreas
Hi,
container-log-max-files and container-log-max-size – modified in /etc/systemd/system/k3s.service:
ExecStart=/usr/local/bin/k3s \
server \
'--kubelet-arg' \
'container-log-max-files=4' \
'--kubelet-arg' \
'container-log-max-size=50Mi' \
Von: AWX Project <awx-p...@googlegroups.com>
Gesendet: Mittwoch, 28. September 2022 18:04
An: AWX Project <awx-p...@googlegroups.com>
Betreff: Re: [awx-project] Re: memory problems
Thanks Andy, can you let us know what container log rotation setting you changed? Was it just "max-size"?
--
You received this message because you are subscribed to a topic in the Google Groups "AWX Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/awx-project/BTQ51PblMaI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to awx-project...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/awx-project/e4644ab6-7bf5-4831-aed1-bb072066a582n%40googlegroups.com.