Ansible + Mitogen Performanceproblem

62 views
Skip to first unread message

Hans Dampf

unread,
May 4, 2023, 4:33:16 AM5/4/23
to go-cd
Hello,

our setup consists of 10 worker with 15 agents each. We run ansible + mitogen on the agents. Currently, we have a problem with the go-agent + mitogen.

Mitogen itself is a tool to speedup ansible runs by "tunneling" multiple tasks over one ssh connection.

If we use i on the worker without the agent directly on the cli it runs very well

Basic Ansible: ~ 5min
Ansible + Mitogen: ~ 1.5 min
Ansible + Mitogen + Go-agent (expected): ~2 min
Ansible + Mitogen + Go-agent (currently): ~ 10 -  30 min

Now, if we start ansible with mitogen enabled IN the go-agent, the runtime is significant longer than the basic run.
Some runs can slow down to 10 - 30 min is highly unusual since it should only take 2 - 5 min. Run directly on the cli it's fast as expected.

Strangely, this was not from the beginning. This is only after because of an incident we had to stress all 150 agents at once.

We already reinstalled ansible, mitogen and the go-agent itself, but the degraded performance persists.

I hope somebody can help in how further debug this, since the last resort would be to complete reinstall the whole workernodes.

Regards

Ketan Padegaonkar

unread,
May 4, 2023, 4:43:29 AM5/4/23
to go...@googlegroups.com
It's unclear from your problem description if the entire job is taking 10-30 minutes, or the task is taking 10-30 minutes. You mention that running locally from the agent is quick — it is unclear if you're running your task as `go` user or `root` user. For context, there are other overheads in jobs that include for example — checking out code, cleaning the working directory (if configured to do so). At the end of all tasks, the agent will also upload all artifacts/console logs back to the gocd server.

If I were in your place, I would do the following next steps:

- See if the script can be run in quiet mode. Maybe redirect the output to /dev/null, if possible and check how long it takes to run just ansible+mitogen. This is to eliminate possible issues or slowness with gocd taking time to "read" the output from your deployment.
- Next — turn on more debug/verbose output in ansible + mitogen to see if there are things that the gocd agent might be doing that could be affecting your deploy timings. For e.g — any spurious environment variables, that gocd might be setting, or perhaps some SSH configs that might be affecting the deployment.
- Run the `env` command before your job — to dump any environment variables that are applicable for that job. You can then `export` these environment variables from the shell (as `go` user) — and then run the script to see if there is any difference.

- Ketan



--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/2464860e-407e-4be6-ae6c-3db0c68a7d95n%40googlegroups.com.

Hans Dampf

unread,
May 4, 2023, 5:46:51 AM5/4/23
to go-cd

It's not just one task, it's the whole playbook which is slower.
Local yes as user go.
This runs in a normal performance
go@host1:~$ ansible-playbook slowplaybook.yaml -i inventory

On the same machine the same playbook but executed by the go-agent is slow.
It ran fast in the past until the incident with the heavy load on the agents and big backlog
100% Usage of all 150 agents + 200 Jobs in the backlog.
Beside this there where no changes on the playbook or the settings of the agents (env variables)

Normaly we only use about 40-50 agents and no backlog

Is there maybe a cachefile or lockfile created by the agents which does not get deleted with a deinstallation?

Ketan Padegaonkar

unread,
May 4, 2023, 6:06:13 AM5/4/23
to go...@googlegroups.com
> Is there maybe a cachefile or lockfile created by the agents which does not get deleted with a deinstallation?

This might help find anything owned by the go user.

$ sudo find / -user go

- Ketan



Hans Dampf

unread,
May 5, 2023, 2:06:57 AM5/5/23
to go-cd
Ok did more testing and build a new setup from scratch. As expected, the performance was very good.
Then we moved one of the old "broken" workernodes from the old setup to the new setup and unexpectedly the performance was also very good again.

So there seems to be some slowdown on the go-server side or with the communication with the nodes.

Chad Wilson

unread,
May 5, 2023, 2:58:23 AM5/5/23
to go...@googlegroups.com
What is a "workernode" in this context? This isn't GoCD terminology, so it's unclear what this means?

GoCD agents simply fork processes to run your tasks within the 'go' user context of the agent process. IIRC the entire "wrapping" environment from the agent process should be propagated to the tasks, so could be differences there depending on how you install and launch your agents.

There's not really any magic here, and the server has no role (synchronously) once the agent knows what job needs to be run, and starts cloning/fetching materials and kicking off tasks. You can see what the agent is doing for each job/task in the console log to see where the time is being spent.

If the agents are "static" and the jobs create mutable content locally (e.g virtualenvs or other such stuff) you also might want to consider whether you should enable "Clean working directory" on the stage level to ensure a clean state before your jobs' tasks run.

Other than that, it seems likely to me that there is some kind of configuration at your host or OS user level (as Ketan hints at) that is affecting mitogen/ansible. Perhaps the way mitogen, ansible or python are installed, something different in the python environment, or some kind of different configuration that is applyied when run via the agent vs via directly on the node (ssh config? mitogen or ansible config?).

I'd dump both env and tool config from within a GoCD task and compare between "good" and "bad" setups. There is likely something different there in how things are running.

-Chad


Hans Dampf

unread,
May 17, 2023, 8:35:56 PM5/17/23
to go-cd
So we basically "fixed" our problem. The problem was mitogen scales awful with multiple playbook on at the same time on the same machine. It seems it does not support multicore CPU. CPU 1 on every server is always running at 100%, we suspect this is because mitogen only uses this one core for its calculations, and then it blocks itself. If you can keep the usage of this one core below 100% everything seems fine and the acceleration by mitogen is noticeable again.

The fix now was to install more server running go-agents and reduce the number of go-agents on every server. It's more a workaround, but I don't expect mitogen will get ever any bigger updates again.
Currently, we have 20 servers with 10 go-agent each. To be honest, I think 10 agents are still to much and if all 10 are running mitogen will start slowing down again.

Next we will try to get the go-agent running in docker-swarm. We hope this scales better.

Chad Wilson

unread,
May 17, 2023, 9:30:14 PM5/17/23
to go...@googlegroups.com
Thanks for sharing!

For what it's worth the https://github.com/gocd-contrib/docker-swarm-elastic-agent-plugin ha snot been released for a long time. It probably works ok with respect to GoCD interfaces (as these have not changed), but may or may not work correctly with latest swarm features and likely has outdated dependent libs. While have been merging some PRs for dependencies the lack of release is partly because of lack of perceived interest and partly because I don't personally have experience with swarm to sanity test it or understanding of the ecosystem since docker's spin offs and such. Worth keeping in mind if you go this path, rather than, say Kubernetes, plain docker on a single host or cloud provider plugins (which have had releases).

-Chad

Reply all
Reply to author
Forward
0 new messages