Config Repositories has always in fetch status

102 views
Skip to first unread message

Komgrit Aneksri

unread,
Aug 8, 2023, 12:23:43 AM8/8/23
to go-cd
Hello GoCD team,

I am facing issue in Config Repositories page.

So All Config Repositories are always in fetching status for Gitlab(16.2.3)
Our GoCD is 23.1.0 version on kubernetes and running ARM64.

In server log, There is error message 

"jvm 1    | 2023-08-08 04:18:17,151 ERROR [qtp1962126505-38] VariableReplacer:385 - function ${escape:} type 'escape:' not a valid type" 

when i did refresh or open the Config Repositories page. 
1691468556134.jpg

Please help us for fix it.

Best Regards,
Komgrit

Chad Wilson

unread,
Aug 8, 2023, 1:16:45 AM8/8/23
to go...@googlegroups.com
That error message is not relevant to the issue, you can ignore it. Are there other errors or timeouts in the logs?

To refresh config repos, GoCD forks regular git processes to clone and then fetches. You might want to exec into the container and see what these processes are doing (high CPU? stuck somehow?)

There are also a few general suggestions on similar issues around which you might want to check:
If this is happening after a restart of the GoCD server, or due to some other change you've made it's possible Gitlab is throttling the requests.

Depending on whether you are using https or ssh connections, you may want to use standard git environment variables to debug what's happening (GIT_TRACE=1, GIT_CURL_VERBOSE=1 etc).

-Chad

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/af5f5dd4-6f27-47d3-8eb3-1a19c0675ae7n%40googlegroups.com.

Komgrit Aneksri

unread,
Aug 8, 2023, 4:06:17 AM8/8/23
to go-cd
Hello Chad,

Thank you for your advice.

I tried run git clone with debug environment variables command in gocd server pod. So the command run successfully and there is no any abnormal log.

And CPU and memory are not high usage.

We tried restart gocd server pods many time but it is not help.

And I dig to see timeout logs. There is log messages.
 INFO   | wrapper  | 2023/08/08 07:43:48 | Wrapper Process has not received any CPU time for 22 seconds.  Extending timeouts. 

Best Regards,
Komgrit

Chad Wilson

unread,
Aug 8, 2023, 5:11:18 AM8/8/23
to go...@googlegroups.com
If you are getting logs like that, it sounds like the container is experiencing CPU starvation.
  • What did you change around the time this behaviour started happening?
  • Which container image for the server are you using? (only gocd-server-centos-9 is built for ARM64 - if you are trying to run emulated you will probably have many problems)
  • What are the CPU requests and limits that you have assigned to the gocd server pod? And are you deploying with the standard GoCD helm chart? If the limits are too low, or the requests are low and other processes on the same node are using too much CPU you can still end up with CPU starvation, even if it looks like the pod isn't using much CPU, because it may be throttled.
-Chad

Komgrit Aneksri

unread,
Aug 8, 2023, 9:19:07 AM8/8/23
to go-cd
Hello Chad,

What did you change around the time this behaviour started happening?
-GoCD = no
-Gitlab = upgrade from 16.2.0 to 16.2.3

Which container image for the server are you using? (only gocd-server-centos-9 is built for ARM64 - if you are trying to run emulated you will probably have many problems)
- I am using official gocd/gocd-server-centos-9:v23.1.0

What are the CPU requests and limits that you have assigned to the gocd server pod? And are you deploying with the standard GoCD helm chart? If the limits are too low, or the requests are low and other processes on the same node are using too much CPU you can still end up with CPU starvation, even if it looks like the pod isn't using much CPU, because it may be throttled.
- I am using officail helm chart v 2.1.6. and no limit cpu and memory and set request CPU and Memory
 resources:
    requests:
      memory: 2048Mi
      cpu: 1000m
PS. GoCD server is running on c6g.large instance type and this below is top node result.
NAME                                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
ip-xx-xxx-xx-xx.ap-southeast-1.compute.internal   74m          3%     2570Mi          81%        


Best Regards,
Komgrit

Chad Wilson

unread,
Aug 8, 2023, 10:24:43 AM8/8/23
to go...@googlegroups.com
Hmm, given your description and the basic metrics you shared, the behaviour sounds strange.

To step back slightly and confirm the issue, please look inside the container at the process tree (via ps, top etc) and see whether there are a large number of forked git processes. If there are, we want to see what they are doing and focus there. If there are not, the problem may be somewhere else inside GoCD causing the config repo loads to get stuck, and we need to look in a different area.

If you changed nothing on GoCD or its hardware/host/config, that seems to point to something outside GoCD as the source of the problem, unless it is a problem that started as a side effect of restarting GoCD.

-Chad

Komgrit Aneksri

unread,
Aug 8, 2023, 11:53:12 AM8/8/23
to go-cd
Hello Chad,

I look in top and ps but no any weird or stuck processes.

As i attached pictures.

1691509554118.jpg 

1691509751182.jpg

BR,
Komgrit

Sriram Narayanan

unread,
Aug 8, 2023, 12:20:00 PM8/8/23
to go...@googlegroups.com
You have shared that you migrated from one version of gitlab to another.

Could you set up a configuration repo on a different system ( file system, GitHub, older Gitlab that worked) and see if GoCD is able to load configuration from that different system?

This could be a test gocd instance or the same gocd instance with an additional configuration instance ( with the other configuration repository settings being disabled/commented out).

Testing with such a different system might help you identify whether the problem is work the Gitlab upgrade.


Separately, do you see any insightful messages in the Gitlab server log?

— Sriram

Chad Wilson

unread,
Aug 8, 2023, 12:23:25 PM8/8/23
to go...@googlegroups.com
That is quite a lot of forked git processes. If there are constantly the same amount of forked git processes (over, say, a 1 minute period), that is likely the server at maximum default "material updates" concurrency. This may mean git operations are queued behind each other and you possibly can't fetch/check for git updates fast enough. The server typically logs something when this is happening - you might want to inspect the logs more closely.

Git operations being queued is possibly also what is happening to your config repositories, which is why you see them constantly "refreshing".

Since all the processes seem to be at low CPU usage, this implies to me that they are probably waiting for the network or your GitLab server (or theoretically the local disk in the container is slow, but less likely a cause). As suggested earlier, I think you are going to need to analyze git speed further, to see what is happening with your network connectivity to the GitLab server, and possibly check the GitLab server metrics itself. If you upgraded or changed something on GitLab, I'd suggest comparing its metrics from before the change/upgrade to afterwards and that type of thing.

-Chad

Komgrit Aneksri

unread,
Aug 16, 2023, 1:01:04 AM8/16/23
to go-cd
Hello Chad,

Thank you for your suggestion. Just for your infomation.

I had fixed this issue since last week.

This root cause issue related to our EFS throughput for stored flyweight, /home/go, artifacts, .... 

This issue solved by I changed EFS  throughput mode from bursting to elastic instead.

But I am investigating about why the new our GoCD server (v23.1.0) use higher throughput than the current GoCD server (v22.1.0).

So they are the same configuration but there are difference only GoCD version and new GOCD server use only git over https.  

Best Regards,
Komgrit

Chad Wilson

unread,
Aug 16, 2023, 1:22:11 AM8/16/23
to go...@googlegroups.com
Thanks so much for sharing back! This does make sense. I have also experienced issues with EFS related to things like this. I would have suggested checking disk performance if I had realised a GoCD server replacement+upgrade was part of what had changed :-)

While it has some challenges in terms of being AZ specific, generally I have had better experience mounting EBS volumes for use by GoCD (rather than network-based stores such as EFS), although that does limit which AZ your GoCD server can run in (without manual intervention), so it depends on your wider deployment architecture whether that is acceptable.

I can't think of any major reason the GoCD server version change on its own would cause higher throughput usage than your older version GoCD Server. One thing worth thinking about is that, in my recollection, EFS in bursting mode does vary speeds a lot based on the size of the storage. If your new server has much lower storage/use of EFS than your old server then the limits may be different. (e.g if you wiped a lot of artifacts while re-using the same EFS volume or created a new EFS volume which is a lot smaller) I'd suggest comparing the AWS side metrics for your EFS throughput between the two to compare what their usage, credits, and limits are per https://docs.aws.amazon.com/efs/latest/ug/performance.html.

For small EFS volumes, the baseline throughput is pretty terrible (15 MiBps read, 5 MiBps continuously), and GoCD servers tend to be rather write heavy if you have heavy use of artifacts within GoCD itself.

I am not sure if use of https-git has any major implications for disk usage on the git side of things compared to ssh , but I would not have thought it'd majorly change the throughput requirements. If EFS volume size doesn't explain issues, and you have changed all of the material URLs to https://, perhaps you want to compare other aspects of your material configuration for changes in
  • the # of distinct materials known to GoCD on the Materials tab (old vs new)
  • the # of these materials that are auto-updating (polling, the default) compared to having auto-update disabled (e.g if you also use Webhooks)
-Chad

Reply all
Reply to author
Forward
0 new messages