failed: amqp:resource-limit-exceeded local-idle-timeout expired

38 views
Skip to first unread message

vishavjeet somwanshi

unread,
Apr 9, 2025, 6:43:38 AMApr 9
to Skupper
Hi Team, 

We have 2 different openshift cluster site A is hosted on ibm cloud ( ROKS ) and site B hosted on AWS cloud ( ROSA ). site A is where we generate the token and link at site B. We have exposed several services from site A to site B and vice versa, in site B we have several namespaces where the skupper is installed and exposed the services. As these namespaces are connected each other using skupper link all the N number of services are visible in all the namespaces in site A and site B where the skupper is configured or skupper is linked.

Question 1: site A is the one where we generate token which is basically a main site for us so how much traffic that skupper-router can handle ? is there limitation of number services that to be exposed ? 

Question 2: Most of the times we have noticed below error in few of the namespaces which are there in site B and skupper router breaks the connectivity between the respective namespace from site B with the site A namespace.

2025-03-26 08:27:08.006312 +0000 SERVER (error) [C45] Connection to skupper-inter-router-namespacesiteA.siteAcluster.xyzdomain:443 failed: amqp:resource-limit-exceeded local-idle-timeout expired

So, is there any specific limit set for this queueing mechanism ( amqp ) so it could handle only so n so number of requests and after that it gives this above error. if so can we increase this limit to handle more requests.

Thanks,
Vishavjeet S

Ganesh Murthy

unread,
Apr 9, 2025, 7:32:53 AMApr 9
to vishavjeet somwanshi, Skupper
Hi Vishavjeet,

On Wed, Apr 9, 2025 at 6:43 AM vishavjeet somwanshi <vishavjeet...@gmail.com> wrote:
Hi Team, 

We have 2 different openshift cluster site A is hosted on ibm cloud ( ROKS ) and site B hosted on AWS cloud ( ROSA ). site A is where we generate the token and link at site B. We have exposed several services from site A to site B and vice versa, in site B we have several namespaces where the skupper is installed and exposed the services. As these namespaces are connected each other using skupper link all the N number of services are visible in all the namespaces in site A and site B where the skupper is configured or skupper is linked.

Question 1: site A is the one where we generate token which is basically a main site for us so how much traffic that skupper-router can handle ? is there limitation of number services that to be exposed ? 
Skupper can handle large amounts of traffic but it might limit the rate of traffic in order to not use up too much memory. 

Question 2: Most of the times we have noticed below error in few of the namespaces which are there in site B and skupper router breaks the connectivity between the respective namespace from site B with the site A namespace.

2025-03-26 08:27:08.006312 +0000 SERVER (error) [C45] Connection to skupper-inter-router-namespacesiteA.siteAcluster.xyzdomain:443 failed: amqp:resource-limit-exceeded local-idle-timeout expired
This error happens if the skupper-router in one site does not receive a heartbeat for 16 seconds from the skupper-router from another site. Since one router is not hearing from the other, it assumes that the other router is gone and bounces 
the inter-router connection.
Either there is loss of connectivity between the skupper-routers due to network issues or the load on the routers is so high that they are unable to send a heartbeat to the other router.
What sort of workloads are you running ?  Is the network between the sites stable ?

So, is there any specific limit set for this queueing mechanism ( amqp ) so it could handle only so n so number of requests and after that it gives this above error. if so can we increase this limit to handle more requests.

Thanks,
Vishavjeet S

--
You received this message because you are subscribed to the Google Groups "Skupper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skupper+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/skupper/29a68cd1-b9b7-4dc7-806b-3c4a8a970addn%40googlegroups.com.

Ganesh Murthy

unread,
Apr 9, 2025, 9:33:48 AMApr 9
to vishavjeet somwanshi, Skupper


On Wed, Apr 9, 2025 at 9:04 AM vishavjeet somwanshi <vishavjeet...@gmail.com> wrote:
Thanks Ganesh for the information !

We have several microservices running in both sites and there are several API calls executed. Network in our both sites is stable. By any chance can we increase this heartbeat time from 16 sec to some other number ?
At the moment, there is no way to change the heartbeat timeout seconds.

The error which I mentioned in the previous thread, what proactive action needs to be taken to avoid such issues ? OR what action needs to be taken when such an error occurs?
This heartbeat timeout issue should ideally not happen because 16 seconds is a long time.
What are the CPU and memory allocations for skupper-router ? Try vertically scaling the routers (adding more memory, cpu) and see if this issue goes away.
Thanks.


Regards,
Vishavjeet N. Somwanshi

Ganesh Murthy

unread,
Apr 9, 2025, 11:23:30 AMApr 9
to vishavjeet somwanshi, Skupper
Hi Vishavjeet,
      Can you please do a Reply to All when you respond so everyone in the Skupper group can see your response ?

Looking at the memory utilization graphs, it is evident that skupper-router is consuming large amounts of memory. I wonder if there is a memory leak.
Can you please share the skupper debug tar.gz file ? (The instructions are in the "Creating a Skupper debug tar file" section of https://skupper.io/docs/troubleshooting/index.html)
We also recently discovered an issue where the router processes the vanflow records slowly which is leading to memory growth - https://github.com/skupperproject/skupper-router/issues/1772
We are working on a fix for this issue.
Thanks.


On Wed, Apr 9, 2025 at 9:51 AM vishavjeet somwanshi <vishavjeet...@gmail.com> wrote:
In our case it happens very frequently in lower environments like development where the usage of microservices is high.

We have not explicitly set up any resource quota for skupper-router it is the default while installing through the skupper-site configmap. However we have noticed the higher memory utilisation of skupper-router in Site A, below is the screenshot for your reference.

image.png

This is the memory utilisation of skupper-router from site B namespace where we have high traffic.
image.png

As our skupper configuration is done using YAML method by skupper-site configmap, is there any way we can explicitly set the resource quota or is that okay if we just simply scale up the skupper-router replica's from 1 to 2 or the desired number where the load is high ?

Regards,
Vishavjeet N. Somwanshi

vishavjeet somwanshi

unread,
Apr 14, 2025, 8:26:31 AMApr 14
to Ganesh Murthy, Skupper
Hi  Ganesh,

Today i noticed that skupper-router pod was consuming memory more than 19GB so to address this I scaled up pods to 2 replicas, it's been more than 5 hours and still memory utilisation is more than 19GB for old pod and around 2GB for the newly scaled up pod.

image.png

1. How the traffic will be handled/routed in case of more than 1 skupper-router replica ?

2. Is there anything that I need to do explicitly to release the memory consumption or balance the traffic flow ?

3. As you mentioned, recently "discovered an issue where the router processes the vanflow records slowly which is leading to memory growth and your team is working on fix", so this fix is for a which specific skupper version as we are currently running on skupper 1.4.2 version.

client version                 1.4.2

transport version              quay.io/skupper/skupper-router:2.4.2 (sha256:c0dfccf2c269)

controller version             quay.io/skupper/service-controller:1.4.2 (sha256:06f3ef0047c6)

config-sync version            quay.io/skupper/config-sync:1.4.2 (sha256:2a350062450a)

flow-collector version         quay.io/skupper/flow-collector:1.4.2 (sha256:c3b1e43cce4a)


Regards,
Vishavjeet N. Somwanshi

You received this message because you are subscribed to a topic in the Google Groups "Skupper" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/skupper/odSfUA6Qz9k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to skupper+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/skupper/CA%2BYqnOpTC%2BL4-t3cZFCWSGOVK_1-83Vk2JQvoM-LOX3GjLKvxA%40mail.gmail.com.

Ganesh Murthy

unread,
Apr 14, 2025, 9:36:04 AMApr 14
to vishavjeet somwanshi, Skupper
On Mon, Apr 14, 2025 at 8:26 AM vishavjeet somwanshi <vishavjeet...@gmail.com> wrote:
Hi  Ganesh,

Today i noticed that skupper-router pod was consuming memory more than 19GB so to address this I scaled up pods to 2 replicas, it's been more than 5 hours and still memory utilisation is more than 19GB for old pod and around 2GB for the newly scaled up pod.

image.png

1. How the traffic will be handled/routed in case of more than 1 skupper-router replica ?

Scaling the router horizontally (from 1 skupper-router replica to 2) is not recommended with Skupper 1.x versions.
This is because the routers make 2 more tcp connections between the replicas and the load balancer between these
replicas split the connections and send them to the wrong destination router which causes routing errors.
Skupper 2.x can do horizontal and vertical scaling without any issues.

2. Is there anything that I need to do explicitly to release the memory consumption or balance the traffic flow ?
The memory consumption on the router is high either due to an unknown memory leak or because the traffic is very heavy
and the router needs to grow in order to handle the traffic (thus adding lots of vanflow records). Unfortunately, once the router grows in memory, it assumes that
it is going to see more of that busy traffic pattern and does not release the memory.
In my last email, I asked you to run the skupper debug dump command, did you get a chance to run it and can you please share the results?

3. As you mentioned, recently "discovered an issue where the router processes the vanflow records slowly which is leading to memory growth and your team is working on fix", so this fix is for a which specific skupper version as we are currently running on skupper 1.4.2 version.
I have created an issue for the memory growth due to unprocessed vanflow records - https://github.com/skupperproject/skupper-router/issues/1783
We have a fix for this issue and are testing it rigorously, fix should be available on the main branch hopefully before the end of the week.
We will then release a skupper-router 2.7.6 with this fix.

Thanks.

vishavjeet somwanshi

unread,
Apr 14, 2025, 11:06:14 AMApr 14
to Ganesh Murthy, Skupper
Hi Ganesh,

On Point 1: Yes, we noticed some bad errors which broke the connectivity in application from Site A to Site B, so I scaled down pods to 1 replica and rolling restart worked fine.

On Point 2: Might be in debug logs there are chances of exposing the DNS names or cluster details to everybody so is there any option where I can share the logs with you directly .. like 1:1 over your email or any box link or something that is authenticated ?

On Point 3: Currently we are running on 1.42. skupper version so when we plan for an upgrade from 1.x to 2.x is it straight forward or there are some changes ? does it involve the downtime or deleting the existing skupper links and recreate it ?

Thanks in advance !!
Regards,
Vishavjeet N. Somwanshi

vishavjeet somwanshi

unread,
Apr 14, 2025, 11:11:54 AMApr 14
to Ganesh Murthy, Skupper
I missed to add one more point here, as you earlier suggested to scale up vertically ( In terms of Memory and CPU ), there is no resource quota set explicitly in skupper-router deployment configuration for container. What is happening here is container in specific namespace is using as much CPU and memory as it wants. 

spec:
restartPolicy: Always
serviceAccountName: skupper-router
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
securityContext:
runAsNonRoot: true
containers:
- resources: {}


Regards,
Vishavjeet N. Somwanshi

Ganesh Murthy

unread,
Apr 14, 2025, 12:05:17 PMApr 14
to vishavjeet somwanshi, Skupper
On Mon, Apr 14, 2025 at 11:06 AM vishavjeet somwanshi <vishavjeet...@gmail.com> wrote:
Hi Ganesh,

On Point 1: Yes, we noticed some bad errors which broke the connectivity in application from Site A to Site B, so I scaled down pods to 1 replica and rolling restart worked fine.
Yes. Please stay away from horizontal scaling on Skupper 1.x

On Point 2: Might be in debug logs there are chances of exposing the DNS names or cluster details to everybody so is there any option where I can share the logs with you directly .. like 1:1 over your email or any box link or something that is authenticated ?
Inside the skupper debug dump look for a file called  skstat-m.txt in a folder called skstat. This file should not have any company specific private information in it. 
Please share that file from all router instances. 

On Point 3: Currently we are running on 1.42. skupper version so when we plan for an upgrade from 1.x to 2.x is it straight forward or there are some changes ? does it involve the downtime or deleting the existing skupper links and recreate it ?
It is hopefully straightforward. We are working on a document that will help you through the process. Downtime will be involved.

vishavjeet somwanshi

unread,
Apr 15, 2025, 8:25:05 AMApr 15
to Ganesh Murthy, Skupper
Hi Ganesh,

I've attached the skstat-m.txt file for your reference from one of our specific namespaces where we noticed the high memory consumption.  

Regards,
Vishavjeet N. Somwanshi

skstat-m.txt

Ganesh Murthy

unread,
Apr 15, 2025, 11:39:08 AMApr 15
to vishavjeet somwanshi, Skupper
On Tue, Apr 15, 2025 at 8:25 AM vishavjeet somwanshi <vishavjeet...@gmail.com> wrote:
Hi Ganesh,

I've attached the skstat-m.txt file for your reference from one of our specific namespaces where we noticed the high memory consumption.  
Looking at the output of skstat -m, I don't immediately see a memory leak.
It also doesn't look like you are running into https://github.com/skupperproject/skupper-router/issues/1772 because the total number of
vanflow records is relatively low. 
What could be happening is that your services are pretty slow in consuming the data that is sent over the skupper network. 
and that could be causing the data to pile up inside the skupper-router. Each TCP connection is allowed a TCP window of 3.2MB per connection.
From your skstat -m output, it looks like there are 5520 active connections on the skupper-router which would explain the router consuming 9GB.

We have implemented a new feature in the router which will implement flow control (https://github.com/skupperproject/skupper-router/commit/017aa4b5524564639b90719dc13cf01784e53664
which will push back on the sending services if the receiving services are not going fast enough which will limit skupper-router memory growth
and that fix is available in skupper-router 3.3.0 which will be used by the upcoming version of Skupper.
Reply all
Reply to author
Forward
0 new messages