Microservice is not responsive due to R2DBC connection loss

91 views
Skip to first unread message

akila dharmadasa

unread,
Sep 22, 2024, 11:32:25 PM9/22/24
to r2dbc
Hi,

I have several spring webflux microservices which are connected to the same postgres database on GCP cloud sql (A google managed service).

I keep constantly running in to an issue where one service out of the services connected to the database becomes unresponsive randomly.
When an api request is send to a service at a time like that, it tries to get a response from the service for a long time and finally throws an error from nginx after the request reaches max keep alive time.

I suspect that this is an R2DBC connection issue because all logs in the service layer upp until a database call is made are written to the log file, but no logs were found after that.
I have even enabled R2DBC debug logs but no query execution logs were found.

I cannot recreate or debug this due to it happening at random, I even did a load test but the service worked fine at that time.

This is how the connection to R2DBC is made in all services. I have adjusted the thread pool due to suggestions i got from other sources. 
r2dbc:
url: r2dbc:postgresql://**.**.**.**:****/v1
username: ******
password: **********
  pool:
    initial-size: 10
    max-size: 20
    max-idle-time: 1m
    max-acquire-time: 1m
    max-validation-time: 1m
    max-create-connection-time: 1m
    max-life-time: 5m
I have given bellow the expected logs and the unresponsive logs bellow for your reference.

expected : 
Screenshot from 2024-09-23 09-00-32.png

INFO Fetching configuration for factoryId: TEST001, appType: WEB,version: v1
is a service layer log just after the API is hit and before a database call and
2024-09-17 04:20:11 [,] - INFO Parsing configurations Map: {FactoryName=TEST001, ..............................................}
2024-09-17 04:20:11 [,] -DEBUG Retrieved configuration configurations: JsonByteArrayInput{{"FactoryName": "TEST001", ...................................................}
are logs from the service layer after a database call.

faulty log : 
This is a log when the same API is hit multiple times at a moment where the service is unresponsive.
 Screenshot from 2024-09-23 08-37-53.png
As you can see not even the query is executed. 

When I hit the actuator/health endpoint when a service is unresponsive I got the same result.
When I hit the actuator/restart endpoint, the service breaks entirely. (sorry I do not have a log of this).

Has this happened to anyone?
Can someone explain why this could happen or give any pointers?

Thank you in advance.
Reply all
Reply to author
Forward
0 new messages