Cosmos DB connections dropping

778 views
Skip to first unread message

wi...@spanner.org

unread,
Mar 15, 2021, 8:03:07 AM3/15/21
to ReactiveMongo - http://reactivemongo.org
Hello all. 

We have been running happily on MongoDB and then on Atlas for many years but business reasons now dictate a move to Azure and Cosmos DB and it is not going well. The basic business of connecting works well and we run for a few minutes but then there seem to be two kinds of problem:

1. Connections drop after 30 minutes of inactivity. I have had some success working around this with a 25 minute maxIdleTimeMS.

2. Cosmos charging structure is based on 'request units', which is a throughput allowance. Exceed the limit and they return a 429 try-again-later error. You can tell it retry on the server side; but then it seems not to return any response at all.

Both of these situations seem to cause Reactivemongo to report that the replica set has gone away. I get a lot of these:

[warn] r.c.a.MongoDBSystem - [Supervisor-1/pwcca] The entire node set is unreachable, is there a network problem?

and

[warn] r.c.a.MongoDBSystem - [Supervisor-1/pwcca] MongoError['Socket disconnected (Supervisor-1/pwcca)'] (channel #47160db0)

and especially:

[Supervisor-1/pwcca] The node set is authenticated, but the primary is not available

Unfortunately it seems that ReactiveMongo doesn't recover from the lack of service, so we usually find that the application has stalled.

Cosmos is generally a bad fit for our application, which has a very high quiet/busy ratio and makes many very small requests, but you know. Business reasons. We're handling health data so there is a very high certification and encryption requirement.

Has anyone managed to make this work in production?

thanks,

Will


Cédric Chantepie

unread,
Mar 15, 2021, 6:41:01 PM3/15/21
to ReactiveMongo - http://reactivemongo.org
Without versions or code reproducer, it would be hard to give any hint.

wi...@spanner.org

unread,
Mar 16, 2021, 5:16:27 AM3/16/21
to ReactiveMongo - http://reactivemongo.org
Hello Cédric. Thanks for your reply.

This wasn't a bug report; just an informal request for guidance and experience. I do not imagine that anyone wants to set up an Azure account and wait 40 minutes to observe my intermittent failures.

We are using ReactiveMongo v1.0.3 in a very straight Play application whose job is to receive and store small data packets from mobile devices.

I can avoid problems in normal use by provisioning enough capacity and using K8s liveness probes to  keep connections alive, but, I am concerned about how well the application copes at the limit. When we simulate those conditions, it seems that the application loses connection to the database and does not reconnect.

If nobody has made this work then yes, I will create a limited cosmos account and a minimal application to explore the problem.

yours,

WIll

wi...@spanner.org

unread,
Mar 24, 2021, 8:25:47 AM3/24/21
to ReactiveMongo - http://reactivemongo.org

For anyone else who ends up in this predicament, I think we have found a configuration that works. There is still some concern about behaviour at the limits of provisioned throughput, but for now we have a reliable connection from ReactiveMongo to Cosmos DB with this configuration:
  • ssl=true
  • sslAllowsInvalidCert=true
  • maxIdleTimeMS=120000
  • rm.keepAlive=true
  • rm.failover=strict
The max idle time has taken some tuning but this seems ok. If it continues to work then I will add some notes to the documentation page.

yours,

Will

Carlos Saltos

unread,
Mar 24, 2021, 9:00:03 AM3/24/21
to reacti...@googlegroups.com
This is also happening to us with our custom EC2 setup at AWS ... it’s a very nasty bug that bite us when we just want to rotate the MongoDB nodes for maintenance (or even worst on a node failure)

The initial workaround was to reboot our application nodes manually on a MongoDB maintenance (or reboot in panic on a MongoDB node crash in the middle of the night)

Important to note is that from the point of view of the MingoDB cluster everything is OK, it’s “just” the inability of the client to recognize changes on the MongoDB nodes and with that bug the client nodes jump to 100% CPU usage all the time with tons of log warnings (the client is still responding but obviously very slow until the point of collapsing)

What happened with CosmoDB is that at difference than Atlas, it rotates nodes more often (more than is normally needed actually) ... and thus you get bitten by the rotation bug.

Obviously the “fix” of rebooting manually our application ... is not entirely sustainable for us ... here comes the automatic workaround.

The automatic workaround that we currently use it’s based on a very simple Scala object cache that rotates the connections every 5 minutes and in case of failures it swap the ReactiveMongo connection objects immediately.

This solution is OK for us (at least it’s keeping our sleep good at night and makes MongoDB maintenance a lot easier)

As mentioned before this is a very nasty bug that actually has years crippling around because it’s difficult for the ReactiveMongo authors to reproduce it, but one of this days we all together will make it  easily reproducible and kill the rotation bug once and for all.

It would be also nice if ReactiveMongo has a sort of client caching out the box to help with this, but actually maybe it’s even better to focus on kill the rotation bug from its roots.

Best regards,

Carlos Saltos

--
You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/2faa5b42-a6ef-402e-831a-a64aec372842n%40googlegroups.com.

Cédric Chantepie

unread,
Mar 24, 2021, 11:39:45 AM3/24/21
to reacti...@googlegroups.com



Important to note is that from the point of view of the MingoDB cluster everything is OK, it’s “just” the inability of the client to recognize changes on the MongoDB nodes and with that bug the client nodes jump to 100% CPU usage all the time with tons of log warnings (the client is still responding but obviously very slow until the point of collapsing)


Release 1.0.3 has mechanism to evict unresponsive node after some time.


Carlos Saltos

unread,
Mar 24, 2021, 11:42:23 AM3/24/21
to reacti...@googlegroups.com
Really ? ... Great news Cédric !! 👍😎

Updating to a new ReactiveMongo version right now ...

Thank you very much !!

Best regards,

Carlos Saltos

On Wed 24. Mar 2021 at 10:39, Cédric Chantepie <chantep...@gmail.com> wrote:



Important to note is that from the point of view of the MingoDB cluster everything is OK, it’s “just” the inability of the client to recognize changes on the MongoDB nodes and with that bug the client nodes jump to 100% CPU usage all the time with tons of log warnings (the client is still responding but obviously very slow until the point of collapsing)


Release 1.0.3 has mechanism to evict unresponsive node after some time.


--
You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.

mikeb

unread,
May 4, 2021, 10:01:14 AM5/4/21
to ReactiveMongo - http://reactivemongo.org
Hello, 
I have the same problems with connecting to CosmosDB. Connection is dropping from time to time.
I use ReactiveMongo 1.0.3

It's not happening immediately but after some time (1 minute?), so here is partial log https://github.com/michalbogacz/reactivemongo-playground/blob/master/part.log 
(In log I had to hide IPs, domains, names).

Code to create this log is here: https://github.com/michalbogacz/reactivemongo-playground

If you need more details I'm happy to help.

Best Regards,
Michal

Cédric Chantepie

unread,
May 4, 2021, 2:00:42 PM5/4/21
to ReactiveMongo - http://reactivemongo.org
I don't think that's the same solution, so first check the connectivity, as network channel doesn't seem to really be up in your log.

mikeb

unread,
May 4, 2021, 2:24:37 PM5/4/21
to ReactiveMongo - http://reactivemongo.org
I had this issue on Azure K8s. Above project was created just to reproduce it locally and same issue happened ("The entire node set is unreachable, is there a network problem?").

I run this several times and each time I can reproduce it. I don't see any other connection issues. 


> network channel doesn't seem to really be up in your log.
Can you elaborate? Why reactivemongo thinks my connection was broken if "isMaster" gave response 2 seconds before? What can I debug to find cause? 

mikeb

unread,
May 5, 2021, 7:14:06 AM5/5/21
to ReactiveMongo - http://reactivemongo.org
I tried official MongoDB driver and this issue does not occur.
The only interesting difference in logs I see, is that in official driver, first server is unregistered and only "test-westeurope.mongo.cosmos.azure.com" stays.

Also, I found that this dropped connection information is from Netty, so this is much more complicated problem.

Cédric Chantepie

unread,
May 6, 2021, 5:29:54 AM5/6/21
to ReactiveMongo - http://reactivemongo.org
Meaning that node registered in the replicaSet are not usable.
From what I see, the Cosmos example is still working : http://reactivemongo.org/releases/1.0/documentation/tutorial/azure-cosmos.html

mikeb

unread,
May 6, 2021, 5:48:33 AM5/6/21
to ReactiveMongo - http://reactivemongo.org
Did you try to run it for longer time? Like 10 minutes?
It's not that it's not connecting. The problems is that driver informs that connection was broken after some time (less than 10 minutes) and reconnects. But in this small reconnect time window, all requests are failing.
This instability does not happens with official driver.

Jacek Sokół

unread,
May 10, 2021, 3:29:07 PM5/10/21
to ReactiveMongo - http://reactivemongo.org
It looks like the problem is with the `IdleStateHandle` being used incorrectly. In the case where SSL is enabled is not set as a first handler. This is my attempt to fix the issue: https://github.com/ReactiveMongo/ReactiveMongo/pull/1037.

Cédric Chantepie

unread,
May 15, 2021, 12:09:15 PM5/15/21
to ReactiveMongo - http://reactivemongo.org
See 1.0.4

Cédric Chantepie

unread,
Apr 15, 2022, 8:58:14 AM4/15/22
to reacti...@googlegroups.com
First try to use an up-to-date version.

Abhishek Sharma

unread,
Sep 9, 2022, 8:13:24 PM9/9/22
to ReactiveMongo - http://reactivemongo.org

  • rm.failover=strict
i see that documentation says http://reactivemongo.org/releases/1.0/documentation/advanced-topics/failoverstrategy.html
failover
default

do we need to change it to strict?

Cédric Chantepie

unread,
Sep 10, 2022, 9:24:06 AM9/10/22
to ReactiveMongo - http://reactivemongo.org
There is no "need" ... you have to test it according the connectivity/network (and many things between the app and the DB that are outside the responsability of the driver) that can impact the latency.

Abhishek Sharma

unread,
Sep 12, 2022, 3:00:42 PM9/12/22
to ReactiveMongo - http://reactivemongo.org
thanks for the info.
do i need to explicitly add rm.failover={default,strict...}?

Cédric Chantepie

unread,
Sep 13, 2022, 6:56:41 AM9/13/22
to reacti...@googlegroups.com
Once again, you need to try and adjust the settings accordingly.
Reply all
Reply to author
Forward
0 new messages