We have been running happily on MongoDB and then on Atlas for many years but business reasons now dictate a move to Azure and Cosmos DB and it is
not going well. The basic business of connecting works well and we run for a few minutes but then there seem to be two kinds of problem:
1. Connections drop after 30 minutes of inactivity. I have had some success working around this with a 25 minute maxIdleTimeMS.
2. Cosmos charging structure is based on 'request units', which is a throughput allowance. Exceed the limit and they return a 429 try-again-later error. You can tell it retry on the server side; but then it seems not to return any response at all.
Both of these situations seem to cause Reactivemongo to report that the replica set has gone away. I get a lot of these:
[warn] r.c.a.MongoDBSystem - [Supervisor-1/pwcca] The entire node set is unreachable, is there a network problem?
and
[warn] r.c.a.MongoDBSystem - [Supervisor-1/pwcca] MongoError['Socket disconnected (Supervisor-1/pwcca)'] (channel #47160db0)
[Supervisor-1/pwcca] The node set is authenticated, but the primary is not available
Unfortunately it seems that ReactiveMongo doesn't recover from the lack of service, so we usually find that the application has stalled.
Cosmos is generally a bad fit for our application, which has a very high quiet/busy ratio and makes many very small requests, but you know. Business reasons. We're handling health data so there is a very high certification and encryption requirement.
Has anyone managed to make this work in production?
thanks,
Will