anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

273 views
Skip to first unread message

Brad Rust

unread,
Nov 2, 2020, 11:05:29 AM11/2/20
to reacti...@googlegroups.com

There has been several times now over the last 6 months that Atlas has done some maintenance or network configurations where my production servers simply do NOT recover properly.

 

In this case, using Kubernetes pods running reactivemongo 1.0 against Atlas instances using URI class/type mongodb+srv,   It didn’t seem to take connectivity/services “down” but because of the spinning of the exception below, our systems were very slow and never recovered or stopped logging these exceptions.

 

I have tried setting networkaddress.cache.ttl to something like 10 seconds thinking maybe it was a cached DNS resolution however this just didn’t seem to effect anything.

 

 

Spinning and spewing of this stacktrace creating a CPU overload of the production servers.  

 

2020-10-30 23:59:14,050 [ERROR]  r.c.a.MongoDBSystem - [Supervisor-1/surchx] Fails to connect channel #d2661b82

java.nio.channels.ClosedChannelException: null

        at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)

        at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976)

        at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237)

        at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342)

        at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)

        at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61)

        at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538)

        at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)

        at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)

        at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

 

 

Cédric Chantepie

unread,
Nov 2, 2020, 1:32:57 PM11/2/20
to ReactiveMongo - http://reactivemongo.org
Such logging is not surprising in case of network change/issue (channel get closed); That's why netty logging should be disabled by default.
As for recovery, what do you mean? In Play plugin? With which DB resolution code?

Brad Rust

unread,
Nov 2, 2020, 2:45:01 PM11/2/20
to reacti...@googlegroups.com

Are you suggesting that I should set “r.c.a.MongoDBSystem” to “OFF” since the log messages were ERROR level logs.

 

I certainly would except some pool/actor completion based on a network error and logs that are related to that.  However, these repeating (many per second) happened for hours after the event (still don’t know exactly what the Atlas event was other than it appeared somewhat severe).

 

I would expect the actors which were active or ready-to-be-active to make db calls may or would fail.  However, after those actor failed and we replaced by newly created actors, that calls would succeed once the mongo+srv URI responded with a successful connection.

 

In my case, I had GB’s of exception logs until I restarted the pods and everything went back to normal. 

 

I mention actors here but it’s probably relevant that this INFO message kept occurring too… (perhaps one for each exception) (12-15 per/second every 5 seconds)


"[Supervisor-1/db] Fails to connect channel #7e90b17f"

 

 

Thanks for responding, just looking for some tips on supporting anything to do with Atlas (or mongodb server) outages and having my Play 2.8 app keep on going after the errors/events have gone away.

--
You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/7699db68-aedb-444d-837c-436e726c4df3n%40googlegroups.com.

Cédric Chantepie

unread,
Nov 2, 2020, 3:20:27 PM11/2/20
to ReactiveMongo - http://reactivemongo.org
On Monday, 2 November 2020 at 20:45:01 UTC+1 br...@interpayments.com wrote:

Are you suggesting that I should set “r.c.a.MongoDBSystem” to “OFF” since the log messages were ERROR level logs.

That should only be enabled when investigating applicative error at upper level.

 

I certainly would except some pool/actor completion based on a network error and logs that are related to that.  However, these repeating (many per second) happened for hours after the event (still don’t know exactly what the Atlas event was other than it appeared somewhat severe).

 

I would expect the actors which were active or ready-to-be-active to make db calls may or would fail.  However, after those actor failed and we replaced by newly created actors, that calls would succeed once the mongo+srv URI responded with a successful connection.

 

Unclear.

In my case, I had GB’s of exception logs until I restarted the pods and everything went back to normal. 

 

I mention actors here but it’s probably relevant that this INFO message kept occurring too… (perhaps one for each exception) (12-15 per/second every 5 seconds)


"[Supervisor-1/db] Fails to connect channel #7e90b17f"

 

 

Thanks for responding, just looking for some tips on supporting anything to do with Atlas (or mongodb server) outages and having my Play 2.8 app keep on going after the errors/events have gone away.

Still don't see involved code.

Cédric Chantepie

unread,
Nov 2, 2020, 3:28:36 PM11/2/20
to ReactiveMongo - http://reactivemongo.org
And stacktrace

Brad Rust

unread,
Nov 2, 2020, 3:37:41 PM11/2/20
to reacti...@googlegroups.com

This (below -- except for the timestamp and the channel #id) is the only stack trace I see in my logs.   Over and over and over again for hours and many per second.  

 

The code is normal collection reads and writes.   There is no hint of my code in any output unfortunately.  

 

I am not really sure how to explain this any more.   I have a production system which is using rm 1.0.0 and an atlas primary, secondary, secondary mongodb backend.   Atlas had some cluster reboot and my reactivemongo “system” seemed to go crazy and never repair or corrected whatever state it was in.   This has happened to me several times.   I guess I can try to replicate with some local networking hiccups.

 

What I was really trying to ask is… if a mongodb goes down, (and perhaps the hostname(s) maps to a different IP address), should reactivemongo system recover and still satisfy database requests?

 

 

2020-10-30 23:59:14,050 [ERROR]  r.c.a.MongoDBSystem - [Supervisor-1/surchx] Fails to connect channel #d2661b82

java.nio.channels.ClosedChannelException: null

        at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)

        at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976)

        at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237)

        at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342)

        at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)

        at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61)

        at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538)

        at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)

        at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)

        at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

 

 

 

From: reacti...@googlegroups.com <reacti...@googlegroups.com> On Behalf Of Cédric Chantepie
Sent: Monday, November 2, 2020 1:29 PM
To: ReactiveMongo - http://reactivemongo.org <reacti...@googlegroups.com>
Subject: Re: anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

 

And stacktrace

--

You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.

Cédric Chantepie

unread,
Nov 2, 2020, 4:21:58 PM11/2/20
to ReactiveMongo - http://reactivemongo.org
So you have error in log at low level (netty/network) and no error at applicative level?

Brad Rust

unread,
Nov 2, 2020, 4:48:03 PM11/2/20
to reacti...@googlegroups.com

 

 

Yes, correct, no stack traces (or long term interruption of our service(s)) except for these repeating exceptions.   I’ll re-include them together since I posted in different messages…  Of course this is just 4 of the thousands but I wanted to include the timestamps to show how frequently they are coming.   

 

 

Info 2020-10-30 18:00:04.863 MDT "[Supervisor-1/db] Fails to connect channel #4f04841e"

2020-10-31 00:00:04,863 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/db] Fails to connect channel #4f04841e

Error 2020-10-30 18:00:04.863 MDT java.nio.channels.ClosedChannelException: null at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976) at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237) at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Info 2020-10-30 18:00:04.863 MDT "[Supervisor-1/db] Fails to connect channel #8206dbce"

2020-10-31 00:00:04,863 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/db] Fails to connect channel #8206dbce

Error 2020-10-30 18:00:04.864 MDT java.nio.channels.ClosedChannelException: null at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976) at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237) at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Info 2020-10-30 18:00:04.864 MDT "[Supervisor-1/db] Fails to connect channel #a1fe6459"

Error 2020-10-30 18:00:04.864 MDT 2020-10-31 00:00:04,864 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/db] Fails to connect channel #a1fe6459

Error 2020-10-30 18:00:04.864 MDT java.nio.channels.ClosedChannelException: null at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976) at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237) at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Error 2020-10-30 18:00:04.864 MDT 2020-10-31 00:00:04,864 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/db] Fails to connect channel #8e1cde89

2020-10-30 18:00:04.864 MDT java.nio.channels.ClosedChannelException: null at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976) at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237) at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Info 2020-10-30 18:00:04.864 MDT "[Supervisor-1/db] Fails to connect channel #8e1cde89"

Cédric Chantepie

unread,
Nov 2, 2020, 5:36:28 PM11/2/20
to reacti...@googlegroups.com
Then it means that the pool is doing is job handling network signals through netty.

You can try to optimize network options, but there is no driver issue as far as I see.

You received this message because you are subscribed to a topic in the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/reactivemongo/LW8hKoKxJGc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/BY5PR18MB3761663DDD5ECDACFD7F99D2CD100%40BY5PR18MB3761.namprd18.prod.outlook.com.

Brad Rust

unread,
Nov 3, 2020, 12:41:16 AM11/3/20
to reacti...@googlegroups.com

 

So when there are network failures like this there is no other way to have a production system operate properly other than “rebooting the server”.   That just seems really wonky.

 

Is there anything you can share about the internals of the mongo+srv protocol or bootstrapping that might suggest caching of hostnames?  Or anything host caching you can think of that might not be honoring the networkaddress.cache.ttl JVM setting?

 

There must be someone using reactivemongo with Kubernetes with Atlas – and having similar issues.   Anyone?

 

Anyway, thanks for your time.   Unfortunately, if I can’t hack through this somehow it’s probably the end of the road.

 

 

From: reacti...@googlegroups.com <reacti...@googlegroups.com> On Behalf Of Cédric Chantepie
Sent: Monday, November 2, 2020 3:36 PM
To: reacti...@googlegroups.com
Subject: Re: anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

 

Then it means that the pool is doing is job handling network signals through netty.

 

You can try to optimize network options, but there is no driver issue as far as I see.

Le lun. 2 nov. 2020 à 22:48, Brad Rust <br...@interpayments.com> a écrit :

 

 

Yes, correct, no stack traces (or long term interruption of our service(s)) except for these repeating exceptions.   I’ll re-include them together since I posted in different messages…  Of course this is just 4 of the thousands but I wanted to include the timestamps to show how frequently they are coming.   

 

 

Info 2020-10-30 18:00:04.863 MDT "[Supervisor-1/db] Fails to connect channel #4f04841e"

2020-10-31 00:00:04,863 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/db] Fails to connect channel #4f04841e

Error 2020-10-30 18:00:04.863 MDT java.nio.channels.ClosedChannelException: null at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976) at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237) at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Info 2020-10-30 18:00:04.863 MDT "[Supervisor-1/db] Fails to connect channel #8206dbce"

2020-10-31 00:00:04,863 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/db] Fails to connect channel #8206dbce

Error 2020-10-30 18:00:04.864 MDT java.nio.channels.ClosedChannelException: null at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976) at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237) at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Info 2020-10-30 18:00:04.864 MDT "[Supervisor-1/db] Fails to connect channel #a1fe6459"

Error 2020-10-30 18:00:04.864 MDT 2020-10-31 00:00:04,864 [ERROR] r.c.a.   - [Supervisor-1/db] Fails to connect channel #a1fe6459

Cédric Chantepie

unread,
Nov 3, 2020, 12:18:27 PM11/3/20
to ReactiveMongo - http://reactivemongo.org
On Tuesday, 3 November 2020 at 06:41:16 UTC+1 br...@interpayments.com wrote:

 So when there are network failures like this there is no other way to have a production system operate properly other than “rebooting the server”.   That just seems really wonky.

I think it was asked whether there is error/stacktrace at application level (not at low/netty level), and then there is none ...
Which seems quite impossible if the application is crashed/stopped. So please share those relevant traces to get help.

Also share code related to thoses traces.
There is a lot of way to prevent an app from being resilient, specially about the DB resolution as indicated in the document.
 

 Is there anything you can share about the internals of the mongo+srv protocol or bootstrapping that might suggest caching of hostnames?  Or anything host caching you can think of that might not be honoring the networkaddress.cache.ttl JVM setting?

There is no specific settings, and the network layer is implemented with netty (which is battle tested).

Brad Rust

unread,
Nov 4, 2020, 10:42:16 AM11/4/20
to reacti...@googlegroups.com

Here is what I am seeing as I am trying to replicate this.

 

  • Play 2.8.2 scala
  • Reactivemongo 1.0.0
  • Atlas M10 cluster (primary and two secondaries)
  • “normal” BSONCollection finds, updates, modifies, etc… (meaning, the code base works as expected for all happy-path network conditions).   And additionally (locally, not with Atlas), if I stop my mongodb and then restart it,… everything “recovers” just fine.  
  • Turned off/removed the linux-shaded component just in case that was adding complexity.

 

How I am trying to replicate (and seem to have replicated *something* that I think is abnormal)

 

  1. Start play app, do all the happy-path reads and writes with success
  2. Atlas pause-cluster
  3. As expected play app logs exceptions about connecting to primary and/or secondary
  4. Atlas resume-cluster
  5. The play app recovers (mostly) and allows my happy-path reads and writes with success
  6. MongoDBSystem continues to log the following …
    1. r.c.a.MongoDBSystem - [Supervisor-6/my-db] Fails to connect channel #<SOME_CHANNEL_ID>
  7. In MongoDBSystem, it goes through the connectAll(nodeSet) code where the updateNode block is called where the netty connect fails yielding the above (#6) logged exception (MongoDBSystem:1528).   The nodes of the nodeSet have “Disconnected” status entities that *never* get removed.   So while the application is still functional, from this point I will *forever* until server restart, have Disconnected channels in the node.
  8. Looking at this a different way, on MongoDBSystem:1572:updateNode(node, node.connections, Vector.empty), the node.connections Vector has connections which stay in a Disconnected state and never get remove or cleaned up or whatever.  If you were to look at the toShortString of these, mine look like this
    1. Node[testcluster-shard-00-00…...mongodb.net:27017: Unknown (9/9/10 available connections), latency=9223372036854775807ns, authenticated={}]
    2. Node[testcluster-shard-00-01…...mongodb.net:27017: Unknown (8/8/10 available connections), latency=9223372036854775807ns, authenticated={}]
    3. Node[testcluster-shard-00-02…...mongodb.net:27017: Primary (9/9/10 available connections), latency=430148298551800ns, authenticated={.....@admin}]
  9. I guess my assumption is that eventually you would expect all of those Nodes to come back to 10/10/10 as things recover.

 

 

I am not sure what ”DB resolution in the document” section you are referring to but I would be happy and willing to try different options but I simply don’t know where to start.

 

 

 

From: reacti...@googlegroups.com <reacti...@googlegroups.com> On Behalf Of Cédric Chantepie
Sent: Tuesday, November 3, 2020 10:18 AM
To: ReactiveMongo - http://reactivemongo.org <reacti...@googlegroups.com>
Subject: Re: anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

 

 

On Tuesday, 3 November 2020 at 06:41:16 UTC+1 br...@interpayments.com wrote:

--

You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.

Carlos Saltos

unread,
Nov 4, 2020, 10:56:59 AM11/4/20
to reacti...@googlegroups.com
This is also happening to us at Talenteca.com ... the application is OK after the MongoDB cluster maintenance but those logs appear like crazy tons per second killing the hard disk and CPU usage then crashing the application.

This is happening for us since ReactiveMongo version 0.16 ... so every time we have a maintenance we have to restart the application.

The real pain is when is not an scheduled maintenance but a real database crash outage, then it does not matter if the MongoDB as a cluster heals it self, ReactiveMongo will kill the app with tons of repetitive log and CPU usage.

By the way at ReactiveMongo version 0.16 this was fixed with a quick patch but reintroduced at ReactiveMongo version 0.17 on, after a long discussion on how to properly fix it, but no real fix yet.

Thanks for ReactiveMongo it’s a great and important project for us. I hope we can improve this little hiccup soon.

Best regards,

Carlos Saltos


Cédric Chantepie

unread,
Nov 4, 2020, 2:45:39 PM11/4/20
to ReactiveMongo - http://reactivemongo.org
On Wednesday, 4 November 2020 at 16:56:59 UTC+1 csa...@gmail.com wrote:
This is also happening to us at Talenteca.com ... the application is OK after the MongoDB cluster maintenance but those logs appear like crazy tons per second killing the hard disk and CPU usage then crashing the application.

This is happening for us since ReactiveMongo version 0.16 ... so every time we have a maintenance we have to restart the application.
 
Please do not mix topic.
0.16 is in a totally different league, and is more than 1 year old.
If you don't/cannot update to 1.0.0, you need at least to upgrade to 0.18.6+, and 0.20.13 being recommended.

Carlos Saltos

unread,
Nov 4, 2020, 3:32:25 PM11/4/20
to reacti...@googlegroups.com
I’m sorry for that important missing part -> It also happens with 1.0.0 ... it’s happening since 0.16 without a clear solid solution.

Once again thank you for ReactiveMongo and I hope we can find a good solution al together soon.

I will try to provide an easy reproducible guide and repo for this bug in the coming weeks (it’s not easy since it’s a live runtime error at a cluster maintenance intervention) ... but I will try.

Best regards,

Carlos Saltos

--
You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.

Cédric Chantepie

unread,
Nov 5, 2020, 4:51:39 AM11/5/20
to reacti...@googlegroups.com
Thanks for the feedback.


Le mer. 4 nov. 2020 à 16:42, Brad Rust <br...@interpayments.com> a écrit :


  1. Start play app, do all the happy-path reads and writes with success
  2. Atlas pause-cluster
  3. As expected play app logs exceptions about connecting to primary and/or secondary
  4. Atlas resume-cluster
  5. The play app recovers (mostly) and allows my happy-path reads and writes with success
  6. MongoDBSystem continues to log the following …
    1. r.c.a.MongoDBSystem - [Supervisor-6/my-db] Fails to connect channel #<SOME_CHANNEL_ID>
  7. In MongoDBSystem, it goes through the connectAll(nodeSet) code where the updateNode block is called where the netty connect fails yielding the above (#6) logged exception (MongoDBSystem:1528).   The nodes of the nodeSet have “Disconnected” status entities that *never* get removed.   So while the application is still functional, from this point I will *forever* until server restart, have Disconnected channels in the node.
  8. Looking at this a different way, on MongoDBSystem:1572:updateNode(node, node.connections, Vector.empty), the node.connections Vector has connections which stay in a Disconnected state and never get remove or cleaned up or whatever.  If you were to look at the toShortString of these, mine look like this
    1. Node[testcluster-shard-00-00…...mongodb.net:27017: Unknown (9/9/10 available connections), latency=9223372036854775807ns, authenticated={}]
    2. Node[testcluster-shard-00-01…...mongodb.net:27017: Unknown (8/8/10 available connections), latency=9223372036854775807ns, authenticated={}]
    3. Node[testcluster-shard-00-02…...mongodb.net:27017: Primary (9/9/10 available connections), latency=430148298551800ns, authenticated={.....@admin}]
  9. I guess my assumption is that eventually you would expect all of those Nodes to come back to 10/10/10 as things recover.
Are you sure the nodes in unknown state are up after cluster resume? 
What a mongoshell says about the RS status?

Will try to reproduce those steps.

You can also try to set up some monitoring on the app side: http://reactivemongo.org/releases/1.0/documentation/advanced-topics/monitoring.html

 

I am not sure what ”DB resolution in the document” section you are referring to but I would be happy and willing to try different options but I simply don’t know where to start.


"""
It’s generally a good practice not to assign the database and collection references to val (even to lazy val), as it’s better to get a fresh reference each time, to automatically recover from any previous issues (e.g. network failure).

Cédric Chantepie

unread,
Nov 5, 2020, 8:10:30 AM11/5/20
to ReactiveMongo - http://reactivemongo.org
I would hypothesize that some node are removed from the replicaSet by the cluster restart, are no longer part of after, but are still accessible by network.

Brad Rust

unread,
Nov 5, 2020, 11:50:25 AM11/5/20
to reacti...@googlegroups.com

 

If there is a branch or some replication you need me to try?   I haven’t built the driver from source before but I am willing to do that to help out.

 

If there is a particular part in the codebase that you want me to try to figure it out, I can try that too.   It’s just a bit overwhelming just to start looking around.

 

 

From: reacti...@googlegroups.com <reacti...@googlegroups.com> On Behalf Of Cédric Chantepie
Sent: Thursday, November 5, 2020 6:10 AM
To: ReactiveMongo - http://reactivemongo.org <reacti...@googlegroups.com>
Subject: Re: anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

 

I would hypothesize that some node are removed from the replicaSet by the cluster restart, are no longer part of after, but are still accessible by network.

--

You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.

Cédric Chantepie

unread,
Nov 5, 2020, 6:07:06 PM11/5/20
to ReactiveMongo - http://reactivemongo.org
On Thursday, 5 November 2020 at 17:50:25 UTC+1 br...@interpayments.com wrote:

 If there is a branch or some replication you need me to try?   I haven’t built the driver from source before but I am willing to do that to help out.

Have you checked whether the Unkown nodes are usable after cluster restart?

Have tried the test scenario with https://github.com/cchantep/RM-SBT-Playground?
If it reproduces the same situation, you can eventually try the same with 1.0.1-SNAPSHOT (and check "Node ... has been not queryable for at least ..." in the logs).


 If there is a particular part in the codebase that you want me to try to figure it out, I can try that too.

 Make sure there is no DB resolution/Future[DB] assigned to val.

It’s just a bit overwhelming just to start looking around

...

Brad Rust

unread,
Nov 9, 2020, 5:11:11 PM11/9/20
to reacti...@googlegroups.com

I can’t seem to run the RM-SBT-Playground on my wsl2 ubuntu shell.  Any thoughts or suggestions?

 

  • Cloned from github
  • ./run.sh
    • Compilation errors… like it didn’t download the RM dependency
  • Used cs to validate reachability of reactive-mongo artifact
    • ~/cs fetch org.reactivemongo::reactivemongo:1.0.0

 

SBT command: sbt

[info] welcome to sbt 1.3.13 (AdoptOpenJDK Java 11.0.8)

[info] loading settings for project global-plugins from sbt-updates.sbt ...

[info] loading global plugins from /home/brust/.sbt/1.0/plugins

[info] loading project definition from /home/brust/src/RM-SBT-Playground/project

[info] loading settings for project rm-sbt-playground from build.sbt ...

[info] set current project to RM-SBT-Playground (in build file:/home/brust/src/RM-SBT-Playground/)

[info] Compiling 1 Scala source to /home/brust/src/RM-SBT-Playground/target/scala-2.12/classes ...

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:10:28: object bson is not a member of package reactivemongo.api

[error]   import reactivemongo.api.bson.BSONDocument

[error]                            ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:21:22: value close is not a member of reactivemongo.api.MongoConnection

[error]     con.foreach(_._1.close()(5.seconds))

[error]                      ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:24:24: value fromStringWithDB is not a member of object reactivemongo.api.MongoConnection

[error]       (MongoConnection.fromStringWithDB(uri).flatMap { dbUri =>

[error]                        ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:73:14: not found: value BSONDocument

[error]         find(BSONDocument.empty).one[BSONDocument].map(_.isDefined), timeout))

[error]              ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:73:38: not found: type BSONDocument

[error]         find(BSONDocument.empty).one[BSONDocument].map(_.isDefined), timeout))

[error]                                      ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:96:48: not found: value BSONDocument

[error]     case Some(db) => db.collection("bar").find(BSONDocument.empty).

[error]                                                ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:97:35: not found: type BSONDocument

[error]         tailable.awaitData.cursor[BSONDocument]().fold({}) { (_, doc) =>

[error]                                   ^

[error] /home/brust/src/RM-SBT-Playground/src/main/scala/Playground.scala:98:29: not found: value BSONDocument

[error]           println(s"doc = ${BSONDocument pretty doc}")

[error]                             ^

[error] 8 errors found

[error] (Compile / compileIncremental) Compilation failed

[error] Total time: 4 s, completed 2020 Nov 9 15:02:39

 

 

 

From: reacti...@googlegroups.com <reacti...@googlegroups.com> On Behalf Of Cédric Chantepie
Sent: Thursday, November 5, 2020 4:07 PM
To: ReactiveMongo - http://reactivemongo.org <reacti...@googlegroups.com>
Subject: Re: anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

 

On Thursday, 5 November 2020 at 17:50:25 UTC+1 br...@interpayments.com wrote:

--

You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.

Cédric Chantepie

unread,
Nov 9, 2020, 6:32:37 PM11/9/20
to reacti...@googlegroups.com
Please do not use wsl2 to test (as said it has network "specificities" ...)

You received this message because you are subscribed to a topic in the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/reactivemongo/LW8hKoKxJGc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/BY5PR18MB3761A3A07260DC7C403B0440CDEA0%40BY5PR18MB3761.namprd18.prod.outlook.com.

Brad Rust

unread,
Nov 10, 2020, 10:51:04 AM11/10/20
to reacti...@googlegroups.com

I tried with a Debian (stretch) docker image with java 1.8 and sbt 1.4.1 from here (https://github.com/mozilla/docker-sbt)... Just in case someone else wants a ready to go docker image to bootstrap from.

 

I received exactly the same errors.   I didn’t see it yesterday but RM_VERSION is 0.17.1 in build.sbt

 

So, after setting `export RM_VERSON=1.0.0`, everything is bootstrapping and running ok. 

Cédric Chantepie

unread,
Nov 10, 2020, 1:29:57 PM11/10/20
to ReactiveMongo - http://reactivemongo.org
Rather even avoid Docker. Goal is restrict possible cause. Containerization or virtualisation on the contrary add possible cause of network issue.

Joseph Hui

unread,
Feb 17, 2022, 9:51:16 PM2/17/22
to ReactiveMongo - http://reactivemongo.org
I want to share our experience on this issue. We are using ReactionMongo 1.0+ and have the same "java.nio.channels.ClosedChannelException: null" problem occasionally. Once we hit this error, it would keep on bumping out this error. Meanwhile, the application could connect to the data, and therefore, the problem was un-noticed until the disk wast filled up. Finally, we reached this conversation and noticed the following: 

"It’s generally a good practice not to assign the database and collection references to val (even to lazy val), as it’s better to get a fresh reference each time, to automatically recover from any previous issues (e.g. network failure)."

We changed the database and collection references from val to var and the problems did not appear again for more than two months. Hope this could help. Thanks!

Cédric Chantepie

unread,
Feb 18, 2022, 3:03:06 AM2/18/22
to ReactiveMongo - http://reactivemongo.org
On Friday, 18 February 2022 at 03:51:16 UTC+1 josep...@gmail.com wrote:

We changed the database and collection references from val to var and the problems did not appear again for more than two months. Hope this could help. Thanks!

Just to make it clear.
The documentation doesn't suggest to use `var` and introduce mutability (rather use def)
 
Reply all
Reply to author
Forward
0 new messages