anyone seen abnormal reactivemongo recovery when Atlas does some hosted db maint or network changes?

Brad Rust

unread,

Nov 2, 2020, 11:05:29 AM11/2/20

to reacti...@googlegroups.com

There has been several times now over the last 6 months that Atlas has done some maintenance or network configurations where my production servers simply do NOT recover properly.

In this case, using Kubernetes pods running reactivemongo 1.0 against Atlas instances using URI class/type mongodb+srv, It didn’t seem to take connectivity/services “down” but because of the spinning of the exception below, our systems were very slow and never recovered or stopped logging these exceptions.

I have tried setting networkaddress.cache.ttl to something like 10 seconds thinking maybe it was a cached DNS resolution however this just didn’t seem to effect anything.

Spinning and spewing of this stacktrace creating a CPU overload of the production servers.

2020-10-30 23:59:14,050 [ERROR] r.c.a.MongoDBSystem - [Supervisor-1/surchx] Fails to connect channel #d2661b82

java.nio.channels.ClosedChannelException: null

at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)

at reactivemongo.io.netty.channel.AbstractChannel$AbstractUnsafe.ensureOpen(AbstractChannel.java:976)

at reactivemongo.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:237)

at reactivemongo.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342)

at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)

at reactivemongo.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61)

at reactivemongo.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538)

at reactivemongo.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)

at reactivemongo.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)

at reactivemongo.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

Cédric Chantepie

unread,

Nov 2, 2020, 1:32:57 PM11/2/20

to ReactiveMongo - http://reactivemongo.org

Such logging is not surprising in case of network change/issue (channel get closed); That's why netty logging should be disabled by default.

As for recovery, what do you mean? In Play plugin? With which DB resolution code?

Brad Rust

unread,

Nov 2, 2020, 2:45:01 PM11/2/20

to reacti...@googlegroups.com

Are you suggesting that I should set “r.c.a.MongoDBSystem” to “OFF” since the log messages were ERROR level logs.

I certainly would except some pool/actor completion based on a network error and logs that are related to that. However, these repeating (many per second) happened for hours after the event (still don’t know exactly what the Atlas event was other than it appeared somewhat severe).

I would expect the actors which were active or ready-to-be-active to make db calls may or would fail. However, after those actor failed and we replaced by newly created actors, that calls would succeed once the mongo+srv URI responded with a successful connection.

In my case, I had GB’s of exception logs until I restarted the pods and everything went back to normal.

I mention actors here but it’s probably relevant that this INFO message kept occurring too… (perhaps one for each exception) (12-15 per/second every 5 seconds)

"[Supervisor-1/db] Fails to connect channel #7e90b17f"

Thanks for responding, just looking for some tips on supporting anything to do with Atlas (or mongodb server) outages and having my Play 2.8 app keep on going after the errors/events have gone away.

--
You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/7699db68-aedb-444d-837c-436e726c4df3n%40googlegroups.com.

Cédric Chantepie

unread,

Nov 2, 2020, 3:20:27 PM11/2/20

to ReactiveMongo - http://reactivemongo.org

On Monday, 2 November 2020 at 20:45:01 UTC+1 br...@interpayments.com wrote:

Are you suggesting that I should set “r.c.a.MongoDBSystem” to “OFF” since the log messages were ERROR level logs.

That should only be enabled when investigating applicative error at upper level.

I certainly would except some pool/actor completion based on a network error and logs that are related to that. However, these repeating (many per second) happened for hours after the event (still don’t know exactly what the Atlas event was other than it appeared somewhat severe).

I would expect the actors which were active or ready-to-be-active to make db calls may or would fail. However, after those actor failed and we replaced by newly created actors, that calls would succeed once the mongo+srv URI responded with a successful connection.

Unclear.

In my case, I had GB’s of exception logs until I restarted the pods and everything went back to normal.

I mention actors here but it’s probably relevant that this INFO message kept occurring too… (perhaps one for each exception) (12-15 per/second every 5 seconds)

"[Supervisor-1/db] Fails to connect channel #7e90b17f"

Thanks for responding, just looking for some tips on supporting anything to do with Atlas (or mongodb server) outages and having my Play 2.8 app keep on going after the errors/events have gone away.

Still don't see involved code.

Cédric Chantepie

unread,

Nov 2, 2020, 3:28:36 PM11/2/20

to ReactiveMongo - http://reactivemongo.org

And stacktrace

Brad Rust

unread,

Nov 2, 2020, 3:37:41 PM11/2/20

to reacti...@googlegroups.com

This (below -- except for the timestamp and the channel #id) is the only stack trace I see in my logs. Over and over and over again for hours and many per second.

The code is normal collection reads and writes. There is no hint of my code in any output unfortunately.

I am not really sure how to explain this any more. I have a production system which is using rm 1.0.0 and an atlas primary, secondary, secondary mongodb backend. Atlas had some cluster reboot and my reactivemongo “system” seemed to go crazy and never repair or corrected whatever state it was in. This has happened to me several times. I guess I can try to replicate with some local networking hiccups.

What I was really trying to ask is… if a mongodb goes down, (and perhaps the hostname(s) maps to a different IP address), should reactivemongo system recover and still satisfy database requests?