Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Any production suggestions for handling primary/secondary elections (with Atlas in my case)

70 views
Skip to first unread message

br...@interpayments.com

unread,
Nov 21, 2022, 12:42:58 PM11/21/22
to ReactiveMongo - http://reactivemongo.org
Mongo Atlas has weekly (if you set your schedule) maintenance events which frequently force a secondary -> primary election.   We have steady write-traffic to our Atlas cluster (which is multi-regional as well) where the writes need to be under 10 seconds.

During the election, it takes (almost always) about a minute for reactivemongo to recover/restart to allow writes to the new primary.   Does anyone have any Atlas or election configuration suggestions to make this quicker or less turbulent.   

Most testing is now done with Atlas' "test failover" functionality which supposedly just forces the election.  The testing appears to simulate accurately what happens during their maintenance events.

If relevant, we are on Play 2.8, scala 2.13, RM 1.1.0-RC6 (same relative response with RM 1.0.10), and Mongo Atlas (closest regions to our cloud provider regions ( < 15ms latency)).  I've tried netty-native tcp configs (keepAlive and tcpNoDelay) but they don't seem to effect anything.

RM configurations I have tried are lower connectTimeoutMS values, and lower heartbeatFrequencyMS values but it always seems to be a full minute before I can start to write again to a new primary.

I am including some of the RM error and warn responses but I assume they are typical for any election.

"level":"ERROR","logger_name":"akka.actor.OneForOneStrategy","thread_name":"reactivemongo-akka.actor.default-dispatcher-8" .... "stack_trace":"java.nio.BufferUnderflowException: null\n\tat java.base/java.nio.Buffer.nextGetIndex(Buffer.java:707)\n\tat java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:684)\n\tat reactivemongo.api.bson.buffer.ReadableBuffer$.readInt$extension(ReadableBuffer.scala:31)
"level":"WARN","logger_name":"reactivemongo.core.actors.MongoDBSystem","thread_name":"reactivemongo-akka.actor.default-dispatcher-15","message":"[Supervisor-1/foo] Restarting the MongoDBSystem: Response(MessageHeader(380, 7889664, 325, 1), Reply(10,0,0,1), ResponseInfo(5bcc0398))","stack_trace":"java.nio.BufferUnderflowException: null\n\tat java.base/java.nio.Buffer.nextGetIndex(Buffer.java:707)\n\tat java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:684)\n\tat reactivemongo.api.bson.buffer.ReadableBuffer$.readInt$extension(ReadableBuffer.scala:31)
"level":"INFO","logger_name":"reactivemongo.core.actors.MongoDBSystem","thread_name":"reactivemongo-akka.actor.default-dispatcher-15","message":"[Supervisor-1/foo] Stopping the MongoDBSystem"}
"level":"WARN","logger_name":"reactivemongo.core.protocol.MongoHandler","thread_name":"nioEventLoopGroup-2-6","message":"[Supervisor-1/foo] Channel is closed under 485236165ns: elections-shard-00-01.mongodb.net/34.12.222.10:27017; Please check network connectivity and the status of the set. (channel [id: 0xf4499761, L:/10.100.129.8:46706 ! R:elections-shard-00-01.mongodb.net/34.12.222.10:27017])"
.
.
.
 hundred if not thousands of "something is broken" warns and errors follow



Cédric Chantepie

unread,
Nov 27, 2022, 3:08:50 PM11/27/22
to ReactiveMongo - http://reactivemongo.org
As it stands, RC6 is a release candidate, not a release.

Carlos Saltos

unread,
Nov 28, 2022, 11:23:54 AM11/28/22
to reacti...@googlegroups.com
Yes, we have the same error, as a workaround, we are rotating all our backend servers with a rollout update and that seems to fix it.

As Cédric mentioned, this is a release candidate version that hopefully will be fix it with the complete new version.

Once again thank you for all the support and great product reactive mongo is.

Best regards,

Carlos Saltos
Talenteca Minkana

--
You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/686f71d9-a0da-4738-8af6-a554adc92e32n%40googlegroups.com.

br...@interpayments.com

unread,
Dec 7, 2022, 8:42:31 PM12/7/22
to ReactiveMongo - http://reactivemongo.org
I get the same one-minute "recovery" from a secondary to primary election with the RM 1.0.10 driver as well.

I have tried connectTimeoutMS and heartbeatFrequencyMS "tuning" (at least changing their values) but it does not seem to effect *anything*.  

Are there by chance any hooks or actor messages that I could send to say "refresh the nodeset because I know there is a new primary" ??   at least to force some kind of refresh or detection.

Is there a way to debug the dns srv resolver code?... assuming that it gets primary updates from that but I simply don't know.

I am willing to dig into this but it's it a bit daunting to start.

Thanks,
Brad

Cédric Chantepie

unread,
Dec 13, 2022, 5:39:38 AM12/13/22
to ReactiveMongo - http://reactivemongo.org
Cannot reproduce such issue with Atlas for now.

br...@interpayments.com

unread,
Dec 13, 2022, 8:41:24 PM12/13/22
to ReactiveMongo - http://reactivemongo.org
I am not really sure what you mean.

Are you saying that you (or other experiential data / driver deployments) have never seen an issue with Atlas elections?  Or are you saying that it's out of scope to reproduce it against Atlas for now?

Cédric Chantepie

unread,
Dec 19, 2022, 10:14:07 AM12/19/22
to ReactiveMongo - http://reactivemongo.org
On Wednesday, 14 December 2022 at 02:41:24 UTC+1 br...@interpayments.com wrote:
I am not really sure what you mean.

Are you saying that you (or other experiential data / driver deployments) have never seen an issue with Atlas elections?

Temporary nodeset update due to election can happen, but I don't see persistent/recurrent issue like this.

Brad Rust

unread,
Dec 20, 2022, 11:30:06 AM12/20/22
to reacti...@googlegroups.com

I have tried 4 (not-related to my codebase) opensource reactivemongo (with play or akka-http) projects and they all react the same way.   I’ve tried a couple with Java 11 and Java 17. 

 

Really posting once more in case the particular logging events shed light on anything Atlas or RM could be doing.

 

  • Testing with Atlas’ “test failover” which an Atlas consultant verified that the election happened as expected and within a second or two.
  • Multi-region cluster of 5 mongo instances
  • RM, Play2.8, Guice DI, Scala 2.13

 

  1. Election happens
  2. Repeat for one minute
    1. BufferUnderflowException is thrown from a Read
    2. Restarting the MongoDBSystem
    3. Stopping the MongoDBSystem
    4. Channel is closed under…  * 5
    5. MongoDBSystem is restarted
    6. Starting the MongoDBSystem
    7. some of these too…

                                                    i.     Then a Got an error, no more attempts to do. Completing with a failure...

                                                   ii.     Error during processing of request: 'MongoError['No primary node is available! (Supervisor-1/Connection-1)']'. Completing with 500 Internal Server Error response. To change default exception handling behavior, provide a custom ExceptionHandler. (reactivemongo.core.actors.Exceptions$PrimaryUnavailableException: MongoError['No primary node is available! (Supervisor-1/Connection-1)'])

 

If there is an opensource or shareable source project that has success recovering (quickly) from a mongodb election process, would they be willing to share to test with?

 

Thanks in advance.   I realize that a lot of hours have been put into reactivemongo.   I’m just blocked hard on redundancy and recovery.

--
You received this message because you are subscribed to a topic in the Google Groups "ReactiveMongo - http://reactivemongo.org" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/reactivemongo/J3ezltdp_eY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/4d56a9e9-ec2d-4a7b-ad60-87033037d1fen%40googlegroups.com.

Cédric Chantepie

unread,
Jan 29, 2023, 10:25:00 AM1/29/23
to ReactiveMongo - http://reactivemongo.org
It's used with Atlas in production without issue.

Cédric Chantepie

unread,
Jan 29, 2023, 10:31:54 AM1/29/23
to ReactiveMongo - http://reactivemongo.org
"""
It’s generally a good practice not to assign the database and collection references to val (even to lazy val), as it’s better to get a fresh reference each time, to automatically recover from any previous issues (e.g. network failure).
""" http://reactivemongo.org/releases/1.0/documentation/tutorial/connect-database.html

Brad Rust

unread,
Jan 30, 2023, 11:29:20 AM1/30/23
to reacti...@googlegroups.com

FWIW, I have no val or lazy val associations with database or collections in code.   Our collection code uses the `trait WithCollection[C <: Collection]`

 

It’s just strange to me that I have tried other code-bases (assuming we aren’t *all* doing something wrong) that don’t work with Atlas primary election sequences.   There are a couple others reporting the same type of behaviour as well.

 

I am willing to try any suggestions that folks might have.   I did notice after re-reading compatibility that I am running java 17 (wasn’t sure if there was a potential issue there), Akka 2.6.19, and Play 2.8.17.

 

Thanks.

--

You received this message because you are subscribed to the Google Groups "ReactiveMongo - http://reactivemongo.org" group.

To unsubscribe from this group and stop receiving emails from it, send an email to reactivemong...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/reactivemongo/553ec194-f0df-418c-9b4f-eccba20433e7n%40googlegroups.com.

Carlos Saltos

unread,
Jan 30, 2023, 12:11:05 PM1/30/23
to reacti...@googlegroups.com
We also have the same errors in our servers (AWS normal instances installation) since 2 years ago ... as a workaround when we are maintaining the servers, we rotate the clients explicitly.

It would be nice to have a real solution for this, hopefully soon. The reproduce path you are sharing is very valuable, thank you.

Also always, thank you to the ReactiveMongo maintainers, it's a great driver very useful.

Best regards,

Carlos Saltos

Cédric Chantepie

unread,
May 7, 2023, 11:12:04 AM5/7/23
to ReactiveMongo - http://reactivemongo.org
A possible fix is included in latest 1.1.0-RC10

Brad Rust

unread,
May 8, 2023, 6:19:51 PM5/8/23
to reacti...@googlegroups.com

Preliminary testing for me yielded successful results (using Atlas “test resilience” functionality).   Thanks for the notification on this.   I realize it’s still RC, but that is a very welcome fix.

 

Thanks!

Reply all
Reply to author
Forward
0 new messages