Lock stuck in failover

114 views
Skip to first unread message

Szabo Akos

unread,
May 20, 2016, 4:27:03 AM5/20/16
to ReactiveMongo - http://reactivemongo.org
Hi!

I am facing a problem that occurres during failover in one of my test systems. I use the default failover strategy that implies a delay of 125 millisec to retry. The problem occures in the following code snippet in the Failover2 class:

api.scala (from line 165):
----------------------------------------------------
          connection.actorSystem.scheduler.scheduleOnce(delay) {
            lock.synchronized(lock.notify)
          }

          lock.synchronized {
            lock.wait()
            send(`try`)
          }
----------------------------------------------------

If some way or another the execution does not reach the lock.wait() until the scheduleOnce happens, then the lock gets stuck forever, leading to variuos nondeterministic problems. I verified, that a lock notification happens, so it is likely to be a synchronization issue.

The other problem is (related to this issue), that FailoverStrategy - at least for me - does not get overridden in the way described in the documentation (http://reactivemongo.org/releases/0.11/documentation/advanced-topics/failoverstrategy.html):
val collection = db2.collection("collname", customStrategy)

I set the value for every collection, but without success. I think this should be set via configuration file, not in the code.

Thanks,
Akos

Szabo Akos

unread,
May 20, 2016, 4:32:18 AM5/20/16
to ReactiveMongo - http://reactivemongo.org
One sidenote: the problem occures in a virtual machine environment (based on OpenVz technology).

Cédric Chantepie

unread,
May 20, 2016, 7:12:18 AM5/20/16
to ReactiveMongo - http://reactivemongo.org


On Friday, 20 May 2016 10:27:03 UTC+2, Szabo Akos wrote:



If some way or another the execution does not reach the lock.wait() until the scheduleOnce happens, then the lock gets stuck forever, leading to variuos nondeterministic problems. I verified, that a lock notification happens, so it is likely to be a synchronization issue.

How do you test it?
 

The other problem is (related to this issue), that FailoverStrategy - at least for me - does not get overridden in the way described in the documentation (http://reactivemongo.org/releases/0.11/documentation/advanced-topics/failoverstrategy.html):
val collection = db2.collection("collname", customStrategy)

I set the value for every collection, but without success.

What do you mean without success? How to you test it?
 
I think this should be set via configuration file, not in the code.

A custom failover for all the collection resolution can be passed during DB resolution: connection.database(name, failover) .

Szabo Akos

unread,
May 20, 2016, 7:43:46 AM5/20/16
to ReactiveMongo - http://reactivemongo.org
Hi Cédric!

Thanks for your response. 

I debugged the related code with breakpoints (and many more until I found this). I tested it with restarting mongodb service when calling some mongodb command, thus the failover mechanism had started. Here I could check the initial delay property which was the default instead of my custom value, however I changed failover strategies at collection level. It would be fine if this strategy could be configured independetly and globally. The documentation also contains a way to configure it on database level, but it's not much better in my view, because you have to specify the custom failover strategy everywhere you reference the database.

As for the locking problem I suspect that the virtualization environment causes the trouble, the scheduleOnce call might not wait for the specified duration. But even though it is risky to put a lock without timeout, it can freeze the application quite badly.

Regards,
Akos

Cédric Chantepie

unread,
May 20, 2016, 10:05:51 AM5/20/16
to ReactiveMongo - http://reactivemongo.org


On Friday, 20 May 2016 13:43:46 UTC+2, Szabo Akos wrote:


I debugged the related code with breakpoints (and many more until I found this). I tested it with restarting mongodb service when calling some mongodb command, thus the failover mechanism had started.

What is the code you use to reproduce that?
 
Here I could check the initial delay property which was the default instead of my custom value, however I changed failover strategies at collection level. It would be fine if this strategy could be configured independetly and globally. The documentation also contains a way to configure it on database level, but it's not much better in my view, because you have to specify the custom failover strategy everywhere you reference the database.

Preparing for 0.12, the failover can already be set using connection options, using rm.failover.

Szabo Akos

unread,
May 20, 2016, 11:00:52 AM5/20/16
to ReactiveMongo - http://reactivemongo.org
Hi Cédric!

Actually the error looks to be independent from my code, Everything happens at api.scala (from line 165) at startup. When application starts reactivemongo occasionally goes to failover code (https://groups.google.com/forum/#!topic/reactivemongo/0jxS1cgXjP8), which is considered normal. When this occures this line is in the log:
   2016-05-20 16:40:07,294 [DEBUG] from reactivemongo.api.Failover2 in ForkJoinPool-4-worker-15 - Got an error, retrying... (try #1 is scheduled in 125 ms)

I also use GridFS and this might be the point where failover strategy is not configured.

Meanwhile I tried to run the same code in a Hyper-V virtual machine, and does the same deadlock az in OpenVz. The problem does not occur on my laptop and a native Linux environment.

I'm cloned the Reactivemongo repo to my local machine, and I'm trying to put extra logging to determine when locking/notification occures, but I'm facing a problem. If I clone the 0.11.11 version, and modify the BuildSettings.buildVersion, then the code does not compile. If I don't change the version, then I cannot get my application to use the pacthed library. I make a publishLocal, and use a resolver like this: 
resolvers += Resolver.file("Local repo", file(System.getProperty("user.home") + "/.ivy2/local"))(Resolver.ivyStylePatterns)

Is there an easy way to use the locally published version of reactivemongo?

Thanks in advance,
Akos

Szabo Akos

unread,
May 23, 2016, 4:26:12 AM5/23/16
to ReactiveMongo - http://reactivemongo.org
Hi!

I was investigating the problem more deeply, and the problem seems to be the following: the scheduleOnce call of the Akka scheduler sometimes does fire the callback, so the failover logic gets locked forever. 

I was trying hard to make the problem reproducable for you, but I'm afraid it is not possible. The problem itself seems to be dependent on server performance, only occurres when the performance is low. For example the same server/same code did not show the error yesterday (in Sunday evening), but it is showing now.

I attached some files, maybe they help to find out the cause:
  • /reactivemongo/api.scala: I added some extra logging at line 160-180. My log messages contain "MyDebug" string to assure easy searching in the logfile
  • /application/* : the config files of my application
  • application.log : the logfile of my application (loglevel=TRACE)
Some more details on the error:
  • usually the problem occurres at startup (when the mongodb connection builds up)
  • the "Lock is notified" log message does not appear in this case, showing that the scheduler does not call the notification for the lock
  • the scheduler is working in other cases, sometimes even the locks are notified, but not every time
  • it the problem does not occur during startup, it can be repoduced with intermittently shutting down mongodb
  • it only occurres in slow server environments, never happens for example on my notebook, but usually occurres in slower virtualized environments.
My questions:
  • how is it possible, that akka scheduler callback is not fired? under what circumstances is it possible? I did not find anything in the log that would explain it.
  • do you have an idea how to debug akka scheduler or the cause of the problem?
  • would not be a safer solution to specify timeout for the lock for example this way in the code: lock.wait(delay.toMillis)?
Thanks in advance,
Akos
locking_problem.zip

Cédric Chantepie

unread,
May 23, 2016, 5:39:51 AM5/23/16
to ReactiveMongo - http://reactivemongo.org
You can try with 0.12.0-SNAPSHOT.

Szabo Akos

unread,
May 23, 2016, 8:08:15 AM5/23/16
to ReactiveMongo - http://reactivemongo.org
Thanks for the patch!

I tried the actual 0.12-SNAPSHOT version in multiple environments with the same result. General queries and aggregation framework queries seem to be working fine (even after failover event caused by mongodb restart), but strangely gridfs requests does not work, and the following message is in the log:

2016-05-23 13:49:16,750 [ERROR] from reactivemongo.api.Cursor in ForkJoinPool-4-worker-1 - fails to send request
java.util.concurrent.ExecutionException: Boxed InterruptedException

I also attached a detailed log file.

Regards,
Akos
application.log

Cédric Chantepie

unread,
May 23, 2016, 11:28:33 AM5/23/16
to ReactiveMongo - http://reactivemongo.org
Which gridfs op?

Szabo Akos

unread,
May 23, 2016, 12:15:24 PM5/23/16
to ReactiveMongo - http://reactivemongo.org
For example find operation.

But I also found the cause of this in the log:
java.lang.NoSuchMethodError: play.api.libs.json.JsLookup$.$bslash$extension(Lplay/api/libs/json/JsLookupResult;Ljava/lang/String;)Lplay/api/libs/json/JsLookupResult;
at play.modules.reactivemongo.MongoController$$anon$2$$anonfun$reads$1.apply(MongoController.scala:73)
at play.modules.reactivemongo.MongoController$$anon$2$$anonfun$reads$1.apply(MongoController.scala:72)
at play.api.libs.json.JsResult$class.flatMap(JsResult.scala:99)
at play.api.libs.json.JsSuccess.flatMap(JsResult.scala:9)
at play.modules.reactivemongo.MongoController$$anon$2.reads(MongoController.scala:72)
at reactivemongo.play.json.JSONSerializationPack$.deserialize(package.scala:469)
at reactivemongo.play.json.JSONSerializationPack$.deserialize(package.scala:437)
at reactivemongo.api.SerializationPack$class.readAndDeserialize(serializationpack.scala:28)
at reactivemongo.play.json.JSONSerializationPack$.readAndDeserialize(package.scala:437)
at reactivemongo.api.gridfs.GridFS$Chunk$3$$anonfun$finish$2$$anonfun$apply$3.apply(gridfs.scala:330)
at reactivemongo.api.gridfs.GridFS$Chunk$3$$anonfun$finish$2$$anonfun$apply$3.apply(gridfs.scala:327)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Success.map(Try.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Thanks,
Akos

On Monday, May 23, 2016 at 5:28:33 PM UTC+2, Cédric Chantepie wrote:
Which gridfs op?

Cédric Chantepie

unread,
May 23, 2016, 3:26:42 PM5/23/16
to ReactiveMongo - http://reactivemongo.org
This NoSuchException is not specific to ReactiveMongo, check the Play-JSON in your dependencies, as it should be compatible with the ReactiveMongo-Play-Json module you use.

Szabo Akos

unread,
May 24, 2016, 2:53:09 AM5/24/16
to ReactiveMongo - http://reactivemongo.org
I'm using Play Json 2.5.3, I checked in the classpath. Reactivemongo is referenced in the following way:

libraryDependencies ++= Seq(
...
"org.reactivemongo" %% "play2-reactivemongo" % "0.12.0-SNAPSHOT"
...
)

resolvers += "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

Is this ok?

However I see something strange when compiling the code:

[warn] Unable to reparse org.reactivemongo#play2-reactivemongo_2.11;0.12.0-SNAPSHOT from Sonatype Snapshots, using Mon Mar 14 00:02:48 CET 2016
[warn] Choosing Sonatype Snapshots for org.reactivemongo#play2-reactivemongo_2.11;0.12.0-SNAPSHOT
[info] Resolving org.reactivemongo#reactivemongo_2.11;0.12.0-SNAPSHOT ...
[warn] Unable to reparse org.reactivemongo#reactivemongo_2.11;0.12.0-SNAPSHOT from Sonatype Snapshots, using Mon May 23 11:39:16 CEST 2016
[warn] Choosing Sonatype Snapshots for org.reactivemongo#reactivemongo_2.11;0.12.0-SNAPSHOT
[info] Resolving org.reactivemongo#reactivemongo-bson-macros_2.11;0.12.0-SNAPSHOT ...
[warn] Unable to reparse org.reactivemongo#reactivemongo-bson-macros_2.11;0.12.0-SNAPSHOT from Sonatype Snapshots, using Mon May 23 11:38:43 CEST 2016
[warn] Choosing Sonatype Snapshots for org.reactivemongo#reactivemongo-bson-macros_2.11;0.12.0-SNAPSHOT
[info] Resolving org.reactivemongo#reactivemongo-bson_2.11;0.12.0-SNAPSHOT ...
[warn] Unable to reparse org.reactivemongo#reactivemongo-bson_2.11;0.12.0-SNAPSHOT from Sonatype Snapshots, using Mon May 23 11:38:46 CEST 2016
[warn] Choosing Sonatype Snapshots for org.reactivemongo#reactivemongo-bson_2.11;0.12.0-SNAPSHOT
[info] Resolving org.reactivemongo#reactivemongo-shaded;0.12.0-SNAPSHOT ...
[warn] Unable to reparse org.reactivemongo#reactivemongo-shaded;0.12.0-SNAPSHOT from Sonatype Snapshots, using Mon May 23 11:38:39 CEST 2016
[warn] Choosing Sonatype Snapshots for org.reactivemongo#reactivemongo-shaded;0.12.0-SNAPSHOT
[info] Resolving org.reactivemongo#reactivemongo-play-json_2.11;0.12.0-SNAPSHOT ...
[warn] Unable to reparse org.reactivemongo#reactivemongo-play-json_2.11;0.12.0-SNAPSHOT from Sonatype Snapshots, using Sun Apr 03 20:58:19 CEST 2016
[warn] Choosing Sonatype Snapshots for org.reactivemongo#reactivemongo-play-json_2.11;0.12.0-SNAPSHOT
[info] Resolving jline#jline;2.12.1 ...
[info]  [SUCCESSFUL ] org.reactivemongo#play2-reactivemongo_2.11;0.12.0-SNAPSHOT!play2-reactivemongo_2.11.jar (1961ms)
[info]  [SUCCESSFUL ] org.reactivemongo#reactivemongo_2.11;0.12.0-SNAPSHOT!reactivemongo_2.11.jar (3133ms)
[info]  [SUCCESSFUL ] org.reactivemongo#reactivemongo-play-json_2.11;0.12.0-SNAPSHOT!reactivemongo-play-json_2.11.jar (1033ms)
[info]  [SUCCESSFUL ] org.reactivemongo#reactivemongo-bson-macros_2.11;0.12.0-SNAPSHOT!reactivemongo-bson-macros_2.11.jar (1178ms)
[info]  [SUCCESSFUL ] org.reactivemongo#reactivemongo-shaded;0.12.0-SNAPSHOT!reactivemongo-shaded.jar (2109ms)
[info]  [SUCCESSFUL ] org.reactivemongo#reactivemongo-bson_2.11;0.12.0-SNAPSHOT!reactivemongo-bson_2.11.jar (1155ms)

Aren't the play related snapshots a bit old in the Sonatype snapshot repo?

Thanks,
Akos

Cédric Chantepie

unread,
May 24, 2016, 7:41:01 AM5/24/16
to ReactiveMongo - http://reactivemongo.org
Try using "org.reactivemongo" %% "play2-reactivemongo" % "0.12.0-SNAPSHOT" changing() and/or clean your IVY cache.

Szabo Akos

unread,
May 24, 2016, 8:57:00 AM5/24/16
to ReactiveMongo - http://reactivemongo.org
Thank you for your quick and valueable help, basically everything is working, even in the virtualized environment which was the main problem before.

I studied the changes in the Failover2 class, I think it is a step forward to a robust operation. However the timeout of the tryAcquire is a bit too restrictive, that's why the failover operation fails sometimes with InterruptedException:

if (!mutex.tryAcquire(delay.length, delay.unit)) {
Future.failed[A](new InterruptedException(
s"fails to retry ${`try`}"))

} else send(`try`)

I experienced that the akka scehduler tends to be late by a few milliseconds compared to the specified duration, thus the tryAcquire call sometimes time out.

Cédric Chantepie

unread,
May 24, 2016, 4:30:37 PM5/24/16
to ReactiveMongo - http://reactivemongo.org
You could use a custom Failover if the default one is not appropriate for your env.

Szabo Akos

unread,
May 25, 2016, 2:51:00 AM5/25/16
to ReactiveMongo - http://reactivemongo.org
I'm afraid it is not a question of failover strategy, tha algorithm needs to be refined instead. This is a classical racing condition, because the scheduleOnce call is timed for the same duration as the mutex timeout. That's why it depends on a few millliseconds of Akka scheduling delay that the failover is successful or results in an error (as you can see in the attached). This even happens on my laptop with Intel i7 processor.

Some possible solutions to consider:
- using a safety threshold in the mutex timeout (for example 100 millisec)
- never throwing Exception at this point, always sending a 'try' on timeout
- not using scheduler only relying on a timeout

I'm happy to contribute with a pull request but please indicate what solution is preferred by you.
failover_error.txt

Cédric Chantepie

unread,
May 25, 2016, 4:32:24 AM5/25/16
to ReactiveMongo - http://reactivemongo.org
You can try the latest snapshot.

Szabo Akos

unread,
May 25, 2016, 6:13:44 AM5/25/16
to ReactiveMongo - http://reactivemongo.org
Seems to be working very well, thank you!
Reply all
Reply to author
Forward
0 new messages