Strange behaviour with slow queries

249 views
Skip to first unread message

Patrick Ahorukomeye

unread,
May 15, 2018, 11:09:23 AM5/15/18
to ReactiveMongo - http://reactivemongo.org


Hello,

We are using playFramework (2.6.1) with reactivemongo (0.13.0) and we experience unexpected behaviour.
Basically, we have an application that runs lots queries, fast and slow. When all queries are fast, everything goes well, but when some of them start to become slow (database growing), we start to observe that some
slow queries seem to last abnormally long.
After investigation, we started to look toward akka dispatcher used by reactivemongo. We have tried several configurations and still, we did not manage to find a proper solution.
We have reproduced that behaviour in a simple project (available at https://github.com/sine-patrick/playm.git).
The code in playm basically: 
- creates 24 actors to send fast queries and 6 actors to send slow queries (all number configurable in application.conf)
- all the messages are sent at the same time
- print the time taken to process the request with reactive mongo

Here is what we have:
- lots of fast queries (16) ends within 1200 ms
- all the others end after 40000 ms

[1165 ms] fast-query-12 finished request, got 10 results.
[1159 ms] fast-query-22 finished request, got 10 results.
[1167 ms] fast-query-1 finished request, got 10 results.
[1166 ms] fast-query-7 finished request, got 10 results.
[1166 ms] fast-query-2 finished request, got 10 results.
[1165 ms] fast-query-20 finished request, got 10 results.
[1834 ms] fast-query-4 finished request, got 10 results.
[1170 ms] fast-query-18 finished request, got 10 results.
[1174 ms] fast-query-9 finished request, got 10 results.
[1176 ms] fast-query-14 finished request, got 10 results.
[1175 ms] fast-query-16 finished request, got 10 results.
[1182 ms] fast-query-23 finished request, got 10 results.
[1175 ms] fast-query-17 finished request, got 10 results.
[1843 ms] fast-query-6 finished request, got 10 results.
[1189 ms] fast-query-11 finished request, got 10 results.
[1186 ms] fast-query-15 finished request, got 10 results.

[39842 ms] slow-query-4 finished request, got 10 results.
[40851 ms] slow-query-5 finished request, got 10 results.
[40184 ms] fast-query-24 finished request, got 10 results.
[41023 ms] slow-query-1 finished request, got 10 results.
[41075 ms] fast-query-8 finished request, got 10 results.
[40458 ms] fast-query-21 finished request, got 10 results.
[40507 ms] fast-query-5 finished request, got 10 results.
[40508 ms] fast-query-3 finished request, got 10 results.
[41167 ms] slow-query-6 finished request, got 10 results.
[40529 ms] fast-query-19 finished request, got 10 results.
[40536 ms] fast-query-10 finished request, got 10 results.
[41198 ms] slow-query-3 finished request, got 10 results.
[40594 ms] fast-query-13 finished request, got 10 results.
[40604 ms] slow-query-2 finished request, got 10 results.

We have not a lot of experience with akka (or reactivemongo) and maybe we are doing something wrong.

Any help is appreciated, thank you!

Cédric Chantepie

unread,
Jun 4, 2018, 2:39:51 PM6/4/18
to ReactiveMongo - http://reactivemongo.org
Maybe resources/thread shortage, any seems difficult to me to reason from such tests.

Evgeny Zhoga

unread,
Jun 20, 2018, 4:23:25 PM6/20/18
to ReactiveMongo - http://reactivemongo.org
Hi,

we have similar type of problem (occasional timeouts of generally fast requests). 
I did the investigation on this problem and for me it looks so:
Mongo wire protocol has two types of requests: with response and without. ReactiveMongo uses netty channels to communicate with mongo. Furthermore, channels are used in roundrobin manner without specific filter on current channel load. 
In my tests it looks like if some channel was used to send a query (long running) and, before any response received,   same channel was used for another query (fast), this second query will not be processed by mongodb till response is sent for first query.

So, on application side it looked like (controlled by channel long id, time is approximate):
1. timestamp -> slow query request
2. timestamp + 100ms -> fast query request
3. timestamp + 10,000ms -> slow query response
4. timestamp + 10,100ms -> fast query response

On mongodb side I saw:
1. timestamp + 10,000ms -> slow query was executed, duration 10,000ms
2. timestamp + 10,100ms -> fast query was executed, duration 100ms

I did modification for ReactiveMongo (https://github.com/eugeneatnezasa/ReactiveMongo) with support of additional ReactiveMongo specific option `lockChannels`, false by default. If set to true, then channel will be used exclusively (another query will not be sent till previous query responded), so free channel (without running query) will be used for executing new query. It is not proper fix but could be used to compare results with pure ReactiveMongo (use instructure to publish SNAPSHOT version locally to use in project)

I do not know, is it really a case of mongodb to first respond previous queue before executing next one for same channel? Could it be somehow resolved on mongodb side, or it is supposed behavior and need to be resolved on ReactiveMongo side? 

For testing I used your example project, modified a bit, but think original one will have same results

Cédric Chantepie

unread,
Jun 21, 2018, 1:32:18 PM6/21/18
to ReactiveMongo - http://reactivemongo.org
Will have a look. I'm not fond of locking, which is contrary to pool/Reactive Mongo for me.

Evgeny Zhoga

unread,
Jul 5, 2018, 5:16:27 AM7/5/18
to ReactiveMongo - http://reactivemongo.org
Hi Cedric,

did you have a chance to check/confirm the issue?

Cédric Chantepie

unread,
Jul 7, 2018, 2:38:43 PM7/7/18
to ReactiveMongo - http://reactivemongo.org
I would first suggest to use distinct resources if you want to make sure the actors/processes are isolated (see attached diff).
playm.diff

Evgeny Zhoga

unread,
Jul 9, 2018, 5:47:23 AM7/9/18
to ReactiveMongo - http://reactivemongo.org
Hi Cedric,

my fault, I did not describe test-case I used (Yes, I started with original TS project but changed it a lot), supposed that pointing the problem would be enough.

Attached patch for original project to demonstrate problem and that it is solved with using channels in exclusive mode (all required code is in AnotherMain class, and I added reactivemongo:0.14.0-SNAPSHOT dependency in build.sbt, where this jar is published locally from https://github.com/eugeneatnezasa/ReactiveMongo )

For emulate slow query, I generated data using:

for (var i = 0;i < 10000000;i++) { db.test.insert({j:{k:""+Math.random()}}); }



in mongo cli. (note that values are converted to string).
I use two channels in both use-cases

So, if I use connection string "mongodb://localhost:27017?rm.keepAlive=true&rm.nbChannelsPerNode=2" output is:
[1198] Fast query 1 was finished, size 10
[1198] Fast query 5 was finished, size 10
[1198] Fast query 3 was finished, size 10
[1209] Fast query 9 was finished, size 10
[1214] Fast query 6 was finished, size 10
[4510] slow query was finished, size 0
[4516] Fast query 2 was finished, size 10
[4521] Fast query 8 was finished, size 10
[4521] Fast query 4 was finished, size 10
[4523] Fast query 10 was finished, size 10
[4527] Fast query 7 was finished, size 10
 

 where number in the beginning is ms from start

In case I use connection string ""mongodb://localhost:27017?rm.lockChannels=true&rm.keepAlive=true&rm.nbChannelsPerNode=2"" output is:
[1181] Fast query 6 was finished, size 10
[1181] Fast query 1 was finished, size 10
[1260] Fast query 5 was finished, size 10
[1318] Fast query 7 was finished, size 10
[1529] Fast query 3 was finished, size 10
[1585] Fast query 10 was finished, size 10
[1926] Fast query 4 was finished, size 10
[1985] Fast query 8 was finished, size 10
[2445] Fast query 2 was finished, size 10
[2505] Fast query 9 was finished, size 10
[4235] slow query was finished, size 0


As you can see in first case all fast queries were written into same channel as slow query were blocked till slow query complete. And all fast queries were processed sequentially using one channel (while second was blocked by slow query) successfyully


With best regards,
Evgeny
playm.diff

Cédric Chantepie

unread,
Jul 9, 2018, 11:53:48 AM7/9/18
to ReactiveMongo - http://reactivemongo.org
But if I'm not right, freezing the channel assignation was fix the original issue, which for me could rather be fix by isolated the resources.

Evgeny Zhoga

unread,
Jul 9, 2018, 12:42:03 PM7/9/18
to ReactiveMongo - http://reactivemongo.org
Sorry, I do not understand.
In your patch I see only one thing - MongoService.getInstance(system, appConfig) was moved inside actor initialization block. It changes nothing because getInstance method return singleton instance as I see.
Probably you mean to have individual pool (mongoConnnection with channels under the hood) per actor, so in this example to avoid using same channel for different requests, but from my point of view it is a bit strange approach. From my point of view in a wild it would be strange to create new MongoConnection per every request type. We just share MongoConnection across all activities for UI. 

This project playm just a demo of issue, that ReactiveMongo can send two requests in same netty channel to mongo wire interface, where mongo processes them sequentially (or, my be mongodb writes responses in FIFO manner, I do not know, did not debug from mongodb side, but on ReactiveMongo side it is the same).
My diff from previous post is just another (simplified and stand-alone) example of same issue.

In our live project it behaves like occasional generally fast requests suddenly respond very slow (as I understand because it was sent to mongo using same channel as  previously slow request)


In other words from my point of view it is an issue (or, at least lack of configuration, when you can choose if you want to control channels and do not want to be blocked by slow requests if they presented in the system), that with ReactiveMongo in multithreading environment I can send two requests concurrently, but get results sequentially. What do you think?

Cédric Chantepie

unread,
Jul 13, 2018, 10:55:32 AM7/13/18
to ReactiveMongo - http://reactivemongo.org
Will try to have a deeper look.

Cédric Chantepie

unread,
Aug 3, 2018, 6:14:57 PM8/3/18
to ReactiveMongo - http://reactivemongo.org
Fix merged/to be tested for coming 0.16

João Ferreira

unread,
Aug 13, 2018, 6:08:34 PM8/13/18
to ReactiveMongo - http://reactivemongo.org
Hi Cédric

We were trying this new release in our staging environment and found 2 issues:

* Application didn't boot
By mistake we were using 80 channels in one of the nodes (rm.nbChannelsPerNode=80) and the application didn't "boot" in that node. I.e all our http requests were timing out...
In logs we could only see this message:
[WARN] r.a.MongoConnection - [Supervisor-1/Connection-2] Timeout after 8200 milliseconds while probing the connection monitor: IsPrimaryAvailable#1414732647?
After changing the rm.nbChannelsPerNode to the default value the warning disappeared and the application "booted" successfully...
However this was not a blocking issue because the high rm.nbChannelsPerNode was a mistake from our part. We will try to submit a reproducer in the following days...

* Higher latencies
Although we see that the original issue was fixed, we see higher latencies when we need to do a lot of DB requests. We suspect this happens because when all the channels are occupied, the failover policy kicks in and there is a initial delay before retrying the request. One could say that we need to tweak our retry policy, but we think this needs to be handled in a more fundamental level. As I understand things, previously requests were being queued in netty/channel layer, while now they are queued in akka scheduler, which introduces larger latencies...
A reproduction for this issues is more difficult, maybe setting rm.nbChannelsPerNode=1 and asserting that request are done in one after another without waiting in between...


Cheers
Joao

Cédric Chantepie

unread,
Aug 15, 2018, 2:02:27 PM8/15/18
to ReactiveMongo - http://reactivemongo.org


On Tuesday, 14 August 2018 00:08:34 UTC+2, João Ferreira wrote:
Hi Cédric

We were trying this new release in our staging environment and found 2 issues:

* Application didn't boot
By mistake we were using 80 channels in one of the nodes (rm.nbChannelsPerNode=80) and the application didn't "boot" in that node. I.e all our http requests were timing out...
In logs we could only see this message:
[WARN] r.a.MongoConnection - [Supervisor-1/Connection-2] Timeout after 8200 milliseconds while probing the connection monitor: IsPrimaryAvailable#1414732647?
After changing the rm.nbChannelsPerNode to the default value the warning disappeared and the application "booted" successfully...
However this was not a blocking issue because the high rm.nbChannelsPerNode was a mistake from our part. We will try to submit a reproducer in the following days...

Is there an isolated/minimal reproducer?
 

* Higher latencies
Although we see that the original issue was fixed, we see higher latencies when we need to do a lot of DB requests. We suspect this happens because when all the channels are occupied, the failover policy kicks in and there is a initial delay before retrying the request. One could say that we need to tweak our retry policy, but we think this needs to be handled in a more fundamental level. As I understand things, previously requests were being queued in netty/channel layer, while now they are queued in akka scheduler, which introduces larger latencies...
A reproduction for this issues is more difficult, maybe setting rm.nbChannelsPerNode=1 and asserting that request are done in one after another without waiting in between...

Previously the behaviour was unpredictable in the erroneous cases, so you might have been lucky in such cases.
Now it's now queued in Akka, except if you means the Failover, which is what is triggered in case of channel not found.
So I would indeed suggestion to try with custom failover.
Reply all
Reply to author
Forward
0 new messages