RavenDB 3 Consumes All Available Memory then Shudders to a Halt

357 views
Skip to first unread message

Todd Bowles@Onthehouse

unread,
Dec 30, 2015, 1:23:04 AM12/30/15
to RavenDB - 2nd generation document database
TL;DR
I need to understand why RavenDB 3 is eating all of the available memory on an AWS instance and then crashing when RavenDB 2.5 (on exactly the same data set, with the same traffic) does not.

Overview
We have had persistent performance problems with RavenDB. I believe that our problems are caused by us doing something strange or unusual, because I don't believe that these problems are inherent to RavenDB. I'm making this post in an attempt to use the RavenDB community to help me to understand where we might have gone wrong and how we can improve.

Disclaimer: While I have tried to learn as much about RavenDB as I could, there are vast holes in my knowledge. Please keep this in mind.

Usage
We have a web service implemented in .NET (using Nancy) and hosted via IIS in AWS. This service uses RavenDB for its persistence store, also hosted in IIS in AWS.

The purpose of the service is to act as a temporary waystation for data, which primarily lives on computers outside of our control (i.e. the client's servers). Data is synchronized up to the service, downloaded to mobile devices, altered, uploaded back to the service, synchronized back into the persistence store at the client site and finally deleted from the service (because it has completed the round trip).

Architecturally the service looks like the following:















The number of API instances is variable and they are all stateless. There is only a single RavenDB instance serving all incoming requests. Barring binary data (primarily images), which uses S3, every incoming request will hit Raven in some way.

Statistics
Approximately 100K documents are in the database at any point in time. This fluctuates somewhat, but only really +/- 20%.
Documents range in size from a few KB to around 300KB.
There are approximately 140K requests/hour (40 requests/second). Of those requests, around 85% are queries of some sort with the remaining being updates of some sort.

History
Historically since releasing the service earlier this year, we've had a number of occurrences of downtime caused by the database. Not all downtime was caused by the database, but it was the root cause in most of the cases.

The most common issue was high/maxed out CPU usage, which would continue for a long enough period of time such that the database would be unable to respond to requests in a timely fashion. 

Initially the reason for this was under-provisioning, and we went through a number of changes to the environment to deal with this. The current resource allocation for our RavenDB instance is an m4.xlarge (4 vCPUs, 16GB memory) w. a data drive set at 2000 provisioned IOPS. Essentially we just threw more power at each time a problem occurred, targeting whatever looked like the bottleneck at the time (the instance was originally a t2.medium but it burnt all its CPU credits, then it started experiencing issues because the data drive couldn't keep up, then finally it ran out of memory and paged to the system drive, grinding to a halt).

All of the above occurred when using RavenDB 2.5.2951.

Thinking toward the future, we agreed internally to upgrade to RavenDB 3, under the assumption that it would perform better and that it would be easier to get support for, both informally and formally (via a support contract).

As a validation step, we cloned our current production environment twice. The intent was to leave one copy at Raven 2.5 and upgrade the other to 3, then run load tests on both in parallel to contrast and compare performance.

The Current Problem
When performing load tests on top of the cloned database upgraded to RavenDB 3 it consumed all of the available memory on the machine (16GB), started paging to the system drive (which increased the drive latency), crashed, then repeated the whole thing again a few times until it seemed to balance itself out. Even when the load tests were tuned down to exactly 1 simulated user, this still occurred.

RavenDB 2.5 does not exhibit this behavior when assailed with the exact same load tests.

Extra Information
I will update this section to include as much additional information as possible, from configuration to statistics, to anything else requested to help get to the bottom of this.

Our Raven configuration is as follows:
<appSettings>
    <add key="Raven/WorkingDir" value="APPDRIVE:\Raven\" />
    <add key="Raven/DataDir/Legacy" value="~\Database\System"/>
    <add key="Raven/DataDir" value="~\Databases\System"/>
    <add key="Raven/AnonymousAccess" value="Admin"/>
    <add key="Raven/Licensing/AllowAdminAnonymousAccessForCommercialUse" value="true" />
    <add key="Raven/AccessControlAllowOrigin" value="*" />
    <add key="Raven/LicensePath" value="<snip>" />
</appSettings>

Michael Yarichuk

unread,
Dec 30, 2015, 2:22:58 AM12/30/15
to RavenDB - 2nd generation document database
Hi,
Can you post output of /admin/stats and /databases/[database name]/stats endpoints?
Also,
* What build to you use?
* How much indexes do you have? Can you post their definitions?
* Do you use custom plugins/custom Lucene analyzers?



--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best regards,

 

Michael Yarichuk

RavenDB Core Team

Tel: 972-4-6227811

Fax:972-153-4-6227811

Email : michael....@hibernatingrhinos.com

 

RavenDB paving the way to "Data Made Simple" http://ravendb.net/  

Oren Eini (Ayende Rahien)

unread,
Dec 30, 2015, 2:41:38 AM12/30/15
to ravendb
In addition to those, when it exhibit high memory usage, please also send the Debug Package Info and take a process dump.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

Todd Bowles@Onthehouse

unread,
Dec 30, 2015, 9:20:55 PM12/30/15
to RavenDB - 2nd generation document database
Additional information as requested.

The version of Raven.Database.dll is 3.0.30000.0.
We are not using any custom plugins or analyzers (to my knowledge).
I believe we are using nothing but automatic indexes.

You can downloaded an archive containing a number of different pieces of information from a time today when the memory usage spiked (and then crashed). This includes stats and admin/stats exports, dumps of the process (minidumps via procdump), debug information from the Raven 3 studio and the actual RavenDB log file.

The archive is available at https://s3-ap-southeast-2.amazonaws.com/oth.console.scratch/ravendb3/20151231_ravendb3.7z (because its too big for Google groups)

Additionally, today after the IIS process crashed and restarted itself, everything appears to be behaving, even when under load. In the past the memory spike -> crash problem appeared to cycle a number of times before settling down.

Thank you for your help.

Michael Yarichuk

unread,
Dec 31, 2015, 4:25:40 AM12/31/15
to RavenDB - 2nd generation document database

Thanks, I will take a look

--

Oren Eini (Ayende Rahien)

unread,
Dec 31, 2015, 9:15:03 AM12/31/15
to ravendb
Well, to start with,you have this interesting error:
Line 0, Position 0: Error CS0006 - Metadata file 'E:\Raven\Assemblies\Lucene.Net.dll' could not be found

It looks like this cause the index to fail to load, and then we see a lot of attempts to index, but nothing that would actually consume the indexing, resulting in the high memory usage.

How are you running ravendb?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Todd Bowles@Onthehouse

unread,
Jan 3, 2016, 7:25:37 PM1/3/16
to RavenDB - 2nd generation document database
We host in IIS. The Raven data directory is configured to be on a dedicated drive (a 100GB/2000 IOPS drive). This is the E: drive. RavenDB itself is actually deployed on the System Drive (C:), which is just a stock-standard 30GB volume in AWS (no provisioned IOPS, so 90 baseline, burstable to whatever AWS feels like giving you).

Oren Eini (Ayende Rahien)

unread,
Jan 4, 2016, 5:02:07 AM1/4/16
to ravendb
Can you check if there is the file there?

Todd Bowles@Onthehouse

unread,
Jan 6, 2016, 4:11:43 AM1/6/16
to RavenDB - 2nd generation document database
I ran the upgrade process again (cloned RavenDB 2.5 server, upgraded to 3 via our deployment process) and when I looked the file was there, but the same messages were present in the log. At no point during this process was memory a problem (the memory usage seems fairly static at around 2.5GB). The moment I started our load tests, even though I only ran then for a few seconds, the memory started spiking and didn't stop until the process crashed and IIS restarted it.

Perhaps those log file entries are related to the way in which we are deploying the upgrade, and may not be related to the memory?

We have RavenDB 2.5 deployed via Octopus. It has the Raven/DataDir setting set to E:\RavenDatabaseFiles.

We deploy RavenDB 3 also via Octopus. It has the Raven/DataDir setting set to E:\RavenDatabaseFiles, but it also has the Raven/DataDir/Legacy setting set to E:\RavenDatabaseFiles and the Raven/WorkingDir setting set to E:\Raven.

During the deployment, we stop the website using Powershell (WebAdministration module, Stop-Website) and then wait for it to be stopped. We then do the same the Application Pool. Both are then deleted. The deployment directory is cleared and the new deployment is copied in. Finally the website is recreated along with the application pool.

Are there any issues with having the same Legacy and Data directories when upgrading from RavenDB 2.5?

The log file has entries stating "After early exit...", does this mean the process did not terminate cleanly and that it is trying to recover something?

Oren Eini (Ayende Rahien)

unread,
Jan 6, 2016, 5:52:32 AM1/6/16
to ravendb
After early exit means that the I/O rate is slow and it aborted a prefetch operation.

Can you take a proc dump and send it to us under high memory siutation?

Todd Bowles@Onthehouse

unread,
Jan 7, 2016, 12:22:15 AM1/7/16
to RavenDB - 2nd generation document database
There are proc dumps included in the information pack I mentioned above. You can download it from https://s3-ap-southeast-2.amazonaws.com/oth.console.scratch/ravendb3/20151231_ravendb3.7z. These proc dumps are only minidumps though, as a full dump would be 10+ GB. I can do the full dump if required, just let me know.

Oren Eini (Ayende Rahien)

unread,
Jan 7, 2016, 5:18:00 AM1/7/16
to ravendb
Yes, to handle memory we need a full dump.
Please make sure to compress it before upload, though

Todd Bowles@Onthehouse

unread,
Jan 10, 2016, 9:26:26 PM1/10/16
to RavenDB - 2nd generation document database
I've sent you a share to the compressed memory dump currently stored in Google Drive. That dump was taken when the process exceeded 10GB of memory usage using procdump.

Let me know if there is anything else I can do to help.

Oren Eini (Ayende Rahien)

unread,
Jan 12, 2016, 9:36:14 AM1/12/16
to ravendb, Michael Yarichuk
Michael will look into this, thanks for the dump

Michael Yarichuk

unread,
Jan 12, 2016, 12:17:17 PM1/12/16
to Oren Eini (Ayende Rahien), ravendb
Looking at the dump currently.

Do you have very large documents? (by large I mean larger than 100mb in size)
--
Best regards,

 

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Michael Yarichuk l RavenDB Core Team 

RavenDB paving the way to "Data Made Simple"   http://ravendb.net/  

Michael Yarichuk

unread,
Jan 13, 2016, 3:06:15 AM1/13/16
to Oren Eini (Ayende Rahien), ravendb
Also, the last dump you captured was taken in the middle of GC, therefore I cannot look at the heap properly.
Is this possible for you to take several full dumps in approx. 1 minute intervals, so at least one of the dumps will be taken when GC is idle?

Todd Bowles@Onthehouse

unread,
Jan 14, 2016, 12:28:40 AM1/14/16
to RavenDB - 2nd generation document database, aye...@ayende.com
I was not under the impression that we had particularly large documents. Is there an easy way that I can ask Raven for document sizing statistics, just in case there are some that I'm not immediately aware of?

I should be able to take a few more full dumps to help you diagnose the issue. I'll share them using the same mechanism as before.

Todd

Oren Eini (Ayende Rahien)

unread,
Jan 14, 2016, 1:12:21 AM1/14/16
to Todd Bowles@Onthehouse, RavenDB - 2nd generation document database
Status > Storage details (but note that this can be expensive in I/O

Todd Bowles@Onthehouse

unread,
Jan 14, 2016, 2:33:22 AM1/14/16
to RavenDB - 2nd generation document database, todd....@onthehouse.com.au
We recently purchased a support contract from Hibernating Rhinos. Would it be easier for me to take this to your support channel, perhaps organizing direct remote access into the server experiencing the issues? Either works for me, its more what would be easier for you guys to manage.

Once the issue has been diagnosed I can provide a summary back here so any poor soul experiencing the same issue can get some closure (https://xkcd.com/979/).

Oren Eini (Ayende Rahien)

unread,
Jan 14, 2016, 2:58:04 AM1/14/16
to ravendb, Todd Bowles@Onthehouse
Yes, that would be great, ping sup...@ravendb.net with the support contract id.
One of our engineers will be able to take that on and work with you directly

Todd Bowles@Onthehouse

unread,
Feb 17, 2016, 2:48:35 AM2/17/16
to RavenDB - 2nd generation document database
Its been a while since I posted here (and some long investigation sessions with the guys from Hibernating Rhinos), but they found the root cause of the memory issue with RavenDB 3 we were experiencing.

I believe that Michael will come and post a better technical explanation than I can, but my understanding of the issue it that it was related to an optimization, the way indexes are initialized, our document profile, the number of documents we have and the fact that we're using a number of auto indexes (instead of static indexes). 

For our particular usage, this optimization backfires and causes multiple copies of documents to be placed into memory across different indexes, spiking the memory usage and causing problems.

They are planning on providing a way to disable the problem optimization as an immediate workaround so that I can move forward, and to improve it in the long term so that it is not an issue for anyone else.

I would like to give a big thank you to Michael and Oren for working with me to get to the bottom of the problem, and for their determination in finding a solution.

Michael Yarichuk

unread,
Feb 24, 2016, 11:51:41 AM2/24/16
to RavenDB - 2nd generation document database
Hi all,
For completeness sake, I will describe what happened in this issue in a more detailed manner.

One of recent optimizations that we introduced is essentially populating a new index with initial data (happens in this method -> https://github.com/ayende/ravendb/blob/master/Raven.Database/Actions/IndexActions.cs#L611)

So what happened is - combination of document sizes and document amount caused the pre-population optimization to take a long time (aided by the fact that there were no limit to populating the , while keeping the fetched documents in memory - that is the reason the memory usage rapidly increased. (see the line here : https://github.com/ayende/ravendb/blob/master/Raven.Database/Actions/IndexActions.cs#L683)

The fix here is to introduce limits for the pre-population process to stop it if a limit is reached.



 


Reply all
Reply to author
Forward
0 new messages