I need to understand why RavenDB 3 is eating all of the available memory on an AWS instance and then crashing when RavenDB 2.5 (on exactly the same data set, with the same traffic) does not.
We have had persistent performance problems with RavenDB. I believe that our problems are caused by us doing something strange or unusual, because I don't believe that these problems are inherent to RavenDB. I'm making this post in an attempt to use the RavenDB community to help me to understand where we might have gone wrong and how we can improve.
Disclaimer: While I have tried to learn as much about RavenDB as I could, there are vast holes in my knowledge. Please keep this in mind.
We have a web service implemented in .NET (using Nancy) and hosted via IIS in AWS. This service uses RavenDB for its persistence store, also hosted in IIS in AWS.
The purpose of the service is to act as a temporary waystation for data, which primarily lives on computers outside of our control (i.e. the client's servers). Data is synchronized up to the service, downloaded to mobile devices, altered, uploaded back to the service, synchronized back into the persistence store at the client site and finally deleted from the service (because it has completed the round trip).
Architecturally the service looks like the following:
The number of API instances is variable and they are all stateless. There is only a single RavenDB instance serving all incoming requests. Barring binary data (primarily images), which uses S3, every incoming request will hit Raven in some way.
Approximately 100K documents are in the database at any point in time. This fluctuates somewhat, but only really +/- 20%.
Documents range in size from a few KB to around 300KB.
There are approximately 140K requests/hour (40 requests/second). Of those requests, around 85% are queries of some sort with the remaining being updates of some sort.
Historically since releasing the service earlier this year, we've had a number of occurrences of downtime caused by the database. Not all downtime was caused by the database, but it was the root cause in most of the cases.
The most common issue was high/maxed out CPU usage, which would continue for a long enough period of time such that the database would be unable to respond to requests in a timely fashion.
Initially the reason for this was under-provisioning, and we went through a number of changes to the environment to deal with this. The current resource allocation for our RavenDB instance is an m4.xlarge (4 vCPUs, 16GB memory) w. a data drive set at 2000 provisioned IOPS. Essentially we just threw more power at each time a problem occurred, targeting whatever looked like the bottleneck at the time (the instance was originally a t2.medium but it burnt all its CPU credits, then it started experiencing issues because the data drive couldn't keep up, then finally it ran out of memory and paged to the system drive, grinding to a halt).
All of the above occurred when using RavenDB 2.5.2951.
Thinking toward the future, we agreed internally to upgrade to RavenDB 3, under the assumption that it would perform better and that it would be easier to get support for, both informally and formally (via a support contract).
As a validation step, we cloned our current production environment twice. The intent was to leave one copy at Raven 2.5 and upgrade the other to 3, then run load tests on both in parallel to contrast and compare performance.
The Current Problem
When performing load tests on top of the cloned database upgraded to RavenDB 3 it consumed all of the available memory on the machine (16GB), started paging to the system drive (which increased the drive latency), crashed, then repeated the whole thing again a few times until it seemed to balance itself out. Even when the load tests were tuned down to exactly 1 simulated user, this still occurred.
RavenDB 2.5 does not exhibit this behavior when assailed with the exact same load tests.
I will update this section to include as much additional information as possible, from configuration to statistics, to anything else requested to help get to the bottom of this.
Our Raven configuration is as follows:
<add key="Raven/WorkingDir" value="APPDRIVE:\Raven\" />
<add key="Raven/DataDir/Legacy" value="~\Database\System"/>
<add key="Raven/DataDir" value="~\Databases\System"/>
<add key="Raven/AnonymousAccess" value="Admin"/>
<add key="Raven/Licensing/AllowAdminAnonymousAccessForCommercialUse" value="true" />
<add key="Raven/AccessControlAllowOrigin" value="*" />
<add key="Raven/LicensePath" value="<snip>" />