OK, I have some more data. This is gonna be long, sorry, this thing is driving me insane. I am not sure what it means but it's very interesting.
I wrote a little script that will execute on my MVC action to time how long it takes to read back a saved instance in two sessions side-by-side.
Here it is:
public ActionResult Timing()
{
DateTime start;
DateTime save;
using (var session = _store.OpenSession())
{
save = DateTime.UtcNow;
var user = session.Load<User>("Users/1");
user.LastActivityDate = save;
start = DateTime.UtcNow;
session.SaveChanges();
}
bool updated = false;
int iterations = 0;
DateTime end = DateTime.MinValue;
TimeSpan tts = TimeSpan.Zero;
DateTime lmd = DateTime.MinValue;
TimeSpan diff = TimeSpan.Zero;
while (!updated)
{
end = DateTime.UtcNow;
tts = end - start;
using (var session = _store.OpenSession())
{
var user = session.Load<User>("Users/1");
if (user.LastActivityDate == save)
{
lmd = session.Advanced.GetMetadataFor(user)["Last-Modified"].Value<DateTime>();
diff = lmd - end;
updated = true;
}
}
iterations++;
}
var builder = new StringBuilder();
builder.AppendFormat("Started: {0}", start);
builder.AppendLine();
builder.AppendFormat("Ended: {0}", end);
builder.AppendLine();
builder.AppendFormat("Time-to-Save: {0}", tts);
builder.AppendLine();
builder.AppendFormat("Iterations: {0}", iterations);
builder.AppendLine();
builder.AppendFormat("LMD: {0}", lmd);
builder.AppendLine();
builder.AppendFormat("Difference: {0}", diff);
return Content(builder.ToString());
}
I only deployed the script to prod because:
a) that's where I've been seeing the issue and
b) I wanted to get the time differential between my Azure server and RavenHQ server (Amazon).
Time Difference between RavenHQ and Azure: +00:00:27
Good, now I can normalize all my findings, just keep this in mind as you read through--I've adjusted all the Raven-specific timestamps (Last-Modified) according to this number.
The Bad Case
[Trace] is the timestamp from Azure App Insights.
[Chrome] is the timestamp from the Response "Date" header sent by Azure
Raven Last-Modified dates are normalized by subtracting 27 seconds
So here is a sequence of events I gathered from Chrome and Application Insights (logging traces) when I observed the best showcase of the issue:
... I make a change (press a button, issue AJAX) to my user document!
[Trace][18:58:31] (Raven Trace -- PUT [Users/1])
... I try to undo (press another button, issues AJAX), it fails! Maybe I'll catch it in the act now.
... I go to Raven Studio and sure enough, the doc has been modified
Last-Modified for Users/1: [18:58:??] (Raven Studio does not show seconds)
... OK, doc is updated, so now my undo requests should work!
... I go back to my other tab and try to perform an undo request
... they all fail with trace entries like:
[Trace][18:59:03] Loading document [Users/1] from https://...
[Chrome][18:59:03] AJAX Response (FAIL)
[Trace][18:59:35] Loading document [Users/1] from https://...
[Chrome][18:59:35] AJAX Response (FAIL)
... finally, it works!
[Trace][18:59:43] (Raven Trace -- PUT [Users/1])
[Chrome][18:59:45] AJAX Response (SUCCESS)
... so I checked the LM date timestamp (from C#, not Studio)
Last-Modified for Users/1: [18:59:43]
Total time it took for my code to load the newly changed document? 00:01:12!
Holy crap, that definitely isn't right! The document has CLEARLY been updated properly in Raven, so why are my subsequent requests failing?
This is when I decided to write my timing code. I wrote it only to get the time differential but I found something more interesting.
Here's a normal output (reference script above) (LMD is Raven's Last-Modified date on the document):
GET /test-timings
Started: 6/2/2015 7:52:53 PM
Ended: 6/2/2015 7:52:53 PM
Time-to-Save: 00:00:00.0468759
Iterations: 1
LMD: 6/2/2015 7:53:21 PM
Difference: 00:00:27.0657413
GET /test-timings
Started: 6/2/2015 7:59:40 PM
Ended: 6/2/2015 7:59:40 PM
Time-to-Save: 00:00:00.5459673
Iterations: 1
LMD: 6/2/2015 8:00:08 PM
Difference: 00:00:27.0820819
Those are two "normal" sample requests. OK, so we can see on average it's only about 500ms to save a change to a document and the time difference between RavenHQ and Azure is 27 seconds--that's how I could normalize the event sequence above.
However, I decided to see what would happen if I browsed the site in one tab and refresh the timing page. Suddenly it was taking a long time. So long in fact, I refreshed the other tab again to see if the site was frozen and I think it stopped the loop (or it was a coincidence, but I did it again the same way one other time):
Started: 6/2/2015 7:52:55 PM
Ended: 6/2/2015 7:55:55 PM
Time-to-Save: 00:02:59.8295933
Iterations: 9541
LMD: 6/2/2015 7:52:55 PM
Difference: -00:02:32.6229524
Holy crap. What's interesting about this is the Last Modified date on the document! It matches the observation from that sequence above; the save happens immediately... so why the hell does it takes 9500 iterations to load an instance that has been changed?!
I checked my Application Insight logs after this:
So it was definitely writing to trace (all 9500 times). Unfortunately, I cannot see the responses--if they were 304 Not Modified or were returning a stale document.
RE: Ninject
My app uses Ninject, but this IS NOT through the Web API resolver that I changed--this is regular Ninject with MVC controller (_store is a singleton). I've been using this for the entire time I've had Raven so I know it's disposing sessions InRequestScope.
I do not think it's a Ninject issue--the code clearly creates a new session and disposes within a single iteration. Something else is going on!
At this point my conclusions are:
1. Caching. Something is being cached. Somewhere. If only I could see those HTTP requests being sent to Raven.
2. Magic. Gnomes have invaded my codebase and are wrecking havoc.
3. Something else I can't see
So, I'll probably continue my investigation tomorrow and try to get the timing issue to happen again (it was easy to do with my code above) and capture the requests between Azure and Raven, maybe using code in that script to hook into Raven's profiling.