Re: [RavenDB] Code hangs at Session.Store() for object using HiLo

268 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Feb 28, 2013, 2:11:32 AM2/28/13
to ravendb
Is this reproducible ?
The error appears to be trying to get the new hilo from the server, but I am not sure how you could get into a state when you keep getting the same issue over & over.


On Wed, Feb 27, 2013 at 8:12 PM, Nathan Ratcliff <nrat...@gmail.com> wrote:
We've encountered an odd situation and I'm hoping someone can tell me what we're doing wrong.

Raven 2261
2 servers, Master -> Slave

The short version is, the call to Store() seems to hang and by hang I mean executing in the debugger stops at that point, but the thread is still active according to the Threads window.  No exceptions are received in the calling code.

While waiting for the call session.store() to return, the following things are being logged over and over and over again on the servers.

Master:
2013-02-27 12:01:17.9690,Raven.Database.Server.HttpServer,Warn,Error on request,"Raven.Abstractions.Exceptions.ConcurrencyException: PUT attempted on document 'Raven/Hilo/managers' using a non current etag
   at Raven.Storage.Esent.StorageActions.DocumentStorageActions.EnsureDocumentEtagMatch(String key, Nullable`1 etag, String method) in c:\Builds\RavenDB-Stable\Raven.Database\Storage\Esent\StorageActions\Util.cs:line 37
   at Raven.Storage.Esent.StorageActions.DocumentStorageActions.AddDocument(String key, Nullable`1 etag, RavenJObject data, RavenJObject metadata) in c:\Builds\RavenDB-Stable\Raven.Database\Storage\Esent\StorageActions\Documents.cs:line 399
   at Raven.Database.DocumentDatabase.<>c__DisplayClass46.<Put>b__3f(IStorageActionsAccessor actions) in c:\Builds\RavenDB-Stable\Raven.Database\DocumentDatabase.cs:line 686
   at Raven.Storage.Esent.TransactionalStorage.ExecuteBatch(Action`1 action) in c:\Builds\RavenDB-Stable\Raven.Database\Storage\Esent\TransactionalStorage.cs:line 558
   at Raven.Storage.Esent.TransactionalStorage.Batch(Action`1 action) in c:\Builds\RavenDB-Stable\Raven.Database\Storage\Esent\TransactionalStorage.cs:line 516
   at Raven.Database.DocumentDatabase.Put(String key, Nullable`1 etag, RavenJObject document, RavenJObject metadata, TransactionInformation transactionInformation) in c:\Builds\RavenDB-Stable\Raven.Database\DocumentDatabase.cs:line 675
   at Raven.Database.Server.Responders.Document.Put(IHttpContext context, String docId) in c:\Builds\RavenDB-Stable\Raven.Database\Server\Responders\Document.cs:line 176
   at Raven.Database.Server.Responders.Document.Respond(IHttpContext context) in c:\Builds\RavenDB-Stable\Raven.Database\Server\Responders\Document.cs:line 50
   at Raven.Database.Server.HttpServer.DispatchRequest(IHttpContext ctx) in c:\Builds\RavenDB-Stable\Raven.Database\Server\HttpServer.cs:line 856
   at Raven.Database.Server.HttpServer.HandleActualRequest(IHttpContext ctx) in c:\Builds\RavenDB-Stable\Raven.Database\Server\HttpServer.cs:line 601"
2013-02-27 12:01:17.9690,Raven.Database.Server.HttpServer,Debug,"Request #3,942,543: PUT     -   245 ms - celljournalist - 409 - /docs/Raven/Hilo/managers",


Slave:
2013-02-27 11:48:25.2816,Raven.Storage.Esent.StorageActions.DocumentStorageActions,Debug,Document with key 'Raven/Hilo/managers' was found,
2013-02-27 11:48:25.2816,Raven.Storage.Esent.StorageActions.DocumentStorageActions,Debug,Document with key 'Raven/ServerPrefixForHilo' was not found,
2013-02-27 11:48:25.2816,Raven.Database.Server.HttpServer,Debug,"Request #73,411: GET     -     0 ms - celljournalist - 304 - /queries/?&id=Raven/Hilo/managers&id=Raven/ServerPrefixForHilo",


This is how we create the document store in StructureMap
        public RavendbRegistry()
        {
            For<IDocumentStore>().Singleton().Use(CreateDocumentStore);

            For<IDocumentSession>().HybridHttpOrThreadLocalScoped().Use(ctx => ctx.GetInstance<IDocumentStore>().OpenSession());
        }

        private static DocumentStore CreateDocumentStore()
        {
            var store = new DocumentStore
                {
                    ConnectionStringName = "RavenDB",
                    // DefaultDatabase = "CellJournalist",
                    Conventions =
                        {
                       
                            FailoverBehavior = FailoverBehavior.ReadFromAllServers
                        
                        }
                };

            store.Initialize();

            return store;
        }

And this is the code that is attempting to save the new manager record.

        using (var session = DocumentStore.OpenSession())
        {

                       var newManager = new Manager()
                        {
                            MemberId = request.MemberId,
                            UserName = member.UserName,
                            Email = member.Email,
                            FirstName = member.FirstName,
                            LastName = member.LastName,
                            PasswordHash = defaultPassword,
                            Salt = salt,
                            Partitions = userPartition,
                            CurrentPartitionId = partition.Id,
                            Permissions = DefaultPermission
                        };
               
                    session.Store(newManager);
                    session.SaveChanges();
       
       
        }



session.Store(newManager) is where execution stops and all of those log messages start piling up.


We have other document types stored in Raven where specify the id when we first insert the object, this issue is only happening on the 2 document types where we let Raven set the Id.  My gut tells me what we're doing something wrong, I'm just not sure what.

Any ideas?

--
You received this message because you are subscribed to the Google Groups "ravendb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Oren Eini (Ayende Rahien)

unread,
Feb 28, 2013, 2:14:20 AM2/28/13
to ravendb

Nathan Ratcliff

unread,
Feb 28, 2013, 10:11:00 AM2/28/13
to rav...@googlegroups.com
I'm not sure how reproducible this is, the execution hang at session.store seems to be intermittent. 

When I say those entries were being logged over and over, I mean like immediately - 1 or more per second.  We end up with log files that are 4 or 5 gig in size as a result.

We had this issue once before, John posted a message about it to the forums.  This particular collection is only 27 documents, so last time we ended up just blowing it all away and starting over.

What would happen if we imported 25 documents with smuggler, with Ids already assigned, and then started saving future documents with HiLo?  When I did a local test, I saw conflicts on the documents - not sure if that could lead to what we're seeing.

Chris Marisic

unread,
Feb 28, 2013, 10:14:41 AM2/28/13
to rav...@googlegroups.com
You just need to modify the hilo document to be consistent to the keys you've already consumed.

Nathan Ratcliff

unread,
Feb 28, 2013, 10:19:55 AM2/28/13
to rav...@googlegroups.com
@Chris - I will do that, thanks!

Nathan Ratcliff

unread,
Mar 7, 2013, 6:49:33 PM3/7/13
to rav...@googlegroups.com
This issue has continued to plague us and today I decided to drop the nuget client and use the ravendb source instead so I could step through the code.  I now know why session.store() is getting stuck and why it's been painful to reproduce.

Everything works great until the HiloGenerator has to call into GetNextRange().  The summary version is that what we've been perceiving as the code hanging in session.store() is actually the while(true) loop in HiLoKeyGenerator.GetNextRange() never breaking because a concurrency exception is received every time the Hilo document is PUT back to the server.  This is also why I keep seeing line after line of that "PUT attempted on document 'Raven/Hilo/managers' using a non current etag" in my server logs.

Long version:

HiLoKeyGenerator.GetNextRange() successfully gets the Hilo document from the server, successfully calculates the new max value, then goes calls into:

ServerClient.Put(string key, Guid? etag, RavenJObject document, RavenJObject metadata)

which calls into:

ServerClient.DirectPut(RavenJObject metadata, string key, Guid? etag, RavenJObject document, string operationUrl)

Gets down to this code

try
            {
                responseJson = request.ReadResponseJson();
            }
            catch (WebException e)
            {
                var httpWebResponse = e.Response as HttpWebResponse;
                if (httpWebResponse == null ||
                    httpWebResponse.StatusCode != HttpStatusCode.Conflict)
                    throw;
                throw ThrowConcurrencyException(e);
            }


responseJson is null and the caught Web Exception is "The remote server returned an error: (409) Conflict."

Goes into ThrowConcurrencyException and the text from the server is
{
  "Url": "/docs/Raven/Hilo/managers",
  "ActualETag": "00000001-0000-0400-0000-0000000012e5",
  "ExpectedETag": "00000001-0000-0300-0000-000000000f62",
  "Error": "PUT attempted on document 'Raven/Hilo/managers' using a non current etag"
}

Which bubbles up to the catch in ReplicationInformer.TryOperation<T>(Func<string, T> operation, string operationUrl, bool avoidThrowing, out T result)


PUT attempted on document 'Raven/Hilo/managers' using a non current etag


This goes on forever - or until I reset IIS and the client gets a fresh range without knowledge of an old one.

Does that help make any sense of what might be going on here?

As always, I appreciate the insight/help!







On Thursday, February 28, 2013 1:11:32 AM UTC-6, Oren Eini wrote:

Oren Eini (Ayende Rahien)

unread,
Mar 10, 2013, 11:23:21 AM3/10/13
to ravendb
On concurrency exception, it is supposed to get the _current_ value of the item, and then move on.
Any idea why it is getting the old value?

Oren Eini (Ayende Rahien)

unread,
Mar 10, 2013, 11:23:34 AM3/10/13
to ravendb
WAIT!
Are you using aggressive caching, by any chance?

Nathan Ratcliff

unread,
Mar 10, 2013, 11:36:44 AM3/10/13
to rav...@googlegroups.com
I thought about that - our product is using memcache, but I don't see how it would be interfering with raven client.  The things we cache, we cache explicitly.

When you say aggressive caching, did you have something specific in mind?  I'd be more than happy to do more digging.




You received this message because you are subscribed to a topic in the Google Groups "ravendb" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/yKOHoFih2qE/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Mar 10, 2013, 11:38:22 AM3/10/13
to ravendb

Nathan Ratcliff

unread,
Mar 10, 2013, 11:58:14 AM3/10/13
to rav...@googlegroups.com
Thanks for the link, I'm still learning about all of Raven's features.

I just went through the code to verify that we are not using aggressive caching.

Oren Eini (Ayende Rahien)

unread,
Mar 11, 2013, 1:00:03 PM3/11/13
to ravendb
Any way for you to do this with Fiddler open? That should tell us something from the traffic to the server.

Nathan Ratcliff

unread,
Mar 11, 2013, 4:31:48 PM3/11/13
to rav...@googlegroups.com
Request:
Accept-Encoding: gzip
Content-Type: application/json; charset=utf-8
Raven-Client-Version: 2.0.0.0
If-None-Match: "cb825dcf-76eb-6a0b-a1b0-22ec0f9495e2"
Host: raven-test-02:8080

Response:
HTTP/1.1 304 Not Modified
Content-Length: 0
Content-Encoding: gzip
Server: Microsoft-HTTPAPI/2.0
Raven-Server-Build: 2261
Temp-Request-Time: 0 ms
Date: Mon, 11 Mar 2013 20:24:09 GMT

Request:
Accept-Encoding: gzip
Content-Type: application/json; charset=utf-8
Raven-Client-Version: 2.0.0.0
Raven-Entity-Name: Raven
Raven-Replication-Version: 5565201
Raven-Replication-Source: e15e2eb2-b30d-4bac-a80f-11e7d2bbac13
X-NewRelic-ID: UgMEUlJACQQAUFdaDg==,UgMEUlJACQQAUFdaDg==,UgMEUlJACQQAUFdaDg==,UgMEUlJACQQAUFdaDg==,UgMEUlJACQQAUFdaDg==,UgMEUlJACQQAUFdaDg==
If-None-Match: 00000001-0000-0400-0000-0000000000a5
Host: raven-test-01:8080
Expect: 100-continue
Content-Length: 22

{
  "Max": 2784
}

Response:
HTTP/1.1 409 Conflict
Location: /docs/Raven/Hilo/managers
Server: Microsoft-HTTPAPI/2.0
Raven-Server-Build: 2261
Date: Mon, 11 Mar 2013 20:24:10 GMT
Content-Length: 246


Those 2 repeat over and over again.

Oren Eini (Ayende Rahien)

unread,
Mar 11, 2013, 4:33:12 PM3/11/13
to ravendb
Can you send the SAZ file? With several iterations of this ?

Nathan Ratcliff

unread,
Mar 11, 2013, 5:16:14 PM3/11/13
to rav...@googlegroups.com
RavenDb.saz

Oren Eini (Ayende Rahien)

unread,
Mar 13, 2013, 5:18:17 AM3/13/13
to ravendb
Okay, this is really interesting.
Mostly because there is NO WAY that this can happen that I can see.

I tried writing tests for that, and I can't.

Can we pair on this on Skype?
HiLoHanging.cs

Nathan Ratcliff

unread,
Mar 13, 2013, 11:09:14 AM3/13/13
to rav...@googlegroups.com
I agree, I've read and stepped through the Hilo generator code many times and what's happening shouldn't be possible. 

What makes it more challenging is that it doesn't happen as predictably as I'd like.  I originally thought it was every time a new range was requested from the server.  That's still where it happens, but it's intermittent.  I got lucky the other day when you asked for the fiddler trace because the problem occurred fairly quickly and I was able to log it.

I also can't replicate the issue on my laptop with a smaller data set, it only seems to happen on our servers with the full 3+ million documents loaded (didn't seem to happen on server pairs with only 8,000 records).  We've seen this occur on multiple sets of servers, so it's not just a bad install or corrupt instance.

I thought maybe something in our caching infrastructure was overriding the cache policy of the web request that Raven was using to request the hilo doc, but the cache polity right before the server request is BypassCache.  Caching is about the only thing I can think of that would cause this.  The only commonality left is that the servers communicate via Rackspace's ServiceNet.  I asked Rackspace if any caching was in play there and they said no.

This has become less important for us - for most of our entities, we supply the id.  For the ones that were using Hilo, we've abandoned Hilo ahd switched to the Identity generator instead. 

That said, I'm happy to do anything I can to help figure out the cause of the issue.  If it was impacting us, it could impact someone else in the future.


  

Oren Eini (Ayende Rahien)

unread,
Mar 13, 2013, 11:17:55 AM3/13/13
to ravendb
The main problem here is that we fail (okay) the first time, then request the id again, and we get the id with the same value.
I am adding code to make it NOT send the etag at all under that scenario. That would skip the cache and avoid the entire issue.



--
You received this message because you are subscribed to the Google Groups "ravendb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Nathan Ratcliff

unread,
Mar 13, 2013, 11:20:59 AM3/13/13
to rav...@googlegroups.com
That sounds promising.  If there's anything we can do to help, test, etc, let me know.

Thanks!

Oren Eini (Ayende Rahien)

unread,
Mar 13, 2013, 12:08:03 PM3/13/13
to ravendb
Can you test the next build?

Nathan Ratcliff

unread,
Mar 13, 2013, 12:20:50 PM3/13/13
to rav...@googlegroups.com
Yes

Oren Eini (Ayende Rahien)

unread,
Mar 13, 2013, 12:21:24 PM3/13/13
to ravendb
Should be out in 30 mins

Nathan Ratcliff

unread,
Mar 13, 2013, 3:16:51 PM3/13/13
to rav...@googlegroups.com
Tested with 2302 (client and server), unfortunately the issue is still there.

Nathan Ratcliff

unread,
Mar 13, 2013, 3:24:30 PM3/13/13
to rav...@googlegroups.com
I did notice from the fiddler trace of the test I just ran, it always requests from the slave (02) and tries to update the master (01).

The documents have the same content on both servers, but have different etags.  Are etags supposed to be the same or different on replicated servers?  If it pulls from 2 and tries to send to 1 with 2's etag each time, wouldn't that cause the concurrency exception to repeat over and over again?

Oren Eini (Ayende Rahien)

unread,
Mar 14, 2013, 1:00:37 AM3/14/13
to ravendb
Huh! That is interesting, and very wrong.
Are you running with ReadFromAllServers ?

Oren Eini (Ayende Rahien)

unread,
Mar 14, 2013, 1:00:49 AM3/14/13
to ravendb
Different servers have different etags, that could be it!

Nathan Ratcliff

unread,
Mar 14, 2013, 1:08:48 AM3/14/13
to rav...@googlegroups.com
Yes, we have a master -> slave configuration.

This is how we build the document store:


This is how we create the document store in StructureMap
        public RavendbRegistry()
        {
            For<IDocumentStore>().
Singleton().Use(CreateDocumentStore);

            For<IDocumentSession>().HybridHttpOrThreadLocalScoped().Use(ctx => ctx.GetInstance<IDocumentStore>().OpenSession());
        }

        private static DocumentStore CreateDocumentStore()
        {
            var store = new DocumentStore
                {
                    ConnectionStringName = "RavenDB",
                    Conventions =
                        {
                       
                            FailoverBehavior = FailoverBehavior.ReadFromAllServers
                        
                        }
                };

            store.Initialize();

            return store;
        }

 We only have the 2 servers so I assume it uses 1, then 2, then 1 again, etc.

Oren Eini (Ayende Rahien)

unread,
Mar 14, 2013, 1:16:00 AM3/14/13
to ravendb
That make sense! That is totally how it would behave in that scenario, sure!
I don't know how I missed that it was talking to different servers.
Will have a fix in the next build.

Nathan Ratcliff

unread,
Mar 14, 2013, 1:25:31 AM3/14/13
to rav...@googlegroups.com
Great! Thanks for sticking with this and figuring it out!

Nathan Ratcliff

unread,
Mar 15, 2013, 2:32:16 PM3/15/13
to rav...@googlegroups.com
This is fixed in 2303.  It's working great now.  Thanks!
Reply all
Reply to author
Forward
0 new messages