Replication very slow and Not replicating All Documents

Anteneh Maru

unread,

Mar 21, 2014, 1:52:36 PM3/21/14

to rav...@googlegroups.com, Jame...@feisystems.com

I have been using Raven 2.5 (Build 2750) in a production enviroment with a Master Slave replication set Up. The first time we replicate the data from master slave it took 8 hours for about 8 million records. We are trying to set up a second slave and replicate from the same Master Database, But we are running in different kinds of issues.

the first couple of trials it replicated half of the documents (currently we have 12 million documents) and stopped without any error, but when a new Document is added it got replicated to the new slave. After that it starts becoming very slow, it replicates 2 million documents in 24 hours. I have tried changing the configuration setting but didn't help. here is the configurations i am using now.

</appSettings>

I turned on debugging and tried to see if any thing is looged, but nothing comes up in the log files.

The server is a window 2012 with 24 GB Ram and 8 CPU with 2.7GHz.I amn't

We are running out of options and would like to get any suggestions.

Oren Eini (Ayende Rahien)

unread,

Mar 23, 2014, 3:52:35 AM3/23/14

to ravendb, James Tan

Anything in the logs? What about the replication endpoints?

/replication/info

Oren Eini

CEO

Mobile: + 972-52-548-6969

Office: + 972-4-674-7811

Fax: + 972-153-4622-7811

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Anteneh Maru

unread,

Mar 24, 2014, 11:26:47 AM3/24/14

to rav...@googlegroups.com, James Tan

I recreated it in a test site and this is the replication endpoint of the slave server

[{"LastDocumentEtag":"01000000-0000-004D-0000-000000008C73","LastAttachmentEtag":"02000000-0000-004C-0000-000000000005","ServerInstanceId":"d4d2c6ca-4ebb-4c72-a001-fd727cc309b1","Source":"http://svr006/databases/db1","@metadata":{"@id":"Raven/Replication/Sources/http://svr006/databases/db1","Last-Modified":"2014-03-21T21:12:47.5093512Z","Raven-Last-Modified":"2014-03-21T21:12:47.5093512","@etag":"01000000-0000-001E-0000-00000000C4D0","Non-Authoritative-Information":false}}]

and the master

{"Self":"http://svr006/databases/db1","MostRecentDocumentEtag":"01000000-0000-004D-0000-000000008C73","MostRecentAttachmentEtag":"02000000-0000-004C-0000-000000000005","Stats":[{"FailureCountInternal":0,"Url":"http://svr004/databases/db2","LastHeartbeatReceived":null,"LastEtagCheckedForReplication":"01000000-0000-004D-0000-000000008C73","LastReplicatedEtag":"01000000-0000-004D-0000-000000008C73","LastReplicatedLastModified":"2014-03-21T21:12:47.4206207Z","LastSuccessTimestamp":"2014-03-21T21:12:47.5145186Z","LastFailureTimestamp":null,"FailureCount":0,"LastError":null}]}




I have shared the log file I got by turning the debugging on at the following location
https://drive.google.com/file/d/0B2lODLbg8xRlbzdVYXBkYzdSTGM/edit?usp=sharing

Sincerely,
Anteneh

Anteneh Maru
Microsoft Certified Technology Specialist
Software Developer
Cell: 202-609-4697
Email: maru.a...@gmail.com

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/XvkRmFFxGUc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Mar 24, 2014, 11:29:00 AM3/24/14

to ravendb, James Tan

How did you recreate it?

Oren Eini

CEO

Mobile: + 972-52-548-6969

Office: + 972-4-674-7811

Fax: + 972-153-4622-7811

Anteneh Maru

unread,

Mar 24, 2014, 11:30:30 AM3/24/14

to rav...@googlegroups.com

restore the production database to the master database and start the replication process to the slave test server.

Sincerely,
Anteneh

Anteneh Maru
Microsoft Certified Technology Specialist
Software Developer
Cell: 202-609-4697
Email: maru.a...@gmail.com

Arkadiusz Paliński

unread,

Mar 25, 2014, 5:57:47 AM3/25/14

to rav...@googlegroups.com

Hi Anteneh,

at what moment did you take the outputs of the replication endpoints which you share with us? Were they taken when the replication got stuck or any other significant moment in the whole replication process? I'm asking because these results suggest that on both servers the highest etag is '01000000-0000-004D-0000-000000008C73' what means that there is just 35.955 documents in databases.

Also the log file you shared comes from a slave of a production system while the outputs of replication endpoints from a test environment, right?

Do you have logs from a primary server too?

Were any new documents put on the primary server while it replicated to the new slave?

Do you use bulk inserts or heavy concurrent writes?

Regards,

Arek

micro

unread,

Mar 30, 2014, 4:42:52 PM3/30/14

to rav...@googlegroups.com

It seems extreme hard to debug this issue in the production but it *has* problem!!

We are trying it one more time to replicate to a new instance with Master-Master.

Here are things:

1. The replication extremely slow, 40K document for 1 hour

2. No error on both side

3. I recalled similar slow replication happened during the new deployment (upgrade 1.0 to 2.5 and then create master-master replication), we had to restart source db and then the speed became normal.

4. Now the problem is that our production is 24x7 but will see if can restart the server.

Here is the question: we are sure there is some bug or issue in the replication process, extremely slow but seems ok after restart source. We cannot provide more info, not error log.

Thanks

micro

unread,

Mar 30, 2014, 4:44:16 PM3/30/14

to rav...@googlegroups.com

There was no bulk insert, but it did have lot insert/update during the process.

On Tuesday, March 25, 2014 5:57:47 AM UTC-4, Arkadiusz Paliński wrote:

micro

unread,

Mar 30, 2014, 5:00:40 PM3/30/14

to rav...@googlegroups.com

I did not test it, but I think it could be related to max outbound connection in ASP.NET application?

the default is 2 connection/target IP.

http://support.microsoft.com/kb/821268

We will try to change the web.config to add this

<configuration>
  <system.net>
    <connectionManagement>
      <add address="*" maxconnection="24"/>
    </connectionManagement>
  </system.net>
</configuration>

Oren Eini (Ayende Rahien)

unread,

Mar 31, 2014, 1:56:50 AM3/31/14

to ravendb

That is unlikely, we only have a single outgoing connection per destination.

Oren Eini

CEO

Mobile: + 972-52-548-6969

Office: + 972-4-674-7811

Fax: + 972-153-4622-7811

Edwin Frey

unread,

Dec 1, 2014, 6:21:49 PM12/1/14

to rav...@googlegroups.com

I know this is a very old thread, but did anything come of this? I'm experiencing a very similar issue where replication to a new source is very slow, restarting the source server speeds it up, but then it slows down again, and when the master says that the slave is in lock step etag wise, it is missing probably half of the documents that should exist (confirmed by trying to hit a couple that I got errors on the site by keying on the master and slave, the documents existed on the master, but not the slave).

Had debug logging on for a different problem and didn't see any errors on in the log.

Any suggestions would be appreciated!

Daniel Häfele

unread,

Dec 1, 2014, 6:29:40 PM12/1/14

to rav...@googlegroups.com

What build are you using?

I think there were some bug-fixes with replication and mis-matching document counts in the last versions.

You should propably upgrade your servers.

Edwin Frey

unread,

Dec 1, 2014, 7:55:23 PM12/1/14

to rav...@googlegroups.com

I'm running build 2879

Federico Lois

unread,

Dec 1, 2014, 9:33:22 PM12/1/14

to rav...@googlegroups.com

Those fixes are in the 29xx servers as far as I remember. It was something regarding the prefetching behavior, which in itself behaves like that. I know because I was the one reporting it, and providing a way to setup an environment to repro it. ;)

Federico
Sent from my Windows Phone

From: Edwin Frey
Sent: ‎01/‎12/‎2014 21:55
To: rav...@googlegroups.com
Subject: Re: [RavenDB] Replication very slow and Not replicating All Documents

[The entire original message is not included.]

Edwin Frey

unread,

Dec 3, 2014, 6:50:32 PM12/3/14

to rav...@googlegroups.com

Thank you for the info Federico.

I took the downtime this morning and upgraded the master to the lastest 2.5 stable build, 2945.

The same thing is happening with replication slowing down. The first 4 hours were replicating at ~ a million documents an hour and I was very hopeful. Now, 10 hours later, it's replicating at 40K documents an hour. Out of 12 million documents i'm only at 6.2 million.

According to the replication info endpoint, it's taking ~1.25 minute to pull the next batch of 3000 documents. This morning it was 8 seconds for 10K documents.

I have some debug logs from the server for the last hour, I'm going to dig through them to see if I can see anything obvious, but I didn't last time, so I'm not hopeful.

Anyone have any additional ideas?

Federico Lois

unread,

Dec 3, 2014, 9:05:03 PM12/3/14

to rav...@googlegroups.com

The size of the document matter, specially in io handicapped environment like Azure where latency is not your friend. We have an unbalanced data store that arises from bulk migration where most the big documents are sitting on the same etag range. Make sure yours is not such a case. If you can I would do a replication from 1 database to another in the same server and with the databases sitting in the same SSD to see if you can repro it.

Sent from my Windows Phone

From: Edwin Frey
Sent: ‎03/‎12/‎2014 20:50

Oren Eini (Ayende Rahien)

unread,

Dec 4, 2014, 1:12:47 AM12/4/14

to ravendb

Also, check if you have a large range of deleted documents, and see if you can purge the tombstones.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

Ed

unread,

Dec 4, 2014, 8:16:19 AM12/4/14

to rav...@googlegroups.com

Oren,

I thought tombstones show up in the total document count, so if I was just looking at raw doc count, not collection count, wouldn't it still be sending those across and I'd see the total doc count continue to go up?

I tried searching this morning but couldn't find any documentation on finding out what the tombstone count is, or how to delete them, can you point me in the right direction please?

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/XvkRmFFxGUc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,

Dec 4, 2014, 9:32:40 AM12/4/14

to ravendb

No, since 2.0 we don't handle tombstones as documents.

Look at the logs, it should tell you what it is doing.

Ed

unread,

Dec 4, 2014, 9:40:08 AM12/4/14

to rav...@googlegroups.com

I was digging through the logs last night and couldn't really see anything obvious. The only thing that stuck out was lines like this:

2014-12-04 02:07:23.8323,Raven.Bundles.Replication.Tasks.ReplicationStrategy,Debug,Will replicate '9cbfd9a123ef23a7e72800892eeb0fa5' to '7193147c-4be1-451d-b3cd-dcb145c13bee',

most of the lines around those are actual document IDs, that's just a guid and we don't use guids as document Ids.

If those are tombstones then that might be my problem, how would I go about purging them?

Oren Eini (Ayende Rahien)

unread,

Dec 4, 2014, 9:44:11 AM12/4/14

to ravendb

POST /admin/replication/purge-tombstones

This log entry comes from here:

https://github.com/ayende/ravendb/blob/2.5/Raven.Database/Bundles/Replication/Tasks/ReplicationStrategy.cs#L54

So you do have a document with that name.

Ed

unread,

Dec 4, 2014, 10:46:26 AM12/4/14

to rav...@googlegroups.com

Thanks for the quick follow-ups Oren, it's much appreciated. I'm slightly nervous about running it without some sanity checks, which may just need some re-assurance.

Is there an endpoint to get the tombstone count so I know how many are about to be deleted?

Is it safe to run that in the middle of the business day, load wise, or should i wait till off hours?

Is it going to mess up in progress replication at all, or should I wait till the replication is done?

Again, thank you!

Oren Eini (Ayende Rahien)

unread,

Dec 4, 2014, 2:51:55 PM12/4/14

to ravendb

We don't expose tombstones externally in 2.5, I'm afraid.

Try running this in off hours, just to be safe.

It shouldn't matter wrt to replication, but it you can disable replication first, to minimize the number of moving parts, that is probably preferable.

Edwin Frey

unread,

Dec 4, 2014, 11:22:40 PM12/4/14

to rav...@googlegroups.com

First, because I'm not sure I said it, the good news that I haven't mentioned yet: upgrading to the current version seems to have fixed the missing document problem, so far, as far as I can tell, it has been replicating everything.

As for the purge tombstone: I tried it on the only slave that I have that is up to date as a safe test...I'm assuming that if I get this error it means there are a lot of tombstones to clean up:

{

"Error": "Microsoft.Isam.Esent.Interop.EsentVersionStoreOutOfMemoryException: Version store out of memory (cleanup already attempted)\r\n at Raven.Storage.Esent.StorageActions.DocumentStorageActions.RemoveAllBefore(String name, Etag etag)\r\n at Raven.Bundles.Replication.Responders.AdminPurgeTombstones.<>c__DisplayClass5.<RespondToAdmin>b__4(IStorageActionsAccessor accessor)\r\n at Raven.Storage.Esent.TransactionalStorage.ExecuteBatch(Action`1 action, EsentTransactionContext transactionContext)\r\n at Raven.Storage.Esent.TransactionalStorage.Batch(Action`1 action)\r\n at Raven.Bundles.Replication.Responders.AdminPurgeTombstones.RespondToAdmin(IHttpContext context)\r\n at Raven.Database.Server.HttpServer.DispatchRequest(IHttpContext ctx)\r\n at Raven.Database.Server.HttpServer.HandleActualRequest(IHttpContext ctx)"

}

I cut down the size and a couple batches completed successfully, going to move through the entire range in batches to the current etag. Guess the only way to test if that fixes the problem is try to replicate to a fresh db and see if it completes much faster.

Oren Eini (Ayende Rahien)

unread,

Dec 5, 2014, 2:09:40 AM12/5/14

to ravendb

Edwin,

Yes, that means that you have a _lot_ of them.

Note that in 3.0 we have an automated task that purge them on a periodical basis.

Edwin Frey

unread,

Dec 5, 2014, 10:37:57 AM12/5/14

to rav...@googlegroups.com

http://X/databases/X/admin/replication/purge-tombstones?docEtag=01000000-0000-0030-0000-000000000000&attachmentEtag=00000000-0000-0000-0000-000000000000

runs for a while and then completes.

http://X/databases/X/admin/replication/purge-tombstones?docEtag=01000000-0000-0040-0000-000000000000&attachmentEtag=00000000-0000-0000-0000-000000000000

throws that error afterwards, but every increment I do like

http://X/databases/X/admin/replication/purge-tombstones?docEtag=01000000-0000-0030-0000-000000A00000&attachmentEtag=00000000-0000-0000-0000-000000000000

completes immediately. Trying 4 again just fails. Is there a way to see or know etag ranges are in use so I can try to find the large spot to purge it?

Thank you for all your help with this Oren.

Oren Eini (Ayende Rahien)

unread,

Dec 5, 2014, 4:37:16 PM12/5/14

to ravendb

I'm afraid that this isn't exposed.

You can raise the Raven/Esent/MaxVerPages to allow us to use more resources for this process.

Reply all

Reply to author

Forward