Multi-tenant, deploying index updates takes hours to clear

Patrick

未讀,

2015年4月22日晚上10:52:582015/4/22

收件者：rav...@googlegroups.com

RavenDB server & client 2956

130 or so tenant databases, most are "client data" and contain the client-specific business data that our users create.

Each "client" database currently has 28 indexes, most of which are static ones defined by us.

Each "client" database contains anywhere from 50k to 2.5MM documents, the vast majority of which are included in at least one index.

Most of the indexes are simply maps for querying; there are a handful of map-reduce, a couple of multi-maps.

When we deploy code (nearly every week) we want to update any index definitions that have changed for all of the client DBs. The pattern originally looked like this:

foreach(tenant){
  IndexCreation.CreateIndexes(indexManifestWithTheTypes)
}

Over the last 2 years, as we added clients, we noticed that the process look longer but most significantly, RavenDB performance degraded severely for a few hours after deployment (CPU, RAM and disk I/O were consumed by the indexing process).

At first, we tackled the problem of many indexes getting marked as stale and re-indexed when they hadn't actually changed, so we added a client-side check for equality first:

foreach(tenant){
  foreach(indexDefinition) {
    if(indexDefinition has changed){
      IndexCreation.PutIndex(indexDefinition)
    }
  }
}

That helped, but without pausing between tenant databases, performance still degraded too much when even one index changed in all 130 databases, so we augmented the pattern to pause between each index update, to let the indexing process get a "head start":

foreach(tenant){
  foreach(indexDefinition) {
    if(indexDefinition has changed){
      IndexCreation.PutIndex(indexDefinition)
      pauseForUpTo60SecondsWhileIndexesAreStale()
    }
  }
}

For the most part, this is smoother. We still occasionally get "Cannot modify indexes while indexing is in progress" even though our deployment mechanism is the only way indexes are created or updated. The upside is that we don't see a massive hit in response time for clients, because we're spreading the indexing work over a much longer period.

However, the problem is that the process now takes HOURS. That is, because we spend a good bit of time waiting for indexes to stop being stale (so that Raven isn't updating ALL the indexes in ALL the databases at once), as well as retrying when we get the "Cannot modify indexes while indexing is in progress" error, I can start the index update process in the morning, and not be done by then end of the day.

This is a real bummer when we have a new feature that depends on updated indexes - we have to wait until the indexes are deployed before we can deploy the new bits.

Any ideas as to what we're doing wrong, or is this to be expected?

Additional server info:

6 core Xenon (virtual)

16GB RAM

disks: System, Data x2 (clients spread among 2 disks), Indexes

Patrick

未讀,

2015年4月22日晚上10:57:592015/4/22

收件者：rav...@googlegroups.com

Oops- typo: smallest clients have about 500k documents, not 50k.

Chris Marisic

未讀,

2015年4月23日上午10:44:112015/4/23

收件者：rav...@googlegroups.com

At first, we tackled the problem of many indexes getting marked as stale and re-indexed when they hadn't actually changed, so we added a client-side check for equality first:

This has been a feature of Raven before it even hit 1.0. Unless a major bug occurred, either your index definitions really are modified, your deployment process is modifying the index definition, or it's just not happening. When the raven server is under load an index could be shown as stale even when it's consistent. It's merely acknowledging i have received document changes, I have no yet processed every single of these changes to this index so it is "stale".

I really can't anything to complain here about. you have over 3000 indexes across 100 million+ documents. You then update hundreds of indexes all at the same time.

Buy a bigger box. At the very least go from 16GB ram to 128GB or 256GB. Disks also should be SSD based. Honestly the only thing i see here is this box is massively undersized for your workload.

Patrick Boudreaux

未讀,

2015年4月23日下午4:40:362015/4/23

收件者：rav...@googlegroups.com

Thanks, Chris.

Regarding testing when indexes have actually changed: I know that Raven does that, we've just grown to not trust it, after seeing many indexes suddenly go stale when we push the same definitions without change. For what it's worth, we use the Raven-defined IndexDefinition.Equals() method to test equality.

Also, we've been using RavenDB since before 1.0 :) My product has grown up with it.

My first thought was to allocate more hardware; I didn't want to go there without checking to make sure we weren't doing something sub-optimal. Some of the discussion I've seen around server sizing gave me the impression that a server this size should be able to handle what we're doing. I've got plans to scale horizontally (I'll bounce those off the community in a future thread).

Any other thoughts or examples of server sizing? I'll have to make a clear case to get that significant of a size increase.

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/OvOCUnHPmPE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Marisic

未讀,

2015年4月23日下午5:37:122015/4/23

收件者：rav...@googlegroups.com

It's a server, give it RAM. I don't understand how in 2015 there needs to be an argument that a mission critical production database server should have anything less than 128GB of RAM.

My last production ravendb box was dual octo-core xeon, 192GB ram, SSDs in raid 0+1 for maybe 20% of what you use. Then a single fail over on an AWS VPS for HA. I'm all for "the cloud" but I'm all for deploying metal for a database.

I want my databases to uses every single byte of RAM its heart can ever possibly desire.

I also see nothing in your usage that says scale horizontally. You've just under provisioned hardware. Put another zero on the end of those numbers then i'd recommend horizontal. The only other reason i'd recommend horizontal is for international data and you want servers in US, EU, Asia, etc due to latency. Otherwise just stick metal behind a fat pipe and be done with it.

Oren Eini (Ayende Rahien)

未讀,

2015年4月24日凌晨2:11:372015/4/24

收件者：ravendb

Consider what you are asking RavenDB to do.

3,640 indexes across the whole server.

65,000,000 documents to index (minimum amount, more likely to be in the few hundred millions).

That is putting a LOT of work on the server to do.

Note that we already don't do anything for indexes that are recreated with the same definition, so that doesn't matter.

But the likely issue is that you are saturating the machine.

Let us take a case where you changed only 2 indexes.

That gives us 260 indexes changed and let us say 75 million documents to index.

Let us assume that each document is 512 Bytes in side (most are considerably large).

That gives us 35 GB that we need to read from the disks.

You also have 6 cores available, so we have to index 260 indexes on a 6 cores.

That means that we have to do about 44+ rounds of 6 indexes at a time per batch.

Note that I'm ignoring all other costs such as write speed, RAM usage, etc.

You are asking RavenDB to do quite a lot. And it just doesn't have the resources for it.

A better alternative would be to split the server into multiple machines, or make use of the side by side indexing new in 3.0

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

未讀,

2015年4月24日凌晨2:13:212015/4/24

收件者：ravendb

Patrick,

I think that you find that during normal operations, the hardware is sufficent.

It is just when you update index definitions that this cause us to go into worst mode possible.

All databases are awake, all databases are active and require resources.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

回覆所有人

回覆作者

轉寄