Delete / Drop collection performance

133 views
Skip to first unread message

Johannes Rudolph

unread,
Jun 12, 2019, 5:47:19 AM6/12/19
to RavenDB - 2nd generation document database
We're continuing our quest for performance using RavenDB 4.2.  We have a scenario that requires us to delete a large collection (about 20 GiB in size, 4.75 Mio Documents) after a migration. 
In production settings, this collection may be even larger. 

Unfortunately, we see very slow performance deleting the collection, approx 2.4k doc/second. It appears as though RavenDB is deleting each document individually. Questions: 

1) is there a faster way to delete whole collection (i.e. "drop collection" without tombstones etc., analogous to a "drop table" in SQL)? 
2) 2.4k docs/s seems kind of slow for deleting documents from the simplest query ("from $collection") - is this an expected performance number? 

The DB is running on an i7-8700 with SSD storage. I have attached a screenshot of the I/O stats, it appears to me that there are considerable gaps between journal writes / data flushes.



Screenshot from 2019-06-12 11-37-49.png
Screenshot from 2019-06-12 11-36-19.png

Oren Eini (Ayende Rahien)

unread,
Jun 12, 2019, 9:24:05 PM6/12/19
to ravendb
Collection delete is implemented via deleting each document.
We cannot really do a drop, because:
* We need to generate revisions
* We need to generate tombstone
** This is important for replication, indexing, etc

See the discussion here: 

About how to read the I/O map.  

What kind of machine are you using?


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
To post to this group, send email to rav...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ravendb/401bccef-6227-4a70-8e75-1b34461eae33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Oren Eini
CEO   /   Hibernating Rhinos LTD
Skype:  ayenderahien
Support:  sup...@ravendb.net

Johannes Rudolph

unread,
Jun 13, 2019, 2:25:50 AM6/13/19
to RavenDB - 2nd generation document database
The delete query has been running on my local machine, 32 GiB RAM

> The DB is running on an i7-8700 with SSD storage. 

The part about the tombstones makes sense, though I think there's different semantics involved (with sql analogy):

- "delete from collection", which is what we have right now, deletes each document individually. revisions & indexes will be maintained, subscribptions & replication do the right thing etc.
- "drop collection", which is what I'm looking for): immediately remove the whole collection on all DB nodes (i.e. make it a cluster command?) it's expected that indexes & subscriptions will be invalid after this, the user may need to clean them up. Alternatively, throw if the collection is still used in any index or subscription (all of that info should be in the DatabaseRecord anyway, right?).

Does that make sense? I'm not sure whether it's worth the effort to optimize for this scenario though. However, the desire comes from the observed slow performance of the O(n) delete by query operation. It wouldn't be that much of an issue if the delete would go at 100k doc/s...

On Thursday, June 13, 2019 at 3:24:05 AM UTC+2, Oren Eini wrote:
Collection delete is implemented via deleting each document.
We cannot really do a drop, because:
* We need to generate revisions
* We need to generate tombstone
** This is important for replication, indexing, etc

See the discussion here: 

About how to read the I/O map.  

What kind of machine are you using?


On Wed, Jun 12, 2019 at 12:47 PM Johannes Rudolph <jrud...@meshcloud.io> wrote:
We're continuing our quest for performance using RavenDB 4.2.  We have a scenario that requires us to delete a large collection (about 20 GiB in size, 4.75 Mio Documents) after a migration. 
In production settings, this collection may be even larger. 

Unfortunately, we see very slow performance deleting the collection, approx 2.4k doc/second. It appears as though RavenDB is deleting each document individually. Questions: 

1) is there a faster way to delete whole collection (i.e. "drop collection" without tombstones etc., analogous to a "drop table" in SQL)? 
2) 2.4k docs/s seems kind of slow for deleting documents from the simplest query ("from $collection") - is this an expected performance number? 

The DB is running on an i7-8700 with SSD storage. I have attached a screenshot of the I/O stats, it appears to me that there are considerable gaps between journal writes / data flushes.



--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rav...@googlegroups.com.

To post to this group, send email to rav...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ravendb/401bccef-6227-4a70-8e75-1b34461eae33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,
Jun 13, 2019, 12:46:42 PM6/13/19
to ravendb
I would really hesitate to do this. You have to take into account that you may have ETL processes, external / pull replication that would also be impacted.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

To post to this group, send email to rav...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ian Cross

unread,
Jun 20, 2019, 6:10:53 AM6/20/19
to RavenDB - 2nd generation document database
Hi Oren,

I tend to agreed with Johannes here.

Maybe this is improved in RavenDB 4 (we are not quite there yet) but deleting a complete collection takes a very long time in RavenDB 3.5 when you have millions of records. Sometimes you just know that you want to delete an entire collection. For example we have this kind of process (maybe there is a better way):
  • We have a scenario where we might import a full data set containing 5 million products every morning from an external data source into a staging database. The external data source cannot tell us what is new, updated or deleted
  • We ETL some Product records from the main database including a hash of the pertinent product data when serialized 
  • We serialize and hash each product in the new data set
  • We use RavenDB indexing to tell us which products are new, updated, deleted by comparing the hashes
In the above scenario, we have to clear out the products each day and would love for there to be a 'drop' collection. Maybe there could be a cluster-wide drop collection which removes from all database and these collections are identified as 'droppable' so that other functionality does not work on 'droppable' collections. We use this collection for a very specific purpose and can therefore don't need other things other than the ability to drop it, bulk insert into it and index it?  

Oren Eini (Ayende Rahien)

unread,
Jun 30, 2019, 9:34:31 AM6/30/19
to ravendb
Is there a reason why you cannot do that on a separate db? You can setup ETL from the other db to compare values to it as well.

In 4.0, deleting is much faster, but the tracking requirements means that we can't really handle it without going over each item.



--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
To post to this group, send email to rav...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ian Cross

unread,
Jul 1, 2019, 5:27:50 AM7/1/19
to rav...@googlegroups.com

Hi Oren,

 

Yes this is already happening in a separate database and we use ETL to bring over the records we want to compare from the main database.

 

Are you saying it may be better to ‘drop’ the whole database and ETL the products from the main database every day? I assume though this puts the load onto the main database every morning to push the 4M products over?

 

We will try the performance of delete in RavenDB 4.0. I guess given we do this once per day, we can kick off the delete well in advance so that it is completely before we start the process the next day.

 

Cheers,

 

Ian

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/rWEdEAhxHls/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.


To post to this group, send email to rav...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages