Processing all documents in a collection

98 צפיות
מעבר להודעה הראשונה שלא נקראה

Andrej Krivulčík

לא נקראה,
6 בספט׳ 2016, 8:02:176.9.2016
עד RavenDB - 2nd generation document database
I need to process around 5 million documents stored in a Voron database. The entire collection needs to be processed but I keep the processing version identifier in the documents so I can identify and filter documents which were already processed so in case of failure, I don't need to reprocess the same documents twice. The documents are around 2-20 kB in size (average size is around 7 kB). After processing, each document will grow by about 10 kB. There is one map-only index and one MapReduce index using the collection. The database file (Raven.voron) is 146 GB in size, Indexes directory is 58.3 GB in size. There are other types of documents, total document count is roughly 25 million.

I tried to do this using simple implementation where one thread loads the documents using streaming, and another thread processes each document and saves it (batched). This approach keeps failing for one of two reasons:
  • The server hits the scratch buffers size limit, even after increasing the limit to 18 GB and slowing the processing down.
  • The server forcibly closes the connection to the thread reading the documents.
One of the scenarios was accompanied by excessive RAM usage (filled 14 GB) and swapping on the RavenDB machine (Azure VM). I'm not sure which one, I'll verify if this information is important. I couldn't figure out why the server closed the connection. I might be because of the issue with buffers being filled. Hints and pointers about how to diagnose this issue would be appreciated.

I tried to implement this using Data Subscriptions which seem to fit the use case nicely. However, I have two questions about this approach:
  • Batching - will processing using batches work correctly with document acknowledgment? I have MaxDocCount = 256 in SubscriptionBatchOptions in SubscriptionConnectionOptions, and the subscribe to the subscription looks like this: subscription.Buffer(256).Subscribe(ProcessBatch). Will this correctly ensure that no documents are missed when the ProcessBatch fails during the processing?
  • When using this approach on a subset database containing around 50000 records, the processing worked correctly (although slowly, as it wasn't batched during the test run). However, after starting on a production database, the processing seems to take very long to start. After starting, the RavenDB VM started reading from disk (around 100 MB/s reported in Resource Monitor) and after 16 hours (ran overnight), the processing still didn't start - no documents are processed. The documents in the database were imported and reimported several times so it's possible that early etags are not assigned to any documents. Anyway, the startup delay seems to be excessive, as at that speed, the entire database could be read in ~1500 seconds (146 GB * 1024 (MB/GB) / 100 (MB/s) = 149504 MB / 100 (MB/s) = 1495 s).

Is this the correct approach to solve this problem? What could be the cause of the data not being processed? Can I manually find the earliest etag to process to save some of the initial scanning? That way, I could specify StartEtag when creating the data subscription.

Thanks
Andrej

Oren Eini (Ayende Rahien)

לא נקראה,
6 בספט׳ 2016, 9:10:516.9.2016
עד ravendb

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrej Krivulčík

לא נקראה,
6 בספט׳ 2016, 10:14:196.9.2016
עד RavenDB - 2nd generation document database
Thanks for the confirmation, Oren. However, can you comment on the delay when starting the processing? I tried to start the processing again (with batching), and the progress is the same: RavenDB keeps reading from disk (125 MB/s at the moment), and the processing didn't even start (it's been around 4.5 hours now). Or at least some suggestions about how to investigate what's going on on the server?
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

לא נקראה,
6 בספט׳ 2016, 10:16:026.9.2016
עד ravendb
The subscriptions scan through the documents, and filter all those that aren't relevant. In some cases, it can be this case. Once it has caught up, it will be much faster.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Andrej Krivulčík

לא נקראה,
7 בספט׳ 2016, 4:22:157.9.2016
עד RavenDB - 2nd generation document database
Either I'm doing something wrong or there is a problem with the data subscriptions.

I started the client program yesterday at ~9:47 UTC. Right now, it's ~7:50 UTC the following day (22 hours) and the processing didn't even start. The server reports constant reading from drive - currently 80 MB/s.

I don't understand why it takes so long, as with this speed, the entire database could be read tens of times in this time.

Is there a way to figure out what the server is doing (scanning progress or something similar)?

Server build is #35166 (3.5), client version is 3.5.0-rc-35166.

Oren Eini (Ayende Rahien)

לא נקראה,
7 בספט׳ 2016, 4:30:177.9.2016
עד ravendb
Look at the the subscription stats, what it is saying?
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Andrej Krivulčík

לא נקראה,
7 בספט׳ 2016, 5:20:387.9.2016
עד RavenDB - 2nd generation document database

I'm not sure if you mean this: 

Oren Eini (Ayende Rahien)

לא נקראה,
7 בספט׳ 2016, 5:23:227.9.2016
עד ravendb
Yes, that is it.
What is in the log?
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Andrej Krivulčík

לא נקראה,
7 בספט׳ 2016, 6:08:047.9.2016
עד RavenDB - 2nd generation document database
I'll need to leave soon until the end of this week so I'll get back next week.

Quick clarification, is the logs you need located here /studio/index.html#admin/settings/adminLogs ?

Oren Eini (Ayende Rahien)

לא נקראה,
7 בספט׳ 2016, 6:10:337.9.2016
עד ravendb
yes
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Andrej Krivulčík

לא נקראה,
12 בספט׳ 2016, 8:00:3712.9.2016
עד RavenDB - 2nd generation document database
I sent the logs to sup...@hibernatingrhinos.com along with resource monitor screenshots.

Andrej Krivulčík

לא נקראה,
12 בספט׳ 2016, 8:03:0512.9.2016
עד RavenDB - 2nd generation document database
Also, I upgraded the server and client libraries to latest release (build 35169).

Maxim Buryak

לא נקראה,
12 בספט׳ 2016, 10:54:1312.9.2016
עד rav...@googlegroups.com
Hi, 
Haven't seen anything unusual in the logs.
Can you please try the solution proposed to here: https://groups.google.com/forum/#!topic/ravendb/wK6fS3K4iaw

It basically says that you should set the StartEtag of the SubscriptionCriteria to the earliest etag of the collection you are trying to query, so it won't have to read and skip all those irrelevant documents.
Also, if your document IDs are of the form [collectionName]/number, then you could add the KeyStartsWith value to the criteria to make the proccessing of the gaps between documents faster



Best Regards,

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Maxim Buryak l Core Team Developer Mobile:+972-54-217-7751

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/  



On Mon, Sep 12, 2016 at 3:03 PM, Andrej Krivulčík <kriv...@gmail.com> wrote:
Also, I upgraded the server and client libraries to latest release (build 35169).

Andrej Krivulčík

לא נקראה,
12 בספט׳ 2016, 14:27:4712.9.2016
עד RavenDB - 2nd generation document database
When I specify StartEtag, the processing starts pretty much immediately. I randomly selected a document from the collection, subtracted around 4k from the Etag and used that value as StartEtag. The document was not one of the last documents so the processing proceeds nicely.

For this to be useful, I need to find the earliest Etag present in a collection. Any hints about how to find this value?

Documents in this particular collection don't have a common prefix in the id, the id is combination of two identifiers. Therefore I can't use KeyStartsWith.

However, what I find problematic is that the processing without specifying StartEtag didn't start after almost 6 days of running. I executed the program (created and connected to the subscription) on 2016-09-07 and let it run through my long weekend until today (2016-09-12). The behavior was as described above - disk read saturated by RavenDB, 100 % CPU utilization and no documents returned. In such case, I would find the delay in the order of hours to be understandable, but when the server is unable to return first documents after days of constant disk reading, I find this very suspicious. As mentioned earlier, the data was imported, deleted and changed several times, so the early Etags are probably unassigned, but it seems that the database has problems even iterating through the collection.

If it would be useful to investigate the issue, I can provide the database. Full dump is around 5 GB.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Maxim Buryak

לא נקראה,
13 בספט׳ 2016, 3:41:2013.9.2016
עד rav...@googlegroups.com

First, please refer to the link I've sent in the previous mail, you'll see there the instruction of finding the etag. Second, we'll be glad to receive that dump, share it with us through sup...@ravendb.com, alongside with instructions of reproducing the issue.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Maxim Buryak

לא נקראה,
13 בספט׳ 2016, 4:55:5313.9.2016
עד rav...@googlegroups.com
Hi,
I think we discovered why the high cpu + high storage usage is happening, you can track the resolution of it here: http://issues.hibernatingrhinos.com/issue/RavenDB-5272.
Meanwhile, you could walk aroung this by setting SubscriptionCriteria.KeyStartsWith to string.Empty



Best Regards,

Hibernating Rhinos Ltd  cid:image001.png@01CF95E2.8ED1B7D0

Maxim Buryak l Core Team Developer Mobile:+972-54-217-7751

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

RavenDB paving the way to "Data Made Simplehttp://ravendb.net/  



Andrej Krivulčík

לא נקראה,
13 בספט׳ 2016, 4:57:3213.9.2016
עד RavenDB - 2nd generation document database
Thanks for dealing with this issue. I'm currently trying to use the StartEtag, but the issue seems to persist (can't definitely confirm as I have index recomputation going on).

Andrej Krivulčík

לא נקראה,
13 בספט׳ 2016, 5:23:3113.9.2016
עד RavenDB - 2nd generation document database
The workaround with using string.Empty for KeyStartsWith worked for me, the processing starts after roughly 3 minutes of initial delay. I'm also using the StartEtag in this case. After the issue is resolved, I'll try to confirm the fix.
השב לכולם
השב למחבר
העבר לנמענים
0 הודעות חדשות