dump arangodb Collection for 10 million documents

445 views
Skip to first unread message

Alireza David

unread,
Apr 3, 2016, 1:56:42 AM4/3/16
to ArangoDB
I have 10 million documents in one collection , because of data files and indexes in that collection ram usage is very high. (for 10 million data arango ram usage is 7 GB)
I want dump this collection into another collection, but when query in 5 million documents arango crashed and dont responde.
I want to ask what is default batch size? and do you have any suggestion?

(Sorry about my english grammar) 

Wilfried Gösgens

unread,
Apr 4, 2016, 9:06:35 AM4/4/16
to ArangoDB
Hi David,
it would be great if you could share your actual query with us, by the description you gave I can only give some general implementation insights you may want to consider.

If you use an aql like this:
  FOR x IN collectionA INSERT x IN collectionB
this should work in chunks of 1000 documents.

However, if you sort by a non indexed property, this may not be true:
  FOR x IN collectionA SORT x.NonIndexedProperty INSERT x IN collectionB

Another way of doing this to minimize the stress on your infrastructure could be do dump and re-import using arangodump and arangorestore:
https://docs.arangodb.com/HttpBulkImports/Arangodump.html
https://docs.arangodb.com/HttpBulkImports/Arangorestore.html

Since indices are in memory, you may want to drop some temporarily to reduce the resource usage on your machine while you run such a very expensive query.

For performance reasons you may also consider to create the indices on collectionB _after_ loading it with content.

Did you actually experience the process go away? Or was it just unresponsive for a while? What was your actual query?

Hope this helps,

Cheers,
Willi

Alireza David

unread,
Apr 5, 2016, 1:08:08 AM4/5/16
to ArangoDB

Thanks for response.
My actual issue is I have a large collection and growing very fast, server has 16 GB ram and arango singly use 9 GB of it. I want query some result who do not need them and backup those documents in another collection (and onload those collection). Now first of all is it ok this scenario? 
I want automate this mechanism and i use http method 
/_api/cursor

with batch size 100,000 but when run this arango freeze and dont response (Web pannel dont open).
This is my fetch query 
FOR ad IN AdsStatics FILTER ad.Add_at < 1459832794879 RETURN ad
 And i use bulk insert api for insert those docuemtns
/_api/import?collection=AdsStatics&type=list
I try this in my local machine with fewer documents and work like a charm, but with 10000000 docuemtns dont work.

Sry about my english language :)

Wilfried Gösgens

unread,
Apr 5, 2016, 5:17:00 AM4/5/16
to ArangoDB
Hi David,
no need to excuse. We're not native speakers either ;-)


If its the same ArangoDB instance, why not do the insert in the same query?

Or do I get that correctly you're trying to insert the documents into another arangodb instance, so you use http requests to fetch the reply?

The problem about the cursor is, that the result has to be built in memory at request time. Since you only add an upper limit, there may be more documents, right?
So you should also specify a LIMIT 1000

And, if you do db._explain(`
FOR ad IN AdsStatics FILTER ad.Add_at < 1459832794879
 RETURN ad
`)

Will it tell you that it uses an index? Else it will do a full table scan.

Another better way of doing this, is to only return the _key attribute (so the result is smaller) paired with the LIMIT. Then fetch a range, delete them, next chunk.

Alireza David

unread,
Apr 5, 2016, 6:42:30 AM4/5/16
to ArangoDB
And what about batchSize? 
i use batch size option for limit result.

Yes in explain function shows AdsStatics used Add_at index.

Wilfried Gösgens

unread,
Apr 5, 2016, 9:00:52 AM4/5/16
to ArangoDB
Hi David,
the batchSize controls che chunk size of the number of items you will fetch from a cursor in one request.
However, as mentioned before, the result has to be prepared in RAM beforehand.

So you need to use the LIMIT statement to effectively limit the number of documents that are handled in one request.
Reply all
Reply to author
Forward
0 new messages