arangoimp performance issues / how can we optimize arangoimp?

Akshay Surve

unread,

Jun 5, 2019, 5:31:13 AM6/5/19

to ArangoDB

Hi,

We were evaluating Arrango DB for our usecase and it looked very promising till we hit some blockers. So, wanted share it with the community and see if we could possibly change anything about our approach.

System:

- Running Arangodb 3.4.5 as a Docker instance

- Using rocksdb

- OSX, 16GB RAM

Usecase:
~775K user GUIDs; which we wanted to bulk import. You can see some sample values below

We stumbled upon 2 blockers:

1. arrangoimp wasn't able to process a large JSON file and would get stuck. (The JSON file was identical to the jsonl file linked below except that here we had a array of JSON objects in a single row.

$ docker exec -i 9ecdb1b73004 arangoimp --server.password XXXXX --file "dxids.json" --type json --collection events --progress true --threads 4 --on-duplicate ignore

Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.4.5, database: '_system', username: 'root'

----------------------------------------

database: _system

collection: events

create: no

create database: no

source filename: dxids.json

file type: json

threads: 4

connect timeout: 5

request timeout: 1200

----------------------------------------

Starting JSON import...

2019-06-05T09:05:40Z [321] ERROR error message(s):

2019-06-05T09:05:40Z [321] ERROR import file is too big. please increase the value of --batch-size (currently 1048576)

We kept getting an error saying please increase batch size. While we kept increasing batch size - and it started to process but eventually would get stuck at 99% (and we had kept it running for 2-3 hours) without success.

Eg:

2019-06-05T09:06:36Z [375] INFO processed 34634719 bytes (93%) of input file

2019-06-05T09:06:36Z [375] INFO processed 35748797 bytes (96%) of input file

2019-06-05T09:06:36Z [375] INFO processed 36895642 bytes (99%) of input file

2. We changed the file to jsonl representation. This time around it atleast processes but takes close to 50-70mins to finish processing.

Here is some stats on our data:

$ wc -l dxids-cleaned.json

775783 dxids-cleaned.json

$ head dxids-cleaned.json

{"_key":"ca7c1b92-962f-482b-8be1-d3888686aee9"}

{"_key":"a54432a0-15c8-46d2-8f67-21c928c385cf"}

{"_key":"c6aa3a49-0d56-4c31-b0f5-32ca88725fff"}

{"_key":"19a207fc-7fcb-4dee-9789-146d5fc7ed0a"}

{"_key":"08e9b852-c4fd-4ff1-83bb-9aaf6e7f837f"}

{"_key":"d6e88e54-cf1f-4566-9ffd-e43aeb3b6767"}

{"_key":"717a99d2-1985-4af1-ab09-4210324c1c83"}

{"_key":"a6377fc2-11bc-4d3c-9c54-ae4f12e7b439"}

{"_key":"a6249b90-a055-4f36-94c7-b16765c8d654"}

{"_key":"2261b38b-a75e-4e6d-b50e-9715a52c6e33"}

Source: https://testfkej2fb945.s3.amazonaws.com/dxids-cleaned.json.zip

Here is how we are initiating the import:

$ docker exec -i f295a1638892 arangoimp --server.password XXXXX --file "dxids-cleaned.json" --type jsonl --collection events --progress true --threads 4 --on-duplicate ignore

Finally, I wanted to understand:

1. If we can tweak the approach?

2. Is 40-60mins time which it takes to process in the expected range? We bulk ingested in Neo4j and it took a few mins. I'm simply curious as we are doing this evaluation for our internal usecase.

Best,

Dave Challis

unread,

Jun 5, 2019, 6:08:59 AM6/5/19

to ArangoDB

I had a quick look (since we're also evaluating ArangoDB at the moment).

It looks like your import is slow due to large number of duplicate IDs in your dataset (there are only 50000 unique IDs in the 775783 file).

Filtering out duplicates before importing would definitely help, otherwise sorting the input also helps, I ran this locally with 2 threads:

$ sort dxids-cleaned.json > sorted-dxids-cleaned.json

$ time arangoimp --file sorted-dxids-cleaned.json --type jsonl --collection dxid --progress true --server.authentication false --create-collection true --overwrite true --on-duplicate ignore

created:          50000
warnings/errors: 394
updated/replaced: 0
ignored:          725389
real    1m 52.74s
user    0m 0.12s
sys    0m 0.19s

Akshay Surve

unread,

Jun 6, 2019, 3:35:36 AM6/6/19

to ArangoDB

Hi Dave,

Thanks for looking into this. You are right - sorting helps get this processed much faster!

It would be nice to know the internal trade-off / doc to understand this better - please let me know if you can share some insights on this.

Reply all

Reply to author

Forward