Simple query, pretty slow

Andreas Jung

unread,

Jun 14, 2019, 12:45:22 AM6/14/19

to ArangoDB

Hi there,

this query

for doc in import

filter doc._type == 'Image'

return {path: doc._path, key: doc._key}

takes about 45 seconds on decent hardware with an import collection of about 100.000 items with about 21.000 of _type = 'Image'.

There is an index of _type. Using PyArango as client...I really wander why this query is running so slow?!

Running ArangoDB 3.4.3

Profile

Query String:

for doc in import

filter doc._type == 'Image'

return {path: doc._path, key: doc._key}

Execution plan:

Id NodeType Calls Items Runtime [s] Comment

1 SingletonNode 1 1 0.00000 * ROOT

7 IndexNode 21 20617 32.73956 - FOR doc IN import /* hash index scan, projections: `_key`, `_path` */

5 CalculationNode 21 20617 0.04354 - LET #3 = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression */ /* collections used: doc : import */

6 ReturnNode 21 20617 0.00016 - RETURN #3

Indexes used:

By Type Collection Unique Sparse Selectivity Fields Ranges

7 hash import false false 0.05 % [ `_type` ] (doc.`_type` == "Image")

Optimization rules applied:

Id RuleName

1 move-calculations-up

2 move-filters-up

3 move-calculations-up-2

4 move-filters-up-2

5 use-indexes

6 remove-filter-covered-by-index

7 remove-unnecessary-calculations-2

8 reduce-extraction-to-projection

Query Statistics:

Writes Exec Writes Ign Scan Full Scan Index Filtered Exec Time [s]

0 0 0 20617 0 32.78928

Query Profile:

Query Stage Duration [s]

initializing 0.00001

parsing 0.00010

optimizing ast 0.00001

loading collections 0.00002

instantiating plan 0.00005

optimizing plan 0.00032

executing 32.78841

finalizing 0.00032

James Courtier-Dutton

unread,

Jun 14, 2019, 3:36:17 AM6/14/19

to aran...@googlegroups.com

Hi,

What is the average size of the returned data? It could just be the time it takes to serialise the data being returned

James

--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andreas Jung

unread,

Jun 14, 2019, 3:41:24 AM6/14/19

to ArangoDB

_key is a UUID4

_path is standard filesystem path not longer than 100 chars each

That can not be the problem.

To unsubscribe from this group and stop receiving emails from it, send an email to aran...@googlegroups.com.

James Courtier-Dutton

unread,

Jun 14, 2019, 3:48:46 AM6/14/19

to aran...@googlegroups.com

Hi,

I am not an arango dev, so i don't know why it is so slow. I do know that if you add a 'view' and use the 'search' instead of 'filter', it will be fast.

Jamez

To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/e03caccb-2753-47b4-b52a-25f8496a6469%40googlegroups.com.

Wilfried Gösgens

unread,

Jun 14, 2019, 3:50:41 AM6/14/19

to ArangoDB

Hi,

afair you're using rocksdb?

can you try to re-create that index to be on `_type`, `_path`, `_key` for better using of projections?

Please note that you shouldn't use fieldnames starting with `_` since they're defined as system specific fields in arangodb.

Cheers,

Willi

Andreas Jung

unread,

Jun 14, 2019, 3:54:10 AM6/14/19

to ArangoDB

Using RocksDB (default installation).

I create a new collection for every import of the data including the indexes.

Unfortunately I don't have the key names in my hands. They are coming

from a JSON dump of a CMS.

Andreas Jung

unread,

Jun 14, 2019, 5:22:51 AM6/14/19

to ArangoDB

Recreating the indexes after import does not make a difference.

Returning doc._path for 20.000 items takes 50 ms, returning doc._path takes minutes

The _path index is deduplicated, the _type index is not

The only difference in the execution plans is "index only" when "RETURN doc._type". Since both _type and _path

are fully indexed I would assume that the query is executed in both times based on index data.

So ArangoDB will load all 100.000 objects for picking up the value of _path? The overall data is meanwhile 55 GB

(about one third of the data is binary data (files and images base64 encoded).

This is all no big problem for me since we perform such queries once before a migration run and it does matter taking

a migration running for some hours a minutes more or less but I want to understand what is going on here (in particular

this is unexpected behavior).

Query String:

for doc in import

filter doc._type == 'Image'

return doc._type

Execution plan:

Id NodeType Est. Comment

1 SingletonNode 1 * ROOT

7 IndexNode 2214 - FOR doc IN import /* hash index scan, index only, projections: `_type` */

5 CalculationNode 2214 - LET #3 = doc.`_type` /* attribute expression */ /* collections used: doc : import */

6 ReturnNode 2214 - RETURN #3

Indexes used:

By Type Collection Unique Sparse Selectivity Fields Ranges

7 hash import false false 0.05 % [ `_type` ] (doc.`_type` == "Image")

Optimization rules applied:

Id RuleName

1 move-calculations-up

2 move-filters-up

3 move-calculations-up-2

4 move-filters-up-2

5 use-indexes

6 remove-filter-covered-by-index

7 remove-unnecessary-calculations-2

8 reduce-extraction-to-projection

Query String:

for doc in import

filter doc._type == 'Image'

return doc._path

Execution plan:

Id NodeType Est. Comment

1 SingletonNode 1 * ROOT

7 IndexNode 2214 - FOR doc IN import /* hash index scan, projections: `_path` */

5 CalculationNode 2214 - LET #3 = doc.`_path` /* attribute expression */ /* collections used: doc : import */

6 ReturnNode 2214 - RETURN #3

Indexes used:

By Type Collection Unique Sparse Selectivity Fields Ranges

7 hash import false false 0.05 % [ `_type` ] (doc.`_type` == "Image")

Optimization rules applied:

Id RuleName

1 move-calculations-up

2 move-filters-up

3 move-calculations-up-2

4 move-filters-up-2

5 use-indexes

6 remove-filter-covered-by-index

7 remove-unnecessary-calculations-2

8 reduce-extraction-to-projection

Wilfried Gösgens

unread,

Jun 14, 2019, 7:03:59 AM6/14/19

to ArangoDB

Hi,

Can you share a set of sample documents? How well is the distribution on `_type` ? Which samples are there?

Andreas Jung

unread,

Jun 14, 2019, 7:10:44 AM6/14/19

to ArangoDB

All _path value are unique, we have about 20 different values for _type.

I am not sure if I can break down the dataset into something smaller.

The data is in general sensitive and not easy to share or anonymize.

Jan Stücke

unread,

Jun 14, 2019, 9:00:47 AM6/14/19

to aran...@googlegroups.com

Hi,

if I read your situation correctly, then you store the meta data and the binary data in one document, correct?

In that case, ArangoDB might be accessing the whole document which means the whole binary data has to be searched through as well. ArangoDB is not optimized for large blob storage.

If you use cases allows this, you could separate metadata and binary data into two collections. Theryb, you can have fast queries on meta data and only lookup the relevant binary data if necessary. Best solution is to use a dedicated file system for the blob storage.

Hope that helps

To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Jan Stücke

Head of Communications

jan.s...@arangodb.com | +49 (0)221 / 2722999-60

*Help us grow the multi-model vision with your review on Gartner Peer Reviews.

Wilfried Gösgens

unread,

Jun 14, 2019, 9:01:30 AM6/14/19

to ArangoDB

May I get back to my sugestion once more?

Could you instead of the index on `_type` create a combined index ot `_type`, `_path` and `_key` ?

This should copy these fields into the index, so ArangoDB doesn't have to fetch the (big) documents.

I gues fetching and decompressing them is huge.

Another suggestion would be to put the payload (You've got base64 encoded binary data, right?) into a separate collection, parted of the structural information.

Cheers,

Willi

Andreas Jung

unread,

Jun 17, 2019, 10:25:51 AM6/17/19

to ArangoDB

A compound index on _type+_path or _type +_path + _key does not improve things.

The query time is still in the range of 120 to 150 seconds.

Andreas

Andreas Jung

unread,

Jun 17, 2019, 10:29:56 AM6/17/19

to ArangoDB

Profile:

Query String:

for doc in import

filter doc._type == 'Image'

return {path: doc._path, key: doc._key}

Execution plan:

Id NodeType Calls Items Runtime [s] Comment

1 SingletonNode 1 1 0.00000 * ROOT

7 IndexNode 21 20617 84.34365 - FOR doc IN import /* hash index scan, projections: `_key`, `_path` */

5 CalculationNode 21 20617 0.05436 - LET #3 = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression */ /* collections used: doc : import */

6 ReturnNode 21 20617 0.00017 - RETURN #3

Indexes used:

By Type Collection Unique Sparse Selectivity Fields Ranges

7 hash import false false 0.05 % [ `_type` ] (doc.`_type` == "Image")

Optimization rules applied:

Id RuleName

1 move-calculations-up

2 move-filters-up

3 move-calculations-up-2

4 move-filters-up-2

5 use-indexes

6 remove-filter-covered-by-index

7 remove-unnecessary-calculations-2

8 reduce-extraction-to-projection

Query Statistics:

Writes Exec Writes Ign Scan Full Scan Index Filtered Exec Time [s]

0 0 0 20617 0 84.40501

Query Profile:

Query Stage Duration [s]

initializing 0.00000

parsing 0.00020

optimizing ast 0.00001

loading collections 0.00001

instantiating plan 0.00005

optimizing plan 0.00021

executing 84.40415

finalizing 0.00033

Wilfried Gösgens

unread,

Jun 17, 2019, 12:58:36 PM6/17/19

to ArangoDB

Hi,

can you get some figures of Avg/Min/Max document sizes?

Andreas Jung

unread,

Jun 17, 2019, 1:14:49 PM6/17/19

to ArangoDB

min= 425

max= 103973102

avg= 604706

The avg size is that high because about 1/4 of the 100.000 documents contains binary content.

Andreas Jung

unread,

Jun 17, 2019, 1:50:45 PM6/17/19

to ArangoDB

<1000 bytes: 41.000 items

<10.000 bytes: 21.000 items

< 100.000 bytes: 22.00 items

< 1 MB: 9000 items

< 10 MB: 1000 items

> 10 MB: 20 items

Reply all

Reply to author

Forward