Simple query, pretty slow

117 views
Skip to first unread message

Andreas Jung

unread,
Jun 14, 2019, 12:45:22 AM6/14/19
to ArangoDB
Hi there,

this query 

 for doc in import 
   filter doc._type == 'Image'
   return {path: doc._path, key: doc._key}

takes about 45 seconds on decent hardware with an import collection of about 100.000 items with about 21.000 of _type = 'Image'.
There is an index of _type. Using PyArango as client...I really wander why this query is running so slow?!

Running ArangoDB 3.4.3

Profile

Query String:
 for doc in import 
 filter doc._type == 'Image'
 return {path: doc._path, key: doc._key}

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  7   IndexNode            21   20617      32.73956     - FOR doc IN import   /* hash index scan, projections: `_key`, `_path` */
  5   CalculationNode      21   20617       0.04354       - LET #3 = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression */   /* collections used: doc : import */
  6   ReturnNode           21   20617       0.00016       - RETURN #3

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        Ranges
  7   hash   import       false    false         0.05 %   [ `_type` ]   (doc.`_type` == "Image")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   remove-unnecessary-calculations-2
  8   reduce-extraction-to-projection

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0        20617          0        32.78928

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00001
 parsing                    0.00010
 optimizing ast             0.00001
 loading collections        0.00002
 instantiating plan         0.00005
 optimizing plan            0.00032
 executing                 32.78841
 finalizing                 0.00032

James Courtier-Dutton

unread,
Jun 14, 2019, 3:36:17 AM6/14/19
to aran...@googlegroups.com
Hi,

What is the average size of the returned data? It could just be the time it takes to serialise the data being returned

James

--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andreas Jung

unread,
Jun 14, 2019, 3:41:24 AM6/14/19
to ArangoDB
_key is a UUID4
_path is standard filesystem path not longer than 100 chars each

That can not be the problem.
To unsubscribe from this group and stop receiving emails from it, send an email to aran...@googlegroups.com.

James Courtier-Dutton

unread,
Jun 14, 2019, 3:48:46 AM6/14/19
to aran...@googlegroups.com
Hi,

I am not an arango dev, so i don't know why it is so slow. I do know that if you add a 'view' and use the 'search' instead of 'filter', it will be fast.

Jamez

To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/e03caccb-2753-47b4-b52a-25f8496a6469%40googlegroups.com.

Wilfried Gösgens

unread,
Jun 14, 2019, 3:50:41 AM6/14/19
to ArangoDB
Hi,
afair you're using rocksdb?

can you try to re-create that index to be on `_type`, `_path`, `_key` for better using of projections?

Please note that you shouldn't use fieldnames starting with `_` since they're defined as system specific fields in arangodb.

Cheers,
Willi

Andreas Jung

unread,
Jun 14, 2019, 3:54:10 AM6/14/19
to ArangoDB
Using RocksDB (default installation).

I create a new collection for every import of the data including the indexes.

Unfortunately I don't have the key names in my hands. They are coming
from a JSON dump of a CMS.

Andreas Jung

unread,
Jun 14, 2019, 5:22:51 AM6/14/19
to ArangoDB
Recreating the indexes after import does not make a difference.

Returning doc._path  for 20.000 items takes 50 ms, returning doc._path takes minutes

The _path index is deduplicated, the _type index is not 

The only difference in the execution plans is "index only" when "RETURN doc._type". Since both _type and _path
are fully indexed I would assume that the query is executed in both times based on index data.

So ArangoDB will load all 100.000 objects for picking up the value of _path? The overall data is meanwhile 55 GB
(about one third of the data is binary data (files and images base64 encoded).  

This is all no big problem for me since we perform such queries once before a migration run and it does matter taking
a migration running for some hours a minutes more or less but I want to understand what is going on here (in particular
this is unexpected behavior).


Query String:
 for doc in import 
 filter doc._type == 'Image'
 return doc._type

Execution plan:
 Id   NodeType          Est.   Comment
  1   SingletonNode        1   * ROOT
  7   IndexNode         2214     - FOR doc IN import   /* hash index scan, index only, projections: `_type` */
  5   CalculationNode   2214       - LET #3 = doc.`_type`   /* attribute expression */   /* collections used: doc : import */
  6   ReturnNode        2214       - RETURN #3

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        Ranges
  7   hash   import       false    false         0.05 %   [ `_type` ]   (doc.`_type` == "Image")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   remove-unnecessary-calculations-2
  8   reduce-extraction-to-projection
Query String:
 for doc in import 
 filter doc._type == 'Image'
 return doc._path

Execution plan:
 Id   NodeType          Est.   Comment
  1   SingletonNode        1   * ROOT
  7   IndexNode         2214     - FOR doc IN import   /* hash index scan, projections: `_path` */
  5   CalculationNode   2214       - LET #3 = doc.`_path`   /* attribute expression */   /* collections used: doc : import */
  6   ReturnNode        2214       - RETURN #3

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        Ranges
  7   hash   import       false    false         0.05 %   [ `_type` ]   (doc.`_type` == "Image")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   remove-unnecessary-calculations-2
  8   reduce-extraction-to-projection

Wilfried Gösgens

unread,
Jun 14, 2019, 7:03:59 AM6/14/19
to ArangoDB

Hi,
Can you share a set of sample documents? How well is the distribution on `_type` ? Which samples are there?

Andreas Jung

unread,
Jun 14, 2019, 7:10:44 AM6/14/19
to ArangoDB
All _path value are unique, we have about 20 different values for _type.
I am not sure if I can break down the dataset into something smaller.
The data is in general sensitive and not easy to share or anonymize.

Jan Stücke

unread,
Jun 14, 2019, 9:00:47 AM6/14/19
to aran...@googlegroups.com
Hi,

if I read your situation correctly, then you store the meta data and the binary data in one document, correct?

In that case, ArangoDB might be accessing the whole document which means the whole binary data has to be searched through as well. ArangoDB is not optimized for large blob storage.
If you use cases allows this, you could separate metadata and binary data into two collections. Theryb, you can have fast queries on meta data and only lookup the relevant binary data if necessary. Best solution is to use a dedicated file system for the blob storage.

Hope that helps

To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--

Jan Stücke
Head of Communications

Wilfried Gösgens

unread,
Jun 14, 2019, 9:01:30 AM6/14/19
to ArangoDB
May I get back to my sugestion once more?

Could you instead of the index on `_type` create a combined index ot `_type`, `_path` and `_key` ?
This should copy these fields into the index, so ArangoDB doesn't have to fetch the (big) documents.
I gues fetching and decompressing them is huge.

Another suggestion would be to put the payload (You've got base64 encoded binary data, right?) into a separate collection, parted of the structural information.

Cheers,
Willi

Andreas Jung

unread,
Jun 17, 2019, 10:25:51 AM6/17/19
to ArangoDB
A compound index on _type+_path or _type +_path + _key does not improve things.
The query time is still in the range of 120 to 150 seconds.

Andreas

Andreas Jung

unread,
Jun 17, 2019, 10:29:56 AM6/17/19
to ArangoDB
Profile:

Query String:
 for doc in import
    filter doc._type == 'Image'
    
    return {path: doc._path, key: doc._key}

Execution plan:
 Id   NodeType          Calls   Items   Runtime [s]   Comment
  1   SingletonNode         1       1       0.00000   * ROOT
  7   IndexNode            21   20617      84.34365     - FOR doc IN import   /* hash index scan, projections: `_key`, `_path` */
  5   CalculationNode      21   20617       0.05436       - LET #3 = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression */   /* collections used: doc : import */
  6   ReturnNode           21   20617       0.00017       - RETURN #3

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        Ranges
  7   hash   import       false    false         0.05 %   [ `_type` ]   (doc.`_type` == "Image")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   remove-unnecessary-calculations-2
  8   reduce-extraction-to-projection

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Exec Time [s]
           0            0           0        20617          0        84.40501

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00020
 optimizing ast             0.00001
 loading collections        0.00001
 instantiating plan         0.00005
 optimizing plan            0.00021
 executing                 84.40415
 finalizing                 0.00033

Wilfried Gösgens

unread,
Jun 17, 2019, 12:58:36 PM6/17/19
to ArangoDB
Hi,

can you get some figures of Avg/Min/Max document sizes?

Andreas Jung

unread,
Jun 17, 2019, 1:14:49 PM6/17/19
to ArangoDB
min= 425
max= 103973102
avg= 604706

The avg size is that high because about 1/4 of the 100.000 documents contains binary content.

Andreas Jung

unread,
Jun 17, 2019, 1:50:45 PM6/17/19
to ArangoDB
<1000 bytes: 41.000 items
<10.000 bytes: 21.000 items
< 100.000 bytes: 22.00  items
< 1 MB: 9000 items
< 10 MB: 1000 items
> 10 MB: 20 items 
Reply all
Reply to author
Forward
0 new messages