Single query vs multiple queries, max aql query length

Roman

unread,

Dec 15, 2015, 2:34:43 AM12/15/15

to ArangoDB

Hi, I'm wondering what's more efficient:

1) Single query - for foo in bar filter foo.bar in [1,2,3] return foo

2) Multiple separate queries - for foo in bar filter foo.bar==1 return foo;for foo in bar filter foo.bar==2 return foo;for foo in bar filter foo.bar==3 return foo

I think 1) because it doesn't have overhead from running multiple commands. But what if number of values I need to search in db is high, for example 10000 or more. Or values are strings or more complex json objects. Is there any limit for length of aql query? Btw. we run all queries from node.js against local database.

Thanks

Roman

Wilfried Gösgens

unread,

Dec 15, 2015, 4:19:29 AM12/15/15

to ArangoDB

Hi,
We recently had a simmilar discussion about IN on Stackoverflow:

http://stackoverflow.com/questions/34200335/arangodb-in-operator-very-slow

TL;DR: definitely go for the IN operator.
Performance wise there will be a fix for this in ArangoDB 2.8

Cheers,
Willi

Jan Steemann

unread,

Dec 15, 2015, 5:01:35 AM12/15/15

to aran...@googlegroups.com

Hi,

if the right-hand side operand of the IN-operator is a hard-coded list or a bind parameter value, then the `IN` will be fast even in 2.7.
For each value in the IN list there will be a single index lookup.

With 2.8 the right-hand side operand of the IN-operator will additionally be pre-sorted if it is a variable that is defined by an arbitrary expression in an outer loop, e.g.

LET values = some expression

FOR foo IN bar

FILTER foo.bar IN values

In this case 2.7 will do a linear search on the IN values in the inner loop, which is slow when the IN list is very big. 2.8 will pre-sort the IN values and do a binary search for this.

Note that this still does not affect hard-coded IN lists of IN lists that are populated using bind parameters. These will be fast anyway.

Regarding performance when a client program is involved:

Let's say there is an IN list with 1,000 values. If you would issue 1,000 individual requests using the equality operator (==) then you would need to issue 1,000 HTTP requests, which will mean high network overhead. If you would put the 1,000 values in a single IN list, then you will have only a single HTTP request, though the one query will take longer than each of the single-value ones, and will produce a much bigger result.

That does not mean you should use an IN list with any number of values. For example, it won't make sense to use an IN list with several million values. Then compiling the one query, building the query result and fetching it from the server will also take very long. As a rule of thumb it may be most sensible to use IN lists but to not let them grow without bounds. I'd recommend starting with IN lists of 1,000 values or so (adjust as needed) and try what the response times and sizes are (also much depends on document sizes).

The length of an AQL query string is not artificially limited by ArangoDB, but it will definitely fail if the query string exceeds 2^31 bytes in length. But before that you're likely to run out of memory anyway, so please don't use *that* long query strings.

Best regards

Jan

--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman

unread,

Dec 15, 2015, 6:49:32 AM12/15/15

to ArangoDB

Thanks Jan.

And what about NOT IN operator performance? Does it always use full scan or it can use index/bin search? In 2.7. query using NOT IN are really slow.

Jan Steemann

unread,

Dec 15, 2015, 6:58:23 AM12/15/15

to aran...@googlegroups.com

2.8 can do a pre-sort on the right-hand side operand of a NOT IN as well as an IN operator.

The reason why NOT IN is often slow is that if you apply it in a loop over a collection, the query won't use an index:

FOR doc IN collection

FILTER doc.value NOT IN [ 3, 92, 54982, 32 ]

RETURN doc

If you run `require("org/arangodb/aql/explainer").explain(queryString);` on the above, you will see that it will do a full collection scan.

In general, the NOT IN operator won't be a good candidate for index usage, because what you're asking for is to scan the entire collection and return all documents but a few. Indexes are good when the goal is to return only a very small fraction of the documents, but NOT IN is the opposite in many cases.

Best regards

Jan

Roman

unread,

Dec 15, 2015, 7:30:49 AM12/15/15

to ArangoDB

OK, I thought it would be like this. One more question, I noticed difference between filter with && and multiple filter statements.

This runs in my query for 300s:

filter length(EDGES(eDevIntL2, intL2, "inbound"))==0 && intL2.neiIP not in intL3IpDone

This runs 5s

filter length(EDGES(eDevIntL2, intL2, "inbound"))==0

filter intL2.neiIP not in intL3IpDone

So it seems that first query evaluates second condition even if first is not true

Jan Steemann

unread,

Dec 15, 2015, 8:11:10 AM12/15/15

to aran...@googlegroups.com

For the latter query (the one using two independent filters), the optimizer may have more choices.

For example, if the query looks like this:

for intL2 in bar

for intL3IpDone in foo

/* FILTERS GO HERE ORIGINALLY */

return intL2

then for a single &&-combined filter, the filter must stay where the comment is placed above, i.e.

for intL2 in bar

for intL3IpDone in foo

filter length(EDGES(eDevIntL2, intL2, "inbound"))==0 && intL2.neiIP not in intL3IpDone

return intL2

If instead there are two filter conditions, the optimizer may pull one filter out of the inner loop into the outer loop, e.g.

for intL2 in bar

filter length(EDGES(eDevIntL2, intL2, "inbound"))==0

for intL3IpDone in foo

filter intL2.neiIP not in intL3IpDone

return intL2

I am not sure if it will be exactly like that, because it depends on the other constructs used in the query and not only the filters.

In general, if you have two different queries with vastly differing performance, explaining them via the `explain` command or the web interface may provide some clues about performance differences. For example, explaining will show where filters will be placed in the final query and if/which indexes are going to be used.