Faceted Search Performance

Roman Kuzmik

unread,

Sep 14, 2017, 9:53:20 AM9/14/17

to ArangoDB

We are evaluating ArangoDB performance in space of facets calculations.

There are number of other products capable of doing the same, either via special API or query language:

We understand, there is no special API in Arango to calculate factes explicitly.

But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

This query calculate a facet on attribute1 and yields frequency in the form of:

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]

It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided.

Pretty much we run a DISTINCT query and aggregated counts.

Looks simple and clean. With only one, but really big issue - performance.

Provided query above runs for !31 seconds! on top of the test collection with only 8M documents.

We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail.

Test documents we use in this test are very concise with only three short attributes.

We would appreciate any input at this point.

Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.

btw, ultimate goal would be to run something like the following in under-second time:









LET docs = (FOR a IN Asset 
  FILTER a.name like 'test-asset-%'
  SORT a.name
 RETURN a)
LET attribute1 = (
 FOR a in docs 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g[*])}
)
LET attribute2 = (
 FOR a in docs 
  COLLECT attr = a.attribute2 INTO g
 RETURN { value: attr, count: length(g[*])}
)
LET attribute3 = (
 FOR a in docs 
  COLLECT attr = a.attribute3 INTO g
 RETURN { value: attr, count: length(g[*])}
)
LET attribute4 = (
 FOR a in docs 
  COLLECT attr = a.attribute4 INTO g
 RETURN { value: attr, count: length(g[*])}
)
RETURN {
  counts: (RETURN {
    total: LENGTH(docs), 
    offset: 2, 
    to: 4, 
    facets: {
      attribute1: {
        from: 0, 
        to: 5,
        total: LENGTH(attribute1)
      },
      attribute2: {
        from: 5, 
        to: 10,
        total: LENGTH(attribute2)
      },
      attribute3: {
        from: 0, 
        to: 1000,
        total: LENGTH(attribute3)
      },
      attribute4: {
        from: 0, 
        to: 1000,
        total: LENGTH(attribute4)
      }
    }
  }),
  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),
  facets: {
    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),
    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),
    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),
    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)
   }
}

Thanks!

Jan

unread,

Sep 14, 2017, 11:17:25 AM9/14/17

to ArangoDB

Hi,

one of the things to do for improving the query performance is to get rid of the "INTO" clause, as "INTO" will copy all documents found per group into a new variable "g":

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

The query without "INTO" would look like this:

 FOR a in Asset 
  COLLECT value = a.attribute1 WITH COUNT INTO length
 RETURN { value, length }

It should be faster than the one with "INTO", but I am not sure how much.
This probably depends on the actual data.

Can you give it a try?
Thanks
Jan

Roman Kuzmik

unread,

Sep 14, 2017, 2:38:31 PM9/14/17

to ArangoDB

Thanks Jan for your reply!

But, yes, we have tried "2.x old school" approach WITH COUNT, as well as brand new DISTINCT.

Both yields similar sluggish results :-/

Jan

unread,

Sep 14, 2017, 5:37:39 PM9/14/17

to ArangoDB

Hi,

I tried it myself on my local laptop, and here are the results:

Original query:

   FOR a IN Asset COLLECT attr = a.attribute1 INTO g RETURN { value: attr, count: length(g) }

This executes in about 35 seconds with the 8M documents. The execution plan is not ideal, because it will sort the entire collection first:

Execution plan:
Id   NodeType                     Est.   Comment
1   SingletonNode                   1   * ROOT
2   EnumerateCollectionNode   8000000     - FOR a IN p   /* full collection scan */
3   CalculationNode           8000000       - LET #3 = a.`attribute1`   /* attribute expression */   /* collections used: a : p */
7   SortNode                  8000000       - SORT #3 ASC
4   CollectNode               6400000       - COLLECT attr = #3 INTO g   /* sorted */
5   CalculationNode           6400000       - LET #5 = { "value" : attr, "count" : LENGTH(g) }   /* simple expression */
6   ReturnNode                6400000       - RETURN #5

Adjusted query:

    FOR a IN Asset COLLECT value = a.attribute1 WITH COUNT INTO length RETURN { value, length }

Changing the query as I suggested makes it finish in 6.x seconds. The execution plan is better already:

Execution plan:
Id   NodeType                     Est.   Comment
1   SingletonNode                   1   * ROOT
2   EnumerateCollectionNode   8000000     - FOR a IN p   /* full collection scan */
3   CalculationNode           8000000       - LET #3 = a.`attribute1`   /* attribute expression */   /* collections used: a : p */
4   CollectNode               6400000       - COLLECT value = #3 WITH COUNT INTO length   /* hash */
7   SortNode                  6400000       - SORT value ASC
5   CalculationNode           6400000       - LET #5 = { "value" : value, "length" : length }   /* simple expression */
6   ReturnNode                6400000       - RETURN #5

With a sorted (skiplist) index on "attribute1", the execution time goes down to around 5.3 seconds, and the plan changes to:

Execution plan:
Id   NodeType             Est.   Comment
1   SingletonNode           1   * ROOT
10   IndexNode         8000000     - FOR a IN p   /* skiplist index scan */
3   CalculationNode   8000000       - LET #3 = a.`attribute1`   /* attribute expression */   /* collections used: a : p */
4   CollectNode       6400000       - COLLECT value = #3 WITH COUNT INTO length   /* sorted */
7   CalculationNode   6400000       - LET #7 = { "value" : value, "length" : length }   /* simple expression */
8   ReturnNode        6400000       - RETURN #7

Still not ideal, but already better than the initial 30+ seconds.
That's all that right now comes to my mind that can easily be done for this particular query.
Maybe there is something more that can be optimized here, but this may require changes to the code.

Best regards
Jan

Roman Kuzmik

unread,

Sep 14, 2017, 11:51:32 PM9/14/17

to ArangoDB

Btw, Fyi, first query (AKA: LENGTH(g)) with an index on attribute1 runs almost same as second query (AKA: WITH COUNT).

Here, 2nd query with an index takes 4.4 seconds.

But it is still, just one facet. Usually you need a bunch, like in my "long" query in very first post.

Let me re-write it using WITH COUNT and create an index on each facet.... will see how long it will take. Currently, "as-is" it takes 117 seconds.

Thanks for looking into this issue.

Jan

unread,

Sep 15, 2017, 10:43:21 AM9/15/17

to ArangoDB

Hi,

I have profiled the execution of a single-facet query with the MMFiles engine this morning and I think we will be able to reduce the execution time quite a bit.
In my local tests, I have seen improvements of around 40% in case there is a full collection scan (no indexes) and there is a collect/count on a single attribute.
Here is the PR with the changes for reference: https://github.com/arangodb/arangodb/pull/3265

However, even with those changes applied (currently they are contained in the PR only) the combined execution time for multiple facet queries will not get below one second.
But at least it would be a step in the right direction, and other queries in MMFiles should also benefit from this.

Best regards
Jan

Roman Kuzmik

unread,

Sep 18, 2017, 3:27:04 PM9/18/17

to ArangoDB

compiled your changes from feature/mmfiles-hash-lookup-performance

indeed, on single facet we are down to 4 seconds from 6 seconds (in the test case provided above). And no indexes needed.

hope it will make to master soon.

Roman Kuzmik

unread,

Sep 18, 2017, 4:50:20 PM9/18/17

to ArangoDB

4 seconds per facet, thus adding 3 more it takes us to 16 seconds.

btw, why is that, arango is doing full scan anyways. is it doing it 4 times with the query bellow? Is there any way to make it smarter?

LET docs = (FOR a IN Asset

 RETURN a)
LET attribute1 = (
 FOR a in docs

  COLLECT attr = a.attribute1 WITH COUNT INTO length
 RETURN { value: attr, count: length}

)
LET attribute2 = (
 FOR a in docs

  COLLECT attr = a.attribute2 WITH COUNT INTO length
 RETURN { value: attr, count: length}

)
LET attribute3 = (
 FOR a in docs

  COLLECT attr = a.attribute3 WITH COUNT INTO length
 RETURN { value: attr, count: length}

)
LET attribute4 = (
 FOR a in docs

  COLLECT attr = a.attribute4 WITH COUNT INTO length
 RETURN { value: attr, count: length}

Jan

unread,

Sep 19, 2017, 1:43:21 AM9/19/17

to ArangoDB

Hi,

yes, it will scan the collection 4 times with the query below, once for each subquery.
Short-term I do not see any way to speed that up with existing AQL.
Long-term a few ways to handle this would be to parallelize the scanning or to collect and count the same input by multiple criteria with a single COLLECT clause.
Parallelizing it would still mean 4 scans, but in parallel. That could reduce latency quite a bit. Collecting and counting the same input by multiple criteria looks a bit more attractive, but does not really fit into the current way the AQL pipelines are set up.

Another alternative is to split the query on the client side and run 4 individual AQL queries, collect their results and aggregate them on the client side. That will work, but requires having quite some more logic in the client/application code.

Best regards
Jan

Roman Kuzmik

unread,

Sep 19, 2017, 6:54:44 PM9/19/17

to ArangoDB

Thanks for a hint!

We have wrote small service faced to calculate facets.

It split my huge AQL provided above into 5 queries:

main - to filter, sort and retrieve matching entities:

LET docs = (FOR a IN Asset

FILTER a.name like 'test-asset-%' SORT a.name RETURN a) RETURN { counts: (RETURN { total: LENGTH(docs), offset: 0, to: 5 }), items: (FOR a IN docs LIMIT 0, 5 RETURN a) }

4 small ones to purely calculate facets:

LET docs = (FOR a IN Asset

FILTER a.name like 'test-asset-%' RETURN a) LET attributeX = ( FOR a in docs COLLECT attr = a.attributeX WITH COUNT INTO length

RETURN { value: attr, count: length} ) RETURN { counts: (RETURN { total: LENGTH(docs),

offset: 0, to: -1, facets: { attributeX: { from: 0, to: 1000, total: LENGTH(attributeX) } } }), facets: { attributeX: (FOR a in attributeX LIMIT 0, 1000 return a) } }

We run these using Java's 8 Fork/Join and basically execute Map(split into sub-queries)/Reduce(merge results) potentially against ArangoDB cluster.

We run custom ArangoDB build from Jan's feature branch. And results are pretty impressive.

Remember, we have started with 140 secs for the full AQL and now we are down to 11 seconds (with sort/filter + skiplist on name) or 4 seconds (w/o soft/filter and 4 facets).

I think, we are satisfied for now :-) and hope this PR will make it to 'master'.

Also, would be awesome to see if AQL-pipeline will be advanced in the future to accomodate more analytical type of queries (facets with sub-facets, ElasticSearch style of sub-grouping, etc)

Jan

unread,

Sep 20, 2017, 2:43:23 AM9/20/17

to ArangoDB

Thanks for the feedback!
Sounds good. I also hope the changes can be integrated into a stable version soon.

Best regards
Jan

Roman Kuzmik

unread,

Sep 20, 2017, 9:48:35 AM9/20/17

to ArangoDB

btw, Updated answer on Stack Overflow https://stackoverflow.com/questions/46221427/arangodb-faceted-search-performance/46323873#46323873

Wilfried Gösgens

unread,

Sep 20, 2017, 9:52:03 AM9/20/17

to ArangoDB

Nice!
please mark your own answer as accepted with the green checkmark on the left ;-)

Roman Kuzmik

unread,

Sep 20, 2017, 10:00:03 AM9/20/17

to ArangoDB

done

Reply all

Reply to author

Forward