AQL Ternary Operator Performance & Optimization

173 views
Skip to first unread message

BrandonJS

unread,
Aug 6, 2016, 5:07:52 PM8/6/16
to ArangoDB
Hi,

I'm trying to compute a non-null weighted average in an AQL query.  In the same query I would like to also retrieve some other aggregates such as regular averages and sums.  The collection I'm using has ~300k records that are fairly small.

When I use the ternary operator to compute my weight total for all non-null entries it significantly degrades performance.  A 1 second query now takes 8 seconds.  What is the best way to optimize the following query while still being able to compute a non-null weighted average?  I'm trying not to query multiple times to generate what will be one result set.  As a separate query when using a filter to remove all null credit scores it takes about ~2 seconds.

Here is my query with the ternary operator.  This query is taking ~8 seconds to complete.

 FOR l in loans2
    COLLECT origDate = l.Acquisition.ORIGINATION_DATE
    AGGREGATE 
        sumUPB = SUM(l.Acquisition.ORIGINAL_UPB),
        weightedFico = SUM(l.Acquisition.BORROWER_CREDIT_SCORE * l.Acquisition.ORIGINAL_UPB),
        ficoUPB = SUM( IS_NULL(l.Acquisition.BORROWER_CREDIT_SCORE) ? 0 : l.Acquisition.ORIGINAL_UPB )

    SORT origDate

  RETURN {
    OrigDate: DATE_FORMAT(origDate, "%m-%yyyy"), 
    TotalUPB: sumUPB,
    wavgFICO: weightedFico / ficoUPB}


Now that same query without the ternary operator and using the regular sum of UPB (unpaid principal balance)  takes only ~1 second to complete.

 FOR l in loans2
    COLLECT origDate = l.Acquisition.ORIGINATION_DATE
    AGGREGATE 
        sumUPB = SUM(l.Acquisition.ORIGINAL_UPB),
        weightedFico = SUM(l.Acquisition.BORROWER_CREDIT_SCORE * l.Acquisition.ORIGINAL_UPB),

    SORT origDate
  RETURN {
    OrigDate: DATE_FORMAT(origDate, "%m-%yyyy"), 
    TotalUPB: sumUPB,
    wavgFICO: weightedFico / sumUPB}

Wilfried Gösgens

unread,
Aug 8, 2016, 7:49:47 AM8/8/16
to ArangoDB
Hi,
Using db._explain() can clearly identify the source of the additional resource usage:

  6   CalculationNode              0       - LET #11 = (IS_NULL(l.`Acquisition`.`BORROWER_CREDIT_SCORE`) ? 0 : l.`Acquisition`.`ORIGINAL_UPB`)   /* v8 expression */   /* collections used: l : loans2 */

shelling out to V8 and evaluating parts of the query in V8 is not as effective as native execution.
There is a simple workaround to avoid the teneray expression which always brings a V8 expression into the game:

  MAX([l.Acquisition.BORROWER_CREDIT_SCORE, 0])

if the first expression is undefined, that will be lower in value than 0, thus 0 is used here.

This should in term be a lot faster.

Cheers,
Willi

Simran Brucherseifer

unread,
Aug 8, 2016, 10:10:58 AM8/8/16
to ArangoDB
Another possible "hack" is to convert the credit score to a boolean (null converts to false) and multiply it with the original UPB, which will implicitly cast the boolean to a number (false = 0, true = 1).
If there's no credit score, the original UPB will be multiplied by 0, which will zero it out. Otherwise it will be multiplied by 1, which will not change its value.

ficoUPB = SUM( TO_BOOL( l.Acquisition.BORROWER_CREDIT_SCORE ) * l.Acquisition.ORIGINAL_UPB )


BrandonJS

unread,
Aug 8, 2016, 8:15:41 PM8/8/16
to ArangoDB
Both methods worked and are much faster!  The TO_BOOL function is a bit quicker and I think a bit more concise of the two.  

Using TO_BOOL it takes about 1.9 seconds to execute (compared to the ? operator's 8 seconds).  It only adds about 0.5 seconds to the overall query which I think is more than acceptable.


Replaced:
    ficoUPB = SUM( IS_NULL(l.Acquisition.BORROWER_CREDIT_SCORE) ? 0 : l.Acquisition.ORIGINAL_UPB )

With either of these:
    ficoUPB = SUM( TO_BOOL(l.Acquisition.BORROWER_CREDIT_SCORE) * l.Acquisition.ORIGINAL_UPB)
    ficoUPB
= SUM( MIN([1, MAX([l.Acquisition.BORROWER_CREDIT_SCORE, 0])]) * l.Acquisition.ORIGINAL_UPB )

Had to do a MIN and a MAX to get l.Acquisition.BORROWER_CREDIT_SCORE to be either 0 or 1.

Thanks for the help!

Jan

unread,
Aug 15, 2016, 4:47:22 AM8/15/16
to ArangoDB
Hi all,

the root cause for this issue seems to be that the ternary operator is executed using V8 still. This works but is unintended, as it may execute slower than executing it natively (in C++).
I am currently working on a fix for this, which should make the original query as fast as the alternative workarounds.

Best regards
Jan

Jan

unread,
Aug 15, 2016, 5:28:57 AM8/15/16
to ArangoDB
The fix has been committed in the 3.0 and devel branches.
Best regards
Jan
Reply all
Reply to author
Forward
0 new messages