Querying for the document with a max value

132 views
Skip to first unread message

Christoph Engel

unread,
Oct 25, 2016, 9:16:32 AM10/25/16
to ArangoDB
Hi,

I would like to get the document from my collection which contains the maximum value "val" of all stored documents.
I've created the following statement:

FOR vertex in MyCollection
filter has
(vertex, "val")
COLLECT AGGREGATE
    maxVal
= max(vertex.val) INTO maxVertex
RETURN
{
    maxVal_1: maxVal,
    maxVal_2: maxVertex[0].val
}

As far as I understand it, maxVertex should contain a set of documents for each value of maxVal. Because there is only onevalue of maxVal, there should only be one group containing the document with maxVal in the maxVertex-group.
In this case maxVal_1 and maxVal_2 should have the same value, but both values are different.

What did I understand wrong?

best regards
Christoph

Jan

unread,
Oct 25, 2016, 10:21:28 AM10/25/16
to ArangoDB
Hi,

let's start by simplifying the RETURN part of the query a bit:

    FOR vertex IN MyCollection
      FILTER HAS(vertex, 'val')
      COLLECT AGGREGATE maxVal = max(vertex.val) INTO maxVertex
      RETURN { maxVal_1: maxVal, maxVal_2: maxVertex }

Because of the COLLECT AGGREGATE, the result of the query is one new document, containing the attributes "maxVal_1" and "maxVal_2".
"maxVal_1" will contain the overall maximum value of the "val" attribute. "maxVal_2" in my case will contain all the visited documented that were inspected while processing the COLLECT statement (because of the FILTER that is all the documents that have a "val" attribute).

The documents in "maxVal_2" are produced by a COLLECT, and COLLECT will store all variables from its scope in its out variable prefixed with the variable name. In this case, the collect will store documents from the variable "vertex" in a sub-attribute "vertex" in its out variable "maxVertex".
With my test data, the contents of "maxVertex" looked like this:

    "maxVal2" : [
      {
        "vertex" : {
          "_key" : "5186868",
          "_id" : "MyCollection/5186868",
          "_rev" : "5186868",
          "val" : "baz"
        }
      },
      {
        "vertex" : {
          "_key" : "5186859",
          "_id" : "MyCollection/5186859",
          "_rev" : "5186859",
          "val" : 6
        }
      },
      {
        "vertex" : {
          "_key" : "5186865",
          "_id" : "MyCollection/5186865",
          "_rev" : "5186865",
          "val" : 8
        }
      }
    ]

Note that there will be as many documents in "maxVertex"."vertex" as there are documents that satisfy the filter condition. Please also note that these documents will appear in the "maxVertex" variable in non-predictable order (technically they will appear in the same order as the collection iteration is done).

In your query, you are accessing the 0th element of "maxVertex", which is just a "random" document from the ones iterated over, but not necessarily the one with the maximum value. In my case, it would be this one:

      {
        "vertex" : {
          "_key" : "5186868",
          "_id" : "MyCollection/5186868",
          "_rev" : "5186868",
          "val" : "baz"
        }
      }

Accessing the "val" sub-attribute of the 0th element will produce "null", because the 0th element does not have the "val" attribute itself. It is necessary to access "vertex"."val".
So the query becomes:

    FOR vertex IN MyCollection
      FILTER HAS(vertex, 'val')
      COLLECT AGGREGATE maxVal = max(vertex.val) INTO maxVertex
      RETURN { maxVal_1: maxVal, maxVal_2: maxVertex[0].vertex.val }

Still "maxVal_2" will contain a "val" value from a "random" document and not necessarily the maximum value.

The main differerence is that "maxVal" is an aggregate value: it is constantly updated during the iteration over all the documents, and will always contain just the maximum value of the documents seen.
"maxVertex" however is not an aggregate. It will be build during the COLLECT too, but will contain all the documents visited by the COLLECT. It will not contain a single document but all.


By the way, an easier way to access the maximum value of an attribute in a collection is to create a skiplist index on the attribute and perform a SORT LIMIT 1 on it, e.g.

    db.MyCollection.ensureIndex({ type: "skiplist", fields: [ "val" ] });
    q = "FOR doc IN MyCollection SORT doc.val DESC LIMIT 1 RETURN doc";
    db._query(q).toArray();

This will also be more efficient because it will look at a single document only, whereas the initial query will always have to look at all the documents in the collection.

Best regards
J

Christoph Engel

unread,
Oct 25, 2016, 11:45:48 AM10/25/16
to ArangoDB
Hi Jan,

thanks for the detailed answer.
The solution with the skiplist index works well.

Best regards
Christoph
Reply all
Reply to author
Forward
0 new messages