AQL: Logical or technical Update Issue

111 views
Skip to first unread message

Tobias Jakob

unread,
Oct 20, 2017, 5:05:48 AM10/20/17
to ArangoDB


Hi Everyone,

I am having some problems with a query:

FOR v, e, p IN OUTBOUND 'users/120040' GRAPH 'ratedGraph'
    LET user = p.vertices[0]
    LET movie = p.vertices[1]
    FOR genre in movie.genres
UPDATE { _key: user._key, [genre]: user[genre] + 1 }
        IN users

So I have simple named Graph called "ratedGraph". It contains users and movies as well as their edges between.

I want to create some kind of user profile, that stores how much a user likes a genre.

So I iterate over all connected movies and count the occurring genres. The count values shall be stored in the user document, so it would look like this:

{
    "_key": "120040",
    "_id": "users/120040",
    "_rev": "_Vxi_Fha--v",
    "Action": 1,
    "Adventure": 1,
    "Thriller": 1,
    ...
  }

But there seems to be something wrong with my query. Each time I run it, every genre count value is just incremented by one.

I hope understand my problem and can help me out.

Thank you!

- Tobias


Simran Brucherseifer

unread,
Oct 20, 2017, 9:57:29 AM10/20/17
to ArangoDB
Hi Tobias,

let me shed some light on this:

The UPDATE operation can be used on a single document like
UPDATE "docKey" WITH {...} IN coll

or on multiple like so:
FOR doc IN coll UPDATE doc WITH { ... } IN coll

but important to note is, that each document is only updated once.

Your traversal returns 0..n documents (movies), and you iterate over their genre attributes.
The UPDATE is in the FOR loop body of that iteration, which means it is executed for every element in the genre array - which can be 0..n times per document (movie).
In addition, the same movie documents can be returned by a graph traversal multiple times (depends on your edges).

If the same document is updated multiple times, you may observe ignored updates, which are actually caused by the original documents being read from a cache and not the already modified versions.
It seems like there are situations however, in which multiple updates are successful. Maybe someone of the core can comment on that, what the intended behavior is.

What works in my test scenario is the following query:

FOR u IN users
    LET genreStats
= MERGE(
        FOR m IN OUTBOUND u GRAPH
'ratedGraph' // get all movies a user is linked to
            OPTIONS
{uniqueVertices: 'global', bfs: true} // ignore duplicate movies
            FOR g IN m
.genre
                COLLECT genre
= g WITH COUNT INTO count // group by genre
                RETURN
{[genre]: count} // return one object per genre with key=genre and value=count (merged into single object by MERGE function in 2nd line)
   
)
    FILTER LENGTH
(genreStats) // don't update user documents which are not linked to any movie
    UPDATE u WITH
{genreStats} IN users // write stats into user document as attribute "genreStats"
    RETURN NEW
// optionally return updated user document

This calculates and writes the genre stats for all users. Note that the graph traversal is in the scope of a sub-query to keep the aggregation part per-user.
You could also write the stats directly into the user document like UPDATE MERGE(genreStats, u), but I find it way cleaner to use a nested object:
If you want to remove the stats, you don't need to know all the names of the genres (assuming there are additional attributes in the user documents, which are not genre stats).

A remark about graphs and modifications:
Graph traversals may return vertices from different collections, but UPDATE / REPLACE / REMOVE can only be carried out against one pre-defined collection (UPDATE ... IN collectionName).
It's probably a good idea to add FILTER IS_SAME_COLLECTION("collectionName", vertex) to the traversal.
Otherwise you might accidentally try to modify a document that is in another collection (error/warning), or accidentally change a wrong document that happens to have the same _key like a document from another collection.

Best, Simran

Tobias Jakob

unread,
Oct 23, 2017, 4:44:44 AM10/23/17
to ArangoDB
Hi Simran,

Your answer was already very helpful. Thanks for that!

I have another question regarding the Collect.

I have a rating value stored in each "rated" edge in the ratedGraph.
I am trying to multiply the count by the particular rating weight.

Something like this:

FOR u IN users
    LET genreStats
= MERGE(
        FOR m IN OUTBOUND u GRAPH
'ratedGraph' // get all movies a user is linked to
            OPTIONS
{uniqueVertices: 'global', bfs: true} // ignore duplicate movies
            FOR g IN m
.genre
                LET rating = e.rating

                COLLECT genre = g WITH COUNT INTO count // group by genre

                RETURN
{[genre]: count*rating} // return one object per genre with key=genre and value=count (merged into single object by MERGE function in 2nd line)

   
)
    FILTER LENGTH
(genreStats) // don't update user documents which are not linked to any movie
    UPDATE u WITH
{genreStats} IN users // write stats into user document as attribute "genreStats"
    RETURN NEW
// optionally return updated user document

This does not work because of the grouping. I was trying to include the rating in that grouping but this doesn't make sense.

Do have an idea about that?

Thank you! :)

Best, Tobias




Simran Brucherseifer

unread,
Oct 23, 2017, 9:50:59 AM10/23/17
to ArangoDB
You are right, it does not work that way. There's actually some rework to be done if you want to do more than group and count.

I can show you how to multiply count and rating, but first I need another piece of information from you:
We have multiple movies per user, each connection between them has a movie rating attribute, and every movie can have multiple genres.
Thus, there can be multiple ratings per genre. With what should the genre count be multiplied? An average of the movie ratings? If so, which one? (median, mean, ...)

Tobias Jakob

unread,
Oct 23, 2017, 11:02:56 AM10/23/17
to ArangoDB
Hi,

Right, I want to have some kind of score that represents a users interest in a genre.
So one genreStat should be the average (mean) rating of this genre, rated by this user.

I am trying this for a while, but I am pretty new to aql.

Thanks, Its nice to have you helping me.

Simran Brucherseifer

unread,
Oct 24, 2017, 1:02:02 PM10/24/17
to ArangoDB
COLLECT clears all variables in the current scope. Click on Explain the web interface and inspect the execution plan:



Follow the CollectNode with Id 7 all the way up to the next SingletonNode - it's Id 3 and the ROOT of the current scope.
The variables emitted by the graph traversal (m, e) and the variable for the genre iteration (g) are in the scope, which means you don't have access to them after COLLECT.
The parent scope (which is the top-level scope) contains the iteration over all user documents. Variable u can be still be accessed after COLLECT, because it is defined outside its scope.

If you want to know into which "buckets" values are grouped, use the COLLECT ... INTO syntax.

The COLLECT ... WITH COUNT INTO ... syntax is a shorthand if you want to group and count the number of occurrences (how many items per bucket if you will).
This syntax can not be extended by an INTO clause however. We still need the counts nonetheless, so we need to rework the query a bit.

We could use the standard INTO syntax, but it would keep way too much data which we don't need further down the query. All we actually need is the rating stored as edge attribute.
Thus, we can create a projection like so: COLLECT ... INTO r = e.rating
For every bucket (genre), we will have access to an array with the rating values via variable r.

We had to remove the counting, and need to add it back in a different way now. There are two options. Post-calculation:
COLLECT genre = g INTO r = e.rating
RETURN LENGTH(r) // array length of ratings equals number of items in bucket (what if there's no rating attribute though?)

Aggregation (can be more efficient, although it shouldn't make any difference in your case):
COLLECT genre = g AGGREGATE count = LENGTH(1) INTO r = e.rating

For every item in a bucket, a counter is increased by one (the LENGTH function always returns 1 in conjunction with AGGREGATE, no matter what you pass to it).
AGGREGATE could also be used to find out the minimum and maximum values as well as a few other statistical metrics, but it's not needed in this context.


The full query:

FOR u IN users
    LET genreStats
= MERGE(

        FOR m
, e IN OUTBOUND u GRAPH 'ratedGraph' // get all movies a user is linked to

            OPTIONS
{uniqueVertices: 'global', bfs: true} // ignore duplicate movies
            FOR g IN m
.
genre
                COLLECT genre
= g AGGREGATE count = LENGTH(1) INTO r = e.rating // group by genre
                RETURN
{[genre]: count * AVERAGE(r)} // return one object per genre (merged into single object by MERGE function in 2nd line)



   
)
    FILTER LENGTH
(genreStats) // don't update user documents which are not linked to any movie
    UPDATE u WITH
{genreStats}
IN users
    RETURN NEW


count is multiplied by the AVERAGE (mean) of the ratings per genre. There are also functions like MEDIAN which could be used instead:
Reply all
Reply to author
Forward
0 new messages