Does Arango AQL support what is similar to EXIST in SQL?

506 views
Skip to first unread message

Aymer

unread,
Oct 30, 2015, 11:17:14 AM10/30/15
to ArangoDB
Hi,

The following join:-

FOR d IN col-1
  FOR w IN col-2
     FILTER w.uid == 'uid1'  &&  d.attr1 == w.attr2
       return d

results in duplicate since  there are multiple records in col-2 matches this condition d.attr2 == w.attr1 and this is a situation where EXISTS in SQL help.

Anything idea to avoid duplicates? as a workaround I'm using DISTINCT while returning, but its very slow.

Thanks
Aym
 

Jan Steemann

unread,
Nov 4, 2015, 8:03:22 AM11/4/15
to aran...@googlegroups.com
Hi,

I tried to reproduce the problem using the following setup of data:

db._create("col1");
db._create("col2");
db.col2.ensureIndex({ type: "hash", fields: [ "uid", "attr2" ] });
for (i = 0; i < 500000; ++i) {
  db.col1.insert({ attr1: i % 10000 });
}
for (i = 0; i < 500000; ++i) {
  db.col2.insert({ uid: "uid" + (i % 10000), attr2: i % 10000 });
}

That produces 500,000 documents in collections `col1` and `col2` each, with a few matches between them.

Running the original query produces 2,500 matches, including duplicate ´d`s:

    FOR d IN col1
      FOR w IN col2
        FILTER w.uid == 'uid1' && d.attr1 == w.attr2
        RETURN d

This query took 2.92 seconds to run on my local machine in 2.7.

When adding a duplicate removal via DISTINCT, the query text slightly changes but runs in about the same time (2.94 seconds):

    FOR d IN col1
      FOR w IN col2
        FILTER w.uid == 'uid1' && d.attr1 == w.attr2
        RETURN DISTINCT d

The `EXISTS` keywords does not exist in AQL, but it can be substituted using a subquery and a regular FILTER. Such query will iterate over each `d` only once, and filter out the `d`s for which no match in `w` exists:

    FOR d IN col1
      LET matching = (
        FOR w IN col2
          FILTER w.uid == 'uid1' && d.attr1 == w.attr2
          RETURN 1
        )
        FILTER LENGTH(matching) > 0
        RETURN d

The above query ran in 4.98 seconds.

So yes, there is an equivalent to `EXISTS` but at least with the test data I used it does not provide a benefit over using `DISTINCT`.
In fact, the `DISTINCT` can be quite fast if it can use the "hash" method internally. But it may be different with other input data.

As an aside: I also ran the above queries on the same data set with 2.8 (currently in devel), and the execution times for the first two queries went down to 1.25 seconds (from 2.9x seconds) and down to 3.06 seconds from 4.98 seconds for the `EXISTS` query.

I hope this helps.

Best regards
Jan






--
You received this message because you are subscribed to the Google Groups "ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arangodb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

CoDEmanX

unread,
Nov 4, 2015, 6:48:27 PM11/4/15
to ArangoDB
Is that without indexes in all cases? (Just to clarify)

Jan

unread,
Nov 5, 2015, 3:35:27 AM11/5/15
to ArangoDB
My example contained the following index as can be seen in my setup code above:


    db.col2.ensureIndex({ type: "hash", fields: [ "uid", "attr2" ] });

Not sure about the original author's query though.

Aymer

unread,
Nov 12, 2015, 7:45:28 AM11/12/15
to ArangoDB
Thanks Jan,

I'll continue with distinct for now.

Rgds
Aym
Reply all
Reply to author
Forward
0 new messages