Performance of "IN" operator on auto created index

56 views
Skip to first unread message

Trevor Hunter

unread,
Feb 24, 2020, 12:53:41 PM2/24/20
to RavenDB - 2nd generation document database
Hi,

We've noticed a large performance gap with the "IN" operator vs standard "AND/OR combinations" (https://ravendb.net/docs/article-page/4.2/csharp/indexes/querying/filtering#where--in)

We're using RavenDB 4.2.8.

Take the following two queries on a relatively simple object model that has a "CustomerId" (guid) and "Status" (Enum) field to query on:

Running the following on our data set (~11m documents) returns in about 2ms:

from Orders where (CustomerId = "0b3b11fe-612e-31b3-9258-0abs0e2bfbba")
and (Status = "Approved" OR Status == "AutoApproved")


Running the following on the same data set takes about 500ms to run and seems to max out all CPU of the DB server if we run with any load:

from Orders where (CustomerId = "0b3b11fe-612e-31b3-9258-0abs0e2bfbba")
and (Status IN ("Approved","AutoApproved"))

The auto index includes the customer id and the status fields. Both queries return the same results, but seem to be very different from a performance point of view.

We can fairly easily refactor the client code to not use the "IN" operator, but it seems strange there would be such a large performance penalty for this.

Regards,

Trev.




Oren Eini (Ayende Rahien)

unread,
Feb 24, 2020, 1:41:16 PM2/24/20
to ravendb
That shouldn't be the case, and off the top of my head, I can't think why it would be so.
Can you reproduce this? Something you can send us to check?

--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ravendb/fa80f26b-42b5-4d88-af58-ef0483cab02c%40googlegroups.com.


--
Oren Eini
CEO   /   Hibernating Rhinos LTD
Skype:  ayenderahine
Support:  sup...@ravendb.net
  

Trevor Hunter

unread,
Feb 24, 2020, 4:02:33 PM2/24/20
to RavenDB - 2nd generation document database
I just sent a C# test harness to Support - ho

Rough outline, with the following object model and a database of 1 million documents:

    public enum OrderStatus
    {
        Pending = 0,
        Approved = 1,
        AutoApproved = 2
    }


    public class Order
    {
        public string Id {get;set;}
        public string CustomerId {get;set;}
        public OrderStatus Status {get;set;}

    }

The Query 1 (IN) below is about 10x slower than Query 2 (AND + OR) (running locally) when "customers/100" DOESN'T EXIST. When it does exist, it's still slower, but marginally. An auto index was created in response to these queries.

Query 1 (slow):
 var q = session.Query<Order>()
                    .Where(o => o.CustomerId == "customers/100", true)
                    .Where(o=> o.Status.In(new []{OrderStatus.Approved, OrderStatus.AutoApproved}) ).Take(1);

Query 2 (fast):
var q = session.Query<Order>()
                    .Where(o => o.CustomerId == "customers/100", true)
                    .Where(o=> o.Status == OrderStatus.Approved || o.Status == OrderStatus.AutoApproved ).Take(1);


In this particular test harness, the performance difference shows itself especially when the customer ID (customers/100) doesn't exist. When the customer ID does exist, the difference is still present, but less. It's as if short-circuit logic works correctly for the logical operators, but not the IN operator. 












On Monday, February 24, 2020 at 1:41:16 PM UTC-5, Oren Eini wrote:
That shouldn't be the case, and off the top of my head, I can't think why it would be so.
Can you reproduce this? Something you can send us to check?

On Mon, Feb 24, 2020 at 7:53 PM Trevor Hunter <thu...@kobo.com> wrote:
Hi,

We've noticed a large performance gap with the "IN" operator vs standard "AND/OR combinations" (https://ravendb.net/docs/article-page/4.2/csharp/indexes/querying/filtering#where--in)

We're using RavenDB 4.2.8.

Take the following two queries on a relatively simple object model that has a "CustomerId" (guid) and "Status" (Enum) field to query on:

Running the following on our data set (~11m documents) returns in about 2ms:

from Orders where (CustomerId = "0b3b11fe-612e-31b3-9258-0abs0e2bfbba")
and (Status = "Approved" OR Status == "AutoApproved")


Running the following on the same data set takes about 500ms to run and seems to max out all CPU of the DB server if we run with any load:

from Orders where (CustomerId = "0b3b11fe-612e-31b3-9258-0abs0e2bfbba")
and (Status IN ("Approved","AutoApproved"))

The auto index includes the customer id and the status fields. Both queries return the same results, but seem to be very different from a performance point of view.

We can fairly easily refactor the client code to not use the "IN" operator, but it seems strange there would be such a large performance penalty for this.

Regards,

Trev.




--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rav...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Feb 26, 2020, 7:26:55 AM2/26/20
to ravendb
Thanks, we are looking into this

To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ravendb/b144ac34-a544-4831-928c-7e38e6216f07%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages