NavigableIndex for a Partially Ordered Universal Set

73 views
Skip to first unread message

Aman Gupta

unread,
Jan 5, 2014, 10:06:40 AM1/5/14
to cqengine...@googlegroups.com
We have an entity has a concept of "targeting" - defined as list of countries, user categories etc. When a request comes from a specific country, with an user belonging to a specific category etc., I want to find all entities which have targeting covering the "current target" request.

There are two ways we are thinking of doing this:
1) Having an HashIndex on each of the targeting attributes, and creating an AND/OR query.
2) You can consider the set of all requests as a universe. A targeting is a subset of this universe. Given two targetings, we can define isSubsetOf relation. It's not very difficult to encode this set theoretic notion using Java, and we have already done this. The request time query will be then to find all entities such that currentTarget.isSubsetOf(entity.targeting).

In the 2nd approach, what's missing is an index which can work the isSubsetOf query. isSubsetOf() only creates a partial order among all sets, thus the NavigableIndex is not directly usable.

What do you suggest? I am planning to implement multiple approaches and run benchmarks.

Niall

unread,
Jan 6, 2014, 2:01:20 PM1/6/14
to cqengine...@googlegroups.com
Hi Aman,

There may be a few ways to do this. One thing to be aware of is "index selectivity" however, and so if you proceed with option 1, make sure to analyze the distribution of values in your fields first. (Google "index selectivity" for more details.)

Let's take it back to the Cars analogy :)

Say Ford is manufacturer, and within the set of cars manufactured by Ford, there are types of car such as "Estate", "Pickup" etc. You could say that the types manufactured by Ford are a subset of all Ford cars, but also other manufacturers might manufacture cars of the same type. So you do indeed need to model the subset relationship.

If you build a HashIndex on Car.manufacturer and Car.type, then great it will reduce the candidate set somewhat. But the "selectivity" of both those indexes would be fairly poor. Say you had a collection of 100000 cars, of which 50000 (50%) are manufactured by Ford, and 50000 (50%) are manufactured by Honda. Let's say 50% of Ford's fleet are Pickup trucks (25000 cars), but let's say that 50% of Honda's fleet are also pickup trucks (25000 cars).

Given a query "and(equal(Car.manufacturer, "Ford"), equal(Car.type, "Pickup"))", CQEngine will have a choice between both of those indexes, but neither index alone will uniquely identify "Ford Pickup" cars. The index on manufacturer ("Ford") will narrow the candidate set to 50000, and the index on type ("Pickup") will narrow the candidate set to 50000 also (i.e. Ford manufactured 25000 Pickups, and Honda manufactured another 25000). The candidate sets from either of those indexes will be large at 50000 cars, 50% of the collection, requiring a non-trivial amount of filtering. So this is where you might run into problems with option 1.

However often this is where CompoundIndex comes in. If you "addIndex(CompoundIndex.onAttributes(Car.manufacturer, Car.type))" then CQEngine will build an index on the *combinations* of those two attributes. This way it can jump immediately to "Ford Pickup" cars which is much smaller, 25% of the collection, requiring much less filtering.

Ultimately, if you analyze the "selectivity" of field values in your collection, you might find that it's not worth adding indexes on certain fields at all. For example any particular value of Car.type matches a large fraction of the entire collection. So to make the retrieval faster you can use CompoundIndex to build indexes over combinations of fields.

So it sounds like CompoundIndex might allow you to model your isSubsetOf relationship in an indirect way?

An alternative to a CompoundIndex, is to define a virtual attribute which is a combination of the two fields, and then you can build any type of index on top of the virtual attribute. This may allow you to model isSubsetOf relationships directly, and might be closer to your option 2.

A simple way with string fields, is to concatenate them in the virtual attribute such that the attribute returns "Ford Pickup", "Ford Estate" etc even though no such field really exists in the objects. For non-string fields, you can define an attribute which returns "CarType" objects encapsulating two fields manufacturer and type, and implement equals(), hashCode() (and Comparable.compareTo() for NavigableIndex) as normal on that value object. You can then query the collection by supplying a CarType object encapsulating the sought combination of those fields (this is like Hibernate's "query by example"). This approach is for advanced cases. CompoundIndex exists to automate boilerplate like that.

With both CompoundIndex and when building indexes on virtual attributes, you might also wish to control the tradeoff between retrieval time and memory required by the index. Take a look at Quantizers which can transparently group objects into the same buckets inside indexes, trading retrieval speed for reduced overall size of the indexes.

I'm not sure but I hope that helps,
Niall

Aman Gupta

unread,
Jan 11, 2014, 3:26:14 AM1/11/14
to cqengine...@googlegroups.com
Thanks Niall for suggesting this approach.

The current system (self written code - not using cqengine or any other indexing lib) that we have has exactly this concept of compound index, where 3 attributes are grouped. Though, because it increases memory usage quite drastically, we have been doing caching (guava cache), where only a subset of all possible combinations of those 3 attributes are in index at any point, and computed if the query doesn't ask for values in that subset. Post that, we do single attribute index queries and intersections/unions for AND/OR. The primary reason for moving to cqengine was to have query optimization on the set of attributes remaining.

I had not considered using quantization to reduce memory usage - this can potentially provide a performance/memory usage tradeoff.

Runnning benchmarks next week - will keep this thread updated on what worked for us.

Thanks,
Aman

Niall

unread,
Jan 13, 2014, 4:27:14 PM1/13/14
to cqengine...@googlegroups.com
No problem. There might be other ways to do it also. Hope it works out.
Reply all
Reply to author
Forward
0 new messages