Predicate to get only keys

522 views
Skip to first unread message

Joan Balagueró

unread,
Jul 12, 2021, 10:32:05 AM7/12/21
to Hazelcast
Hello,

I have the following Predicate:

Predicates.sql("attribute[client_code] = 'CL99' AND attribute[check_in] >= '20210930' AND attribute[hotels] IN ('h5','h1')")

But I'm only interested in getting the keys. But Predicate also deserializes values on each node.

"The requested predicate is sent to each member in the cluster. Each member looks at its own local entries and filters them according to the predicate. At this stage, key/value pairs of the entries are deserialized and then passed to the predicate."

My values are big, and I don't need them. Is there any way to get only the keys when using 'Predicates.sql()?

Thanks,

Joan.

Tom OConnell

unread,
Jul 13, 2021, 11:28:01 AM7/13/21
to Hazelcast
Hi -

To get only keys, there's the 'keySet(Predicate)' method. The only values returned over the network will be the keys that match.
However, your predicate from above will cause de-serialization of the objects unless you have indexes that remove that need. For large or complex objects, this may have a significant cost. To create those indexes, it may be best to use 'custom attributes' and extractors. Take a look at https://docs.hazelcast.com/imdg/4.2/query/custom-attributes.html - it's pretty straightforward. You can reference your custom attributes in indexes.
If creating indexes that completely cover the query is not practical (for either speed or space considerations), then you should be very aware of your serialization - because you may be doing many de-serialization operations, either for large maps or frequent queries. In this case, consider Portable (https://docs.hazelcast.com/imdg/4.2/serialization/implementing-portable-serialization.html) serialization - this is optimized for indexing and query operations.

Cheers

Tom

Joan Balagueró

unread,
Jul 13, 2021, 4:37:47 PM7/13/21
to Hazelcast
Hi Tom,

Thanks for your response. Then I understand that using keySet the Predicate executed on each node will only return (deserialize) the keys, but not the values associated to these keys? I though that keys and values were returned to the calling node, and then the keySet method collected and returned only the keys. That's why I was using this projection: Projection<Map.Entry<CacheKey, CacheValue>, CacheKey> projectionByKey = Projections.singleAttribute("__key");
But this is no longer necessary if I use 'keySet', isn't it?

I'm already using custom attributes and indexing them (in fact "attribute[client_code]" is already a custom attribute).

I will take a look at Portable. Currently the imap value (CacheValue) is already implementing IdentifiedDataSerializable. So are you saying I shoud replace IdentifiedDataSerializable by Portable to make queries more efficient? But will this affect the performance in put/get?

Thanks,

Joan.

Tom OConnell

unread,
Jul 14, 2021, 12:20:41 PM7/14/21
to Hazelcast
Hi -

With an IMap, you can use "keySet()", "entries()" or "entrySet()" to get only keys, only values or a collection of Map.Entry objects. I think your approach is not needed, given the keySet, but I really can't say, without looking at it.

Portable and IdentifiedDataSerializable are different implementations, with different goals. Serialization results and efficiency will always depend, somewhat, on your model and your code. It's safe to say, though, that IdentifiedDataSerializable will tend to be slightly more efficient in terms of CPU and memory so that it will be better for put/get/set operations. Portable will generally have a significant benefit for query operations, so if deserialization is required during a query, it will probably be best. 
The choice changes, if you can cover your most important or most frequent queries with indexes. Indexing all 'query columns' will outperform an un-indexed query, even with Portable.
So, these are guidelines - if the shape of the data is such that rigorous indexing is not practical, Portable is well worth looking at. If you have complete - or sufficient - indexing, then IdentifiedDataSerializable will probably be better. In this second case, you may also consider using your favorite serializer - Avro, Kryo, Protobufs, ... As the elements will not be deserialized for indexed queries, or will be for infrequent queries, IdentifiedDataSerializable or any of the others should be fine. All of them will outperform java default serialization by a significant margin.

Cheers

Tom

Joan Balagueró

unread,
Jul 15, 2021, 4:26:28 AM7/15/21
to Hazelcast
Hi Tom,

Thanks! I've been reading about Portable ... My current cache object has an attribute (that represents the api response) that occupies the 99,99% of the space at the object. And this is the attribute the clients are requesting ... So I don't think Portable is going to help in this scenario (if I'm not wrong ...).

Just a couple of questions to finish (now I have eveything working fine, thanks for you advise):

1. When I try to add an index mixing simple values with collections an error is returned:
java.lang.IllegalStateException: Collection/array attributes are not supported by composite indexes: attribute[hotels]
Will this be possible in next versions?

2. Regarding the above point, let's suppose I have this query:  start >= '2021-07-14' AND end <= '2021-07-25' AND hotelCode IN ('h1','h2,'h3'). Since I can't have an index by [start,end,hotelCode] (because hotelCode is a collection), then I have created 2 indexes: idx_dates [ start, end ] (sorted) and idx_hotels [ hotelCode] (hash).

I don't know if Hazelcast generates a kind of execution plan, but I understand the behaviour is:
1. Index 'idx_dates' is used to execute "start >= '2021-07-14' AND end <= '2021-07-25'", and this generates a keyset named "ks1" on each node.
2. Index 'idx_hotels' is used to execute "hotelCode IN ('h1','h2,'h3')", and this generates a keyset named "ks2" on each node.
3. Then the intersection between "ks1" and "ks2" is applied on each node because the two subqueries are joined by an 'AND' operator.
4. The resulting keyset is sent over the network to the calling node.

Is this correct?

Thanks,

Joan.

Tom OConnell

unread,
Jul 16, 2021, 1:43:17 PM7/16/21
to Hazelcast
Hi -

1. - What does your index creation look like? I have a map with a trivial "HotelData" class that contains an array of strings called "hotels" and a timestamp.  In v4.2, I have, in part 'indexConfig.addAttribute("hotels[any]");' and I use that to add a HASH index to a map. I populate the map with some dummy data and 'l.info("keys: {}", foo.keySet(Predicates.sql("hotels[any] = h1")));' acts as expected and provides a correct result.

In response to your other questions, each of the predicates is optimized separately - and this is an interesting point - so multiple indexes are used. My data isn't the same as yours, but I have two index expressions:
sorted - "timestamp" and
hash - "hotels[any]"
This gives me two indexes, for a map "foo" - "foo_sorted_timestamp" and "foo_hash_hotels[any]". The names are only important if you want to dissect the local map stats.

I created some predicates - one for 
    -  Predicates.sql("hotels[any] in (h1, h2)");
    -  Predicates.sql("timestamp >= 1626455173965 and timestamp < 1626455213965")
    -  Predicate andP = Predicates.and(hotelP, timestampP);
The third one is what I think is most similar to what you're looking for - find some hotels within a range of timestamp data
.
I queried each predicate separately and printed out some local index stats in between.
foo.getLocalMapStats().getLocalIndexStats().get("foo_sorted_timestamp").getQueryCount() and 
foo.getLocalMapStats().getLocalIndexStats().get("foo_hash_hotels[any]").getQueryCount()

So, the query counts each ended up at '1' after the non-composite queries were run and were both '2' after the composite 'and' predicate was queried, exactly as it should have been. This clearly shows that multiple indexes were used to indicate the logical 'and' query.

In the composite, you're correct - all query logic is executed on the members, and only the keySet() is returned over the wire.

Efficiency is always a concern, so this was a really good question, I thought.

There's another nuance to look at, as it may be helpful, partition-predicates. This allows you to control which member evaluates the query. In the above, there are a couple of data-dependent cases this could be used. If it had been a single hotel (i.e Predicates.sql("hotels[any] = h1") or for multiple hotels if they had used partition-aware keys and were on the same member (because they're on the same partition), we could have used
Predicate awareP = Predicates.partitionPredicate("h1",  andP);
This would have taken the prior composite predicate and allow us to only run that query on the member where the data resides.  There may well be a non-trivial benefit to reducing the distributed query load in a busy system - both CPU and network.

I want to mention that with the 5.0 release, streaming Jet and caching IMDG operations are now supported in a single binary. If this were a reservation system, we could combine streaming events (reservations, ...) and enrichment (hotel-code to hotel-name, for example), storage in an IMap, distributed queries, streaming queries, and streaming data out to external systems. This can be done in 3.x and 4.x as well, but by combining the products (which was pretty seamless, anyway).

Cheers

Tom

Joan Balagueró

unread,
Jul 18, 2021, 2:06:40 PM7/18/21
to Hazelcast
Hi Tom,

Great explanation, thanks !! Let me explain my case and why I'm getting the error  java.lang.IllegalStateException: Collection/array attributes are not supported by composite indexes: doc[hotelCodes].

My cache value contains a byte array with the response (representing a json or xml document). The xml/json request is not in the object. So to index these documents my only choice is to create an attribute like: Map<String, Object> attributes, where the key is the name of the json/xml tag (from the request or response document) and the value can be:
1. A simple value (String, Long, Integer, ...) if the xml/json element is simple (i.e. a client code, then  attributes['clientCode'] = 'MyClientCode').
2. A multiple value (String[], Long[], Integer[],...) if the xml/json element is multiple  (i.e. a hotel code, then  attributes['hotelCodes'] = '[ 'h1','h2', ... ,' hN' ]'.)

So the steps are:
1. Create the attribute config:
AttributeConfig attributeConfig = new AttributeConfig();
attributeConfig.setName("doc");
attributeConfig.setExtractorClassName("com.ventusproxy.proxy.cache.data.CacheValueExtractor");
mapConfig.addAttributeConfig(attributeConfig);

2. Create the class "CacheValueExtractor", if the value is simple we just add it to the collector, if it's an array we add all values 1 by 1.
public class CacheValueExtractor implements ValueExtractor<CacheValue, String> {

@Override
public void extract(CacheValue value, String argument, ValueCollector valueCollector) {
  Object obj = value.getDoc(argument);

  if (obj != null) {
      if (obj instanceof Object[]) {
          for (Object o : (Object[]) obj) {
               valueCollector.addObject(o);
          }
      } else {
          valueCollector.addObject(obj);
      }
  }
}

3. The CacheValue has the attribute that contains the name/value(s):
private Map<String, Object> attributes;
public Object getDoc(String name) { return this.attributes != null ? this.attributes.get(name) : null; }

Now queries like below works perfectly:
doc[start] BETWEEN '20210826' AND '20210828' AND doc[hotelCodes] IN ('h0','h1') AND doc[clientCode] = 'TESTMC'

4. The point is when I want to add an index mixing simple and multiple values like below:
IndexConfig idxConfig = new IndexConfig();
idxConfig.setName("idxClientAndHotels");
idxConfig.setType(IndexType.HASH);
idxConfig.addAttribute("doc[clientCode]");
idxConfig.addAttribute("doc[hotelCodes]");
this.cache.addIndex(idxConfig);

Then the "java.lang.IllegalStateException: Collection/array attributes are not supported by composite indexes: doc[hotelCodes]" error is thrown.

And that's all. Maybe I'm doing something in a wrong way ... I don't know. Please let me know.

Thanks again,

Joan.



Tom OConnell

unread,
Jul 19, 2021, 4:37:10 PM7/19/21
to Hazelcast
Hi -

Not to quibble about your question - and I get that you may be making simplifying statements for clarity, but -
The use of a single index that supports both 'doc[hotelCodes]' and 'doc[clientCode]' - is probably not a good idea, based on your example. I think that's complicating the query plan optimization and not gaining much for us in the code - even if it worked.

The exception message you're seeing makes sense - you can have an index on a collection, but a composite index would be complex and the semantics of that would probably be very unclear (would it index *every* element in the first collection against every element in the second or only corresponding? What if there were no corresponding elements?) Multiple indexes - based on a single collection or multiple scalars will be clear and easily optimized. 
I'd lean toward using more narrowly focussed indexes - a sorted one for the dates and two separate hash indexes - one for hotel code and one for client-code.

Nor would I use a single SQL predicate for the whole where-clause, although you could. I wouldn't expect a really good query plan for that. Multiple simpler SQL predicate objects wrapped in an 'and predicate' would probably work very well It might look like this -

Predicate hotelP = Predicates.sql("doc[hotelCode] in (h1, h2)");
Predicate clientP = Predicates.sql("doc[clientCode] = TESTMC");
... more predicates = ...
Predicated andP = Predicates.and(hotelP, clientP, dateP, ...);

Each of the predicates is optimized separately and the hotel predicate will find the appropriate index, as will the others. When you evaluate the outer 'and' predicate, each constituent predicate will be executed optimally. The 'and' predicate will ensure that only objects that match each of the underlying predicates are included. Note that "Predicates.or" is there, too. You can nest these so you can build more complex queries like "(a and b) OR (c and d)".

There are always other options for this, though and I may not have caught your meaning properly. In your CacheValueExtractor, which was certainly properly written, you didn't use the second parameter "argument" (type String). You can pass additional information into the extract method with that. Previously, I was fooling around with a 'substring' extractor and using it to create indexes with attributes like "substring[0,10]" or "substring[5]" - either an index on the first 10 characters or an index on everything following the 4th character of the attribute I was testing. The "argument" parameter is populated with whatever you placed in the square brackets in the index attribute with indexes of that format. You can pass information that directs you on how to extract information from the object. You can use one extractor class for multiple map attributes if needed.  These approaches may help you, as the extraction for each index would be passed different "argument" data.

I'd note also, that as you have a map in your object you may index the keys of the embedded map - i.e. an index attribute of "mapData.keySet[any]" - assuming that there's an appropriate getter for "mapData".

Don't forget that as you're testing, you can look at the local map stats  "indexedQueryCount" or look into the per-index stats to verify that the index and query you're testing are being matched by the query optimizer.

Cheers

Tom

sujit kumar

unread,
Dec 4, 2021, 2:54:21 AM12/4/21
to Hazelcast
Hi, I have a use case where I store Key and Value as below, with No custom attributes. I want to implement a predicate to get Keys if value is matched. 
KEY                           Value
5131512000136 "51315120251"
5131513000101 "51315130497"

My Logic:
Predicate predicate = Predicates.equal("value", itemId);
Set<String> retailIds = itemMappingMap.keySet(predicate);

Exception: 

o.s.c.s.i.web.ExceptionLoggingFilter : Uncaught exception thrown

org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.lang.ClassCastException: [C cannot be cast to java.lang.Comparable

Is there any change needed in my predicvate logic? 

Thanks 

Sujit

Reply all
Reply to author
Forward
0 new messages