Computing tokens locally for parameters to issue fewer concurrent queries

Dorian Hoxha

unread,

Sep 27, 2016, 6:14:12 PM9/27/16

to DataStax Python Driver for Apache Cassandra User Mailing List

Hi,

I want to make a query "where x IN (1,2,3,4,5.....)" but I read that it's better to issue several queries from your app than for the coordinator node to do it. So that means I have to do several "where x=y" queries.

But I'm thinking is for a way to group the values "1,2,3,4,5" by their generated token/token-range so I can issue fewer "where x IN (3,5)" but that is guaranteed that 3,5 are both in the same token/token-range/node.

Does something like this already exist ?(by searching looks like no)
Is there a function I can use to get a token from a tuple of parameters?

Thank You

Alan Boudreault

unread,

Sep 28, 2016, 4:56:38 PM9/28/16

to python-dr...@lists.datastax.com

Hello Dorian,

Briefly, to see the benefit of this, it would require a large number of keys. Also, doing so will require more computing on clients.I don´t know how many keys you currently requesting, but a good and simple approach could be:

- Use a prepared statement with a simple PK filter (select * from t where pk = ?). (rather than using a IN clause)

- Use TokenAwareBalancingPolicy, so prepared statetements will be automatically routed to one of its replicas.

- Send multiple requests concurrently

If you really want to try your solution, you will have to generate all routing keys, find their associate replicas and split your requests accordingly. Here are some pointers to generate a routing key and find its replica:

https://datastax.github.io/python-driver/api/cassandra/metadata.html?highlight=metadata#cassandra.metadata.Metadata.get_replicas

https://github.com/datastax/python-driver/blob/master/cassandra/query.py#L253

https://datastax.github.io/python-driver/api/cassandra/metadata.html?highlight=metadata#cassandra.metadata.Metadata.get_replicas

Regards,

Alan

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsub...@lists.datastax.com.

--

Alan Boudreault

Software Engineer (Drivers) | alan.bo...@datastax.com

Dorian Hoxha

unread,

Sep 29, 2016, 4:37:22 AM9/29/16

to DataStax Python Driver for Apache Cassandra User Mailing List

@Alan
I will have up to 1000 queries. I was thinking doing all you mention, but ALSO lowering from 1000 "where x=y" to hopefully fewer "where x IN y" where "y" values are guaranteed to go in 1 token/token-range/shard/node.
I'll check out your links, thanks.

Alan Boudreault

unread,

Sep 29, 2016, 9:02:18 AM9/29/16

to python-dr...@lists.datastax.com

In that case, I´d say it´s worth a try. I would also suggest you to randomly split your requests and benchmark to see how it goes. (let´s say 10 requests fo 100 keys). I´m interested to hear back from you with your findinds.

--

You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsub...@lists.datastax.com.

Max C

unread,

Sep 30, 2016, 4:38:56 PM9/30/16

to python-dr...@lists.datastax.com

I have tried this in a 3-node cluster with RF=3; so in my case getting the correct tokens isn’t important. From what I recall, you do see a performance gain by using IN (sorry I can’t remember the exact amount, and certainly can’t tell you what if/any is the optimal number of values to use inside the IN), but then you have a new issue — the values that come back aren’t guaranteed to be in the same order as requested.

Ex: If you have some function “objects = FooClass.load_many_by_id([1, 2, 3, 4, 5])”

if you do this:

execute_concurrent(select foo where ID = 1,

select foo where ID = 2,

select foo where ID = 3,

select foo where ID = 4,

select foo where ID = 5)

^^ rows always come back in the same order as requested (1, 2, 3, 4, 5)

vs:

If you optimize it to this:

execute.("select foo where ID in (1, 2, 3, 4, 5)”)

^^

… then rows come back in unpredictable order, you have to rearrange them back into the 1, 2, 3, 4, 5 order that was requested, if your application requires it.

- Max

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Dorian Hoxha

unread,

Sep 30, 2016, 4:58:28 PM9/30/16

to python-dr...@lists.datastax.com

@Max C

Yeah, you do less network calls.
In my case, I don't care about the order.
You can still reorder on the client though.

On Fri, Sep 30, 2016 at 4:27 AM, Max C <mc_goog...@core43.com> wrote:

I have tried this in a 3-node cluster with RF=3; so in my case getting the correct tokens isn’t important. From what I recall, you do see a performance gain by using IN (sorry I can’t remember the exact amount, and certainly can’t tell you what if/any is the optimal number of values to use inside the IN), but then you have a new issue — the values that come back aren’t guaranteed to be in the same order as requested.

Ex: If you have some function “objects = FooClass.load_many_by_id([1, 2, 3, 4, 5])”

if you do this:
execute_concurrent(select foo where ID = 1,

select foo where ID = 2,
select foo where ID = 3,
select foo where ID = 4,
select foo where ID = 5)

^^ rows always come back in the same order as requested (1, 2, 3, 4, 5)

vs:

If you optimize it to this:
execute.("select foo where ID in (1, 2, 3, 4, 5)”)
^^
… then rows come back in unpredictable order, you have to rearrange them back into the 1, 2, 3, 4, 5 order that was requested, if your application requires it.

- Max

On Sep 29, 2016, at 6:01 am, Alan Boudreault <alan.bo...@datastax.com> wrote:

In that case, I´d say it´s worth a try. I would also suggest you to randomly split your requests and benchmark to see how it goes. (let´s say 10 requests fo 100 keys). I´m interested to hear back from you with your findinds.

On Thu, Sep 29, 2016 at 4:37 AM, Dorian Hoxha <dorian...@gmail.com> wrote:

@Alan
I will have up to 1000 queries. I was thinking doing all you mention, but ALSO lowering from 1000 "where x=y" to hopefully fewer "where x IN y" where "y" values are guaranteed to go in 1 token/token-range/shard/node.
I'll check out your links, thanks.

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsubscribe@lists.datastax.com.

--

Alan Boudreault
Software Engineer (Drivers) | alan.bo...@datastax.com

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to a topic in the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this topic, visit https://groups.google.com/a/lists.datastax.com/d/topic/python-driver-user/3t8MwM9dtRE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-driver-user+unsub...@lists.datastax.com.

Bhuvan Rawal

unread,

Sep 30, 2016, 7:30:07 PM9/30/16

to python-dr...@lists.datastax.com

Hi Dorian,

I believe this is what you need here, I have tested it to be working. You can find the replica nodes for your keys and then group them for further query.

replicas = session.cluster.metadata.token_map.get_replicas('keyspace', session.cluster.metadata.token_map.token_class.from_key('partition_key'))

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsubscribe@lists.datastax.com.

--
You received this message because you are subscribed to a topic in the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this topic, visit https://groups.google.com/a/lists.datastax.com/d/topic/python-driver-user/3t8MwM9dtRE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to python-driver-user+unsubscribe@lists.datastax.com.

Reply all

Reply to author

Forward