This policy provides round-robin queries over the node of the local data center. It also includes in the query plans returned a configurable number of hosts in the remote data centers, but those are always tried after the local nodes. In other words, this policy guarantees that no host in a remote data center will be queried unless no host in the local data center can be reached.
newQueryPlan
method will first return the LOCAL
replicas for the query (based on Statement.getRoutingKey()
) if possible (i.e. if the querygetRoutingKey
method doesn't return null
and if Metadata.getReplicas(java.lang.String, java.nio.ByteBuffer)
returns a non empty set of replicas for that partition key). If no local replica can be either found or successfully contacted, the rest of the query plan will fallback to one of the child policy.--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.
On Jan 25, 2016, at 1:17 PM, Jun Wu <wuxia...@hotmail.com> wrote:Hi there,I'm doing some research in load balancing policies in Java driver, but feel confusion about some questions:1. For RoundRobin it's easy to understand, which does not consider the data center and in a round-robin fashion. However, for DCAwareRoundRobin, the API doc shows that:This policy provides round-robin queries over the node of the local data center. It also includes in the query plans returned a configurable number of hosts in the remote data centers, but those are always tried after the local nodes. In other words, this policy guarantees that no host in a remote data center will be queried unless no host in the local data center can be reached.
I'm very curious about its application. It said that it won't consider the remote data center unless all local nodes are down. If that's the case, what's the purpose of the remote data center, for safety only? For example, if i have 2 data centers with 2 nodes in each (such as "us-west" and "us-east"). I specified the replication factor to be "us-west":1 and "us-east":1, which means 1 copy of data to be sent to "us-west" and anther copy to another data center. All these nodes work fine. If I use DCAwareRoundRobin("us-west") to specify the local data center, then according to the previous explanation, it won't write data to "us-east" unless the two nodes in "us-west" are down. I cannot see any practical purpose for this, as I assume that the possibilities for all nodes to be down are very low. Also, I have specify the replication factor to be 1 in the remote data center, if I use DCAwareRoundRobin("us-west"), then it means no copy will be sent to the remote data center. Am I right on this?
2. For situations in multiple data center, if still the similar situation, 2 data centers with 3 nodes in each. Node 1, 2, 3in one data center 1and 4, 5, 6 in data center 2. The replication factor is 2 and 2 for each data center. If I choose node 1 to be the coordinator, then when a write request comes in, it knows which nodes to write to in data center 1, based on the load balancing policy the cluster used.
However, for the other data center, 2 other copies of data needed to be written. Then my question is will the data is sent directly from the coordinator, or it will send the data to one of the nodes in data center 2 (which probably is called remote coordinator) and it'll decide which nodes to be sent. Because I'm doing experiments in Amazon EC2, sending 2 copies to another data center directly and sending 1 copy to one of the nodes in data center 2 and this node send copies to 2 other copies, matter a lot.
3. For TokenAwarePolicy(DCAwareRoundRobin), which is the default policy and according to the self-learning course, it performs best among all. However, it seems to be a little bit hard to understand and I'm wondering is there a detailed explanation on this (why it performs best)? The following is from the API:
- the iterator return by the
newQueryPlan
method will first return theLOCAL
replicas for the query (based onStatement.getRoutingKey()
) if possible (i.e. if the querygetRoutingKey
method doesn't returnnull
and ifMetadata.getReplicas(java.lang.String, java.nio.ByteBuffer)
returns a non empty set of replicas for that partition key). If no local replica can be either found or successfully contacted, the rest of the query plan will fallback to one of the child policy.Does that means the coordinator will calculate the token of the primary key, and based on that key it'll send data directly to that node owing the token?
Also, what'll this be if in multiple data centers? In 2 data centers, does the 2 rings have its own token rings? If using TokenAwarePolicy(DCAwareRoundRobin), does that means the data cannot be written to another remote data center, even if 1 and 1 replication factor having been specified?
Sorry for the overwhelming paragraphs. Hope to get some hints/answers.Thanks!Jun
--
Olivier Michallat
Driver & tools engineer, DataStax
On Jan 27, 2016, at 4:10 PM, Jun Wu <wuxia...@hotmail.com> wrote:Thank you so much Vishy Kasar!However, it still not that clear for me, especially the coordinator on client and server side. Could I take an example and dig deeper on this?Let's say that we have 2 data centers: DC1 and DC2, the figure also be got from link here: https://docs.datastax.com/en/cassandra/1.2/cassandra/images/write_access_multidc_12.pngThere're 10 nodes in each data center. We set the replication factor to be 3 and 3 in each data center.Assume we have keyspace "student" and table "studentinfo", with (ind id primary key, text name) specified.Here we only consider write path, so we start to write query from client, and you can see that node 10 in DC1 has been chosen as coordinator.For example, we have 5 queries/rows to be written in this cluster.Query 1: insert into student.studentinfo (id, name) values (1, "Allen");Query 2: insert into student.studentinfo (id, name) values (2, "Alex");Query 3: insert into student.studentinfo (id, name) values (3, "Brandon");Query 4: insert into student.studentinfo (id, name) values (4, "Jess");Query 5: insert into student.studentinfo (id, name) values (5, "Ryan");If we use roundrobin policy:For query 1, the query plan will return all the hosts: 1, 2,... 10 in DC1 and 1,2, ...10 in DC2. If we assume node 1 is the first node found to be written data to, then (1, "Allen") will be written to node 1. (Here, my question is how to decide the first node to be written data to?)
For query 2, as node 1 has been chosen as the first node, then (2, "Alex") will be written to node 2 in round robin fashion.For query 3, similarly, (3, "Brandon") will be written to node 3.For query 4, (4, "Jess") to node 4;For query 5, (5, "Ryan") to node 5;Then according to our assumption, 3 replicas should be written, then how to write data of next 2 replicas?
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsub...@lists.datastax.com.
-- Jack Krupansky
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsub...@lists.datastax.com.