replication factor error

472 views

Skip to first unread message

Sindhuja Balaji

unread,

Oct 25, 2016, 11:44:00 PM10/25/16

to spark-conn...@lists.datastax.com

I was getting an exception as below, So I changed the replication factor for the keyspace and ran the nodetool repair of keyspace. The repair has some error in system.log How do we fix the error or is the node repaired ?

Error 1:

Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)

Error 2 on Repair:

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #ee4bc900-9b28-11e6-997b-e5311ab44422 on edw_data_import/tr_otp_topic_obj_assmnt, [(-5243341462767007283,-5181027973334775002], 5580759236169,-1908636647698583176], (-2288645573410771832,-2277883262246275075], (-664127392650676409,-636015452303436947], (3577877081120993756,3579814815688904558]]] Validation failed in

Any help for fixing Error 2?

Thanks,

Sindhuja

Jim Hatcher

unread,

Oct 26, 2016, 9:50:55 AM10/26/16

to spark-conn...@lists.datastax.com

Hi Sindhuja,

Let me speak to your first error. Let's say your table is in a keyspace that is using a replication factor of 3. That means that when the data was written, Cassandra will have tried to write it to three servers in the cluster. Now, when you're querying the data, you're using a read consistency level of 1 (I know that because the error message cites "LOCAL_ONE"). To satisfy your query, Cassandra needs a response from just one of the three servers where the data exists to be able to consider this a good read. In this case, none of the three servers was able to respond.

I don't think the issue you're having is that your data is inconsistent between nodes. I think the problem you're having is that some (or all) of your Cassandra nodes are down or are being overloaded.

If the replication factor of your keyspace is actually 1, then this error is more likely to happen because there is only one copy of the data. So, if that's the case, you can help address this error by increasing the replication factor of your keyspace to 2 or 3.

I'm not sure exactly what is going on with your second exception. It might be helpful if you included more of the error message.

Regarding your first message, can you include the following information:

1) What is the replication factor of your keyspace?

2) Can you show the output of a nodetool status?

Thanks,

Jim

From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>
Sent: Tuesday, October 25, 2016 9:43 PM
To: spark-conn...@lists.datastax.com
Subject: replication factor error

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Sindhuja Balaji

unread,

Oct 26, 2016, 9:55:44 AM10/26/16

to spark-conn...@lists.datastax.com

Hi Jim,

1) What is the replication factor of your keyspace? - Currently replication factor is 1. Should I need to change to 3.

2) Can you show the output of a nodetool status? - Attached the log file for your reference

On Wed, Oct 26, 2016 at 7:50 AM, Jim Hatcher <james_...@hotmail.com> wrote:

Hi Sindhuja,

Let me speak to your first error. Let's say your table is in a keyspace that is using a replication factor of 3. That means that when the data was written, Cassandra will have tried to write it to three servers in the cluster. Now, when you're querying the data, you're using a read consistency level of 1 (I know that because the error message cites "LOCAL_ONE"). To satisfy your query, Cassandra needs a response from just one of the three servers where the data exists to be able to consider this a good read. In this case, none of the three servers was able to respond.

I don't think the issue you're having is that your data is inconsistent between nodes. I think the problem you're having is that some (or all) of your Cassandra nodes are down or are being overloaded.

If the replication factor of your keyspace is actually 1, then this error is more likely to happen because there is only one copy of the data. So, if that's the case, you can help address this error by increasing the replication factor of your keyspace to 2 or 3.

I'm not sure exactly what is going on with your second exception. It might be helpful if you included more of the error message.

Regarding your first message, can you include the following information:

1) What is the replication factor of your keyspace?

2) Can you show the output of a nodetool status?

Thanks,

Jim

From: spark-connector-user@lists.datastax.com <spark-connector-user@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Tuesday, October 25, 2016 9:43 PM

To: spark-connector-user@lists.datastax.com
Subject: replication factor error

I was getting an exception as below, So I changed the replication factor for the keyspace and ran the nodetool repair of keyspace. The repair has some error in system.log How do we fix the error or is the node repaired ?

Error 1:

Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)

Error 2 on Repair:

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #ee4bc900-9b28-11e6-997b-e5311ab44422 on edw_data_import/tr_otp_topic_obj_assmnt, [(-5243341462767007283,-5181027973334775002], 5580759236169,-1908636647698583176], (-2288645573410771832,-2277883262246275075], (-664127392650676409,-636015452303436947], (3577877081120993756,3579814815688904558]]] Validation failed in

Any help for fixing Error 2?

--

Thanks,
Sindhuja

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Thanks,

Sindhuja

log.txt

Jim Hatcher

unread,

Oct 26, 2016, 10:22:50 AM10/26/16

to spark-conn...@lists.datastax.com

Sindhuja,

You could try increasing replication factor to 2. That means that when you run a query with a read consistency level of 1 that the Spark Cassandra connector will have two chances to get the data before throwing that error. Keep in mind that you'll be doubling the size of the data in your cluster.

I was asking for the output of the nodetool status because I was trying to get an idea of how many servers were in your cluster and whether they were up or down. For instance, here is what I get when I run a nodetool status:

[jhatcher@someserver1 ~]$ nodetool status
Datacenter: DataCenter1
===========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address       Load       Tokens Owns    Host ID                               Rack
UN 10.0.0.7 486.55 GB 64      ?       a2307045-71f9-4d8f-bb30-147687d95dc0 RAC1
UN 10.0.0.6 772.68 GB 64      ?       e2b1a46a-b73a-465d-8b7a-3f2b8b3ac578 RAC1
UN 10.0.0.1 694.04 GB 64      ?       fb0c4a02-43cb-4631-be47-e5184cf00c86 RAC1
UN 10.0.0.3 633.85 GB 64      ?       e26cb2dc-c452-447e-b0c3-9de0f9b5e335 RAC1
UN 10.0.0.2 741.95 GB 64      ?       48483131-bbb8-49c2-8edd-60983985155d RAC1
UN 10.0.0.5 663.09 GB 64      ?       d362278e-5f65-4428-beb6-779922f7e7f5 RAC1
UN 10.0.0.4 599.4 GB   64      ?       8f404aad-3a5d-4659-882f-239e331e071a RAC1

You can see that I have 7 nodes and that they are all "UN" (which means Up and Normal)

Regarding your logs, I see this error:

ERROR [ValidationExecutor:2] 2016-10-25 21:05:38,832 CompactionManager.java:1320 - Cannot start multiple repair sessions over the same sstables

You have some other errors too, but I suspect they're being caused by trying to run two repairs simultaneously.

BTW, a quick way to see all the errors in a log is to do a command like this: cat logfile | grep ERROR

Jim

From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>
Sent: Wednesday, October 26, 2016 7:55 AM
To: spark-conn...@lists.datastax.com
Subject: Re: replication factor error

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Sindhuja Balaji

unread,

Oct 26, 2016, 10:38:18 AM10/26/16

to spark-conn...@lists.datastax.com

sindhuja.dhamodaran@cassandra104-01 ~ $ nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Tokens Owns Host ID Rack

UN 10.20.20.165 2 GB 256 ? 88e0164d-7834-4c77-9725-7df831568298 rack1

UN 10.20.20.166 2 GB 256 ? 715bc107-13c0-4aec-ad00-bc8b16a347d2 rack1

UN 10.20.20.58 2 GB 256 ? 0ced5d7b-1dc1-46ec-91a1-4aab7e36c042 rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless.

This is what I see in the status.

On Wed, Oct 26, 2016 at 8:22 AM, Jim Hatcher <james_...@hotmail.com> wrote:

Sindhuja,

You could try increasing replication factor to 2. That means that when you run a query with a read consistency level of 1 that the Spark Cassandra connector will have two chances to get the data before throwing that error. Keep in mind that you'll be doubling the size of the data in your cluster.

I was asking for the output of the nodetool status because I was trying to get an idea of how many servers were in your cluster and whether they were up or down. For instance, here is what I get when I run a nodetool status:

[jhatcher@someserver1 ~]$ nodetool status
Datacenter: DataCenter1
===========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address       Load       Tokens Owns    Host ID                               Rack
UN 10.0.0.7 486.55 GB 64      ?       a2307045-71f9-4d8f-bb30-147687d95dc0 RAC1
UN 10.0.0.6 772.68 GB 64      ?       e2b1a46a-b73a-465d-8b7a-3f2b8b3ac578 RAC1
UN 10.0.0.1 694.04 GB 64      ?       fb0c4a02-43cb-4631-be47-e5184cf00c86 RAC1
UN 10.0.0.3 633.85 GB 64      ?       e26cb2dc-c452-447e-b0c3-9de0f9b5e335 RAC1
UN 10.0.0.2 741.95 GB 64      ?       48483131-bbb8-49c2-8edd-60983985155d RAC1
UN 10.0.0.5 663.09 GB 64      ?       d362278e-5f65-4428-beb6-779922f7e7f5 RAC1
UN 10.0.0.4 599.4 GB   64      ?       8f404aad-3a5d-4659-882f-239e331e071a RAC1

You can see that I have 7 nodes and that they are all "UN" (which means Up and Normal)

Regarding your logs, I see this error:

ERROR [ValidationExecutor:2] 2016-10-25 21:05:38,832 CompactionManager.java:1320 - Cannot start multiple repair sessions over the same sstables

You have some other errors too, but I suspect they're being caused by trying to run two repairs simultaneously.

BTW, a quick way to see all the errors in a log is to do a command like this: cat logfile | grep ERROR

Jim

From: spark-connector-user@lists.datastax.com <spark-connector-user@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Wednesday, October 26, 2016 7:55 AM

To: spark-connector-user@lists.datastax.com

Subject: Re: replication factor error

Hi Jim,

1) What is the replication factor of your keyspace? - Currently replication factor is 1. Should I need to change to 3.

2) Can you show the output of a nodetool status? - Attached the log file for your reference

On Wed, Oct 26, 2016 at 7:50 AM, Jim Hatcher <james_...@hotmail.com> wrote:

Hi Sindhuja,

Let me speak to your first error. Let's say your table is in a keyspace that is using a replication factor of 3. That means that when the data was written, Cassandra will have tried to write it to three servers in the cluster. Now, when you're querying the data, you're using a read consistency level of 1 (I know that because the error message cites "LOCAL_ONE"). To satisfy your query, Cassandra needs a response from just one of the three servers where the data exists to be able to consider this a good read. In this case, none of the three servers was able to respond.

I don't think the issue you're having is that your data is inconsistent between nodes. I think the problem you're having is that some (or all) of your Cassandra nodes are down or are being overloaded.

If the replication factor of your keyspace is actually 1, then this error is more likely to happen because there is only one copy of the data. So, if that's the case, you can help address this error by increasing the replication factor of your keyspace to 2 or 3.

I'm not sure exactly what is going on with your second exception. It might be helpful if you included more of the error message.

Regarding your first message, can you include the following information:

1) What is the replication factor of your keyspace?

2) Can you show the output of a nodetool status?

Thanks,

Jim

From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Tuesday, October 25, 2016 9:43 PM

To: spark-conn...@lists.datastax.com
Subject: replication factor error

I was getting an exception as below, So I changed the replication factor for the keyspace and ran the nodetool repair of keyspace. The repair has some error in system.log How do we fix the error or is the node repaired ?

Error 1:

Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_ONE (1 required but only 0 alive)

Error 2 on Repair:

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #ee4bc900-9b28-11e6-997b-e5311ab44422 on edw_data_import/tr_otp_topic_obj_assmnt, [(-5243341462767007283,-5181027973334775002], 5580759236169,-1908636647698583176], (-2288645573410771832,-2277883262246275075], (-664127392650676409,-636015452303436947], (3577877081120993756,3579814815688904558]]] Validation failed in

Any help for fixing Error 2?

--

Thanks,
Sindhuja

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsubscrib...@lists.datastax.com.

--

Thanks,
Sindhuja

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Thanks,

Sindhuja

Jim Hatcher

unread,

Oct 26, 2016, 10:49:57 AM10/26/16

to spark-conn...@lists.datastax.com

OK, so your nodes are all up (which is good!) and you have three nodes (which means you can at least go to a replication factor of 2). You could also go to a replication factor of 3, but that would mean if you lost a node that your cluster would be in trouble.

I found this article on increasing the replication factor:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/update_ks_rf_t.html

Updating the replication factor - DataStax

docs.datastax.com

Increasing the replication factor increases the total number of copies of keyspace data stored in a Cassandra cluster. Increasing the replication factor increases ...

And here is an article on monitoring the progress of a repair:

http://stackoverflow.com/questions/25064717/how-do-i-know-if-nodetool-repair-is-finished

how do i know if nodetool repair is finished - Stack Overflow

stackoverflow.com

@Aaron Okay, what if nodetool netstats tells you that everything is done and nodetool repair does not return? Would it then be safe to use Ctrl-C on that run?

If going to a different replication factor doesn't help, you may have to look at adding more horsepower to your cluster, or throttling your Spark process.

Russell has a good video on doing tuning of the Spark Cassandra connector here:

https://www.youtube.com/watch?v=cKIHRD6kUOc&list=PLm-EPIkBI3YoiA-02vufoEj4CgYvIQgIk&index=105

Jump to about minute 17.

Maximum Overdrive: Tuning the Spark Cassandra Connector ...

www.youtube.com

Slides: https://www.slideshare.net/DataStax/maximum-overdrive-tuning-the-spark-cassandra-connector-russell-spitzer-datastax-c-summit-2016 | Worried that you ...

Jim

From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Wednesday, October 26, 2016 8:38 AM
To: spark-conn...@lists.datastax.com

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Sindhuja Balaji

unread,

Oct 26, 2016, 10:45:15 PM10/26/16

to spark-conn...@lists.datastax.com

Thank you Jim, That really helped me in getting some good idea.

I changed the replication factor to 3 and I am seeing warning message. What would be the best practice to resolve the same. Do we need to set the below to a higher value

        .set("spark.cassandra.output.batch.size.rows", "5120")

WARN [SharedPool-Worker-4] 2016-09-29 10:45:07,294 BatchStatement.java:289 - Batch of prepared statements for [edw_data_import.tr_otp_topic_assmnt_201608] is of size 7618, exceeding specified threshold of 5120 by 2498.

WARN [SharedPool-Worker-6] 2016-09-29 10:45:07,323 BatchStatement.java:289 - Batch of prepared statements for [edw_data_import.tr_otp_topic_passmnt_tmpl_201608] is of size 11882, exceeding specified threshold of 5120 by 6762.

On Wed, Oct 26, 2016 at 8:49 AM, Jim Hatcher <james_...@hotmail.com> wrote:

OK, so your nodes are all up (which is good!) and you have three nodes (which means you can at least go to a replication factor of 2). You could also go to a replication factor of 3, but that would mean if you lost a node that your cluster would be in trouble.

I found this article on increasing the replication factor:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/update_ks_rf_t.html

Updating the replication factor - DataStax

docs.datastax.com

Increasing the replication factor increases the total number of copies of keyspace data stored in a Cassandra cluster. Increasing the replication factor increases ...

And here is an article on monitoring the progress of a repair:
http://stackoverflow.com/questions/25064717/how-do-i-know-if-nodetool-repair-is-finished

how do i know if nodetool repair is finished - Stack Overflow

stackoverflow.com

@Aaron Okay, what if nodetool netstats tells you that everything is done and nodetool repair does not return? Would it then be safe to use Ctrl-C on that run?

If going to a different replication factor doesn't help, you may have to look at adding more horsepower to your cluster, or throttling your Spark process.

Russell has a good video on doing tuning of the Spark Cassandra connector here:

https://www.youtube.com/watch?v=cKIHRD6kUOc&list=PLm-EPIkBI3YoiA-02vufoEj4CgYvIQgIk&index=105

Jump to about minute 17.

Maximum Overdrive: Tuning the Spark Cassandra Connector ...

www.youtube.com

Slides: https://www.slideshare.net/DataStax/maximum-overdrive-tuning-the-spark-cassandra-connector-russell-spitzer-datastax-c-summit-2016 | Worried that you ...

Jim

From: spark-connector-user@lists.datastax.com <spark-connector-user@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Wednesday, October 26, 2016 8:38 AM

To: spark-connector-user@lists.datastax.com
Subject: Re: replication factor error

From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Wednesday, October 26, 2016 7:55 AM

To: spark-conn...@lists.datastax.com
Subject: Re: replication factor error

--

Thanks,
Sindhuja

Thanks,

Sindhuja

Jim Hatcher

unread,

Oct 27, 2016, 10:01:22 AM10/27/16

to spark-conn...@lists.datastax.com

Sindhuja,

Here is an article regarding that (with an answer by Russell Spitzer -- who you should always listen to!):

http://stackoverflow.com/questions/27039398/datastax-enterprise-spark-cassandra-batch-size

I think the idea is that you either need to set spark.cassandra.output.batch.size.rows or set spark.cassandra.output.batch.size.bytes.

You might consider not setting spark.cassandra.output.batch.size.rows (which will tell the connector to look at the bytes setting instead) and then setting spark.cassandra.output.batch.size.bytes to some larger value (like 256K maybe?)

Also, Russell mentions in his answer that you can adjust this setting in the cassandra.yaml: batch_size_warn_threshold_in_kb

It's just a warning though. I don't think it means that your writes are failing.

Jim

From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>

Sent: Wednesday, October 26, 2016 8:45 PM

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Reply all

Reply to author

Forward

0 new messages