HA cluster test query

52 views
Skip to first unread message

Sarath Ambadas

unread,
Nov 6, 2019, 12:06:07 PM11/6/19
to DataStax Node.js Driver for Apache Cassandra Mailing List
Hi
My 3 node cassandra cluster is running on kubernetes. To test the HA scenario, made one of the nodes as down (pod status is shown as unknown).
I was querying data from different tables (with and without data in it). I see there is a delay of 15 seconds when ever a different table is selected to query data from. If querying the same table again and again I dont see any delay. Also if there is no data in the table I dont see any delay. I was iniitally testing with 3.5 driver version. Later I tested with 4.3 as well.
I am attaching a sample program which I tested. Could you please review and suggest/comment what is happening or I need to change any configuration

Thanks in advance.
Sarath
Message has been deleted
Message has been deleted
Message has been deleted

Sarath Ambadas

unread,
Nov 6, 2019, 12:20:28 PM11/6/19
to nodejs-dr...@lists.datastax.com
file attached

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.
script.zip

Jorge Bay Gondra

unread,
Nov 7, 2019, 3:39:15 AM11/7/19
to nodejs-dr...@lists.datastax.com
Hi,
I would first try to simplify the code,  leave the Client default options (just set contactPoints and localDataCenter), as well as the table creation options.
Then, you can use client.getState() to get the current state of the connection pools and use info level logging:

client.connect();
console.log(`Connected! Current state: ${client.getState().toString()}`);

For example, logging to console without any additional library:

client.on('log', (level, className, message, furtherInfo) => {
  if (level === 'verbose') return;
  console.log(new Date(), level, className, message);
});


That information should help you understand what is happening under the hood.
Thanks,
Jorge

Sarath Ambadas

unread,
Nov 7, 2019, 1:03:32 PM11/7/19
to nodejs-dr...@lists.datastax.com
Thanks very much Jorge for the reply. Initially I used that configuration as its used in my app.

When I enabled the logs earlier, I was seeing connection related errors and there was a delay of 15 seconds when ever a new table with data
is queried. I simplified the configuration settings and tested. I see connection timeout errors when it tries to connect to the node which is down
for every new table queried. Also I see a delay of 10 seconds now for the first query of every new table as shown below

Thu Nov 07 2019 17:34:03 GMT+0000 (UTC)    keyspace1.mytable1(1)
2019-11-07T17:34:03.889Z 'info' 'Connection' 'Connecting to 172.30.3.163:9042'
2019-11-07T17:34:13.897Z 'warning' 'Connection' 'There was an error when trying to connect to the host 172.30.3.163'
2019-11-07T17:34:13.897Z 'warning' 'HostConnectionPool' 'Connection to 172.30.3.163:9042 could not be created: DriverError: Connection timeout'
2019-11-07T17:34:13.897Z 'warning' 'HostConnectionPool' 'Connection pool to host 172.30.3.163:9042 could not be created'
2019-11-07T17:34:13.906Z 'SARATH   NUMBER OF RECORDS FOUND ::: 1'

I am also attaching the logs generated and the sample script.

My question is why the call is going to the node which is down to query the tables for the first time and if we can avoid the delay.
I thought there will be some background thread/process which checks if the node is up again and adds to the list of hosts so that it cna be used again.
In my production scenario, I need to query lot of tables and this delay is affecting the response times.

Thanks
Sarath

newScriptAndLogs.zip

Jorge Bay Gondra

unread,
Nov 8, 2019, 3:35:52 AM11/8/19
to nodejs-dr...@lists.datastax.com
Hi,
The driver attempts to reconnect to the nodes that are down in the background, that is not affecting your queries.

What it looks like is occurring in your case, is that you are changing the schema with a node down. With Cassandra, you usually want to change the schema when all your nodes are up. It's unlikely that you want to change the schema definition while part of the cluster is failing.

As Cassandra is a distributed DB, the schema must be uniform across all nodes. The driver has a mechanism to wait for all the nodes to be "in agreement" of the schema.

Here's some info on the java driver docs, the same applies to the Node.js driver: https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/metadata/schema/#schema-agreement

In summary, when testing HA, only use DML queries (insert/update/delete/select) and not DDL.

Thanks,
Jorge

Sarath Ambadas

unread,
Nov 11, 2019, 1:36:39 PM11/11/19
to DataStax Node.js Driver for Apache Cassandra Mailing List
Using a cluster with one node down and this sample program for testing purpose I created the keyspace, tables and data and then querying the data.
When I did all the above steps I was seeing the delay while querying a table for first time. I commented the code for keyspace/table creation
and just ran the querying the data part in which case I did not see the delay or timeout errors.

In my production, when my application starts I have a cassandra cluster with all nodes are up, the application creates the tables and everything works fine
When one of the node goes down, I still see the error occurring while querying these tables. I dont think we are changing schemas after the node is down. 
I can double check this step and confirm. But I see the connection timeout errors and a delay of 10 seconds.
Let me check if we execute any DDL

Thanks
Sarath

On Friday, November 8, 2019 at 12:35:52 AM UTC-8, Jorge Bay Gondra wrote:
Hi,
The driver attempts to reconnect to the nodes that are down in the background, that is not affecting your queries.

What it looks like is occurring in your case, is that you are changing the schema with a node down. With Cassandra, you usually want to change the schema when all your nodes are up. It's unlikely that you want to change the schema definition while part of the cluster is failing.

As Cassandra is a distributed DB, the schema must be uniform across all nodes. The driver has a mechanism to wait for all the nodes to be "in agreement" of the schema.

Here's some info on the java driver docs, the same applies to the Node.js driver: https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/metadata/schema/#schema-agreement

In summary, when testing HA, only use DML queries (insert/update/delete/select) and not DDL.

Thanks,
Jorge

On Wed, Nov 6, 2019 at 6:20 PM Sarath Ambadas <sarath...@gmail.com> wrote:
file attached

On Wed, Nov 6, 2019 at 9:06 AM Sarath Ambadas <sarath...@gmail.com> wrote:
Hi
My 3 node cassandra cluster is running on kubernetes. To test the HA scenario, made one of the nodes as down (pod status is shown as unknown).
I was querying data from different tables (with and without data in it). I see there is a delay of 15 seconds when ever a different table is selected to query data from. If querying the same table again and again I dont see any delay. Also if there is no data in the table I dont see any delay. I was iniitally testing with 3.5 driver version. Later I tested with 4.3 as well.
I am attaching a sample program which I tested. Could you please review and suggest/comment what is happening or I need to change any configuration

Thanks in advance.
Sarath

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-user+unsub...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-user+unsub...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-user+unsub...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-user+unsub...@lists.datastax.com.

Jorge Bay Gondra

unread,
Nov 12, 2019, 7:33:11 AM11/12/19
to nodejs-dr...@lists.datastax.com
Depending how your nodes are going down (how the process is being terminated), it might be surfaced on the client side as ReadTimeout or WriteTimeout: the server node responsible selected as coordinator didn't get a reply from the replica node in time.

You can check out these old blog posts for more context:

You can tackle long latency spikes using speculative executions: https://docs.datastax.com/en/developer/nodejs-driver/4.3/features/speculative-executions/



file attached

To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.

--
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs-driver-u...@lists.datastax.com.

Sarath Ambadas

unread,
Nov 14, 2019, 4:24:27 PM11/14/19
to nodejs-dr...@lists.datastax.com
I checked it nas we dont do any schema change every time. I did another cluster installation and this is the tests I performed. I created keyspace with SimpleStrategy and replication factor of 3.
I am not using any DML as part of my HA tests

Took cassandra cluster with 3 nodes and each node running on one VM. Created keyspace and tables(5 tables) and populated with data when all the nodes/cluster is up and running.
All DDL statements were executed when the cluster was functioning normally

test case1 : I used select queries (only reads) on these tables when the cluster was proper and everything works fine without delay.


I shutdown one VM and one of the cassandra nodes is marked as down. Then I tested the same select queries

saraths-mbp:issue2918_3855 sambadas5$ k exec -it reaa2e6a71d-apiconnect-cc-2 -- nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.44.0.14  16.73 MiB  256          100.0%            e71d2478-8444-41fa-82d2-fa4e35198448  rack1
DN  10.36.0.16  11.58 MiB  256          100.0%            2d7de9ae-b5df-42a4-b8ab-5cbe7ccca672  rack1
UN  10.32.0.15  8.66 MiB   256          100.0%            f310bb9a-946f-4eac-a8ff-a4b579b0d6c6  rack1

I started executing the same queries (only reads) and see around 2.5 to 3 seconds in getting the output for each query

My use case needs to query multiple tables for each API call and this adds up the delay.

I went through the links you shared in previous update, using speculative query execution affect overall performance? Is the delay because of the way my configuration is
or is it some other issue?

Thanks
sarath

Sarath Ambadas

unread,
Nov 14, 2019, 6:15:14 PM11/14/19
to nodejs-dr...@lists.datastax.com
scripts attached
scripts_ddl_dml.zip

Jorge Bay Gondra

unread,
Nov 15, 2019, 4:59:20 AM11/15/19
to nodejs-dr...@lists.datastax.com
Hi,
You can use stackoverflow.com for troubleshooting specific scripts: create a minimal code sample, the expected behavior and what you are getting.

Thanks,
Jorge

Sarath Ambadas

unread,
Nov 15, 2019, 1:52:30 PM11/15/19
to nodejs-dr...@lists.datastax.com
Thanks Jorge.

Sarath Ambadas

unread,
Nov 18, 2019, 1:56:24 PM11/18/19
to nodejs-dr...@lists.datastax.com
I see this issue only when prepare flag is set to true in the query options while querying the data with one of the node as down.
Reply all
Reply to author
Forward
0 new messages