Help troubleshooting Timeouts

Steve Hall

unread,

Mar 20, 2018, 7:02:06 PM3/20/18

to DataStax Java Driver for Apache Cassandra User Mailing List

We have an 18 node Cassandra cluster running on AWS. This cluster has been running for quite some time. We now have a situation where the client times out on some queries and we are looking for some ideas on how to track this problem down.

The general situation is as follows:

We have many tables but two tables are primarily involved in this issue. The application is performing an in application join, per advice we have gotten from DataStax engineers.

A user has interests, which is essentially a wide table that has a partition key of userId and a list of the associated partition keys for the user's interests, which are the partition keys of the interests table. The interests table is where the meta data for the interest resides. This design help us avoid having to update the data in thousands of records in fully materialized views when the meta data for a particular interest changes.

We resolve the meta data for a user interest by performing a second query. The second query is the one that is now timing out.

That query looks like this, with 200 specific interests to fetch (I have edited out many of the id's to keep this succinct):

select * from interests where interest_id in (a1d82d54-136e-4032-8ed6-6a233e13f0c2,38ffa7fb-747b-465c-ae68-eb4652966ce9 ,f3c40c79-4e93-43f6-9029-48cb6554588a ,859b798f-cc36-4961-9165-345c6c8e4363 ,f16f4fcd-2df2-4127-b1e8-83e45bcfebdb ,06411ad8-aff1-429f-855a-3816c87da6e1 ,635ca1cb-1708-44fd-9050-a3e2014e2735);

The meta-data for any interest is quite small: 13 attributes and less than 500 characters total.

I can run this query on one node using the CQL shell and it returns nearly instantly. It times out from the application where it may be hitting other nodes as the coordinator.

What changed... What we believe to be the root cause. We had a team use a REST endpoint to update the interest meta data. The application was written to perform an upsert. This action created tombstones for every interest. That's when it suddenly got slow.

The interest table is a narrow row. Partition key and 13 fields of data.

We have compacted the tables but we are still getting timeouts.

CfStats for the table look like this:

nodetool cfstats interests.interests

Keyspace: interests

Read Count: 303758334

Read Latency: 0.11072883718476016 ms.

Write Count: 7792

Write Latency: 0.031202130390143735 ms.

Pending Tasks: 0

Table: interests

SSTable count: 2

Space used (live), bytes: 7736156

Space used (total), bytes: 7736156

Off heap memory used (total), bytes: 70920

SSTable Compression Ratio: 0.4049112590236776

Number of keys (estimate): 46336

Memtable cell count: 325

Memtable data size, bytes: 100555

Memtable switch count: 114

Local read count: 303758334

Local read latency: 0.104 ms

Local write count: 7792

Local write latency: 0.000 ms

Pending tasks: 0

Bloom filter false positives: 61357

Bloom filter false ratio: 0.00651

Bloom filter space used, bytes: 58752

Bloom filter off heap memory used, bytes: 58736

Index summary off heap memory used, bytes: 10136

Compression metadata off heap memory used, bytes: 2048

Compacted partition minimum bytes: 259

Compacted partition maximum bytes: 1331

Compacted partition mean bytes: 409

Average live cells per slice (last five minutes): 1.0

Average tombstones per slice (last five minutes): 0.0

Any ideas on how to approach this problem?

Nate McCall

unread,

Mar 20, 2018, 8:00:21 PM3/20/18

to java-dri...@lists.datastax.com

On Wed, Mar 21, 2018 at 12:02 PM, Steve Hall <stevejh...@gmail.com> wrote:

We have an 18 node Cassandra cluster running on AWS. This cluster has been running for quite some time. We now have a situation where the client times out on some queries and we are looking for some ideas on how to track this problem down.

The general situation is as follows:

We have many tables but two tables are primarily involved in this issue. The application is performing an in application join, per advice we have gotten from DataStax engineers.

A user has interests, which is essentially a wide table that has a partition key of userId and a list of the associated partition keys for the user's interests, which are the partition keys of the interests table. The interests table is where the meta data for the interest resides. This design help us avoid having to update the data in thousands of records in fully materialized views when the meta data for a particular interest changes.

We resolve the meta data for a user interest by performing a second query. The second query is the one that is now timing out.

That query looks like this, with 200 specific interests to fetch (I have edited out many of the id's to keep this succinct):

select * from interests where interest_id in (a1d82d54-136e-4032-8ed6-6a233e13f0c2,38ffa7fb-747b-465c-ae68-eb4652966ce9 ,f3c40c79-4e93-43f6-9029-48cb6554588a ,859b798f-cc36-4961-9165-345c6c8e4363 ,f16f4fcd-2df2-4127-b1e8-83e45bcfebdb ,06411ad8-aff1-429f-855a-3816c87da6e1 ,635ca1cb-1708-44fd-9050-a3e2014e2735);

You are putting a lot of strain on the coordinator since using this syntax you cannot take advantage of the driver routing requests to the owning replica.

The output of nodetool proxyhistograms should show this as will your coordinatorReadLatency statistics for the table in a monitoring system.

Fire off the selects individually asynchronously within a loop and roll up the results via Guava's Futures or similar.

Also, given min/max/mean partition size on this table, turn your compression settings for the column family down to like 4k (probably matching the disk block size). You are most definitely overreading if not as the default is 64k.

--

-----------------
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com

Message has been deleted

Steve Hall

unread,

Mar 21, 2018, 5:17:54 PM3/21/18

to DataStax Java Driver for Apache Cassandra User Mailing List

Thank you Nate. There is a lot of good advice in that response. You are in fact correct the nodetool proxyhistograms confirms your suggestion. So, we have some work to do. Block size is also a very good suggestion.

Having said that, what we now understand more fully is:

1. Users updated nearly every interest via a REST call. This call results in an upsert on Cassandra.

2. We had a cache in front of the service.

3. When users did not immediately see the change, they make the call 10 times for every interest.

4. Fortunately the cache time expired and they started to see their new data so they stopped. Otherwise they would still be creating tombstones.

5. Compact was run

6. Repair was run

7. Database performance for these queries is much worse than it had been, and in fact they are now timing out

8. Developers have increased the driver timeout beyond 2 seconds. Still a problem.

9. The read is set for local quorum. Yes, I too wonder why that makes any sense in light of us having a cache.

Any thoughts on what we can do right now, while the team tackles the refactoring?

Nate McCall

unread,

Mar 21, 2018, 5:57:23 PM3/21/18

to java-dri...@lists.datastax.com

Given we are more into the realm of tuning, this may be more appropriate/get a wider audience on for us...@c.a.o at this point, but for expediency sake...

Unfortunately whole row deletes wont show up on the tombstone counters in tablestats output until 3.11.2, so you can't really see the immediate impact of what you are experience from the data model perspective.

It looks like you are using the default compaction strategy. Set that to LCS and you will probably have to take the gloves off on dealing with the tombstones as well, so an alter might look like:

gc_grace_seconds=3600 and

compaction={'sstable_size_in_mb': '256', 'unchecked_tombstone_compaction': 'true', 'tombstone_compaction_interval': '3600', 'class': 'LeveledCompactionStrategy'}

These options are pretty dangerous for consistency sake long term. They are *only* to get you out of the woods. What we've done:

- switch to LCS with aggressive tombstone pruning

- sstable size is set to 256 to try to capture more upserts per level

- SSTables are elligible for tombstone only compaction an hour after they are created

- gc_grace has been turned down to 1 hour (this is now effectively your recover window in the case of a node failure, there are caveats, but this is the major take away)

These will cause *a lot* of disk IO when implemented. You may want to roll them around the cluster one node at a time via jmxterm invoking setCompactionParametersJson (depending on version). An example of doing this (but with TWCS) can be found here: http://thelastpickle.com/blog/2017/01/10/twcs-part2.html

Test these on a staging cluster. I'm making a bunch of guestimates based off of previous experiences, so YMMV.

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.

Reply all

Reply to author

Forward