Help troubleshooting Timeouts

32 views
Skip to first unread message

Steve Hall

unread,
Mar 20, 2018, 7:02:06 PM3/20/18
to DataStax Java Driver for Apache Cassandra User Mailing List
We have an 18 node Cassandra cluster running on AWS.  This cluster has been running for quite some time.  We now have a situation where the client times out on some queries and we are looking for some ideas on how to track this problem down.

The general situation is as follows:

We have many tables but two tables are primarily involved in this issue. The application is performing an in application join, per advice we have gotten from DataStax engineers.

A user has interests, which is essentially a wide table that has a partition key of userId and a list of the associated partition keys for the user's interests, which are the partition keys of the interests table.  The interests table is where the meta data for the interest resides.  This design help us avoid having to update the data in thousands of records in fully materialized views when the meta data for a particular interest changes.

We resolve the meta data for a user interest by performing a second query.  The second query is the one that is now timing out.

That query looks like this, with 200 specific interests to fetch (I have edited out many of the id's to keep this succinct):

select * from interests where interest_id in (a1d82d54-136e-4032-8ed6-6a233e13f0c2,38ffa7fb-747b-465c-ae68-eb4652966ce9 ,f3c40c79-4e93-43f6-9029-48cb6554588a ,859b798f-cc36-4961-9165-345c6c8e4363 ,f16f4fcd-2df2-4127-b1e8-83e45bcfebdb ,06411ad8-aff1-429f-855a-3816c87da6e1 ,635ca1cb-1708-44fd-9050-a3e2014e2735);

The meta-data for any interest is quite small:  13 attributes and less than 500 characters total.

I can run this query on one node using the CQL shell and it returns nearly instantly.  It times out from the application where it may be hitting other nodes as the coordinator.

What changed...   What we believe to be the root cause.  We had a team use a REST endpoint to update the interest meta data.  The application was written to perform an upsert.  This action created tombstones for every interest.  That's when it suddenly got slow.

The interest table is a narrow row.  Partition key and 13 fields of data.  

We have compacted the tables but we are still getting timeouts.

CfStats for the table look like this:

nodetool cfstats interests.interests
Keyspace: interests
Read Count: 303758334
Read Latency: 0.11072883718476016 ms.
Write Count: 7792
Write Latency: 0.031202130390143735 ms.
Pending Tasks: 0
Table: interests
SSTable count: 2
Space used (live), bytes: 7736156
Space used (total), bytes: 7736156
Off heap memory used (total), bytes: 70920
SSTable Compression Ratio: 0.4049112590236776
Number of keys (estimate): 46336
Memtable cell count: 325
Memtable data size, bytes: 100555
Memtable switch count: 114
Local read count: 303758334
Local read latency: 0.104 ms
Local write count: 7792
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 61357
Bloom filter false ratio: 0.00651
Bloom filter space used, bytes: 58752
Bloom filter off heap memory used, bytes: 58736
Index summary off heap memory used, bytes: 10136
Compression metadata off heap memory used, bytes: 2048
Compacted partition minimum bytes: 259
Compacted partition maximum bytes: 1331
Compacted partition mean bytes: 409
Average live cells per slice (last five minutes): 1.0
Average tombstones per slice (last five minutes): 0.0


Any ideas on how to approach this problem?


Nate McCall

unread,
Mar 20, 2018, 8:00:21 PM3/20/18
to java-dri...@lists.datastax.com
On Wed, Mar 21, 2018 at 12:02 PM, Steve Hall <stevejh...@gmail.com> wrote:
We have an 18 node Cassandra cluster running on AWS.  This cluster has been running for quite some time.  We now have a situation where the client times out on some queries and we are looking for some ideas on how to track this problem down.

The general situation is as follows:

We have many tables but two tables are primarily involved in this issue. The application is performing an in application join, per advice we have gotten from DataStax engineers.

A user has interests, which is essentially a wide table that has a partition key of userId and a list of the associated partition keys for the user's interests, which are the partition keys of the interests table.  The interests table is where the meta data for the interest resides.  This design help us avoid having to update the data in thousands of records in fully materialized views when the meta data for a particular interest changes.

We resolve the meta data for a user interest by performing a second query.  The second query is the one that is now timing out.

That query looks like this, with 200 specific interests to fetch (I have edited out many of the id's to keep this succinct):

select * from interests where interest_id in (a1d82d54-136e-4032-8ed6-6a233e13f0c2,38ffa7fb-747b-465c-ae68-eb4652966ce9 ,f3c40c79-4e93-43f6-9029-48cb6554588a ,859b798f-cc36-4961-9165-345c6c8e4363 ,f16f4fcd-2df2-4127-b1e8-83e45bcfebdb ,06411ad8-aff1-429f-855a-3816c87da6e1 ,635ca1cb-1708-44fd-9050-a3e2014e2735);

You are putting a lot of strain on the coordinator since using this syntax you cannot take advantage of the driver routing requests to the owning replica.

The output of nodetool proxyhistograms should show this as will your coordinatorReadLatency statistics for the table in a monitoring system. 
Fire off the selects individually asynchronously within a loop and roll up the results via Guava's Futures or similar. 

Also, given min/max/mean partition size on this table, turn your compression settings for the column family down to like 4k (probably matching the disk block size). You are most definitely overreading if not as the default is 64k. 


--
-----------------
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com
Message has been deleted

Steve Hall

unread,
Mar 21, 2018, 5:17:54 PM3/21/18
to DataStax Java Driver for Apache Cassandra User Mailing List
Thank you Nate.  There is a lot of good advice in that response.  You are in fact correct the nodetool proxyhistograms confirms your suggestion.  So, we have some work to do.  Block size is also a very good suggestion.

Having said that, what we now understand more fully is:
1. Users updated nearly every interest via a REST call.  This call results in an upsert on Cassandra.
2. We had a cache in front of the service.
3. When users did not immediately see the change, they make the call 10 times for every interest.
4. Fortunately the cache time expired and they started to see their new data so they stopped.  Otherwise they would still be creating tombstones.
5. Compact was run
6. Repair was run
7. Database performance for these queries is much worse than it had been, and in fact they are now timing out
8. Developers have increased the driver timeout beyond 2 seconds.  Still a problem.
9. The read is set for local quorum.  Yes, I too wonder why that makes any sense in light of us having a cache.

Any thoughts on what we can do right now, while the team tackles the refactoring?

Nate McCall

unread,
Mar 21, 2018, 5:57:23 PM3/21/18
to java-dri...@lists.datastax.com
Given we are more into the realm of tuning, this may be more appropriate/get a wider audience on for us...@c.a.o at this point, but for expediency sake...

Unfortunately whole row deletes wont show up on the tombstone counters in tablestats output until 3.11.2, so you can't really see the immediate impact of what you are experience from the data model perspective. 

It looks like you are using the default compaction strategy. Set that to LCS and you will probably have to take the gloves off on dealing with the tombstones as well, so an alter might look like:

  gc_grace_seconds=3600 and
  compaction={'sstable_size_in_mb': '256', 'unchecked_tombstone_compaction': 'true', 'tombstone_compaction_interval': '3600', 'class': 'LeveledCompactionStrategy'} 

These options are pretty dangerous for consistency sake long term. They are *only* to get you out of the woods. What we've done:
- switch to LCS with aggressive tombstone pruning
- sstable size is set to 256 to try to capture more upserts per level 
- SSTables are elligible for tombstone only compaction an hour after they are created
- gc_grace has been turned down to 1 hour (this is now effectively your recover window in the case of a node failure, there are caveats, but this is the major take away)

These will cause *a lot* of disk IO when implemented. You may want to roll them around the cluster one node at a time via jmxterm invoking setCompactionParametersJson (depending on version). An example of doing this (but with TWCS)  can be found here: http://thelastpickle.com/blog/2017/01/10/twcs-part2.html

Test these on a staging cluster. I'm making a bunch of guestimates based off of previous experiences, so YMMV. 

--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscribe@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages