Errors with multiple schema changes

196 views
Skip to first unread message

Alex Kaiser

unread,
Jul 2, 2015, 5:27:27 PM7/2/15
to java-dri...@lists.datastax.com
Hello,

  We have a Cassandra cluster that we use for qa testing.  For each checkin we do validation, which involves creating a new keyspace for that validation, which is then dropped at the end of validation.  This happens, probably about ~10 times an hour.  However we often notice errors in the log of the form: "Asked to rebuild table <> but I don't know keyspace <>".  Not that bad as we can just hide these messages.  However, sometimes the actual query to create the keyspace fails because it can't refresh the schema, which causes the whole validation to fail.  Is there some other way we can go about this so that these failures don't happen?  Additionally, why does the driver need to have a local copy of the schema?

  We have also run into a bug where the client is continuously running the Metadata#rebuildSchema() and Metadata#rebuildTokenMap() methods, and causing 100% CPU utilization.  And this is even when we aren't creating or dropping any keyspaces.

Thanks,
Alex kaiser


Olivier Michallat

unread,
Jul 5, 2015, 7:54:28 AM7/5/15
to java-dri...@lists.datastax.com
Hi,

why does the driver need to have a local copy of the schema?

The driver exposes schema metadata as part of its public API (some clients use that information, for example DevCenter).
It's also used internally for token-aware routing of prepared statements. After the statement has been prepared, the driver uses table metadata to find out if all partition key columns are parameterized; if so, the routing key for bound statements created from this prepared statement can be computed automatically. Note that this will become obsolete with Cassandra 2.2 (see JAVA-776).


"Asked to rebuild table <> but I don't know keyspace <>"

Do you use async queries?
The driver updates its metadata upon successful execution of a DDL query. The fact that it tries to rebuild metadata for the table indicates that the query was successful server-side (so the keyspace did exist). What might happen here is a race between two async queries:
* query 1 creates keyspace K
* query 1 succeeds server-side
* query 2 creates table K.T
* query 2 succeeds server-side
* driver gets response for query 2 first, tries to update metadata for K.T but metadata for K does not exist yet

My recommendation for DDL queries is to use the synchronous APIs (session.execute instead of session.executeAsync), and ensure queries are run from the same application thread if they are related. Also, see our doc about schema agreement, which plays an important role in DDL queries (although I don't think it's the issue here).


sometimes the actual query to create the keyspace fails because it can't refresh the schema, which causes the whole validation to fail

What's the error message in that case?


We have also run into a bug where the client is continuously running the Metadata#rebuildSchema() and Metadata#rebuildTokenMap() methods, and causing 100% CPU utilization

If you have a reproducible scenario and/or a thread dump, I would be very interested in that.

One known issue with the driver is that it doesn't debounce rapid notifications from Cassandra (like if many schema changes are run by many clients at the same time). JAVA-657 will address that, it's scheduled for our next sprint.

Another issue that comes to mind is JAVA-702, where rebuildTokenMap's performance could degenerate when a keyspace referenced a non-existing datacenter. That's fixed an coming in 2.0.11 and 2.1.7.


--

Olivier Michallat

Driver & tools engineer, DataStax


To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.

Alex Kaiser

unread,
Jul 7, 2015, 1:51:11 PM7/7/15
to java-dri...@lists.datastax.com
- What's the error message in that case?

21:03:29 2015-05-28 21:03:25,592 [Cassandra Java Driver worker-9] ERROR com.datastax.driver.core.ControlConnection - [Control connection] Unexpected error while refreshing schema
21:03:29 java.util.concurrent.ExecutionException: com.datastax.driver.core.OperationTimedOutException: [/192.168.22.25:9042] Operation timed out

So first I will get the Unexpected error with the schema refresh, and then I will get an operation timed out error.  I can't get you the exact steps to reproduce it though, because it only happens on a few of our many qa runs. I'll try today to reproduce it, but I doubt I will be able to.

Also to note there could be a "Asked  to rebuild table <> but I don't know keyspace <>" message somewhere in there, but we started to hide those messages because they were happening so often.

- About async queries:

For DDL queries we do use synchronous queries, however we do have many environments running at once so you can still get the situation similar to what was described above, this is what I think is happening:

Consider having three servers updating, A, B, and C:

B adds keyspace K1 (or really does any schema modification event)
This triggers A to refresh schema
While A is looping through the tables of K2
C drops keyspace K2
This causes the error on A, because when it tries to get data on a table in K2 that keyspace doesn't exist

Does this sound like a plausible scenario, and if so what would be a way to avoid this happening?

- About the continuous update scenario:

I don't have much info on this because we stopped creating environments off of our qa cassandra cluster, but I can try today/tomorrow to create an environment off of our qa cluster and see if it gets in this state.  It will be hard to create a reproducible situation because I can't stare at the environment all day to see when it gets into the never ending loop, but if it does ever get there I can get a thread dump.

Alex Kaiser

Olivier Michallat

unread,
Jul 8, 2015, 4:29:58 AM7/8/15
to java-dri...@lists.datastax.com
ControlConnection - [Control connection] Unexpected error while refreshing schema
21:03:29 java.util.concurrent.ExecutionException: com.datastax.driver.core.OperationTimedOutException: [/192.168.22.25:9042] Operation timed out

Looks like one of the queries used to refresh the schema timed out. Is your cluster under high load at that moment?

Also, is an exception surfaced to the client, i.e. thrown by the session.execute call (you said the actual query to create the keyspace failed)? Error logs like the above, although they indicate a real problem with the cluster, should not affect client queries. Some metadata could become stale, but the driver should keep functioning.

This triggers A to refresh schema
While A is looping through the tables of K2
C drops keyspace K2

This scenario wouldn't be a problem.

The message "Asked to rebuild table K.T but I don't know keyspace K" happens specifically when:
* the driver received a notification that T was created or updated
* it had previously received a notification that K was removed, so removed it from its local metadata

If you don't use async queries, one scenario where you might receive the notifications out of order is if the table is created/updated by one application instance while the keyspace is being removed by another.

At any rate, the driver schedules a full schema refresh when this message is logged, so things should fix themselves.


--

Olivier Michallat

Driver & tools engineer, DataStax


Reply all
Reply to author
Forward
0 new messages