Race condition in com.datastax.driver.core.HostConnectionPool#tryResurrectFromTrash

21 views

Skip to first unread message

Alexander Ryazanov

unread,

May 22, 2017, 2:54:49 AM5/22/17

to DataStax Java Driver for Apache Cassandra User Mailing List

Hi,

we faced the following problem with Java client (version 2.1.9). Sometimes one of the threads goes into infinite loop, consuming one CPU, and producing a lot of garbage on the way. Here is the snapshot from the profiler.

Here is the codebase:

private Connection tryResurrectFromTrash() {
    long highestMaxIdleTime = System.currentTimeMillis();
    Connection chosen = null;

    while (true) {
        for (Connection connection : trash)
            if (connection.maxIdleTime > highestMaxIdleTime && connection.maxAvailableStreams() > minAllowedStreams) {
                chosen = connection;
                highestMaxIdleTime = connection.maxIdleTime;
            }

        if (chosen == null)
            return null;
        else if (chosen.state.compareAndSet(TRASHED, RESURRECTING))
            break;
    }
    logger.trace("Resurrecting {}", chosen);
    trash.remove(chosen);
    return chosen;
}

The race is possible, because 'chosen' is set to null only once (I think it should be set to null in the beginning of 'while' cycle).

Here is the race scenario:

1)First 'while' cycle. Found connection in the trash satisfying our conditions. 'chosen' is assigned.

2)Another thread changes this connection state from TRASHED (probably -> GONE)

3)Neither of (chosen == null) or CAS is true.

So, we go for the next 'while' cycle, but chosen is not reset. If there are no more connections in the trash satisfying this criterias, there is an endless loop.

I can assume that the other thread is executing method com.datastax.driver.core.HostConnectionPool#cleanupTrash. Logically, conditions in these 2 methods should be exclusive (both rely on the time), but they use different timestamps, so race condition is possible in some unlucky moment.

Actually, we see this problem happening after days (or even weeks) of normal work.

I believe this codebase is not supposed to work for newer driver versions supporting V3 (for V3 the number of core connections is the same as maximum, so this sophisticated logic with connection trash/resurrection should not kick in). But there is still open API com.datastax.driver.core.PoolingOptions#setCoreConnectionsPerHost and com.datastax.driver.core.PoolingOptions#setMaxConnectionsPerHost. So, technically one can force the same problem by changing the defaults.

Those are mostly my assumptions, maybe the truth is much deeper. Anyway, it would be great to hear from developers.

From our end, we are moving to new driver version, I think it should solve the problem. Can you confirm that?

Regards,

Alex

Reply all

Reply to author

Forward

0 new messages