segfault when closing a connection

Robin Mahony

unread,

Apr 30, 2015, 2:33:29 PM4/30/15

to cpp-dri...@lists.datastax.com

Hi again,

I have run into the below segfault while trying to close my connection to Cassandra via the driver. It appears that I was trying to close a connection that was never successfully created.

1430418665.981 [ERROR] (src/connection.cpp:643:void cass::Connection::notify_error(const string&)): Host 10.96.98.209 had the following error on startup: 'Connection timeout'

Any ideas why this would cause a segfault? Stack trace is below.

Cheers,

Robin

#0 0x00007f5865927b55 in raise () from /lib64/libc.so.6
#1 0x00007f5865929131 in abort () from /lib64/libc.so.6
#2 0x0000000000b4821d in OSL_Debug_Halt () at libs/osl/OSL_Debug.cc:92
#3 0x0000000000813a8f in Signal_FatalHandler (SigNum=<optimized out>) at modules/SignalHandler_Module/SignalHandler_Module.cc:203
#4 <signal handler called>
#5 0x00007f5866ce50fe in cass::ScopedPtr<cass::AsyncQueue<cass::MPMCQueue<cass::SessionEvent> >, cass::DefaultDeleter<cass::AsyncQueue<cass::MPMCQueue<cass::SessionEvent> > > >::operator->() const ()
from /usr/lib64/libcassandra.so.1
#6 0x00007f5866ce4902 in cass::EventThread<cass::SessionEvent>::send_event_async(cass::SessionEvent const&) () from /usr/lib64/libcassandra.so.1
#7 0x00007f5866ce0110 in cass::Session::notify_up_async(cass::Address const&) () from /usr/lib64/libcassandra.so.1
#8 0x00007f5866d31858 in cass::IOWorker::notify_pool_ready(cass::Pool*) () from /usr/lib64/libcassandra.so.1
#9 0x00007f5866d3ac6a in cass::Pool::maybe_notify_ready() () from /usr/lib64/libcassandra.so.1
#10 0x00007f5866d3b3b9 in cass::Pool::on_connection_closed(cass::Connection*) () from /usr/lib64/libcassandra.so.1
#11 0x00007f5866d3e9bc in boost::_mfi::mf1<void, cass::Pool, cass::Connection*>::operator()(cass::Pool*, cass::Connection*) const () from /usr/lib64/libcassandra.so.1
#12 0x00007f5866d3e3aa in void boost::_bi::list2<boost::_bi::value<cass::Pool*>, boost::arg<1> >::operator()<boost::_mfi::mf1<void, cass::Pool, cass::Connection*>, boost::_bi::list1<cass::Connection*&> >(boost::_bi::type<void>, boost::_mfi::mf1<void, cass::Pool, cass::Connection*>&, boost::_bi::list1<cass::Connection*&>&, int) () from /usr/lib64/libcassandra.so.1
#13 0x00007f5866d3dc64 in void boost::_bi::bind_t<void, boost::_mfi::mf1<void, cass::Pool, cass::Connection*>, boost::_bi::list2<boost::_bi::value<cass::Pool*>, boost::arg<1> > >::operator()<cass::Connection*>(cass::Connection*&) () from /usr/lib64/libcassandra.so.1
#14 0x00007f5866d3d231 in boost::detail::function::void_function_obj_invoker1<boost::_bi::bind_t<void, boost::_mfi::mf1<void, cass::Pool, cass::Connection*>, boost::_bi::list2<boost::_bi::value<cass::Pool*>, boost::arg<1> > >, void, cass::Connection*>::invoke(boost::detail::function::function_buffer&, cass::Connection*) () from /usr/lib64/libcassandra.so.1
#15 0x00007f5866d2a571 in boost::function1<void, cass::Connection*>::operator()(cass::Connection*) const () from /usr/lib64/libcassandra.so.1
#16 0x00007f5866d26550 in cass::Connection::on_close(uv_handle_s*) () from /usr/lib64/libcassandra.so.1
#17 0x00007f58638a697f in uv_run () from /usr/lib64/libuv.so.1
#18 0x00007f5866ce2dc6 in cass::LoopThread::on_run_internal(void*) () from /usr/lib64/libcassandra.so.1
#19 0x00007f58638afee0 in ?? () from /usr/lib64/libuv.so.1
#20 0x00007f586922b7b6 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f58659ced6d in clone () from /lib64/libc.so.6
#22 0x0000000000000000 in ?? ()

Michael Penick

unread,

Apr 30, 2015, 3:29:30 PM4/30/15

to cpp-dri...@lists.datastax.com

That looks like the session was deleted before the IO threads terminated. However, this is puzzling because session isn't deleted until all the IO threads join: https://github.com/datastax/cpp-driver/blob/master/src/session.cpp#L397-L400. Also, the event queue (in cass::EventThread) is definitely created before the IO threads are started. I have some ideas to attempt to reproduce, but any extra information would be very helpful.

Are you able to regularly reproduce this issue? Are you able to concentrate the issue to a small code example (even if it only reproduces sporadically)?

Thanks!

Mike

To unsubscribe from this group and stop receiving emails from it, send an email to cpp-driver-us...@lists.datastax.com.

Robin Mahony

unread,

Apr 30, 2015, 3:35:27 PM4/30/15

to cpp-dri...@lists.datastax.com

So I have only seen this once, and have not been able to reproduce as of yet. I will update you if I manage to find a way to reproduce semi-reliably.

Robin Mahony

unread,

Apr 30, 2015, 3:39:15 PM4/30/15

to cpp-dri...@lists.datastax.com

Only thing I can think of is attempting to create/close/create/close connections rapidly. And perhaps do this JUST after Cassandra has started (do not wait for it to fully initialize).

Or perhaps if you can force a connection timeout (make the timeout value really small), then close your driver connection after it has timed out but before it has managed to make a successful connection

Michael Penick

unread,

Apr 30, 2015, 3:44:52 PM4/30/15

to cpp-dri...@lists.datastax.com

Those are good ideas.

Does your application code call cass_session_close() before freeing the session or does it only call cass_session_free()? Both are okay, as cass_session_free() waits until the session is closed before freeing, but It would help to know for reproducing the issue.

Mike

Robin Mahony

unread,

Apr 30, 2015, 3:49:01 PM4/30/15

to cpp-dri...@lists.datastax.com

It does call session close if it has previously managed to connect successfully (cass_session_connect future returns CASS_OK). And then it calls cass_session_free

Robin Mahony

unread,

May 11, 2015, 5:50:21 PM5/11/15

to cpp-dri...@lists.datastax.com

Has there been any development with this issue? I am hitting it again during stress testing I am performing.

Michael Penick

unread,

May 11, 2015, 7:29:07 PM5/11/15

to cpp-dri...@lists.datastax.com

I spent a while trying to reproduce with different pauses in both the IO and session threads while rapidly creating and destroying sessions. I haven't been able to reproduce the issue.

Any extra information you can provide to help reproduce the issue would be appreciated.

Mike

PS Are you rapidly creating and destroying sessions in your application?

Robin Mahony

unread,

May 11, 2015, 7:35:13 PM5/11/15

to cpp-dri...@lists.datastax.com

So it appears to be related to closing sessions after getting 'Connection timeout' errors. We have a 3 site grid, with 4 nodes per site. Perhaps try lowering the connection timeout to something tiny, and use a larger grid?

[ERROR] (src/connection.cpp:643:void cass::Connection::notify_error(c

onst string&)): Host 10.96.98.214 had the following error on startup: 'Connection timeout'

Michael Penick

unread,

May 11, 2015, 7:44:34 PM5/11/15

to cpp-dri...@lists.datastax.com

Is this on Linux or Windows?

Robin Mahony

unread,

May 11, 2015, 7:46:25 PM5/11/15

to cpp-dri...@lists.datastax.com

Linux. Sles 11 sp3.

Robin Mahony

unread,

May 12, 2015, 8:53:33 PM5/12/15

to cpp-dri...@lists.datastax.com

Also, any idea why I am getting so many connection timeouts? It seems to be that only my local nodes are returning these errors and I'd expect those connections to be fast...

Robin Mahony

unread,

May 12, 2015, 9:23:22 PM5/12/15

to cpp-dri...@lists.datastax.com

So update on this. It appears that what is happening is when we upgraded our nodes, we weren't properly opening our firewall to allow access to the native port on non-0.0.0.0 nodes (had it closed previously). So the scenario is actually that you try to close the connection, after it has attempted to connect to the cluster with the firewall preventing cross-node communication.

Not sure how valid this is, though it would still probably be preferable not to core dump. :)

Sorry for the initial confusion.

Michael Penick

unread,

May 13, 2015, 12:00:35 PM5/13/15

to cpp-dri...@lists.datastax.com

Thanks for the updated information. That's helpful. I'll work to reproduce with this new information.

Ideally, the driver should never segfault under all error conditions. :)

Mike

Michael Penick

unread,

May 13, 2015, 3:47:14 PM5/13/15

to cpp-dri...@lists.datastax.com

Reproduced the issue. I setup a three node cluster then block one of them using firewall. I enabled a debug version of malloc() then ran an example that connects/closes over and over.

Here's the fix: https://github.com/datastax/cpp-driver/commit/2f95954b67acbd0b8cdce81d0d3744999dd4d090

It's a back port of a fix for the same issue found on 2.0 (https://github.com/datastax/cpp-driver/commit/8f4b13abeb0fa6608b7561c6b4d3e9fd1a197417) This change will be in the 1.0.2 version of the driver that will be out officially later this week (or early next week).

Let me know if this actually does fix the issue for you.

Mike

Robin Mahony

unread,

May 13, 2015, 3:57:00 PM5/13/15

to cpp-dri...@lists.datastax.com

Awesome, thanks. When does 2.0.1 come out? Same time as 1.0.2? There is that other change you're adding and figure I might as well bump to the latest version of the driver.

Michael Penick

unread,

May 13, 2015, 4:00:01 PM5/13/15

to cpp-dri...@lists.datastax.com

2.0.1 and 1.0.2 are coming out at the same time.

Mike

Reply all

Reply to author

Forward