SIGSEGV in dispatcher.cpp

259 views
Skip to first unread message

Nick Burkitt

unread,
Dec 2, 2016, 1:36:54 PM12/2/16
to Dbus-cxx
Hi Robert.

I've been chasing a nasty problem which has led me to Dispatcher::watch_thread_main().

P14894.14898 (Signal 11; Signal: Segmentation violation)
0xb6e1af7c [libstdc++.so.6.0.20] local_Rb_tree_increment(): tree.cc, line 65
0xb6e1af7c [libstdc++.so.6.0.20] _Rb_tree_increment(): tree.cc, line 91
0xb6f10b20 [libdbus-cxx.so.0.0.0] operator++(): stl_tree.h, line 278
0xb6f10b20 [libdbus-cxx.so.0.0.0] watch_thread_main(): dispatcher.cpp, line 348
0xb6f10cfe [libdbus-cxx.so.0.0.0] proxy_watch_thread_main(): dispatcher.cpp, line 600
0xb6f41fbc [libpthread-2.19.so] start_thread()
0xb6f41fbc [libpthread-2.19.so] start_thread()
...

It appears that the iterator in the statement on line 348

for ( fditer = m_enabled_write_fds.begin(); fditer != m_enabled_write_fds.end(); fditer++ )

is being invalidated within the loop. Since the only way that can happen is if the element it references is erased from the set, that points to Dispatcher::on_watch_toggled() as a suspect.

I haven't tried to figure out what the code is doing, or if I may be abusing it in some way that wasn't envisioned when it was designed.
I'd be very interested in hearing your thoughts on the matter before I go any further.

Thanks,

-Nick

Robert Middleton

unread,
Dec 2, 2016, 5:38:02 PM12/2/16
to Nick Burkitt, Dbus-cxx
It's been a long time since I've touched the code, plus I didn't write that part of the code originally, so I don't know anything off of the top of my head.

How fast are you sending data out over the bus though?  I have seen in the past that spamming the bus with information can cause issues, but I don't remember if it was a segfault or if it was just not sending information.

-Robert Middleton

--
You received this message because you are subscribed to the Google Groups "Dbus-cxx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.
To post to this group, send email to dbus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dbus-cxx/4ac6265d-b59e-466d-bb37-4dfe2faf07c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nick Burkitt

unread,
Dec 2, 2016, 6:50:20 PM12/2/16
to Dbus-cxx
Hi Robert.

Here's more information than you really want. :-)

I have two executables that connect to a session bus. One is a service that streams fairly low-rate electromagnetic sensor data as signals. It also accepts client connections via TCP, for handling some simple status and control commands.
The second executable is an application. It consumes the sensor signals and extends the service's client command set via a dbus method which it registers with the service at run time. That's where the file descriptor passing comes in, so that the application can communicate over the client's TCP connection.

Streaming data via signals doesn't seem to be a problem - I've done that to the point of exhausting CPU bandwidth without software failures. My test setup has two streams, each sending signals of less than 50 bytes at 10 Hz.
Where the problem becomes apparent is when sending commands from the client (an Android app in this case) to the service, which forwards them to the application.
On startup and shutdown, the Android client issues several commands back-to-back, of which five get routed over the dbus to the application. They are no more than 20-odd bytes each.
Occasionally, the service will encounter the situation I described in my first post. I've tweaked the dispatcher code to copy the iterator, increment the original, and then use the copy in the body of the for() loop. That avoids the SIGSEGV, but after a few client startup/shutdown cycles, I start getting "Did not receive a reply" complaints from the service about the commands forwarded to the application. The application seems happy, though. It gets all the commands (with 25 second delays between them) and responds normally.

I imagine that the situation where the thing that the iterator references gets erased from the set isn't expected to happen where it does, so simply avoiding the crash just pushes the problem further downstream. At least it's not (as) fatal.

If you have any insights, I'd love to hear them.
Thanks,

-Nick
To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+u...@googlegroups.com.

Robert Middleton

unread,
Dec 3, 2016, 10:12:47 AM12/3/16
to Nick Burkitt, Dbus-cxx
It looks like because we are iterating over m_enabld_write_fds list without being in the m_mutex_write_watches mutex, that could cause a problem.  I agree with your Dispatcher::on_watch_toggled being the culprit; do you see the debug message in that method right before the segfault?

This code should probably be re-written to use std::mutex and std::thread to make it easier to lock and unlock the mutex and be more C++-ified.

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Robert Middleton

unread,
Dec 3, 2016, 8:19:19 PM12/3/16
to Nick Burkitt, Dbus-cxx
I just re-wrote the Dispatcher class to maybe fix this problem.  Quick changelog:
  • use std::thread instead of pthread
  • use std::mutex instead of pthread_mutex
  • use std::condition_variable instead of pthread_condition
  • Removed some unused methods
  • properly use mutexes to access the lists(this should fix the segfault)
  • use poll() instead of select() to check file descriptors
  • made running volatile
If you could test this out and tell me if you still have an issue, that would be great.  It looks good on my side, but I've only run some very basic tests.  On the bright side, this also simplifies the code in several areas, so it should be a little easier to follow.

-Robert Middleton

Nick Burkitt

unread,
Dec 7, 2016, 9:40:56 PM12/7/16
to Dbus-cxx
Hi Robert.

The segmentation fault is gone - I've stressed the code pretty hard without seeing it again. But this timeout issue is still a problem. I've fiddled with the debug code and have at least found where things are coming to a halt. The Application (one of the two executables communicating over the session bus) is blocked waiting on Dispatcher::m_cond_initiate_processing. Apparently, the code decides it has nothing to do. After 25 seconds, it gets kicked off again. Sometimes things are able to continue at that point, but typically not. Could there be a race between the message response and the code that looks for it? Or something?
I have logs from both sides of the conversation if you think it would help to look at them.
Any hints on where to narrow my search would be welcomed.
Thanks,

-Nick

Nick Burkitt

unread,
Dec 7, 2016, 9:57:05 PM12/7/16
to Dbus-cxx
Speaking of logs, here's at least a snippet from the hanging side. I added time stamps to the dbus-cxx messages, and massaged the thread ID to match my logger's format (hex pid:tid):

02:32:01.946087:[3858:385a]:connection.cpp(950): Connection::on_filter_callback:  filter_result: 1
02:32:01.02:32:01.946302:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main
946165:[3858:385a]:object.cpp(310): Object::handle_message: before call message test
02:32:01.946654:[3858:385a]:object.cpp(321): Object::handle_message: message is good (it's a call message) for interface 'com.optimalranging.Android.v1'
02:32:01.946721:[3858:385a]:object.cpp(359): Object::handle_message: before handling lower bound interface is com.optimalranging.Core100.v1
02:32:01.946766:[3858:385a]:object.cpp(360): Object::handle_message: before handling upper bound interface is com.optimalranging.Core100.v1
02:32:01.946807:[3858:385a]:interface.cpp(365): Interface::handle_call_message  interface=com.optimalranging.Core100.v1
02:32:01.946904:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main

Now my code gets called. The command message is received, parsed, and executed. The function that's bound to the dbus-cxx object is PreExecute().

02:32:01.947175<78>[3858:385a]FsClientCommandHandler::PreExecute()[FsClientCommandHandler.cpp:47]
02:32:01.947720:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main
02:32:01.948108<79>[3858:385a]InputOutput::InputOutput()[InputOutput.cpp:21]:"assigning connection ID 13"
02:32:01.948498<80>[3858:385a]SocketInputOutput::SocketInputOutput()[SocketInputOutput.cpp:32]:"tracking socket handle 7"
02:32:01.948879<81>[3858:385a]CommandMessage::CommandMessage()[CommandMessage.cpp:25]
02:32:01.949219<82>[3858:385a]FsMessage::FsMessage()[FsMessage.cpp:53]
02:32:01.949491<83>[3858:385a]FsClientCommandHandler::PreExecute()[FsClientCommandHandler.cpp:55]:"$FSAND,1,2014 (fd: 7[13])"
02:32:01.949742<84>[3858:385a]AndroidCommandHandler::Execute()[AndroidCommandHandler.cpp:295]
02:32:01.950379<85>[3858:385a]AndroidCommandHandler::Parse()[AndroidCommandHandler.cpp:83]
02:32:01.950957<86>[3858:385a]AndroidCommandHandler::ParseOutputUdp()[AndroidCommandHandler.cpp:154]
02:32:01.951236<87>[3858:385a]AndroidCommandHandler::Parse()[AndroidCommandHandler.cpp:135]:"done"
02:32:01.951456<88>[3858:385a]AndroidCommandHandler::ExecuteOutputUdp()[AndroidCommandHandler.cpp:357]
02:32:01.952042<89>[3858:385a]Android::SubscribeUdp()[Android.cpp:129]
02:32:01.952886<90>[3858:385a]AndroidCommandHandler::Execute()[AndroidCommandHandler.cpp:334]:"done"
02:32:01.953154<91>[3858:385a]FsClientCommandHandler::PreExecute()[FsClientCommandHandler.cpp:72]:"done"
02:32:01.953397<92>[3858:385a]FsMessage::~FsMessage()[FsMessage.cpp:59]
02:32:01.953645<93>[3858:385a]CommandMessage::~CommandMessage()[CommandMessage.cpp:31]

Back to dbux-cxx code.

02:32:01.957351:[3858:385a]:dispatcher.cpp(525): Dispatcher::on_wakeup_main
02:32:01.957475:[3858:385a]:object.cpp(381): Object::handle_message: message was handled
02:32:01.957533:[3858:385a]:objectpathhandler.cpp(120): ObjectPathHandler::message_handler_callback: result = 0
02:32:01.957632:[3858:385a]:connection.cpp(950): Connection::on_filter_callback:  filter_result: 1
02:32:01.958399:[3858:385a]:connection.cpp(950): Connection::on_filter_callback:  filter_result: 1
02:32:01.958713:[3858:385a]:connection.cpp(950): Connection::on_filter_callback:  filter_result: 1
02:32:01.959379:[3858:385a]:dispatcher.cpp(536): Dispatcher::on_dispatch_status_changed: 1
02:32:01.959541:[3858:385a]:dispatcher.cpp(228): Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0

Oops. 25 seconds have disappeared. The relevant code is in Dispatcher::dispatch_thread_main(). I added code to measure the time elapsed during the wait() call.

std::unique_lock<std::mutex> lock( m_mutex_initiate_processing );
  
if ( not m_initiate_processing ){
 std::chrono::system_clock::time_point _start = std::chrono::system_clock::now();
 // If we don't have any work to do we will wait here for someone to wake us up
 m_cond_initiate_processing.wait( lock );
 std::chrono::system_clock::duration _duration = std::chrono::system_clock::now() - _start;
 if ( std::chrono::duration_cast< std::chrono::milliseconds>(_duration) > std::chrono::milliseconds( 1000 ) )
 {
DBUS_CXX_DEBUG( "wait at line " << __LINE__ << " took " << std::chrono::duration_cast<std::chrono::milliseconds>(_duration).count() << " ms" );
 }
}


02:32:26.937183:[3858:385a]:dispatcher.cpp(208): wait at line 208 took 24977 ms


02:32:26.937319:[3858:385a]:dispatcher.cpp(228): Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0
02:32:26.938312:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main
02:32:26.938421:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main
02:32:26.938487:[3858:3858]:connection.cpp(897): Connection::on_remove_timeout_callback  enabled:1  interval: 25000
02:32:26.0200000:32:93854926.:[3858:3858]:dispatcher.cpp(511): Dispatcher::on_remove_timeout  enabled:1  interval: 25000
02:32:26.938612:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main
02:32:26.938664:[3858:3858]:dispatcher.cpp(525): Dispatcher::on_wakeup_main

-Nick

Robert Middleton

unread,
Dec 8, 2016, 5:37:59 PM12/8/16
to Nick Burkitt, Dbus-cxx
I'm pretty sure that I have seen this problem before, but I haven't taken enough time to track it down.  As far as I can tell, it is a race condition with regards to the condition being broadcast.  That shouldn't be the case though, as all of the accesses to the std::condition are properly mutexed.

Does the message actually have a return?  dbus-monitor can be used to see what is actually going across the bus.  I doubt that the message is being dropped but it's possible.

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Robert Middleton

unread,
Dec 8, 2016, 9:13:00 PM12/8/16
to Nick Burkitt, Dbus-cxx
Actually... maybe the problem isn't in the dispatcher at all?  Could it be whatever is calling the dispatcher to wake up?  Also, why does Dispatcher::on_wakeup_main get called twice after your code gets executed?  That looks suspicious to me, since it doesn't seem to happen before that point.  

I can take a look at it more this weekend.

-Robert Middleton

Nick Burkitt

unread,
Dec 9, 2016, 12:27:53 PM12/9/16
to Dbus-cxx
The messages from Service to Application have a return type of void. I found a bug yesterday where the Service method bound to the dbus-cxx object that issues dbus calls to the Application was expecting a std::string return type. That was causing an exception that was caught by the Service code, but apparently had no other effects. I've retested since fixing that bug and the above problem remains.
I appreciate your spending the time to look at this - it's the one big risk left in this project.
Thanks,

-Nick

Robert Middleton

unread,
Dec 10, 2016, 12:11:47 PM12/10/16
to Nick Burkitt, Dbus-cxx
Do you have a small example where you can reliably reproduce this issue? 

Also, does the hang happen on the sending side or on the receiving side(the program that is calling the method vs the program that has the method implementation)?  That may help me track down the issue; I just tried to reproduce it on my computer but I wasn't able to do it, but my setup is probably different from yours.  At least the same method signature that you are using would be useful.

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Robert Middleton

unread,
Dec 10, 2016, 8:54:15 PM12/10/16
to Nick Burkitt, Dbus-cxx
Ah... I think I may have just been able to reproduce the issue now.  I wasn't trying it with nearly enough data to reproduce it reliably.

I did get this segfault though(I was able to catch the hanging issue separately in the debugger).  I'm not sure if it's related at all, since that signal shouldn't be getting modified at all during execution.

140737323644672 - 20:16:29: Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0
140737323644672 - 20:16:29: Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0
140737315251968 - 20:16:29: Connection::on_remove_timeout_callback  enabled:1  interval: 25000
140737315251968 - 20:16:29: Dispatcher::on_remove_timeout  enabled:1  interval: 25000
140737315251968 - 20:16:29: Dispatcher::on_wakeup_main

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5aea700 (LWP 13411)]
0x00007ffff7b93a50 in emit (
    _A_a1=@0x7ffff5ae9cfc: DBus::DISPATCH_DATA_REMAINS, impl=0x60c970)
    at /usr/include/sigc++-2.0/sigc++/signal.h:1006
1006       for (iterator_type it = slots.begin(); it != slots.end(); ++it)
(gdb) bt
#0  0x00007ffff7b93a50 in emit (
    _A_a1=@0x7ffff5ae9cfc: DBus::DISPATCH_DATA_REMAINS, impl=0x60c970)
    at /usr/include/sigc++-2.0/sigc++/signal.h:1006
#1  emit (this=<optimized out>, 
    _A_a1=@0x7ffff5ae9cfc: DBus::DISPATCH_DATA_REMAINS)
    at /usr/include/sigc++-2.0/sigc++/signal.h:2801
#2  DBus::Connection::on_dispatch_status_callback (connection=<optimized out>, 
    new_status=<optimized out>, data=<optimized out>) at connection.cpp:936
#3  0x00007ffff792853e in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#4  0x00007ffff7928676 in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#5  0x00007ffff7940d2a in dbus_watch_handle ()
   from /lib/x86_64-linux-gnu/libdbus-1.so.3
#6  0x00007ffff7bc4cbe in DBus::Watch::handle (this=this@entry=0x609e50, 
    flags=flags@entry=1) at watch.cpp:87
#7  0x00007ffff7bc4d2b in DBus::Watch::handle_read (this=this@entry=0x609e50, 
    error=error@entry=false, hangup=hangup@entry=false) at watch.cpp:95
#8  0x00007ffff7b9dac3 in DBus::Dispatcher::watch_thread_main (this=0x608c60)
    at dispatcher.cpp:341
#9  0x00007ffff74bf970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ffff67090a4 in start_thread (arg=0x7ffff5aea700)
    at pthread_create.c:309
#11 0x00007ffff6c2f87d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


If the segfault is related, that at least gives us a point at which to start investigating.  


Note that the above segfault was in the sending application.  I also got this segault in the receiving application:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff62eb700 (LWP 13572)]
0x00007ffff7b93a59 in empty (this=<optimized out>)
    at /usr/include/sigc++-2.0/sigc++/functors/slot_base.h:290
290    { return (!rep_ || !rep_->call_); }
(gdb) bt
#0  0x00007ffff7b93a59 in empty (this=<optimized out>)
    at /usr/include/sigc++-2.0/sigc++/functors/slot_base.h:290
#1  emit (_A_a1=@0x7ffff62ea5cc: DBus::DISPATCH_COMPLETE, impl=0x612970)
    at /usr/include/sigc++-2.0/sigc++/signal.h:1008
#2  emit (this=<optimized out>, _A_a1=@0x7ffff62ea5cc: DBus::DISPATCH_COMPLETE)
    at /usr/include/sigc++-2.0/sigc++/signal.h:2801
#3  DBus::Connection::on_dispatch_status_callback (connection=<optimized out>, 
    new_status=<optimized out>, data=<optimized out>) at connection.cpp:936
#4  0x00007ffff792853e in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#5  0x00007ffff792944a in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#6  0x00007ffff7b9317a in DBus::Connection::send (this=0x60f1c0, msg=
    std::shared_ptr (count 2, weak 0) 0x7ffff0000920) at connection.cpp:338
#7  0x0000000000406c95 in DBus::Method<void, std::string, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil>::handle_call_message (
    this=0x613570, connection=std::shared_ptr (count 8, weak 1) 0x60f1c0, 
    message=std::shared_ptr (count 3, weak 0) 0x7ffff0000a30)
    at /usr/local/include/dbus-cxx-0.8/dbus-cxx/method.h:1206
#8  0x00007ffff7ba3155 in DBus::Interface::handle_call_message (
    this=<optimized out>, 
    connection=std::shared_ptr (count 8, weak 1) 0x60f1c0, 
    message=std::shared_ptr (count 3, weak 0) 0x7ffff0000a30)
    at interface.cpp:388
#9  0x00007ffff7bb375c in DBus::Object::handle_message (
---Type <return> to continue, or q <return> to quit---
    this=this@entry=0x612f90, 
    connection=std::shared_ptr (count 8, weak 1) 0x60f1c0, 
    message=std::shared_ptr (count 1, weak 0) 0x7ffff0000a10) at object.cpp:367
#10 0x00007ffff7bb64a6 in DBus::ObjectPathHandler::message_handler_callback (
    connection=<optimized out>, message=<optimized out>, user_data=0x612f90)
    at objectpathhandler.cpp:119
#11 0x00007ffff793860f in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#12 0x00007ffff792a194 in dbus_connection_dispatch ()
   from /lib/x86_64-linux-gnu/libdbus-1.so.3
#13 0x00007ffff7b93541 in DBus::Connection::dispatch (this=0x60f1c0)
    at connection.cpp:435
#14 0x00007ffff7b9cc61 in DBus::Dispatcher::dispatch_thread_main (
    this=0x60ec60) at dispatcher.cpp:219
#15 0x00007ffff74bf970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007ffff67090a4 in start_thread (arg=0x7ffff62eb700)
    at pthread_create.c:309
#17 0x00007ffff6c2f87d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111



Also, some good-ish news.  When the call hangs, the return message does make it back to the calling program.  Which means that this is almost certainly a bug in the calling program. 

I think if we take a good look at what's going on with these segfaults we may be able to track down exactly what is going on.  I'm thinking that if the segfaults are related at all to the non-response, the problem is that the sigc++ signal is getting modified from two different threads at once.  My current thinking is that if the signal gets modified after the response comes back, you see the hang.  If it's modified while the response is coming back, the iterator gets invalidated and thus we get our segfault.  In this case, you may not ever see it if only that one thread is killed.  In that case though, the library would die as no other processing would happen in that thread.

-Robert Middleton

Nick Burkitt

unread,
Dec 10, 2016, 9:51:19 PM12/10/16
to Dbus-cxx
Hi Robert.

That's all good news! It makes sense that it takes a heavy load to trigger the condition, or it would have shown up before this. I actually didn't intend to do quite so many dbus calls. The code is ported from an RTOS version where the messages transport was simple callbacks within a single executable, and I had forgotten that detail of the implementation. But I'd rather find the problems in development than hear about it from the field.
It sounds as if you've got something to look at, which is generally the biggest hurdle to overcome. I'm really glad you're spending the time on this. :-) And it will make the library even better in the end.
Thanks!

-Nick

Robert Middleton

unread,
Dec 11, 2016, 12:16:36 PM12/11/16
to Nick Burkitt, Dbus-cxx
After some investigation, the good news is that putting the emit() that the Connection has seems to fix the segfault(why it was segfaulting in the first place, I don't know...)  The bad news is that it still doesn't fix the pausing issue(although it does seem better now).

Looking at the thread IDs that are printed out as part of the debug output, callback methods /generally/ called from the same thread(but not always).  For example, Connection::on_remove_timeout_callback is generally called from the main thread, but can sometimes be called from the watch thread.  Dispatcher::on_wakeup_main can be called from any thread.  I suspect that because these methods can be called from multiple threads, a method is being called from a thread that it should not be called from.  I haven't looked through all of the DBus methods, but at least when the dispatch status changes, you can't call dispatch from inside this callback(see the documentation here: https://dbus.freedesktop.org/doc/api/html/group__DBusConnection.html#ga55ff88cd22c0672441c7deffbfb68fbf)  My best guess is that this is the case for a few other methods as well, which is why it sometimes hangs up depending on what thread the method was called from.

Unfortunately, I think that the best way to actually fix this problem is going to require a bit of work to refactor the Dispatcher class to use only one thread instead of two.  My understanding at least is that you should only have to have one thread anyway; the only reason to create threads in the library is so that processing can still happen if you don't have a main loop of any kind.

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Nick Burkitt

unread,
Dec 11, 2016, 7:32:24 PM12/11/16
to Dbus-cxx
Hi Robert.

It's good that you're making progress. It's a little scary to think that the problem may be with the design of the library itself.
Of the callback methods, it looks as though only the "watch" (add, remove, toggled) callbacks do any real processing. As a quick hack, would it be reasonable simply to queue an indication that the callback was called, and defer the processing to watch_main_loop()? 

Something along the lines of:

// "watch" request queue element
struct watch_request
{
enum class type
{
add, remove, toggled,
};
type m_type;
watch::pointer m_watch;
watch_request( const type& type, const watch::pointer& watch ) : m_type( type ), m_watch( watch ) {}
};

// new member of Dispatcher:
std::dequeue< watch_request > m_watch_queue;

// callbacks change along these lines:
bool Dispatcher::on_add_watch( Watch::pointer watch )
{
bool okay = watch and watch->is_vald();
if ( okay )
{
m_watch_queue.push_back( watch_request( watch_request::type::add, watch ) );
}
return okay;
}

// actual processing happens in new methods like this
void Dispatcher::really_add_watch( const Watch::pointer& watch )
{
/* code from the old on_add_watch() method */
}

// Then in dispatch_thread_main(),
void dispatch_thread_main()
{
if (!m_watch_queue.empty())
{
switch (m_watch_queue.front().m_type)
{
case watch_request::type::add:
{
really_add_watch(m_watch_queue.front());
}
break;
case watch_request::type::remove:
{
}
break;
case watch_request::type::toggled:
{
}
}
m_watch_queue.pop();
}
}

It's kind of simplistic, but it could be an easy way to make sure only the watch thread executes watch code.

-Nick

Nick Burkitt

unread,
Dec 11, 2016, 7:48:39 PM12/11/16
to Dbus-cxx
And of course that would be

really_add_watch(m_watch_queue.front().m_watch);

Robert Middleton

unread,
Dec 12, 2016, 11:01:04 PM12/12/16
to Nick Burkitt, Dbus-cxx
In terms of library design, I can see why it was designed this way in the first place.  The reason that there are two threads in the library is because it needs to watch two things: the file descriptors and separately it has to flush events out to the daemon.

This works fine, as long as you aren't sending a lot of data.  I'm not sure exactly what the problem is, but if my suspicion about a race condition is correct, the easiest way to solve this problem is to simply put all of the dbus-related functions into one thread instead of figuring out which thread has to handle the proper event.  Otherwise, the level of mutexing that would be required to make it work properly just seems to be overboard.

So yes, basically what has to be done is to add watches to a queue and wait for some event to happen on them(similar to your code).  This also makes the main thread loop much simpler, as it can simply wait in poll() indefinitely until something happens.  But then you have a question of how to wake up the thread to tell it to do some action.  Currently, this is handled in the dispatcher thread which waits on the std::condition.  If we make a socketpair() that we send data across to tell the thread to wake up, that can solve the problem though(see an example in the Qt documentation: http://doc.qt.io/qt-4.8/unix-signals.html).  The Qt example is about informing the main loop to wake up, so it's a similar concept(waking up one thread from another).

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Nick Burkitt

unread,
Dec 13, 2016, 1:52:23 PM12/13/16
to Dbus-cxx
Hi Robert.

This may be a good application for SIGUSR1 (or 2). Add a signal handler for SIGUSRn (that does nothing - it just avoids calling the default signal handler). Then raise SIGUSRn when a watch event is added to the queue. That should interrupt poll(), and you can check the queue for things to do. My signals kung-fu is pretty weak, but I think that's how it works.

-Nick

Robert Middleton

unread,
Dec 17, 2016, 11:50:33 AM12/17/16
to Nick Burkitt, Dbus-cxx
So the good news is that I think I have fixed the problem by re-writing the dispatcher to run on just one thread.  I've run it for 5+ minutes, which doesn't sound like much except that the lockup would happen generally within 30 seconds before.

It's under a separate branch at the moment, because it does break API/ABI compatability.  You can find it here: https://sourceforge.net/p/dbus-cxx/code/HEAD/tree/branches/api-cleanup/

The basic idea behind the new implementation is similar to the Qt article that I linked before: whenever the dispatch thread has work to do, a single byte is written to a FD that is created by socketpair(), as this can be called from multiple threads.  In our main Dispatch thread, we watch all file descriptors for watches, plus the one of the file descriptors created by socketpair().  This allows us to wait forever in poll(), so we don't have to wakeup the thread periodically in order to check for new data/watches.  I don't want to do anything with signal handling, as that can (sometimes) be sent to a random thread, and given that this is a library we don't want to force users to have a specific signal handler if they're already using it for some reason.

I have also implemented a better logging mechanism for the library so that you can install a logger callback in your code and have everything logged at one spot.  There is a sample callback method in utility.cpp if you want to see an example of how it works.

If you could try it out and see if it is still good on your side, that would be good.

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Nick Burkitt

unread,
Dec 18, 2016, 7:59:58 PM12/18/16
to Robert Middleton, Nick Burkitt, Dbus-cxx
Hi Robert.

That sounds awesome! I agree that signals are a bit of a minefield, so avoiding them is wise. I'll have to wait until tomorrow to try the fix - remote access is down at the moment, but I'm excited to give it a try. I'll let you know as soon as I've had a chance to test it out.
Thanks!
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Nick Burkitt

unread,
Dec 19, 2016, 8:49:51 PM12/19/16
to Dbus-cxx, nebu...@gmail.com
Hi Robert.

I'm not having much luck with the new code. The Service executable will occasionally hang on startup. When it doesn't, I'm able to run the Application executable, and the two are generally able to talk to each other if the conversation is short. If I send a batch of commands, typically all dbus traffic stops. On rare occasions I can get as far as getting the Application to start sending TCP data, but if I then send the commands to stop, it hangs.
I haven't been able to get the simplelogger to say anything. I'm not sure how a reference in a shared library gets resolved to an address in one of the executables to which it's linked. At any rate, I duplicated my earlier logger hack so I could get something. I've attached the logs as files. I've only included the tail end, where the problem occurs. If you'd like to see more, let me know.
Thanks,

-Nick
...
Service.log
Application.log

Robert Middleton

unread,
Dec 20, 2016, 8:43:21 AM12/20/16
to Nick Burkitt, Dbus-cxx, Nick Burkitt
That's odd.  I'll upload the code that I've been using for testing this issue either tonight or tomorrow, maybe that will help find the problem. 

-Robert Middleton

To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Robert Middleton

unread,
Dec 20, 2016, 6:16:15 PM12/20/16
to Nick Burkitt, Dbus-cxx, Nick Burkitt
Here are the two programs that I've been using to test the issue.  The recieve also shows you how you can add in a callback to get the log messages from the library.

I thought I just reproduced the issue, but it turns out it was a segfault.  So there may still be a segfault around someplace.

Actually, I was just able to reproduce it.  Let me see if I can catch it again in the debugger.

-Robert Middleton
receive.cpp
send.cpp

Nick Burkitt

unread,
Dec 20, 2016, 6:34:26 PM12/20/16
to Dbus-cxx
Hi Robert.

Thanks - I'll take a look at them shortly. I'll keep my fingers crossed for you and your debugger. :-)

-Nick

Robert Middleton

unread,
Dec 20, 2016, 6:35:44 PM12/20/16
to Nick Burkitt, Dbus-cxx, Nick Burkitt
Okay, a quick look at the code indicates that it is deadlocking.  Thread 1 on the sending side is stuck on dispatcher.cpp:355, while Thread 2 is stuck at dispatcher.cpp:216, which then goes into watch.cpp, witch then goes into libdbus, which then goes into pthread_mutex_lock.  Both of these sections of code have a lock on m_mutex_read_watches.

Stack trace of the sending side:

(gdb) thread apply all bt

Thread 2 (Thread 0x7ffff62e1700 (LWP 2204)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff67014d4 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff6701336 in __GI___pthread_mutex_lock (mutex=0x609880) at ../nptl/pthread_mutex_lock.c:114
#3  0x00007ffff791dfc8 in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#4  0x00007ffff791e62a in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#5  0x00007ffff7936d2a in dbus_watch_handle () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#6  0x00007ffff7bc3f5e in DBus::Watch::handle (this=0x6092b0, flags=flags@entry=1) at watch.cpp:87
#7  0x00007ffff7bc3fcb in DBus::Watch::handle_read (this=<optimized out>, error=error@entry=false, hangup=hangup@entry=false)
    at watch.cpp:95
#8  0x00007ffff7b92503 in DBus::Dispatcher::handle_read_and_write_watches (this=this@entry=0x6082c0, fds=fds@entry=0x7ffff62e0eb0)
    at dispatcher.cpp:216
#9  0x00007ffff7b996d1 in DBus::Dispatcher::dispatch_thread_main (this=0x6082c0) at dispatcher.cpp:182
#10 0x00007ffff74b5970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff66ff0a4 in start_thread (arg=0x7ffff62e1700) at pthread_create.c:309
#12 0x00007ffff6c2587d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7ffff7fd3740 (LWP 2200)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff67014b9 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff67012e0 in __GI___pthread_mutex_lock (mutex=0x6082e8) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007ffff7b953ce in __gthread_mutex_lock (__mutex=0x6082e8) at /usr/include/x86_64-linux-gnu/c++/4.9/bits/gthr-default.h:748
#4  lock (this=0x6082e8) at /usr/include/c++/4.9/mutex:135
#5  lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/4.9/mutex:377
#6  DBus::Dispatcher::on_watch_toggled (this=0x6082c0, watch=std::shared_ptr (count 2, weak 0) 0x18f4610) at dispatcher.cpp:355
#7  0x00007ffff7b99f02 in operator() (_A_a1=..., this=<optimized out>) at /usr/include/sigc++-2.0/sigc++/functors/mem_fun.h:1856
#8  operator()<const std::shared_ptr<DBus::Watch>&> (_A_arg1=..., this=<optimized out>)
    at /usr/include/sigc++-2.0/sigc++/adaptors/adaptor_trait.h:89
#9  sigc::internal::slot_call1<sigc::bound_mem_functor1<bool, DBus::Dispatcher, std::shared_ptr<DBus::Watch> >, bool, std::shared_ptr<DBus::Watch> >::call_it (rep=<optimized out>, a_1=...) at /usr/include/sigc++-2.0/sigc++/functors/slot.h:137
#10 0x00007ffff7b9022a in sigc::internal::signal_emit1<bool, std::shared_ptr<DBus::Watch>, sigc::nil>::emit (impl=0x60b940, _A_a1=...)
---Type <return> to continue, or q <return> to quit---
    at /usr/include/sigc++-2.0/sigc++/signal.h:921
#11 0x00007ffff7b8c5a8 in emit (this=<optimized out>, _A_a1=...) at /usr/include/sigc++-2.0/sigc++/signal.h:2801
#12 DBus::Connection::on_watch_toggled_callback (cwatch=<optimized out>, data=0x608620) at connection.cpp:860
#13 0x00007ffff79353bb in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#14 0x00007ffff7934ef2 in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#15 0x00007ffff793587c in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#16 0x00007ffff7935fc2 in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#17 0x00007ffff7934ca8 in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#18 0x00007ffff791e074 in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#19 0x00007ffff791ea4d in ?? () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#20 0x00007ffff791f02a in dbus_connection_send_with_reply_and_block () from /lib/x86_64-linux-gnu/libdbus-1.so.3
#21 0x00007ffff7b898b5 in DBus::Connection::send_with_reply_blocking (this=0x608620, message=
    std::shared_ptr (count 5, weak 0) 0x2d4fbb0, timeout_milliseconds=timeout_milliseconds@entry=-1) at connection.cpp:370
#22 0x00007ffff7bb5744 in DBus::ObjectProxy::call (this=0x60beb0, call_message=std::shared_ptr (count 5, weak 0) 0x2d4fbb0, 
    timeout_milliseconds=-1) at objectproxy.cpp:446
#23 0x00007ffff7ba1b3e in DBus::InterfaceProxy::call (this=<optimized out>, call_message=std::shared_ptr (count 5, weak 0) 0x2d4fbb0, 
    timeout_milliseconds=<optimized out>) at interfaceproxy.cpp:266
#24 0x00007ffff7ba9d6c in DBus::MethodProxyBase::call (this=<optimized out>, call_message=..., timeout_milliseconds=<optimized out>)
    at methodproxybase.cpp:82
#25 0x000000000040392e in DBus::MethodProxy<void, std::string, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil, sigc::nil>::operator() (this=0x60c8a0, _val_1="two") at /usr/local/include/dbus-cxx-0.8/dbus-cxx/methodproxy.h:363
#26 0x0000000000402865 in main (argc=1, argv=0x7fffffffe368) at send.cpp:43

-Robert Middleton

Nick Burkitt

unread,
Dec 20, 2016, 8:47:11 PM12/20/16
to Dbus-cxx
Hi Robert.

So we think thread 1 has a lock on mutex 1 and is waiting on mutex 2, while thread 2 has a lock on mutex 2 and is waiting on mutex 1? That's not good.
Do we need to hold those locks as long as we are? In handle_read_and_write_watches() for example, can we release the lock after the return from find()? The idea is really just to protect the integrity of the map while we're searching it. handle_read() would need to be able to survive a deleted watch, of course.

-Nick
...

Nick Burkitt

unread,
Dec 20, 2016, 8:49:26 PM12/20/16
to Dbus-cxx
Not actually deleted, of course, but erased from the map and no longer active.

Robert Middleton

unread,
Dec 20, 2016, 11:07:01 PM12/20/16
to Nick Burkitt, Dbus-cxx
I'm not actually sure if that would work well at all, as I'm not sure if the watch is is a good state at that point, or what you have to do with it.  Alternatively, it would be possible to queue up the watches as they come into the on_watch_toggled method and update our data structures in the dispatcher thread still.

But I have just committed a fix that should handle this in some regard.  Looking at the DBus API, I realized that the DBusWatchToggledFunction doesn't actually need to return a value(it was returning true or false in dispatcher.cpp).  Thinking about it even more, I realized that we don't even have to have separate read_watches and write_watches lists in the dispatcher at all - we can figure out if we have to read/write the FD that the watch provides when we build up our array of pollfd structures.  Really all that on_watch_toggled has to do is to wakeup the thread to do any processing that has to be done, which will do the processing and then re-build the pollset for the poll() call.  This is probably a little less efficient, but given the deadlock that we had before we don't have much of a choice at the moment.

Unfortunately, this doesn't yet seem to fix the problem. :(  After a re-build however, I only get a segfault, so maybe I just had to re-compile and re-link? 

I should be able to take another look at it this weekend sometime.

(also check the comment at the end of the documentation for dbus_connection_set_watch_functions)

I did manage to catch the segfault twice though, now it seems to segfault in connection.cpp:937.  The last output from the log always looks something like this:

dbus.Dispatcher - dispatch status changed: 0
dbus.Dispatcher - Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0
dbus.Dispatcher - dispatch status changed: 1
dbus.Dispatcher - Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0
dbus.Dispatcher - Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0
dbus.Dispatcher - Dispatch Status: 1   Hints: DISPATCH_DATA_REMAINS=0

(I don't have the thread IDs in at the moment)

-Robert Middleton

--
You received this message because you are subscribed to the Google Groups "Dbus-cxx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.

To post to this group, send email to dbus...@googlegroups.com.

Robert Middleton

unread,
Dec 29, 2016, 8:58:33 PM12/29/16
to Nick Burkitt, Dbus-cxx
I had some time the past few days to take a look at this, and the most common problem that I have seen is that sometimes on the sending side an exception will be thrown that the receiving side did not respond back to the message, even though it did.  So I'm not sure what the problem is with that.  I have only seen another segfault once.  

Specifically, the error is getting thrown in connection.cpp:372.  Catching it in gdb, error->what() says "Attempt to extract a component of a value that is not a structure pointer."  However when I catch the exception in my code error->what() says "Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.".  So something weird is going on, but I don't know what.

If you're still getting problems, I'm not sure what else the problem could be.  It looks like it may be deep within the DBus library, but that seems odd to me.  I would recommend catching the exception when you call a remote method.

One thing that you should try if you haven't yet is to ensure that you have built everything cleanly, as I have seen some weird issues with updating the shared library but not re-linking the program that uses the library.

I'll push out a new release in a few weeks that also includes the file descriptor passing, so that there will be a good baseline.  There have been a few other bugs fixed since the last release(check the SVN history if you're curious).

-Robert Middleton

Nick Burkitt

unread,
Dec 30, 2016, 2:59:54 PM12/30/16
to Dbus-cxx
Hi Robert.

I do catch exceptions when calling dbus methods, and I see the same problem, at least from the application side. I would expect that the exception seen by the application is thrown in response to the first exception, which is likely the real culprit. Line 372 in connection.cpp isn't a throw - what source are you using? Or, what method is that line in?
I doubt that this is a problem with the dbus library itself. It's too well used for something this easy to reproduce to be lurking in the shadows.
Thanks,

-Nick

Robert Middleton

unread,
Dec 30, 2016, 5:23:37 PM12/30/16
to Nick Burkitt, Dbus-cxx
It's line 372 in the api-cleanup branch(should have made that clear): https://sourceforge.net/p/dbus-cxx/code/HEAD/tree/branches/api-cleanup/dbus-cxx/dbus-cxx/connection.cpp#l372

The fact that it happens sometimes though and not all the time is very strange, so I have no idea what could be causing it at this point.

-Robert Middleton

--
You received this message because you are subscribed to the Google Groups "Dbus-cxx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dbus-cxx+unsubscribe@googlegroups.com.
To post to this group, send email to dbus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages