Surprised about ROS node that is stuck at shutdown if running with RT priority on single-core system

589 views
Skip to first unread message

Jonas Sticha

unread,
May 20, 2014, 4:48:45 AM5/20/14
to ros-sig-...@googlegroups.com
Dear all,

while working on my bachelor's thesis (Validating the Real-Time Capabilities of the ROS middleware), I stumbled across a surprising behavior in roscpp, which leads to a stuck and unpreemptable rosnode, producing 100% CPU load.
The following conditions have to apply:
- single-core CPU or only one CPU core accessible
- the rosnode is running with RT priority
- the roscore process is running with normal priority
- the rosnode has a thread in which ros::spin() is running
If those conditions are met, the rosnode hangs with 100% CPU load as soon as ros::shutdown() is called or as soon as the return statement of the main function is reached.
A small program in C++ that reproduces this behavior under the given conditions is available at the following URL: https://github.com/bmwcarit/stuck_at_shutdown

If you do not have a single-core platform at hand, you can also use cgroups to bind the roscore and the stuck_at_shutdown process to one CPU core.
This can be done by executing the following commands on your multi-core Linux machine (as root):
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpuset
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset
echo 0 > cpuset.cpu_exclusive
echo 0 > cpuset.mem_exclusive
mkdir foo
cd foo
echo 0 > cpuset.cpus
echo 0 > cpuset.mems
Now we have created a new cgroup called "foo" which is bound to CPU core 0.
To start the roscore and the stuck_at_shutdown binary as members of that cgroup, execute the following commands (still as root):
cgexec -g cpuset:foo roscore &
cgexec -g cpuset:foo ./stuck_at_shutdown
By running htop from a different shell, you see that one of the threads ROS created now uses 100% CPU time and does not stop.

Any thoughts?

Greetings,
Jonas

Brian Gerkey

unread,
May 20, 2014, 11:05:12 AM5/20/14
to ros-sig-...@googlegroups.com

Does giving roscore RT priority avoid the problem? If so, then perhaps the deadlock is caused by the node trying to deregister with the master, but the master never gets a chance to run.

Alternatively, does calling spin from the node's main, without the extra thread, avoid the problem? If so, then perhaps the problem is in thread scheduling, with main waiting for the other thread, but not giving it a chance to run. Do threads automatically inherit their parent's priority?

--
You received this message because you are subscribed to the Google Groups "ros-sig-embedded" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ros-sig-embedd...@googlegroups.com.
To post to this group, send email to ros-sig-...@googlegroups.com.
Visit this group at http://groups.google.com/group/ros-sig-embedded.
For more options, visit https://groups.google.com/d/optout.

Jonas Sticha

unread,
May 21, 2014, 7:52:26 AM5/21/14
to ros-sig-...@googlegroups.com
When running roscore with RT priority + RoundRobin scheduling as well, the program doesn't get stuck but exits with a core dump.
The following error message is printed to the shell:
stuck_at_shutdown: /usr/include/boost/thread/pthread/recursive_mutex.hpp:92: boost::recursive_mutex::~recursive_mutex(): Assertion `!pthread_mutex_destroy(&m)' failed.
Aborted (core dumped)

We also suspect that a deadlock is caused by the master never getting a chance to answer. However, we find it surprising that the node produces 100% CPU load while waiting for an answer. This makes the scheduler never give CPU time to the roscore in the first place.

Calling ros::spin() from the main thread and ros::shutdown() from a different one, leads to the same behavior.

To answer your question about priority inheritance: according to the Linux manual pages PTHREAD_CREATE and PTHREAD_ATTR_SETINHERITSCHED, threads always inherit priority + scheduling policy from the calling process unless explicitly specified otherwise, for example in the pthread_create call.
By inspecting the stuck_at_shutdown program with htop you can also see that all threads which are created by ROS run with the priority and scheduling policy we specify within the first lines of the main function for the main thread.

Regards,
Jonas

Brian Gerkey

unread,
May 21, 2014, 10:33:09 PM5/21/14
to ros-sig-...@googlegroups.com

Perhaps there's a busy loop in the deregistration XML/RPC interaction. I have a distant memory of modifying the event loop of xmlrpcpp to allow asynchronous interactions.  I could imagine that capability being used in a busy loop at shutdown.

Reply all
Reply to author
Forward
0 new messages