A First Step towards Validating the Real-Time Capabilities of ROS

483 views
Skip to first unread message

Jonas Sticha

unread,
Mar 27, 2014, 6:48:41 AM3/27/14
to ros-sig-...@googlegroups.com

Hello everyone,

 

I am studying computer science at the Technische Universität München, and I am currently working on my Bachelor's Thesis at BMW Car IT under supervision of Lukas Bulwahn. In the work of my bachelor thesis, I analyze the real-time characteristics of ROS. As a first step, I have written a test suite that measures the latencies of the ROS-Timer function under different conditions, for example, for different timeout values and running the node with real-time and normal process priority. All the tests were run on a PandaBoard with a Linux operating system (3.4.0 kernel with PREEMPT_RT patch).

 

So far, the tests revealed that while the test node runs with a normal process priority, the system isn't experiencing heavy CPU load and the timeout value is at least 1 millisecond, the latencies of the ROS-Timer are very similar to the latencies of the nanosleep system call, especially considering the maximum latencies. Also under those conditions, the latency values passed to the callback function are quite accurate (the average precision is  less than 10 microseconds).

However, as soon as the system experiences a high CPU load, the maximum measured latency values increase significantly, even if the test node is running with real-time priority. We think the reason for that might be that the timer is actually running in the roscore process. In our opionion, this would explain why running the test node with real-time priority doesn't improve the test results. Also, under high CPU load, the differences between the latency values passed to the callback function to the measured latencies start to increase. Peak differences of over 1000 microseconds start to occur. So in a scenario with high CPU load, the latency values passed to the callback function don't seem to be reliable. Furthermore, the tests revealed that independent of the previously mentioned scenarios, timeout values smaller than approximately 1 millisecond lead to disproportionally high latencies (average of roughly 980 microseconds).


We plan to continue working on additional test cases where the roscore is also run with real-time priority. Another focus of my thesis is to validate the real-time capabilities of the Publish/Subscribe communication mechanism.

Are you working on similar topics?
Can you comment on the interpretation of the previous test results?

Greetings
Jonas

Brian Gerkey

unread,
Mar 27, 2014, 1:24:53 PM3/27/14
to ros-sig-...@googlegroups.com
hi Jonas,

This sounds like interesting work, and using ROS in real-time systems
is definitely something that comes up often.

I'm not aware of anybody having done the kind of testing that you're
describing, and would be very interested to see the results; maybe you
can share your data and/or paper when they're ready?

One point: the roscpp timer events are being handled locally in the
roscpp node, without any interaction with the master (aka roscore).
So I would look somewhere else, inside roscpp, for the cause of the
high-load jitter that you're describing.

brian.
> --
> You received this message because you are subscribed to the Google Groups
> "ros-sig-embedded" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ros-sig-embedd...@googlegroups.com.
> To post to this group, send email to ros-sig-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/ros-sig-embedded.
> For more options, visit https://groups.google.com/d/optout.

Paul Mathieu

unread,
Mar 27, 2014, 7:51:53 AM3/27/14
to ros-sig-...@googlegroups.com
Hi Jonas,

You are undertaking a very interesting task, as more demand shifts towards an industrialization of ROS, real-time capability is definitely on the table.
As for the Timer you mention, it is in no aspect running in the roscore process. Think of roscore as a dns server.
By default, every node keeps track of time by calling clock_gettime(). Unless you are using the /use_sim_time param and publish the time on the /clock topic, this will also be your case. Also, every node has its own callback queue that among other things trigger timer callbacks. Communications with roscore are usually the result of actions such as subscribe, advertise, getParam, etc.

The issue of the timing in real-time is quite delicate. I am not very familiar with the PREEMT_RT patch, but the system clock may drift under heavy load, and jump when the time of day is adjusted, which is why in real-time systems, people usually use a monotonic clock that never jumps, and thus provides accurate timings for repetitive tasks.
My guess is that ros::Timer uses ros::Time to specify wake-up times, and everything is served by the associated callback queue. This means that because the system time is potentially being adjusted by the kernel with some load-induced delay, your callbacks may be served with a similar delay.

Hope that helps,

Paul


--

Jonas Sticha

unread,
Apr 25, 2014, 9:43:45 AM4/25/14
to ros-sig-...@googlegroups.com
Hi,

First of all, thank you for your answers. I am glad to hear that there is interest in my research.
Secondly, I found the cause for the high latencies when running the test node with real-time priority. As Brian and Paul already suggested earlier, the reason was not the lower process priority of the roscore process. As it turned out, the priorities of the threads created by the ROS::init() call, weren't set appropriately. This might have caused some of those threads to starve, leading to extremely high latencies.
The solution to that problem was to set the priority and scheduling policy, before calling the ROS::init() method. This made the newly created threads inherit the desired priority and scheduling policy from the main process, which resolved the issue.
 
Furthermore, I am happy to announce that I finally managed to publish my test suite on GitHub, along with a detailed documentation of the test setup as well as the first test results. You are welcome to have a look at it, or even run the test suite on any other system.
The documentation is at http://bmwcarit.github.io/ros_realtime_tests ; the git repository of the test suite is at https://github.com/bmwcarit/ros_realtime_tests .

Also, in a first attempt to address the surprising handling of timeout values smaller than 1 millisecond (as described in my previous mail), I applied the following patch to the timer_manager header file:
      {
        // On system time we can simply sleep for the rest of the wait time, since anything else requiring processing will
        // signal the condition variable
        int32_t remaining_time = std::max((int32_t)((sleep_end - current).toSec() * 1000.0f), 1);
        timers_cond_.timed_wait(lock, boost::posix_time::milliseconds(remaining_time));
      }
====
      {
        // On system time we can simply sleep for the rest of the wait time, since anything else requiring processing will
        // signal the condition variable
        long long remaining_time = (sleep_end - current).toNSec();
        timers_cond_.timed_wait(lock, boost::posix_time::microseconds(remaining_time/1000));
      }
<<<< 
After first tests, this seemed to have resolved the issue. Longer tests however showed that with the patch applied, high latencies start to occur (>10 milliseconds), even with timeout values that worked fine before.
Any thoughts on that?

Greetings,
Jonas

Lukas Bulwahn

unread,
Aug 18, 2014, 2:23:09 AM8/18/14
to ros-sig-...@googlegroups.com
Hi everyone,

in the mid of July, Jonas Sticha finished his thesis ''Validating the Real-Time Capabilities of the ROS Communication Middleware''. His thesis is available on our Web site at 'About Us -> Publications', or directly through the URL:

http://www.bmw-carit.com/downloads/publications/ValidatingTheRealTimeCapabilitiesOfTheROSCommunicationMiddleware.pdf

The sources of the test suite for validating the real-time property is located on our github space:

https://github.com/bmwcarit/ros_realtime_tests

Also, here is a short summary of his results:
For one-shot timer, the probability for latencies above 400 microseconds can be approximated to a value between 1.0E-8 and 1.0E-9 based on our measurements in our described setting (Linux preempt-rt on Pandaboard).
Given this empirical data, one could say the ROS timers have a hard real-time capability (for the selected operating system configuration and hardware). In our view, a maximal latency of 400 microseconds can be tolerable for many (high-level) controllers. However, it is actually still quite high compared to a maximal 200 microseconds latency that you can achieve with a fully pre-emptive Linux kernel and nanosleep timers on these high-performance embedded boards. So, there is still some chances and ways for improving ROS timers.

In the end of June, Jonas also reported that sometimes one-shot ROS timers do not fire. Paul Mathieu and Tully tried to reproduce the problem, but with a couple of hundred thousand iterations they did not detect the problem. In a further personal discussion, he confirmed to me that the problem occurs on various hardware he tested on, but on desktop machines and laptops, it happens very seldom (chances are less than 1 to a million). To observe the behavior, it is best to choose a very high number of iterations (e.g., 10,000,000) and then let the program run while one is working on that machine throughout the day; the program does not use much CPU and memory. I have been asking some Linux experts and they assume it could be that other timer signals for other processes are set before the ROS nodes fetch their timer signal, and that Linux does not guarantee that all processes finally get their timer signals.

I am also planning to continue the evaluation and address improvements for the real-time capabilities of ROS in a subsequent bachelor's or master's thesis. In the subsequent work, we plan to automate the testing for long-term and continuous measurements, measure the real-time characteristics of the nodelet communication, analyze the detected issues, and prototype ROS timers based on nanosleep timers. Considering the time it takes to find a suitable candidate for the thesis, I expect that we will only continue the work at some point in time next year.

If anyone in the community is interested to continue work based on our findings, feel free to contact me and we can discuss possible next steps that are valuable to the overall project. For us, this has been a first step motivated by the BoF session on ROS' real-time capabilities last year on ROSCon 2013. We hope that other members of that session last year can benefit from this first step.


Best regards,

Lukas
Reply all
Reply to author
Forward
0 new messages