Dynamic addition, removal and listing of diagnostic aggregators

85 views
Skip to first unread message

Michal Staniaszek

unread,
Sep 8, 2015, 8:37:06 PM9/8/15
to ROS Drivers Special Interest Group
I've been implementing diagnostics nodes for a system in which in some cases a node doesn't exist when the system starts, but is started later by a separate launch script. In the current implementation of the diagnostics aggregator, the analyzers specified in the yaml passed to the node startup launcher are the only ones that will exist on the system. One can also launch another diagnostic aggregator with a different name, but that seems messy.

In any case, this limitation means that when using the rqt_robot_monitor to view the aggregated diagnostics, the timeline will show the red error state for the diagnostics that are produced by nodes which don't exist yet, even though all the other diagnostics are in the OK state. Modifying the robot_monitor behaviour didn't seem to make sense. Instead, implementing functionality to allow the addition of analyzers to the aggregator when launching a node seemed to be a solution which would actually be useful in other cases as well (as well as the ability to delete and list).

I've submitted pull requests to common_msgs and diagnostics implementing these changes, which use some suggestions for a service API posted by Ryohei Ueda in this issue.

The current implementation doesn't allow easy recursion over the analyzer tree, so the implementation has a couple of things which seem a little messy.

I'd appreciate any suggestions on additional functionality or improvements to my proposed implementation.

Jack O'Quin

unread,
Sep 8, 2015, 8:49:49 PM9/8/15
to ros-sig...@googlegroups.com
+1 Thanks for the suggestion. It seems useful to me.
--
 joq

Austin Hendrix

unread,
Sep 8, 2015, 11:26:10 PM9/8/15
to ROS Drivers Special Interest Group
I agree that this would be useful. I don't think the proposed API is appropriate.

I think an add/delete API on a per-item basis gives the user too much freedom to accidentally remove important diagnostics.

The nominal use-case that I see for this API would be wanting to bring up additional diagnostics as part of a capability, and bring those diagnostics down again when that capability is no longer needed.


Perhaps something based on a model like bond would be more appropriate?


-Austin

Michal Staniaszek

unread,
Sep 9, 2015, 1:00:37 AM9/9/15
to ROS Drivers Special Interest Group
Perhaps something based on a model like bond would be more appropriate?
I'm not aware of this model, and some searching didn't bring up any useful information. I assume that this would involve something like linking the node that requested the diagnostics be brought up to the group that was added in the aggregator?

Since the current implementation requires parameters to be loaded into a namespace so that re-use of the AnalyzerGroup initialisation is possible, I made the assumption that most of the calls to add diagnostics would be done from launch files via a script. One of the nodes from the capability launched would then send a deletion request to the diagnostic aggregator on shutdown. As such, it would require a little bit of reworking to make some kind of link.

Michal Staniaszek

unread,
Oct 2, 2015, 5:19:10 AM10/2/15
to ROS Drivers Special Interest Group
I've added a new pull request here using bond.

Austin Hendrix

unread,
Oct 2, 2015, 3:52:02 PM10/2/15
to ros-sig...@googlegroups.com
Aside from the code comments, you still haven't put forward a set of use
cases or a larger justification for your changes. What is the greater
purpose that you're trying to achieve here?

It's difficult to understand the decisions you're made without the
context of how you plan to use them.

I've been considering a few other ideas for making diagnostic
aggregation more dynamic, but I don't have a strong use case to drive
the design process.

It seems like you have a strong use case in mind here; please share it
so that we can evaluate not only the code quality, but the quality of
the design and the API.

-Austin

On 10/02/2015 02:19 AM, Michal Staniaszek wrote:
> I've added a new pull request here
> <https://github.com/ros/diagnostics/pull/43> using bond.
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "ROS Drivers Special Interest Group" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/ros-sig-drivers/2lF1Wdi5ys4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> ros-sig-drive...@googlegroups.com
> <mailto:ros-sig-drive...@googlegroups.com>.
> To post to this group, send email to ros-sig...@googlegroups.com
> <mailto:ros-sig...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ros-sig-drivers/fca937af-e5a4-492c-9a6a-872bb5782448%40googlegroups.com
> <https://groups.google.com/d/msgid/ros-sig-drivers/fca937af-e5a4-492c-9a6a-872bb5782448%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Michal Staniaszek

unread,
Oct 7, 2015, 11:37:57 PM10/7/15
to ROS Drivers Special Interest Group
I don't have much experience using diagnostics, so I can only comment on the
interaction with the rqt_robot_monitor.

In our system we don't launch all nodes at the same time. As a result, some of
the diagnostics that are expected by the aggregator (i.e. are in the .yaml used
to set up the aggregator) in the current setup are not active all the time.

We would like to use the robot monitor as a simple diagnostics UI which a user
can look at to see what the cause of an error is, whether that be an end-user or
a team member during development. Currently the robot monitor is implemented in
such a way that any missing diagnostics (which will be marked as stale) result
in error colouring on the timeline. This is confusing, as nothing is actually
wrong with the system, just some nodes haven't been launched yet. We thought
about modifying the robot monitor, but that would change its basic assumption
that all diagnostics in the aggregator are supposed to be active.

Another option we considered was to have some way of marking nodes which weren't
currently active, but in the end modifying the aggregator to allow addition of
diagnostics on the fly (from a launch file) was what made the most sense. In the
simplest case, these changes allow users to have args in launch files which can
disable diagnostics that they don't want to use, and have them not show up in
the robot monitor, which is not currently possible without having different
.yaml files containing only those things which you wish to analyse, which leads
to clutter. 

A side effect of the changes is that it is possible for aggregator parameters to
be split up and placed in packages for the component that the diagnostics are
coming from. The current diagnostic aggregator setup requires that all
information about the aggregation targets be passed to it on startup. This means
that the user has to have a single large .yaml containing information about all
the different diagnostics in the system. This isn't a particularly strong case
for the changes, but I think that there should at least be the option to have
those parameters split up.

Daniel Stonier

unread,
Oct 27, 2015, 11:41:16 PM10/27/15
to ROS Drivers Special Interest Group

It seems like you have a strong use case in mind here; please share it
so that we can evaluate not only the code quality, but the quality of
the design and the API.

I'm probably better oriented to answer that question.

We have test engineers, manufacturing engineers, demonstrators who need to know if the robot is up and running ok and then relay us important information when it is not. They all utilise different launch modes on top of the basic bootstrap mode. If we put everything that we need checks for in the current diagnostics and let them use the robot monitor tool, then there will always be many parts not relevant to the current launch mode showing up as red.

None of these people have the knowledge to work out what red is relevant or not (that defeats the purpose of a diagnostic tool anyway). I'd really like to break it down to just identifying green or red flags on a first glance, but as it stands it is unusable for me and them right now. 

So, it's either dynamic diagnostics, or build/use another diagnostics framework. I don't have any great alternatives and don't the cycles to build a diagnostics framework, so the former is a really strong requirement for me right now.

Cheers,
Daniel.
 

Mike Purvis

unread,
Oct 27, 2015, 11:51:18 PM10/27/15
to ros-sig...@googlegroups.com
We have an interest in this idea too— we ship a lot of composed robots, both at the hardware and software level, in some cases launching stuff at runtime (for example, using the capabilities server). It would be great for those runtime launches to be able to supply additional diagnostic checks, and then shut them down again when complete.

A service interface similar to how ros_control controllers are stopped and started seems perfectly reasonable.

M.



--
You received this message because you are subscribed to the Google Groups "ROS Drivers Special Interest Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ros-sig-drive...@googlegroups.com.
To post to this group, send email to ros-sig...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ros-sig-drivers/e9f5c22a-91c1-4c29-8440-c33769dff108%40googlegroups.com.

Michal Staniaszek

unread,
Oct 28, 2015, 12:07:09 AM10/28/15
to ROS Drivers Special Interest Group
Our initial implementation was something like how ros_control does it (i.e. via a direct request to a service running in the aggregator), but adding a bond on top of this is probably better. The service request is used to define the bond ID, and only once this bond is broken do the diagnostics go down again. This limits the ability to start the diagnostics up with one node and shut them down with another, but I don't think that that's a bad thing.

Daniel Stonier

unread,
Oct 28, 2015, 12:55:41 AM10/28/15
to ROS Drivers Special Interest Group


On Wednesday, 28 October 2015 13:07:09 UTC+9, Michal Staniaszek wrote:
Our initial implementation was something like how ros_control does it (i.e. via a direct request to a service running in the aggregator), but adding a bond on top of this is probably better. The service request is used to define the bond ID, and only once this bond is broken do the diagnostics go down again. This limits the ability to start the diagnostics up with one node and shut them down with another, but I don't think that that's a bad thing.

Just for reference, Michal's development is happening in https://github.com/ros/diagnostics/pull/43.

Daniel.

G.A. vd. Hoorn - 3ME

unread,
Oct 28, 2015, 4:42:13 AM10/28/15
to ros-sig...@googlegroups.com
On 28-10-2015 5:07, Michal Staniaszek wrote:
> Our initial implementation was something like how ros_control does it (i.e.
> via a direct request to a service running in the aggregator), but adding a
> bond on top of this is probably better. The service request is used to
> define the bond ID, and only once this bond is broken do the diagnostics go
> down again. This limits the ability to start the diagnostics up with one
> node and shut them down with another, but I don't think that that's a bad
> thing.
[..]

Just wondering: how do we then discriminate between a sub-system crash
(bond goes down, diagnostics go away), and a proper shutdown (bond goes
down, diagnostics go away)?


Gijs

Jack O'Quin

unread,
Oct 28, 2015, 10:00:16 AM10/28/15
to ros-sig...@googlegroups.com
+1 to the use case Daniel and Mike described. We have similar needs with differently-configured robots using various launch files.
I have not had time to study the proposed fix, but the problem is definitely worth solving somehow.

Austin Hendrix

unread,
Oct 28, 2015, 3:11:02 PM10/28/15
to ros-sig...@googlegroups.com
Thanks for all the feedback; this fix definitely seems like a good
approach for bringing additional diagnostics up and down along with a
launch file.

To address the specific issue here, re differentiating between a
sub-system crash and a shutdown, it has the a similar failure mode as
the existing diagnostics: we can't differentiate between a crash of the
main aggregator and a clean shutdown of the main aggregator.

If we do have a crash in the loader node for a subsystem, the analyzers
for that subsystem will be unloaded, and those diagnostics should start
to show up in the "Other" category if the other subsystem nodes are
still running.

If the entire launch file for the subsytem crashes, it should unload the
analyzers and the diagnostics for the subsystem will no longer be
published, which looks the same as the if the subsystem shut down cleanly.

I think we can solve some of this with configuration policy; perhaps
having each modular subsystem as it's own top-level diagnostics
analyzer; then the user's check to see if that subsystem is running or
not is for the user to simply look at the list of diagnostics, and
confirm that their subsystem exists.

-Austin

Michal Staniaszek

unread,
Nov 2, 2015, 8:28:39 PM11/2/15
to ROS Drivers Special Interest Group
To add to what Austin said, if an individual node in the launch file whose diagnostics
are being aggregated crashes, the diagnostics associated with it will stop being
published. Depending on the configuration of the analyzers, the status of
the namespace containing diagnostics from that node will be set to error due to
stale messages.

As for differentiating between clean shutdown and crash, as far as I can see
there's no mechanism within the diagnostics to do this, and I'm unsure as to
whether it's the right place to do it anyway. After all, rosgraph and other
tools exist to allow introspection into the system. Perhaps I'm not
understanding the problem correctly.

I think we can solve some of this with configuration policy; perhaps 
having each modular subsystem as it's own top-level diagnostics 
analyzer; then the user's check to see if that subsystem is running or 
not is for the user to simply look at the list of diagnostics, and 
confirm that their subsystem exists. 

At the moment all the new analyzers that are added are being put into the main
analyzer group in the aggregator. While I think it would be good to have dynamic
analyzers separated from the base analyzer, it would make no difference to the
user's ability to check the diagnostics list (via the robot monitor, I suppose).
If we want to add something like a service which can list the currently running
analyzers, then we would need to add something which allows an analyzer to be
distinguished from others. This could be done either by using the namespace from
which analyzer parameters were loaded, or via a unique name. Currently, using
the script to add an analyzer loads parameters from the launched node's
namespace, which is by definition a unique name, so this could be exploited.

Marcus Liebhardt

unread,
Nov 4, 2015, 3:38:50 AM11/4/15
to ros-sig...@googlegroups.com
On 3 November 2015 at 02:28, Michal Staniaszek <m.stan...@gmail.com> wrote:

As for differentiating between clean shutdown and crash, as far as I can see
there's no mechanism within the diagnostics to do this, and I'm unsure as to
whether it's the right place to do it anyway. After all, rosgraph and other
tools exist to allow introspection into the system. Perhaps I'm not
understanding the problem correctly.

I would be very interested in finding a solution for distinguishing a node crash vs. clean shutdown - whether that's inside or outside the diagnostics.

Thinking out loud:
AFAIK a bond does not distinguish between a intended breakage (as in releasing) and a breakage due to a node crash. Maybe this would be a good addition to the bonds: have a way to release a bond for an intended shutdown and use the timeout to detected stale/crashed nodes.

Or we introduce our own watchdog for diagnostics and use the bond to detect crashes.


Cheers,
Marcus




 

Michal Staniaszek

unread,
Nov 4, 2015, 7:30:39 PM11/4/15
to ROS Drivers Special Interest Group
There is already a way to break a bond cleanly, using the breakBond() function, or more cleanly using shutdown(), which is what is done in the add_analyzers script using rospy.on_shutdown. I don't think that this clean shutdown/crash distinction is important for this specific feature, though. It might be a nice addition to the analyzer, but that's not what we're trying to add here.
Reply all
Reply to author
Forward
0 new messages