Problems with current multiimaster solutions

Piyush

unread,

May 10, 2012, 12:24:21 PM5/10/12

to ros-s...@googlegroups.com

Hi All,

There seem to be a number of existing multi-master solution. I would like to understand what are the current problems with those solutions, and what I can do to help.

I have listed one question I have in this context. I apologize for my lack of networking knowledge in advance:

1) I believe one of the main problems is that TCP is unsuitable over intermittent connectivity. Is there a proposed methodology to fix this?

2) Is there some methodology to do namespace resolution for different machines?

I would appreciate it if anybody could list out other problems with the current solutions.

Thanks,

Piyush

Jeff Rousseau <jrousseau@aptima.com>

unread,

May 10, 2012, 1:13:42 PM5/10/12

to ros-s...@googlegroups.com

> 1) I believe one of the main problems is that TCP is unsuitable over intermittent connectivity. Is there a
> proposed methodology to fix this?

A ROS C++ UDP transport implementation exists and works that helps get over the "TCP doesn't work well over wireless" issue. I'm also exploring a multicast solution (using OpenPGM atm) which may help with sharing topics like /tf--but it's not ready for primetime (yet).

ROSUDP is only available for C++, so a good next step would be to add UDP transport support for python and java language bindings. If you look in the old email archives there's some discussion about possible changes to how ROSUDP currently works that might be worth looking at (and revisiting with some rigorous testing).

> 2) Is there some methodology to do namespace resolution for different machines?

In the past some MM solutions had basic topic name resolution by appending a unique master/machine name to prevent topic collisions: /some_topic -> /machine_name/some_topic

>
> I would appreciate it if anybody could list out other problems with the current solutions.

Before jumping into problems with current solutions it would be nice to have a canonical list of solutions first. I'm primarily familiar with:

foreign_relay + unreliable_relay
app_manager's "master_sync" (syncs to masters together)
rosjava (which has a nice api where you can create and subscribe to masters programmatically)

Foreign relay is the simplest form of a MM solution as all it really does is use the Master API XMLRPC calls to advertise or subscribe on foreign masters. If you use the unreliable_relay node as well you can get a UDP-backed topic shared between two masters.

Its shortcomings are that it's just a simple relay (most solutions resort to hard-coded launch files).

The now defunct building manager project enumerates some of the issues involved/proposed with MM:
http://www.ros.org/wiki/Projects/Building%20Manager/multimaster

>
> Thanks,
> Piyush

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

Armstrong-Crews, Nicholas - 1002 - MITLL

unread,

May 10, 2012, 2:08:30 PM5/10/12

to ros-s...@googlegroups.com

Sorry, I'm experience some cognitive dissonance...

> A ROS C++ UDP transport implementation exists and works that helps get over the "TCP doesn't work well over wireless" issue

> Foreign relay is the simplest form of a MM solution as all it really does is use the Master API XMLRPC calls to advertise or subscribe on foreign masters.

> Currently the master backend relies on XMLRPC (TCP+HTTP)

Aren't these statements at odds? If "TCP doesn't work well over wireless," then any multi-master solution using the Master API also won't work well (unless the master backend is re-written to use a non-TCP transport).

That was our experience, when we wrote the equivalent of a foreign_relay node. I *believe* that long TCP timeout was one of the major issues, so our nodes continued cramming data into the tx queue since they had not yet been notified that the remote subscriber was no longer online. Hard to say... this was years ago, and I wasn't the primary developer.

Thx,
-N

Jeff Rousseau <jrousseau@aptima.com>

unread,

May 10, 2012, 2:33:48 PM5/10/12

to ros-s...@googlegroups.com

Sorry, there's probably some confusion between pub/sub transports and the master XMLRPC API going on. The pub/sub transport code is in charge of all the heavy lifting in that it covers all the data transferred between a publisher and subscriber. The master api is a simple RPC (yes, over TCP) that handles registration of publishers and subscribers. The master doesn't handle traffic itself; it's really just a "match maker" between publisher and subscribers, so using TCP here isn't a big issue. Your primary master should generally be local to your machine to prevent issues with TCP timeouts, etc.

If you use something like zeroconf to detect foreign masters are available, you can successfully register with them using the XMLRCP master api (and if you don't have the bandwidth to do a simple HTTP request, you probably have bigger issues). The main problem with the XMLRPC api is that it doesn't really know about when machines go offline, since it relies on an explicit "unregister" call to remove connections.

> -----Original Message-----
> From: ros-s...@googlegroups.com [mailto:ros-sig-
> m...@googlegroups.com] On Behalf Of Armstrong-Crews, Nicholas - 1002 -
> MITLL
> Sent: Thursday, May 10, 2012 2:09 PM
> To: ros-s...@googlegroups.com
> Subject: RE: [ros-sig-mm] Problems with current multiimaster solutions
>

Jeff Rousseau <jrousseau@aptima.com>

unread,

May 10, 2012, 4:40:08 PM5/10/12

to ros-s...@googlegroups.com

Since connection issues between pub/subs are out of the realm of what the master knows about, one option would be adding TTL to master resource (topic/service) registrations--meaning that advertisers would need to poke foreign masters to keep the topic in the master's topic dictionary active.

If we adopt a "public master" approach, then the public masters would be the advertisers themselves instead of the nodes behind the public master (which would reduce the keep-alive traffic, since there are many more resource providers than masters).

Piyush

unread,

May 10, 2012, 5:09:07 PM5/10/12

to ros-s...@googlegroups.com

Thanks for the detailed explanations. It has made the problem a lot clearer to me.

On Thu, May 10, 2012 at 3:40 PM, Jeff Rousseau <jrou...@aptima.com> <jrou...@aptima.com> wrote:

Since connection issues between pub/subs are out of the realm of what the master knows about, one option would be adding TTL to master resource (topic/service) registrations--meaning that advertisers would need to poke foreign masters to keep the topic in the master's topic dictionary active.

If we adopt a "public master" approach, then the public masters would be the advertisers themselves instead of the nodes behind the public master (which would reduce the keep-alive traffic, since there are many more resource providers than masters).

This makes a lot of sense. Here's my take on this problem:

1) Every so often, I have the master reporting zombie nodes that were unable to call unregister before exiting (crashing out). Although this problem would also be fixed if we adopt solution #1, it is not really clear whether the extra keep alive traffic is worth the check. I have not seen reports by users for this being a major problem.

2) I am personally in the favor of a "public master" approach, as it has minimal traffic requirements and should be easy to integrate with auto discovery.

If we go with the public master approach, do you think it is a better to have the public master as a full secondary master in sync? or some sort of a master_sync relay with an xml/rpc server to accept requests. The latter should be fairly easy to prototype.

Piyush

Jeff Rousseau <jrousseau@aptima.com>

unread,

May 10, 2012, 5:53:44 PM5/10/12

to ros-s...@googlegroups.com

> If we go with the public master approach, do you think it is a better to have the public master as a full secondary

> master in sync? or some sort of a master_sync relay with an xml/rpc server to accept requests. The latter should

> be fairly easy to prototype.

By "full secondary" vs "relay", do you mean "sync's all resources between private and public master" vs "only a specified list"? If so I'd say enumerating only what you want externally shared is preferred. Exposing everything over a lossy connection invites users to shoot themselves in the foot by subscribing to topics that could essentially take down the network due to bandwidth constraints. You won’t want to expose a high-frequency point cloud topic over 802.11g would you? If you needed to you would just expose a heavily throttled topic to prevent mishaps.

Armstrong-Crews, Nicholas - 1002 - MITLL

unread,

May 11, 2012, 9:19:14 AM5/11/12

to ros-s...@googlegroups.com

> The main problem with the XMLRPC api is that it doesn't really know about when machines go offline, since it relies on an explicit "unregister" call to remove connections.

Yes, exactly. I believe we modified this API (maserslave.py) to send some keep-alive traffic, but on a disconnect it took a long time for the keep-alive to time out (python => ROSTCP). In retrospect, we could/should have used python UDP sockets outside of ROS… or, god forbid, done a system call to ping.

> By "full secondary" vs "relay", do you mean "sync's all resources between private and public master" vs "only a specified list"?

We found the “whitelist” approach described to work well from a configuration standpoint. We were also playing with a regex/wildcard-based whitelist. I bet there are existing libraries to read and resolve rules of this form (think apache’s “allow from, deny from” – it’s even based on hierarchical namespaces [domains]).

-Nick

Jeff Rousseau <jrousseau@aptima.com>

unread,

May 11, 2012, 11:17:53 AM5/11/12

to ros-s...@googlegroups.com

A wildcard/regex whitelist for resource (topic/service) exposure would probably suffice over an (even apache-simple) acl language.

I think an acl language would be more interesting on connections between masters themselves: only allow syncing between masters on certain domains, etc.

We really have to decide on how automatic we want master syncing to be and what lies in configuration and what lies in programmatic interfaces.

Here are some open questions I have:

How would a subscription/call to a foreign resource work as far as namespacing?

- Do you need to know the machine name (or whatever unique prefix we use)?

o When do we need to know it: at configuration time or runtime (launch file param, or query the master programmatically? Both?)

Should running nodes even receive callback events on resource adds/removals?

Jeff

Piyush

unread,

May 11, 2012, 12:08:35 PM5/11/12

to ros-s...@googlegroups.com

By "full secondary" vs "relay", do you mean "sync's all resources between private and public master" vs "only a specified list"? If so I'd say enumerating only what you want externally shared is preferred. Exposing everything over a lossy connection invites users to shoot themselves in the foot by subscribing to topics that could essentially take down the network due to bandwidth constraints. You won’t want to expose a high-frequency point cloud topic over 802.11g would you? If you needed to you would just expose a heavily throttled topic to prevent mishaps.

I should have elaborated clearly. The building manager docs suggested that the secondary (or public) master was a second instantiation of the regular master exposing a subset of the available resources as necessary. This public master would be kept in sync with the primary master using something like master_sync. To me, it is not clear whether this is the best route to take -- as the public master does not need matchmaking skills. Anyhow, it looks like we are proceeding with an implementation discussion anyway, so some of these things will get ironed out.

I have replied on the questions you laid out based on my own use case. Over the next month I should probably have better answers.

How would a subscription/call to a foreign resource work as far as namespacing?

I vote for attaching a unique id to the topic name. Something like /odom -> /unique_id/odom

- Do you need to know the machine name (or whatever unique prefix we use)?

At first, we should know the unique id of a foreign machine. We can later figure out how to do this automatically using auto discovery. This is mainly because a number of people with small setups wouldn't necessarily mind assigning the unique id themselves.

o When do we need to know it: at configuration time or runtime (launch file param, or query the master programmatically? Both?)

Both perhaps? I can't see the added functionality hurting. Having this programmatically should help auto discovery as well.

Should running nodes even receive callback events on resource adds/removals?

Yes. My application will have a centralized planner. It will need to know how many services are available, and when they go offline.

Piyush

Armstrong-Crews, Nicholas - 1002 - MITLL

unread,

May 11, 2012, 12:36:52 PM5/11/12

to ros-s...@googlegroups.com

> A wildcard/regex whitelist for resource (topic/service) exposure would probably suffice over an (even apache-simple) acl language.

Agreed… but let’s not write one from scratch if there’s a more fully-featured one already available. Also want to mention that I don’t think we’re *actually* talking about access control – perhaps I should have referenced DNS bind routing rules instead.

I agree with Piyush on all counts.

Seems like we’re trying to build a non-pub/sub overlay network on top of a pub/sub network.

What if we considered the unique_id (as in “/unique_id/odom”) to be literally hostname:port or ip:port (as in “/bender:11311/odom”)? Let DNS/IP layers do what they do, and not mix that functionality into ROS. In fact, we could view the regular ROS pub/sub as a special case: “/*/odom”.

???

-N

Jeff Rousseau <jrousseau@aptima.com>

unread,

May 11, 2012, 1:37:28 PM5/11/12

to ros-s...@googlegroups.com

> > A wildcard/regex whitelist for resource (topic/service) exposure would probably suffice over an (even apache-simple) acl language.

> Agreed… but let’s not write one from scratch if there’s a more fully-featured one already available. Also want

> to mention that I don’t think we’re *actually* talking about access control – perhaps I should have referenced

> DNS bind routing rules instead.

I understand what you’re aiming at. I merely brought up actual access control because the use of the words ‘allow’ and ‘deny’ from apache triggered the idea of filtering masters from each other. We can probably table that discussion for later though.

As far as a regex/whitelist is concerned, I must still be misunderstanding the particulars because given a string topic list already available from a master query, it’s just a simple regex to filter those that match an expression (python has decent regex support out-of-box). Again, I’m probably missing something. Are there other features you can think of that we’d need that an external library would be good for?

>

> I agree with Piyush on all counts.

Piyush’s comments align with my use-case as well. As long as there aren’t any lurkers on the list that have good counter arguments, I think we have a pretty good consensus.

> Seems like we’re trying to build a non-pub/sub overlay network on top of a pub/sub network.

Not entirely sure what you mean here.

I’ve done overlays on pub/subs before (out of necessity, not choice); this effort seems to be evolving into more of a “gateway” design that tries to keep the pub/sub metaphor intact (you’ll still be publishing and subscribing, just with only a subgraph). The master backend never was a pub/sub system itself and really what we’re doing here is adding a special case master instance.

> What if we considered the unique_id (as in “/unique_id/odom”) to be literally hostname:port or ip:port (as in

> “/bender:11311/odom”)? Let DNS/IP layers do what they do, and not mix that functionality into ROS. In fact,

> we could view the regular ROS pub/sub as a special case: “/*/odom”.

+1

This seems reasonable to me.

Armstrong-Crews, Nicholas - 1002 - MITLL

unread,

May 11, 2012, 7:50:43 PM5/11/12

to ros-s...@googlegroups.com

> it’s just a simple regex to filter those that match an expression (python has decent regex support out-of-box)

Good point. A regex whitelist would be trivial to do in Python, and would probably provide 90% of what people would get out of a more complicated language.

> more of a “gateway” design that tries to keep the pub/sub metaphor intact (you’ll still be publishing and subscribing, just with only a subgraph)

Agreed, this is a good way of thinking about it.

-N

Daniel Stonier

unread,

May 12, 2012, 2:33:30 AM5/12/12

to ros-s...@googlegroups.com

On 12 May 2012 01:08, Piyush <piy...@gmail.com> wrote:

By "full secondary" vs "relay", do you mean "sync's all resources between private and public master" vs "only a specified list"? If so I'd say enumerating only what you want externally shared is preferred. Exposing everything over a lossy connection invites users to shoot themselves in the foot by subscribing to topics that could essentially take down the network due to bandwidth constraints. You won’t want to expose a high-frequency point cloud topic over 802.11g would you? If you needed to you would just expose a heavily throttled topic to prevent mishaps.

I should have elaborated clearly. The building manager docs suggested that the secondary (or public) master was a second instantiation of the regular master exposing a subset of the available resources as necessary. This public master would be kept in sync with the primary master using something like master_sync. To me, it is not clear whether this is the best route to take -- as the public master does not need matchmaking skills. Anyhow, it looks like we are proceeding with an implementation discussion anyway, so some of these things will get ironed out.

I have replied on the questions you laid out based on my own use case. Over the next month I should probably have better answers.

I wonder if the building manager docs expression was not meant to be too literal or they changed their ideas for it. The building manager app manager framework doesn't explicitly instantiate a second master and master sync just relays the publicly listed topics/services between the local and the foreign master. Please correct me if I'm missing part of the puzzle though - it's been a while since I've looked through the code, but we still run the master_sync and app manager framework regularly.

How would a subscription/call to a foreign resource work as far as namespacing?

I vote for attaching a unique id to the topic name. Something like /odom -> /unique_id/odom

- Do you need to know the machine name (or whatever unique prefix we use)?

At first, we should know the unique id of a foreign machine. We can later figure out how to do this automatically using auto discovery. This is mainly because a number of people with small setups wouldn't necessarily mind assigning the unique id themselves.

o When do we need to know it: at configuration time or runtime (launch file param, or query the master programmatically? Both?)

Both perhaps? I can't see the added functionality hurting. Having this programmatically should help auto discovery as well.

Yes, both definitely.

Should running nodes even receive callback events on resource adds/removals?

Yes. My application will have a centralized planner. It will need to know how many services are available, and when they go offline.

I have a similar situation also. I think this brings up a query Jeff made earlier as well. Does this information need to be in the master, or should the master just handle the directory of registrations and another node to the side handle whether those are on/offline?

Also, bundling the online/offline information into the master is redundant for alot of use cases that aren't MM.

Piyush

On Fri, May 11, 2012 at 10:17 AM, Jeff Rousseau <jrou...@aptima.com> <jrou...@aptima.com> wrote:

A wildcard/regex whitelist for resource (topic/service) exposure would probably suffice over an (even apache-simple) acl language.

I think an acl language would be more interesting on connections between masters themselves: only allow syncing between masters on certain domains, etc.

We really have to decide on how automatic we want master syncing to be and what lies in configuration and what lies in programmatic interfaces.

We ran a few scenarios last year - initially I was inclined to make syncing automatic. i.e. if a robot found a 'building manager' master, it would automatically connect. But then we had situations with more than one building manager on the same network (quite common in a large lab)? So we started whitelisting/blacklisting building manager masters in the robot configurations. That required robot configuration though - it was much nicer to have that configuration in the building manager solution itself, so we reversed the connection and had the building manager invite the robots it required to join. This starts to become fairly programmatic.

We then brought in an android (android based tablet with human). And that is a different case again - you don't want the android automatically taking control away from the user. So we allowed the android client to do manual connections to the building manager master (via QR coded instructions).

In conclusion, there are a couple of common scenarios that need to be considered, and it would be good to support these common use cases.

--
Phone : +82-10-5400-3296 (010-5400-3296)
Home: http://snorriheim.dnsdojo.com/

Yujin Robot: http://www.yujinrobot.com/

Embedded Ros : http://www.ros.org/wiki/eros

Embedded Control Libraries: http://snorriheim.dnsdojo.com/redmine/wiki/ecl

Daniel Stonier

unread,

May 12, 2012, 2:57:52 AM5/12/12

to ros-s...@googlegroups.com

On 12 May 2012 02:37, Jeff Rousseau <jrou...@aptima.com> <jrou...@aptima.com> wrote:

> > A wildcard/regex whitelist for resource (topic/service) exposure would probably suffice over an (even apache-simple) acl language.

> Agreed… but let’s not write one from scratch if there’s a more fully-featured one already available. Also want

> to mention that I don’t think we’re *actually* talking about access control – perhaps I should have referenced

> DNS bind routing rules instead.

I understand what you’re aiming at. I merely brought up actual access control because the use of the words ‘allow’ and ‘deny’ from apache triggered the idea of filtering masters from each other. We can probably table that discussion for later though.

As far as a regex/whitelist is concerned, I must still be misunderstanding the particulars because given a string topic list already available from a master query, it’s just a simple regex to filter those that match an expression (python has decent regex support out-of-box). Again, I’m probably missing something. Are there other features you can think of that we’d need that an external library would be good for?

>

> I agree with Piyush on all counts.

Piyush’s comments align with my use-case as well. As long as there aren’t any lurkers on the list that have good counter arguments, I think we have a pretty good consensus.

> Seems like we’re trying to build a non-pub/sub overlay network on top of a pub/sub network.

Not entirely sure what you mean here.

I’ve done overlays on pub/subs before (out of necessity, not choice); this effort seems to be evolving into more of a “gateway” design that tries to keep the pub/sub metaphor intact (you’ll still be publishing and subscribing, just with only a subgraph). The master backend never was a pub/sub system itself and really what we’re doing here is adding a special case master instance.

Yes, gateway design is a nice way of describing it.

> What if we considered the unique_id (as in “/unique_id/odom”) to be literally hostname:port or ip:port (as in

> “/bender:11311/odom”)? Let DNS/IP layers do what they do, and not mix that functionality into ROS. In fact,

> we could view the regular ROS pub/sub as a special case: “/*/odom”.

+1

This seems reasonable to me.

Has to be at least hostname and port to differentiate the case when you might have more than one master running on a single machine. A negative to this is that it makes introspection by developers and interfacing by users less human-readable.

If you wanted to install into an environment and provide hostname strings to avoid everything being namespaced as 192.168.10.3:11311, 192.168.10.4:11311, ..., you'd have to enter dns information. Which, outside the lab, will be impossible in many situations, and impractical in the rest (you have to go through sysadmins).

I'd like to have some way of making this human usable. We thought about zeroconf strings for unique id'ing, but that fails when you have robots going in and out of wireless range - zeroconf doesn't keep a centralised bank of names anywhere to guarantee the uniqueness of the id as you come and go. We could use hostname/port combinations, but we'd need an abstraction layer on top to make it easy to use. Or we could build human readable unique strings into the namespacing mechanism- we tried something like that last year with a node that acted like a name server on the building manager machine. It would take a string suggested by the robot, append a unique numeral suffix on the end and then return that to the robot as the unique identifer for the session. It's not difficult and makes the system much more usable.

Falling back to hostname/port combinations as a default for simple setups of course, would be fine.

Reply all

Reply to author

Forward