OLSRd limitation encountered, next steps

6 views
Skip to first unread message

Russell Senior

unread,
Mar 29, 2014, 4:42:38 AM3/29/14
to ptp...@googlegroups.com

In the last month or so, we've been seeing instability in our IPv6
OLSRd routing (referred to below as "olsr6d" for brevity). Last
night, in collaboration with the olsr-users mailing list, we managed
to figure out why.

Most deployments of olsrd are over wireless meshes, sometimes in quite
large networks. As a result, the devices tend to be spatially
distributed. As a result, it is rare for devices to have many, many
neighbors. Even though our network is much smaller than many of these
deployments, its topology is different. We have a central server that
everything else connects to, a hub-and-spoke topology. As a result,
our central server has a substantial number of neighbors. It turns
out, on olsr6d everything is good with up to and including 59
neighbors. When we add one more neighbor, we are apparently exceeding
the MTU we have established on the VPN tunnel, 1280 bytes, and the
OLSR packets start getting rejected or lost or something, and the ipv6
routing table on iris and the nodes collapses, and we end up with no
ipv6 routes on the nodes and only a smattering on iris. If we turn
off olsr6d on one of the nodes, the routes all come back over 30
seconds or a minute.

I will leave out a detailed description of the substantial head
scratching and experimentation that occurred before discovering the
exact shape of the problem, even though I find it endlessly
fascinating. If you want the full story, come to a meeting and I will
explain blow-by-blow.

We will have the same problem on IPv4 eventually, but we aren't seeing
the problem now because IPv4 addresses are smaller (4 versus 16
bytes).

The question now is, what to do about it?

I have a so-far-unanswered query in to the olsrd community about
whether this is fixable in the near-term, perhaps by splitting the
olsr messages into separate datagrams if they exceed MTU. If that is
both forthcoming and compatible with older nodes, then problem solved.

We could perhaps increase the MTU on the VPN tunnel. I know 1280 is
the minimum guaranteed MTU for ipv6, but it's not clear to me why we
didn't choose something higher, if the tunnel supports it. That would
buy us some time.

Another stop gap is to begin moving people with native ipv6 upstreams
off of the VPN tunnel, freeing up some of the 59 available neighbor
slots for those that really need it. This is something we should be
doing anyway, and something we have a tested example of now, but
requires we first spin up some dynamic dns for AAAA records.

Another solution is to just stop using olsrd in this role, and make
static routing assignments. Because the nodes typically don't
interconnect directly we could just give all the nodes a big fat
10.11.0.0/16 route (and an equivalent ipv6 route) back towards iris
through the tunnel, and let iris sort things out. Then have iris add
routes towards nodes and their HNAs as the VPN tunnels come up based
on some static data.

One other strategy that occured to me is to try segmenting nodes onto
different openvpn services, e.g. on different machines or ports. This
introduces some maintenance burden in deciding which nodes point at
which ports, but might solve (untested) the too-many-neighbors-to-fit-mtu
problem.

Btw (I mentioned this at the Monthly, but for those who weren't
there), OpenVPN on the nodes has a problem. When the server (on iris)
restarts, the nodes get pushed a new IP address. However, the OpenVPN
on the node has long since dropped privileges and therefore has
insufficient permission from the OS to manipulate the interface
configuration, and so it just exits and the tunnel goes away.

Luckily, I (long ago) have taken mitigating steps on the nodes. There
is a cron job that fires approximately every 7 minutes to look for
processes named openvpn, and if it fails to find one, it restarts it.

https://github.com/personaltelco/ptp-openwrt-files/commit/0741bebc85836c8be8d1f34d9d722a86a125b676

Another solution, which Jason McArthur had implemented at one time,
was the use of static ipaddr assignments in the /etc/openvpn/ccd
directory. When the VPN server comes back up, the VPN clients should
reconnect okay because no interface reconfiguration is needed. This
has the downside that someone has to maintain the static assignments
manually. Also, the ccd approach doesn't cohabitate very readily with
the automatic address assignments, as the ifconfig-pool will happily
hand out conflicting addresses, borking a sister-node's tunnel (as
seen recently).

Another option is to utilize an openvpn option called
--ifconfig-pool-persist, which maintains a "lease" file for nodes as
they connect, trying to ensure (without a guarantee) that they'll get
the same VPN address the next time they connect.

Any change in the OpenVPN configuration will require a restart, which
will kill, at least for a time, the nodes' VPN tunnels, and may result
in loss of contact and might even require a visit or some other
induced reboot to recover if the cron job is unsuccessful.

One other detail. Olsr stop scripts on some nodes are broken. The
problem is that we have two olsrd daemons (ipv4 and ipv6) that are
started separately (btw, we may be able to avoid two daemons using an
NIIT arrangement that tunnels ipv4 through ipv6, google for details).
However, the stop script uses "killall olsrd", which kills both. See:

https://github.com/personaltelco/ptp-openwrt-files/commit/29256475dcc3fc6d4498dacd369fbf2471134fef

Modern images have fixed this by tracking the PID that was started and
killing it by number, rather than by name.

https://github.com/personaltelco/ptp-openwrt-files/commit/4443911bf4ea3702694db68adf39a6dbedac2443

Some older images still retain the broken logic. However, the modern
start/stop logic relies on command line options that are not available
in some older olsrd software, and therefore cannot be just plopped
onto old nodes. Fixing this, with local patches where called for
would be desirable, but ultimately the old images will be paved over
(with luck) anyway.

Any questions? Thoughts? Preferences?


--
Russell Senior, President
rus...@personaltelco.net
Reply all
Reply to author
Forward
0 new messages