Thoughts on Allocators

Jackie Kay

unread,

Oct 6, 2015, 4:26:21 PM10/6/15

to ROS, ros-sig...@googlegroups.com

I recall some of us from SIG-NG chatted about this at ROSCon, but I
don't remember everyone who was there (some combination of Jack,
Geoff, William, and Mike Purvis, maybe?). So I figure I'll just email
everyone, and anyone can jump in...

I've been thinking about how I'm going to follow up the real-time work
to make the full intra-process pipeline and the service/client
pipeline real-time safe in rclcpp.

In the current implementation, the MemoryStrategy class handles
preallocated pools of memory that need to get instantiated during
RT-runtime. These pools are simply arrays of the specific classes that
must get instantiated, such as the AnyExecutable class.

The advantages of this approach are that it's simple and easy to
engineer, and actually pretty good for performance--by using
preordained knowledge about the execution of the program, we can
reserve exactly as much memory as we need. However, the performance
tradeoff is that we don't reuse those reserved pools of memory when we
don't need them, which could be possible in a smarter allocator. The
bigger disadvantage is that it creates a lot more code to write and
maintain.

I am considering moving to the approach of exposing an allocator
interface as a template argument to several classes in rclcpp, and
setting the default to std::allocator [1], then providing at least one
O(1) allocator as an alternative to the default.

It's a more elegant and generalized approach than the MemoryStrategy,
but harder to implement. It would be trivial if we could just package
an existing real-time safe allocator in rclcpp. But it's actually
quite hard to find a good allocator that's appropriate. TLSF [2] is an
allocator specialized for embedded systems. Heap Layers [3] is an
awesome library of composable "allocator primitives". But both are
under GPL licensing, and we want to release ROS 2 under Apache 2.

Also, using allocators that follow the std::allocator interface means
that those allocators could be passed to STL structures like vector
and map that are used in rclcpp and can incur allocation penalties

What I might do is explore the allocator interface approach and use
these existing allocator implementations for testing, but not include
them anywhere in the ROS 2 codebase to avoid GPL. Then, if I decide
the approach is viable/better than MemoryStrategy, I will pursue an
existing allocator that is LGPL or Apache, or implement my own to be
included with ROS 2.

Let me know if you'd like clarification about anything--there might be
some context missing for people who haven't looked extensively at the
rclcpp codebase, or anyone who isn't familiar with std::allocator and
std::allocator_traits.

Cheers,

Jackie

p.s. The TLSF implementation is in C anyway, so I will have to modify
it slightly to make it work with std::allocator_traits.

1. http://en.cppreference.com/w/cpp/memory/allocator
2. http://www.gii.upv.es/tlsf/main/repo
3. http://heaplayers.org/

Bob Dean

unread,

Oct 6, 2015, 6:02:20 PM10/6/15

to ros-sig...@googlegroups.com, ROS

The std::allocator interface is a solid approach. Or be honest I was wondering why memory strategies used a different API.

Given the OP do you expect to be allocating memory at runtime? Previous comments and the github notes suggest you were moving to an "allocate during init, not at runtime" approach.

Do you have metrics on how the memory strategies are or are not being used under load, or is this based on theory?

Sent from my iPhone

> --
> You received this message because you are subscribed to the Google Groups "ROS SIG NG ROS" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ros-sig-ng-ro...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jackie Kay

unread,

Oct 6, 2015, 6:29:16 PM10/6/15

to Bob Dean, ros-sig...@googlegroups.com, ROS

update on licensing for anyone who cares: I learned that the
implementation of the TLSF allocator is actually "dual license" GPL
and LGPL 2, and they claim you can choose whichever license suits you
when you link your library against it. I'll proceed with caution since
I really don't understand the nuances of dual-licensing.

Response to Bob:

Indeed, the goal is to avoid allocation at runtime, if the allocator
used is not real-time safe.

A custom allocator could allow calls to "malloc" or "new" to be
real-time safe by using a different strategy. For example, the TLSF
allocator executes "malloc" in constant time, and requires the user to
provide a preallocated pool of memory.

This is in contrast to the default allocator, which does not run in
constant time because it does dynamic heap allocation. So, the current
approach is not to change the allocator, but instead change the code
path so that malloc and new are not called during real-time execution.

As for data, I presented latency metrics on the current implementation
at ROSCon. The conclusion was that the current approach is "soft
real-time" safe for publish-subscribe. I can send you my slide deck,
which has all the relevant information, if you're interested.

My feeling is that the two approaches I described won't have a huge
difference in performance, but the tradeoff is more about
API/implementation simplicity versus elegance.

For example, with the current implementation, the user specifies
limits on the number of entities (subscribers, publishers, etc) that
rclcpp creates during runtime, and that determines the exact size of
the pools provided for those entities.

If we were to expose the allocator interface to the user, the API is
more abstract and requires the user to have some knowledge of
allocators. There's also less control over the exact amount of memory
used--for example, if TLSF is used, the user hands the allocator a
chunk of memory at the beginning of the program and hopes for the
best... But, you could argue that anyone who's interested in
controlling memory allocations and real-time performance is an
advanced user and we shouldn't worry about "dumbing down" the API for
them.

Bob Dean

unread,

Oct 6, 2015, 6:45:54 PM10/6/15

to Jackie Kay, ros-sig...@googlegroups.com, ROS

Yes, please forward your slides. I was unable to watch the live stream last weekend.

I basically agree with everything in your response. I have written allocators in the past, after discovering the boost memory pools hard coded their pool size expansion to 2*currentSize. Boom, segfault for large datasets.

Sent from my iPad

Geoff

unread,

Oct 7, 2015, 2:39:56 AM10/7/15

to ros-sig...@googlegroups.com

I agree quite strongly with using the std::allocator interface. It would
provide greater flexibility to users of rclcpp. It would allow someone
to use an existing allocator directly, if they have one that meets their
use case and want to use it. And it would, as you mentioned, also be
usable with other standard library classes. However good or bad a
standard library implementation is for a particular use case, the
_interfaces_ are in general well-designed (if we ignore std::string) and
work well when used together. When appropriate, the closer ROS class
interfaces are to similar standard library interfaces, the better, in my
opinion.

The "provide a big block of pre-allocated memory" approach is easier to
understand for the average user than writing allocators. I think that it
is work considering providing some way for users to use this approach
for simple, poorly-tuned real-time. Perhaps provide an allocator that is
instantiated with a big block of user-provided memory, and then the user
can give that to the allocator argument? Or a very thin wrapping API? Or
something like that. I'm thinking out loud. However, I think that the
write an allocator approach should be given priority first, as anything
else can be based on that later on.

Geoff

Jackie Kay

unread,

Oct 7, 2015, 10:02:43 PM10/7/15

to ros-sig...@googlegroups.com

Bob, I've attached my ROSCon slides in this email.

Implementation update, for anyone who cares about the API details:

I prototyped how the exposing the allocator argument to Publishers and
Subscriptions would look.

https://github.com/ros2/rclcpp/compare/rt_intra_process

You pass the allocator type as a template argument to the
publisher/subscription. It then uses that allocator to allocate any
incidental messages in the pub/sub pipeline. Feedback about the API is
welcome, there's probably a lot that's going to change before this is
final.

You can get the TLSF allocator wrapped as an ament package from this repo:

https://github.com/ros2/tlsf_cpp

This is how example code would work (adapted from William's intra-process demo):

https://github.com/ros2/examples/blob/allocator_template_example/rclcpp_examples/src/topics/cyclic_pipeline_allocator.cpp

I'd like to get around to profiling the performance difference (if
applicable) before the end of the week.

ROSCon Realtime ROS 2 Presentation.pdf

Jack O'Quin

unread,

Oct 7, 2015, 10:14:44 PM10/7/15

to ros-sig...@googlegroups.com, ROS

+1 This idea is worth exploring.

My real-time experience was in C, with malloc() always done during initiialization. The ROS 2 node life-cycle seems to cover that, and most nodes will probably take advantage of it.

But, C++ makes things more complicated. ROS messages can contain std::string and std::vector components. Passing messages between real-time and non-real-time nodes looks hard, and unfortunately the standard ROS message Header contains a frame_id string, so we will have to deal with it somehow.

My guess is that passing messages between real-time and non-real-time nodes will involve copying data between the regular heap and the mlock()ed real-time heap. That will presumably need to be done on the non-real-time side, which can tolerate page faults and sbrk(2). If the real-time heap is statically allocated and contiguous, detecting that a pointer lies within it is fairly efficient.

Although difficult, this seems worth exploring.

There are problems with using mlockall(2) on entire address spaces. The real-time threads have to decide how deep a call/return stack they each need and touch every page before calling mlockall(). Otherwise, you could get a first-reference fault when calling a new function. In addition to the stack, code and data must already have been allocated. I am assuming the use of MCL_CURRENT; I don't know how to use MCL_FUTURE for real-time applications. Without mlockall() it is hard to find all the pages that need to be locked individually.

Mixing non-real-time threads in the same address space is even more complex. Locking all the code and literals seems reasonable. Only the real-time stacks need special initialization. The real-time heap can probably be allocated and mlock()ed as a block. I believe all this has been done successfully, but it's not simple.

The result may not be "hard" real-time, but if carefully written it can be "firm" in the sense that some messages may be lost, but deadlines will be met reliably. For most robotics, that is a reasonable goal.
--

joq

Bob Dean

unread,

Oct 8, 2015, 9:38:40 AM10/8/15

to ros-sig...@googlegroups.com

Jackie,

The tlsf paper(s) read much like the papers for tcmalloc and jemalloc. I wonder how different the current implementations really are for the realtime case. This link shows both je and tc performing better than tlsf under short and long run cases:

http://webkit.sed.hu/blog/20100324/war-allocators-tlsf-action

Since you have a memory jitter test, have tests been done using the link time alternatives? Does tlsf have a link time version?

If the link time mallocs perform well, they may be a simpler Api/solution than custom allocators.

Sent from my iPad

<ROSCon Realtime ROS 2 Presentation.pdf>

Jack O'Quin

unread,

Oct 8, 2015, 10:19:00 AM10/8/15

to ros-sig...@googlegroups.com

On Thu, Oct 8, 2015 at 8:39 AM, Bob Dean <bob....@gmail.com> wrote:

Jackie,

The tlsf paper(s) read much like the papers for tcmalloc and jemalloc. I wonder how different the current implementations really are for the realtime case. This link shows both je and tc performing better than tlsf under short and long run cases:

http://webkit.sed.hu/blog/20100324/war-allocators-tlsf-action

Since you have a memory jitter test, have tests been done using the link time alternatives? Does tlsf have a link time version?
If the link time mallocs perform well, they may be a simpler Api/solution than custom allocators.

As with many real-time performance measurements, I'd expect to pay a penalty in the average performance in return for better guarantees for worst-case performance.

Bob: those benchmarks look like they report average run-time performance. Right?

Are there any worst-case performance data available?

--

joq

Bob Dean

unread,

Oct 8, 2015, 1:58:20 PM10/8/15

to ros-sig...@googlegroups.com

The one advantage to being stuck in bed with back issues is the ability to scour the Internet. The graphs linked do show averages, I have not been able to find worst case performance numbers. Plenty of statements such as "tcmalloc vX is 30% faster for thingY". I found a couple claims that jemalloc was now O(1) which I could not verify. I think a case could be made that 3rd party allocator a would "stabilize" their memory usage given the repetitive nature of real time loops. I just do not know how to prove it without digging deep into the code. Or emailing the project maintainers.

It should be possible, however, to wrap tlsf in a Malloc/free API so that it could replace the default memory alloc at runtime. (At least in Linux). Then the memory pool could be assigned early in main, reducing the need for allocators to be passed around, while providing the desired max bound.

Another point, Jackie mentioned replacing the current arrays used within rlcpp with allocators. There are cases where knowing the amount of memory you require, and setting up an array may still be a valid design choice over an allocator. It depends on the use case. Sometimes more complexity internally in exchange for simpler external interface is worth it.