[Sbcl-help] Garbage collector and realtime

Josef Wolf

unread,

Mar 10, 2014, 12:54:29 PM3/10/14

to sbcl...@lists.sourceforge.net

Hello folks,

I am considering to rewrite an application I wrote about 10 years ago in C
into CL.

This (single-threaded) application takes as input a DVB-S transponder, takes
it apart, parses the TS/SI/whatever tables, demuxes the services and serves
the services to clients as HTTP streams.

This used to work great: with an 400MHz Pentium box, four such applications
were running, receiving 4 transponder and serve HTTP streams to up to 200
clients. The box was running 24/7 and even managed to survive two soccer
world championships :-). The only reason for downtime were kernel upgrades.

The problem is that the structure of the TS/SI/whatever tables is rather
complicated, so I'd like to take advantage of CL's macros to parse them.
My first experimentations in this direction are quite encouraging, so I'd
like to go further this path.

But I am not entirely sure about the GC behavior and how the GC can be
tuned. After all, the satellite won't stop sending data with 5MBytes per
second just because my application decides to take a nip with the GC. And
the clients won't like it also, every set-top-box will get upset when the
A/V-stream has too much delays.

I don't expect the CL implementation to compete with the performance I
mentioned above. But I definitely need a way to ensure that the GC won't
interrupt my main processing for more than -- say -- 20 milliseconds. This is
the the time to fill 50% of the hardware-ringbuffer of the DVB card.
20ms every second (1% overhead) would be great. But 2 seconds once a day
(0.0023% overhead) would be fatal. You get the idea: I am willing to trade
overhead against responsiveness.

I know I can force GC by calling SB-EXT:GC. And I see there are several
options to tune HOW OFTEN the GC is invoked. But I can't find a way to limit
the DURATION of GC invocations, so that the rest of the garbage is left for
the next GC invocation.

Or maybe there's a way to move the GC into a separate thread, which would then
not interfere with my main working thread?

I know I can reduce the frequency of GC runs by avoid of consing, managing
memory pools, and by allocating large memory chunks statically. But my problem
is _not_ the _frequency_ of GC invocations. My problem is the _duration_ of the
invocations.

I don't think I can completely avoid consing, so GC will need to be run from
time to time.

Any thoughts?

--
Josef Wolf
j...@raven.inka.de

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Sbcl-help mailing list
Sbcl...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-help

Stelian Ionescu

unread,

Mar 10, 2014, 1:44:16 PM3/10/14

to sbcl...@lists.sourceforge.net

On Mon, 2014-03-10 at 17:54 +0100, Josef Wolf wrote:
> Hello folks,
>
> I am considering to rewrite an application I wrote about 10 years ago in C
> into CL.
>
> This (single-threaded) application takes as input a DVB-S transponder, takes
> it apart, parses the TS/SI/whatever tables, demuxes the services and serves
> the services to clients as HTTP streams.

[...]

> I don't think I can completely avoid consing, so GC will need to be run from
> time to time.
>
> Any thoughts?

Make it multi-threaded, and keep in C the part that enqueues the frames
into a ring buffer. If using a lock, you'd have to be careful that the
consumer thread written in CL doesn't hold it while in GC, because that
would block the writer.

--
Stelian Ionescu a.k.a. fe[nl]ix
Quidquid latine dictum sit, altum videtur.

signature.asc

Günther Thomsen

unread,

Mar 10, 2014, 2:24:48 PM3/10/14

to Josef Wolf, sbcl...@lists.sourceforge.net

Not sure, whether CL is the right tool for the job. Allegedly real-time requirements have been met by avoiding allocation on the heap, but I'm afraid then one is using such an awkward subset of CL that one might be better off just using C (or Ada) instead.

> The problem is that the structure of the TS/SI/whatever tables is rather
complicated, so I'd like to take advantage of CL's macros to parse them.

How about writing a DSL (to make it easier chose one which can be READ/EVAL'ed ;-) and translate that into C? That might allow you also to reuse parts of the already existing program.

~ Guenther

Josef Wolf

unread,

Mar 10, 2014, 5:00:23 PM3/10/14

to sbcl...@lists.sourceforge.net

On Mon, Mar 10, 2014 at 06:44:16PM +0100, Stelian Ionescu wrote:
> Make it multi-threaded, and keep in C the part that enqueues the frames
> into a ring buffer. If using a lock, you'd have to be careful that the
> consumer thread written in CL doesn't hold it while in GC, because that
> would block the writer.

OK, that would help on the input side. But the clients need a constant data
stream also., though the requirements are not that hard on the client
side. But the delay from input to delivery to the client should not exceed
0.5sec or something. So I'd still need a way to limit the duration of the GC.

I wonder what the "nursery collection" and "collection generations" are that
the user manual is talking about. Any pointers to a more thorough explanation
of the algorithm?

Stelian Ionescu

unread,

Mar 10, 2014, 5:23:08 PM3/10/14

to Josef Wolf, sbcl...@lists.sourceforge.net

On Mon, 2014-03-10 at 22:00 +0100, Josef Wolf wrote:
> On Mon, Mar 10, 2014 at 06:44:16PM +0100, Stelian Ionescu wrote:
> > Make it multi-threaded, and keep in C the part that enqueues the frames
> > into a ring buffer. If using a lock, you'd have to be careful that the
> > consumer thread written in CL doesn't hold it while in GC, because that
> > would block the writer.
>
> OK, that would help on the input side. But the clients need a constant data
> stream also., though the requirements are not that hard on the client
> side. But the delay from input to delivery to the client should not exceed
> 0.5sec or something. So I'd still need a way to limit the duration of the GC.

0.5s is large enough that you're probably ok without special effort.
I'd go ahead with the implementation and start optimizing only if the
clients actually notice frame delays.

signature.asc

Orivej Desh

unread,

Mar 10, 2014, 5:26:44 PM3/10/14

to sbcl...@lists.sourceforge.net

Run GC every N frames/seconds to make GC time more predictable and
easier to measure. Measure it. Set input ring buffer size large enough
not to lose any input. If GC pauses are too large for output, feed
decoded frames to a separate process (e.g. using ZMQ IPC) with no or
small pauses and serve from there after a delay large enough to cover
any pause.

Josef Wolf

unread,

Mar 10, 2014, 5:24:03 PM3/10/14

to sbcl...@lists.sourceforge.net

On Mon, Mar 10, 2014 at 11:24:48AM -0700, Günther Thomsen wrote:
> > The problem is that the structure of the TS/SI/whatever tables is rather
> > complicated, so I'd like to take advantage of CL's macros to parse them.
> How about writing a DSL (to make it easier chose one which can be
> READ/EVAL'ed ;-) and translate that into C? That might allow you also to
> reuse parts of the already existing program.

Although it's a long time since I wrote this C code, I remember that
processing those tables is a quirky task. I don't think it would take a long
time until the generated code would fulfill Greenspun's tenth rule.

Christopher Stacy

unread,

Mar 10, 2014, 6:23:55 PM3/10/14

to sbcl...@lists.sourceforge.net

Another option to consider, which Lisp programmers sometimes forget,
is to simply turn off garbage collection entirely. If you don’t have enough room,
given the duration of the program, do explicit storage management in various ways.

Josef Wolf

unread,

Mar 10, 2014, 7:05:06 PM3/10/14

to sbcl...@lists.sourceforge.net

On Mon, Mar 10, 2014 at 06:44:16PM +0100, Stelian Ionescu wrote:

> Make it multi-threaded, and keep in C the part that enqueues the frames
> into a ring buffer. If using a lock, you'd have to be careful that the
> consumer thread written in CL doesn't hold it while in GC, because that
> would block the writer.

I wonder how I'd go about this.

Either the Multithreading would be done from C, then I'd need to fiddle with
SBCL internals.

Or the multithreading would be done in CL, then I'd need a way to suppress GC
in the writer thread. After all, the GC _can_ be run in the writer thread,
won't it? I don't see how to force GC to be run in a specific thread. If there
were such a mechanism, I could simply create a task solely for the GC.

Let's assume I'd use multitaksing: when GC is run in one task, would the other
threads block when the need to cons? I guess the GC would use some locking
internally, won't it?

And how would I ensure that GC won't be run while the lock is held? I can't
find any methods to suppress GC. Would it be enough to call GC right before
ackquiring the lock?

Let's assume I have a thread that don't do any consing at all. Can I safely
assume that the GC would never run in thins thread?

Martin Cracauer

unread,

Mar 10, 2014, 7:53:06 PM3/10/14

to Josef Wolf, sbcl...@lists.sourceforge.net

Josef Wolf wrote on Tue, Mar 11, 2014 at 12:05:06AM +0100:
> On Mon, Mar 10, 2014 at 06:44:16PM +0100, Stelian Ionescu wrote:
> > Make it multi-threaded, and keep in C the part that enqueues the frames
> > into a ring buffer. If using a lock, you'd have to be careful that the
> > consumer thread written in CL doesn't hold it while in GC, because that
> > would block the writer.
>
> I wonder how I'd go about this.
>
> Either the Multithreading would be done from C, then I'd need to fiddle with
> SBCL internals.

No, the idea is to have a pure C thread that the Lisp knows nothing
about. That works fine, we use it in our toy at work. It will
continue to run during Lisp GC.

Symbolics did pause-free GC but it was overall very expensive.

Martin
--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin Cracauer <crac...@cons.org> http://www.cons.org/cracauer/

Nikodemus Siivola

unread,

Mar 11, 2014, 3:54:02 AM3/11/14

to Martin Cracauer, sbcl-help

Other ways of keeping GC in check:

- Use DYNAMIC-EXTENT a lot -- ie. stack allocate. With appropriate
declarations SBCL can stack allocate pretty much any closure, vector,
cons, or structure-object you care to create. (Still, sometimes
surprisingly simple things can prevent it (mostly due to the compiler
still not being sufficiently smart), which is why code that really
relies on stack allocation happening should have unit tests that
verify that.)

- Use unboxed storage: pack your data into eg. vectors of
unsigned-byte 8 or 32. Then even when you heap-allocate unboxed data
causes *way* less GC pressure. (Less to scavenge.)

- Avoid dirtying old objects with pointers in them. It may be more
efficient re. GC to cons up a whole new structure than to modify a
slot in an old one as there are less dirty old pages to scavenge.
Sure, if you need to change a slot all the time, then it is probably
not feasible.

- ...which brings us to the worst trick of them all. If you have say
10M objects that are going to be subject to intense allocation
intensive algorithm, but aside from object identities everything could
live in unboxed storage? Have a large vector of those objects, and
refer to them by index instead of identity (fetching the real object
when necessary) -- ie. take a loan of storage and indirection to pay
off the GC cost of scavenging lots and lots of stuff.

Cheers,

-- nikodemus

Josef Wolf

unread,

Mar 11, 2014, 4:57:22 AM3/11/14

to sbcl...@lists.sourceforge.net

On Mon, Mar 10, 2014 at 07:53:06PM -0400, Martin Cracauer wrote:
> No, the idea is to have a pure C thread that the Lisp knows nothing
> about. That works fine, we use it in our toy at work. It will
> continue to run during Lisp GC.

A pure C thread? That would be a separate process, then?

I would have gone this route:

Create a thread for CL for reading the hardware (which has a ringbuffer with a
fixed size of 188*1024 bytes). This thread would use CFFI to call a library to
do additional buffering. This thread would use MAKE-ARRAY with DISPLACED-TO to
deliver the data to the worker thread, so no allocations would happen for the
bulk data. This thread can be kept very simple, so it should be not hard to
avoid consing.

My assumption here is that GC would only be triggered when a thread tries to
allocate storage. Since the buffering thread never allocates (after initial
setup), it would never do any GC. Is this assumption correct?

The worker thread, OTOH, would ensure that
1. The buffering thread is emptied before invoking GC. This is needed to avoid
overrun during GC.
2. All data received during GC is pulled immediately after GC. This is needed
to avoid overrun during processing of the data. After all, the buffer is
expected to be filled up after GC.

So the worker would do something like this:

(do ()
(nil)
(pull-all-data-from-transponder)
(gc)
(pull-all-data-from-transponder)
(process-pulled-data))

Is my assumption correct that GC will only run in a thread when it tries to
allocate storage?

What happens when GC is run _really_ often? Say, every 40ms (that is, after
receiving 188*1024 bytes). Will the GC inspect _all_ data items at every run?
Or will it only check the items where some sort of reference counter has been
reduced?

Can anybody give me a pointer to a comprehensible description of the GC

algorithm?

--
Josef Wolf
j...@raven.inka.de

Josef Wolf

unread,

Mar 11, 2014, 5:25:24 AM3/11/14

to sbcl-help

On Tue, Mar 11, 2014 at 09:54:02AM +0200, Nikodemus Siivola wrote:
> Other ways of keeping GC in check:
> - Use DYNAMIC-EXTENT a lot -- ie. stack allocate. With appropriate
> declarations SBCL can stack allocate pretty much any closure, vector,
> cons, or structure-object you care to create. (Still, sometimes
> surprisingly simple things can prevent it (mostly due to the compiler
> still not being sufficiently smart), which is why code that really
> relies on stack allocation happening should have unit tests that
> verify that.)

Hmm, I'd rather use dynamic storage only when it makes sense semantically.

> - Use unboxed storage: pack your data into eg. vectors of
> unsigned-byte 8 or 32. Then even when you heap-allocate unboxed data
> causes *way* less GC pressure. (Less to scavenge.)

Well, I actually use (make-array size :element-type '(unsigned-byte 8)))
and DISPLACED-TO for the bulk data. But I don't want to do too much
optimizations of this type in this early stage of the project.

> - Avoid dirtying old objects with pointers in them. It may be more
> efficient re. GC to cons up a whole new structure than to modify a
> slot in an old one as there are less dirty old pages to scavenge.
> Sure, if you need to change a slot all the time, then it is probably
> not feasible.

Can you elaborate on this? I don't really get what you are trying to say here.

> - ...which brings us to the worst trick of them all. If you have say
> 10M objects that are going to be subject to intense allocation
> intensive algorithm, but aside from object identities everything could
> live in unboxed storage? Have a large vector of those objects, and
> refer to them by index instead of identity (fetching the real object
> when necessary) -- ie. take a loan of storage and indirection to pay
> off the GC cost of scavenging lots and lots of stuff.

If I understand this suggestion correctly, then I have already done this type
of optimization in the C implementation. The storage (once allocated) would
never be freed. Instead the data structures were reused. So after an initial
phase, the storage can be assumed to be static.

The question is, whether the GC (once triggered by some other event) will have
to inspect those structures every time it runs? Or is there some sort of
optimization to process only structures where the reference counter has be
reduced or something like that?

--
Josef Wolf
j...@raven.inka.de

Andreas Franke

unread,

Mar 11, 2014, 11:36:37 AM3/11/14

to Josef Wolf, sbcl-help

On Tue, Mar 11, 2014 at 10:25 AM, Josef Wolf <j...@raven.inka.de> wrote:

If I understand this suggestion correctly, then I have already done this type
of optimization in the C implementation. The storage (once allocated) would
never be freed. Instead the data structures were reused. So after an initial
phase, the storage can be assumed to be static.

The question is, whether the GC (once triggered by some other event) will have
to inspect those structures every time it runs? Or is there some sort of
optimization to process only structures where the reference counter has be
reduced or something like that?

The heap objects are partitioned in generations 0 ... 6

where generation 6 is static, and generation 5 is the highest
generation where objects are moved to. Watch the logs:

(setf (gc-logfile) "/tmp/gc-log.txt")

A useful strategy might be to
1) allocate your "permanent" data structures,

2) perform (gc :full t) to make sure the are moved to generation 5.

3) prune the number of generations looked at by gc, for example:

(in-package :cl-user)
(define-alien-variable gencgc-oldest-gen-to-gc (signed 8))

(setf gencgc-oldest-gen-to-gc 2)

4) and then tune the gc parameters experimentally:
(setf (sb-ext:bytes-consed-between-gcs) ...)

and for 0 <= k < (1- gencgc-oldest-gen-to-gc):

(setf (sb-ext:generation-minimum-age-before-gc k) ...)
(setf (sb-ext:generation-number-of-gcs-before-promotion k) ...)

and make sure gc is called often enough.

Note that during gc all threads are stopped completely.

HTH,
Andreas

Josef Wolf

unread,

Mar 14, 2014, 6:09:47 AM3/14/14

to sbcl-help

Thanks, I'll give it a try!

> Note that during gc all threads are stopped completely.

But that should not apply to threads created from C, since CL should not
know anything about them, right?

Reply all

Reply to author

Forward