Patterns involving "allocating" on one thread and "freeing" on another.

149 views
Skip to first unread message

Rajiv Kurian

unread,
Dec 23, 2013, 3:22:34 AM12/23/13
to lock...@googlegroups.com
This seems like a common pattern in multi-threaded applications:

1) Single thread reads of the network, allocating buffers as necessary.
2) It sends these buffers along with other info to worker threads that actually do real work. A worker thread processes the event entry and then frees the buffers from step (1). Typically a ring buffer or some other kind of queue is used with an entry that resembles something like this:

struct event_entry {
char* buffer;
int length;
// Other stuff.
}

The problem with this pattern is that memory allocated on one thread is freed on another. Besides the constant allocation and frees, this also suffers from the allocation and free actually happening on different threads causing contention.

I need to use a pointer to a buffer instead of an inlined array since the size of data is for each entry is variable and unknown. Currently this is what I am planning to do to handle this:

1) Producer maintains a pool of buffers of different sizes. I could also just use a memory allocator like jemalloc which does pooling behind the scenes.
2) Producer picks an appropriately sized buffer from the pool for an incoming request. It's easy to pick a size especially if the protocol is length prefixed.
3) Once there are enough bytes to form a complete entry (based on our protocol) the producer puts a pointer to this buffer on a ring buffer entry and publishes it.
4) The consumer picks an entry off the ring buffer and synchronously processes it. If it needs the buffer entry beyond the point of initial processing it copies it (this is rare for me). It marks the ring buffer entry as processed. 
5) This ties to step (2). Whenever the producer doesn't have the right sized buffer in it's pool it checks all the ring buffer entries already marked processed by the consumer in this cycle of the ring buffer. It does so by checking the sequence number of the consumer/worker. It claims all of these buffers as processed and puts them back in its pool. This logic needs to be run at least once per cycle of the ring buffer (it could be triggered early because there was a shortage of buffers) otherwise we will end up reusing buffers that are still being processed. If after a reclamation it still cannot find a right sized buffer it just allocates one and adds it to the pool (should be rare in steady state).

Any comments or obvious faults with my logic? What do you guys use to exchange buffers of dynamic sizes between two threads? I am trying to avoid cross thread allocate and free. In fact I am trying to avoid allocate and free in steady state all together.

Thanks!

Dmitriy V'jukov

unread,
Jan 9, 2014, 9:39:43 AM1/9/14
to lock...@googlegroups.com
Hi!

Your scheme sounds reasonable.


As far as I understand it's related to your other question about SPMC
queue. So if you use a queue like this:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and have an upper bound on buffer size and ready to waste some memory,
then you can embed the buffer directly into each element of the queue.
This queue is actually bi-directional, currently consumers transfer
"empty" elements back to producer. But there is no reason they can not
transfer useful payload back to producer. So in the extreme case you
can have just:

#define MAX_MSG_SIZE (64<<10)
#define QUEUE_SIZE (1<<10)

struct element {
char buf[MAX_MSG_SIZE];
... other data
};

queue<QUEUE_SIZE, element> q;

This consumes 64M of memory persistently. But it is as simple and as
fast as you can get.

You will need to "split" enqueue and dequeue functions into 2 parts:
first waits when the next element becomes ready for
production/consumption, and the second part "commits"
production/consumption.
Producer workflow:
- wait till the next element becomes ready for production (it's most
likely already ready, because the queue if FIFO)
- fill in the element (the buffer is already there, right in the element)
- commit production (make the element ready for consumption)
- repeat
Consumer workflow is symmetric:
- wait till the next element becomes ready for consumption
- process the element (don't need to copy the buffer, read it right in place)
- commit consumption (make the element ready for subsequent production)

Rajiv Kurian

unread,
Jan 9, 2014, 7:34:54 PM1/9/14
to lock...@googlegroups.com
It is somewhat related but still a common enough pattern in most networked + threaded applications.
So if you use a queue like this:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
and have an upper bound on buffer size and ready to waste some memory,
then you can embed the buffer directly into each element of the queue.
This queue is actually bi-directional, currently consumers transfer
"empty" elements back to producer. But there is no reason they can not
transfer useful payload back to producer. So in the extreme case you
can have just:

#define MAX_MSG_SIZE (64<<10)
I think this is a great solution where the MAX_MSG_SIZE is reasonable and there is not much variance between requests. I am accepting requests to process images/videos. The variance is massive and the maximum size is way too big (100s of MB). Another problem is that the network buffers get filled in asynchronously. So if client-1 connects and wants to send 1 MB of data, I can get a slot in a ring buffer and start filling in the buffer. In the meantime another client could connect and start writing data. Now client-2 could connect and actually send it's data faster than the first one, but we have to wait on the first one to publish the second one. The first one could actually just become unresponsive and could need to be killed by a timer or something. In general a slow client that starts sending a request before other faster clients could stall the pipeline. A pointer to the buffer, while definitely slower than an inline one avoids these issues.

#define QUEUE_SIZE (1<<10)

struct element {
  char buf[MAX_MSG_SIZE];
  ... other data
};

queue<QUEUE_SIZE, element> q;

This consumes 64M of memory persistently.  But it is as simple and as
fast as you can get.

You will need to "split" enqueue and dequeue functions into 2 parts:
first waits when the next element becomes ready for
production/consumption, and the second part "commits"
production/consumption.
Producer workflow:
- wait till the next element becomes ready for production (it's most
likely already ready, because the queue if FIFO)
- fill in the element (the buffer is already there, right in the element)
- commit production (make the element ready for consumption)
- repeat
Consumer workflow is symmetric:
- wait till the next element becomes ready for consumption
- process the element (don't need to copy the buffer, read it right in place)
- commit consumption (make the element ready for subsequent production)
Right this is what I am doing besides the extra logic to put buffers back into the pool when I detect a need for buffers. 

Dmitriy V'jukov

unread,
Jan 12, 2014, 5:41:05 AM1/12/14
to lock...@googlegroups.com
In general case you can just use a sifficently good multi-threaded
allocator, e.g. tcmalloc.
However, if you want more performance then what you described make
perfect sense. Since you have only 1 producer, you can use the
classical lock-free stack as transfer queue for empty buffers.
Producer can grab the whole queue with a single XCHG(stack, NULL)
operation, this avoids any issues with ABA and safe memory
reclamation.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Scalable Synchronization Algorithms" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to lock-free+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lock-free/70736031-1eb0-4c83-ad48-afde9006e973%40googlegroups.com.
>
> For more options, visit https://groups.google.com/groups/opt_out.



--
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

Rajiv Kurian

unread,
Jan 12, 2014, 3:10:47 PM1/12/14
to lock...@googlegroups.com
Thank you Dmitry!

Miral Mirality

unread,
Jan 12, 2014, 4:54:59 PM1/12/14
to lock...@googlegroups.com
On Friday, January 10, 2014 1:34:54 PM UTC+13, Rajiv Kurian wrote:
I think this is a great solution where the MAX_MSG_SIZE is reasonable and there is not much variance between requests. I am accepting requests to process images/videos. The variance is massive and the maximum size is way too big (100s of MB). Another problem is that the network buffers get filled in asynchronously. So if client-1 connects and wants to send 1 MB of data, I can get a slot in a ring buffer and start filling in the buffer. In the meantime another client could connect and start writing data. Now client-2 could connect and actually send it's data faster than the first one, but we have to wait on the first one to publish the second one. The first one could actually just become unresponsive and could need to be killed by a timer or something. In general a slow client that starts sending a request before other faster clients could stall the pipeline. A pointer to the buffer, while definitely slower than an inline one avoids these issues.
 
You can just have separate queues that work cooperatively to resolve this.  For example, create a freelist queue that is preloaded with a large number of small-to-medium preallocated buffers (with extra space for a linked list pointer), and a work queue that is initially empty.  When your network thread starts receiving data, pop a buffer off the freelist and start filling it; once it is full, pop another buffer off the freelist and link it to the first (via a linked-list-style pointer).  Once all the data is received, queue the linked list of buffers (aka, just the pointer to the head buffer) onto your work queue.
 
Your processing thread(s) can then dequeue from the work queue, and when they're done with the buffers just push them back onto the freelist.  No need to worry about sequence ids or the order in which work arrives this way -- although you will still need to deal with timeouts and with running out of buffers -- and of course selecting the correct types of queues depending on how many threads you have on each end.
Reply all
Reply to author
Forward
0 new messages