ANNOUNCE: C++ library for shared-memory parallel programming

Arch D. Robison

unread,

May 13, 2006, 7:16:35 AM5/13/06

to

Intel is currently offering a beta version of a C++ library for
shared-memory programming. The library contains templates for common
parallel programming patterns and containers. These pre-tested templates
simplify the writing of correct scalable parallel programs.

For more information, download the beta for free via
http://www.intel.com/software/products/tbb/beta

Arch Robison (lead developer of the library)
Intel Corporation

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Joe Seigh

unread,

May 13, 2006, 4:42:48 PM5/13/06

to

Arch D. Robison wrote:
> Intel is currently offering a beta version of a C++ library for
> shared-memory programming. The library contains templates for common
> parallel programming patterns and containers. These pre-tested templates
> simplify the writing of correct scalable parallel programs.
>
> For more information, download the beta for free via
> http://www.intel.com/software/products/tbb/beta
>
> Arch Robison (lead developer of the library)
> Intel Corporation
>

Is there anyway to find out what the api looks like without
having to register and download the entire thing? Perhaps
somebody who thinks they're going to need it anyway can
summarize here.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

wrybred

unread,

May 14, 2006, 7:38:46 PM5/14/06

to

Joe Seigh wrote:
> Is there anyway to find out what the api looks like without
> having to register and download the entire thing? Perhaps
> somebody who thinks they're going to need it anyway can
> summarize here.

It includes some lock-free containers, capabilities, and some
frameworky stuff for parallel applications. Looks useful for those who
don't want to roll their own--how many people can really write solid
lock-free algorithms?

I'm curious what the eventual asking price and competitors would be.
Mutex + standard container + condtion is pretty wasteful on
multi-processor/multi-core hardware.

Malte Clasen

unread,

May 14, 2006, 7:36:10 PM5/14/06

to

Arch D. Robison wrote:
> Intel is currently offering a beta version of a C++ library for
> shared-memory programming. The library contains templates for common
> parallel programming patterns and containers. These pre-tested templates
> simplify the writing of correct scalable parallel programs.

Has anyone already compared it to the recently accepted boost library
shmem?

http://lists.boost.org/boost-announce/2006/02/0082.php
http://lists.boost.org/boost-announce/2006/02/0083.php

Malte

Joe Seigh

unread,

May 15, 2006, 2:56:22 PM5/15/06

to

wrybred wrote:
> Joe Seigh wrote:
>
>>Is there anyway to find out what the api looks like without
>>having to register and download the entire thing? Perhaps
>>somebody who thinks they're going to need it anyway can
>>summarize here.
>
>
> It includes some lock-free containers, capabilities, and some
> frameworky stuff for parallel applications. Looks useful for those who
> don't want to roll their own--how many people can really write solid
> lock-free algorithms?
>
> I'm curious what the eventual asking price and competitors would be.
> Mutex + standard container + condtion is pretty wasteful on
> multi-processor/multi-core hardware.

It's probably moot how many people can write lock-free algorithms. Except maybe
for lock-free LIFO stacks and Scott and Michael's lock-free queue, most
lock-free algorithms have been or will be patented. So the use of lock-free
will be limited to commercial libraries.

There another commercial library by Parallel Scalable Solutions
http://www.pss-ab.com/
You don't have to register to see the documentation. I haven't
used it myself.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Arch D. Robison

unread,

May 15, 2006, 3:00:39 PM5/15/06

to

Obligatory disclaimer: The following are my own opinions, not Intel's. I am
the lead developer of the library.

There are programming languages with direct support for parallel
programming, but they can be difficult to integrate into existing
environments. So we looked at various parallel languages, and asked
ourselves "how much of this can we turn into a C++ library?" Though a
library cannot provide the beautiful syntax of a new language, C++ is
nonetheless a powerful language that let us adapt much useful functionality
from the other languages. For examples, the library's task scheduler is
adapted from Cilk (http://supertech.csail.mit.edu/cilk/), and the parallel
loops operate on a recursive range concept inspired by STAPL
(http://parasol.tamu.edu/compilers/research/STAPL/). C++ is a powerful
language for this purpose because it combines the efficiency of C with
support for generic programming. Two examples: You use the parallel for-all
template with your own type of iteration space. You can use the parallel
reduction template on your own types, not only built-in types.

The library is not a general purpose threading library. It targets
threading for speed-up on systems with multiple CPUs or multi-core.
Threading for speed-up, as anyone who has done it will attest, not only
requires avoiding race conditions, but also using resources efficiently
(e.g. cache, memory bandwidth, memory space, load balancing). Simply
unleashing a thread for every possible piece of work that can be done in
parallel will bog a machine down. The library uses a work-stealing approach
from Cilk, which tends to make efficient use of memory space and cache,
avoid oversubscribing the hardware with an excessive number of threads.
Also, the work-stealing approach deals well with load balancing across
processors.

The containers in the library are independent of the scheduler. They use a
combination of lock-free techniques and fine-grain locking. We tried both
approaches for some containers, and found that in general, fine-grain
locking performs better than lockless, because the former usually use fewer
atomic operations. Atomic operations are fairly expensive on modern
processors, because of their interaction with deep pipelines and caches.
Furthermore, there are subtle memory reclamation issues in lock-free
algorithms that are an issue for languages without garbage collection. See
http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf for a
discussion. In some contexts, the advantages of lockless algorithms
outweigh their costs. Depending on feedback, we will consider whether to
add purely lockless algorithms in future versions of the library. We
deliberately started small, and want to grow the library based on
experience.

Arch D. Robison
Intel Corporation

Joe Seigh

unread,

May 16, 2006, 6:08:28 AM5/16/06

to

Arch D. Robison wrote:
> The containers in the library are independent of the scheduler. They use a
> combination of lock-free techniques and fine-grain locking. We tried both
> approaches for some containers, and found that in general, fine-grain
> locking performs better than lockless, because the former usually use fewer
> atomic operations. Atomic operations are fairly expensive on modern
> processors, because of their interaction with deep pipelines and caches.
> Furthermore, there are subtle memory reclamation issues in lock-free
> algorithms that are an issue for languages without garbage collection. See
> http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf for a
> discussion. In some contexts, the advantages of lockless algorithms
> outweigh their costs. Depending on feedback, we will consider whether to
> add purely lockless algorithms in future versions of the library. We
> deliberately started small, and want to grow the library based on
> experience.
>

You can use things like a version of RCU combined with SMR hazard pointers
to eliminate the requirement for the store/load (MFENCE?) memory barrier
in the hazard pointer. E.g. on my system (no MFENCE instruction)
the hazard pointer load code w/o memory barriers to stall the
pipeline

static __inline__ void * smrload32(void **hptr, void **src) {
void * ret;

__asm__ __volatile__ (
"mov 0(%2), %%ecx ;\n" // load source pointer
"1:\t"
"mov %%ecx, %0 ;\n"
"mov %%ecx, 0(%1) ;\n" // store into hazard pointer
"mov 0(%2), %%ecx ;\n" // reload source pointer
"cmp %%ecx, %0 ;\n"
"jne 1b ;\n"
"mov %%ecx, 4(%1) ;\n" // store into hazard pointer[1]
: "=&r" (ret)
: "r" (hptr), "r" (src)
: "cc", "memory", "ecx"
);
return ret;
}

runs about 10 times faster than the hazard pointer load w/ memory barriers
(8 psecs vs. 81 psec on a 866 Mhz P3)

static __inline__ void * smrload_sync32(void **hptr, void **src) {
void * ret;

__asm__ __volatile__ (
"mov 0(%2), %%ecx ;\n" // load source pointer
"1:\t"
"mov %%ecx, %0 ;\n"
"lock; addl $0, 0(%%esp);\n" // release membar
"mov %%ecx, 0(%1) ;\n" // store into hazard pointer
"lock; addl $0, 0(%%esp);\n" // store/load membar
"mov 0(%2), %%ecx ;\n" // reload source pointer
"cmp %%ecx, %0 ;\n"
"jne 1b ;\n"
"mov %%ecx, 4(%1) ;\n" // store into hazard pointer[1]
: "=&r" (ret)
: "r" (hptr), "r" (src)
: "cc", "memory", "ecx"
);
return ret;
}

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Joe Seigh

unread,

May 16, 2006, 6:12:02 AM5/16/06

to

I wrote

> runs about 10 times faster than the hazard pointer load w/ memory barriers
> (8 psecs vs. 81 psec on a 866 Mhz P3)

That should be nsec, not psec of course.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]