[ANN] macstl 0.1.3, the C++ source library

Glen Low

unread,

Sep 25, 2003, 7:44:40 PM9/25/03

to

Just to let you know, I've updated macstl to 0.1.3, which now works
with Metrowerks Codewarrior 9. The new version also features 65,566
different generated Altivec constants, and all the standard Altivec
operators in a neat object-oriented package. I've listened to my users
and wrote over 100 new pages of reference documentation on the website
and a comprehensive unit test regime.

You can browse the internal references or download the code to look at
some portable, cutting edge tips and techniques. I intend that
valarray should compile successfully on non-OS X systems, so I
appreciate all your comments and testing.

* Expression templates with a twist: curiously recursive inheritance
to share code and minimize object size, creates an open-ended type
system.
* Template template parameters as functors to switch between scalar
and vector code.
* Limited typeof using SFINAE to synchronize Altivec classes to the
inbuilt types and operators.
* Templates that both share code and have different functionality
without up-front policy, partial specialization or virtual inheritance
-- a sort of internalized policy-based design.
* Partial specialization as a expression template pattern matching
technique.
* Valarray calculation using STL iterators to allow algorithms to
scale gracefully.
* Versions of std algorithms further tuned for random access
iterators: indexing rather than incrementing yields aliasing
improvements.

http://www.pixelglow.com/macstl/

Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Andrei Alexandrescu (See Website For Email)

unread,

Sep 26, 2003, 7:01:24 AM9/26/03

to

"Glen Low" <gle...@pixelglow.com> wrote in message
news:9215d7ac.0309...@posting.google.com...

> * Partial specialization as a expression template pattern matching
> technique.

> * Versions of std algorithms further tuned for random access
> iterators: indexing rather than incrementing yields aliasing
> improvements.

Could you detail a little on these features? In what way does indexing give
the compiler a hint that there's no aliasing?

Andrei

Glen Low

unread,

Sep 27, 2003, 10:12:51 AM9/27/03

to

> > * Partial specialization as a expression template pattern matching
> > technique.

> Could you detail a little on these features?

Hi Andrei,

I'll post two replies in case anyone wants to comment separately.

Quite often, a valarray expression could be evaluated more efficiently
it if were simplified.

For example, evaluating v0 * v1 + v2 where v0, v1, v2 are valarrays
naively would invoke v0 [n] * v1 [n] + v2 [n] for each n. Some CPUs
however have a fmadds instruction which does multiply and add in the
same opcode, so fmadds (v0 [n], v1 [n], v2 [n]) would be more
efficient. (For the sake of argument, I'm ignoring for now the
substitution a compiler would make automatically, since I have several
concrete examples later where the compiler needs the help.)

Unfortunately in expression template technique, v0 * v1 + v2's type is
like:

binary_term <binary_term <valarray <T>, valarray <T>,
std::multiplies>, valarray <T>, std::plus>

So we add a partial specialization for

binary_term <binary_term <Expr1, Expr2>, std::multiplies, Expr3,
std::plus>

whose operator[] uses the fmadds instruction instead of calling the
subterms operator[].

In practice, we want to make these substitutions only for particular
types e.g. float and double, and possibly only for expressions that
can be chunked for vector instructions. Thus the partial
specialization happens in a superclass of binary_term that exposes the
T and the chunkability. That is neat because the expression template
still looks like the above, no upfront policies have been chosen, the
subclass actually "chooses" the policy.

In effect, partial specialization is being used as a pattern matcher
for particular expression templates. It is almost akin to regular
expression matching at a metaprogramming level, where the compiler
must be applying a pattern match from the top of the parse tree on
downwards.

Currently, macstl uses this technique for:

1. Changing (x == y).min () and (x == y).max () to use Altivec
predicates which calculate all x == y or any x == y at one go.

I will be using it for the following:

2. x * y + z to use vec_madd etc. the Altivec form of fmadds, which
the compiler does not deduce.
3. x & !y to use vec_andc

etc.

Ideally I was hoping to emulate CSE using templates, so that xy + xy
might be collapsed to 2xy for example, but I cannot deduce the
identities of leaf terms, only their types. Ah well.

Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Glen Low

unread,

Sep 27, 2003, 10:17:00 AM9/27/03

to

> In what way does indexing give
> the compiler a hint that there's no aliasing?

In macstl, an expression template yields an iterator which can step
through the elements of the expression. This iterator can be fairly
complex as a result, e.g. if there are n leaf terms, there are n
subiterators contained within the iterator.

In gcc libstdc++ the random access std::copy is implemented like this:

for (n = last - first; n > 0; --n)
{
*result = *first;
++first; ++result;
}

This is good because the compiler will realize n is fixed outside the
loop and use appropriate instructions and/or unroll the loop. In
PowerPC, they would be mtctr and bdnz, which don't stress out the
branch prediction logic as a result.

However each increment must "write" to an iterator, and this seems to
confuse gcc aliasing logic, especially if first and/or result is a
complex iterator like mine are. The compiler ended up inserting
pessimistic reloads of one or the other iterator.

Furthermore, most CPUs have indexed load instructions e.g. PowerPC's
lvx d, a, b where a is the integer index and b is an address. The
above formulation does not make it explicit to use n as the integer
index and in the face of a complex iterator, the compiler may not make
the deduction.

The following equivalent reformulation does not "write" to an
iterator, and is more succint:

for (i = 0; i != n; ++n)
result [n] = first [n];

This eliminated all redundant loads and stores with -fstrict-aliasing
on. The opcodes produced were more or less:

lvx v0, n, first
stvx v0, n, result

assuming a leaf term for first.

Thus macstl contains these optimized versions in a namespace called
stdext. They could be adopted into portable code fairly easily, since
they make no machine-specific assumptions.

Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]