Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

For Loops and Lambda in "Lies, Damned Lies, and Microbenchmarks"

97 views
Skip to first unread message

Damien Kick

unread,
Oct 12, 2007, 5:13:57 AM10/12/07
to
Having encountered a number of haters of C++ template programming in
general and Boost libraries in particular, I decided to start
experimenting with a few code snippets. A particular target of ire is
the subject of writing for loops by hand versus using algorithms. Those
who are opposed to the use of algorithms and function objects
frequently speak of the "unnecessary complexity and overhead" of using
algorithms and having to write functions or function objects
out-of-line. I had mentioned that Boost's lambda library allowed for
writing in-line function objects but the general thought was that there
would be too much overhead with something like this. So I thought I
would experiment with a few microbenchmarks. Pulled out of the air, for
no very good reason (and I know that this is often the death of whether
or not a benchmark is even worth consideration, but once more into the
breach, as it were), all variations used the following:

class Foo {
int bar_;
double baz_;
public:
Foo(int const bar, double const baz) : bar_(bar), baz_(baz) { }
int bar() const { return this->bar_; }
double baz() const { return this->baz_; }
};

typedef boost::uniform_int<> uniform_int_type;
typedef boost::variate_generator<boost::mt19937&,uniform_int_type>
generator_type;

class Make_foo {
generator_type numerator_;
generator_type denominator_;
public:
Make_foo(generator_type const& numerator,
generator_type const& denominator)
: numerator_(numerator), denominator_(denominator)
{ }
Foo operator () ()
{ return Foo(numerator_(),
(double(numerator_()) / double(denominator_()))); }
};

std::size_t const useless_microbenchmark_data_size = 1000000;

int main()
{
boost::mt19937 rng;
uniform_int_type zero_to_ninety_nine(0, 99);
uniform_int_type one_to_one_hundred(1, 100);
generator_type numerator(rng, zero_to_ninety_nine);
generator_type denominator(rng, one_to_one_hundred);
Make_foo make_foo(numerator, denominator);

volatile ACE_Time_Value start_time =
ACE_High_Res_Timer::gettimeofday_hr();


// ... various for, for_each, etc. loops go here ...

volatile ACE_Time_Value end_time =
ACE_High_Res_Timer::gettimeofday_hr();
volatile ACE_Time_Value duration =
const_cast<ACE_Time_Value const&>(end_time) -
const_cast<ACE_Time_Value const&>(start_time);

std::cout << "duration.sec() = "
<< const_cast<ACE_Time_Value const&>(duration).sec()
<< "\nduration.usec() = "
<< const_cast<ACE_Time_Value const&>(duration).usec()
<< '\n';
std::cout << "qux.size() = " << qux.size() << '\n';
std::cout << "value = " << value << '\n';
}

What Foo? Just so that I'd have an excuse to write a lambda function
expression to use the baz but ignore the bar. Why make baz a double?
<shrug> Because double is a type frequently used by those with whom I've
been conversing. Why not? Why use the Boost random number library?
Just because. Why use ACE_Time_Value? It was just handy. I printed
out the value to show that the same pseudo-random numbers were adding to
the same value, i.e. to show that the arithmetic worked the same in all
cases. Not an interesting observation, I know.

The "simple" (duff.cc) variation was:

std::vector<Foo> qux;
std::size_t const qux_size = useless_microbenchmark_data_size;
qux.reserve(qux_size);
for (std::size_t i = 0; i < qux_size; ++i) {
qux.push_back(make_foo());
}
double value = 0.0;
for (std::vector<Foo>::const_iterator pos = qux.begin();
pos != qux.end(); ++pos)
{
value += pos->baz();
}

The "for_each" (fudd.cc) variation was:

std::vector<Foo> qux;
std::size_t const qux_size = useless_microbenchmark_data_size;
qux.reserve(qux_size);
std::generate_n(std::back_inserter(qux), qux_size, make_foo);
double value = 0.0;
std::for_each(qux.begin(), qux.end(), value += bind(&Foo::baz, _1));

I did this with the following:

Damien-Kicks-Computer:~/tmp dkick$ g++ --version
powerpc-apple-darwin8-g++-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build
5341)
[...]

Without optimization, the for loop performed better. I had expected
this. I was very interested to see that the for_each version seemed to
perform better than the for loop version when optimization was -O3. I
had expected it to be about identical with the for loop version.

Damien-Kicks-Computer:~/tmp dkick$ g++ -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 9
duration.usec() = 372856
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 9
duration.usec() = 394740
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 9
duration.usec() = 348901
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 10
duration.usec() = 245389
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 10
duration.usec() = 199316
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 11
duration.usec() = 788393
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O1 -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -O1 -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 865765
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 822372
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 869480
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 823178
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 864894
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 867775
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O2 -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -O2 -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 846027
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 796215
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 792571
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 876464
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 867439
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 843071
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O3 -I $ACE_ROOT -I /sw/include
duff.cc -o duff -lACE
Damien-Kicks-Computer:~/tmp dkick$ g++ -O3 -I $ACE_ROOT -I /sw/include
fudd.cc -o fudd -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 686661
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 674884
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./duff
duration.sec() = 1
duration.usec() = 719897
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 566605
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 596694
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./fudd
duration.sec() = 1
duration.usec() = 598898
qux.size() = 1000000
value = 2.57784e+06

And then it occurred to me to try a variation with "accumulate"
(algorithms.cc):

std::vector<Foo> qux;
std::size_t const qux_size = useless_microbenchmark_data_size;
qux.reserve(useless_microbenchmark_data_size);
std::generate_n(std::back_inserter(qux), qux_size, make_foo);
double const value =
std::accumulate(qux.begin(), qux.end(), double(0),
_1 + bind(&Foo::baz, _2));

I was interested to see that things seemed to get worse at -O2 but that
with -O3, it seemed to be the best performing variation.

Damien-Kicks-Computer:~/tmp dkick$ g++ -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 10
duration.usec() = 241283
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 10
duration.usec() = 236267
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 10
duration.usec() = 255965
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O1 -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 940784
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 929024
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 905576
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O2 -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 2
duration.usec() = 201833
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 939748
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 2
duration.usec() = 276619
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ g++ -O3 -I $ACE_ROOT -I /sw/include
algorithms.cc -o algorithms -lACE
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 524692
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 587146
qux.size() = 1000000
value = 2.57784e+06
Damien-Kicks-Computer:~/tmp dkick$ ./algorithms
duration.sec() = 1
duration.usec() = 576647
qux.size() = 1000000
value = 2.57784e+06

So what I'm wondering is ... just how flawed is my methodology, code,
etc.? Are these numbers of any use, even within the already flawed
realm of microbenchmarks?

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

skaller

unread,
Oct 14, 2007, 12:39:07 PM10/14/07
to
On Fri, 12 Oct 2007 03:13:57 -0600, Damien Kick wrote:

> Having encountered a number of haters of C++ template programming in
> general and Boost libraries in particular, I decided to start
> experimenting with a few code snippets. A particular target of ire is
> the subject of writing for loops by hand versus using algorithms. Those
> who are opposed to the use of algorithms and function objects
> frequently speak of the "unnecessary complexity and overhead" of using
> algorithms and having to write functions or function objects
> out-of-line. I had mentioned that Boost's lambda library allowed for
> writing in-line function objects but the general thought was that there
> would be too much overhead with something like this. So I thought I
> would experiment with a few microbenchmarks.

It would be nice if you could summaries your results (its just a pile
of numbers to me, I have no idea what your tentative conclusion is).

However I think there is a need to go back and examine the issues:
from my viewpoint the performance is secondary because the library
and/or compiler should be able to generate fast stuff from clean source.

The key concept here is "locality". When you can write associated
source close to each other, you have good locality. Structures
which provide good locality include nesting, anonymity, and
combinators (which are all related).

Nesting is easy to understand. For functions, neither C nor
C++ provide nesting, aka lexical scoping. D, Felix, ML, and
even Pascal provide nested functions, that is, functions
which when invoked bind to their enclosing context
dynamically. Gcc also provides an extension to C providing
lexically scoped functions.

However in C and C++, the bodies of while and for loops,
and indeed any block *are* lexically scoped, which make
them superior to use of STL algorithms .. not because
of a deficiency in STL, but because of a deficiency in
the core C++ language.

Now, for any given nested function you can always make
a global function with extra arguments, and pass the
arguments. This is called 'lambda lifting' or just 'lifting'.

Having to do lifting by hand is an indication of a low
level programming language not suitable for human application
programming. This is housekeeping that the compiler can and
should do.

An extreme form of localisation is when some function
is given anonymously. The lack of a name is a vital
indicator that the function is only used in one place.
A localised but named function is used in a restricted
scope, with un-named functions the scope is even more
restricted. Global functions with external linkage
have the worst possible localisation properties
and should be avoided at all costs unless they specifically
and carefully designed for reuse.

So now we come to combinators. Template meta-programming is
a weak form of combinatorial encoding. Code such as Lambda
library attempts to remove some of the problems of mandatory
lifting by automating the lifting. The key to the way this
works is to understand that the combination not only declares
but also *defines* a unique entity, and the C++ linker will
ensure a unique copy is retained in a program.

Unfortunately, yet another fault in C++ core language prevents
Lambda, or any other combinator library in C++ being more
than an interesting toy and 'what would have been sweet' if
only the core language were better designed.

In particular, both data structure and functional combinators
must allow for recursion, and both these things can easily
be done with nominal (named) structures and functions in C.
However the generic versions cannot support recursion.

The problem for types is fairly easy to see: I want to write:

typedef list<T> = pair<T, list<T>*>;

but I cannot do so. This is trivial to do in C with a declaration,
but you just cannot do it with combinators.

Unfortunately yet another design fault in core C++ ensures this problem
is elevated to functions. C++ allows syntax based generics such that
applicative objects can be used as functions. (these are sometimes
incorrectly called functors)

The problem is C++ confuses classes with types, and the recursion
problem with types lifts straight into any generic meta-programming
techniques which uses applicative objects: functional recursion
becomes a type recursion, and type recursion cannot be combinatorial.

the problem does not exist for functions as such, nor for
generic (templated) functions .. the problem is that such functions
have no way to bind to a dynamic context, whereas classes provide
non-static members which can be used to store such context
in the object constructor, and retain it until the apply method
(usually operator()()) is called.

So a summary of all this is that C++ fails to provide the
technology actually required for good programming. Locality is
preserved by using old fashioned for/while loops and blocks,
but these cannot be parametrised, recursive, or saved for later
evaluation.

Applicative objects can be parametrised, recursive, and saved,
but they have to be manually *lifted* to obtain these properties,
thereby losing the critical localisation property which is required
not only for human comprehension .. but also machine reasoning.
In particular, optimisation opportunities are lost because the
context of use is too large, and even possibly unknown.

It is a pity I think: STL is one of the best standard libraries
around, but the core language it is designed for doesn't provide
the facilities actually required to use it.

--
John Skaller <skaller at users dot sf dot net>
Try Felix, the successor to C++ http://felix.sf.net

Damien Kick

unread,
Oct 15, 2007, 5:53:09 AM10/15/07
to
skaller wrote:
> On Fri, 12 Oct 2007 03:13:57 -0600, Damien Kick wrote:
>
>> Having encountered a number of haters of C++ template programming in
>> general and Boost libraries in particular, I decided to start
>> experimenting with a few code snippets. A particular target of ire is
>> the subject of writing for loops by hand versus using algorithms. Those
>> who are opposed to the use of algorithms and function objects
>> frequently speak of the "unnecessary complexity and overhead" of using
>> algorithms and having to write functions or function objects
>> out-of-line. I had mentioned that Boost's lambda library allowed for
>> writing in-line function objects but the general thought was that there
>> would be too much overhead with something like this. So I thought I
>> would experiment with a few microbenchmarks.
>
> It would be nice if you could summaries your results (its just a pile
> of numbers to me, I have no idea what your tentative conclusion is).

Fair enough. I can see how you could've missed the following in my
previous post, as it was buried in the middle of a bunch of numbers.

<blockquote>


Without optimization, the for loop performed better. I had expected
this. I was very interested to see that the for_each version seemed to
perform better than the for loop version when optimization was -O3. I
had expected it to be about identical with the for loop version.

</blockquote>

And then later I showed number with a version based on std::accumulate
preforming even better than std::for_each. But to hope to try and make
this a bit clearer, here are average execution times, in microseconds,
for executing each version 10 and 100 times.

UL-USER> (let ((*number-of-exec-results* 10))
(use optimize-3
!'(/Users/dkick/tmp/for-loop-O3)
!'(/Users/dkick/tmp/for-each-lambda-O3)
!'(/Users/dkick/tmp/accumulate-lambda-O3)))
(:O3
(:FOR-LOOP 1702227 :FOR-EACH-LAMBDA 1576749 :ACCUMULATE-LAMBDA
1562228))
UL-USER> (let ((*number-of-exec-results* 100))
(use optimize-3
!'(/Users/dkick/tmp/for-loop-O3)
!'(/Users/dkick/tmp/for-each-lambda-O3)
!'(/Users/dkick/tmp/accumulate-lambda-O3)))
(:O3
(:FOR-LOOP 1677765 :FOR-EACH-LAMBDA 1546606 :ACCUMULATE-LAMBDA
1542550))
UL-USER>

At "-O3", the std::accumulate is the fastest, std::for_each 2nd, and the
for loop is the slowest. I'm still a little surprised that for these
three different ways to write such a simple iteration, that the versions
which use lambda seem to be consistently out performing the "simple" for
loop.

> However I think there is a need to go back and examine the issues:
> from my viewpoint the performance is secondary because the library
> and/or compiler should be able to generate fast stuff from clean source.

Foreigner: How to I get to FooBarKhan from here?
Local: Oh, I wouldn't start from here.

I am still personally quite persuaded by Scott Meyers item on preferring
algorithms to hand written loops, for exmaple. However, They have read
it, too, but They are not convinced. They seem to think that while this
all might be nice "in theory" that the additional "complexity" and
overhead simply relegate things like the Boost Lambda libraries to a
silly way to write "simple" loops with lots of complicated template
programming. In my experience, when They start complaining about
something being "too complicated", the easiest way to convince Them that
what They perceive to be complication is worthwhile is to be able to
point out that it is faster. It takes the argument out of the realm of
subjectivity and at least gives some quantitative analysis to go with
the qualitative analysis. Now, admittedly, the difference in run-time
performance between these three are small enough that I would easily
imagine that They will simply wave it away. Not to mention that
microbenchmarks generally tend to be sorta fubar from the word go.

FWIW, here is the code I wrote to run the executables and compute the
averages. I haven't run through the full results of
FOR-LOOPS-AND-LAMBDA yet, though. (Note to mods. If this is too much
!C++, please feel free to just clip the following code.)

(in-package :ul-user)

(defconstant +second-as-microseconds+ (expt 10 6))

(defun exec-result-time-as-microseconds (string)
(register-groups-bind (seconds microseconds)
('(:sequence :single-line-mode-p
:start-anchor
"duration.sec() = "
(:register
(:greedy-repetition 1 nil :digit-class))
#\Newline
"duration.usec() = "
(:register
(:greedy-repetition 1 nil :digit-class)))
string)
(+ (* (parse-integer seconds) +second-as-microseconds+)
(parse-integer microseconds))))

(defvar *number-of-exec-results* (expt 10 4))

(defun average-time-as-microseconds
(exec &optional (number-of-exec-results *number-of-exec-results*))
(use truncate
(/ (loop repeat number-of-exec-results
sum (exec-result-time-as-microseconds (sh exec)))
number-of-exec-results)))

(defun for-loops-and-lambda ()
(use for-loops-and-lambda*
!'(/Users/dkick/tmp/for-loop-no-O)
!'(/Users/dkick/tmp/for-each-lambda-no-O)
!'(/Users/dkick/tmp/accumulate-lambda-no-O)
!'(/Users/dkick/tmp/for-loop-O1)
!'(/Users/dkick/tmp/for-each-lambda-O1)
!'(/Users/dkick/tmp/accumulate-lambda-O1)
!'(/Users/dkick/tmp/for-loop-O2)
!'(/Users/dkick/tmp/for-each-lambda-02)
!'(/Users/dkick/tmp/accumulate-lambda-O2)
!'(/Users/dkick/tmp/for-loop-O3)
!'(/Users/dkick/tmp/for-each-lambda-03)
!'(/Users/dkick/tmp/accumulate-lambda-O3)))

(defun for-loops-and-lambda*
(for-loop-no-O for-each-lambda-no-O accumulate-lambda-no-O
for-loop-O1 for-each-lambda-O1 accumulate-lambda-O1
for-loop-O2 for-each-lambda-O2 accumulate-lambda-O2
for-loop-O3 for-each-lambda-O3 accumulate-lambda-O3)
(list (use no-optimization
for-loop-no-O for-each-lambda-no-O accumulate-lambda-no-O)
(use optimize-1
for-loop-O1 for-each-lambda-O1 accumulate-lambda-O1)
(use optimize-2
for-loop-O2 for-each-lambda-O2 accumulate-lambda-O2)
(use optimize-3
for-loop-O3 for-each-lambda-O3 accumulate-lambda-O3)))

(defun no-optimization (for-loop for-each-lambda accumulate-lambda)
(skeleton :no-O for-loop for-each-lambda accumulate-lambda))

(defun optimize-1 (for-loop for-each-lambda accumulate-lambda)
(skeleton :O1 for-loop for-each-lambda accumulate-lambda))

(defun optimize-2 (for-loop for-each-lambda accumulate-lambda)
(skeleton :O2 for-loop for-each-lambda accumulate-lambda))

(defun optimize-3 (for-loop for-each-lambda accumulate-lambda)
(skeleton :O3 for-loop for-each-lambda accumulate-lambda))

(defun skeleton (label for-loop for-each-lambda accumulate-lambda)
(flet ((lemma (exec) (average-time-as-microseconds exec)))
(list label (list :for-loop (lemma for-loop)
:for-each-lambda (lemma for-each-lambda)
:accumulate-lambda (lemma accumulate-lambda)))))

--

skaller

unread,
Oct 15, 2007, 4:00:17 PM10/15/07
to
On Mon, 15 Oct 2007 03:53:09 -0600, Damien Kick wrote:

> skaller wrote:

> At "-O3", the std::accumulate is the fastest, std::for_each 2nd, and the
> for loop is the slowest. I'm still a little surprised that for these
> three different ways to write such a simple iteration, that the versions
> which use lambda seem to be consistently out performing the "simple" for
> loop.

I'm not quite so surprised. I need to explain that: this generated
C++ code for calculating Ackermann's function:

int FLX_REGPARM ack( int x, int y){
start_5719:;
ifnot(x == 0 ==1) goto _5714;
return y + 1 ;
_5714:;
ifnot(y == 0 ==1) goto _5715;
y = 1;
x = x - 1 ;
goto start_5719;
_5715:;
y = ack(x, y - 1 );
x = x - 1 ;
goto start_5719;
}

is TWICE as fast as this C code:

int Ack(int M, int N) {
if (M==0) return N +1;
else if(N==0) return Ack(M-1,1);
else return Ack(M-1, Ack(M,N-1));
}

Do you know why? I have no idea! Both compiled with gcc 4.1 with -O3
on amd64.

I also had C++ code generated by Felix for Shootout nbody where
small differences in array indexing in what appeared to be a
floating point performance test made my code FOUR times slower
than C. (It isn't now .. now it is faster!).

I will be looking at loops vs STL algorithms shortly, mainly with
the effect of using OpenMP with gcc 4.2, since libstdc++ has
a number of parallelised STL algorithms in it. I'm curious to
know how effective that is.

>> However I think there is a need to go back and examine the issues: from
>> my viewpoint the performance is secondary because the library and/or
>> compiler should be able to generate fast stuff from clean source.
>
> Foreigner: How to I get to FooBarKhan from here? Local: Oh, I wouldn't
> start from here.

Point taken, but see above: microbenchmarks are notoriously sensitive
to trivial details. It's particularly counter-intuitive to find a
template solution is faster than the plain loop it undoubtedly reduces
to anyhow.

> I am still personally quite persuaded by Scott Meyers item on preferring
> algorithms to hand written loops, for exmaple. However, They have read
> it, too, but They are not convinced.

IMHO, it isn't really a matter of preferring one or the other,
but on making a choice based on tradeoffs.

For example if you have an applicative object ready prepared,
OR the calculation is complex and can be reused, the algorithm
appears a better choice, otherwise localisation issues suggest
the loop is a better choice.

> something being "too complicated", the easiest way to convince Them that
> what They perceive to be complication is worthwhile is to be able to
> point out that it is faster.

I agree it certainly helps to point at a performance advantage .. but ..

> Now, admittedly, the difference in run-time
> performance between these three are small enough that I would easily
> imagine that They will simply wave it away.

.. the point is that the 'more complicated' way isn't necessarily
slower. Indeed you show here it can be faster.

> Not to mention that
> microbenchmarks generally tend to be sorta fubar from the word go.

Yes, indeed, but they serve a good purpose here, if only to
show more research and experiment is needed.

And again you should note it is not the STL algorithms that are
the problem. They actually provide 'stock standard' functional
programming stuff like maps (called copy in STL) and folds
(called accumulate in STL). Such combinatorial programming is
generally good, and often faster than hand written code --
the problem here is simply that C++ doesn't have the core language
capability to service the library.

I'm sure Boost::Lambda helps, but you'd never write a complex
algorithm that way. Compare with Felix

int k = 2;
int sum = accumulate
(fun (acc:int) (x:int) : int => acc + k* x)
0
lst
;

to really understand the difference. That 'fun' thing in the middle
GENERATES a C++ applicative object like:

struct X {
int k;
X(int k_) : k(k_) {}
int operator()const(int acc, int x) {
return acc + k * x;
}
};

for you and passes it to the accumulate. Templates cannot
compete with this kind of code generator, where the 'functional'
value is more or less arbitrary. In theory "Lambda" (with recursion
it doesn't have) could do this, but the notation is unreadable.

In effect Felix translates a sane notation, namely the main
language, into the right form (i.e. applicative object here).

FYI, in the 'old days' I was asked to write a book on STL.
After some months of work setting it up, I wanted to show
how to rewrite a triple loop using STL algorithms.

Pages of code and days later I quit the book .. AND C++.
There's no way a triple loop with a 1 line body should be
so complicated. STL algorithms just aren't useful enough in C++
to bother given the real power comes from higher order
functions which in C++ are very hard to construct.

The amount of work is VERY considerable: I can easily
prove that .. take a look at the ~200,000 lines of Ocaml
code in the Felix compiler, which tries to build minimal
and optimal applicative objects to ensure good performance
whilst preserving locality. In general this cannot be done
with C++ templates, no matter how complex you make them,
due to the lack of type recursion**. Felix originally did
generate templates but I soon gave them up in favour of
doing my own instantiation.

** Gabriel del Rois showed a way to solve this in general
using open recursion and a specialisation to fixate it.
However the code to generate this solution is much more
complicated even in a high level language like Ocaml,
than instantiating polymorphic entities directly.

For the same reason, hand written specialised C++ code such
as loop bodies have a significant advantage over templates
because templates simply don't have the right combinators
in a convenient form.

The effect is you'd probably use STL for sorting, since sorting
is a complex algorithm, but the simpler accumulate and copy
just aren't worth using from application top level code.
They're useful to use from inside other templates though.


--
John Skaller <skaller at users dot sf dot net>
Try Felix, the successor to C++ http://felix.sf.net

Mathias Gaunard

unread,
Oct 16, 2007, 1:14:47 AM10/16/07
to
On Oct 15, 10:00 pm, skaller <skal...@users.sourceforge.net> wrote:

> I'm sure Boost::Lambda helps, but you'd never write a complex
> algorithm that way. Compare with Felix
>
> int k = 2;
> int sum = accumulate
> (fun (acc:int) (x:int) : int => acc + k* x)
> 0
> lst
> ;
>
> to really understand the difference. That 'fun' thing in the middle
> GENERATES a C++ applicative object like:
>
> struct X {
> int k;
> X(int k_) : k(k_) {}
> int operator()const(int acc, int x) {
> return acc + k * x;
> }
> };

See the C++0x lambdas which do exactly that, but with a C++ syntax.

As for boost::lambda, it's not particularly unreadable. The main issue
is that "." cannot be overloaded, so you've got to use bind, which is
very verbose.


--

skaller

unread,
Oct 16, 2007, 8:42:53 AM10/16/07
to
On Mon, 15 Oct 2007 23:14:47 -0600, Mathias Gaunard wrote:

> On Oct 15, 10:00 pm, skaller <skal...@users.sourceforge.net> wrote:
>
>> I'm sure Boost::Lambda helps, but you'd never write a complex algorithm
>> that way. Compare with Felix
>>
>> int k = 2;
>> int sum = accumulate
>> (fun (acc:int) (x:int) : int => acc + k* x) 0
>> lst
>> ;
>>
>> to really understand the difference. That 'fun' thing in the middle
>> GENERATES a C++ applicative object like:
>>
>> struct X {
>> int k;
>> X(int k_) : k(k_) {}
>> int operator()const(int acc, int x) {
>> return acc + k * x;
>> }
>> };
>
> See the C++0x lambdas which do exactly that, but with a C++ syntax.

>From what I have seen, they do half what is needed.
[If you have a link to the paper I'll read it again?]

> As for boost::lambda, it's not particularly unreadable. The main issue
> is that "." cannot be overloaded, so you've got to use bind, which is
> very verbose.

That's the main issue for an implementation that is so primitive
it hardly works at all .. but it isn't the real main issue,
which you won't discover until secondary issues are resolved.

Roughly speaking, tree forms are relatively easy, eg you can
make combinators for simple operators such as

+ - * /

which are really all just applications:

f (x)

although the latter is a bit nasty in a language without tuples
or currying. However when you come to conditionals, it starts
getting much harder: the C syntax

c ? t : f

is a little tricky because the branches must be lazily evaluated.
A more general switch can be done for a fixed N arguments
(or slighly better with some template meta-programming hacks),
but the problem is the combinatorial form is not really looking
like real code any more. What you really need is at least an
ML style match and THAT requires algebraic type constructors,
not merely templates which do the same constructions.
It also breaks in C++ at the moment due to the lack of recursion
and garbage collection.

Full combinatorial programming is basically an implementation
of category theory. Most functional programming languages use
the semantically equivalent applicative style instead. This
requires proper lambda abstraction in the core language.
It's also insane without ML let/in scoping construction,
and very very hard without pattern matching (see the utterly
disgusting mess called Scheme which uses caaddr etc to
unpack a list).

Felix has pattern matching and a light form of let/in,
but is basically block structured like C/C++ and it makes the
code quite a bit harder to write.

Categorical combinators may be better for C++, because composition
is associative which fits well with the reverse polish operator style
used for classes (x.f().g().h() ... reverse polish)

However to really make it work, it HAS to support
a fixpoint operator (recursion) or you can't actually encode
anything very useful.

In the long run it cannot work for the simple reason it is
well known that functional constructions cannot be properly
implemented without a garbage collector. But the big obstacle
is really syntax.. quite a bit of sugar is needed to make
the combinatorial stuff acceptable, and that needs a
real translator: either a code generator like Felix or
a lot more C++ core language support. Apart from garbage
collection Felix doesn't do anything that couldn't be put
in C++ with syntax changes .. since it is basically
just well typed C++ macro processor.

BTW: the main difference between applicative and categorical
style is that the former has variables whilst the latter
doesn't. For example the application

int y = f ( x,x);
int z = f ( y,y );

is using variables x and y, effectively let/in bindings,
to decompose the actual expression:

int z = f (make_pair ((make_pair(x,x),
make_pair(x,x))
;

The categorical (point free, variable free) form is

int z = x . make_pair . f . make_pair . f

Although this looks simpler it isn't when the
calculation gets complex and would use a few variables
in different places, because the input of each function
has to contain all the information for the next stage,
even if the function itself doesn't use it. So you need
higher order combinators to construct such functions,
and there quite a few of them. Contrast to applicative
style where let/in variable bindings and uses of the
variables can do all these categorical manipulations.

So most FPL's use applicative style with variables,
but templates are purely categorical. C++0x new lambdas
will help in some cases but they are limited in how
they can be used (if they were complete garbage collection
would be mandatory which it isn't so they can't be :)


--
John Skaller <skaller at users dot sf dot net>
Try Felix, the successor to C++ http://felix.sf.net

skaller

unread,
Oct 16, 2007, 11:13:42 AM10/16/07
to
On Tue, 16 Oct 2007 06:42:53 -0600, skaller wrote:

>> See the C++0x lambdas which do exactly that, but with a C++ syntax.
>
>>From what I have seen, they do half what is needed.
> [If you have a link to the paper I'll read it again?]

Ok, i just read it. Seems good, except for one minor problem ..
no named lambdas .. hmm..

Damien Kick

unread,
Oct 18, 2007, 3:01:48 PM10/18/07
to
skaller wrote:
> On Mon, 15 Oct 2007 03:53:09 -0600, Damien Kick wrote:
>
>> At "-O3", the std::accumulate is the fastest, std::for_each 2nd, and the
>> for loop is the slowest. I'm still a little surprised that for these
>> three different ways to write such a simple iteration, that the versions
>> which use lambda seem to be consistently out performing the "simple" for
>> loop.

The results mentioned above where from using GCC 4.0.1 on Mac OS 10.4
(PPC). Unfortunately, with GCC 4.1 and Fedora Core 6 GNU/Linux (Intel),
I am getting the following.

* (let ((*number-of-exec-results* 10)) (for-loops-and-lambda))

((:NO-O
(:FOR-LOOP 2000477 :FOR-EACH-LAMBDA 2183352 :ACCUMULATE-LAMBDA 2022219))
(:O1 (:FOR-LOOP 876098 :FOR-EACH-LAMBDA 993034 :ACCUMULATE-LAMBDA 899092))
(:O2 (:FOR-LOOP 888597 :FOR-EACH-LAMBDA 901585 :ACCUMULATE-LAMBDA 846097))
(:O3 (:FOR-LOOP 869545 :FOR-EACH-LAMBDA 898671 :ACCUMULATE-LAMBDA 886614)))
* (let ((*number-of-exec-results* 100)) (for-loops-and-lambda))

((:NO-O
(:FOR-LOOP 1991081 :FOR-EACH-LAMBDA 2195028 :ACCUMULATE-LAMBDA 2042294))
(:O1 (:FOR-LOOP 880715 :FOR-EACH-LAMBDA 994338 :ACCUMULATE-LAMBDA 898823))
(:O2 (:FOR-LOOP 895697 :FOR-EACH-LAMBDA 913588 :ACCUMULATE-LAMBDA 856782))
(:O3 (:FOR-LOOP 874672 :FOR-EACH-LAMBDA 899297 :ACCUMULATE-LAMBDA 893823)))
*

Which is a bit worse than I would've expected from the "so fancy I bet
it drinks expensive wine and eats stinky cheese" solutions. <pause> Of
course, accumulate-lambda is a bit faster at -O2. I only just noticed
that. Unfortunately for my attempt at getting my commission from
selling Boost to yet another sucker, They are not using GCC 4.0.1 and
Mac OS 10.4 (PPC). I suppose that I should consider microbenchmarking
versions with "hand crafted" function objects, too. Oh, but I'm also
just noticing that the -O2 version of accumulate-lambda is faster than
the -O3 version of the "hand crafted" for loop. Interesting.

>> something being "too complicated", the easiest way to convince Them that
>> what They perceive to be complication is worthwhile is to be able to
>> point out that it is faster.
>
> I agree it certainly helps to point at a performance advantage .. but ..

> ... the point is that the 'more complicated' way isn't necessarily


> slower. Indeed you show here it can be faster.

Oh, you don't know Them very well, do you? Even if They were using the
warez that consistently gave the "fancy look at your with your college
education computer science degree" versions the slight performance
advantage, They would probably only pause for a moment to consider the
benefits of faster before Their immune systems started fighting off the
new antibodies. But at least slightly faster would've perhaps given
Them a slight fever or perhaps even the sniffles. Not a full blown case
of Boostitis but a little <pause> tickle perhaps. Any argument based on
anything subjective will be either thoroughly ignored or dragged down
into the depths of a Turing Tar Pit.

--

Mathias Gaunard

unread,
Oct 20, 2007, 10:37:56 AM10/20/07
to
On Oct 16, 2:42 pm, skaller <skal...@users.sourceforge.net> wrote:
> However when you come to conditionals, it starts
> getting much harder: the C syntax
>
> c ? t : f
>
> What you really need is at least an
> ML style match and THAT requires algebraic type constructors,
> not merely templates which do the same constructions.

An ML-style match can be done, but using types instead of
constructors, which are basically the same thing.
See boost.variant, and eventually that thread [1].

However I don't really see why you would need this to put conditions
in an expression.
Boost.Lambda and co provide all kinds of statements like this, with
syntax similar to the real thing: if_(condition)
[then_code].else[else_code]

[1]
http://groups.google.com/group/comp.std.c++/browse_frm/thread/85acdd712da73b58


> It also breaks in C++ at the moment due to the lack of recursion
> and garbage collection.

I don't understand what you mean by recursion here.

I also don't see why you would need garbage collection, and why you
cite it her. One key point of C++ programming is that you must think
about who owns the resources, and it's just more useful, deterministic
and robust than using the GC everywhere, which is really only suited
for shared ownership.
Sharing state is a rather bad thing to do, anyway. (well, it's nice
for performance at least)

> Full combinatorial programming is basically an implementation
> of category theory. Most functional programming languages use
> the semantically equivalent applicative style instead. This
> requires proper lambda abstraction in the core language.

C++ does have a form of lambdas: functors. However using them is quite
verbose, and reusing other variables from the scope is explicit.
That's why the lambda syntax was introduced: it's syntactic sugar to
generate the functors. It doesn't change their nature, though. They're
just any type which happens to be callable.


> It's also insane without ML let/in scoping construction,

let/in isn't as nice as blocks for imperative programming, I think.

> In the long run it cannot work for the simple reason it is
> well known that functional constructions cannot be properly
> implemented without a garbage collector.

There is that adage, indeed, that says you need garbage collection for
closures. I personally believe it's not really needed. Having the
closure either own its context or not doesn't seem so problematic. And
sharing ownership, if really needed, can be done as a library, eg.
shared_ptr.

By the way, I don't have much theoretical knowledge about type
category and other stuff you mentioned, so please excuse my simple
answers.


--

0 new messages