Chris Vine <chris@cvine--nospam--.
freeserve.co.uk> wrote:
> Using the C++ algorithms where relevant such as std::transform (map),
> std::accumulate (fold/reduce), std::adjacent_difference (differentiate),
> std::partial_sum (integrate), std::rotate, std::stable_partition and
> std::partial_sort makes code significantly easier to understand, and it
> generally annoys me when people try to do exactly the same by hand (and
> usually less well and obscurely). People often seem to do this through
> ignorance.
In the case of xorring the elements of two vectors into a third vector
it's a matter of opinion which way is clearer and easier to read and
understand.
One could argue that most programmers are more accustomed to seeing and
understanding explicit for-loops, especially those that are a two-liner
(like in this case), than a more unusual call to a standard library
function (that, at a very quick glance, doesn't even look like a loop
at all, until you start reading the details). Of course this is more a
matter of what the programmer is accustomed to.
The original poster might have not had execution speed in mind when he
asked the question (even though he used the word "efficient", but I get
the feeling he used that word to mean something else than execution
speed; maybe something like "efficient in terms of amount of source
code", or perhaps "efficient in terms of readability", or something
along those lines?) However, a "manual" implementation might allow for
some low-level optimizations that are in no way guaranteed to be done by
std::transform.
Assume that this particular task needs to be done as efficiently as the
hardware just can do it, squeezing every possible clock cycle away from it,
achieving as much throughput as is available (while still keeping it fully
standard C++).
One method that even many experienced C++ programmers aren't aware of,
and haven't studied, is SIMD vectorization. Modern compilers can be
surprisingly good at generating vectorized SSE code (or whatever the
equivalent is in the target architecture, like ARM64), but in many cases
you need to "help" the compiler a bit.
For example, if you are doing a simple operation like this to a very
large amount of ints, for instance, it can become much faster if you
do it to 8 ints in an inner loop at once. In other words, rather than
just write the simple singular loop, you write an inner loop that
simply does the operation to 8 elements. There's a very high chance
that the compiler will optimize that inner loop into a single SSE
instruction that does the operation to all 8 elements at once.
If the compiler was not already doing this with the simple loop,
the result may speed up by well over a factor of 2.