I have, along with Joel Falcou, been looking a bit at how we can
integrate SIMD into iterator-based algorithms to accelerate them.
Unfortunately, there are a couple of issues.
The basic idea is to provide a range adaptor that adapts of range of
PODs, where N consecutive PODs are stored contiguously, into a range of
vectors (or "packs") of those PODs of size N.
Therefore, instead of:
std::vector<float> tab(20);
float sum = boost::accumulate(tab, 0);
we can write something like:
std::vector< float, simd::allocator<float> > tab(20);
float sum = simd::sum(
boost::accumulate(
simd::packed_range<4>(tab),
simd::zero_
)
);
Which does the same but using SSE instructions, adding four floats at a
time, and typically providing a 4x speed-up.
This can be applied to have the same result for any binary operation
which is commutative and associative (which unfortunately isn't the case
of many operations, but can still be useful).
Here, however, it works because 1) the memory is well aligned thanks to
the use of our custom allocator and 2) the size, 20, is a multiple of
the pack size, 4.
We'd be interested in providing a mechanism where we could also support
ranges of data aligned at an arbitrary address and of an arbitrary size.
The SSE would only be used in the middle, the borders being handled
element per element.
A first solution consists in padding the beginning and the end with zeroes.
Say you have 1 2 3 4 5 6 7
with only 3 being aligned well enough for SSE usage.
so we'd turn that into | 0 0 1 2 | 3 4 5 6 | 7 0 0 0 |.
However, it is very important that the code to deal with borders does
not affect the efficiency of the code in the middle. This is not
possible with the normal iterator/range design.
Our data is clearly cut into three segments: the beginning that is not
aligned, the data that is rightly aligned and sized, and the data at the
end that is not big enough to fill a whole pack.
Therefore, segmented iterators could provide us with a solution to do
this efficiently; but so could other solutions as well.
Wouldn't it be better to be able to use the mechanism so that you could
deal with the borders in an element-per-element way while you could deal
with the middle bit in a vector-per-vector way?
With C++ iterators, you pull data, but if we could use an iteration
paradigm where you push data instead, such as calling a function object
for each element, we could call an overloaded function that would be
able to accept both a single element and a pack.
Such a solution would also provide the speed benefits of segmented
iterators, and all high-level standard algorithms can be expressed with
this iteration paradigm just as well as with iterators.
I have already suggested such an alternative to segmented iterators
(where I had a few misconceptions about segmented iterators, I thought
they broke compatibility, but they don't, they still maintain the legacy
non-segmented interface) there:
<http://thread.gmane.org/gmane.comp.lib.boost.user/61636/focus=61647>,
with an example of how you could use that to iterate over a binary tree
there:
<http://www.boostpro.com/vault/index.php?action=downloadfile&filename=generator_iterator.zip&directory=>
Converting from those things to a pair of iterators is possible through
the use of a coroutine, as demonstrated in the example.
I would like to have feedback on the general idea of providing tools to
use SIMD with iterators and related algorithms, and ideas about whether
there is sufficient interest in the alternative to iterators for me to
develop them.
The intent would be to submit both the alternative to iterators and the
SIMD helpers to Boost.
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
+1!
> and ideas about whether there is sufficient interest in the
> alternative to iterators for me to develop them. The intent would
> be to submit both the alternative to iterators and the SIMD helpers
> to Boost.
Now that your misconceptions about segmented iterators are cleared up,
could you give a brief rundown of your alternative and its advantages?
If all I have to do is read your earlier post and ignore claims of
broken compatibility, just say so, but I don't want to waste time
trying to decipher it if it's substantially wrong.
Thanks!
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
by the way openCL claims to be portable
--
Pavel
P.S.
if you notice a grammar mistake or weird phrasing in my message
please point it out
Portable SSEx/Altivec code is rather trivial to make properly.
Not like I presented that to Boost'con already, so before saying
anythign else, yes it works.
>> and ideas about whether there is sufficient interest in the
>> alternative to iterators for me to develop them. The intent would
>> be to submit both the alternative to iterators and the SIMD helpers
>> to Boost.
>
> Now that your misconceptions about segmented iterators are cleared up,
> could you give a brief rundown of your alternative and its advantages?
> If all I have to do is read your earlier post and ignore claims of
> broken compatibility, just say so, but I don't want to waste time
> trying to decipher it if it's substantially wrong.
Here, I will only be talking of the full generator I described in that
other thread, not the step one.
It's push instead of pull, basically.
- It's a very simple, easy-to-grasp concept, it's basically just
allowing ranges to define their own for_each (I use an output iterator
in my example, but it's not really different from a function object)
- It is very easy to write an iteration primitive with it. See the
difference between binary_tree_iterator and binary_tree_generator. I
don't think a segmented iterator would be that easy to define for that
data structure.
- It is iteration governed by the data, rather than by the program that
wants to use the data. Therefore the iteration can have full knowledge
of the data layout and doesn't need to do useless operations at every
increment, since that information can be statically known from the place
the iteration is at, allowing it to be as efficient as possible. This is
also useful for when you have to wait for data, such as when you want to
iterate data coming from a network, which is typically done with a
reactive paradigm.
- It can be converted into a forward iterator through a "simple"
coroutine application. I still need to benchmark this to see how much of
a cost it is, but I'm no good at benchmarking.
Of course, there is the drawback that it makes the code using the
iteration primitive somewhat more complicated: but so do segmented
iterators.
There are also several other drawbacks, in particular tied to the fact
that it cannot be conceptually much more than a forward iterator.
There is also an extra advantage I hadn't considered that I thought of
when considering SIMD:
- It is possible to push different types of data and have them
statically dispatched through overloading. Heterogeneous structures with
a pull paradigm require a dynamic dispatch.
I was in particular thinking of using this latter advantage so that the
iteration could dispatch its contents to the algorithm
element-by-element at the borders and pack-by-pack for the rest.
Of course I didn't introduce this idea as SIMD my main concern, I just
saw a potential useful application there.
Now, I'm not saying we should embrace push instead of pull; I think both
approaches should be allowed to co-exist, and that we should probably
try to standardize concepts and allow interoperability with other range
and iterator algorithms and adapters.
But this raises several questions: How do we choose between the two? How
do we integrate the approaches?
OpenCL is a possible implementation, albeit we choose to call the
various SIMD instructions ourselves to have more control on the
toolchain and the end result.
OpenCL will however be our main backend for GPU targets which we will be
supporting in the future. Or maybe we will just target it through Clang
and LLVM.
NT2 (also know as the crazy frenchman library), upon which this effort
is based, only supports x86 (SSE, ..., SSE4, AVX) and PowerPC (AltiVec).
ARM (NEON, VFP) is being added.
An effort has been made in its design so that instructions could
register themselves for a particular (type, cardinal) pair, all of which
ranked according to a category to select the best candidate. It heavily
uses meta-programming, including rewritten bits of MPL to augment its
compile-time performance.
It also uses expression templates with Proto to detect certain patterns,
such as fused multiply-add, for which x86 is introducing new
instructions soon.
Therefore it is very tunable and extensible.
The bit I wanted to discuss here, however, is not the implementation,
but rather the interface that the library provides to the programmer.
We aim to provide an interface in modern C++ that integrates well with
the standard library and Boost in order to allow developers to make use
of SIMD in an easy and fairly high-level fashion, potentially using
meta-programming to write an algorithm with parametric register and
cache sizes.
OpenCL is not an interface that satisfies those criteria.
Have you seen the ct thing coming from Intel? I just learned about it last week.
http://software.intel.com/en-us/articles/intel-array-building-blocks/
It looks like a container library for vector processing in C++ with a JIT compilation runtime environment as part of the library. As a guy who appreciates compilers, langauges and libraries there is a lot to like there. This area is evolving very fast.
Regards,
Luke
Assuredly I missed some sort of presentation or a discussion elsewhere, but why is a segmented-iterator/SIMD-oriented-iterator (or ct) superior or worse than a well-tuned valarray? Obviously, valarray has a uniquely sparse interface providing (less than) the barebones needed for data-parallel programming; on top of which it uses a member-function strategy. However, it seems that "fixing" valarray might be a more robust start, rather than immediately striking out on a different strategy.
> Have you seen the ct thing coming from Intel? I just learned about it last week.
>
> http://software.intel.com/en-us/articles/intel-array-building-blocks/
I haven't had a chance to look at it myself, but I've heard this is
based on Rapidmind.
I would say NT2 is quite comparable to that product, albeit, since NT2
is a public research project that doesn't have the full backing of
Intel, it is probably inferior in some aspects, while still having good
original ideas too.
We are however only looking at submitting the SIMD part to Boost, rather
than the whole of NT2 that does also threads, MPI, matrices, vectors,
numerical stuff, etc.
I believe it is important to be able to isolate and decouple components
as much as possible; and I can indeed think of quite of few applications
where I would want to use SIMD but not the whole artillery of different
parallel technologies and making my application depend on a big runtime
that does a lot behind the scenes.
Being able to use it non-intrusively with my own types, or extend it as
needed, is a plus too.
So for now, for Boost, we're just looking at providing nice ways to help
the programmer organize its code around packs of data (also known as
vectors, but for various reasons we prefer to use a different name),
each pack being the size of a SIMD register, and ensuring that code can
then generate efficient SIMD code.
>
> It looks like a container library for vector processing in C++ with a JIT compilation runtime environment as part of the library. As a guy who appreciates compilers, langauges and libraries there is a lot to like there. This area is evolving very fast.
An interesting area to look into indeed.
We have a few compilation-related projects around NT2 (some using Proto,
some using an actual compiler) where we optimize array programming, but
none using a JIT.
NT2 does all its fine tuning (including for size of cache etc.) at
compile-time, so it cannot adapt as well for different CPUs without a
recompilation.
Implementing automatic vectorization in compilers is certainly
worthwhile. Other people are working on this, as this is not really our
area of expertise. We consider our approach complimentary; both
approaches benefit different applications.
To produce good vectorized code, you need to design your code in a way
that is vectorizable. By making the vectorization explicit, it becomes
clear what you need to do to make it work, and what the problems that
prevent vectorization are.
It can require concessions or reorganizing your code, something an
optimizer will not do.
Also, auto-vectorization technology is not at a level where it can
compete with explicit vectorization.
Even with an ideal setup where everything should be vectorizable, there
is a very large gap between using explicit vectorization or just
automatic vectorization, as can be demonstrated for example by this
benchmark:
<http://eigen.tuxfamily.org/index.php?title=Benchmark>
Even a "good" vectorizing C++ compiler is pretty bad, practically worthless, and while research continues apace, the practical implications of the state of the art hasn't really changed for decades. The simple fact is that most of the time the compiler just can't make enough useful assumptions about general C code to vectorize it safely. People write SSE code by hand in assembly (calling intrisics is writing assembly by hand, but more verbose) or accept the scalar code the compiler generates. Fortran is a different story, it is a language designed to make life easy for the compiler. The compiler can hardly fail to vectorize fortran code, but other than calling fortran libraries from C interfaces that doesn't help us much. If you just defined an object model that faithfully duplicates semantics of fortran and let people program that in C++ you could auto-vectorize it in the C++ compiler if it became part of the standard, but that would take years and cost millions of live
s, its much easier to do it with a JIT runtime environment in a library. Since we already have the compiler technology for languages that allow it to work well it seems best to implement such a language as a DSEL that allows us to use that compiler technology from C++ rather than try to implement a vector compiler in a template metaprogram or keep beating our heads against the vectorizing general C code brick wall. JIT runtime is an end run around standardization and is kind of like going through Belgium to invade France. You get to Paris without all the fuss.
Regards,
Luke
Agreed
> Fortran is a different story, it is a language designed to make life easy
> for the compiler. The compiler can hardly fail to vectorize fortran code,
> but other than calling fortran libraries from C interfaces that doesn't
> help us much. If you just defined an object model that faithfully
> duplicates semantics of fortran and let people program that in C++ you
> could auto-vectorize it in the C++ compiler if it became part of the
> standard, but that would take years and cost millions of lives
Agreed
>, its much easier to do it with a JIT runtime environment in a library.
> Since we already have the compiler technology for languages that allow it
> to work well it seems best to implement such a language as a DSEL that
> allows us to use that compiler technology from C++ rather than try to
> implement a vector compiler in a template metaprogram or keep beating our
> heads against the vectorizing general C code brick wall. JIT runtime is
> an end run around standardization and is kind of like going through
> Belgium to invade France. You get to Paris without all the fuss.
I have rouble seeing how JIT and High Performance or JIT and EMbedded
platform can be usable together, hence our static solution. I urge people
to get a look at our boost'con slides. We are not speaking in the air, we
already have working proof of concept and even more than that.
Nice for very linear things. Not so great for sort, binary_search,
reverse, rotate, ...
> - It is very easy to write an iteration primitive with it. See the
> difference between binary_tree_iterator and binary_tree_generator. I
> don't think a segmented iterator would be that easy to define for that
> data structure.
It may not matter that it isn't easy, if you really want generic capability.
> - It is iteration governed by the data, rather than by the program that
> wants to use the data. Therefore the iteration can have full knowledge
> of the data layout and doesn't need to do useless operations at every
> increment, since that information can be statically known from the place
> the iteration is at, allowing it to be as efficient as possible. This is
> also useful for when you have to wait for data, such as when you want to
> iterate data coming from a network, which is typically done with a
> reactive paradigm.
> - It can be converted into a forward iterator through a "simple"
> coroutine application. I still need to benchmark this to see how much of
> a cost it is, but I'm no good at benchmarking.
Reading through
http://github.com/boost-lib/parameter/blob/master/test/efficiency.cpp
might help.
> Of course, there is the drawback that it makes the code using the
> iteration primitive somewhat more complicated: but so do segmented
> iterators. There are also several other drawbacks, in particular
> tied to the fact that it cannot be conceptually much more than a
> forward iterator.
...nor even implement several of the algorithms that work with forward
iterators.
> There is also an extra advantage I hadn't considered that I thought of
> when considering SIMD:
> - It is possible to push different types of data and have them
> statically dispatched through overloading. Heterogeneous structures with
> a pull paradigm require a dynamic dispatch.
I'm sorry, I can't understand what you're driving at here.
> I was in particular thinking of using this latter advantage so that the
> iteration could dispatch its contents to the algorithm
> element-by-element at the borders and pack-by-pack for the rest.
>
> Of course I didn't introduce this idea as SIMD my main concern, I just
> saw a potential useful application there.
>
> Now, I'm not saying we should embrace push instead of pull; I think both
> approaches should be allowed to co-exist, and that we should probably
> try to standardize concepts and allow interoperability with other range
> and iterator algorithms and adapters.
> But this raises several questions: How do we choose between the two? How
> do we integrate the approaches?
Although what you're describing is easier to make efficient for many
(perhaps even the majority of cases), it doesn't sound like it
actually has the same level of generality as segmented iterators do.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
> Since we already have the compiler technology for languages that allow it to work well it seems best to implement such a language as a DSEL that allows us to use that compiler technology from C++ rather than try to implement a vector compiler in a template metaprogram
I don't understand. Implementing a C++ DSEL means implementing a
compiler as a C++ meta-program, since it's a domain-specific language
embedded in the C++ language.
Did you mean it without the E, i.e. a domain-specific language with a
normal compiler infrastructure?
We're doing that too, re-using the same technology.
We want something that can be called from C++ and extends C++. If we wanted a domain-specific langauge there is cuda, openCL and fortran itself (if you care to think of Fortran as a high performance computing domain specific langauge). Implementing a C++ DSEL does mean implementing the front end of the compiler as a meta-program if you want your errors at compile time instead of runtime, but the backend of the compiler can be deferred to runtime by compiling the expressions to expression data objects (some IR) that aren't compiled into executable code until the first time they are called. Having done expression templates and meta programming (including expression templates for vector primitives) I can honestly say that I wouldn't want to implement an optimizing compiler as a template meta-program. It would be a very tedious and painful task with the end result being something that is thoroughly impractical due to long compile times and the impossibility of maintaining it.
It could not compete with the quality and speed of compilers programmed in C.
Regards,
Luke
Really ? try perusing proto to its fullest, it's not that hard to
maintain as soon as oyu have a neat generic package around
and use it to geenrate a proper intermediate representation of your code
fragment.
> It could not compete with the quality and speed of compilers programmed in C.
>
Compile time is an expendable ressources, runtime is not.
i guess everything above is applied to power and arm platforms as well
--
Pavel
P.S.
if you notice a grammar mistake or weird phrasing in my message
please point it out
_______________________________________________
Both are expendable. Compile time goes to developer productivity. If the developer is more productive he will be able to either write more functionality or better optimize the runtime performance of the same functionality with better algorithms/finer tuning. Before dismissing JIT perhaps you should do some benchmark comparisons?
There is a pretty big difference between parsing syntax to an expression template and doing minor operations on the expression and implementing constant propigation and dead code elimination. I know proto can help with the former, but I'm not to keen on using it for the latter. I'm pretty sure we can optimize across expressions at runtime, while we can only optimize within expressions at compile time with template meta-programming. You're cutting off quite a bit of optimization opportunity by limiting the scope of optimizations to the expression level.
Regards,
Luke
While a little awkward to program in, the use of "auto" allows multistatement optimizations. However, unless you want *very* awkward coding, the control flow will always be transparent to a C++ DSEL optimizing library. The hard part of vectorization (and what defeats most compilers) are the complexities inherent in handling control flow; I don't mean simple things like loop unrolling, I'm thinking of loop invariants, loop lifting (commutativity of operations across loops), loop pipelining (distributivity of operations between loops), etc. Even "well behaved" codes for shaders in 3d graphics contexts are very branch-y now-a-days and require sophisticated compilers to allow kernels to fully utilize the hardware.
But how is this different from specifying appropriate compile options for the platforms you intent to support? Even the code generated by a compiler will only work on the processor architectures that the compiler targeted. And if you tell the compiler to target the lowest common denominator, the resulting code will be quite slow.
Regards,
Thomas
Give back to Casear ...
Could you share the results?
Thanks,
Luke
> Could you share the results?
>
More on that, I'm eager to know how dynamically geenrating code (in a
string ? in a file ? as soem bytecode array ?) THEN
running a compiler then executing the resulting binary can beat
statically compiled code dealing with vector register.
Explain this to me and I convert my self.
> But how is this different from specifying appropriate compile
> options for the platforms you intent to support? Even the code
> generated by a compiler will only work on the processor
> architectures that the compiler targeted. And if you tell the
> compiler to target the lowest common denominator, the resulting code will be quite slow.
for example intel compiler for win32 allows to generate two versions
binaries from one source: plain x86 and say sse2 which are switched at
runtime depending on cpuid
so the question still stands
--
Pavel
P.S.
if you notice a grammar mistake or weird phrasing in my message
please point it out
_______________________________________________
an idea immediately arises in my mind: what if it could be possible to
force a particular backend e.g. sse2 and make it propagate down the
call stack? this way you write generic code once and are able to
(pre)compile for (expected) platforms/technologies
then you just switch once depending on some circumstances and call the
appropriate instance (say sse2 forced)
at least this way you write the source once and possibly don't rewrite
it for every circumstance
--
Pavel
P.S.
if you notice a grammar mistake or weird phrasing in my message
please point it out
_______________________________________________
I looked at Intel's array building blocks and tried to understand how
they were doing JIT. More here :
http://nonchalantlytyped.net/blog/2010/09/26/deconstructing-intels-array-building-blocks/
Basically, they use a mix of operator overloading and some cleverly
named macros to make C++ statements "generate" abstract syntax trees
at run time, then JIT it to SSE supported by thread building blocks.
Manjunath
I've been talking about the ct/rapidmind stuff, though my discussion has been about the general idea and not the specifics of what they may have implemented. I don't know exactly what is in there and haven't tried it yet myself.
> More on that, I'm eager to know how dynamically geenrating code (in a
> string ? in a file ? as soem bytecode array ?) THEN
> running a compiler then executing the resulting binary can beat
> statically compiled code dealing with vector register.
>
> Explain this to me and I convert my self.
You can write IR to memory, or load it as part of the process image, run the backend of the compiler on that and dynamically link in the result by replacing some function pointer in a table with the address of the executable code that is generated. Implement it as a try catch around the function call and initialize the function pointer in the table to a null pointer and you get near zero runtime overhead for subsequent function calls. Allow the JIT compiler to inline these functions into eachother and you can reduce the function overhead even further. Alternately, implement it as something that runs at program startup and eliminate the indirection entirely.
As to how it could be better:
If the dynamically generated code makes use of new hardware features not available at the time the original code was compiled....
If the dynamically generated code is better than the statically generated code because it has better algorithms for optimizing vector instructions...
If the compile time is negligable compared the the runtime of the resulting code...
If the dynamically generated code offloads work to the GPU through shared L3 cache...
If the dynamically generated code mutlithreads in addition to vectorizing the code and dynamically schedules it to all available cores...
Need I go on? I shall.
It doesn't make sense for vector intrinsics compiled by C++ compiler to outperform fortran compiled to vector instructions by the fortran compiler. I know perfectly well that it is the same back end compiler, but the difference is that with C++ the compiler can't know which optimizations are safe that the fortran compiler can easily make. As long as the embedded langauge allows the compiler to make the assumptions needed to enable optimizations that the C++ compiler can't then the code it generates will outperform the C++ compiler even if it's the exact same backend compiler being used. There is practically no way you can tell the C++ compiler which assumptions are safe, there just aren't sufficient pragmas and flags for that and even if there were they wouldn't be portable.
In any case, I don't really know much about this stuff, I encourage you to look at what's in there yourself. It could be it's much less good than I imagine it to be based upon my brief reading of the description. You could be a lot smarter than the guys from MIT who did it, who knows?
Regards,
Luke
cool
forgive me for repeating myself, do i understand correctly? will it be
something like the following?
template</*...*/> void task(...);
//somewhere
if (supportsSSE2) task<SSE2>(/*...*/);
else task<General>(/*...*/);
--
Pavel
P.S.
if you notice a grammar mistake or weird phrasing in my message
please point it out
_______________________________________________
I am sure both JIT approach and static metaprogramming approach have
their places. My issue with ct/rapidmind/ArBB is the "unnatural-ness"
it introduces to the programmer. Seems like the programmer has to be
aware of "retained" mode vs. immediate mode when coding his
algorithms. For some of the problems, I believe static metaprogramming
(a la NT2) can be more natural to a C++ programmer.
> If the dynamically generated code makes use of new hardware features not available at the time the original code was compiled....
Can't beat JIT in the above scenario. But the above scenario is
arguably fictitious. HPC people will definitely recompile their code
on newer platforms.
> If the compile time is negligable compared the the runtime of the resulting code...
Vectorizing is no easy task. State of the art algorithms are O(N^2) on
the size of a basic block. So it has non-negligible overhead. Unless
the programmers using JIT are aware of this (use "retained" mode to
pre-compile functions once and call it multiple times), they can
easily blow away the benefits.
Manjunath
I guess blaming the poster was so much more fun than actually requesting
details.
And all in all, I guess the fact than tryign to get innovation on this
in a OS project is useless too ...
You can't have your cake and eat it too. If you leave the low level optimization to the C++ compiler you have to accept that it will punt when it can't be sure whether two pointers might be aliases, for example.
>I guess blaming the poster was so much more fun than actually requesting
>details.
>And all in all, I guess the fact than tryign to get innovation on this
>in a OS project is useless too ...
I wasn't trying to blame you for anything other than dismissing the JIT approach out of hand and my only intention was to get you to spend some time looking into this CT thing which looks quite promising. I'm not sure why you would get frustrated with participating in OS, but it's clear I've upset you, and the wise crack about being smarter than the guys who came up with CT was uncalled for--I apologize.
Regards,
Luke
Manycore with SSE is going mainstream. Desktop applications of the near future will look more and more like HPC applications of today. Desktop users will not recompile their code.
Regards,
Luke
Question to Joel:
Will this compiler automatically do the correct thing with NT2 without any further user intervention?
If I understood correctly, NT2 will know at compile time which SIMD instruction set is available, and generate the appropriate code. Now if the intel compiler compiles the code multiple times for different target architectures, I imagine that NT2 would generate the appropriate code during the different compile runs for the different architectures, so that the resulting executable should work optimal right out of the box. Is this correct?
Regards,
Thomas
NT2 vectorization is triggeed by the activation of compiler option for
SIMD exteniosn (-msse2 etc) so yes.
> If I understood correctly, NT2 will know at compile time which SIMD instruction set is available, and generate the appropriate code.
Yes
> Now if the intel compiler compiles the code multiple times for different target architectures, I imagine that NT2 would generate the appropriate code during the different compile runs for the different architectures,
Yes
> so that the resulting executable should work optimal right out of the box. Is this correct?
>
You lost me here, how do you link all your version ? Dynamically or what ?
At Tue, 12 Oct 2010 04:06:02 -0400, David Abrahams wrote:
>
> At Mon, 11 Oct 2010 22:50:29 +0100,
> Mathias Gaunard wrote:
> >
> > On 11/10/2010 16:37, David Abrahams wrote:
> >
> > >> and ideas about whether there is sufficient interest in the
> > >> alternative to iterators for me to develop them. The intent would
> > >> be to submit both the alternative to iterators and the SIMD helpers
> > >> to Boost.
> > >
> > > Now that your misconceptions about segmented iterators are cleared up,
> > > could you give a brief rundown of your alternative and its advantages?
> > > If all I have to do is read your earlier post and ignore claims of
> > > broken compatibility, just say so, but I don't want to waste time
> > > trying to decipher it if it's substantially wrong.
> >
> > Here, I will only be talking of the full generator I described in that
> > other thread, not the step one.
> >
> > It's push instead of pull, basically.
> >
> > - It's a very simple, easy-to-grasp concept, it's basically just
> > allowing ranges to define their own for_each (I use an output iterator
> > in my example, but it's not really different from a function object)
>
> Nice for very linear things. Not so great for sort, binary_search,
> reverse, rotate, ...
>
> > - It is very easy to write an iteration primitive with it. See the
> > difference between binary_tree_iterator and binary_tree_generator. I
> > don't think a segmented iterator would be that easy to define for that
> > data structure.
>
> It may not matter that it isn't easy, if you really want generic capability.
>
> > - It is iteration governed by the data, rather than by the program that
> > wants to use the data. Therefore the iteration can have full knowledge
> > of the data layout and doesn't need to do useless operations at every
> > increment, since that information can be statically known from the place
> > the iteration is at, allowing it to be as efficient as possible. This is
> > also useful for when you have to wait for data, such as when you want to
> > iterate data coming from a network, which is typically done with a
> > reactive paradigm.
> > - It can be converted into a forward iterator through a "simple"
> > coroutine application. I still need to benchmark this to see how much of
> > a cost it is, but I'm no good at benchmarking.
>
> Reading through
> http://github.com/boost-lib/parameter/blob/master/test/efficiency.cpp
> might help.
>
> > Of course, there is the drawback that it makes the code using the
> > iteration primitive somewhat more complicated: but so do segmented
> > iterators. There are also several other drawbacks, in particular
> > tied to the fact that it cannot be conceptually much more than a
> > forward iterator.
>
> ...nor even implement several of the algorithms that work with forward
> iterators.
>
> > There is also an extra advantage I hadn't considered that I thought of
> > when considering SIMD:
> > - It is possible to push different types of data and have them
> > statically dispatched through overloading. Heterogeneous structures with
> > a pull paradigm require a dynamic dispatch.
>
> I'm sorry, I can't understand what you're driving at here.
>
> > I was in particular thinking of using this latter advantage so that the
> > iteration could dispatch its contents to the algorithm
> > element-by-element at the borders and pack-by-pack for the rest.
> >
> > Of course I didn't introduce this idea as SIMD my main concern, I just
> > saw a potential useful application there.
> >
> > Now, I'm not saying we should embrace push instead of pull; I think both
> > approaches should be allowed to co-exist, and that we should probably
> > try to standardize concepts and allow interoperability with other range
> > and iterator algorithms and adapters.
> > But this raises several questions: How do we choose between the two? How
> > do we integrate the approaches?
>
> Although what you're describing is easier to make efficient for many
> (perhaps even the majority of cases), it doesn't sound like it
> actually has the same level of generality as segmented iterators do.
>
> --
> Dave Abrahams
> BoostPro Computing
> http://www.boostpro.com
>
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
... Says the guy who, quite literally, wrote the book on metaprogramming :)
Sorry.
Is it possible to distill a succinct phrasing of question being posed? I
think I'm having trouble reading it through the reply markers. I think I get
the basic gist, but I'm not sure my understanding is complete enough to
generate an opinion. Maybe it would make a nice article for C++Next.
Andrew
> On 12/10/10 20:32, Manjunath Kudlur wrote:
>>> More on that, I'm eager to know how dynamically geenrating code (in a string
>>> ? in a file ? as soem bytecode array ?) THEN
>>> running a compiler then executing the resulting binary can beat statically
>>> compiled code dealing with vector register.
>>>
>>> Explain this to me and I convert my self.
>>>
>> I looked at Intel's array building blocks and tried to understand how
>> they were doing JIT. More here :
>> http://nonchalantlytyped.net/blog/2010/09/26/deconstructing-intels-array-building-blocks/
>> Basically, they use a mix of operator overloading and some cleverly
>> named macros to make C++ statements "generate" abstract syntax trees
>> at run time, then JIT it to SSE supported by thread building blocks.
>>
> I can't see how different it is from ET and what are the benefit of
> doing static code geenration at runtime...
> I must be really dense but I can't see this winning any race ...
Presuming a sufficiently advanced jit architecture,
- static code generation
- may not have access to custom opcodes or hardware on the user's computer
- may not be able to produce all optimal variants
- cannot change after delivery
- can only guess the actual runtime code paths
- a jit on the user's machine can support
- local benchmarking and profiling
- user-side optimization (including profile hints)
- use an upgraded compiler
- save compiled results
- ...
Basically, think of it as moving the compiler from the developer's box to
the user's executable. Both are "the same compiler", but the user's copy
presumably knows more about the actual input data and hardware available.
Knowing when and how to invoke the JIT is an art (Java uses it too much,
C/C++ too little), but there are certain patterns which are a clear win.
For example, a jit can inline several functions in a user-selected
computation pipeline (examples: DSP blocks or high/medium/low special
effects options); this eliminates branches, temporaries, etc. (hence its
use in current GPU shader languages).
The LLVM project supports turning the full compiler into a JIT (hint:
there is no essential need for on-disk object files). With the Clang
frontend, it should even be possible to compile Boost code into a running
application...
Later,
Daniel
I want to underscore the similarity between shaders in graphics and vector kernels in HPC applications. As we offload to GPU the distincition gets even more fuzzy. When your games load it tells you its compiling the shaders. That's a pretty crummy user experience in and of itself, but we don't quibble because we are happy with the user experience that comes immediately after. In theory you could pre-compile shaders for every variant of graphics hardware out there and ship them with the game and then patch it as new hardware comes out. In practice we are happier to wait for the shaders to compile when we launch the game or get to the next level. It isn't realistic to expect every user to compile the application for their platform, nor it is reasonable to expect every app vendor to provide their source code to the customer to recompile.
> The LLVM project supports turning the full compiler into a JIT (hint:
> there is no essential need for on-disk object files). With the Clang
> frontend, it should even be possible to compile Boost code into a
> running application...
I know a lot of people who are excited about LLVM + Clang and JIT comes up a lot since LLVM is light weight and fast. It would be nice to have an open source (and free) C++ interpreter. We could use LLVM for things like auto-vectorization JIT in a snap. People get it.
Regards,
Luke
> At Mon, 11 Oct 2010 22:50:29 +0100,
> Mathias Gaunard wrote:
> >
> > On 11/10/2010 16:37, David Abrahams wrote:
> >
> > >> and ideas about whether there is sufficient interest in the
> > >> alternative to iterators for me to develop them. The intent would
> > >> be to submit both the alternative to iterators and the SIMD helpers
> > >> to Boost.
> > >
> > > Now that your misconceptions about segmented iterators are cleared up,
> > > could you give a brief rundown of your alternative and its advantages?
> > > If all I have to do is read your earlier post and ignore claims of
> > > broken compatibility, just say so, but I don't want to waste time
> > > trying to decipher it if it's substantially wrong.
> >
> > Here, I will only be talking of the full generator I described in that
> > other thread, not the step one.
> >
> > It's push instead of pull, basically.
> >
> > - It's a very simple, easy-to-grasp concept, it's basically just
> > allowing ranges to define their own for_each (I use an output iterator
> > in my example, but it's not really different from a function object)
>
> Nice for very linear things. Not so great for sort, binary_search,
> reverse, rotate, ...
>
Please please please, if you're going to attempt a better alternative for
iterators, read "The essence of the Iterator pattern" [1] (or [2] for an
updated version) and the important work on idioms[3] that it references. I
wish I had more time to spend on this effort, but needless to say the tough
semantic work has already been done!
Thanks David A. for pointing out there was a semantics discussion going on
here.
David
[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.7480
[2] http://www.comlab.ox.ac.uk/jeremy.gibbons/publications/iterator.pdf
[3] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.22
--
David Sankel
Sankel Software
www.sankelsoftware.com
I don't know about that. This work seems (to me) fundamentally
different in spirit from C++ iterators a là Stepanov and Musser. For
one thing, Gibbons fails to even address random access. Staying close
to the machine model and the ability to make the most generic
algorithms as efficient as hand-coded C (sometimes even assembly!) are
the lifeblood of generic programming, and I don't see how one can
ignore those things in the realm of HPC.
>
> Thanks David A. for pointing out there was a semantics discussion going on
> here.
Welcome. Sorry to be a contrarian :-)
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
The irony is not lost on me. But I fear I've created a monster :(
> Is it possible to distill a succinct phrasing of question being
> posed? I think I'm having trouble reading it through the reply
> markers. I think I get the basic gist, but I'm not sure my
> understanding is complete enough to generate an opinion. Maybe it
> would make a nice article for C++Next.
The question has to do with what kind of abstraction should be used to
expose element access in segmented structures. This short and very
readable 1998 paper explains the problem space and offers one
approach: http://lafstern.org/matt/segmented.pdf
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
Notes on clang/LLVM + C++ + JIT stuff:
////////////////////////////////////////////////////////////////////
// Copyright 2008 Eric Niebler. Distributed under the Boost
// Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <iostream>
#include <boost/proto/core.hpp>
#include <boost/proto/context.hpp>
#include <boost/typeof/std/ostream.hpp>
namespace proto = boost::proto;
proto::terminal< std::ostream & >::type cout_ = {std::cout};
template< typename Expr >
void evaluate( Expr const & expr )
{
proto::default_context ctx;
proto::eval(expr, ctx);
}
int main()
{
evaluate( cout_ << "hello" << ',' << " world" );
return 0;
}
//]
wash@Pegasus:~/sandbox$ time clang++ -cc1 -triple x86_64-unknown-linux-gnu
- -emit-llvm-bc -x c++ hello.cpp
real 0m1.958s
user 0m1.676s
sys 0m0.276s
wash@Pegasus:~/sandbox$ lli hello.bc
hello, world
wash@Pegasus:~/sandbox$
There's nothing magic about JITing C++, even JITing non-trivial C++. It's
basically pointless, because the LLVM bitcode/LLVM assembly generated by a C++
program is absolutely, positively not portable. Starting in the earliest stages
of compilation (preprocessing), platform dependencies enter the code (not to
mention all the preprocessor workarounds). Add in alignment jazz. Plus sizes.
Plus implementation details (GNU expands sizeof and assert in the preprocessor,
for example). CXXABI. C++ is not Java, and clang/LLVM are not magic. We can do
cool things with clang/LLVM, C++ and JIT, but at the end of the day, C++ is not
Java.
Other stuff:
I think I saw some comments implying that C compilers were faster or better
than C++ compilers. I know that I saw people claim that C compilers were
faster, better or could optimize better than C++ compilers implemented using
expression templates and metaprogramming techniques. Such claims are
unfounded.
http://clang.llvm.org/performance.html - clang uses limited template
metaprogramming techniques, however it is quickly shaping into a solid,
production-quality open-source C++ compiler (I have compiled a semi-functional
Linux kernel with clang), and it is by far superior to it's C predeccessor GCC
when it comes to optimization.
Boost.Wave, a standards compliant C preprocessor is implemented using
Boost.Spirit.
www.ll.mit.edu/HPEC/agendas/proc02/abstracts/mullin.pdf - really smart MIT
guy's paper on really cool optimizations, done in C++ with C++ TMP techniques,
specifically expressive templates.
- --
Bryce Lelbach aka wash
http://groups.google.com/group/ariel_devel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAky1PVkACgkQO/fqqIuE2t6gBwCcDjnYwO8kYzzBu7HOyl6yyAhB
UJsAoMfUKrujDLkBHHEOe7Eyep9AJOJi
=hYSg
-----END PGP SIGNATURE-----
Note: s/expressive/expression/ on my last post. This is becoming a recurring
problem for me :x
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAky1Pt0ACgkQO/fqqIuE2t4sGQCg4h8eddk9piv3p+EY5UoRwWHK
1r0AnAtnZWB1lTb20BtXwBo8VazFwpzK
=QhW/
Jumping in the middle here so this may not be relevant.... Here is the
beginning of a discussion on spirit-devel about segmented Fusion
algorithms that sought to apply Matt Austern's formulation to Fusion's
heterogeneous sequences:
http://article.gmane.org/gmane.comp.parsers.spirit.devel/2765
This discussion eventually led to Fusion's current (incomplete,
unmaintained) support for segmented sequences. (Although in Fusion, the
goal is not improved runtime performance, but easier-to-implement
segmented sequences and drastically lower compile-time overhead.)
We found no fundamental problems in Austern's formulation of generic
segmentation, or in our adaptation of it to meet Fusion's needs.
--
Eric Niebler
Unlike Joel, I do see the advantages of using a JIT to do
high-performance computing, and the potential benefit of including such
technology into NT2, albeit those benefits would be minimal in the
current market NT2 is targeting.
I think, however, that this is getting way off-topic. We are only
looking at submitting the SIMD abstraction layer to Boost, nothing more,
and certainly not the whole of NT2. There is much more to a numerical
computation library or runtime than just SIMD.
People have expressed interest in using a SIMD abstraction layer to
speed up uBLAS, but it has various implications beyond that, and can
also be used to give a significant speedup to text processing,
compression algorithms, or whatever else you might think of.
I want to repeat what the SIMD library is in case it hasn't been clear:
it only provides an abstraction that allows one to model SIMD registers,
and tools to help in packetizing data into register-sized packets while
taking care of alignment considerations; all of it while ensuring
operations on the packets will directly translate in the expected SIMD
instructions, for many different architectures on all major C++ compilers.
As a Boost library, it intends to do only one job, in a generic and
minimal fashion, and do it well.
NT2.SIMD or Boost.SIMD would take all its code generation decisions at
compile-time according to what the compiler claims to support, like a
regular C++ library. If the user of that library wants to take those
decisions at runtime, it will be up to him to JIT the code (which
shouldn't be too difficult, since we aim at providing strong integration
with clang), use a strategy similar to that of the Intel compiler (and
as already said in this thread, it already integrates with that strategy
very well), or use an environment or runtime built upon the library that
already does it.
> At Tue, 12 Oct 2010 15:40:38 -0500,
> Andrew Sutton wrote:
> >
> > >
> > > I find it interesting, but also a bit sad, that the thread on
> > > metaprogramming “plumbing” is very active, while this fundamental
> > > question of abstraction and generic programming has gone unaddressed.
> > >
> >
> > ... Says the guy who, quite literally, wrote the book on metaprogramming
> :)
> > Sorry.
>
> The irony is not lost on me. But I fear I've created a monster :(
>
>
For the mere mortals among us, who are struggling to keep up with the
metaprogramming astronauts, am I right in thinking that you are lamenting
the emphasis on 'bells and whistles' to support SIMD iterators, while the
more fundamental aspects of iterator segmentation are being ignored?
And in your last remark are you saying you fear that metaprogramming and
the Boost MPL are monsters!?
Thx
- Rob.
I was unclear adn a bit spun yesterday for out-of-ML reason. I have
nothing against JIT, we even JIT openCL atm for GPGPU support. i was a
bit against JIT for the problem at hand and th eproposal we were making.
> As a Boost library, it intends to do only one job, in a generic and
> minimal fashion, and do it well.
Agreed
> Nice for very linear things. Not so great for sort, binary_search,
> reverse, rotate, ...
Yes, since those require either bidirectional or random access iterators.
Ability to stop the iteration could be a useful addition to make the
concept have more applications.
In functional languages, people seem happy to use exceptions for that;
but I assume that certainly wouldn't be the case in C++.
In any case, linear operations or filters represent in my opinion a
large enough spectrum of what people do with iterable ranges of data
that it could be worthwhile to have concepts that allow them to be
efficient, even if it cannot apply to other cases.
> It may not matter that it isn't easy, if you really want generic capability.
But do we want generic capability at the cost of complicated code which
is hard to optimize?
>> Of course, there is the drawback that it makes the code using the
>> iteration primitive somewhat more complicated: but so do segmented
>> iterators. There are also several other drawbacks, in particular
>> tied to the fact that it cannot be conceptually much more than a
>> forward iterator.
>
> ...nor even implement several of the algorithms that work with forward
> iterators.
Do you have some examples?
>
>> There is also an extra advantage I hadn't considered that I thought of
>> when considering SIMD:
>> - It is possible to push different types of data and have them
>> statically dispatched through overloading. Heterogeneous structures with
>> a pull paradigm require a dynamic dispatch.
>
> I'm sorry, I can't understand what you're driving at here.
>
struct my_iteration
{
void operator()(float f)
{
do_something_with_scalar(f);
}
void operator()(packed_view<float, 4> v)
{
do_something_with_vector(v);
}
};
for_each(simd::packed_range(mystuff), my_iteration());
vs
for(packed_view<float, 4> v : simd::packed_range(mystuff))
do_something_with_vector(v);
Using a pull solution for iteration, the former /could/ take care of the
non-aligned cases at the beginning and at the end of the mystuff range
through overloading.
>
> Although what you're describing is easier to make efficient for many
> (perhaps even the majority of cases), it doesn't sound like it
> actually has the same level of generality as segmented iterators do.
Segmented iterators still allow bidirectional support, indeed (I think?).
But I'm note sure it can do random access as well, if you consider
recursively segmented local iterators.
I believe writing a zip_iterator (or similar) would be no easier for
segmented iterators than for generators due to the arbitrary recursion.
> Using a pull solution for iteration, the former /could/ take care of the
> non-aligned cases at the beginning and at the end of the mystuff range
> through overloading.
Sorry, I meant a push solution, not a pull one.
Might be clearer to use higher-order functional, origami or some other
term to describe it.
> Please please please, if you're going to attempt a better alternative for
> iterators, read "The essence of the Iterator pattern" [1] (or [2] for an
> updated version) and the important work on idioms[3] that it references. I
> wish I had more time to spend on this effort, but needless to say the tough
> semantic work has already been done!
>
> Thanks David A. for pointing out there was a semantics discussion going on
> here.
>
> David
>
> [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.7480
> [2] http://www.comlab.ox.ac.uk/jeremy.gibbons/publications/iterator.pdf
> [3] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.22
>
This is interesting in essence, but I fear it abstracts too much into
lambda land. (to the point where I don't understand it, but that's
another problem ;))
From a glimpse at your paper, the advantages of your approach over a
monadic map (which is what I take to be the most similar thing to a fold
in C++, since C++ has side effects) is that it allows composition, and
doesn't need two symmetric definitions to traverse the data both ways.
Is that correct?
Unfortunately I do no understand how one would use this in C++. The
approach seems a bit similar to continuations, unless I'm completely out
of it?
I have already suggested using coroutines to make the iteration more
flexible, but it comes with a runtime cost (context switching).
I could see this.
[...]
>>> Of course, there is the drawback that it makes the code using the
>>> iteration primitive somewhat more complicated: but so do segmented
>>> iterators. There are also several other drawbacks, in particular
>>> tied to the fact that it cannot be conceptually much more than a
>>> forward iterator.
>>
>> ...nor even implement several of the algorithms that work with forward
>> iterators.
>
> Do you have some examples?
find_if? adjacent_find? adjacent_difference? parallel_for_each? I
suppose (some of) these could be implemented using a for_each-style of
iteration, but would require caching (of the previously visited element,
in the case of adjacent_*) and exceptions (in the case of
find_if/adjacent_find)...which would seem to negate the performance
motivation. Or maybe I'm being too dense...
[...]
>>> - It is possible to push different types of data and have them
>>> statically dispatched through overloading. Heterogeneous structures with
>>> a pull paradigm require a dynamic dispatch.
>>
>> I'm sorry, I can't understand what you're driving at here.
>>
>
> struct my_iteration
> {
> void operator()(float f)
> {
> do_something_with_scalar(f);
> }
>
> void operator()(packed_view<float, 4> v)
> {
> do_something_with_vector(v);
> }
> };
>
> for_each(simd::packed_range(mystuff), my_iteration());
One might describe this as expressing iteration in terms of the visitor
pattern...? I honestly didn't understand what you meant between "push"
and "pull" before this snippet.
[...]
>> Although what you're describing is easier to make efficient for many
>> (perhaps even the majority of cases), it doesn't sound like it
>> actually has the same level of generality as segmented iterators do.
>
> Segmented iterators still allow bidirectional support, indeed (I think?).
> But I'm note sure it can do random access as well, if you consider
> recursively segmented local iterators.
Seems to me segmented iterators and a visitation-based/push-based
iteration address fundamentally different problems. At least, I don't
see how segmented iterators help here. The problem that
visitation-based iteration seems to try to solve is improving iteration
efficiency over join(t)_range-like ranges...
> I believe writing a zip_iterator (or similar) would be no easier for
> segmented iterators than for generators due to the arbitrary recursion.
What do you mean by generators in this context?
- Jeff
Those return an iterator, so that makes it complicated to see what they
would do.
Assuming they would return a pointer to the value, you could do
(ignoring const issues):
template<typename Range, typename P>
struct find_if_f
{
find_if_f(P p_) : p(p_) {}
bool operator()(typename range_value<Range>::type& e)
{
if(p(e))
{
ptr = &e;
return false;
}
return true;
}
P p;
typename range_value<Range>::type* ptr;
};
template<typename Range, typename P>
typename range_value<Range>::type* find_if(Range& range, P p)
{
find_if_f<Range, P> f(p);
if(for_each_until(range, f))// for_each but stops when f returns false
return f.ptr;
return 0;
}
> adjacent_difference?
You need to remember the previous element, but you don't have to copy to
do that, you could just use a reference or pointer to it.
I would think a sufficiently good optimizer should take care of inlining
that.
> parallel_for_each?
Good point.
This requires the ability to subdivise the range, which would be
feasible but inefficient, in particular due to the lack of random access.
I
> suppose (some of) these could be implemented using a for_each-style of
> iteration, but would require caching (of the previously visited element,
> in the case of adjacent_*) and exceptions (in the case of
> find_if/adjacent_find)...which would seem to negate the performance
> motivation. Or maybe I'm being too dense...
I'm not certain those really hamper performance (assuming the use of
another mechanism instead of exceptions to stop the loop).
> One might describe this as expressing iteration in terms of the visitor
> pattern...? I honestly didn't understand what you meant between "push"
> and "pull" before this snippet.
This referenced the initial conversation I had with Dave in another
thread. It's not very practical terminology.
>
> [...]
>>> Although what you're describing is easier to make efficient for many
>>> (perhaps even the majority of cases), it doesn't sound like it
>>> actually has the same level of generality as segmented iterators do.
>>
>> Segmented iterators still allow bidirectional support, indeed (I think?).
>> But I'm note sure it can do random access as well, if you consider
>> recursively segmented local iterators.
>
> Seems to me segmented iterators and a visitation-based/push-based
> iteration address fundamentally different problems. At least, I don't
> see how segmented iterators help here. The problem that visitation-based
> iteration seems to try to solve is improving iteration efficiency over
> join(t)_range-like ranges...
Since those could be different segments, you wouldn't need to check
which segment you are in at each iteration.
Actually, all local iterators for a a segmented iterator must be the
same for all segments, but since a local iterator can be a segmented
iterator itself, it can also have a local iterator which can be
different. By recursion, you can therefore use different iterator types
for each segment, but that's somewhat convoluted.
>
>> I believe writing a zip_iterator (or similar) would be no easier for
>> segmented iterators than for generators due to the arbitrary recursion.
>
> What do you mean by generators in this context?
That's what I called my alternative to segmented iterators in the other
thread, because I made them use an output iterator instead of a function
object.
Any function that reads a range and passes each element to a template
user-supplied object would fit the idea.
I call it push because the data is being pushed onto the user-supplied
object.
That's how I've read the discussion. The previous thread seems to have
trailed off into a space that more to do with particulars and little to do
with the fundamental abstraction.
On the topic of abstraction, I would say that Matt's solution seems to be
the right approach -- from an iterator perspective. He claims that
segmentation concepts is an orthogonal to the standard iterator concepts. I
don't especially like the implementation, requiring an explicit
specialization of segmented_iterator_traits. I see nothing preventing the
segmented iterator class from defining those associated types and operations
and allowing the traits class to use them by default.
The solution seems to fit the GP pattern quite well. Specialize generic
algorithms for orthogonal or refined concepts to get faster code for "free".
I don't really know enough about Fusion to comment on the design, but on the
surface it seems like an equivalent solution.
> And in your last remark are you saying you fear that metaprogramming and
> the Boost MPL are monsters!?
>
My feeling on metaprogramming (and hence the MPL) is that it's a necessary
evil for the style of generic programming going on here: necessary because
it's the duct tape that holds generic libraries together, evil because it
allows us to focus on really low-level details while doing some really
clever programming. In case this may have read differently, I think that
this style of generic programming is actually a good thing.
Andrew
On Wed, 13 Oct 2010, Bryce Lelbach wrote:
> There's nothing magic about JITing C++, even JITing non-trivial C++. It's
> basically pointless, because the LLVM bitcode/LLVM assembly generated by a C++
> program is absolutely, positively not portable.
Agreed on the portability, but I think you missed the point. People are
advocating something like JIT, not to VM bytecode but directly to machine
code, exactly to get non-portable optimizations.
> I think I saw some comments implying that C compilers were faster or better
> than C++ compilers. I know that I saw people claim that C compilers were
> faster, better or could optimize better than C++ compilers implemented using
> expression templates and metaprogramming techniques. Such claims are
> unfounded.
The normal example is Fortran being faster than C/C++ because it doesn't
have to worry about issues such as pointer aliasing.
Later,
Daniel
BOOST_MPL_DEFINE_HAS_MEMBER(is_segmented)
template <typename T, bool Enable = has_is_segmented<T>::value>
struct segmented_iterator_traits {
typedef false_type is_segmented_iterator;
};
template <typename T>
struct segmented_iterator_traits<T, true> {
typedef true_type is_segmented_iterator;
typedef T iterator;
typedef typename T::segment_iterator segment_iterator;
// ...
};
Sebastian
But all those things are not intrinsic to C++. The C++ abstract
machine doesn't specify any of those platform dependencies. It's just
that you need to do some of those optimizations in the front-end for
full compilation to object code, and nobody is interested in C++ code
that's link-incompatible with all the libraries, etc., that have
already been compiled.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
> Since those could be different segments, you wouldn't need to check
> which segment you are in at each iteration.
> Actually, all local iterators for a a segmented iterator must be the
> same for all segments, but since a local iterator can be a segmented
> iterator itself, it can also have a local iterator which can be
> different. By recursion, you can therefore use different iterator types
> for each segment, but that's somewhat convoluted.
That's wrong actually. Since all the local iterators are the same, then
their local iterators must be the same too.
So segmented iterators wouldn't help to only check for alignment once it
has been reached I think. I would have to properly try to implement one,
but it's not particularly practical to do.
Anyway, my belief is that my alternative to segmented iterators is more
interesting.
Actually, all of those can be done with bidirectional iterators
(although the standard over-constrains sort). binary_search can be
done with forward iterators.
> Ability to stop the iteration could be a useful addition to make the
> concept have more applications.
> In functional languages, people seem happy to use exceptions for that;
> but I assume that certainly wouldn't be the case in C++.
Nooo. Waaay too far from the machine model. If we're talking about
HPC, let's figure out how to generate optimal (or near-optimal) code.
> In any case, linear operations or filters represent in my opinion a
> large enough spectrum of what people do with iterable ranges of data
> that it could be worthwhile to have concepts that allow them to be
> efficient, even if it cannot apply to other cases.
I believe that segmented iterators do allow them to be efficient.
> > It may not matter that it isn't easy, if you really want generic capability.
>
> But do we want generic capability at the cost of complicated code
> which is hard to optimize?
I don't believe you have to choose.
> >> Of course, there is the drawback that it makes the code using the
> >> iteration primitive somewhat more complicated: but so do segmented
> >> iterators. There are also several other drawbacks, in particular
> >> tied to the fact that it cannot be conceptually much more than a
> >> forward iterator.
> >
> > ...nor even implement several of the algorithms that work with forward
> > iterators.
>
> Do you have some examples?
binary_search (equal_range, lower_bound, upper_bound), rotate, equal,
transform (2-input version), includes, merge...
These were just the first ones I encountered in the algorithms section
of http://www.sgi.com/tech/stl/stl_index_cat.html
> >> There is also an extra advantage I hadn't considered that I thought of
> >> when considering SIMD:
> >> - It is possible to push different types of data and have them
> >> statically dispatched through overloading. Heterogeneous structures with
> >> a pull paradigm require a dynamic dispatch.
> >
> > I'm sorry, I can't understand what you're driving at here.
> >
>
> struct my_iteration
> {
> void operator()(float f)
> {
> do_something_with_scalar(f);
> }
>
> void operator()(packed_view<float, 4> v)
> {
> do_something_with_vector(v);
> }
> };
>
> for_each(simd::packed_range(mystuff), my_iteration());
>
> vs
>
> for(packed_view<float, 4> v : simd::packed_range(mystuff))
> do_something_with_vector(v);
>
> Using a pull solution for iteration, the former /could/ take care of
> the non-aligned cases at the beginning and at the end of the mystuff
> range through overloading.
Oh, I think I see what you're getting at. Solving that problem for
segmented iterators would be interesting.
> > Although what you're describing is easier to make efficient for many
> > (perhaps even the majority of cases), it doesn't sound like it
> > actually has the same level of generality as segmented iterators do.
>
> Segmented iterators still allow bidirectional support, indeed (I
> think?). But I'm note sure it can do random access as well, if you
> consider recursively segmented local iterators.
Just as with flat containers, some segmented iterators for
hierarchical containers can do random-access and some can't.
std::deque is an example that can. I don't think recursive
segmentation is an obstacle as long as all the sizes are constant at
the leaves.
> I believe writing a zip_iterator (or similar) would be no easier for
> segmented iterators than for generators due to the arbitrary
> recursion.
I'm not sure what you're getting at here, again. Is it relevant?
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
That's the same problem (in essence) that segmented iterators are
solving. Did you read the Austern paper?
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
I think the exercise would be worth your time
> Anyway, my belief is that my alternative to segmented iterators is
> more interesting.
Not to me, until you can begin to address something like std::merge().
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
You're in the ballpark. I'm lamenting the focus on tricky techniques
(a catch-all bucket for metaprogramming and JIT and...) when there's a
more fundamental question of abstraction to deal with. IMO, first you
should get the abstractions right, and only then should you go back
and apply tricky techniques to optimize if/where necessary.
> And in your last remark are you saying you fear that metaprogramming and
> the Boost MPL are monsters!?
No, but I do fear that the Boost community has become so adept at
fancy TMP that we have lost track of our roots in generic programming,
per this thread:
http://news.gmane.org/find-root.php?message_id=%3cm2wrq5zpxq.wl%25dave%40boostpro.com%3e
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
I just worry that the focus on TMP is so intense that, potentially,
design choices in the GP space that could vastly reduce the amount of
TMP needed are missed.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
sse series on x86 provide several ways to catch errors
specifically control word/flags and exceptions/signals (noty aware of
other platforms)
are you going to model such subtle platform specific details? and if
yes -- how?
as a C++ programmer i would prefer C++ exception mechanism
--
Pavel
P.S.
if you notice a grammar mistake or weird phrasing in my message
please point it out
I didn't know binary_search (and similar functions) worked with forward
iterators. I have trouble seeing how they could work effectively at all.
>> But do we want generic capability at the cost of complicated code
>> which is hard to optimize?
>
> I don't believe you have to choose.
Segmented iterators are neither practical to write nor to use.
ou've
> rotate,
Interesting one, because it modifies itself.
I'm not convinced it's not possible to write it.
> equal,
> transform (2-input version), includes, merge...
Indeed, there is a combination problem.
>> I believe writing a zip_iterator (or similar) would be no easier for
>> segmented iterators than for generators due to the arbitrary
>> recursion.
>
> I'm not sure what you're getting at here, again. Is it relevant?
Well, the limitation is how you can combine those "generators" to
iterate several at the same time, be it for merge or otherwise.
I don't see a good way to do this with segmented iterators either.
Well, let me rephrase the problem.
Given the contiguous sequence of numbers 1 2 3 4 5 6 7 8 9 10 11, with
the 3rd element (3) lying on a satisfying alignment boundary to deploy
SIMD instructions of cardinal 4, we want to adapt the range into this:
| 0 0 1 2 | 3 4 5 6 | 7 8 9 10 | 11 0 0 0 |
The concern is that we want the iteration of the middle part | 3 4 5 6 |
7 8 9 10 | (which in practice would contain much more elements than
this), to be as fast as possible, i.e.
- increment is a ptr += 4
- dereference is load ptr into a SIMD register
This is not possible using normal iterators, since I will need to have a
different behaviour (and thus a branch), for at least one of those two
functions, depending on what state the iteration is at.
As a result a few bytes at the beginning and at the end of a long
sequence slow down the whole iteration.
I thought segmented iterators would solve this, since after all I've got
three segments, and you've been repeatedly saying on this list that it
is the solution to many iterator problems, but they don't. Segmented
iterators are only useful if the iteration of all my segments is done
with the exact same logic but at different locations, since they all
must have the same type. They don't allow to circumvent the problem I
exposed with normal iterators at all.
My alternative not only does, but it could also allows not to have to
pad the ranges with zeroes, which is potentially more interesting to
write SIMD-aware algorithms.
>
>> Anyway, my belief is that my alternative to segmented iterators is
>> more interesting.
>
> Not to me, until you can begin to address something like std::merge().
Yet you are convinced about segmented iterator without even knowing if
it can address something like std::merge (well sure it can if you use
the fallback flat iterator, but that's not what I mean).
But in practice it doesn't help much.
As a non-JIT developer you can figure what is the minimum requirements
for your program to run decently. Optimize for that!
If your customer has better hardware - fine, it will run even faster.
If your calculations need SIMD extensions, no JITing in the world can
fix that on my 486.
Bo Persson
> On 13/10/10 16:00, David Abrahams wrote:
>> At Wed, 13 Oct 2010 10:52:36 +0100,
>> Mathias Gaunard wrote:
>>>
>>> On 12/10/10 09:06, David Abrahams wrote:
>>>
>>>> Nice for very linear things. Not so great for sort, binary_search,
>>>> reverse, rotate, ...
>>>
>>> Yes, since those require either bidirectional or random access iterators.
>>
>> Actually, all of those can be done with bidirectional iterators
>> (although the standard over-constrains sort). binary_search can be
>> done with forward iterators.
>
> I didn't know binary_search (and similar functions) worked with forward iterators. I have trouble seeing how they could work effectively at all.
It's O(n) iterator movements, but only O(log n) comparisons and dereferences, which is valuable if comparison is expensive (e.g. long, similar strings).
Sebastian
O(N) steps, O(log N) comparisons. Better than linear search if
comparison is not cheap.
> >> But do we want generic capability at the cost of complicated code
> >> which is hard to optimize?
> >
> > I don't believe you have to choose.
>
> Segmented iterators are neither practical to write nor to use.
> ou've
Why do you say that? I don't think anyone has made a serious effort.
> > rotate,
>
> Interesting one, because it modifies itself.
> I'm not convinced it's not possible to write it.
I'd like to see that one.
> > equal,
> > transform (2-input version), includes, merge...
>
> Indeed, there is a combination problem.
>
> >> I believe writing a zip_iterator (or similar) would be no easier for
> >> segmented iterators than for generators due to the arbitrary
> >> recursion.
> >
> > I'm not sure what you're getting at here, again. Is it relevant?
>
> Well, the limitation is how you can combine those "generators" to
> iterate several at the same time, be it for merge or otherwise.
Ah.
> I don't see a good way to do this with segmented iterators either.
:-) zip_iterator
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
I think you mean
1 2 | 3 4 5 6 | 7 8 9 10 | 11
right? Are you really intending to manufacture zeros for padding?
>
> The concern is that we want the iteration of the middle part | 3 4 5 6
> | 7 8 9 10 | (which in practice would contain much more elements than
> this), to be as fast as possible, i.e.
> - increment is a ptr += 4
> - dereference is load ptr into a SIMD register
>
> This is not possible using normal iterators, since I will need to have
> a different behaviour (and thus a branch), for at least one of those
> two functions, depending on what state the iteration is at.
> As a result a few bytes at the beginning and at the end of a long
> sequence slow down the whole iteration.
Yes, this is exactly the same problem as std::deque has, which
segmented iterators were designed for.
> I thought segmented iterators would solve this, since after all I've
> got three segments, and you've been repeatedly saying on this list
> that it is the solution to many iterator problems, but they
> don't.
Of course they do.
> Segmented iterators are only useful if the iteration of all my
> segments is done with the exact same logic but at different locations,
> since they all must have the same type. They don't allow to circumvent
> the problem I exposed with normal iterators at all.
I think they do. However, you might find that you need some dynamic
dispatching to get it to work. In other words,
if (operating on complete segment)
SIMD algorithm
else
elementwise algorithm
> My alternative not only does, but it could also allows not to have to
> pad the ranges with zeroes, which is potentially more interesting to
> write SIMD-aware algorithms.
You don't need to pad with zeroes for segmented iterators.
> >> Anyway, my belief is that my alternative to segmented iterators is
> >> more interesting.
> >
> > Not to me, until you can begin to address something like std::merge().
>
> Yet you are convinced about segmented iterator without even knowing if
> it can address something like std::merge (well sure it can if you use
> the fallback flat iterator, but that's not what I mean).
I do know that it can address std::merge. It's not necessarily
pretty, though. ;-)
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
Yes. But I'm not convinced yet that iterators exposing the segmentation
of the data structure are the right solution to presenting a flattened
view of that data structure. 1) It doesn't play nice with standard C++
looping constructs, specifically for loops. 2) I have yet to see a
benchmark comparison between segmented iterators and alternatives (other
than a side-note claim in the paper). I'm open to looking at some
specific comparisons if one has already done this, and this is something
I'm interested enough in that I might try to do myself. Seems to me a
more general solution is a "flat_multirange_iterator" templated on the
outer range and the "segmentation depth". I.e., a
flat_multirange_iterator< vector< vector<T> >, 2 > would iterate over
the T's.
To address your claim that "that's the same problem ... that segmented
iterators are solving", what I meant by a join_range or joint_range is,
essentially, a flattened out Boost.Fusion sequence of ranges of
*differing* types (but with compatible value_type's and references).
E.g., the thing that boost::join [1] returns. Segmented iterators can't
deal with such ranges (at least with a strict reading of Austern's
paper), and I think such ranges are closer to what Mathias has in mind
(to be verified......). But perhaps I'm just not seeing your
generalization beyond the immediate scope of the Austern paper?
- Jeff
[1]
http://www.boost.org/doc/libs/1_44_0/libs/range/doc/html/range/reference/utilities/join.html
That doesn't seem any better than what I would imagine the
range_detail::join_iterator does. I.e., I don't see what segmented
iteration buys us here over what I'd imagine Mathias is currently doing
(which, I'd imagine, is some variant on a joint_iterator that returns
aligned packets).
- Jeff
where's that list of algorithms again? sort, transform, merge,
binary_search ...
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
[snip]
>> Segmented iterators are neither practical to write nor to use.
>> ou've
>
> Why do you say that? I don't think anyone has made a serious effort.
I've been watching this thread and I've been thinking of saying my
experience trying to use segmented iterators.
I am working on a project that has a hierarchical structure where an
object can own other objects of the same type. It is a tree-structure.
This hierarchical structure might be acessible hierachically or
not. Conceptually it always is, but it might be accessible only
through input_iterators which would flatinize the hierarchy in a
efficient way. But, some other implementation might be able to access
it hierarchically, which would allow some algorithms, e.g. find, to work
in better-than-a-linear time.
At first I thought that segmented iterators were what I needed. But
then trying to work with it. I found that segmented iterators would
differentiate segment iterators from local iterators in a way that
didn't work in a tree-like structure.
The nodes of the trees had to be moved to be the first in the local
sequence because only local iterators are dereferenced in
segmented-aware algorithms. This meant that segmented iterators
weren't completely overhead-free.
This doesn't mean segmented iterators aren't good, just that the
use-case that I tried didn't seem to fit it.
[snip]
> --
> Dave Abrahams
> BoostPro Computing
> http://www.boostpro.com
Regards,
--
Felipe Magno de Almeida
Ummm...not sure what your point is. Are you saying a join_iterator (the
iterator used in the result of boost::join) has issues with these
algorithms?
- Jeff
Just trying to refocus this thread to what I meant it to be.
We have a SIMD abstraction layer that we would like to eventually submit
to Boost.
For appropriate values of N, simd::pack<T, N> represents a SIMD
register, which allows you to do the same operation on N elements in
parallel with a single CPU instruction. pack<T, N> provides all the same
operators as T, and can also detect a sequence of operations that exists
on the CPU as a single instruction. It falls back to a loop if the
architecture has no SIMD register of size sizeof(T)*N bytes.
simd::pack<T, N> is also a fusion sequence and a range.
The library also provides a series of useful functions, like summing a
pack or reordering its elements.
What I would like to know, is how people think we could integrate this
system into iterators and ranges so that existing algorithms could be
adapted to treat N elements at a time instead of 1, and therefore get a
potential speed gain.
As I said, we currently have an iterator adapter that adapts a range of
properly aligned Ts, that are also contiguous at least in chunks of N
elements, into a range of pack<T, N>s.
Are there other utilities we could provide to help with the usage of SIMD?
I was thinking of supporting adapting non-aligned ranges as well and
padding them with some values, but that is not possible to do
efficiently with the standard iterator model, which led to some
discussion about an alternative iterator model that seems to be going
nowhere.
Any suggestion of features or tools the SIMD could provide to make the
life of the developer easier would be appreciated.
With regards to alignment, we also provide a memory allocator that
aligns correctly, and functions to get the next or previous aligned
address from a particular address.
"seems" being the operative word ;)
I find the argument to investigate a framework for internal iteration /
push iteration / visitation iteration / whatever convincing. I haven't
been convinced yet (despite Dave's best efforts) that the standard,
external iteration model, segmented or otherwise, will absolutely
address your efficiency concerns.
Have you benchmarked the performance difference between the standard
iteration model and push iteration?
- Jeff
I've only had a chance to read the Austern paper "Segmented Iterators and Hierarchical Algorithms" once. I agree that the segmented iterators are rather appealing as an interface to packs-of-pods-datatype. However, I'm not sure the conceptual interface to segmented iterators are rich enough. Luke Simonson has more experience with the construction of an iterator pattern around SIMDs used in a high-performance system than I do. However I seem I to remember that the pattern for SIMD iteration always looked like this:
Loop A?
Loop B*
Loop C?
Where Loops A & C handle the unaligned start/end (if they exist) and Loop B* handles (0 or more of) the aligned packs of the sequence. Handling the edge-cases for A & C were definitely not the same --- there are memory exceptions that can occur at the end-of-range for Loop C that can't (or usually don't) occur for Loop A when doing unaligned loads/stores. I seem to remember that it was insufficient to just mark "not Loop B" but it was necessary to explicitly mark foreach operations as "Loop A" vs. "Loop C". It seems that this division of Loop A/B/C will need to be exposed through the iterator.
My other concern is that while it should be fairly straightforward to provide efficient SIMD instruction sequences for addition, subtraction, multiplication, (refined) division, and related fused-accumulate ops, many of the more "interesting" SIMD algorithms are highly sensitive to swizzling & permutation, prefetching, built-in conversion operations, etc.
A SIMD library exposing just (for instance) the Ring or Field algebraic operations does have its uses; however, I feel that in most "real world" scenarios the unexposable details needed for high-performance coding will mean that the library is only used for toy applications. That is, unless you think you can also define platform-independent forwarding to swizzle/permute/conditional-lane-usage/etc.
Another (particular ugly) use of SIMDs is to allow per-object aligned storage & load of structures into/outof binary array blobs. This would basically look like an allocator for an object that uses the SIMD extensions for loading/storing data-structures as binary blobs into and out of arrays. Have you considered an interface for this sort of usage?
+1
I share this sentiment. My focus on TMP has been so intense too that
I am stepping back and reflecting on what I(/we) have done with Spirit,
Fusion and Phoenix hoping to find an alternate path. My feeling is that
I want to ease out a bit on TMP and focus more on GP.
Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
All libraries that, IMO, have well-defined and easy-to-grasp concepts
(in the GP sense). I wonder why you feel this way.
--
Eric Niebler
BoostPro Computing
http://www.boostpro.com
It seems to me a segmented iterator could provide a generic way to, for example, vectorize an algorithm on a deque by letting me unpack the pointer and distance to end of the current contiguous address range. I could implement A?B*C? for each contiguous range without needing access to the internals of the dequeue data structure, however, the dequeue doesn't provide a segmented iterator interface, nor does anything else that I'm aware of, and I don't see that changing, so I don't see the point in implemented vector accelerated algorithms with segmented iterator based interfaces to work with all the segmented iteratior data types that don't exist and probably never will. My work with the vectorized A?B*C? using pointer based interface and generic call back functor argument showed that it could match the performance of hand crafted assembly with all of my loop unrolling, prefetching and similar handiwork. There was a lot of special casing for unaligned begin and end of source
and destinatin address ranges. I made great use of it vectorizing some pretty complex general C-style code and got nice speedups for data that was unaligned and address ranges that were often quite small. It was a very productive abstraction and easy to apply. The benefit of abstracting away the complexity of memory access was a big win. Regular iterators didn't perform as well, but I did implement them and use them because they were better than nothing. I didn't look at segmented iterators. I think we might want a different abstraction than segmented iterators to vectorize algorithms that are abstracted away from data structures. Perhaps we should think in terms of arrays instead of scalar elements as the smallest unit. Abstractions and semantics are important exactly because they dictate how you think about a problem. I want simple abstractions and semantics that make the problem at hand easy to think about. So far I haven't heard anyone explain how a segmented
iterator makes it easier to vectorize an algorithm and not harder.
Regards,
Luke
I'm not sure, Eric. It's just a gut feeling at this point. The overall
complexity makes me feel lost in a TMP forest that I want to step
out a bit to see the bigger picture.
Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
_______________________________________________
I really feel the same way. I want simple things to be simple.
Regards,
Luke
We did our homework I can reassure you. We provide more than just
operators on these pack and we use proto to detect fused operations
seuqence and replace them before evaluation. As for swizzling and permute,
we have various potential interface for this and trying to settle on some.
Conversion are used through a simd::cast<T> operator.
As for prefetching, the current solutino is to provide a generic prefetch
function one can use. In NT2, those prefetch are estimated and inserted in
the array evaluation but here, we deal with a lower level.
> Another (particular ugly) use of SIMDs is to allow per-object aligned
> storage & load of structures into/outof binary array blobs. This would
> basically look like an allocator for an object that uses the SIMD
> extensions for loading/storing data-structures as binary blobs into and
> out of arrays. Have you considered an interface for this sort of usage?
I fail to understand your use case ? We have some SIMD compatible
allocator already but what d you make reference to ?
> We did our homework I can reassure you. We provide more than just
> operators on these pack and we use proto to detect fused operations
> seuqence and replace them before evaluation. As for swizzling and permute,
> we have various potential interface for this and trying to settle on some.
> Conversion are used through a simd::cast<T> operator.
Hi,
I would like to add pointer to your lib in the library under construction page http://svn.boost.org/trac/boost/wiki/LibrariesUnderConstruction. Please could you gice some references where we can download the SIMD library and a link to the documentation?
Best,
Vicente
It's currently part of NT2 and need to be extracted and boostified.
NT2 on-going development (and I emphasize this *on-going*) is on
github.com/jfalcou/nt2.
I am currently slowly refactoring/pushing our old code base into the new
system so stuff are in flux atm.
The main area is include/nt2/sdk/simd
Good; that's not what they're for ;-)
> 1) It doesn't play nice with standard C++ looping constructs,
> specifically for loops.
The whole point of hierarchical algorithms is to avoid flat loops for
segmented data structures.
> 2) I have yet to see a benchmark comparison between segmented
> iterators and alternatives (other than a side-note claim in the
> paper). I'm open to looking at some specific comparisons if one has
> already done this, and this is something I'm interested enough in that
> I might try to do myself. Seems to me a more general solution is a
> "flat_multirange_iterator" templated on the outer range and the
> "segmentation depth". I.e., a flat_multirange_iterator< vector<
> vector<T> >, 2 > would iterate over the T's.
That's gonna be slow. Lots of testing and branching.
"Note that the joined range incurs a performance cost due to the
need to check if the end of a range has been reached internally
during traversal."
> To address your claim that "that's the same problem ... that segmented
> iterators are solving", what I meant by a join_range or joint_range
> is, essentially, a flattened out Boost.Fusion sequence of ranges of
> *differing* types (but with compatible value_type's and
> references). E.g., the thing that boost::join [1] returns.
> Segmented iterators can't deal with such ranges (at least with a
> strict reading of Austern's paper),
I don't see why you say that.
> and I think such ranges are closer to what Mathias has in mind (to
> be verified......). But perhaps I'm just not seeing your
> generalization beyond the immediate scope of the Austern paper?
>
> - Jeff
>
> [1]
> http://www.boost.org/doc/libs/1_44_0/libs/range/doc/html/range/reference/utilities/join.html
Looks like this doc is missing something about the value/reference
type requirements, BTW.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
Yes, they are designed to deal with a segmentation depth that's known
at compile-time.
> The nodes of the trees had to be moved to be the first in the local
> sequence because only local iterators are dereferenced in
> segmented-aware algorithms. This meant that segmented iterators
> weren't completely overhead-free.
Right. Trees have any storage at internal nodes, which is another way
they differ from the kinds of structures segmented iterators work on.
> This doesn't mean segmented iterators aren't good, just that the
> use-case that I tried didn't seem to fit it.
Makes sense.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
As I understand it he's not doing anything like a classic STL
iterator, at all.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
> I've only had a chance to read the Austern paper "Segmented
> Iterators and Hierarchical Algorithms" once. I agree that the
> segmented iterators are rather appealing as an interface to
> packs-of-pods-datatype. However, I'm not sure the conceptual
> interface to segmented iterators are rich enough. Luke Simonson has
> more experience with the construction of an iterator pattern around
> SIMDs used in a high-performance system than I do. However I seem I
> to remember that the pattern for SIMD iteration always looked like
> this:
>
> Loop A?
> Loop B*
> Loop C?
>
> Where Loops A & C handle the unaligned start/end (if they exist) and
> Loop B* handles (0 or more of) the aligned packs of the
> sequence. Handling the edge-cases for A & C were definitely not the
> same --- there are memory exceptions that can occur at the
> end-of-range for Loop C that can't (or usually don't) occur for Loop
> A when doing unaligned loads/stores. I seem to remember that it was
> insufficient to just mark "not Loop B" but it was necessary to
> explicitly mark foreach operations as "Loop A" vs. "Loop C". It
> seems that this division of Loop A/B/C will need to be exposed
> through the iterator.
I think you should read the paper again and look at the implementation
of one of the algorithms there. They all have that structure.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
_______________________________________________
There is no support for catching such errors, at least there is none yet.
Translating them to C++ exceptions sounds like translating SIGABRT or
other errors to C++ exceptions to me, in a similar fashion to what
Visual C++ does with SEH; i.e. not a good idea.
On a technical note, it might be a bit complicated to do, since that
state is global so it could be hard in the presence of threads.
We'll nevertheless look into the details of this.
I think I missed something, then. Are we not applying sequential
algorithms (e.g., fill, for_each, ...) to hierarchical data structures?
>> 1) It doesn't play nice with standard C++ looping constructs,
>> specifically for loops.
>
> The whole point of hierarchical algorithms is to avoid flat loops for
> segmented data structures.
It certainly avoids flat loops, to the point of outright precluding
their use. I can add here a 1a) How would you go about iterating
through segmented data structures in parallel?
>> 2) I have yet to see a benchmark comparison between segmented
>> iterators and alternatives (other than a side-note claim in the
>> paper). I'm open to looking at some specific comparisons if one has
>> already done this, and this is something I'm interested enough in that
>> I might try to do myself. Seems to me a more general solution is a
>> "flat_multirange_iterator" templated on the outer range and the
>> "segmentation depth". I.e., a flat_multirange_iterator< vector<
>> vector<T> >, 2> would iterate over the T's.
>
> That's gonna be slow. Lots of testing and branching.
>
> "Note that the joined range incurs a performance cost due to the
> need to check if the end of a range has been reached internally
> during traversal."
First, that quote applies directly to a join_iterator, though I'll give
you that it *may* apply to my hypothetical a flat_multirange_iterator.
What I'm having trouble with is it *looks* to me like you have the same
tests and branches with a segmented iterator as with a
flat_multirange_iterator. Comparing the nested for-loop on page 5 of
the Austern paper (which is what a segmented iterator would give you)
with the flattened while-loop on page 6 (let's replace the
while-condition with true put an inserted break statement when !(node !=
nodes.end())), they both have the same number of iterator increments,
comparisons, and assignments. What am I missing? Are compilers just
able to produce better code for nested for-loops than flattened loops?
>> To address your claim that "that's the same problem ... that segmented
>> iterators are solving", what I meant by a join_range or joint_range
>> is, essentially, a flattened out Boost.Fusion sequence of ranges of
>> *differing* types (but with compatible value_type's and
>> references). E.g., the thing that boost::join [1] returns.
>> Segmented iterators can't deal with such ranges (at least with a
>> strict reading of Austern's paper),
>
> I don't see why you say that.
Show me what the equivalent nested for-loop looks like if a join_range
can be "segmentedly" iterated over. If I tried, I'd be inclined to
write a for-loop over a Boost.Fusion sequence. If you want to include a
join_iterator within the family of segmented iterators, you'd have to
enrich the interface of segmented iterators to include Boost.Fusion
sequences. Is that desirable? Maybe?
[...]
>> http://www.boost.org/doc/libs/1_44_0/libs/range/doc/html/range/reference/utilities/join.html
>
> Looks like this doc is missing something about the value/reference
> type requirements, BTW.
Yes :(
- Jeff
Actually, I just realized that making those loop modifications is
"cheating" :/ Unless you did some kind of Java-style iteration.
Whoops,
> Show me what the equivalent nested for-loop looks like if a join_range
> can be "segmentedly" iterated over. If I tried, I'd be inclined to write
> a for-loop over a Boost.Fusion sequence. If you want to include a
> join_iterator within the family of segmented iterators, you'd have to
> enrich the interface of segmented iterators to include Boost.Fusion
> sequences. Is that desirable? Maybe?
It seems useful, but it wouldn't be usable for the main reason segmented
iterators were created, deque and trees, and that concept wouldn't
really be interoperable with segmented iterators.