I found the execution time of the latter to be higher than the former, as if many DO loops were executed instead than just one. Why use global array operations then? Isn't better to stick to old plain DO loops? Thanks,
On 17 Jun., 17:41, deltaquattro <deltaquat...@gmail.com> wrote:
> Hi,
> I was wondering whether global array operations, introduced in f90, > can have a negative impact on performance.
> [snip]
> I found the execution time of the latter to be higher than the former, > as if many DO loops were executed instead than just one. Why use > global array operations then? Isn't better to stick to old plain DO > loops? Thanks,
> regards,
> deltaquattro
This is quite a strange observation and raises some questions:
1) What optimisation options did you use?
2) Which compiler did you use? The gcc 4.0 and 4.1 Fortran compilers for instance are still pretty much in their infancy, so one would expect bugs and strange behaviour there. Use 4.2 or 4.3 instead, if you use gfortran.
3) How did you measure execution time? I find that accuarate timing on a computer is a nontrivial task. The 'time' command on my machine shows up to 200% variance. I can only assume you used some clever and appropriately precise way of measuring.
I'm not a compiler specialist but AFAIK, array operations should not usually be slower than explicit loop constructs.
Why? When using array operations like -say- x = MATMUL(A,b) in contrast to two nested DO-loops, the compiler has a greater amount of information at hand about what it is you want to do, which allows it to use more aggressive optimisation methods to generate code, or to generate calls to (more or less optimised) runtime libraries; the latter is done by all compilers I know. Additionally, the gfortran compiler has the '-fexternal-blas' option which tells the compiler to automagically generate calls to an optimised vendor BLAS for certain array operations, instead of using the runtime library. I've never tried this, but using a tuned ATLAS library will surely speed things up nicely.
A second benefit of array operations is their conciseness. Take the MATMUL example again: A single call opposed to two nested DO-loops. Or think of copying part of an array into another array, anything really! IMHO, a lot of scientific code completely disregards maintainability issues for the sake of the highest possible degree of code optimisation. Using array operations makes your code more concise, more readable and therefore easier to maintain in the long run! It *should* also improve, or at least not hurt, performance.
> do i=1, ntheta > r(0,i) = rhub > r(nr+1,i) = rmax > dt(0,i) = 0.0 > dft(0,i) = 0.0 > dfr(0,i) = 0.0 > dt(nr+1,i) = 0.0 > dft(nr+1,i) = 0.0 > dfr(nr+1,i) = 0.0 > end do > r(0,:) = rhub > r(nr+1,:) = rmax > dt(0,:) = 0.0 > dft(0,:) = 0.0 > dfr(0,:) = 0.0 > dt(nr+1,:) = 0.0 > dft(nr+1,:) = 0.0 > dfr(nr+1,:) = 0.0 > I found the execution time of the latter to be higher than the former, > as if many DO loops were executed instead than just one. Why use > global array operations then? Isn't better to stick to old plain DO > loops? Thanks,
Normally an initialization loop like this one would be faster as separate loops than one fused loop because it's faster to access memory consecutively rather than jumping around as implied by the fused loop. However in this case the loops appear to be setting boundary values so they are traversing rows rather than columns of the arrays. As a consequence the code jumps around in memory no matter what the compiler does and loop fusion can win out because it implies less loop overhead which otherwise would be of negligible importance compared to memory access considerations (assuming that the data set is too large to fit in cache).
One thing to investigate is whether the r(i,j), dt(i,j), dft(i,j), and dfr(i,j) always get accessed together. If so, you could group them as a derived type and the above loop could go 4X as fast as the structure of arrays code listed above.
-- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
deltaquattro <deltaquat...@gmail.com> wrote: > I was wondering whether global array operations, introduced in f90, > can have a negative impact on performance. Compare:
[elided initialization with DO loops and whole array operations]
> I found the execution time of the latter to be higher than the former, > as if many DO loops were executed instead than just one. Why use > global array operations then? Isn't better to stick to old plain DO > loops? Thanks,
Note that the usual terminology is something more like "whole array operations" or even just an unmodified "array operations" instead of "global".
The main reasons are clarity and conciseness. If it doesn't help clarity and conciseness, don't do it. That is, no doubt, an oversimplification; there are exceptions, etc. But its a good first approximation. Every once in a while they might also get you faster execution, but if that is your primary reason for using them, and you don't have specific knowledge of exactly why to expect faster execution from your paticular case, then your efforts are probably misplaced.
Execution time is actually not the sole measure of code "goodness". In many cases, it isn't even particularly high on the list of important things. Sometimes it isn't on the list at all. Other times it is at the very top of the list. All generalizations are false, your mileage may vary, etc.
In that regard, is execution time of an initialization such as this actually significant in your code? While possible, that would be unusual, and might suggest that the choice of algorithms is less than ideal. There can be efficient algorithms like that, but they are rare. As Dennis says, it can be tricky to even measure execution times precisely enough to time initializations like this. I'm supposing that perhaps you are just using this as an example of more "interesting" cases.
In answer to Dennis, by the way, it is *VERY* common for whole array operations to be slower than DO loops. No, it is not all all strange. It is much closer to the usual state of afairs. There are a whole host of reasons.
1. Compilers have had over 5 decades of time to develop techniques of optimizing loops. Progress has been made in that time. There has only been about a decade or two (some work preceeded the f90 standard; other compilers didn't really start until later) of significant work on optimizing array expressions. Things have improved and are still improving, but it just is not at the level of experience of DO-loop optimization.
2. Array temporaries are often a big deal in whole-array expressions. A naive (aka straightforward) applicaion of the rules very often involves such array temporaries, which are expensive in time. The compiler has to do a fair amount of work to figure out whether they can be elided. See point 1. That's probably not the case for your example, but it is a common one.
3. Your example illustrates the problem of "loop fusion". The naive (again aka straightforward) application of the rules for your code example *DOES* imply separate loop for each array operation (complete with all loop overhead). That's how the operations are defined. It is an optimization for the compiler to recognize when it can usefully fuse these multiple loops. See point 1.
-- Richard Maine | Good judgement comes from experience; email: last name at domain . net | experience comes from bad judgement. domain: summertriangle | -- Mark Twain
On 17 Jun., 20:31, nos...@see.signature (Richard Maine) wrote:
> [snipety-snip]
> In answer to Dennis, by the way, it is *VERY* common for whole array > operations to be slower than DO loops. No, it is not all all strange. It > is much closer to the usual state of afairs. There are a whole host of > reasons.
Beats me. After all, when using array operations you have additional information either readily available, or you are able to extract it fairly easy, which is not always the case in DO loops. Talk about aliasing, strides, contingent memory locations etc. But then again, "See point 1". I actually find this hard to believe, but given that I'm rather new to the delightful post-77 Fortran world and haven't really done any serious benchmarking on array operations vs DO loops, I'll gladly trust your judgement on this. Thanks for enlightening us!
> 1. Compilers have had over 5 decades of time to develop techniques of > optimizing loops. Progress has been made in that time. There has only > been about a decade or two (some work preceeded the f90 standard; other > compilers didn't really start until later) of significant work on > optimizing array expressions. Things have improved and are still > improving, but it just is not at the level of experience of DO-loop > optimization.
OK, here's my newfound corner of gcc development that I feel like doing, as soon as I have more time on my hands than right now. After all, despite my earlier ramblings about conciseness and maintainability, performance DOES matter in many cases that are relevant to me :)
> 2. Array temporaries are often a big deal in whole-array expressions. A > naive (aka straightforward) applicaion of the rules very often involves > such array temporaries, which are expensive in time. The compiler has to > do a fair amount of work to figure out whether they can be elided. See > point 1. That's probably not the case for your example, but it is a > common one.
The Intel compiler (10.1, maybe earlier versions as well) throws a warning at runtime if it finds itself needing to create an array temporary; I found myself changing pieces of my code due to those warnings. Gonna have a look if the gfortran guys already have a feature request about this...
> 3. Your example illustrates the problem of "loop fusion". The naive > (again aka straightforward) application of the rules for your code > example *DOES* imply separate loop for each array operation (complete > with all loop overhead). That's how the operations are defined. It is an > optimization for the compiler to recognize when it can usefully fuse > these multiple loops. See point 1.
> -- > Richard Maine | Good judgement comes from experience; > email: last name at domain . net | experience comes from bad judgement. > domain: summertriangle | -- Mark Twain
In article <9e1269be-de8c-4971-857d-e95bf639d...@i76g2000hsf.googlegroups.com>, Dennis Wassel <dennis.was...@googlemail.com> writes:
> On 17 Jun., 20:31, nos...@see.signature (Richard Maine) wrote:
>> 1. Compilers have had over 5 decades of time to develop techniques of >> optimizing loops. Progress has been made in that time. There has only >> been about a decade or two (some work preceeded the f90 standard; other >> compilers didn't really start until later) of significant work on >> optimizing array expressions. Things have improved and are still >> improving, but it just is not at the level of experience of DO-loop >> optimization.
> OK, here's my newfound corner of gcc development that I feel like > doing, as soon as I have more time on my hands than right now. After > all, despite my earlier ramblings about conciseness and > maintainability, performance DOES matter in many cases that are > relevant to me :)
For starters, you can see what gfortran does by using the -fdump-tree-original option. Try it with
subroutine po(x,y) real x(3,3), y(3,3) x = 1. y = 0. x = matmul(x,y) end subroutine po
If you're really curious about the internal goop, use -fdump-tree-all.
>> 2. Array temporaries are often a big deal in whole-array expressions. A >> naive (aka straightforward) applicaion of the rules very often involves >> such array temporaries, which are expensive in time. The compiler has to >> do a fair amount of work to figure out whether they can be elided. See >> point 1. That's probably not the case for your example, but it is a >> common one.
> The Intel compiler (10.1, maybe earlier versions as well) throws a > warning at runtime if it finds itself needing to create an array > temporary; I found myself changing pieces of my code due to those > warnings. > Gonna have a look if the gfortran guys already have a feature request > about this...
There is currently no warning and AFAIK no request for such a feature. gfortran has fairly decent dependency analysis, but in certain situation it will err on the safe side and generate a temporary array even if it isn't necessarily needed.
On 17 Jun., 22:41, ka...@troutmask.apl.washington.edu (Steven G.
Kargl) wrote: > In article <9e1269be-de8c-4971-857d-e95bf639d...@i76g2000hsf.googlegroups.com>,
> For starters, you can see what gfortran does by using the > -fdump-tree-original option. Try it with
> subroutine po(x,y) > real x(3,3), y(3,3) > x = 1. > y = 0. > x = matmul(x,y) > end subroutine po
> If you're really curious about the internal goop, use -fdump-tree-all.
This sounds dangerously like "don't do this, unless you *really* want to". Right'o, gimme something to fill my umpteen MBs of terminal buffer :)
> There is currently no warning and AFAIK no request for such a feature. > gfortran has fairly decent dependency analysis, but in certain situation > it will err on the safe side and generate a temporary array even if > it isn't necessarily needed.
>>r(0,:) = rhub >>r(nr+1,:) = rmax >>dt(0,:) = 0.0 >>dft(0,:) = 0.0 >>dfr(0,:) = 0.0 >>dt(nr+1,:) = 0.0 >>dft(nr+1,:) = 0.0 >>dfr(nr+1,:) = 0.0 >>I found the execution time of the latter to be higher than the former, >>as if many DO loops were executed instead than just one. Why use >>global array operations then? Isn't better to stick to old plain DO >>loops? Thanks, > Normally an initialization loop like this one would be faster as > separate loops than one fused loop because it's faster to access > memory consecutively rather than jumping around as implied by the > fused loop. However in this case the loops appear to be setting > boundary values so they are traversing rows rather than columns of > the arrays. As a consequence the code jumps around in memory no > matter what the compiler does and loop fusion can win out because > it implies less loop overhead which otherwise would be of negligible > importance compared to memory access considerations (assuming that > the data set is too large to fit in cache).
The cache effect can be complicated in cases like this. If the different statements are on the same elements of the same array, then a single loop helps them stay in cache.
If speed is that important, you might try reversing the subscript order (in the whole program). Well, the general rule is to arrange the subscripts such that the leftmost subscript changes fastest in array operations. That is the order they are stored in memory, the order they will be done in array operations, and the order for I/O if just an array name is specified.
> One thing to investigate is whether the r(i,j), dt(i,j), dft(i,j), > and dfr(i,j) always get accessed together. If so, you could group > them as a derived type and the above loop could go 4X as fast as > the structure of arrays code listed above.
The old struct of array vs. array of struct trick.
(snip, and previously snipped DO vs. array expressions)
> I'm not a compiler specialist but AFAIK, array operations should not > usually be slower than explicit loop constructs.
To make a fair comparison, it should be separate DO loops vs. array operations, and separate DO loops vs. a single DO loop. Then you can separate the difference due to memory access patterns and actual instructions.
My usual rule is that simple array operations are better than (or at least as good as) DO loops, but more complicated ones are slower. (Especially if temporary arrays are used.)
Craig Powers wrote: > glen herrmannsfeldt wrote: >> If speed is that important, you might try reversing the >> subscript order (in the whole program). > I suspect, given the context, that outside of this initialization, the > subscripts are accessed in the correct order.
That is the problem with posting only a small part.
Though I might wonder if this is such a small part, how one could notice the speed difference. It seemed worth a reminder, just in case.
On Jun 17, 5:36 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> Dennis Wassel wrote:
> (snip, and previously snipped DO vs. array expressions)
> > I'm not a compiler specialist but AFAIK, array operations should not > > usually be slower than explicit loop constructs.
> To make a fair comparison, it should be separate DO loops > vs. array operations, and separate DO loops vs. > a single DO loop. Then you can separate the difference > due to memory access patterns and actual instructions.
> My usual rule is that simple array operations are better than > (or at least as good as) DO loops, but more complicated ones > are slower. (Especially if temporary arrays are used.)
An assignment of a scalar to an array should never use a temporary array. Even the simplest compiler should get that right.
On 17 Giu, 19:21, Dennis Wassel <dennis.was...@googlemail.com> wrote:
> On 17 Jun., 17:41, deltaquattro <deltaquat...@gmail.com> wrote: [..]
> This is quite a strange observation and raises some questions:
> 1) What optimisation options did you use?
None.
> 2) Which compiler did you use?
Compaq Visual Fortran.
> 3) How did you measure execution time? > I find that accuarate timing on a computer is a nontrivial task. The > 'time' command on my machine shows up to 200% variance. I can only > assume you used some clever and appropriately precise way of > measuring.
When Richard told you exactly the same, you didn't ask which "clever and appropriately precise way of measuring" he used, but you quickly thanked him for "enlightening us". I can only assume that for you using "some clever and appropriate precise way of measuring" means asking for Richard's agreement on the subject.
> I'm not a compiler specialist but AFAIK, array operations should not > usually be slower than explicit loop constructs.
Well, then you could just try some example for yourself and see what happens. My experience comes from different CFD codes I wrote using whole array operations and single DO loops, and with my compiler I often found significant execution time differences in real life codes on real life cases. That's enough for me to start asking questions on the ng, and yes thanks, I know enough Fortran to reverse index ordering outside initialization loops: that was just an example.
On 18 Jun., 10:20, deltaquattro <deltaquat...@gmail.com> wrote:
> > 3) How did you measure execution time? > > I find that accuarate timing on a computer is a nontrivial task. The > > 'time' command on my machine shows up to 200% variance. I can only > > assume you used some clever and appropriately precise way of > > measuring.
> When Richard told you exactly the same, you didn't ask which "clever > and appropriately precise way of measuring" he used, but you quickly > thanked him for "enlightening us". I can only assume that for you > using "some clever and appropriate precise way of measuring" means > asking for Richard's agreement on the subject.
Hey, easy on the noobs! That's outright flaming, when all I was trying to do is to help! Had you earlier made any suggestion that you know yourself around Fortran coding very well, thank you, I might as well have left answering to the experts. Instead you *now* post something I can summarise as "STFU n00b", which tends to scare away new members. Not gonna cow me down, but others would have left for good by now.
But then again, who am I telling? Of course you know all that!
> > I'm not a compiler specialist but AFAIK, array operations should not > > usually be slower than explicit loop constructs.
> Well, then you could just try some example for yourself and see what > happens. My experience comes from different CFD codes I wrote using > whole array operations and single DO loops, and with my compiler I > often found significant execution time differences in real life codes > on real life cases.
> On 18 Jun., 10:20, deltaquattro <deltaquat...@gmail.com> wrote:
> > > 3) How did you measure execution time? > > > I find that accuarate timing on a computer is a nontrivial task. The > > > 'time' command on my machine shows up to 200% variance. I can only > > > assume you used some clever and appropriately precise way of > > > measuring.
> > When Richard told you exactly the same, you didn't ask which "clever > > and appropriately precise way of measuring" he used, but you quickly > > thanked him for "enlightening us". I can only assume that for you > > using "some clever and appropriate precise way of measuring" means > > asking for Richard's agreement on the subject.
> Hey, easy on the noobs! > That's outright flaming, when all I was trying to do is to help! > Had you earlier made any suggestion that you know yourself around > Fortran coding very well, thank you, I might as well have left > answering to the experts. > Instead you *now* post something I can summarise as "STFU n00b", which > tends to scare away new members. Not gonna cow me down, but others > would have left for good by now.
> But then again, who am I telling? Of course you know all that!
I'm not flaming and I don't think you are a novice: on the contrary, since you talked about the possibility of you being involved in gcc development, I think you know Fortran way better than me. Anyway, I would never say "STFU n00b" to anyone under any circumstances, if I understand what that means (I'm not sure since because English is not my mother tongue, and I don't live in an English speaking country). The point is that when I posted a genuine doubt about speed of whole array operations, you replied asking me if I have been clever enough in doing my timings. When Richard confirmed my doubts, you were ready to accept his words. To me, it seemed as if you were dismissing my doubts just because I'm not an expert as Richard, robin, glenn, Paul, Steve and all the many other guys on this newsgroup. If this isn't so, then I apologize to you and to the rest of the newsgroup: I have misinterpreted your words, maybe also because of my imperfect knowledge of English.
On 18 Jun., 14:18, deltaquattro <deltaquat...@gmail.com> wrote:
> On 18 Giu, 11:41, Dennis Wassel <dennis.was...@googlemail.com> wrote:
> I'm not flaming and I don't think you are a novice: on the contrary, > since you talked about the possibility of you being involved in gcc > development, I think you know Fortran way better than me. > Anyway, I would never say "STFU n00b" to anyone under any > circumstances, if I understand what that means (I'm not sure since > because English is not my mother tongue, and I don't live in an > English speaking country). > The point is that when I posted a genuine doubt about speed of whole > array operations, you replied asking me if I have been clever enough > in doing my timings. When Richard confirmed my doubts, you were ready > to accept his words. To me, it seemed as if you were dismissing my > doubts just because I'm not an expert as Richard, robin, glenn, Paul, > Steve and all the many other guys on this newsgroup.
*sigh* I'm afraid it sounded exactly like that.
Sorry if I sounded as if I just dismissed your doubts as pointless, I am in no position to do that; I wanted to point out that proper timing is nontrivial, and whether you're aware of that. Apparently you are! With Richard, his post (and google, I admit!) gives me some idea about his background, so I'll tend to trust in what this guy says.
Besides, the fact that array operations may indeed be slower than explicit loops was (and still is) hard for me to believe. Learned something today!
> If this isn't so, then I apologize to you and to the rest of the > newsgroup: I have misinterpreted your words, maybe also because of my > imperfect knowledge of English.
> Best Regards,
> deltaquattro
No offence meant, and none taken (I hope)!
BTT: Is (compiler) optimisation of array operations that much of a nontrivial task or just something that nobody has come around to doing, yet? Judging from their mailing list, gfortran seems to have many construction sites and a number of them apparently with higher priorities.
But deltaquattro reported this against Compaq, and I also remember someone from the gfortran mailing list mentioning that ifort shines on DO-loop optimisation. Alas, I really feel like digging into this, but got to get my current project done first.
deltaquattro <deltaquat...@gmail.com> wrote: > On 17 Giu, 19:21, Dennis Wassel <dennis.was...@googlemail.com> wrote: > > On 17 Jun., 17:41, deltaquattro <deltaquat...@gmail.com> wrote: > > 3) How did you measure execution time? > > I find that accuarate timing on a computer is a nontrivial task. The > > 'time' command on my machine shows up to 200% variance. I can only > > assume you used some clever and appropriately precise way of > > measuring.
> When Richard told you exactly the same, you didn't ask which "clever > and appropriately precise way of measuring" he used, but you quickly > thanked him for "enlightening us".
As long as I'm being mentioned by name, I'll note that I thought I was agreeing with Dennis on this particular point/question. I don't recall answering it at all, but rather reinforcing it when I said:
>>> As Dennis says, it can be tricky to even measure execution times >>> precisely enough to time initializations like this.
I did disagree on other points, including the larger one about array operations often being slower (or not). But for the specific case here, where it is just an initialization, and thus presumably pretty quick, I had exactly the same question, even if I stated it implicitly instead of with a question mark. I just didn't really need to ask because I agreed with your (deltaquattro's) general observation, even though I wondered about how this particular case was measured.
-- Richard Maine | Good judgement comes from experience; email: last name at domain . net | experience comes from bad judgement. domain: summertriangle | -- Mark Twain
>>> If speed is that important, you might try reversing the >>> subscript order (in the whole program).
>> I suspect, given the context, that outside of this initialization, the >> subscripts are accessed in the correct order.
> That is the problem with posting only a small part.
> Though I might wonder if this is such a small part, how one > could notice the speed difference.
It might be part of an outer loop that is executed many times. I had something of this nature happen in one of my programs---not down to the subscripting, but simple initialization-to-zero. A bunch of stuff needed to be initialized to zero on each pass through the outer loop, and doing it using DO loops (which also, incidentally, cut out initialization of unused portions of a ragged array) realized a substantial savings in program execution time for certain input sets.
> Is (compiler) optimisation of array operations that much of a > nontrivial task or just something that nobody has come around to > doing, yet? Judging from their mailing list, gfortran seems to have > many construction sites and a number of them apparently with higher > priorities.
If you do exactly the same thing in both cases, the compiler should do pretty well, except in cases that modify the source array. There are cases where you know that no array elements are used after they are modified (using DO loops), but the compiler doesn't (using array expressions).
Using vector subscripts as an example, say you have arrays X and Y, both dimensioned N, where X has a permutation of the numbers from 1 to N.
DO I=1,N Y(I)=Y(X(I)) ENDDO
reorders Y based on the permutation, where you know that the values in X are unique.
Y=Y(X)
does it using vector subscripts, but the compiler does not know that the values are unique, so a temporary array is needed. If you read in X from a file, there is no way that the compiler could possibly know the values are unique. If you generated X directly, it is unlikely that even the best optimizer would figure it out.
> > Is (compiler) optimisation of array operations that much of a > > nontrivial task or just something that nobody has come around to > > doing, yet? Judging from their mailing list, gfortran seems to have > > many construction sites and a number of them apparently with higher > > priorities.
> If you do exactly the same thing in both cases, the compiler > should do pretty well, except in cases that modify the source array. > There are cases where you know that no array elements are used > after they are modified (using DO loops), but the compiler doesn't > (using array expressions).
> Using vector subscripts as an example, say you have arrays X and Y, > both dimensioned N, where X has a permutation of the numbers from > 1 to N.
> DO I=1,N > Y(I)=Y(X(I)) > ENDDO
*Errrrrm* Am I getting something wrong? This won't work, consider for instance X = (2 1). Except for the special case X(I) >= I \forall I, this will produce garbage (and for permutations, that special case is only satisfied by the identity). Point is, I'm fairly sure you need temporaries here as well.
> reorders Y based on the permutation, where you know that > the values in X are unique.
> Y=Y(X) > does it using vector subscripts, but the compiler does not > know that the values are unique, so a temporary array is > needed. If you read in X from a file, there is no way > that the compiler could possibly know the values are unique. > If you generated X directly, it is unlikely that even the > best optimizer would figure it out.