defined operator & assignment speed & memory usage problem

Damian

ongelezen,

12 jul 2008, 11:49:1312-07-2008

aan

This is a slightly modified reposting of an earlier posting that
received no replies:

C++ has a construct referred to as expression templates that make it
possible to greatly increase the speed and decrease the memory
utilization of programs that use overloaded arithmetic ("defined
operators and assignments" in Fortran). While I don't know enough to
advocate adopting the C++ approach, I do think that not having a
solution to the problem expression templates solves puts Fortran at a
very significant disadvantage in performance and it seems to me that
performance is one of the most cited reasons for using Fortran.

Consider the following:

type(foo) :: a,b,c,d

d = a + b*c

where =, +, and * are a defined assignment and two defined operators.
Since many (most?) modern microprocessors can perform a multiply/add
combination in a single clock cycle, it's a travesty to have separate
procedures for + and *. But combining them into one procedure such
as

d = add_multiply(a,b,c)

is a travesty in terms of notational clarity in my view (particularly
for much longer expressions). Furthermore, the evaluation of the
first expression above will likely involve the creation of temporary
results that could easily be eliminated in the second expression. This
is especially important when each object contains large amounts of
data (gigabytes in my applications). In fact, all temporary results
can be eliminated by the following

call add_multiply_assign(d,a,b,c)

Additionally, there are issues related to cache locality that arise
if the derived type contains long arrays. The combined procedure can
use the result of a multiplication immediately for a subsequent
addition, rather than writing the result back to main memory inside
the * operator for later retrieval by the + operator.

As I understand them, C++ expression templates allow the programmer
to have the best of both worlds by telling the compiler to resolve
expressions of the first form above into those of the second form.

A current Fortran compiler might be able to combine operators
automatically during interprocedural optimization, but I don't know
enough to know what things I might mistakenly do to thwart that
process. For example, I tend to use deeply nested derived types:
derived types containing derived type components containing derived
type components... Will this make it harder for the compiler to
accomplish the requisite interprocedural optimizations?

It seems better to provide a language facility rather than to let this
be compiler-dependent. The differences in performance and memory
usage could be substantial. Can someone with knowledge of C++
expression templates explain the pros and cons? Is there another
approach that could provide the same benefits.

Damian

Dick Hendrickson

ongelezen,

12 jul 2008, 14:27:3512-07-2008

aan

Damian wrote:
> This is a slightly modified reposting of an earlier posting that
> received no replies:
>
> C++ has a construct referred to as expression templates that make it
> possible to greatly increase the speed and decrease the memory
> utilization of programs that use overloaded arithmetic ("defined
> operators and assignments" in Fortran). While I don't know enough to
> advocate adopting the C++ approach, I do think that not having a
> solution to the problem expression templates solves puts Fortran at a
> very significant disadvantage in performance and it seems to me that
> performance is one of the most cited reasons for using Fortran.
>
> Consider the following:
>
> type(foo) :: a,b,c,d
>
> d = a + b*c
>
> where =, +, and * are a defined assignment and two defined operators.
> Since many (most?) modern microprocessors can perform a multiply/add
> combination in a single clock cycle, it's a travesty to have separate
> procedures for + and *.

I don't know anything about C++ templates, and I'm not sure I
understand your question any better this time than I did the
first time. So, this might not be the best answer ;).

But, why do you think that expression will resolve into a
combined multiply add? Surely in any reasonable program
the derived types will have complicated insides and the
operating procedures will do more than just multiply
or add a single component.

I think the solution to the problem is to use a compiler that
does aggressive inlining of the derived type procedures
and then let normal optimization work. The drawback is
that you'd (almost for sure) have to compile the module at
the same time as the rest of the code, which can be a
compile time hog. But, compilers are now very good at
straight line code; generally they write better assembly
code than a human can.

Given that I don't see your example as being typical of a
real problem, I don't think a new syntax is the answer.

Dick Hendrickson

Richard Maine

ongelezen,

12 jul 2008, 14:44:4012-07-2008

aan

Dick Hendrickson <dick.hen...@att.net> wrote:

> But, why do you think that expression will resolve into a
> combined multiply add? Surely in any reasonable program
> the derived types will have complicated insides and the
> operating procedures will do more than just multiply
> or add a single component.

Particularly when, as

> Damian wrote:
> > ...especially important when each object contains large amounts of

> > data (gigabytes in my applications).

...

> > But combining them into one procedure such as
> >
> > d = add_multiply(a,b,c)
> >
> > is a travesty in terms of notational clarity in my view (particularly
> > for much longer expressions).

You really have objects that contain gigabytes of data, expect this to
resolve into a single multiply-add, and have it appear in much longer
expressions? Perhaps so, but I have trouble picturing this as typical of
many applications.

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain

dpb

ongelezen,

12 jul 2008, 17:07:3512-07-2008

aan

Damian wrote:
> This is a slightly modified reposting of an earlier posting that
> received no replies:

...

> Consider the following:
>
> type(foo) :: a,b,c,d
>
> d = a + b*c
>
> where =, +, and * are a defined assignment and two defined operators.
> Since many (most?) modern microprocessors can perform a multiply/add
> combination in a single clock cycle, it's a travesty to have separate

> procedures for + and *. ...
...

Where I'm confused mostly is that if the expression at the TYPE level
resolves at the implementation level to a combined multiply/add and the
processor has the facility, why wouldn't you expect the compiler
optimizer to recognize that anyway?

I'm missing something here, too...

--

Damian

ongelezen,

12 jul 2008, 22:12:2812-07-2008

aan

On Jul 12, 11:44 am, nos...@see.signature (Richard Maine) wrote:

The question of whether this is a special case seems to be a recurring
theme. I can name roughly half a dozen major scientific software
development projects where groups are doing exactly what I'm
describing; however, all of the projects except mine abandoned Fortran
for C++ 5-10 years ago because the Fortran language and compilers made
this kind of work excruciatingly painful. (I've published two journal
articles and filed one patent just on the issue of avoiding memory
leaks with recent Fortran compilers and that doesn't even get to the
question of efficiency addressed in my original posting.) Amongst the
people doing this kind of work, I'm one of the only die-hards left
holding out hope that Fortran will advance to support the kind of
programming I'm describing.

To explain just how broadly applicable this work is, here's some more
detail:

My work involves creating derived types that are reusable for solving
partial differential equations across numerous disciplines within the
physical and engineering sciences. The basic idea is to imbue
software with high levels of reusability by modeling software
constructs after highly reusable physical and mathematical
constructs. For example, anyone who has studied fluid mechanics,
solid mechanics, electromagnetics, quantum mechanics, or special/
general relativity is the familiar with the concept of a "field" as a
function of 3D space and time. Since the mathematical concept of a
"field" is highly reusable across so many disciplines, a derived type
that represents the state (components) and behavior (type-bound
procedures) of a field can be reused across just as many disciplines.
For example, there are a lot of problems that can be solved with 3D
Fourier fields, so I have created a derived type whose components hold
3D Fourier coefficients and whose type-bound procedures provide
arithmetic and differential operators useful in solving ordinary and
partial differential equations using discrete Fourier transforms. I
have reused these derived types and similar types on problems ranging
from quantum mechanics to fluid mechanics and electromagnetics. The
result is dramatically reduced development time over traditional
programming approaches. I wrote a Maxwell's equations solver in a
single day, coupled it to a Navier-Stokes equation solver, and had
working code in less than a week (after someone very familiar with the
subject but using a more traditional, procedural programming style
predicted it would be naive to think it could be done that fast).

If anybody wants references that describe my work in Fortran and the
work of half a dozen or so similar C++ projects, I will be glad to
provide them. I can provide articles that describe the software
design as well as articles showing that publishable new science is
coming out of this work regularly across multiple disciplines. In
these fields, many of the cutting edge problems are solved with
distributed data structures on massively parallel platforms, so
honestly, even the gigabyte-sized objects I use would be considered
modest in some arenas. At a minimum, I hope I can put to rest any
notion that this work is unique in its approach or needs.

I'm stuck between a world where no one doing the kind of work I do
understands why I use Fortran and no one who uses Fortran understands
why I want to do the things I'm do.

Damian

ongelezen,

12 jul 2008, 23:16:4112-07-2008

aan

What's missing is my lack of understanding and experience with
interprocedural optimzation. At least one compiler developer tells me
his company's compiler will combine the operations I'm describing, so
I'll have to do some more testing to find out. I guess I'm assuming
that if the compilers could generally be trusted to do this no matter
how complicated the expression, then the whole notion of expression
templates would not exist in C++, but maybe there are specific reasons
why it's harder for C++ compilers to do the desired interprocedural
optimizations than it is for Fortran compilers. I'll have to do some
testing and write more at a later date, but it still seems to me
better to have a language construct than to rely on the compiler for
something like this.

Damian

dpb

ongelezen,

13 jul 2008, 00:19:2213-07-2008

aan

Damian wrote:
...
> ... I guess I'm assuming

> that if the compilers could generally be trusted to do this no matter
> how complicated the expression, then the whole notion of expression
> templates would not exist in C++, but maybe there are specific reasons
> why it's harder for C++ compilers to do the desired interprocedural

> optimizations than it is for Fortran compilers. ...

I'll admit I'm not that up on C++ templates but it wasn't my
understanding their purpose was for optimization but for generalization
of code at the source level???

--

Tim Prince

ongelezen,

13 jul 2008, 01:30:4513-07-2008

aan

Fortran has a language construct for fused multiply-add, if that's the
point you are stuck on.

d = a + b*c

allows fused multiply-add;
d = a + (b*c)
requires compilation "as if" separate operations are performed. Not that
I expect to see code written in practice so as to make this distinction.
I don't know why you would split this expression between functions; if you
do, it doesn't look right for a compiler to fuse multiply-add.

After so many years of fused multiply-add coming and going, even with a
reappearance planned for a future extension of the Intel AVX instruction
set, we still don't have conclusive data establishing an advantage for it.
Sure, my applications might gain 5% in performance on Itanium or Power PC
with it, but that's insufficient to make those machines competitive. On
these machines, code sequences with sequential dependency may run twice as
fast with the fused multiply-add, but those code sequences still must be
considered undesirable, at least when they aren't vectorizable.

But, maybe you're talking about something else. If you are trying to ask
about a distinction between Fortran and C++, you haven't made it clear.

Tim Prince

ongelezen,

13 jul 2008, 02:03:1113-07-2008

aan

Certain C++ templates can be an advantage for optimization, at least
inner_product(), when seen as a partial equivalent of Fortran dot_product.
If these things can't be optimized, there's no hope for C++ to be
competitive.
The usual template practice of "loop until reaching object past end of
loop" rather than "give me a way to calculate loop count" can be a severe
obstacle to optimization. The only reason for it AFAIK is to permit
iterators on pointers, such as linked lists, at the expense of making it
nearly as difficult to optimize simpler cases.
One of the difficulties is the total lack of protection against the case
of a practically unterminated loop. For the arithmetically countable
case, there are 2 possibilities:
a) loop count calculable from (object.end - object.begin), where
optimization may be important
b) object.end > object.begin, probably an error, but not defined as such.
32-bit compilers are considered obligated to attempt to let this wrap
around memory, using all addresses except those between object.end and
object.begin. I've noticed 64-bit compilers which don't attempt this
case, as it would surely hang or crash the program.
So you could draw the conclusion there was no intention of facilitating
optimization or error diagnosis. Surely, few Fortran programmers want to
go there.
You seem to have interpreted this thread as an attempt to draw
distinctions between C++ and Fortran. I've carried that idea too far.

Damian

ongelezen,

13 jul 2008, 06:18:4113-07-2008

aan

I'm not really trying to ask about a distinction between Fortran and C+
+. I'm simply mentioning C++ because it's one language that addressed
the problem I'm pointing out, but I can't seem to convince most
Fortran programmers that this is a problem. It goes well beyond the
issue of the performance of fused multiply/add. Please recall that my
original posting also talked about cache utilization, memory
utilization (excessive temporary results), and notational convenience.
Each of these issues has been discussed in the research literature but
the only language construct I've found that addresses it happens to be
a C++ construct.

I'm trying to point out an issue that comes up in a lot of projects
where one is attempting to support a syntax in which software
constructs closely mirror the analytical constructs they represent,
e.g. where the right-hand side of the 3D heat equation can be written
as

type(field) :: T
real :: diffusion_coefficient

dT_dt = Laplacian(T)/diffusion_coefficient

where T will likely contain a large 3D data set over a some region of
space and where Laplacian(T) contains code like

function Laplacian(T)
type(field), intent(in) :: T
type(field) :: Laplacian
Laplacian = d2_dx2(T) + d2_dy2(T) + d2_dz2(T)

where the last line computes second-order partial derivatives and
where, for reasons I can explain if desired, most of the algorithms
for computing those derivatives will involve multiplication, so again
we see the multiply/add combination.

Think of writing Fortran modules that give a user of those modules the
ability to use a syntax closer to Mathematica but with the speed
advantages that come with using a highly optimizable, compilable
language like Fortran. To support that kind of work, the developer
provides individual operators such as differentiation, addition, and
multiplication. If the developer can't allow the user to split those
operators (syntactically) without being able to later fuse them at
compilation time, then there is no way for this approach to be
competitive performance-wise. That alone would not deter me because I
generally find that the development-time reductions are so significant
(literally reducing some projects that would normally take weeks or
months down to a few days) that the total time to solution
(development time plus run time) justifies the approach. However, now
that I've proven that in several arenas, I'm trying to turn my
attention to the performance question -- hence the original posting.

Damian

Again, I hope the above example gives a sense of the general utility
of this approach. I could have chosen virtually any partial
differential equation in any field and could have written the code to
solve it just as fast so long as the underlying derived type can be
assumed to perform the desired operations. This is about as general
as it gets in terms of the disciplines this can touch. It just
involves thinking a different way about the programming process.

Damian

ongelezen,

13 jul 2008, 06:55:3313-07-2008

aan

Here are a couple of links describing expression templates:

http://ubiety.uwaterloo.ca/~tveldhui/papers/Expression-Templates/exprtmpl.html
http://www.ddj.com/cpp/184401627

There is a good chance I'm not even understanding this material enough
to know whether it addresses the issue I'm describing. If so, I'll
drop any mention of expression templates and hope it's clear that my
issue remains one worth addressing somehow.

I also probably should not have over-emphasized fusing the multiply/
add combination. It's really a more general question of fusing any two
defined operators and whether that process should be left to the
compiler or whether the programmer should be allowed to stipulate it.

Damian

dpb

ongelezen,

13 jul 2008, 09:51:1413-07-2008

aan

Tim Prince wrote:
...

> You seem to have interpreted this thread as an attempt to draw

> distinctions between C++ and Fortran. ...

Not really, I was trying to understand the question/suggestion having
virtually no C++ background. Hence any suggestion in my question re:
Fortran vis a vis C++ was an attempt for education enough to understand
the feature desired and why the need for a syntax to gain the desired
resulting optimization.

On reflection, I suppose my initial thought of "of course the compiler
will optimize it" is perhaps somewhat over-optimistic in a case of
sufficient source/data complexity. But, it still seems to me that if
the actual computation at the implementation level boils down to the
specific sequence of instructions, those will be recognizable.

It seems somewhat unlikely to me that if the root algorithm would
distill to this it wouldn't show up in the code generation despite
complexity in the source code or data storage.

I can see that if the data are widely distributed one could have issues
in performance owing to cache or such. My thought would be the overall
optimization then would be far bigger than simply worrying over whether
one particular instruction over another were used--I could envision a
more efficient data-ordering could overwhelm the effects of the single
multiply/add operation vis a vis multiply and add.

--

Damian

ongelezen,

13 jul 2008, 09:59:0513-07-2008

aan

These are really encouraging and useful insights. What you say is
corroborated by what at least one vendor has told me, so perhaps I
should be more optimistic about the ability of compilers to optimize
such code. More importantly, I need to arm myself with some empirical
data to see if this bears out in practice and across multiple
compilers. (I probably should have done that before submitting the
original posting.) If it does, I think it would be (yet another)
feather in Fortran's cap. I really hope so.

Damian

James Giles

ongelezen,

13 jul 2008, 17:17:2513-07-2008

aan

I just read through several articles about expression templates.
One article stated (unequivocally) that such templates were
essentially the same as the old Algol 60 'call by name' parameter
passing mechanism. With tricks like 'Jensen's device' being
possible. Well, there's a sound reason that 'call by name' has
been avoided since the early 60's: it's *very* error prone. And
the errors tend to be hard to find and hard to correct. Worse,
they often generate plausible wrong answers - that is, errors
that the programmer doesn't notice are there.

While expression templates have some advantages, I think we
should avoid them if *any* alternative is even remotely acceptable.
Given that it's only of importance if the code in question has been
directly identified as the program's bottleneck, and solutions (that
Damian counts as ugly) are available, what's the hurry. An "ugly"
solution is often quite acceptable in a bottleneck.

Note: the original purpose of expression templates seems to have
been to do some of the things Fortran elemental expressions can
do. A common example is:

A = B*C

Where they are all arrays expression templates can to the multiply
and assignment without creating a large temporary to hold the product
and then copy into A. The problem is that the template mechanism
makes dependency analysis the responsibility of the programmer.
C++ will calmly do the operation as told even if there's aliasing
involved.

--
J. Giles

"I conclude that there are two ways of constructing a software
design: One way is to make it so simple that there are obviously
no deficiencies and the other way is to make it so complicated
that there are no obvious deficiencies." -- C. A. R. Hoare

"Simplicity is prerequisite for reliability" -- E. W. Dijkstra

Damian

ongelezen,

13 jul 2008, 17:57:2513-07-2008

aan

On Jul 13, 2:17 pm, "James Giles" <jamesgi...@worldnet.att.net> wrote:

> An "ugly" solution is often quite acceptable in a bottleneck.

Wise words.

Thanks for your time in researching this. I knew there would be
people on this newsgroup who could evaluate this approach better than
I could! I won't fret about this any further until I have more
reason. I have profiled the code and, honestly, the issue described
in the original posting is not the bottleneck. Those FFT's are inside
a defined operator, but obviously that's one case where expression
templates wouldn't help even if used properly.

A final note: Amdahl's law suggests that if I want to speed up my code
by the factors of hundreds to thousands that are common on massively
parallel platforms, I'll eventually need every dusty corner of the
code to scale. Anything that doesn't will eventually become the
bottleneck. But I'm years away from that level of scalability, so
I'll cross that bridge when I come to it and focus on the biggest
bottlenecks first.

Damian

ongelezen,

13 jul 2008, 18:05:1013-07-2008

aan

At the end of my 2nd paragraph, I meant to write, "The bottleneck are
3D FFT's. Those FFT's are inside a defined operator, but obviously

James Giles

ongelezen,

13 jul 2008, 18:22:5613-07-2008

aan

Damian wrote:
...

> At the end of my 2nd paragraph, I meant to write, "The bottleneck are
> 3D FFT's. Those FFT's are inside a defined operator, but obviously
> that's one case where expression templates wouldn't help even if used
> properly."

I kind of assumed something of the sort. I wasn't going to
nit-pick about the fact you hadn't even mentioned FFTs before.

Arjen Markus

ongelezen,

16 jul 2008, 06:05:3716-07-2008

aan

On 12 jul, 17:49, Damian <dam...@rouson.net> wrote:
> This is a slightly modified reposting of an earlier posting that
> received no replies:
>

Having read this thread (and the articles you mentioned),
I think I understand the problem. I wrote the program
below to measure the performance of such assignments.

Typically, using overloaded operations takes twice as long
as an explicit do-loop (see the comments, measured on a Windows PC
with Cygwin and MinGW but also on Linux with Ifort).

One thing that I have not tested is the following idea:
Rather than work with allocated blocks of memory _per variable_,
you could use a pool of allocated blocks. That way you do not
need to copy data if that data is a temporary result. And
you minimize the memory leaks.

(There are drawbacks to that method! Local variables and such
need to be cleaned up. And probably I need to write it down
more properly than expect you to read my mind ;)).

Anyhow, see the program.

Regards,

Arjen
-------
! assign_comp.f90 --
! Various ways to implement "A = B * C" where A, B and C
! are derived types
!
! g95:
! Plain:
! operator: 16.45 16.95 seconds
! do-loop: 11.02 11.44 seconds
! -O2:
! operator: 10.58 10.88 seconds
! do-loop: 6.92 7.11 seconds
!
! gfortran:
! Plain:
! operator: 15.86 15.80 seconds
! do-loop: 13.25 13.50 seconds
! -O2:
! operator: 11.47 11.67 seconds
! do-loop: 7.41 7.37 seconds
!
!
module assignment
implicit none

integer, parameter :: fieldsize = 10000000

type field
logical :: temp = .false.
real, dimension(:), pointer :: data
end type field

interface assignment(=)
module procedure assign_field
module procedure assign_field_scalar
end interface

interface operator(*)
module procedure multiply_fields
end interface
contains

subroutine assign_field( left, right )
type(field), intent(inout) :: left
type(field), intent(in) :: right

if ( right%temp ) then
if ( associated(left%data) ) then
deallocate( left%data )
endif
left%temp = .false.
left%data => right%data
else
if ( .not. associated(left%data) ) then
allocate( left%data(fieldsize) )
endif
left%temp = .false.
left%data = right%data
endif

end subroutine assign_field

subroutine assign_field_scalar( left, right )
type(field), intent(inout) :: left
real, intent(in) :: right

if ( .not. associated(left%data) ) then
allocate( left%data(fieldsize) )
endif
left%temp = .false.
left%data = right

end subroutine assign_field_scalar

type(field) function multiply_fields( op1, op2 )
type(field), intent(in) :: op1, op2

multiply_fields%temp = .true.

if ( op1%temp ) then
multiply_fields%data => op1%data
elseif ( op2%temp ) then
multiply_fields%data => op2%data
else
allocate( multiply_fields%data(1:fieldsize) )
endif

multiply_fields%data = op1%data * op2%data

end function multiply_fields

! Do it the classic way
!
subroutine multiply_assign( left, op1, op2 )
type(field) :: left, op1, op2
integer :: i

if ( .not. associated(left%data) ) then
allocate( left%data(1:fieldsize) )
endif

do i = 1,fieldsize
left%data(i) = op1%data(i) * op2%data(i)
enddo
end subroutine multiply_assign

end module assignment

program test_assign
use assignment

implicit none

type(field) :: field_a
type(field) :: field_b
type(field) :: field_c

real :: time1, time2, cumulative1, cumulative2
integer :: count

character(len=10) :: dummy

field_b = 1.1
field_c = 21.4

cumulative1 = 0.0
cumulative2 = 0.0

do count = 1,100
call cpu_time( time1 )
field_a = field_b * field_c
call cpu_time( time2 )

cumulative1 = cumulative1 + (time2-time1)

call cpu_time( time1 )
call multiply_assign( field_a, field_b, field_c )
call cpu_time( time2 )

cumulative2 = cumulative2 + (time2-time1)
enddo

write(*,*) field_a%data(1), field_b%data(1), field_c%data(1)

write(*,*) 'Method 1 (operator): ', cumulative1
write(*,*) 'Method 2 (do-loop): ', cumulative2

write(*,'(a)') 'Waiting ...'
read(*,'(a)') dummy

end program

Damian

ongelezen,

17 jul 2008, 12:30:1217-07-2008

aan

Aha! So we finally have some empirical data. I'm going to try this
code out with the IBM XL Fortran compiler because I have been told it
will do interprocedural optimization that might eliminate the
differences between the two approaches.

Damian

Arjen Markus

ongelezen,

17 jul 2008, 13:43:3917-07-2008

aan

On 17 jul, 18:30, Damian <dam...@rouson.net> wrote:
>
> Aha! So we finally have some empirical data. I'm going to try this
> code out with the IBM XL Fortran compiler because I have been told it
> will do interprocedural optimization that might eliminate the
> differences between the two approaches.
>
> Damian

That would be interesting indeed.

By the way, I have thought out the memory pool I mentioned. I hope to
get a small program together before my holidays to illustrate it (and
measure the performance), but otherwise it will be in a few weeks
time. Just remind me of it (after 10 august)

Regards,

Arjen

Arjen Markus

ongelezen,

12 aug 2008, 08:49:4812-08-2008

aan

Well, I am back and I have completed the almost but not quite trivial
program that uses a memory pool I promised. The code is below.

Using g95 on a Windows PC via MinGW (but those are mere details, I
suppose)
I get something like this:

1. Without optimisation (default options only) the memory pool version
is
5 to 10% slower than the do-loop version
2. With optimisation (-O2) the two versions are almost as fast -
marginal
differences only
3. If I replace the assignment in the function multiply_fields by a
subroutine
that does the same thing, but does not retain the pointer
attributes,
the memory pool version is actually faster by 30-50% in the
unoptimised case. With optimisation there is no significant
difference.

The drawbacks of the memory pool version in an actual program (that
would
require substantive additional work though), are that it would use
more memory (there is a constant or almost constant overhead) and
that
the user would be confronted with the need to manage the memory in
some
way - to get rid of local variables and left-over intermediate
results.
I think this can be fairly easily automated (up to a point), so the
disadvantage is not that big.

The good thing though is that you can simply specify array operations
like:

a = a + dt * diffusion * laplace(a) / dx**2

(dt, dx being scalars) without performance overhead or memory leaks.

Regards,

Arjen

------ memory pool version ------

! mempool.f90 --
! Implement a memory pool that is then used for
! measuring the performance of a derived type with overloaded
! operations.
! (see: assign_comp.f90)

! Various ways to implement "A = B * C" where A, B and C
! are derived types
!

! Managing the stack level:
! - A higher stack level means the results are more "temporary"
! - The stack level is increased by 2:
! call accum( sum , 2*x )
! 2*x is a temporary variable but the associated memory should
! not be reused within the subroutine accum

!
module assignment
implicit none

integer, parameter :: fieldsize = 10000000

private :: stack_level
private :: block
private :: block_pool

type block
integer :: level

real, dimension(:), pointer :: data
end type

type(block), dimension(10) :: block_pool

integer, parameter :: UNASSIGNED = -1
integer, parameter :: PERSISTENT = 0
integer, parameter :: TEMPORARY = huge(0)
integer, save :: stack_level = 1

type field
integer :: pool_id = UNASSIGNED
end type field

interface assignment(=)
module procedure assign_field
module procedure assign_field_scalar
end interface

interface operator(*)
module procedure multiply_fields
end interface
contains

real function value( obj, idx )
type(field) :: obj
integer :: idx

value = block_pool(obj%pool_id)%data(idx)
end function value

subroutine enter_routine
stack_level = stack_level + 1
end subroutine enter_routine

subroutine leave_routine
stack_level = stack_level - 1
end subroutine leave_routine

subroutine init_pool
integer :: i

do i = 1,size(block_pool)
block_pool(i)%level = TEMPORARY
allocate( block_pool(i)%data(1:fieldsize) )
enddo

end subroutine init_pool

subroutine find_block( obj, pid )
type(field), intent(in) :: obj
integer :: pid

integer :: i

if ( obj%pool_id /= UNASSIGNED ) then
pid = obj%pool_id
! write(*,*) 'Reusing: ', pid
else
pid = -1
do i = 1,size(block_pool)
if ( block_pool(i)%level > stack_level ) then
pid = i
! write(*,*) 'Assigned: ', pid, block_pool(i)%level
exit
endif
enddo
endif

if ( pid == -1 ) then
stop 'Not enough memory blocks!'
endif

end subroutine find_block

subroutine assign_field( left, right )
type(field), intent(inout) :: left
type(field), intent(in) :: right

integer :: pid

if ( block_pool(right%pool_id)%level > stack_level ) then
if ( left%pool_id /= UNASSIGNED ) then
block_pool(left%pool_id)%level = TEMPORARY
endif
pid = right%pool_id
left%pool_id = pid
block_pool(pid)%level = stack_level
else
call find_block( left, pid )

block_pool(pid)%level = stack_level
block_pool(pid)%data = block_pool(right%pool_id)%data
endif

end subroutine assign_field

subroutine assign_field_scalar( left, right )
type(field), intent(inout) :: left
real, intent(in) :: right

integer :: pid

call find_block( left, pid )

left%pool_id = pid
block_pool(pid)%level = stack_level ! TODO: figure out if this
should be the minimum?
block_pool(pid)%data = right

end subroutine assign_field_scalar

type(field) function multiply_fields( op1, op2 )
type(field), intent(in) :: op1, op2

integer :: pid

call find_block( multiply_fields, pid )

multiply_fields%pool_id = pid

block_pool(pid)%level = stack_level + 1
block_pool(pid)%data = block_pool(op1%pool_id)%data *
block_pool(op2%pool_id)%data
! write(*,*) ' Multiply - using ', pid

!! Alternatively:
!! call multiply_array( block_pool(pid)%data,
block_pool(op1%pool_id)%data, &
!! block_pool(op2%pool_id)%data )

end function multiply_fields

!
! Possible optimisation: the arrays have no pointer attributes anymoer
!
subroutine multiply_array( a, b, c )
real, dimension(:), intent(out) :: a
real, dimension(:), intent(in) :: b, c

a = b * c

end subroutine multiply_array

! Do it the classic way
!
subroutine multiply_assign( left, op1, op2 )
type(field) :: left, op1, op2
integer :: i

integer :: pid

call find_block( left, pid )

left%pool_id = pid

block_pool(pid)%level = stack_level

do i = 1,fieldsize
block_pool(pid)%data(i) = block_pool(op1%pool_id)%data(i) &
* block_pool(op2%pool_id)%data(i)
enddo

end subroutine multiply_assign

end module assignment

program test_assign
use assignment

implicit none

type(field) :: field_a
type(field) :: field_b
type(field) :: field_c

real :: time1, time2, cumulative1, cumulative2

real :: walltime1, walltime2
integer :: count, count1, count2, countrate

character(len=10) :: dummy

call init_pool

field_b = 1.1
field_c = 21.4

field_a = field_b * field_c

cumulative1 = 0.0
cumulative2 = 0.0
walltime1 = 0.0
walltime2 = 0.0

call system_clock( count_rate = countrate )
write(*,*) 'Counts per second: ', countrate

do count = 1,100
call system_clock( count1 )

call cpu_time( time1 )
field_a = field_b * field_c

! write(*,*) 'fields: ', field_a%pool_id, field_b%pool_id,
field_c%pool_id
call cpu_time( time2 )
call system_clock( count2 )

cumulative1 = cumulative1 + (time2-time1)

walltime1 = walltime1 + (count2-count1)

call system_clock( count1 )

call cpu_time( time1 )
call multiply_assign( field_a, field_b, field_c )
call cpu_time( time2 )

call system_clock( count2 )

cumulative2 = cumulative2 + (time2-time1)

walltime2 = walltime2 + (count2-count1)
enddo

write(*,*) value(field_a,1), value(field_b,1), value(field_c,1)

write(*,*) 'Method 1 (operator): ', cumulative1, walltime1/
countrate
write(*,*) 'Method 2 (do-loop): ', cumulative2, walltime2/
countrate

fj

ongelezen,

13 aug 2008, 12:54:5013-08-2008

aan

This thread is interesting : I was faced with a similar problem in
trying to overload classical operators. I also observed efficiency
variations, depending on the compiler, when I compared operators with
routines like add_multiply_assigned.

I don't use FFT but simply variables associated to partial
derivatives :

TYPE var
REAL(r8) :: value ! the value of the variable
REAL(r8) :: deriv(ndmax) ! deriv(i)=d(value)/dx(list(i))
INTEGER :: index(ndmax) ! the list of associated state variables
END TYPE

Of course, such simple derided type does not correspond to gigabytes
of memory. But the associated field may become much larger when the
numbers of nodes (or meshes) is high :

TYPE(var),ALLOCATABLE :: T(:) ! temperature field with SIZE(T) >
100000

But as you see, the field is described as an array. Therefore, I don't
have all the troubles met by Damian because I can take advantage of
elemental functions for overloading operators. However a problem
occurs when computing simple expressions. I join two identical test
cases, the first one written in calling subroutines, the second one in
using operators instead. Here are the results with 3 compilers :

g95 -O3 : G95 (GCC 4.0.3 (g95 0.92!) Jul 30 2008)

3.3424919999999996 4.230357

gfortran -O3 : GNU Fortran (GCC) 4.3.0 20080102 (experimental) [trunk
revision 131253]

1.0338430000000001 1.4547790000000000

ifort -O3 : ifort (IFORT) 10.0 20070426

0.972852000000000 0.759885000000000

As you see, the compilers g95 and gfortran show a decrease of
performance when using operators.
On the contrary, operators are more efficient than subroutines with
ifort.

MODULE var ! a little bit too large to be published but it contains
only pure subroutines and elemental functions
...
END MODULE

SUBROUTINE test1(nx)

USE var
IMPLICIT NONE

INTEGER,INTENT(in) :: nx

TYPE(var_type) :: P(nx),T(nx),n(nx),rho(nx),x(nx)
REAL(r8) :: M=18.e-3_r8,V=50._r8,R=8.314_r8
INTEGER :: i

DO i=1,nx
CALL newMainVar(P(i),1.e5_r8,2*i)
CALL newMainVar(T(i),1000._r8,2*i-1)
ENDDO

DO i=1,nx

CALL newVar(x(i),T(i))
CALL inverseVar(x(i))
CALL addVar(x(i),T(i))
CALL multVar(x(i),2._r8)
CALL minusVar(x(i),T(i))
CALL divVar(x(i),2._r8)

CALL newVar(n(i),P(i))
CALL divVar(n(i),T(i))
CALL multVar(n(i),V/R)

CALL newVar(rho(i),n(i))
CALL multVar(rho(i),M/V)

ENDDO

!write(*,*) 'PV/RT',n(1)%value,n(1)%deriv(1:n(1)%nderiv)
!write(*,*) 'MP/RT',rho(1)%value,rho(1)%deriv(1:rho(1)%nderiv)

END SUBROUTINE

SUBROUTINE test2(nx)

USE var
IMPLICIT NONE

INTEGER,INTENT(in) :: nx

TYPE(var_type) :: P(nx),T(nx),n(nx),rho(nx),x(nx)
REAL(r8) :: M=18.e-3_r8,V=50._r8,R=8.314_r8
INTEGER :: i

DO i=1,nx
CALL newMainVar(P(i),1.e5_r8,2*i)
CALL newMainVar(T(i),1000._r8,2*i-1)
ENDDO

x=((1._r8/T+T)*2._r8-T)/2._r8 ! an arbitrary expression
n=(P/T)*(V/R)
rho=n*(M/V)

!write(*,*) 'PV/RT',n(nx)%value,n(nx)%deriv(1:n(nx)%nderiv)
!write(*,*) 'MP/RT',rho(nx)%value,rho(nx)%deriv(1:rho(nx)%nderiv)

END SUBROUTINE

PROGRAM main
IMPLICIT NONE
DOUBLE PRECISION :: t0,t1,t2
CALL cpu_time(t0)
CALL test1(500000)
CALL cpu_time(t1)
CALL test2(500000)
CALL cpu_time(t2)
WRITE(*,*) t1-t0,t2-t1
END PROGRAM

James Van Buskirk

ongelezen,

13 aug 2008, 19:31:4813-08-2008

aan

"fj" <franco...@irsn.fr> wrote in message
news:4715e7e8-acaf-4218...@v57g2000hse.googlegroups.com...

> But as you see, the field is described as an array. Therefore, I don't
> have all the troubles met by Damian because I can take advantage of
> elemental functions for overloading operators. However a problem
> occurs when computing simple expressions. I join two identical test
> cases, the first one written in calling subroutines, the second one in
> using operators instead. Here are the results with 3 compilers :

> As you see, the compilers g95 and gfortran show a decrease of

> performance when using operators.
> On the contrary, operators are more efficient than subroutines with
> ifort.

Have you checked that ifort is returning correct results? I had an
issue years ago with assignment via an elemental subroutine and as far
as I know it has never been fixed. Try

http://home.comcast.net/~kmbtib/Fortran_stuff/elem_assign.f90

with the latest ifort and see if it works.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

John Harper

ongelezen,

13 aug 2008, 22:27:5713-08-2008

aan

In article <AM2dndLkqN958z7V...@comcast.com>,

James Van Buskirk <not_...@comcast.net> wrote:
>
>Have you checked that ifort is returning correct results? I had an
>issue years ago with assignment via an elemental subroutine and as far
>as I know it has never been fixed. Try
>
>http://home.comcast.net/~kmbtib/Fortran_stuff/elem_assign.f90
>
>with the latest ifort and see if it works.

I don't know if this is the latest ifort but it gave a wrong answer
for new x on our linux system:

puka[~]$ ifort -V testassign.f90
Intel(R) Fortran Compiler for applications running on IA-32, Version 10.1 Build 20080312 Package ID: l_fc_p_10.1.015
Copyright (C) 1985-2008 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

Intel Fortran 10.1-2047
GNU ld version 2.15.92.0.2 20040927
puka[~]$ ./a.out
original x = 7 11 13
permuted x = 11 13 7
new x = 11 13 11
puka[~]$

All three of gfortran, g95 and Sun f95 gave the last line of output as
new x = 11 13 7
(with various amounts of blank space, as is permitted for * format).
IMHO that is correct.

-- John Harper, School of Mathematics, Statistics and Computer Science,
Victoria University, PO Box 600, Wellington 6140, New Zealand
e-mail john....@vuw.ac.nz phone (+64)(4)463 6780 fax (+64)(4)463 5045

fj

ongelezen,

14 aug 2008, 04:55:1314-08-2008

aan

On 14 août, 01:31, "James Van Buskirk" <not_va...@comcast.net> wrote:
> "fj" <francois.j...@irsn.fr> wrote in message

In my test, the results of ifort were correct. But your test gives
effectively a wrong answer with my version of ifort which is not the
latest one (10.0).

In you case, you use an elemental subroutine for assignment. I do not
have such type of routine in my "var" module : all the functions are
ELEMENTAL but all the subroutines are just PURE.

So I think that I have avoided the ifort bug by chance !