Expanding a dynamically allocated array of a derived type: is there any standard Fortran option to achieve better performance than the textbook approach?

FortranFan

unread,

Nov 2, 2016, 11:37:44 PM11/2/16

to

Say one has a dynamically allocated array of a derived type of size N and the array needs to be expanded in order to add a few more elements to the back of the array while retaining all the existing values: is there any standard Fortran option that can yield better performance than the one described in Modern Fortran Explained by Metcalf et al.?

Per Metcalf et al. (see section 15.5.3 on Transferring an allocation in Modern Fortran Explained), starting with Fortran 2003 and the introduction of MOVE_ALLOC intrinsic that reduces 2 copy steps to one, the recommended approach on changing the size of an dynamically allocated array appears to be to first allocate a temporary array with the desired new shape, copy the existing array onto the temporary, add desired new elements, and employ MOVE_ALLOC to transfer the allocation of temporary to the actual array. The canonical sequence is:

-- begin code snippet --
..
type(foo_t), allocatable :: foo(:) !** foo_t mainly holds an array of real64
type(foo_t), allocatable :: tmp(:) ! intrinsic type
..
! say foo is currently allocated with size N
allocate (tmp(M)) ! M on the order of 2*N
tmp(1:N) = foo ! copy existing values of foo
tmp(..) = .. ! add new elements
call move_alloc( from=tmp, to=foo ) ! transfer allocation back to foo
..
-- end snippet --

The above approach is straight-forward but it can become quite time-consuming as foo gets bigger and bigger. Is there a better option than that described above?
More specifically, is there any approach with standard Fortran that can provide a couple of orders of magnitude or better performance than that noticed using the above approach?

For a use case I have where foo gets appreciably large relative to system resources (i.e., N becomes large and so does the data held by each element of foo), the above shown sequence can take anywhere from 2 to 100 CPU seconds for a case being tried: the copy operation to transfer existing elements of foo to the temporay array as well as the MOVE_ALLOC step take up most of this time, as one would expect. This then becomes a performance bottleneck for the task at hand.

It is desired to bring this time down to 0.01 CPU seconds and make it relatively independent of the size of foo array size! That is, what is being sought is two orders of magnitude or better in performance improvement for the operation of adding more elements to the end of a dynamically allocated array of a derived type! Is this even possible to achieve this using *standard Fortran* (i.e., no OpenMP/MPI or vendor-specific directives, etc. are to be considered at this stage) while retaining the basic design where foo remains an ALLOCATABLE rank-one array of foo_t derived type?

Thanks much in advance,

Ian Harvey

unread,

Nov 2, 2016, 11:48:25 PM11/2/16

to

Some of the comments in the code snippet above imply that spare space is
being allocated each time that the array is grown - the number of
elements in use (N) being tracked separately to Fortran's concept of the
array size. Are you doing that? That can dramatically reduce the
number of times that the time consuming reallocation is triggered.

> The above approach is straight-forward but it can become quite
> time-consuming as foo gets bigger and bigger. Is there a better
> option than that described above? More specifically, is there any
> approach with standard Fortran that can provide a couple of orders of
> magnitude or better performance than that noticed using the above
> approach?
>
> For a use case I have where foo gets appreciably large relative to
> system resources (i.e., N becomes large and so does the data held by
> each element of foo), the above shown sequence can take anywhere from
> 2 to 100 CPU seconds for a case being tried: the copy operation to
> transfer existing elements of foo to the temporay array as well as
> the MOVE_ALLOC step take up most of this time, as one would expect.
> This then becomes a performance bottleneck for the task at hand.
>
> It is desired to bring this time down to 0.01 CPU seconds and make it
> relatively independent of the size of foo array size! That is, what
> is being sought is two orders of magnitude or better in performance
> improvement for the operation of adding more elements to the end of a
> dynamically allocated array of a derived type! Is this even possible
> to achieve this using *standard Fortran* (i.e., no OpenMP/MPI or
> vendor-specific directives, etc. are to be considered at this stage)
> while retaining the basic design where foo remains an ALLOCATABLE
> rank-one array of foo_t derived type?

If each element of the array is itself of reasonable size, it can help
to hold each element via an allocatable component in a wrapper type,
then you can move_alloc that component across to the new array rather
than having to actually copy the element value.

FortranFan

unread,

Nov 3, 2016, 12:41:34 AM11/3/16

to

On Wednesday, November 2, 2016 at 11:48:25 PM UTC-4, Ian Harvey wrote:

> ..

> Some of the comments in the code snippet above imply that spare space is
> being allocated each time that the array is grown - the number of
> elements in use (N) being tracked separately to Fortran's concept of the
> array size. Are you doing that? That can dramatically reduce the
> number of times that the time consuming reallocation is triggered.
>

Yes, considerable care is taken to perform this reallocation the fewest number of times. It's just that some scenarios have been put forward where data come streaming unexpectedly (upset conditions) and the program needs to handle those as well.

> ..

>
> If each element of the array is itself of reasonable size, it can help
> to hold each element via an allocatable component in a wrapper type,
> then you can move_alloc that component across to the new array rather
> than having to actually copy the element value.

Yes, foo_t is in fact a wrapper type of allocatable components. What you say with "move_alloc that component across to the new array" is what I'm after - are you thinking of an elemental subroutine bound to the type, say named move, that does the move operation instead of a copy:

call tmp(1:N)%move( foo ) ! this replaces <tmp(1:N) = foo> step in the snippet
! from the original post

The above is what I've been considering, but was looking for readers' views on this. The above does help me tremendously in terms of performance but I wasn't sure of it from a code design point-of-view.

Or more importantly, if an user can do above, what I was hoping to get at ultimately was whether the language itself can introduce some form of additional "MOVE SEMANTICS", something along the lines of "move assignment", which might bring broader applicability of the concept in more circumstances involving the derived type?

Thanks much for your input,

Stefano Zaghi

unread,

Nov 3, 2016, 1:26:04 AM11/3/16

to

Dear FortranFan and Ian,

I am very interested on this topic, but I do not completely follow the last post of Ian and subsequent replay of FortranFan.

The great Jacob Williams posted a nice article some months ago on how "partitioning" the array could (drammatically) speedup the re.allocation, see it here

http://degenerateconic.com/dynamically-sizing-arrays/

Incidentally, the speedup seems to be comparable to the 2 orders wanted by FortranFan :-)

What I am not completely followed is the difference between

---code
tmp(1:N) = foo
! and
call tmp(1:N)%move(foo)
---end code

Namely, in the TPB "move", that is elemental, you are copying the members of "foo" element by element, so "move" could be something like

---code
elemental subroutine move(self, wrapper)
class(foo_t), intent(inout) :: self
! first case
type(foo_t), intent(in) :: wrapper
! or
type(foo_t), intent(inout) :: wrapper

! first case
self%memeber = wrapper%member
! or
call move_alloc(from=wrapper%member, to=self%member)
end subroutine move
---end code

If I understand you, can you explain why the second case (move_alloc) is better? In the first case we have an automatic lhs re-allocation with a copy, in the second what we have?

Thank you for any help.

My best regards.

Ian Harvey

unread,

Nov 3, 2016, 2:43:49 AM11/3/16

to

On 2016-11-03 16:26, Stefano Zaghi wrote:
> Dear FortranFan and Ian,
>
> I am very interested on this topic, but I do not completely follow
> the last post of Ian and subsequent replay of FortranFan.
>
> The great Jacob Williams posted a nice article some months ago on how
> "partitioning" the array could (drammatically) speedup the
> re.allocation, see it here
>
> http://degenerateconic.com/dynamically-sizing-arrays/
>
> Incidentally, the speedup seems to be comparable to the 2 orders
> wanted by FortranFan :-)

That's effectively the same approach as suggested by the code snippet in
the original post, except the snippet uses a factor for the expansion of
the array size rather than adding a fixed chunk size.

That page also suggests a linked list, which is also very reasonable, if
such a data structure can still be considered "retaining the basic design".

> What I am not completely followed is the difference between
>
> ---code
> tmp(1:N) = foo
> ! and
> call tmp(1:N)%move(foo)
> ---end code
>
> Namely, in the TPB "move", that is elemental, you are copying the members of "foo" element by element, so "move" could be something like
>
> ---code
> elemental subroutine move(self, wrapper)
> class(foo_t), intent(inout) :: self
> ! first case
> type(foo_t), intent(in) :: wrapper
> ! or
> type(foo_t), intent(inout) :: wrapper

I don't think it makes sense for the move procedure to be a binding. If
foo_t is extended, it probably doesn't make sense for objects of dynamic
type foo_t to be moved into an object of the extension type. Also, if
the procedure is a binding the passed argument has to be polymorphic,
which isn't going to help performance and can be a nuisance with respect
to PURE procedures.

The argument representing the variable being moved to would be intent(out).

> ! first case
> self%memeber = wrapper%member
> ! or
> call move_alloc(from=wrapper%member, to=self%member)
> end subroutine move
> ---end code
>
> If I understand you, can you explain why the second case (move_alloc)
> is better? In the first case we have an automatic lhs re-allocation
> with a copy, in the second what we have?

If the storage size of wrapper%member is larger than a descriptor for a
scalar object (which may be just storage for an address or two) then you
are moving less stuff and saving on the need to allocate self%member.
There'll be a little bit of bookkeeping associated with MOVE_ALLOC, but
as the destination object is not allocated that should be small.

Ron Shepard

unread,

Nov 3, 2016, 3:03:46 AM11/3/16

to

On 11/2/16 10:37 PM, FortranFan wrote:
> Is this even possible to achieve this using *standard Fortran* (i.e., no OpenMP/MPI or vendor-specific directives, etc. are to be considered at this stage) while retaining the basic design where foo remains an ALLOCATABLE rank-one array of foo_t derived type?

I was going to say yes, a linked list data structure will eliminate all
of the copying and simplify the allocations, but then you added that
last part "ALLOCATABLE rank-one array" condition at the end.

With a linked list, you can just keep allocating new elements as needed
without touching the previous elements. You don't need to allocate them
one at a time, you could also allocate them in small blocks to reduce
the overhead, but you need to add logic to remember how many elements
you are actually using at any time.

However, this data structure is not a rank-one array. So you might
consider the possibility of relaxing this condition, and with this other
data structure, your other problems can be solved. There might be an
intermediate approach where you do some of the work with the linked
list, and then when it is complete, converting it to the rank one array
form with a single allocation+copy for the rest of the work on the data.

$.02 -Ron Shepard

Stefano Zaghi

unread,

Nov 3, 2016, 3:50:15 AM11/3/16

to

Dear Ian, thank you for your kind replay.

Il giorno giovedì 3 novembre 2016 07:43:49 UTC+1, Ian Harvey ha scritto:
> That's effectively the same approach as suggested by the code snippet in
> the original post, except the snippet uses a factor for the expansion of
> the array size rather than adding a fixed chunk size.

I missed the "similarity", sorry, I miss the "M" at all...

> That page also suggests a linked list, which is also very reasonable, if
> such a data structure can still be considered "retaining the basic design".

Yes, linked list has almost O(1) for put/del, but the access is not O(1) as an array, maybe for FortranFan the access to element is more critical: there are so many data structures with different bias, hash table could be a compromise. In this case, I like to learn more on array data structure.

> I don't think it makes sense for the move procedure to be a binding. If
> foo_t is extended, it probably doesn't make sense for objects of dynamic
> type foo_t to be moved into an object of the extension type. Also, if
> the procedure is a binding the passed argument has to be polymorphic,
> which isn't going to help performance and can be a nuisance with respect
> to PURE procedures.

Sure, I only try to follow (failing) the FortranFan snippet.

> If the storage size of wrapper%member is larger than a descriptor for a
> scalar object (which may be just storage for an address or two) then you
> are moving less stuff and saving on the need to allocate self%member.

Oh, this is interesting. So in general, "array_lhs = tmp_rhs" is slower than "call move_alloc(from=tmp_rhs, to=array_lhs)" (assuming that "tmp_rhs" is a temporary array that could be deallocated), is this right? In case "tmp_rhs" is not temporary I have to do a "standard" copy I think.

> There'll be a little bit of bookkeeping associated with MOVE_ALLOC, but
> as the destination object is not allocated that should be small.

Mmm, sorry, I am losting... this means

+ if the destination (to=dest) is already allocated then the "bookkeeping" of move_alloc is relevant <=> source=dest is slower than move_alloc(from=dest, to=source);
+ if the destination (to=dest) is NOT already allocated then the "bookkeeping" of move_alloc is small <=> source=dest is similar to move_alloc(from=dest, to=source);

Is my understanding right?

Thank you again.

My best regards.

Wolfgang Kilian

unread,

Nov 3, 2016, 4:15:17 AM11/3/16

to

On 03.11.2016 07:42, Ian Harvey wrote:
> On 2016-11-03 16:26, Stefano Zaghi wrote:
>> Dear FortranFan and Ian,
>>
>> I am very interested on this topic, but I do not completely follow
>> the last post of Ian and subsequent replay of FortranFan.
>>
>> The great Jacob Williams posted a nice article some months ago on how
>> "partitioning" the array could (drammatically) speedup the
>> re.allocation, see it here
>>
>> http://degenerateconic.com/dynamically-sizing-arrays/
>>
>> Incidentally, the speedup seems to be comparable to the 2 orders
>> wanted by FortranFan :-)
>
> That's effectively the same approach as suggested by the code snippet in
> the original post, except the snippet uses a factor for the expansion of
> the array size rather than adding a fixed chunk size.
>
> That page also suggests a linked list, which is also very reasonable, if
> such a data structure can still be considered "retaining the basic design".

I'll just throw in the caveat that a linked list as such may perform
poorly, due to memory-fetch and cache-locality issues. With a list, the
compiler is responsible for storing data items, and it may be more or
less successful in arranging an efficient layout. Eventually, it can
become a tradeoff between efficient storing and efficient retrieving of
data.

-- Wolfgang

--
E-mail: firstnameini...@domain.de
Domain: yahoo

Ian Harvey

unread,

Nov 3, 2016, 5:53:44 AM11/3/16

to

On 2016-11-03 18:50, Stefano Zaghi wrote:
>> If the storage size of wrapper%member is larger than a descriptor
>> for a scalar object (which may be just storage for an address or
>> two) then you are moving less stuff and saving on the need to
>> allocate self%member.
>
> Oh, this is interesting. So in general, "array_lhs = tmp_rhs" is
> slower than "call move_alloc(from=tmp_rhs, to=array_lhs)" (assuming
> that "tmp_rhs" is a temporary array that could be deallocated), is
> this right? In case "tmp_rhs" is not temporary I have to do a
> "standard" copy I think.

"In general" is probably fair enough, assuming that the size of tmp_rhs
is reasonable. It will also depend on things like whether array_lhs
needs to be reallocated for the assignment, whether finalizers are
involved, compiler smarts, etc.

An implicit assumption here is that the source and destination are
already allocatable, if they are not, then making those things
allocatable may have implications. Another assumption is that a move of
the value is all that is required - if you actually need a copy of the
value, then there's not much point considering the speed of a move.

(In terms of the original "make my array bigger faster" question, the
move_alloc's being discussed were on components of the elements of the
array being moved, not the array itself - that's a separate move_alloc.)

>> There'll be a little bit of bookkeeping associated with MOVE_ALLOC,
>> but as the destination object is not allocated that should be
>> small.
>
> Mmm, sorry, I am losting... this means
>
> + if the destination (to=dest) is already allocated then the "bookkeeping" of move_alloc is relevant <=> source=dest is slower than move_alloc(from=dest, to=source);
> + if the destination (to=dest) is NOT already allocated then the "bookkeeping" of move_alloc is small <=> source=dest is similar to move_alloc(from=dest, to=source);
>
> Is my understanding right?

You can't be so definitive about the relativities.

If the source object is allocated and the destination object is also
allocated, then MOVE_ALLOC has to deallocate the destination (and just
before that, finalize it, if relevant). That will chew some cycles,
exactly how many depends on things like the thing being deallocated. If
the destination object is not allocated then MOVE_ALLOC doesn't have to
do that work, it can get on with updating the descriptors for the source
and destination objects.

Louis Krupp

unread,

Nov 3, 2016, 5:54:18 AM11/3/16

to

I could be wrong, but I don't think there's a magic solution here. If
you were doing the memory allocation in C and you're only allocating
one array and you're using realloc(), it might be fast as long as the
array you were reallocating was the last one in the heap block
(something that's pretty much guaranteed if the array is big enough).
Reallocation means extending the end of the array, allocating more
virtual memory as needed, and returning. There's no need to move any
data.

Can you do the same thing in Fortran? I don't know.

If you can't, then here's a thought:

Allocate a big array. An enormous array. Bigger than the biggest
array you'll ever need. It will span multiple pages in virtual
memory, but that's OK. As long as you don't touch a page, you'll
never have to allocate a physical page for it. So whatever you do,
don't initialize the whole array. Just keep track of how many array
elements you're using, and ignore everything else.

If you allocated 100,000 elements and you find yourself trying to use
100,001 of them, then abort. You've exceeded your design limit.

If that seems like a problem, remember that no matter what you do,
there's a limit to how much memory you can allocate. So don't be shy.
If 100,000 seems small, allocate an array of 1,000,000 elements. Or
10,000,000.

Try it. If it doesn't work, you can always go back and do it the hard
way.

Louis

Stefano Zaghi

unread,

Nov 3, 2016, 6:19:22 AM11/3/16

to

Dear Ian thank you again.

Il giorno giovedì 3 novembre 2016 10:53:44 UTC+1, Ian Harvey ha scritto:
> "In general" is probably fair enough, assuming that the size of tmp_rhs
> is reasonable. It will also depend on things like whether array_lhs
> needs to be reallocated for the assignment, whether finalizers are
> involved, compiler smarts, etc.

Sure, I am trying to depict a very "minimal guide-line" for me, just an array of integers for example, no finalizers, pointers members...

I do not want to add noise to the FortranFan question, I am sorry.

> An implicit assumption here is that the source and destination are
> already allocatable, if they are not, then making those things
> allocatable may have implications. Another assumption is that a move of
> the value is all that is required - if you actually need a copy of the
> value, then there's not much point considering the speed of a move.

Yes, you are right, mine was only tautological thinking at "viva-voce".

> (In terms of the original "make my array bigger faster" question, the
> move_alloc's being discussed were on components of the elements of the
> array being moved, not the array itself - that's a separate move_alloc.)

Oh, I was sure on both, sorry for my over/under-sight.

> You can't be so definitive about the relativities.
>
> If the source object is allocated and the destination object is also
> allocated, then MOVE_ALLOC has to deallocate the destination (and just
> before that, finalize it, if relevant).
> That will chew some cycles,
> exactly how many depends on things like the thing being deallocated.

Yea, this is what I assumed, thus I was surprised about the (possible) difference between "self%member=wrapper%member" and "call move_alloc(from=wrapper%member, to=self%member)". Probably I had done a lot of confusion with FortranFan question... Later I'll try to carefully re-read FortranFan post. Sorry for the noise.

My best regards.

herrman...@gmail.com

unread,

Nov 3, 2016, 2:12:45 PM11/3/16

to

On Wednesday, November 2, 2016 at 8:37:44 PM UTC-7, FortranFan wrote:
> Say one has a dynamically allocated array of a derived type
> of size N and the array needs to be expanded in order to
> add a few more elements to the back of the array while
> retaining all the existing values: is there any standard
> Fortran option that can yield better performance than the
> one described in Modern Fortran Explained by Metcalf et al.?

(snip)

> -- begin code snippet --
> ..
> type(foo_t), allocatable :: foo(:) !** foo_t mainly holds an array of real64
> type(foo_t), allocatable :: tmp(:) ! intrinsic type
> ..
> ! say foo is currently allocated with size N
> allocate (tmp(M)) ! M on the order of 2*N
> tmp(1:N) = foo ! copy existing values of foo
> tmp(..) = .. ! add new elements
> call move_alloc( from=tmp, to=foo ) ! transfer allocation back to foo
> ..
> -- end snippet --

> The above approach is straight-forward but it can become
> quite time-consuming as foo gets bigger and bigger.

It can, but note that as the array gets bigger, it needs
to be copied less and less often. The total number of
copies is O(log N). (Personally, I usually increase mine
by a factor of 5/3, but as long as the array grows
exponentially, the number of copies is O(log N).)

> Is there a better option than that described above?
> More specifically, is there any approach with standard
> Fortran that can provide a couple of orders of magnitude
> or better performance than that noticed using the
> above approach?

The C realloc() function has the ability to reallocate
without copying, if the system can arrange for that.
For some systems, if the array is at the end of memory,
it can be extended without copying. If you loop
reallocating a single array, eventually it will be large
enough to be at the end, and extend that way.

Note that if you loop with two arrays, both can't
be at the end of memory, and so will be moved
(and data copied) each time.

As noted, move_alloc() allows for twice the
efficiency of the previous method.

> For a use case I have where foo gets appreciably large
> relative to system resources (i.e., N becomes large and so
> does the data held by each element of foo), the above shown
> sequence can take anywhere from 2 to 100 CPU seconds for
> a case being tried: the copy operation to transfer existing
> elements of foo to the temporay array as well as the
> MOVE_ALLOC step take up most of this time, as one
> would expect.

The MOVE_ALLOC step should be fast, as all it does, at
least for the way it usually works, is update some pointers.

> This then becomes a performance bottleneck for the
> task at hand.

Since we don't know the task at hand, it is hard to say more.

One way is to make a linked list of reasonable sized arrays,
or an array of arrays.

But the optimal answer depends on what you do with the array.

If you do as above, the total number of elements copied will
be less than twice the final size of the array. Assuming that you
do something with those element, that time should be more than
the copying time.

In some cases, the problem isn't the time, but when the time
occurs. For an interactive program, there can be a noticeable
delay when the larger copies occur, but maybe not so bad
otherwise.

> It is desired to bring this time down to 0.01 CPU seconds
> and make it relatively independent of the size of foo array
> size! That is, what is being sought is two orders of magnitude
> or better in performance improvement for the operation of
> adding more elements to the end of a dynamically allocated
> array of a derived type!

I suspect that you are running into virtual memory problems,
where the system starts swapping during the copy. If so, the
solution is to buy more RAM for your computer.

To test it out, set the initial allocation to the maximum size that
it needs to be for that run, then try the program. Then you know
that it isn't doing any copying, and the speed is the speed it is.

> Is this even possible to achieve this
> using *standard Fortran* (i.e., no OpenMP/MPI or vendor-specific
> directives, etc. are to be considered at this stage) while retaining
> the basic design where foo remains an ALLOCATABLE rank-one
> array of foo_t derived type?

As far as I know, it isn't. I don't know that there is a reason why
something similar to C's realloc() couldn't be done, but so far it
isn't. (You could use C interoperability, call C routines, and
convert C pointers to Fortran pointers. But I suspect this isn't
really the problem.)

As I noted above, one solution is an array or arrays.
That complicates access, and it depends much on what you
use the array for. If you need random access to array element,
it usually works pretty well. Allocate an array of structures containing
allocatable arrays, and then allocate each one as the previous one
fills up. Keep them all the same size, so you can calculate which
array, and which element of the subarray using simple math.

No moving ever needs to be done.

herrman...@gmail.com

unread,

Nov 3, 2016, 2:24:00 PM11/3/16

to

On Thursday, November 3, 2016 at 1:15:17 AM UTC-7, Wolfgang Kilian wrote:

(snip on linked list)

> I'll just throw in the caveat that a linked list as such may perform
> poorly, due to memory-fetch and cache-locality issues. With a list, the
> compiler is responsible for storing data items, and it may be more or
> less successful in arranging an efficient layout. Eventually, it can
> become a tradeoff between efficient storing and efficient retrieving of
> data.

Yes. The solution is to use a linked list of reasonable sized arrays.
Fill up the most recently allocated array, when it is full, add a new
one onto the list.

But just about as good is an array of arrays. Often a fixed sized
array of arrays is enough for the largest you can ever imagine, but
if not, use the MOVE_ALLOC method on the array of arrays.

If, for example, you allocate a 1,000,000 element array of
(structures containing) allocatable arrays, and then allocate
each array at 1,000,000 elements, the total size can be
up to one trillion elements. That should last at least a
few years. The space overhead isn't huge. The elements
of the subarray will be sequential in memory for fast access.

Past that, it depends too much on what you want to do
with the array(s).

herrman...@gmail.com

unread,

Nov 3, 2016, 2:36:59 PM11/3/16

to

On Thursday, November 3, 2016 at 2:54:18 AM UTC-7, Louis Krupp wrote:

(snip on reallocation, MOVE_ALLOC, and taking too long)

> I could be wrong, but I don't think there's a magic solution here. If
> you were doing the memory allocation in C and you're only allocating
> one array and you're using realloc(), it might be fast as long as the
> array you were reallocating was the last one in the heap block
> (something that's pretty much guaranteed if the array is big enough).
> Reallocation means extending the end of the array, allocating more
> virtual memory as needed, and returning. There's no need to move any
> data.

C doesn't guarantee that, but for Unix-like systems it is usually true.

Other systems might allocate differently, and not have that ability.

(snip)

> If you can't, then here's a thought:

> Allocate a big array. An enormous array. Bigger than the biggest
> array you'll ever need. It will span multiple pages in virtual
> memory, but that's OK. As long as you don't touch a page, you'll
> never have to allocate a physical page for it. So whatever you do,
> don't initialize the whole array. Just keep track of how many array
> elements you're using, and ignore everything else.

As far as I know, this works on many systems.

It seems that what Linux does when you allocate a large array
like that is setup virtual memory all pointing to the same zero-filled,
write protected page. When you write to the page, Linux sees the
attempt to write, allocates a new page for you, and updates the
page tables.

It can do this even if you ask for more that the total virtual
storage available, up to the size of the virtual address space.

It has the side effect that ALLOCATE will succeed, even when the
necessary virtual memory isn't available, and then access will fail
sometime later.

Note that you can read from the pages without problems, it is
only when you write to them that they are actually allocated.

-- glen

FortranFan

unread,

Nov 3, 2016, 8:30:37 PM11/3/16

to

On Thursday, November 3, 2016 at 12:41:34 AM UTC-4, FortranFan wrote:

> On Wednesday, November 2, 2016 at 11:48:25 PM UTC-4, Ian Harvey wrote:
>
> > ..
> >

> > If each element of the array is itself of reasonable size, it can help
> > to hold each element via an allocatable component in a wrapper type,
> > then you can move_alloc that component across to the new array rather
> > than having to actually copy the element value.
>

> .. the move operation instead of a copy:

>
> call tmp(1:N)%move( foo ) ! this replaces <tmp(1:N) = foo> step in the snippet

.. The above does help me tremendously in terms of performance ..

Shown below are some results with a rather simple code example that is listed further down in this post, hopefully this will make things clearer. So the point is as explained by Ian Harvey, instead of following the book example, if one employs the MOVE_ALLOC intrinsic for the component(s) of the FOO_T derived type in the original post, considerable improvements can be noticed.

In the code below, two sequences are attempted: 1) resize_copy, a 'contain'ed procedure that follows the steps shown by Metcalf et al. and which were described in the original post and 2) resize_move which is along the lines of Ian Harvey's suggestion and which is something I've had the back of my mind for a while now, ever since I had become familiar with C++ 11 standard revision.

Readers will notice the performance differences between the two approaches.

---
So the question I want to pose is this: is it possible to formalize this somehow in a future revision of the Fortran standard in order to make it more easier and clearer for coders who need to work with any kind of *containers* of derived types? That is, I am wondering if it's possible to introduce some form of "move assignment" semantics in addition to the existing defined assignment which is effectively a copy assignment. See this for some background with respect to C++: https://en.wikipedia.org/wiki/Move_assignment_operator.

I bring this up because in the general context of derived types and type extensions, it is rather difficult for a coder to be fully sure of how to design the type (as in FOO_T below) and how to setup the move scheme (as in move_foo_dat below) so that containers of the type, an array being the simplest, can be grown and shrunk efficiently. So if some rules and constraints and syntax and such are introduced in the language around the derived type components and methods acting on them, perhaps it can be made more straightforward.
---

For the code listed below, here're some execution results using gfortran for a 64-bit target on a Windows 7 workstation with 16 GB RAM, Intel Core i7-6820HQ @ 2.7 GHz CPU:

N = 100000 --------------------------------
# Array CPU Time CPU Time
Elements resize_copy resize_move
(seconds) (seconds)
-------------------------------------------
1000 0.2893 3.9E-3
5000 2.433 3.9E-3
10000 30.98 4.0E-3
15000 203.9 4.2E-3
-------------------------------------------

N = 50000 ---------------------------------
# Array CPU Time CPU Time
Elements resize_copy resize_move
(seconds) (seconds)
-------------------------------------------
2000 0.2709 2.0E-3
10000 1.721 2.3E-3
20000 19.81 1.9E-3
-------------------------------------------

-- begin code --
module mykinds_m

use, intrinsic :: iso_fortran_env, only : I8 => int64, WP => real64

implicit none

end module

module foo_m

use mykinds_m, only : WP

implicit none

private

type, public :: foo_t
real(WP), allocatable :: dat(:)
contains
procedure, pass(this), public :: set => set_foo
end type foo_t

public :: move_foo_dat

contains

elemental subroutine move_foo_dat( rhs, lhs )

type(foo_t), intent(inout) :: rhs
type(foo_t), intent(inout) :: lhs

call move_alloc( from=rhs%dat, to=lhs%dat )

return

end subroutine move_foo_dat

impure elemental subroutine set_foo( this, n )

class(foo_t), intent(inout) :: this
integer, intent(in) :: n

if ( allocated(this%dat) ) then
deallocate( this%dat)
end if

allocate( this%dat(n) )
call random_number( this%dat )

return

end subroutine set_foo

end module foo_m
program p

use mykinds_m, only : I8, WP
use foo_m, only : foo_t, move_foo_dat

implicit none

integer, parameter :: N = 100000
integer, parameter :: DATA_SIZE = 15000
type(foo_t), allocatable :: foo(:)

print *, "Initial array size, N = ", N
print *, "Size of data in each array element, DATA_SIZE = ", DATA_SIZE

! Set up foo array
allocate( foo(N) )
call foo%set( DATA_SIZE )

print *, "Before reallocation:"
print *, "size(foo) = ", size(foo)
print *, "foo(N)%dat(1) = ", foo(N)%dat(1)
print *

!call resize_copy()
call resize_move()

print *
print *, "After reallocation:"
print *, "size(foo) = ", size(foo)
print *, "foo(N)%dat(1) = ", foo(N)%dat(1)

stop

contains

subroutine resize_copy()

type(foo_t), allocatable :: tmp(:)
real(WP) :: start_time
real(WP) :: end_time

call my_cpu_time( start_time )

! Canonical sequence for array growth per Metcalf et al.
allocate( tmp(2*N) )
tmp(1:N) = foo

call move_alloc( from=tmp, to=foo )

call my_cpu_time( end_time )

print "(*(g0.4))", "Reallocation sequence: CPU time = ", (end_time - start_time), &
" seconds."

return

end subroutine resize_copy

subroutine resize_move()

type(foo_t), allocatable :: tmp(:)
real(WP) :: start_time
real(WP) :: end_time

call my_cpu_time( start_time )

! Possible sequence for array growth for types with ALLOCATABLE components
allocate( tmp(2*N) )
call move_foo_dat( foo, tmp(1:N) )

call move_alloc( from=tmp, to=foo )

call my_cpu_time( end_time )

print "(*(g0.4))", "Reallocation sequence: CPU time = ", (end_time - start_time), &
" seconds."

return

end subroutine resize_move

subroutine my_cpu_time( time )

!.. Argument list
real(WP), intent(inout) :: time

!.. Local variables
integer(I8) :: tick
integer(I8) :: rate

call system_clock (tick, rate)

time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

return

end subroutine my_cpu_time

end program p
-- end code --

herrman...@gmail.com

unread,

Nov 3, 2016, 11:44:04 PM11/3/16

to

On Thursday, November 3, 2016 at 5:30:37 PM UTC-7, FortranFan wrote:
> On Thursday, November 3, 2016 at 12:41:34 AM UTC-4, FortranFan wrote:

(snip)

> > .. the move operation instead of a copy:
> >
> > call tmp(1:N)%move( foo ) ! this replaces <tmp(1:N) = foo> step in the snippet
> .. The above does help me tremendously in terms of performance ..

(snip)

OK, so you have an allocatable array, dimension 100000 where each
element has an allocatable array of 15000 64 bit real values, for about 12GB.

This is like the array or arrays that I mentioned in my previous post.

In this case, there is no need to copy the subarrays, as they aren't
changing size. You could MOVE_ALLOC them individually yourself.

From the times, it looks like you are running into swapping, which makes some
sense if you are copying 12GB or data to another 12GB array with only 16GB RAM.

The actual copy problem comes when you try to reallocate an array
of contiguous data. That has to be done as one operation.

But for an array of arrays, create the new resized array, MOVE_ALLOC
each subarray from the old to the new, then deallocate the old.

> Readers will notice the performance differences between
> the two approaches.

> So the question I want to pose is this: is it possible to formalize
> this somehow in a future revision of the Fortran standard in order
> to make it more easier and clearer for coders who need to work
> with any kind of *containers* of derived types?

A suggestion to MOVE_ALLOC arrays element by element
when it is possible to do that?

> That is, I am wondering if it's possible to introduce some
> form of "move assignment" semantics in addition to the
> existing defined assignment which is effectively a copy
> assignment. See this for some background with
> respect to
> C++: https://en.wikipedia.org/wiki/Move_assignment_operator.

In this case, you only have the allocatable array in the structure.

If you had other data, you would copy the other data, and
then MOVE_ALLOC the array(s).

If you do it as a structure assignment, and it automatically
allocates the subarrays in a copy, then you could just
deallocate the originals inside the loop.

> I bring this up because in the general context of derived
> types and type extensions, it is rather difficult for a coder
> to be fully sure of how to design the type (as in FOO_T below)
> and how to setup the move scheme (as in move_foo_dat below)
> so that containers of the type, an array being the simplest,
> can be grown and shrunk efficiently.

Anytime you are working with 12GB of data, you have to
be very careful with allocating and copying.

> So if some rules and constraints and syntax and such are
> introduced in the language around the derived type
> components and methods acting on them, perhaps it
> can be made more straightforward.

The one that is still complicated is the case of a single large
array, where there needs to be enough space for the original
and the copy. You should have enough real memory to hold
both the original and the copy.

(snip)

herrman...@gmail.com

unread,

Nov 4, 2016, 2:55:34 AM11/4/16

to

On Wednesday, November 2, 2016 at 8:37:44 PM UTC-7, FortranFan wrote:

> Say one has a dynamically allocated array of a derived type of
> size N and the array needs to be expanded in order to add a
> few more elements to the back of the array while retaining all
> the existing values:

(big snip)

To be sure I understand, if you do a structure assignment when the
structure contains allocatables, it does allocate on assignment for
those arrays.

If so, it seems that it would be nice to have a way to do such
assignment that would not assign (or allocate) such.

That way, one could assign (copy) all the non-allocatables,
and, separately, MOVE_ALLOC the allocatable.

Otherwise, one has to separately assign all the non-allocatable,
and then MOVE_ALLOC the allocatables.

Seems to me that this is what the OP is asking about.

Wolfgang Kilian

unread,

Nov 4, 2016, 3:14:09 AM11/4/16

to

This excludes any version of intrinsic, defined or type-bound
assignment, since the source argument would have to be INTENT(INOUT).
Defining an explicit procedure for doing the data transfer is
straighforward, however. Just don't use assignment here.