Fortran I/O performance versus file size

mpbro

unread,

Jan 10, 2008, 1:14:29 AM1/10/08

to

I have an application which needs to do a lot of I/O on large files
(>1TB/run). Thus it is unsurprising that I'm worried about speed. I
wrote a test program which simply allocates an array of variable size,
writes that array to /tmp/, and reads it back. I will attach that
program below:

!-----------------------------------------------------------------------
program IOSpeed

use system_time_mod
use ifport

implicit none

integer :: l, n, n1, n2, n3, iostat, sysstat
real :: fs
type(timer) :: tr, tw
real, dimension(:,:,:), allocatable :: data

call system_time_init()

!---------------------------------------------------------------------
! Loop over increasing file size,
!---------------------------------------------------------------------
sysstat = system("touch /tmp/test")
do l=1,10

n = 512000*(2**l)
n1 = 512
n2 = 1000
n3 = n/n1/n2

allocate( data(n1,n2,n3) )
sysstat = system("rm /tmp/test")
open(1,file='/tmp/test',form='BINARY',status='NEW',iostat=iostat)

call start_timer(tw)
write(1) data
call stop_timer(tw)

rewind(1)

call start_timer(tr)
read(1) data
call stop_timer(tr)

fs = (1.0*n*4)/1024000

write(0,*) fs,fs/tw%telapsed,fs/tr%telapsed

close(1)
deallocate(data)
end do
!---------------------------------------------------------------------

call exit(0)

end program IOSpeed
!-----------------------------------------------------------------------

A few caveats about the code:
1) I'm using the Intel fortran compiler, hence the use ifport to
enable system calls
2) I'm using a timing module of my own creation (just a front-end for
system_clock())
3) The BINARY file type is not necessarily a fortran standard (to my
knowledge)

If you run the code, for each file size, it prints the file size, and
the I/O speed for the read and write in MB/sec.

My machine:
1) Intel quad core 64-bit
2) Fedora Core 7
3) /tmp is ext3 (I converted another disk to ext2 (no journaling) and
see the same behavior)

Here is a GNUPLOT figure of the performance, for three separate runs.

http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize.png

You will notice a fairly predictable performance pattern for each file
size. The sampling along the x-axis is non-uniform--data occurs at
powers of 2: 2MB, 4MB, ... 2048 MB. Write performance declines
precipitously at 256 MB, whereas read performance declines
precipitously at 1024 MB.

I'm guessing that this problem is likely more a system issue than a
fortran issue, but I know there is some significant brain power here,
and that someone will have an answer.

Thank you for your time...

Morgan Brown

glen herrmannsfeldt

unread,

Jan 10, 2008, 2:09:35 AM1/10/08

to

mpbro wrote:

(snip doing FORM='BINARY' I/O)

> Here is a GNUPLOT figure of the performance, for three separate runs.

> http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize.png

> You will notice a fairly predictable performance pattern for each file
> size. The sampling along the x-axis is non-uniform--data occurs at
> powers of 2: 2MB, 4MB, ... 2048 MB. Write performance declines
> precipitously at 256 MB, whereas read performance declines
> precipitously at 1024 MB.

How much real memory do you have on your system?

If you start swapping performance of any operation will drop like
a rock. Especially I/O as swapping and I/O compete for disk access.

That wouldn't explain why write and read decline starting at
different points, though. It might be that WRITE copies the
data to a buffer of equal size before writing it out, while READ
doesn't need such a big buffer?

> I'm guessing that this problem is likely more a system issue than a
> fortran issue, but I know there is some significant brain power here,
> and that someone will have an answer.

-- glen

mpbro

unread,

Jan 10, 2008, 2:19:05 AM1/10/08

to

Hi Glen,

Thanks for your response.

> How much real memory do you have on your system?

I have 4Gb, but there are other apps running on here, competing for
memory. I just ran on another (very quiet) machine, which has only
2Gb of memory, and I get a different result.

4Gb machine (replotted with lines and points):
http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize_brown01.png

2Gb machine:
http://www.voxproperty.com/images/2008_01_09_f90_io_speed_vs_filesize_brown02.png

If I interpret the two plots correctly, the "precipitous decline" in
both read and write performance occurs at a smaller memory/file size
on the 2Gb machine.

> If you start swapping performance of any operation will drop like
> a rock. Especially I/O as swapping and I/O compete for disk access.
>
> That wouldn't explain why write and read decline starting at
> different points, though. It might be that WRITE copies the
> data to a buffer of equal size before writing it out, while READ
> doesn't need such a big buffer?

I'm wondering if this is an ifort issue. I do not see any obvious
swapping going on, but...

When I get the energy tomorrow I will try with gfortran.

Thanks,
Morgan

glen herrmannsfeldt

unread,

Jan 10, 2008, 3:58:40 AM1/10/08

to

mpbro wrote:

(I wrote)

>>How much real memory do you have on your system?

> I have 4Gb, but there are other apps running on here, competing for
> memory. I just ran on another (very quiet) machine, which has only
> 2Gb of memory, and I get a different result.

You don't say which OS you use, which can make some difference.
Also, if you use a 32 bit or 64 bit OS.
(snip)

Also, I just noticed one big problem with your program, which
greatly affects its memory usage: you don't initialize the
array. On many systems when it is allocated all the page
table entries point to one page of zeros. Look at the
memory usage while the program is running. The actual
memory isn't allocated until the array is initialized.
(I believe Linux and some other Unix-like systems do
that. I don't know about Windows systems.)

>>If you start swapping performance of any operation will drop like
>>a rock. Especially I/O as swapping and I/O compete for disk access.

>>That wouldn't explain why write and read decline starting at
>>different points, though. It might be that WRITE copies the
>>data to a buffer of equal size before writing it out, while READ
>>doesn't need such a big buffer?

Otherwise, I would say "don't do that."

You might try not writing the whole array at once.
You can at least do timing on something like:

call start_timer(tw)
do i=1,n3
write(1) data(:,:,i)
enddo
call stop_timer(tw)

For FORM='UNFORMATTED' this will write it as multiple records,
instead of just one, which may or may not make it faster.

-- glen

mpbro

unread,

Jan 10, 2008, 9:31:15 AM1/10/08

to

> You don't say which OS you use, which can make some difference.
> Also, if you use a 32 bit or 64 bit OS.

Linux running Fedora Core 7 (64-bit) with quad core Intel

> Also, I just noticed one big problem with your program, which
> greatly affects its memory usage: you don't initialize the
> array.

I was initializing it before, but it didn't seem to make a difference.

I changed the test code to loop over n3, writing (n1,n2) slices, and
timed each write. I print the results below, for the first, second,
and last third of the write. Interesting--even for small files, the I/
O rate monotonically decreases as you get further into the file.

> For FORM='UNFORMATTED' this will write it as multiple records,
> instead of just one, which may or may not make it faster.

The path of least resistance (and the only path with gfortran) is to
do unformatted write. However, I want flat binary files for later use
outside of Fortran. Oh well, maybe I should return to vanilla
unformatted writes with fixed record length.

Thanks,
Morgan

-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
4.000000 361.4066 1200.86006728223
first third: write rate (s) Infinity
second third: write rate (s) 234.8270
last third: write rate (s) 247.3749
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
8.000000 263.1493 1050.41422489356
first third: write rate (s) 426.5918
second third: write rate (s) 266.4255
last third: write rate (s) 188.5790
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
16.00000 246.2385 1074.75639403597
first third: write rate (s) 351.1053
second third: write rate (s) 224.5157
last third: write rate (s) 204.8707
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
32.00000 239.1803 1047.21788929981
first third: write rate (s) 268.3126
second third: write rate (s) 242.5312
last third: write rate (s) 213.0988
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
64.00000 243.8591 1052.30015602092
first third: write rate (s) 265.5897
second third: write rate (s) 241.9495
last third: write rate (s) 227.0722
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
128.0000 246.8387 1055.10445883907
first third: write rate (s) 259.5407
second third: write rate (s) 247.2273
last third: write rate (s) 234.9698
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
256.0000 243.1740 1028.15349790298
first third: write rate (s) 254.0984
second third: write rate (s) 248.8201
last third: write rate (s) 228.1858
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
512.0000 113.7644 983.021700340661
first third: write rate (s) 252.9854
second third: write rate (s) 240.1486
last third: write rate (s) 54.78432
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
1024.000 48.70774 1056.61644875273
first third: write rate (s) 250.2715
second third: write rate (s) 47.29613
last third: write rate (s) 27.43272
-----------------------------------------
filesize(MB) write rate(MB/s) read rate(MB/s)
2048.000 31.69512 39.3168732882498
first third: write rate (s) 68.94968
second third: write rate (s) 27.00253
last third: write rate (s) 23.19384

John S

unread,

Jan 10, 2008, 9:36:43 AM1/10/08

to

This is entirely normal behavior. I've been seeing exactly the same thing
for the last four years in my work with very large file IO from disk. The
break point to slower read IO always depends on system memory and occurs for
both Windows and Linux. Once the file size exceeds available memory, IO
rate drops to what I assume is the fundamental disk IO read rate.

This said, my app has had max file size of 100 - 130 GB and it only read
sections at a time, not entire array.

Have significantlly improved this disk rate by using fast 15k rpm scsi raid
0 scratch disk arrays. On my workstation this included finding a scsi
controller which was much faster than scsi that came on the PC Workstation
mother board.

A SATA II 7200 rpm single drive my app IO rate is on the order of 53
MBytes/s. On the 4 pack raid 0 scsi pack, rate starts to get to 130
MBytes/sec.

This app is is unformatted Fortran (Intel IVF) writes and reads.

Welcome anyone who can suggest further improvement approaches.

John

mpbro

unread,

Jan 10, 2008, 10:49:56 AM1/10/08

to

5-star answer, John. Thanks.

In my app, the disk files are mainly temporary files. When I said >
1TB, I meant total throughput. I should be able to write a wrapper
that breaks up the file I/O into multiple mutually exclusive parts.

I have a RAID5-enabled disk, but I believe it is ATA. Thanks very
much for the advice.

Morgan

John S

unread,

Jan 10, 2008, 11:53:58 AM1/10/08

to

The scsi ultras advertise an maximum data rate of 320 MBytes/sec for a
single drive. My results with raid 0 with 4 scsi's have maximum data rate
1/3 of that rate.

What can be done or what does it take harware wise (bus speed, controller,
...) or software wise to get closer to these advertised data rates?

Note than in my experience that Win 32 or 64 and Linux 32 or 64 have very
similiar results with Linux having a little faster IO. Fortran and C
approaches were basically the same.

John S

mpbro

unread,

Jan 10, 2008, 1:30:08 PM1/10/08

to

It's interesting--I broke up the problem into multiple files and
multiple arrays, and it still doesn't make a difference. Even made
sure I never allocated the full amount of memory that I intended to
write. Tried flushing buffers, no dice.

This is probing aspects of I/O buffering that are far beyond my feeble
brain!

tho...@antispam.ham

unread,

Jan 10, 2008, 3:27:47 PM1/10/08

to

John S writes:

> The scsi ultras advertise an maximum data rate of 320 MBytes/sec for a
> single drive. My results with raid 0 with 4 scsi's have maximum data rate
> 1/3 of that rate.

Sure you aren't confusing the drive sustained throughput with the speed
of the SCSI adapter? There is such a thing as an Ultra 320 SCSI adapter,
but I've not seen drives that can sustain such high throughput. I believe
you need a 64-bit bus to utilize the throughput of an Ultra 320 SCSI adapter.
A 32-bit PCI bus running at 33 MHz provides 132 MB/sec.

Pierre Asselin

unread,

Jan 10, 2008, 8:17:28 PM1/10/08

to

mpbro <morgan...@gmail.com> wrote:
> I have an application which needs to do a lot of I/O on large files

> (>1TB/run). [ ... ]

Your benchmark is testing the Linux buffer cache as much as the
actuaL I/O. Linux will keep as much of your data in memory as it
can, evicting other idle applications if necessary. I wouldn't be
surprised if a second repetition of the test gave different results.

You should run vmstat in a separate window and see where the bytes
are going. Here's the output on an idle laptop.

==============
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 172 248140 75280 30324 0 0 17 2 75 62 1 0 97 1
0 0 172 248140 75280 30324 0 0 0 0 63 63 1 0 99 0
0 0 172 248140 75280 30324 0 0 0 0 42 27 0 0 100 0
0 0 172 248140 75280 30324 0 0 0 0 47 39 1 0 99 0
==============

The "1" tells vmstat to print a sample every second. Read the
vmstat man page for full details. Pay attention to the "bi" and
"bo" (blocks in, blocks out) columns, they have the actual disk
I/O. Also keep an eye on "cache" and "so" (swap out).

My guesses:
* During the write test, your data gets stuffed into RAM and
only goes to disk when the system has nothing better to
do. You should see the "free" drop (depending on what else
is in memory) and the "cache" number go up (maybe "buff"
too). You might see some "bo" for actual disk I/O, but
that won't hold up your program until the cache/buff gets
full and all your writes have to go to disk.
* Still during the write test, keep an eye on "so". These
are idle applications being swapped out to make room for
your data in RAM. I suspect this cause your early
slowdown at 256 MB writing.
* During the read test, you probably won't see any "bi"
until the 2 GB test. This would be because your data is
still cached in RAM, even if it has also been written to
disk, and there is no need to perform any physical I/O to
get it back.

These are just guesses. If the measurements prove me wrong, then
that's that.

--
pa at panix dot com

glen herrmannsfeldt

unread,

Jan 11, 2008, 1:36:46 AM1/11/08

to

mpbro wrote:

I didn't actually look at your times. It might be that the system
has a disk write buffer, so the smaller ones go to the buffer,
not to disk. You could put the CLOSE inside the timing block,
though even that doesn't guarantee the data is on the disk.

-- glen

mpbro

unread,

Jan 11, 2008, 11:17:09 AM1/11/08

to

Glen and Pierre are almost certainly correct--my numbers for small
arrays are much, much faster than the top I/O rate quoted by a
previous poster even for the fastest RAID disks (~200MB/sec). So it's
almost certain that my times do NOT reflect an actual commit of data
to the disk.

So even though I'm not explicitly doing asynchronous I/O here, it
seems that the Intel Fortran compiler is doing a pretty good job of
using spare cycles and memory to hide I/O waiting. If anything, I've
found the "sweet spot" for this out-of-core computation on this
machine -- 256 MB memory/file size. If I want to do out-of-core on
larger problems, I may just have to bite the bullet.

Thanks for all the insights,
Morgan

Thomas Koenig

unread,

Jan 12, 2008, 6:59:24 AM1/12/08

to

On 2008-01-10, mpbro <morgan...@gmail.com> wrote:

> In my app, the disk files are mainly temporary files.

[...]

> I have a RAID5-enabled disk,

RAID5 is good for data integrity and disc use efficiency,
but is known to be relatively slow for writing.

If you are using very large temporary files (and can take the
risk of a disk failure, if you just need to restart the program),
it might be better to use to use a "naked" disc or a RAID0 array.
For better security (but more money for discs :-) you can use
a combination of RAID0 and RAID1.