Reading CSV files with missing data at the end

Arjen Markus

unread,

Nov 23, 2009, 9:24:21 AM11/23/09

to

Hello,

the program below (which first writes a small CSV file and then reads
it) causes a runtime
error when reading the second line:

! Read a CSV file with missing fields
!
program test_csv
character(len=10) :: x, y, z
character(len=80) :: line

open( 10, file = 'test.csv' )
write( 10, '(a)' ) 'X,,Z'
write( 10, '(a)' ) 'Y,,'
write( 10, '(a)' ) 'Z,,'
close( 10 )

open( 10, file = 'test.csv' )

x = '?'
y = '?'
z = '?'

do i = 1,2
read( 10, '(a)' ) line
write(*,*) '>>',trim(line),'<<'
read( line, * ) x, y, z

write(*,*) '>',trim(x), '<'
write(*,*) '>',trim(y), '<'
write(*,*) '>',trim(z), '<'
enddo
end program

I was rather surprised: the line reads "X,,", and the program fails on
reading three strings from it.

The double comma in the first line is treated as I expect it would: no
value is assigned to the
variable associated with that empty field (y in this case).

So I would have thought the two trailing commas both define an empty
field and only x would
be read. Instead, the last comma seems to be treated differently and
the read statement
fails.

Is this what should happen according to the standard? Why is the
interpretation of the
two commas different depending on what comes after them?

Regards,

Arjen

m_b_metcalf

unread,

Nov 23, 2009, 9:58:49 AM11/23/09

to

Don't you need an additional comma separator to mark the third, empty,
field?

write( 10, '(a)' ) 'Y,,,'
write( 10, '(a)' ) 'Z,,,'

Regards,

Mike metcalf

Richard Maine

unread,

Nov 23, 2009, 12:23:49 PM11/23/09

to

Arjen Markus <arjen.m...@gmail.com> wrote:

...

> I was rather surprised: the line reads "X,,", and the program fails on
> reading three strings from it.

[with a list-directed read]

> The double comma in the first line is treated as I expect it would: no
> value is assigned to the
> variable associated with that empty field (y in this case).
>
> So I would have thought the two trailing commas both define an empty
> field and only x would
> be read. Instead, the last comma seems to be treated differently and
> the read statement
> fails.
>
> Is this what should happen according to the standard? Why is the
> interpretation of the
> two commas different depending on what comes after them?

The interpretation of the commas is not different. In both cases the
comma terminates the field. Your problem actualy has nothing to do with
the commas at all. You can duplicate the essense of it with no commas at
all. To do that, try reading just a single value from a line. Make the
first line read " and the second one read " " (in both cases, without
the quotes - that is the second line is blank).

The problem is in the definition of an empty field for list-directed
input. A comma terminates the current field in all relevant cases (well,
ok, not if the comma is quoted and not if the you have decimal='comma',
but those cases aren't relevant to the current question). The end of a
line can terminate the field, but only if the field is non-empty.

That is the reason for the difference you see. It is not the comma, but
rather the end of line that is interpreted differently depending on
whether or not there is data in the field. If there is no data in the
field, the runtime system goes to the next line to find the field.
List-directed input is *NOT* restricted to having all the data for a
single read on a single line. To get an empty field, you need to
actively terminate the field with a comma (or a slash will do if it is
the last field).

As Mike said, adding an extra comma will fix your problem, but I wanted
to explain why.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain

Arjen Markus

unread,

Nov 24, 2009, 2:42:08 AM11/24/09

to

On 23 nov, 18:23, nos...@see.signature (Richard Maine) wrote:

Ah, so the comma is not a _separator_ but a _terminator_!
That explains the behaviour indeed.

Well, the problem is indeed solved by adding another separator, but
the
discussion arose in a context where the CSV file is written with
commas
as separators in mind.

Thanks for the explanation.

Regards,

Arjen

Richard Maine

unread,

Nov 24, 2009, 6:43:52 AM11/24/09

to

Arjen Markus <arjen.m...@gmail.com> wrote:

> Ah, so the comma is not a _separator_ but a _terminator_!
> That explains the behaviour indeed.

I don't think that is an accurate explanation. (And while I might have
used the word "terminator", I'm pretty sure the standard uses the word
"separator").

The problem is not the comma at all. Forget the comma; look at my
example that had no comma at all. The problem is with the end of line.

I don't think I'll go rewrite it; I'll just suggest rereading what I
wrote before... and stop focusing on the comma.

Arjen Markus

unread,

Nov 24, 2009, 7:51:44 AM11/24/09

to

On 24 nov, 12:43, nos...@see.signature (Richard Maine) wrote:

Okay, I got it - the end of the character string is the point where
things break. Hm, an elegant solution has presented itself:

! Read a CSV file with missing fields
!
program test_csv
character(len=10) :: x, y, z

character(len=80), dimension(2) :: line

open( 10, file = 'test.csv' )
write( 10, '(a)' ) 'X,,Z'
write( 10, '(a)' ) 'Y,,'
write( 10, '(a)' ) 'Z,,'
close( 10 )

open( 10, file = 'test.csv' )

x = '?'
y = '?'
z = '?'

line(2) = ',,,,,,' ! Enough commas to satisfy even reading
! an empty line

do i = 1,2
read( 10, '(a)' ) line(1)
write(*,*) '>>',trim(line(1)),'<<'
read( line, * ) x, y, z ! Yes, read from the array!

write(*,*) '>',trim(x), '<'
write(*,*) '>',trim(y), '<'
write(*,*) '>',trim(z), '<'
enddo
end program

This way the read statement can always find enough data.

Regards,

Arjen

Gordon Sande

unread,

Nov 24, 2009, 8:57:06 AM11/24/09

to

I would have translated the discussion into the comma is a separator
that also acts as a continuation operator that will bypass line ends.
You need to either terminate the field or the line to inhibit the
continuation aspect. Puting another comma is fine until you decide to
read another field when you will again fall into the next line.

> Well, the problem is indeed solved by adding another separator, but
> the
> discussion arose in a context where the CSV file is written with
> commas
> as separators in mind.

What was the mind set on terminating the data record? I would have expected
something like comma as separator and semicolon as terminator. So data records
and media lines are not same. It allows sets of three small data records to
be put on the same line for convenience as an example of the tendency to want
to "optimize" things at a later stage.

At one point I got and read the specification of the .SYLK file format which
has all the complexities that one might encounter in csv files. Multiple short
and very long data records are a feature of such data.

Arjen Markus

unread,

Nov 24, 2009, 10:18:14 AM11/24/09

to

On 24 nov, 14:57, Gordon Sande <g.sa...@worldnet.att.net> wrote:
>
> What was the mind set on terminating the data record? I would have expected
> something like comma as separator and semicolon as terminator. So data records
> and media lines are not same. It allows sets of three small data records to
> be put on the same line for convenience as an example of the tendency to want
> to "optimize" things at a later stage.
>
> At one point I got and read the specification of the .SYLK file format which
> has all the complexities that one might encounter in csv files. Multiple short
> and very long data records are a feature of such data.
>

I had a discussion about the support for missing values in Fortran in
conjunction
to CSV files. The reading problem was settled for missing values in
the middle
of a record, but then the actual data files turned out to be missing
data at
the end.

If you export CSV files from a spreadsheet program like MS Excel, then
there
is no terminating semicolon or anything else to indicate the end of
the record
other than the end-of-line character.

But the second version of my sample program, inspired by the
discussion here,
does the job nicely.

Regards,

Arjen

Richard Maine

unread,

Nov 24, 2009, 12:09:27 PM11/24/09

to

Arjen Markus <arjen.m...@gmail.com> wrote:

> Okay, I got it - the end of the character string is the point where
> things break. Hm, an elegant solution has presented itself:

...

> line(2) = ',,,,,,' ! Enough commas to satisfy even reading
> ! an empty line

Alternatively,

line(2) = '/'

ought to do the trick and be general to an arbitrary number of fields.

glen herrmannsfeldt

unread,

Nov 24, 2009, 1:53:15 PM11/24/09

to

Richard Maine <nos...@see.signature> wrote:
> Arjen Markus <arjen.m...@gmail.com> wrote:

>> Okay, I got it - the end of the character string is the point where
>> things break. Hm, an elegant solution has presented itself:

>> line(2) = ',,,,,,' ! Enough commas to satisfy even reading

>> ! an empty line

> Alternatively,

> line(2) = '/'

> ought to do the trick and be general to an arbitrary number of fields.

My first thought, without actually looking it up, would be that
'/' would be like end of file, such that the following list items are
not set to null, but are undefined. Then, as usual for end of file,
all items from the read are undefined. Better to look it up, though.

-- glen

Richard Maine

unread,

Nov 24, 2009, 2:14:32 PM11/24/09

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

> > line(2) = '/'
>
> > ought to do the trick and be general to an arbitrary number of fields.
>
> My first thought, without actually looking it up, would be that
> '/' would be like end of file, such that the following list items are
> not set to null, but are undefined. Then, as usual for end of file,
> all items from the read are undefined. Better to look it up, though.

No, it isn't like end-of-file for multiple reasons. For that matter, the
above is not an accurate descriptiopn of what happens for end-of-file
either. The inaccuracy is related to the difference.

An end-of-file is an exceptional condition - sort of like an error,
though not quite. One of the ways it is like an error is that *ALL* the
input items become undefined - not just the "following" ones.

A "/" on the other hand, terminates the list-directed read normally.

I would also slightly quibble with describing "/" as setting list items
to null. The list items are not set to anything; their value is left
unchanged. I suppose you are probably trying to describe it as having
the same effect as null fields. The list items are the variables in the
io-list - not the fields.

glen herrmannsfeldt

unread,

Nov 24, 2009, 2:46:45 PM11/24/09

to

Richard Maine <nos...@see.signature> wrote:
> glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

>> > line(2) = '/'

>> > ought to do the trick and be general to an arbitrary number of fields.

>> My first thought, without actually looking it up, would be that
>> '/' would be like end of file, such that the following list items are
>> not set to null, but are undefined. Then, as usual for end of file,
>> all items from the read are undefined. Better to look it up, though.

> No, it isn't like end-of-file for multiple reasons. For that matter, the
> above is not an accurate descriptiopn of what happens for end-of-file
> either. The inaccuracy is related to the difference.

> An end-of-file is an exceptional condition - sort of like an error,
> though not quite. One of the ways it is like an error is that *ALL* the
> input items become undefined - not just the "following" ones.

I said that, but in the following sentence.

I wonder, though, if computer technology has changed over the
years. end-of-file isn't so exceptional as it used to be.
Maybe for Fortran 2013 the same treatment can be given in the
end-of-file case?

> A "/" on the other hand, terminates the list-directed read normally.

> I would also slightly quibble with describing "/" as setting list items
> to null. The list items are not set to anything; their value is left
> unchanged. I suppose you are probably trying to describe it as having
> the same effect as null fields. The list items are the variables in the
> io-list - not the fields.

Yes, I was thinking that it was like reading in zero length character
input, but it seems that isn't the case. I have used list-directed
input, but maybe only for numeric values.

So, somewhat similar to C in that values in the list can keep
their previous value, though C can also do that in the EOF case.

Also, the treatment for quoted or apostrophed input and record
boundaries could be inconvenient. In the case of fixed record length,
it might be that a quote or apostrophe accidentally comes at the end
of record. There is no way to indicate that the input continues
on the next record. Given an arbitrary string and fixed record length
it isn't always possible to write out the string such that it can
be read in with the original value.

-- glen

frank

unread,

Nov 24, 2009, 4:18:49 PM11/24/09

to

On Tue, 24 Nov 2009 04:51:44 -0800, Arjen Markus wrote:

> Okay, I got it - the end of the character string is the point where
> things break. Hm, an elegant solution has presented itself:

[code elided]

> This way the read statement can always find enough data.

I don't think you're there yet, Arjen:

dan@dan-desktop:~/source$ gfortran csv1.f90 -Wall -Wextra -o out
dan@dan-desktop:~/source$ ./out
>>X,,Z<<
>X<
>?<
>Z<
>>Y,,<<
>Y<
>?<
>Z<
dan@dan-desktop:~/source$ cat outlook1.csv
First Name,Last Name,Middle Name,Name,E-mail Address
alison,,,alison,gree...@unm.edu
,,,Amazon.com,store...@amazon.com
Andy,,,Andy,an...@firstinter.net
Arjen,,,Arjen,arjen....@wldelft.nl

[snip hey look, ^^^you could become your output}
dan@dan-desktop:~/source$ cat test.csv
X,,Z
Y,,
Z,,
dan@dan-desktop:~/source$ cat csv1.f90

! gfortran csv1.f90 -Wall -Wextra -o out
! ./out >text1.txt
dan@dan-desktop:~/source$

So, test.csv starts out looking like outlook1.csv, and ends up looking
the way it is cat'ed.

Has anyone gotten a hold of Andy in the last several months? Last time I
heard, he was on a walkabout or something.
--
frank

"Guns: yes, they are harmful."

Richard Maine

unread,

Nov 24, 2009, 6:55:27 PM11/24/09

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

> Richard Maine <nos...@see.signature> wrote:
> > glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>
> >> > line(2) = '/'
>
> >> > ought to do the trick and be general to an arbitrary number of fields.
>
> >> My first thought, without actually looking it up, would be that
> >> '/' would be like end of file, such that the following list items are
> >> not set to null, but are undefined. Then, as usual for end of file,
> >> all items from the read are undefined. Better to look it up, though.
>
> > No, it isn't like end-of-file for multiple reasons. For that matter, the
> > above is not an accurate descriptiopn of what happens for end-of-file
> > either. The inaccuracy is related to the difference.
>
> > An end-of-file is an exceptional condition - sort of like an error,
> > though not quite. One of the ways it is like an error is that *ALL* the
> > input items become undefined - not just the "following" ones.
>
> I said that, but in the following sentence.

Oh. I see, sort of. I had trouble following the intended meaning there,
particularly the connection between the two sentences.