csv-file processing.

Robert de Bock

unread,

Nov 8, 2002, 4:18:06 AM11/8/02

to

Hello,

To process csv files I can define a separator like this in awk:
awk 'BEGIN { FS = "," } { print $2 }'

csv files are separated with a "," which is not the same as the awk-line above:
echo "field-one,field1","field-two,field2" | awk 'BEGIN { FS = "," } { print $2 }'

I was hoping for the result:
field-two,field2

instead I got:
field1

Which is correct if the separator was *only* a comma, but with lots of csv files, the separator is "," ( including the beginning " and ending " )

Now, my question: How can I tell awk to use "," as a separator, so csv files are processed the way I want/expect?

Tia,
Robert de Bock.

John L

unread,

Nov 8, 2002, 6:03:37 AM11/8/02

to

"Robert de Bock" <rob...@dive.nl> wrote in message news:20021108101806....@dive.nl...

The short answer is that parsing csv files is hard, and if
you do a search in this NG, you will see how to do it.

The longer answer is there is an anomaly that might be
a bug in gnu awk, but this (below) works in Solaris nawk.

You are asking about a simple subset of csv files,
where each field is wrapped in quotes: "a","b","c","d"
So this ought to be a case for simply setting FS appropriately.

FS="\","\" works for all bar the first and last fields.
So to take care of the first one, we can add quote marks
at the start of line to FS: FS="^\"|\",\""
(Note that this increases the number of fields by one, as
there is now an empty field at the start of the line before
the initial quotation mark.)

Now that works in Solaris nawk, but not with gawk.
The reason is that the ^ in FS ties what follows to
the start of the line in nawk but to the start of the field
(separator) in Gnu awk. (Time to rtfm, methinks.)

So if you have Gnu awk, search the NG for a comprehensive
solution, and if you have nawk (or something that works
similarly), then you also need to deal with the final quote:
FS="^\"|\",\"|\"$"

John.

John L

unread,

Nov 8, 2002, 6:34:24 AM11/8/02

to

"John L" <j...@lammtarra.fslife.co.uk> wrote in message news:aqg5hu$4qr$1...@news6.svr.pol.co.uk...

> The longer answer is there is an anomaly that might be
> a bug in gnu awk, but this (below) works in Solaris nawk.
>
> You are asking about a simple subset of csv files,
> where each field is wrapped in quotes: "a","b","c","d"
> So this ought to be a case for simply setting FS appropriately.
>
> FS="\","\" works for all bar the first and last fields.
> So to take care of the first one, we can add quote marks
> at the start of line to FS: FS="^\"|\",\""
> (Note that this increases the number of fields by one, as
> there is now an empty field at the start of the line before
> the initial quotation mark.)
>
> Now that works in Solaris nawk, but not with gawk.
> The reason is that the ^ in FS ties what follows to
> the start of the line in nawk but to the start of the field
> (separator) in Gnu awk. (Time to rtfm, methinks.)
>

Oops, forgot to add an example that shows the different
behaviour of the two awk implementations. Take the line:

"a","b",""c"","d"

where the fields are: a, b, "c" (including quotes) and d
with nawk, whereas gawk has two (extra) empty fields
delimited by the adjacent quotation marks either side of
the c. (Ignoring empty fields at the start and end of line.)

Which is "correct", I cannot say.

John.

Robert de Bock

unread,

Nov 8, 2002, 7:37:41 AM11/8/02

to

Hi,

On Fri, 8 Nov 2002 11:34:24 -0000
"John L" <j...@lammtarra.fslife.co.uk> wrote:

>
> "John L" <j...@lammtarra.fslife.co.uk> wrote in message news:aqg5hu$4qr$1...@news6.svr.pol.co.uk...
> > The longer answer is there is an anomaly that might be
> > a bug in gnu awk, but this (below) works in Solaris nawk.
> >
> > You are asking about a simple subset of csv files,
> > where each field is wrapped in quotes: "a","b","c","d"
> > So this ought to be a case for simply setting FS appropriately.
> >
> > FS="\","\" works for all bar the first and last fields.
> > So to take care of the first one, we can add quote marks
> > at the start of line to FS: FS="^\"|\",\""
> > (Note that this increases the number of fields by one, as
> > there is now an empty field at the start of the line before
> > the initial quotation mark.)

You brougt me on an idea! (or maybe you made me understood the real issue...) As a solution, I now have this line:

echo "one,1","two,2","three,3" | sed -e 's/^"//;s/"$//' | awk -v d=\",\" 'BEGIN { FS = d ; OFS = FS } { print $2 }'

which substitute the leading and ending quote (") and results in:

---
[robert@cousteau:~] $ echo \"one,1\",\"two,2\",\"three,3\" | sed -e 's/^"//;s/"$//' | awk -v d=\",\" 'BEGIN { FS = d ; OFS = FS } { print $1 }'
one,1
[robert@cousteau:~] $ echo \"one,1\",\"two,2\",\"three,3\" | sed -e 's/^"//;s/"$//' | awk -v d=\",\" 'BEGIN { FS = d ; OFS = FS } { print $2 }'
two,2
[robert@cousteau:~] $ echo \"one,1\",\"two,2\",\"three,3\" | sed -e 's/^"//;s/"$//' | awk -v d=\",\" 'BEGIN { FS = d ; OFS = FS } { print $3 }'
three,3
---

Which is what I am looking for.

> >
> > Now that works in Solaris nawk, but not with gawk.
> > The reason is that the ^ in FS ties what follows to
> > the start of the line in nawk but to the start of the field
> > (separator) in Gnu awk. (Time to rtfm, methinks.)
> >
>
> Oops, forgot to add an example that shows the different
> behaviour of the two awk implementations. Take the line:
> "a","b",""c"","d"
> where the fields are: a, b, "c" (including quotes) and d
> with nawk, whereas gawk has two (extra) empty fields
> delimited by the adjacent quotation marks either side of
> the c. (Ignoring empty fields at the start and end of line.)

This also seems to work with the lines above.

Thanks!

Robert de Bock.

Joseph Shkolnik

unread,

Nov 8, 2002, 10:47:13 AM11/8/02

to

Hi Tia,

Sorry, it'll be a little bit more complex.

If $1 has symbol " inside need use regular expression to get everything
between two symbols ". Better not use Awk base fields parsing, but parse
everything inside Awk.

E.g.

s=$0;

/* remove everything before first double quote */
i=match(s,"^[^\"]*\"");

/* s1 in starts after 1st */
s1=substr(s,RSTART+RLENGTH);

/* find second */
i=match(s1,"[^\"]*\"");

/* get context between 1st & 2nd */
field_one=substr(s1,1,RLENGTH-1);

/* next field: */
s=substr(s1,RSTART+RLENGTH);

/* remove everything before first double quote of second field*/
i=match(s,"^[^\"]*\"");
/* s1 in starts after 1st */
s1=substr(s,RSTART+RLENGTH);
...

It should be not too complex to write loop around similar Awk fragment to
get all fields one by one. Need just remember that need get rid not only
from " but also from commas Awk regular expressions, function 'match' with
RSTART/RLENGTH are very power .

Regards,
-Joseph

"Robert de Bock" <rob...@dive.nl> wrote in message
news:20021108101806....@dive.nl...

Adrian Davis

unread,

Nov 11, 2002, 3:40:27 AM11/11/02

to

Robert de Bock <rob...@dive.nl> wrote in message news:<20021108101806....@dive.nl>...

If you are interested in parsing CSV files in AWK I suggest you check
out "setcsv" which can be found at:-

http://www.deja.com/=dnc/getdoc.xp?AN=603309980

...it will even parse CSV files produced by Micosoft Excell, which
allows for newlines to be embedded in quoted fields!!

Regards,
=Adrian=