Parsing a CSV file containing quoted strings (themselves containing commas)

Pierre-Philippe Ravier

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

Hello,

I have a file with lines like this :
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
"jiuiieziure, ioiooii"

As you can see the fields are separated by commas. I simplyneed to extract
some columns to make another file, which is one of the most usual task for
wak. However I face two problems :
- Some columns are strings surrounded by speechmarks. I want to get rid of
them.
- Some of these strings may contain commas, which confuse awk because it
take them as separator without knowing that they are part of the field. I
didn't manage to make awk understand that in this case a comma must not be
used a a separator.

The only solution I found is the following :
- Parse the file a first time with awk, character by character, finding
myself the commas in strings and the speecmarks to eliminate them (I don't
need the speechmarks, neither the commas in the strings). I keep all the
other characters.
- Parse the result again to grab the columns I want.

The first parsing takes a lot of time because I must go character by
characters...

Is there a more elegant way to do what I need ? Maybe awk cannot do it
better. In this case I can switch to another tool if you have a suggestion.

Best regards.

Robert M Lowrey

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

Awk cannot do it?!?! NEVER!

run command : awk -f ref.awk input.dat | awk -f ref2.awk > output.dat

ref.awk reformats the file with | as the field separator, ref2.awk is just a
sample to show you how to use FS to deal with new field sep so you can write
your own awk to grab the columns you want.

It should run reasonably fast.

Robert L.

======start of ref.awk======
#
# Assumption: records do NOT contain | character!!!!
# Assumption: quoted fields contain either 0 or 1 commas
#
# take line apart and build table of fields
{gsub(/,/,"|",$0) # change commas to bar
gsub(/ \|/,"|",$0) # get rid of spaces before and after bar
gsub(/\| /,"|",$0)
split($0,pad,"|") # split the line into table pad[]
for (x in pad) { # rebuild and deal with embedded comma's
rq = substr(pad[x],1,1)
q = gsub(/\"/,"",pad[x])
if (q == 1 && rq == "\"") {
gsub(/\"/,"",pad[x+1])
pad[x] = pad[x] ", " pad[x+1]
pad[x+1] = "" }
}

}

#rebuild line with new field separators and print
{ rec = ""
for (x in pad)

if (rec == "") {
rec = pad[x] }
else {
if (pad[x] != "") rec = rec "|" pad[x] } #skip null
fields
}
print rec
}
======end of ref.awk======

======start of ref2.awk======
BEGIN { FS="|" }

{ print "Record: " NR
print "Field 1: " $1
print "Field 2: " $2
print "Field 3: " $3
print "Field 4: " $4
print "Field 5: " $5
print "Field 6: " $6
print "Field 7: " $7
print "Field 8: " $8
print "Field 9: " $9
print "Field 10: " $10 }
======end of ref2.awk======

======start of input.dat======

1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"
1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf",
"klkmmlk","jiuiieziure, ioiooii"

======end of input.dat======

======start of my output.dat==
Record: 1
Field 1: 1
Field 2: 45
Field 3: 789
Field 4: 41
Field 5: popo
Field 6: klrpkretp
Field 7: rrtt
Field 8: klklkllk, djkdjf
Field 9: klkmmlk
Field 10: jiuiieziure, ioiooii
Record: 2
Field 1: 1
Field 2: 45
Field 3: 789
Field 4: 41
Field 5: popo
Field 6: klrpkretp
Field 7: rrtt
Field 8: klklkllk, djkdjf
Field 9: klkmmlk
Field 10: jiuiieziure, ioiooii
Record: 3
Field 1: 1
Field 2: 45
Field 3: 789
Field 4: 41
Field 5: popo
Field 6: klrpkretp
Field 7: rrtt
Field 8: klklkllk, djkdjf
Field 9: klkmmlk
Field 10: jiuiieziure, ioiooii
Record: 4
Field 1: 1
Field 2: 45
Field 3: 789
Field 4: 41
Field 5: popo
Field 6: klrpkretp
Field 7: rrtt
Field 8: klklkllk, djkdjf
Field 9: klkmmlk
Field 10: jiuiieziure, ioiooii
======end of my output.dat==

"Pierre-Philippe Ravier" <ppra...@bigfoot.com> wrote in message
news:38dfe...@news2.cluster1.telinco.net...

PEZ

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

In article <38dfe...@news2.cluster1.telinco.net>,

As someone said. Comma must be the worst possible field separator to
choose. Anyway. You should search the archives for "Parsing CSV" and
you'll find lots and lots of input. A few months ago we took this
several rounds and here's my suggestion from that thread:

#!/usr/bin/awk -f
BEGIN { FS=SUBSEP; OFS="|" }

{
result = setcsv($0, ",")
#print
}

# setcsv(str, sep) - parse CSV (MS specification) input
# str, the string to be parsed. (Most likely $0.)
# sep, the separator between the values.
#
# After a call to setcsv the parsed fields are found in $1 to $NF.
# setcsv returns 1 on sucess and 0 on failure.
#
# By Peter Strömberg aka PEZ.
# Based on setcsv by Adrian Davis. Modified to handle a separator
# of choice and embedded newlines. The basic approach is to take the
# burden off of the regular expression matching by replacing ambigious
# characters with characters unlikely to be found in the input. For
# this the characters "\035".
#
# Note 1. Prior to calling setcsv you must set FS to a character which
# can never be found in the input. (Consider SUBSEP.)
# Note 2. If setcsv can't find the closing double quote for the string
# in str it will consume the next line of input by calling
# getline and call itself until it finds the closing double
# qoute or no more input is available (considered a failiure).
# Note 3. Only the "" representation of a literal quote is supported.
# Note 4. setcsv will probably missbehave if sep used as a regular
# expression can match anything else than a call to index()
# would match.
#
function setcsv(str, sep, i) {
gsub(/""/, "\035", str)
gsub(sep, FS, str)

while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(FS, sep, middle)
str = sprintf("%.*s%s%s", RSTART-1, str, middle,
substr(str, RSTART+RLENGTH))
}

if (index(str, "\"")) {
return ((getline) > 0) ? setcsv(str (RT != "" ? RT : RS) $0,
sep) : !setcsv(str "\"", sep)
} else {
gsub(/\035/, "\"", str)
$0 = str

for (i = 1; i <= NF; i++)
if (match($i, /^"+$/))
$i = substr($i, 2)

$1 = $1 ""
return 1
}
}

Deja will probably screw the indents of the script. Use = in VIM or the
equivalent in whatever editor you are using to fix it.

Regards,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-

Sent via Deja.com http://www.deja.com/
Before you buy.

Charles Demas

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

In article <38dfe...@news2.cluster1.telinco.net>,
Pierre-Philippe Ravier <ppra...@bigfoot.com> wrote:

>Hello,

>
>I have a file with lines like this :
>1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
>"jiuiieziure, ioiooii"
>
>As you can see the fields are separated by commas. I simplyneed to extract
>some columns to make another file, which is one of the most usual task for
>wak. However I face two problems :
>- Some columns are strings surrounded by speechmarks. I want to get rid of
>them.
>- Some of these strings may contain commas, which confuse awk because it
>take them as separator without knowing that they are part of the field. I
>didn't manage to make awk understand that in this case a comma must not be
>used a a separator.
>
>The only solution I found is the following :
>- Parse the file a first time with awk, character by character, finding
>myself the commas in strings and the speecmarks to eliminate them (I don't
>need the speechmarks, neither the commas in the strings). I keep all the
>other characters.
>- Parse the result again to grab the columns I want.
>
>The first parsing takes a lot of time because I must go character by
>characters...
>
>Is there a more elegant way to do what I need ? Maybe awk cannot do it
>better. In this case I can switch to another tool if you have a suggestion.

If you assume that each line will have complete quoted strings, then you
can assume that if you used " as the FS that the even numbered fields
would be the ones within quotes, so to replace the commas in such fields
with a space one could do the following:

awk -F'"' '/"/ {for(i=2;i<=NF;i+=2){gsub(/,/, " ", $i)}}{print}' infile

The above replaces the double quotes and the commas with a space.

To eliminate the double quotes, make the OFS="" like this:

awk -F'"' 'BEGIN{OFS=""}
/"/ {for(i=2;i<=NF;i+=2){gsub(/,/, " ", $i)}}{print}' infile

for your infile:

1, 45, 789, 41, popo, "klrpkretp", rrtt, "klklkllk,djkdjf", "klkmmlk",
"jiuiieziure, ioiooii"

this last script produces:

1, 45, 789, 41, popo, klrpkretp, rrtt, klklkllk djkdjf, klkmmlk,
jiuiieziure ioiooii

The assumption of matched double quotes on each and every line might not be
valid for your data, but if it is, this approach would work.

Some CSV files can be screwed up and have badly quoted strings.

You might want to check if there are any lines without matched quotes
first, something like this should tell you how many lines have
unmatched double quotes:

sed -e 's/[^"]//g;s/""//g' infile | grep '"' | wc -l

You could check with awk too. Perhaps something more complex:

awk '/"/{a=$0;gsub(/[^"]/,"",a); if(length(a)%2==1){cnt++;
print cnt ". Record", NR, "had unmatched quotes:"; print}}
END{if(cnt==0){print "No unmatched double quotes"}
if(cnt>1){print "There were",cnt,"problem lines"}}' infile

Chuck Demas
Needham, Mass.

--
Eat Healthy | _ _ | Nothing would be done at all,
Stay Fit | @ @ | If a man waited to do it so well,
Die Anyway | v | That no one could find fault with it.
de...@tiac.net | \___/ | http://www.tiac.net/users/demas