Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

CSV files with some double-quoted fields with commas inside

1,352 views
Skip to first unread message

Seb

unread,
Nov 5, 2010, 9:10:21 PM11/5/10
to
Hi,

Some CSV (comma-separated values) files have the following layout for
each line:

field1,field2,"a,b,c",field4, "field5"

As you can see there are fields that may or may not be surrounded by
double quotes, and each double-quoted field may contain one or more
strings separated by commas. Also, there may be whitespace between
fields separators.

I've seen some posts offering ways to parse this, but nothing seems very
efficient. I'd imagine that FS could be set to a suitable regexp, but
it's difficult to design one that handles these issues. Any tips would
be appreciated.

Cheers,

--
Seb

Hermann Peifer

unread,
Nov 6, 2010, 5:05:42 AM11/6/10
to
On 06/11/2010 02:10, Seb wrote:
> Hi,
>
> Some CSV (comma-separated values) files have the following layout for
> each line:
>
> field1,field2,"a,b,c",field4, "field5"
>

Other CSV files might have escaped double quotes inside double quoted
fields, etc. The number of potential pitfalls is endless.

Whenever I have to handle one of these CSV files, I use awk.csv, from
http://lorance.freeshell.org/csv/

Hermann

Ed Morton

unread,
Nov 6, 2010, 9:19:46 AM11/6/10
to
On 11/5/2010 8:10 PM, Seb wrote:
> Hi,
>
> Some CSV (comma-separated values) files have the following layout for
> each line:
>
> field1,field2,"a,b,c",field4, "field5"
>
> As you can see there are fields that may or may not be surrounded by
> double quotes, and each double-quoted field may contain one or more
> strings separated by commas. Also, there may be whitespace between
> fields separators.

Not really. This:

a, b ,c

should be treated as "a", " b ", and "c" not "a", "b", and "c". If you
don't want leading/trailing whitespace to be part of the field, then don't put
it in the file when you create it.

And there may be escaped quotes within the fields where "escaping" them may be
done by a backslash in front of them or pairs of quotes, or....

>
> I've seen some posts offering ways to parse this, but nothing seems very
> efficient. I'd imagine that FS could be set to a suitable regexp, but
> it's difficult to design one that handles these issues. Any tips would
> be appreciated.

The main problem with parsing CSV files is that there's various different file
formats that are referred to as "CSV", not one unique standard.

Here's a few definitions of "CSV formats":

http://tools.ietf.org/html/rfc4180
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
http://www.ricebridge.com/products/csvman/reference.htm
http://www.catb.org/~esr/writings/taoup/html/ch05s02.html
http://edoceo.com/utilitas/csv-file-format

Having said that, Lorance Stinsons CSV parser in awk seems to be what I hear
most people referring to when it comes to parsing CSV files and it seems to have
some options for handling different input formats so maybe that'll work for you.
See http://lorance.freeshell.org/csv/.

Ed.

Seb

unread,
Nov 6, 2010, 11:01:19 AM11/6/10
to
On Sat, 06 Nov 2010 08:19:46 -0500,
Ed Morton <morto...@gmail.com> wrote:

> And there may be escaped quotes within the fields where "escaping"
> them may be done by a backslash in front of them or pairs of quotes,
> or....

Thanks Ed, Hermann, for the suggestions. This gave me an idea. I am
exporting M$ Access tables with mdb-export (from the mdbtools package).
This command has an option (-d) which allows one to select the field
delimiter. By using "\t" (TAB) for delimiting fields, it was much
simpler to parse each field by using FS="\t"... duh!

On related note, some of the fields need reformatting. One can easily
do this with sprintf, so for instance I can use
newfield34=sprintf("%.1f", $34). The problem is that if the field is
blank (missing), then one gets a "0.0". Is:

if ($34 ~ /^[ ]*$/) {
newfield34=""
} else {newfield34=sprintf("%.1f", $34)}

the best way to deal with this? It seems awfully inefficient.

Cheers,

--
Seb

Hermann Peifer

unread,
Nov 6, 2010, 11:41:47 AM11/6/10
to
On 06/11/2010 16:01, Seb wrote:
>
> On related note, some of the fields need reformatting. One can easily
> do this with sprintf, so for instance I can use
> newfield34=sprintf("%.1f", $34). The problem is that if the field is
> blank (missing), then one gets a "0.0". Is:
>
> if ($34 ~ /^[ ]*$/) {
> newfield34=""
> } else {newfield34=sprintf("%.1f", $34)}
>
> the best way to deal with this? It seems awfully inefficient.
>

I usually do something like this:
{ $34 = $34+0 == $34 ? sprintf("%.1f", $34) : "" }

The above test for numbers is perhaps not absolutely bullet-proof, it
does however work nicely in practice. At least with the data that I receive.

Hermann

Ed Morton

unread,
Nov 6, 2010, 12:16:23 PM11/6/10
to

Not much better, but assuming you're only formatting fields that contain digits:

newfield34=($34 ~ /[[:digit:]]/ ? sprintf("%.1f", $34) : "")

or, arguably more legibly but a bit less efficiently since you'll call sprintf()
for even empty fields:

fmt34=($34 ~ /[[:digit:]]/ ? "%.1f": "%s")
newfield34=sprintf(fmt34, $34)

Regards,

Ed.


Seb

unread,
Nov 6, 2010, 1:06:58 PM11/6/10
to
On Sat, 06 Nov 2010 11:16:23 -0500,
Ed Morton <morto...@gmail.com> wrote:

[...]

> Not much better, but assuming you're only formatting fields that
> contain digits:

> newfield34=($34 ~ /[[:digit:]]/ ? sprintf("%.1f", $34) : "")

> or, arguably more legibly but a bit less efficiently since you'll call
> sprintf() for even empty fields:

> fmt34=($34 ~ /[[:digit:]]/ ? "%.1f": "%s")
> newfield34=sprintf(fmt34, $34)

Neat! Somehow, this "... ? ... : ..." construct has a very short
shelf-life in my brain.

Thank you both,

--
Seb

Ed Morton

unread,
Nov 7, 2010, 11:14:41 AM11/7/10
to
On 11/6/2010 10:01 AM, Seb wrote:
> On Sat, 06 Nov 2010 08:19:46 -0500,
> Ed Morton<morto...@gmail.com> wrote:
>
>> And there may be escaped quotes within the fields where "escaping"
>> them may be done by a backslash in front of them or pairs of quotes,
>> or....
>
> Thanks Ed, Hermann, for the suggestions. This gave me an idea. I am
> exporting M$ Access tables with mdb-export (from the mdbtools package).
> This command has an option (-d) which allows one to select the field
> delimiter. By using "\t" (TAB) for delimiting fields, it was much
> simpler to parse each field by using FS="\t"... duh!

Yeah, unless any fields in your CSV file contain a tab character :-).

I wrote the script below, which I called ctsv.awk, to convert from CSV format to
a tab-separated-value format (which I'm calling TSV for want of a better name)
and back.

By default, when doing CSV->TSV conversion it:
a) replaces newlines with control-N characters,
b) replaces tabs with control-Ts,
c) replaces escaped quotes (\" or "") inside quoted fields with just a quote (")
d) replaces commas outside of quoted fields with tabs
e) discards quotes around fields

and when doing TSV->CSV conversion it does the reverse:
a) replaces control-N characters with newlines,
b) replaces control-Ts with tabs,
c) replaces quotes (") with escaped quotes ("")
d) replaces tabs with commas
e) quotes fields containing commas, quotes, or newlines.

So, while in the TSV format you can operate on your data using awk just by
setting FS and OFS to tab, then convert it back to CSV later if necessary.

Try it if you like - it should be fine unless your CSV file contains things like
\\" inside a quoted field.

Ed.

$ cat ctsv.awk
# Converts Comma-Separated-Value (CSV) <-> Tab-Separated-Value (TSV) files
#
# Usage:
#
# awk thisFile in.csv > out.tsv
# awk -v c2t=1 thisFile in.csv > out.tsv
# awk -v t2c=1 thisFile in.tsv > out.csv
#
# Variables you can overwrite using -v:
#
# tab = the character to replace tabs in the CSV file.
# nl = the character to replace newlines in the CSV file.
# fs = the field separator in the CSV files.
# eq = the characters for escaped quotes in the CSV files.
# qtspc = a flag to indicate its necessary to quote fields
# in the TSV files that start or end with spaces.
# caterr = a string to replace "cat>&2" for printing to stderr.

BEGIN{
if (t2c) {
# quote fields that have leading or trailing spaces?
qtspc = ( qtspc == "" ? 0 : qtspc)

FS="\t"
deq = "\"\""
} else {
c2t = 1

FS=""
deq = "[\"\\\\]\""
}

tab = (tab == "" ? "¶" : tab)
nl = ( nl == "" ? "♫" : nl)
fs = ( fs == "" ? "," : fs)
eq = ( eq == "" ? deq : eq)

caterr = (caterr == "" ? "cat>&2" : caterr)
}

c2t {
if (FNR == 1) { rec=ors=""; quoted=0 }

if ( ($0 ~ nl) || ($0 ~ tab) ) {
printf "ERROR: Input record %d:\n <<<%s>>>\n\
already contains newline(%s) and/or tab(%s) replacements.\n\n",
FNR,$0,nl,tab | caterr
exit 1
}

printf "%s%s",rec,(quoted ? nl : ors)
rec = ""; ors = ORS

gsub(eq,RS)
for (i=1;i<=NF;i++) {
char = $i
if ($i == fs) { if (!quoted) char = "\t" }
if ($i == "\"") { quoted = !quoted; char = "" }
if ($i == RS) { char = "\"" }
rec = rec char
}
}

t2c {
gsub(nl,"\n")
gsub(tab,"\t")
gsub(/\"/,eq)
sep = ""
for (i=1;i<=NF;i++) {
q = ($i ~ /[,\"\n]/ ? "\"" : "")
q = (qtspc && ($i ~ /^[[:space:]]|[[:space:]]$/) ? "\"" : q)
printf "%s%s%s%s",sep,q,$i,q
sep = fs
}
print ""
}

END {
printf "%s%s",rec,ors
}

Seb

unread,
Nov 7, 2010, 4:48:11 PM11/7/10
to
On Sun, 07 Nov 2010 10:14:41 -0600,
Ed Morton <morto...@gmail.com> wrote:

> On 11/6/2010 10:01 AM, Seb wrote:
>> On Sat, 06 Nov 2010 08:19:46 -0500,
>> Ed Morton<morto...@gmail.com> wrote:

>> And there may be escaped quotes within the fields where "escaping"
>>> them may be done by a backslash in front of them or pairs of quotes,
>>> or....

>> Thanks Ed, Hermann, for the suggestions. This gave me an idea. I am
>> exporting M$ Access tables with mdb-export (from the mdbtools
>> package). This command has an option (-d) which allows one to select
>> the field delimiter. By using "\t" (TAB) for delimiting fields, it
>> was much simpler to parse each field by using FS="\t"... duh!

> Yeah, unless any fields in your CSV file contain a tab character :-).

Exactly... as Herman said, potential pitfalls with these files are
endless.


> I wrote the script below, which I called ctsv.awk, to convert from CSV
> format to a tab-separated-value format (which I'm calling TSV for want
> of a better name) and back.

> By default, when doing CSV->TSV conversion it:
> a) replaces newlines with control-N characters,
> b) replaces tabs with control-Ts,
> c) replaces escaped quotes (\" or "") inside quoted fields with just a
> quote (")
> d) replaces commas outside of quoted fields with tabs

> e) discards quotes around fields

> and when doing TSV->CSV conversion it does the reverse:
> a) replaces control-N characters with newlines,
> b) replaces control-Ts with tabs,
> c) replaces quotes (") with escaped quotes ("")
> d) replaces tabs with commas
> e) quotes fields containing commas, quotes, or newlines.

> So, while in the TSV format you can operate on your data using awk
> just by setting FS and OFS to tab, then convert it back to CSV later
> if necessary.

> Try it if you like - it should be fine unless your CSV file contains
> things like \\" inside a quoted field.

Thanks for this contribution! This is great help before doing anything
else with these files.

One question though: what is the purpose of replacing the newline
character in either direction of the conversion?


--
Seb

Ed Morton

unread,
Nov 7, 2010, 7:03:18 PM11/7/10
to

So that the resulting records are all on one line and you can use the newline
character as the RS in the TSV format. The alterantive is you need to come up
with some other character to use as the RS.

Ed.

0 new messages