CSV Files revisisted

PEZ

unread,

Jan 12, 2000, 3:00:00 AM1/12/00

to

Hello,

Today I had to deal with Excel csv files using awk. I recalled there
was rather recently quite a few solutions posted here. Before I tried
any of those I just had to try for myself and came up with the below
function. It assumes $1 to $NF has already been populated by awk using
a non regexp FS. The function then attempts to "rejoin" any double
quoted strings that have been "accidentally" split by awk. The idea is
that awk does it almost right and I just have to fix the (hopefully
few) exceptions. Like so:

The regular expression tries to make sure the ending " really ends the
string. I'm not sure I got it all correct. Especially I think it screws
up on backslash quoted "s. Since my csv input came from Excel that was
not a problem for me. Excel uses "" to represent a literal ". Maybe the
function should also echange those "" with just "?

Note that this approach doesn't assume comma is the delimiter. In fact
in my case semicolons are used since comma is the decimal "point" in
numbers here in Sweden. MS in Sweden has choosen the extension SKV for
where you Americans would see CSV. (Colon spells kolon in Swedish).

I also recall someone benchmarking the other CSV dealing functions. I
wonder how my function would fare in that benchmark.... It should be
pretty fast I think, since my approach doesn't do much work on rows and
fields not needing the processing.

If anyone cares to try this function out please inform me (via this
newsgroup) of what bugs and bad behaviour it shows.

Thanks for your attention,
/Peter
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-

Sent via Deja.com http://www.deja.com/
Before you buy.

PEZ

unread,

Jan 12, 2000, 3:00:00 AM1/12/00

to

In article <85ier6$ars$1...@nnrp1.deja.com>,

It bugged me that the function couldn't deal with backslas quoted
double quotes (\") very well. So I've revised the regular expressions
some to at least do it better. It's still easy to confuse the function
by mixing the "" and the \" notation. But any sane input shouldn't do
that anyway. Here's the updated function:

function rejoin_csvstrs(a, n, i, j) {
for (i in a)
delete a[i]
for (i = 1; i <= NF; i++) {
a[++n] = $i
# do we have a double quoted string here?
if (substr(a[n], 1, 1) == "\"") {
# yes. does it contain the closing "?

if (!match(a[n], /([^"]("")+"|(\\")+"|[^"\\]")$/)) {
# nope, we need to rejoin it.

for (j = 1; i + j <= NF; j++) {
a[n] = a[n] FS $(i + j)
# is this the field with the closing "?

if (match($(i + j), /([^"]("")+"|[^"\\]"|^(\\")*")$/)) {

i += j
break
}
}
if (RLENGTH == -1)
return "-1 Can't find closing \" for field " i
}
# remove enclosing "s
a[n] = substr(a[n], 2, length(a[n])-2)
}
}
return n
}

Used like this:

{
len = rejoin_csvstrs(b)
printf("%s %s %s -", NR, NF, len)
for (i = 1; i < len; i++)
printf("%s", b[i] FS)
print b[i]
}

(Called with awk -F \;)

On this input:

1;"Dreamcast\"";3;4
1;";Dreamcast";3;4
1;";Dreamcast;\""";3;4
1;";Dreamcast;;3;4
1;";Dreamcast;"3;4
"Dreamcast";2;3
"Dreamcast";""Arcade"" Mode";;;;Gösta
"Dreamcast; ""Arcade"" Mode";;;;Gösta
"Dreamcast; ""Arcade"" Mode""""""";;;;Gösta
"Dreamcast; ""Arcade"" Mode"""""";;;;Gösta
"Dreamcast; ""Arcade"" Mode"""""";";;;Gösta
1;3;Bufo Bufo;"Dreamcast; ""Arc;ade"" Mode";;;;Gösta
"Dreamcast; ""Arc;ade"" Mode;;;;Gösta
1;2;3;4;5;pez;kat;wil;is a family;10

It produces this:

1 4 4 : 1;Dreamcast\";3;4
2 5 4 : 1;;Dreamcast;3;4
3 6 4 : 1;;Dreamcast;\"";3;4
4 6 -1 Can't find closing " for field 2 : 1
5 5 -1 Can't find closing " for field 2 : 1
6 3 3 : Dreamcast;2;3
7 6 6 : Dreamcast;"Arcade"" Mode;;;;Gösta
8 6 5 : Dreamcast; ""Arcade"" Mode;;;;Gösta
9 6 5 : Dreamcast; ""Arcade"" Mode"""""";;;;Gösta
10 6 -1 Can't find closing " for field 1 : "Dreamcast; ""Arcade""
Mode"""""";;;;Gösta
11 6 4 : Dreamcast; ""Arcade"" Mode"""""";;;;Gösta
12 10 8 : 1;3;Bufo Bufo;Dreamcast; ""Arc;ade"" Mode;;;;Gösta
13 7 -1 Can't find closing " for field 1 : "Dreamcast; ""Arc;ade""
Mode;;;;Gösta
14 10 10 : 1;2;3;4;5;pez;kat;wil;is a family;10

As you can see line 3 isn't correctly dealt with. But I can't bring
myself to clutter the regexp any more to deal with it. In fact it could
even be argued that one should use two different regexps (thus two
functions) for the different kinds of quoting.

FWIW,

Harlan Grove

unread,

Jan 12, 2000, 3:00:00 AM1/12/00

to

I've written a few lines on this in the past. Your approach
is basically the same I use. My only difference is using
gsub(/"/, "\"", current_field) % 2 == 0 to check if
current_field has an even number of double quotes, which,
given Excel's use of "" as literal ", is the indicator that
you've found the ending field delimiter. [I don't think MS
has ever written any software in which backslash was
anything other than a directory separator - Xenix doesn't
count.]

* Sent from AltaVista http://www.altavista.com Where you can also find related Web Pages, Images, Audios, Videos, News, and Shopping. Smart is Beautiful

PEZ

unread,

Jan 13, 2000, 3:00:00 AM1/13/00

to

In article <254d6108...@usw-ex0109-069.remarq.com>,

Harlan Grove <hrlngrv...@aol.com.invalid> wrote:
> I've written a few lines on this in the past. Your approach
> is basically the same I use. My only difference is using

Oh. I didn't realize. Must check your csv postings then. Hmmm, your
function hangs on my sample input. Checked it some more and found that
Peter J.Farley had nailed that bug. Now your csvsplit function deals
with the real MS csv specs I see. Produces quite different output from
my rejoin_csvstrs function. But I recon I could rather easily add the
MS behaviour since as you say, our approaches are almost exactly the
same.

> gsub(/"/, "\"", current_field) % 2 == 0 to check if
> current_field has an even number of double quotes, which,
> given Excel's use of "" as literal ", is the indicator that
> you've found the ending field delimiter. [I don't think MS
> has ever written any software in which backslash was
> anything other than a directory separator - Xenix doesn't
> count.]

That makes sense. (Assuming you mean != 0 and "_un_even number" there.)
Though I really wanted to allow for \" as well since I have seen that
in some csv-ish files in the past. (And MS has written a lot of
software with backslash as a quoting character. Even if you don't count
Xenix, which I do count.)

There are a few aspects of csvsplit() that confuses me. I think it must
exploit some things I don't know about awk. You call it with $0 as the
only argument. Yet the function seems to just throw it away and replace
it's copy with the result of "length(f[newf])" which seems not to be
used anywehere in the program. What's going on?

I'll copy the bug fixed csvsplit here for convenience:

BEGIN { FS="," }
{ n = csvsplit($0) }
function csvsplit( n, newf, oldf, fs) {
fs = FS
oldf = newf = 0
while ( oldf < NF ) {
f[++newf] = $(++oldf)
if ( f[newf] ~ /^"/ ) {
while ( gsub(/"/, "\"", f[newf]) % 2 ) {
if ( oldf >= NF ) {
if ((getline) > 0) {
oldf = 0
fs = "\n"
}
else break
} else fs = FS
f[newf] = f[newf] fs $(++oldf)
}
sub(/^"/, "", f[newf])
sub(/"$/, "", f[newf])
gsub(/""/, "\"", f[newf])
}
n = length(f[newf])
}
return newf
}

I thought making f global was unecessary, but I don't know since it
seems you know something I don't know.

Regards,

PEZ

unread,

Jan 13, 2000, 3:00:00 AM1/13/00

to

In article <85ier6$ars$1...@nnrp1.deja.com>,
PEZ <pe...@my-deja.com> wrote:
> Today I had to deal with Excel csv files using awk. I recalled there
> was rather recently quite a few solutions posted here. Before I tried
> any of those I just had to try for myself and came up with the below
> function. It assumes $1 to $NF has already been populated by awk using
> a non regexp FS. The function then attempts to "rejoin" any double
> quoted strings that have been "accidentally" split by awk. The idea is
> that awk does it almost right and I just have to fix the (hopefully
> few) exceptions. Like so:
>
> function rejoin_csvstrs(a, n, i, j) {

[snip]

OK, so I've revised the function some. I took Harlan's little hint and
skipped dealing with backslashed double quotes. I also added the
details so it can deal with fields having embedded newlines. Here's the
function:

function rejoin_csvstrs(a, n, i, j, continued) {

for (i = 1; i <= NF; i++) {

if (!continued)
a[++n] = $i

if (substr(a[n], 1, 1) == "\"") {

for (j = i + 1; !match(a[n], /([^"]("")+|[^"])"$/) && j <= NF;
j++)
a[n] = a[n] FS $j
if (RLENGTH == -1) {
if ((getline) <= 0)
return n " Unexpected EOF"
a[n] = a[n] "\n" $1
continued = 1
i = 0
continue
}
else {

a[n] = substr(a[n], 2, length(a[n])-2)

gsub(/""/, "\"", a[n])
continued = 0
i = j - 1
}
}
}
return n
}

I put it to test against a sample (insane) input like this:

1;";Dreamcast;"";3;4
1;";Dreamcast;;3;4
"Dreamcast";2;3

"A;""B"";";C
1;"Dreamcast";3;4;"pez"

1;";Dreamcast";3;4
"Dreamcast";2;3
1;";Dreamcast;"3;4
"Dreamcast";""Arcade"" Mode";;;;Gösta
"Dreamcast; ""Arcade"" Mode";;;;Gösta
"Dreamcast; ""Arcade"" Mode""""""";;;;Gösta
"Dreamcast; ""Arcade"" Mode"""""";;;;Gösta
"Dreamcast; ""Arcade"" Mode"""""";";;;Gösta
1;3;Bufo Bufo;"Dreamcast; ""Arc;ade"" Mode";;;;Gösta
"Dreamcast; ""Arc;ade"" Mode;;;;Gösta
1;2;3;4;5;pez;kat;wil;is a family;10

1;";Dreamcast;"";3;4

Harlan's csvsplit() and my rejoin_csvstrs() disagree on a few of the
entries. Though I'm not sure which one does the right thing. I also
opened the input in Excel and found that Excel disagrees with both
functions. Anyway, on sane input all three agree.

Since I don't do things like [ gsub(/"/, "\"", str) % 2 ] to check for
the ending quote of strings I thought my function would be faster. But
wrong I was. On the insane input above (about 20.000 lines of it) my
function runs at about 80% the speed of Harlan's. (Using gawk 3.0.3 on
Win95.) And on sane input Harlan's function is about twice as fast as
mine. (Excel beats both with magnitudes. While Haralan's and my
functions took about 20+ secs to finish the job Excel just opened it in
under a second.) But why is Haralan's function so much faster then
mine? In desperation I tried using regexps instead of my substr() calls
but that only slowed performance down (as I had expected it would).
It's something else and I can't see it

Well, that was it for now.

Harlan Grove

unread,

Jan 13, 2000, 3:00:00 AM1/13/00

to

In article <85l3aj$9bf$1...@nnrp1.deja.com>, PEZ <pez68@my-
deja.com> wrote:
> In article <85ier6$ars$1...@nnrp1.deja.com>,

<snip>

> for (j = i + 1; !match(a[n], /([^"]("")+|[^"])"$/)
&& j <= NF; j++) a[n] = a[n] FS $j

I'd guess one problem is here. My own functions use gsub
(/"/, "\"", a[n]) % 2 == 0 rather than your match().

<snip>

>Harlan's csvsplit() and my rejoin_csvstrs() disagree
>on a few of the entries. Though I'm not sure which one

>does the right thing. . . .

So what are the differences?

> . . . I also opened the input in Excel and found that

>Excel disagrees with both functions.

Again, what are the differences?

> . . . Anyway, on sane input all three agree. Since I

>don't do things like [ gsub(/"/, "\"", str) % > 2 ] to
>check for the ending quote of strings I thought my
>function would be faster. But wrong I was. On the insane
>input above (about 20.000 lines of it) my function runs at
>about 80% the speed of Harlan's. (Using gawk 3.0.3 on
>Win95.) And on sane input Harlan's function is about

>twice as fast as mine. . . .

Yes, mine works better when there are few double quotes as
it does less work. Yours works better when there are many
double quotes - same reasons as for the different
performance of my approach and that of Jim Monty's.

> . . . (Excel beats both with magnitudes. While Haralan's

>and my functions took about 20+ secs to finish the job

>Excel just opened it in under a second.) . . .

Maybe a function of the underlying I/O. If Excel slurps
LARGE chunks of disk files into RAM (64KB+) at a time while
gawk reads more modest chunks (say 2-4KB at a time), Excel
could be much faster. Also, Excel's CSV parser is almost
certainly implemented in C, processing a character at a
time. Since it'd be using char pointers, it's going to tear
through the file MUCH QUICKER than any awk could. Shouldn't
be surprising: a tuned, single purpose C function had
better be able to run several times faster than a script.

PEZ

unread,

Jan 13, 2000, 3:00:00 AM1/13/00

to

In article <1fa3a9c0...@usw-ex0109-069.remarq.com>,

Harlan Grove <hrlngrv...@aol.com.invalid> wrote:
> In article <85l3aj$9bf$1...@nnrp1.deja.com>, PEZ <pez68@my-
> deja.com> wrote:
> > In article <85ier6$ars$1...@nnrp1.deja.com>,
>
> <snip>
>
> > for (j = i + 1; !match(a[n], /([^"]("")+|[^"])"$/)
> && j <= NF; j++) a[n] = a[n] FS $j
>
> I'd guess one problem is here. My own functions use gsub
> (/"/, "\"", a[n]) % 2 == 0 rather than your match().

Yes, definately. I tried to use your gsub() instead and it made all the
difference. That's far from what my intuition told me but it was good to
know.

> <snip>
>
> >Harlan's csvsplit() and my rejoin_csvstrs() disagree
> >on a few of the entries. Though I'm not sure which one
> >does the right thing. . . .
>
> So what are the differences?

My function tries harder to find a closing double quote. On line 7 in
that sample input my function doesn't accept "3 and instead goes on to
find a solo single qoute ending the field. The input is errornous (as
far as I know the specs) so I think either result can be accepted.

> > . . . I also opened the input in Excel and found that
> >Excel disagrees with both functions.
>
> Again, what are the differences?

On row 7 in that input Excel agrees with csvsplit() but later on Excel
starts to build more fields from dubious input than does either yours or
mine function. I'd say the functions goes from pickiest (rejoin_csvstrs)
to most forgiving (Excel) via csvsplit that chooses a golden middle way.

> > . . . (Excel beats both with magnitudes. While Haralan's
> >and my functions took about 20+ secs to finish the job
> >Excel just opened it in under a second.) . . .
>
> Maybe a function of the underlying I/O. If Excel slurps
> LARGE chunks of disk files into RAM (64KB+) at a time while
> gawk reads more modest chunks (say 2-4KB at a time), Excel
> could be much faster. Also, Excel's CSV parser is almost
> certainly implemented in C, processing a character at a
> time. Since it'd be using char pointers, it's going to tear
> through the file MUCH QUICKER than any awk could. Shouldn't
> be surprising: a tuned, single purpose C function had
> better be able to run several times faster than a script.

Yes, it was not very surprising that Excel was the fastest. I just
mentioned it because I observed it and after having waited for my prompt
with those scripts I was a tad surprised that Excel was _that_ fast.

By the way I tried the scripts on an AIX 4.3 box and the nawk there
seems to be equally fast on both our functions. Maybe gawk is better
optimized for really simple regexps like that one in your gsub()?

Adrian Davis

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

> Hello,

>
> Today I had to deal with Excel csv files using awk.

Just like you I thought I'd have a go at this challenge. The following
function sets $1, $2 ... $NF to the appropriate values. Two double quotes
("") and escaped quote (\") are converted to literal quote ("). In this
example the Field separator should be set to "\034":-

#=(Start of function)=#
function setcsv() {
gsub(/\\"/,"\035")
gsub(/""/,"\035")
while (match($0,/"[^"]*"/)) {
middle = substr($0,RSTART+1,RLENGTH-2)
gsub(",","\036",middle)
$0 = substr($0,1,RSTART-1) middle substr($0,RSTART+RLENGTH)
}
gsub("\035","\"")
gsub(",","\034")
gsub("\036",",")
}
#=(End of function)=#

The lines containing "\035" can be ommitted if the ("") and (\") literal
conversions are not required.

Regards,
=Adrian=

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85n0jo$8ck$1...@starburst.uk.insnet.net>,

"Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
> Just like you I thought I'd have a go at this challenge. The following
> function sets $1, $2 ... $NF to the appropriate values. Two double
quotes
> ("") and escaped quote (\") are converted to literal quote ("). In
this
> example the Field separator should be set to "\034":-
>
> #=(Start of function)=#
> function setcsv() {
> gsub(/\\"/,"\035")
> gsub(/""/,"\035")
> while (match($0,/"[^"]*"/)) {
> middle = substr($0,RSTART+1,RLENGTH-2)
> gsub(",","\036",middle)
> $0 = substr($0,1,RSTART-1) middle substr($0,RSTART+RLENGTH)
> }
> gsub("\035","\"")
> gsub(",","\034")
> gsub("\036",",")
> }
> #=(End of function)=#

YES! I also thought about going right at $1 to $NF beacuse it feels
cleaner. You only need to call setcsv() in the first rule after BEGIN
and the rest of any awk script can be used as is. For my sharp case
which started me writing that function this is much better and I have
exchanged my function for this one (after having changed it so it deals
with semicolons instead of commas). It meant I could use the original
awk script as it was when I started.

Two things lacks from your approach though as I see it.
1) The separator needs to be hardcoded into the script.
2) It doesn't deal with embedded newlines in the string fields.

I guess 1) is harder to fix than 2). ?

Your function kicks butt with mine in terms of perfomance. What takes
my function 30 secs your function does in 6 secs! (simple, sane input).

(On my not so simple, insane sample input your function is also much
faster but since your function doesn't produce the correct result it
doesn't matter. =))

Thanks for your contribution. My real life case felt messy with my own
function. Now that is fixed.

Adrian Davis

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

> Two things lacks from your approach though as I see it.
> 1) The separator needs to be hardcoded into the script.
> 2) It doesn't deal with embedded newlines in the string fields.
>
> I guess 1) is harder to fix than 2). ?

1) How would you like this. I'll see what I can think of.
2) Yep you're right, far too fiddly. However, I haven't (as yet!) come
across an application which puts newlines into string fields.

> (On my not so simple, insane sample input your function is also much
> faster but since your function doesn't produce the correct result it
> doesn't matter. =))

I'm not sure how to deal with "insane" input in a general sense. The
definition of "correct result" would be application specific. I think that
in most cases it would be better to fix the application which is producing
the "Insane" data!!:-)

Regards,
=Adrian=

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85ncsk$c0h$1...@starburst.uk.insnet.net>,

"Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
>
> > Two things lacks from your approach though as I see it.
> > 1) The separator needs to be hardcoded into the script.
> > 2) It doesn't deal with embedded newlines in the string fields.
> >
> > I guess 1) is harder to fix than 2). ?
>
> 1) How would you like this. I'll see what I can think of.

It was very easy:

function setcsv(sep) {
gsub(/["\\]"/, "\035")
while (match($0, /"[^"]*"/)) {

middle = substr($0, RSTART+1, RLENGTH-2)

gsub(sep, "\036", middle)
$0 = substr($0, 1, RSTART-1) middle substr($0, RSTART+RLENGTH)
}
gsub("\035", "\"")
gsub(sep, "\034")
gsub("\036", sep)
}

Yes, I replaced the first two gsubs with one. It's just the way I would
do it. In the rest of the code I only exchanged the hardcoded "," for
sep.

> 2) Yep you're right, far too fiddly. However, I haven't (as yet!) come
> across an application which puts newlines into string fields.

You haven't came across Excel and the likes then. Believe me, this is
rather common. I'm not sure your algorithm can be tweaked to deal with
it without it getting a mess, which would destroy it since it's so
wonderfully clean and straightforward as it is. (But I doubt I will be
able to resist _trying_ to add the embedded newline functionality to
it.... I'll let you know if I come up with something.)

> > (On my not so simple, insane sample input your function is also much
> > faster but since your function doesn't produce the correct result it
> > doesn't matter. =))
>
> I'm not sure how to deal with "insane" input in a general sense. The
> definition of "correct result" would be application specific. I think
that
> in most cases it would be better to fix the application which is
producing
> the "Insane" data!!:-)

Hehe. Well, with "insane" I don't mean it is necessarily corrupt.
My "insane" sample data is more stressing the definition and putting
the csv functions to a test. Dealing with corrupt data is another deal.
If you can't fix the application producing it (it could be out of your
control for instance) you might want to preprocess and clean the date
before deciding what to do with it.

Regards,

Adrian Davis

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

I've just realised (Duh!) that empty (null) strings will be output as a
double quote instead of null. The "Extra line" should fix this problem.

Regards,
=Adrian=

#=(Start of function)=#
function setcsv() {
gsub(/\\"/,"\035")

gsub(/""/,"\035")

while (match($0,/"[^"]*"/)) {
middle = substr($0,RSTART+1,RLENGTH-2)
gsub(",","\036",middle)
$0 = substr($0,1,RSTART-1) middle substr($0,RSTART+RLENGTH)
}
gsub("\035","\"")

gsub(",","\034")
gsub("\036",",")
gsub("\034\"\034","\034\034") # Extra line!!
}
#=(End of function)=#

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85njbg$3d5$1...@nnrp1.deja.com>,

PEZ <pe...@my-deja.com> wrote:
> In article <85ncsk$c0h$1...@starburst.uk.insnet.net>,
> "Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
> >
> > > Two things lacks from your approach though as I see it.
> > > 1) The separator needs to be hardcoded into the script.
> > > 2) It doesn't deal with embedded newlines in the string fields.

> > 2) Yep you're right, far too fiddly. However, I haven't (as yet!)
come
> > across an application which puts newlines into string fields.

> (But I doubt I will be

> able to resist _trying_ to add the embedded newline functionality to
> it.... I'll let you know if I come up with something.)

No, I didn't solve it. I thought I had but it doesn't work as I
expected. I do think I have the main idea right. Just think I'm missing
something obvious. Maybe someone can help?

function setcsv(str, sep) {
gsub(/[\\"]"/, "\035", str)
while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(sep,"\036", middle)
str = substr(str, 1, RSTART-1) middle substr(str, RSTART+RLENGTH)
}
while (index(str, "\"")) {
if ((getline) <= 0)
break
setcsv(str "\n" $0, sep)
}
gsub("\035", "\"", str)
gsub(sep, "\034", str)
gsub("\036", sep, str)
$0 = str
}

It's supposed to be called like setcsv($0, ","). And then it should
recursively call itself with the next line of input concatenated on the
so far processed str with a "\n" squeezed in between as long as the
first gsub and the following while loop doesn't kill all double qoutes.
But it seems to just eat up the entire file end then gives up. It still
works on data without embedded newlines though.

I tried to follow what happens by adding the line:

print ++c, str, "\n===="

as the first statement in the function and then on data like this:

1,2,"An entry with embedded new
line", 4, 5
a,b,c,d
x,y,z

It produces this:

1 A,B,"C D E F",G
====
2 1,2,"An entry with embedded new
====
3 1,2,"An entry with embedded new
line", 4, 5
====
4 1,2,"An entry with embedded new
a,b,c,d
====
4 1,2,"An entry with embedded new
a,b,c,d
x,y,z
====

The first line parses correctly. Then on the second line the first
recursive call seems to work as expected, but it fails to match the
obvious (to me) entry and then throughs that third line away and starts
concatenating the rest of the lines. (Even if they contain double
quotes it never again throughs lines away as far as I could see.)

Adrian Davis

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

> You haven't came across Excel and the likes then. Believe me, this is
> rather common. I'm not sure your algorithm can be tweaked to deal with
> it without it getting a mess, which would destroy it since it's so

> wonderfully clean and straightforward as it is. (But I doubt I will be

> able to resist _trying_ to add the embedded newline functionality to
> it.... I'll let you know if I come up with something.)

The problem with this one is the character you want to deal with is the same
as the record separator. Is it possible to get Excel to use a different
record separator?

Regards,
=Adrian=

PS: Have you seen my other post Re: fixing the "null" string problem?

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85nn6b$6m4$1...@nnrp1.deja.com>,
PEZ <pe...@my-deja.com> wrote:
> In article <85njbg$3d5$1...@nnrp1.deja.com>,

> function setcsv(str, sep) {
> gsub(/[\\"]"/, "\035", str)
> while (match(str, /"[^"]*"/)) {
> middle = substr(str, RSTART+1, RLENGTH-2)
> gsub(sep,"\036", middle)
> str = substr(str, 1, RSTART-1) middle substr(str,
RSTART+RLENGTH)
> }
> while (index(str, "\"")) {
> if ((getline) <= 0)
> break
> setcsv(str "\n" $0, sep)
> }
> gsub("\035", "\"", str)
> gsub(sep, "\034", str)
> gsub("\036", sep, str)
> $0 = str
> }

I'm blushing here. The problem was that I thouht literaly "while
there's still double quotes..." Somehow my brain translated that to a
while loop.... Here's a working version:

function setcsv(str, sep) {
gsub(/[\\"]"/, "\035", str)
while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(sep,"\036", middle)
str = substr(str, 1, RSTART-1) middle substr(str, RSTART+RLENGTH)
}

if (index(str, "\"")) {
if ((getline) <= 0)
return 0
setcsv(str "\n" $0, sep)
}
else {

gsub("\035", "\"", str)
gsub(sep, "\034", str)
gsub("\036", sep, str)
$0 = str

return 1
}
}

Now it deals with a separator of choice and embedded newlines. And it's
still bleeding fast compared to the other solutions. And it still does
the natural thing and goes right at $1 to $N.

I'd say it's the best solution suggested so far. Benchmark it please.
Only thing wrong with it is that it wasn't my own idea. =)

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85nmlb$i1u$1...@starburst.uk.insnet.net>,

"Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
> The problem with this one is the character you want to deal with is
the same
> as the record separator. Is it possible to get Excel to use a
different
> record separator?

Nope. Excel works as Excel. It's Microsfts CSV specification we are
dealing with anyway so I think it's fair to honour it. But check my
other post. I did nail it. With a very slight change to your original
function.

> PS: Have you seen my other post Re: fixing the "null" string problem?

Nope. I'll check for it. That means there's still a bug to smash?

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85njj0$fl3$1...@starburst.uk.insnet.net>,

"Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
> I've just realised (Duh!) that empty (null) strings will be output as
a
> double quote instead of null. The "Extra line" should fix this
problem.

> gsub("\034\"\034","\034\034") # Extra line!!

Hmmm, yes it does. But, what happens with a string that should just
contain a double quote then? Now that I look at it.. String containg
just a double quote is printed as two double qoutes... Huston, we have
a problem.

Let's hope some of the other guys come to the rescue.

Regards,

Harlan Grove

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85nmlb$i1u$1...@starburst.uk.insnet.net>, "Adrian
Davis" <Adrian...@lewisham.gov.uk> wrote:

<snip>

>The problem with this one is the character you want to
>deal with is the same as the record separator. Is it
>possible to get Excel to use a different record separator?

About as possible as getting Microsoft to become a not-for-
ptofit entity. In other words, maybe a few days after hell
freezes over. Or the nonsarcastic answer: no.

Harlan Grove

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85ncsk$c0h$1...@starburst.uk.insnet.net>, "Adrian
Davis" <Adrian...@lewisham.gov.uk> wrote:

<snip>

>>(On my not so simple, insane sample input your function

>>is also much faster but since your function doesn't
>>produce the correct result it doesn't matter. =))
>
>I'm not sure how to deal with "insane" input in a general
>sense. The definition of "correct result" would be
>application specific. I think that in most cases it would
>be better to fix the application which is producing the
>"Insane" data!!:-)

Begs the question whether the awk script writer gets to
tell the world's largest software producer what their specs
should be. In the real world, an 800+ gorilla like Excel
gets to do whatever it damn well pleases, and scripts
written to work with it have to accomodate its specs.

I know I'm overreacting (slightly), but Excel is what it
is, just like tax laws are what they are. You or I or
anyone else who hangs out in comp.lang.awk have exactly
zero influence on what Excel's CSV specs are. Therefore,
scripts that have to deal with its CSV output have to adapt
themselves to Excel's specs, not vice versa.

PEZ

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

In article <85njj0$fl3$1...@starburst.uk.insnet.net>,

"Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
> I've just realised (Duh!) that empty (null) strings will be output as
a
> double quote instead of null. The "Extra line" should fix this
problem.

[snip]

> gsub("\034\"\034","\034\034") # Extra line!!

There are a few problems to this fix. What if empty strings are found
in the beginning or end of a row (entry)? And as I pointed out earlier
it's not only empty strings that goes wrong. It's strings solely
containing any number (including zero) of double quotes. You always get
one double quote to many. There are a few ways to deal with that. A few
ways to deal with it "afterwards" like your "Extra line" there even.
Here's one way to do it. Replace your "Extra line" with:

for (i = 1; i <= NF; i++)

if (match($i, /^"+$/))
$i = substr($i, 2)
return 0

It adds some processing time to the function of course. On my "insane"
data setcsv() is still almost twice as fast as Harlan's csvsplit(). But
on my "sane" data sample setcsv() runs in only about 75% the speed of
csvsplit() (still it's faster than my rejoin_csvstrs() though). As
Harlan points out, his function doesn't do much work when it's not
called for.

I'll post the setcsv function with my latest changes in it's own thread.

Jim Monty

unread,

Jan 14, 2000, 3:00:00 AM1/14/00

to

Cf. <http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=540417887&fmt=text>.

--
Jim Monty
mo...@primenet.com
Tempe, Arizona USA

PEZ

unread,

Jan 15, 2000, 3:00:00 AM1/15/00

to

In article <85o5r8$9o0$1...@nnrp02.primenet.com>,

Jim Monty <mo...@primenet.com> wrote:
> Cf. <http://www.deja.com/[ST_rn=ps]/getdoc.xp?AN=540417887&fmt=text>.

Hmmm. I fail to see how reading that posting (again) would help here.
Since Adrian's setcsv() chooses a different path from your splitcsv()
few lessons can be learnt that way. (And I have read Jeffrey's book. I
too think that that regexp is cool. It's one of the best puter related
books in my library maybe second only to The Perl Cookbook and Advanced
Perl Programming.)

However as you might have noticed by now I've nailed the bug myself.
But I do feel it could be done with less computing power. Please feel
free to have a go at it.

Adrian Davis

unread,

Jan 17, 2000, 3:00:00 AM1/17/00

to

> Begs the question whether the awk script writer gets to
> tell the world's largest software producer what their specs
> should be. In the real world, an 800+ gorilla like Excel
> gets to do whatever it damn well pleases, and scripts
> written to work with it have to accomodate its specs.

Just to clear up a minor point here. I didn't know the application producing
"insane" data was Excel. In any case I have zero experience with Excel, but
I do not think it unreasonable to assume that an application which produces
some kind of "delimited" files should allow the choice of delimiter - others
do!!

Regards,
=Adrian=

Adrian Davis

unread,

Jan 17, 2000, 3:00:00 AM1/17/00

to

> > double quote instead of null. The "Extra line" should fix this
> problem.
>

> > gsub("\034\"\034","\034\034") # Extra line!!
>

> Hmmm, yes it does. But, what happens with a string that should just
> contain a double quote then? Now that I look at it.. String containg
> just a double quote is printed as two double qoutes... Huston, we have
> a problem.

I suppose there should also be...

gsub(/(^"\034)|(\034"$)/,"\034")

...at the end of the script to deal with empty strings at begining and end.
I know this doesn't cope with situations where a field is supposed to have a
single double quote. I gather that the main problems seem to be dealing with
Excel output, does a single double quote have a particular meaning for
Excel? There are all sorts of possibilities for "escaping" and special
meanings, and I guess some of those will need to be delt with on an
application specific basis.

Regards,
=Adrian=

PEZ

unread,

Jan 17, 2000, 3:00:00 AM1/17/00

to

In article <85uvnq$fng$1...@starburst.uk.insnet.net>,

"Adrian Davis" <Adrian...@lewisham.gov.uk> wrote:
> I suppose there should also be...
>
> gsub(/(^"\034)|(\034"$)/,"\034")
>
> ...at the end of the script to deal with empty strings at begining
and end.
> I know this doesn't cope with situations where a field is supposed to
have a
> single double quote. I gather that the main problems seem to be
dealing with
> Excel output, does a single double quote have a particular meaning for
> Excel? There are all sorts of possibilities for "escaping" and special
> meanings, and I guess some of those will need to be delt with on an
> application specific basis.

Not really. It's a general problem with the algorithm as such. But
check my latest setcsv posting (in a new thread) for a general solution.

Harlan Grove

unread,

Jan 17, 2000, 3:00:00 AM1/17/00

to

In article <85ulcj$ato$1...@starburst.uk.insnet.net>, "Adrian

Maybe true of more technical software, but 'office
productivity' software like spreadsheets generally support
rigidly defined file formats. In the case of CSV files, the
field separator can be modified through the OS's 'list
separator' setting, but it can't be changed in the
application itself. And the only record separator is the CR-
NL sequence (under Windows) or CR (under Mac OS).

So the general rule for mass market software is 'don't
expect much flexibility in general, don't expect any in
file formats.'

Elisa Roselli

unread,

Jan 21, 2000, 3:00:00 AM1/21/00

to

It may not be directly relevant to your posting or to this forum, but we too
have been dealing with files that transit back and forth between Excel and
Unix via semi-colon csv, and I need to warn you that we suspect a major bug
in the Excel "export to csv" function. It was not keeping all the field
delimiters, especially at ends of lines and where there were many successive
empty fields. We had so much trouble with it that finally we gave up using
it. What I do now is copy-paste each of the columns we need into a separate
ascii file which I then reassemble under Unix with a paste -d";". YMMV, of
course.

EFR

PEZ a écrit dans le message <85ier6$ars$1...@nnrp1.deja.com>...
>Today I had to deal with Excel csv files using awk. (snipped)

PEZ

unread,

Jan 21, 2000, 3:00:00 AM1/21/00

to

In article <8694eh$g8c$1...@wanadoo.fr>,

"Elisa Roselli" <e.ro...@volusoft.com> wrote:
> It may not be directly relevant to your posting or to this forum, but
we too
> have been dealing with files that transit back and forth between
Excel and
> Unix via semi-colon csv, and I need to warn you that we suspect a
major bug
> in the Excel "export to csv" function. It was not keeping all the
field
> delimiters, especially at ends of lines and where there were many
successive
> empty fields. We had so much trouble with it that finally we gave up
using
> it. What I do now is copy-paste each of the columns we need into a
separate
> ascii file which I then reassemble under Unix with a paste -d";".
YMMV, of
> course.

I have stumbled across problems like that as well. But for me it worked
if I added a last dummy field beyond the last column that always
contained something and always the same thing. Your workaround only
works if there are no ";" in the data I think. Maybe I missunderstood
something.

Harlan Grove

unread,

Jan 21, 2000, 3:00:00 AM1/21/00

to

In article <8694eh$g8c$1...@wanadoo.fr>, "Elisa Roselli"
<e.ro...@volusoft.com> wrote:
>It may not be directly relevant to your posting or to this
>forum, but we too have been dealing with files that
>transit back and forth between Excel and Unix via semi-
>colon csv, and I need to warn you that we suspect a major
>bug in the Excel "export to csv" function. It was not
>keeping all the field delimiters, especially at ends of
>lines and where there were many successive empty fields.
>We had so much trouble with it that finally we gave up
>using it. What I do now is copy-paste each of the columns
>we need into a separate ascii file which I then reassemble
>under Unix with a paste -d";". YMMV, of course.

Definitely off-topic, and I could point you to the apt
thread in comp.apps.spreadsheets, but quicker to just give
the fix in Excel: always include an extra column that
contains ONLY either a single appostrophe/single quote or
the formula ="". Both give a string result of zero length,
which is NOT the same as a truly blank/empty cell. It's
enough to force Excel to put semicolons or commas between
each column. Or you could select the entire range, then use
Go To Special to select only blank cells, and enter a
single single quote in the first of these and enter using
the Ctrl+Enter key combination. [Another in the fine
comp.lang.awk tradition of posting non-awk solutions.]