Re: parsing csv files in gawk

Ed Morton

unread,

Apr 1, 2017, 8:56:49 AM4/1/17

to

On 4/1/2017 12:37 AM, Sivaram Neelakantan wrote:
<snip>
> The file was hand created on a linux box. And there is no control
> chars other than \n for the line endings. I did check this first.
>
> And here's the weird part
>
> $ cat -v sample.csv
> a,b,43,-134,-1.2,"as,d",,3,,
> "aa",b,4335,-134,-1.2,"bs",,,,3
> ,,6,7,",aaa"
> ,,1,".s,"
>
> removing the 2 ,, after 3 in 2nd row fixes the 3rd row extra field issue
>
> $ awk -f s.awk sample.csv
> [a] [b] [43] [-134] [-1.2] ["as,d"] [] [3] [] []
> ["aa"] [b] [4335] [-134] [-1.2] ["bs"] [] [] [] [3]
> [] [6] [7] [",aaa"]
> [] [1] [".s,"]
>
> why?

To be clear - that's not "fixing" an issue, it's introducing an issue. Your
input lines 3 and 4 both start with 2 commas so there are 2 leading null fields
on each of those and so the output for those 2 lines SHOULD start with [] [].

I've emailed the GNU awk providers to ask them to check if this minimized
version of the problem:

$ cat tst.awk
BEGIN { FPAT="[^,]*" }
{
print NF, $0
for (i=1;i<=NF;i++)
print "\t" i, "[" $i "]"
print ""
}

$ cat -v file.csv
,,3
,,3

$ awk -f tst.awk file.csv
3 ,,3
1 []
2 []
3 [3]

2 ,,3
1 []
2 [3]

(i.e. line 2 is printing with 1 less leading field than it should) is a bug or
not and, if not, what's the explanation for that behavior. I'll post what I hear
back.

>
> in this test, if there are extra commas at the end of rows, the next row
> has an extra field if that too starts with commas

Again - there IS an "extra" field. The problem is a field being removed in some
cases (e.g. the 2nd line of my input above) not a field being added.

I've cross-posted this to comp.lang.awk where this question really belongs and
other awk experts might see it and shed some light on it.

Ed.

Janis Papanagnou

unread,

Apr 1, 2017, 9:20:52 AM4/1/17

to

I get a different result here...

3 ,,3
1 []
2 []
3 [3]

3 ,,3
1 []
2 []
3 [3]

$ awk --version
GNU Awk 4.1.2, API: 1.1 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2)
Copyright (C) 1989, 1991-2015 Free Software Foundation.

Janis

Ed Morton

unread,

Apr 1, 2017, 9:39:57 AM4/1/17

to

Your output looks correct to me. I'm on:

$ gawk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.

Ed.

Ed Morton

unread,

Apr 1, 2017, 9:40:24 AM4/1/17

to

On 4/1/2017 8:20 AM, Janis Papanagnou wrote:

Janis Papanagnou

unread,

Apr 1, 2017, 9:57:15 AM4/1/17

to

I get the same (wrong) results with gawk 4.1.4 in my environment.

(There's obviously some point in not jumping on the latest release
for production purposes, specifically in cases where no appropriate
regression tests are available.)

Janis

>
> Ed.
>

Sivaram Neelakantan

unread,

Apr 1, 2017, 10:41:47 AM4/1/17

to

On Sat, Apr 01 2017,Janis Papanagnou wrote:

[snipped 33 lines]

[snipped 27 lines]

> I get a different result here...
>
> 3 ,,3
> 1 []
> 2 []
> 3 [3]
>
> 3 ,,3
> 1 []
> 2 []
> 3 [3]
>
>
> $ awk --version
> GNU Awk 4.1.2, API: 1.1 (GNU MPFR 3.1.0-p3, GNU MP 5.0.2)
> Copyright (C) 1989, 1991-2015 Free Software Foundation.
>
> Janis
>

[snipped 14 lines]

My version on Debian is

$ gawk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.

sivaram
--

Ed Morton

unread,

Apr 1, 2017, 11:29:01 AM4/1/17

to

Unfortunately it's just the gawk you get when you install cygwin (and apparently
debian), it wasn't a choice I made to install it specifically.

I heard back from Andy Schorr (GNU awk developer) by email and it is a bug:

> This worked OK in 4.1.3, but is broken in 4.1.4. It is related to this
> ChangeLog entry:
>
> 2015-09-18 Arnold D. Robbins <arn...@skeeve.com>
>
> * field.c (fpat_parse_field): Always use rp->non_empty instead
> of only if in_middle. The latter can be true even if we've
> already parsed part of the record. Thanks to Ed Morton
> for the bug report.
>
> diff --git a/field.c b/field.c
> index 6a7c6b1..ed31098 100644
> --- a/field.c
> +++ b/field.c
> @@ -1598,9 +1598,8 @@ fpat_parse_field(long up_to, /* parse only up to this field number */
>
> if (in_middle) {
> regex_flags |= RE_NO_BOL;
> - non_empty = rp->non_empty;
> - } else
> - non_empty = false;
> + }
> + non_empty = rp->non_empty;
>
> eosflag = false;
> need_to_set_sep = true;
>
> Reversing this patch fixes the bug, but reintroduces the bug that
> was fixed by this patch. :-) Here's the test case for that bug:
>
> ==> test/fpat5.awk <==
> BEGIN {
> FPAT = "([^,]*)|(\"[^\"]+\")"
> OFS = ";"
> }
>
> p != 0 { print NF }
>
> { $1 = $1 ; print }
>
> ==> test/fpat5.in <==
> "A","B","C"
>
> ==> test/fpat5.ok <==
> "A";"B";"C"
>
>
> *** fpat5.ok 2017-01-26 13:52:53.285369000 -0500
> --- _fpat5 2017-04-01 09:55:20.122459000 -0400
> ***************
> *** 1 ****
> ! "A";"B";"C"
> --- 1 ----
> ! "A";;"B";"C"

They're investigating.....

Ed.

Janis Papanagnou

unread,

Apr 1, 2017, 11:59:06 AM4/1/17

to

On 01.04.2017 17:28, Ed Morton wrote:
> On 4/1/2017 8:57 AM, Janis Papanagnou wrote:
>>
>> (There's obviously some point in not jumping on the latest release
>> for production purposes, specifically in cases where no appropriate
>> regression tests are available.)
>

> Unfortunately it's just the gawk you get when you install cygwin (and
> apparently debian), it wasn't a choice I made to install it specifically.

It's interesting that cygwin is so fast with adding new versions. My Linux
distribution is comparably slow here, so I often have to download my tools
manually if I want some specific new version. In other cases I have to fall
back to older releases because of newly introduced bugs or inconsistencies.
(Note that the comment was not addressed to you, but rather a self-pondering
confirmation about my conservative approach here. The respective drawbacks
from taking either path must be valued from the users individually.)

>
> I heard back from Andy Schorr (GNU awk developer) by email and it is a bug:
>

>> [...]

Thanks for filing that report.

Janis

Ed Morton

unread,

Apr 2, 2017, 10:48:55 AM4/2/17

to

I've been poking at this a little and best I can tell doing this before
processing each record:

oFPAT=FPAT; FPAT=""; FPAT=oFPAT

will work around the problem for now while we wait for a fix. e.g.:

$ cat tstFPAT.awk
BEGIN { FPAT="[^,]*" }
{
oFPAT=FPAT; FPAT=""; FPAT=oFPAT

print NF, $0
for (i=1;i<=NF;i++)

print 1, "\t" i, "[" $i "]"
print ""
}

$ awk -f tstFPAT.awk file1.csv
3 ,,3
1 1 []
1 2 []
1 3 [3]

3 ,,3
1 1 []
1 2 []
1 3 [3]

Regards,

Ed.

Sivaram Neelakantan

unread,

Apr 6, 2017, 11:54:26 PM4/6/17

to

On Sun, Apr 02 2017,Ed Morton wrote:

[snipped 134 lines]

>
> I've been poking at this a little and best I can tell doing this
> before processing each record:
>
> oFPAT=FPAT; FPAT=""; FPAT=oFPAT
>
> will work around the problem for now while we wait for a fix. e.g.:
>
> $ cat tstFPAT.awk
> BEGIN { FPAT="[^,]*" }
> {
> oFPAT=FPAT; FPAT=""; FPAT=oFPAT
>
> print NF, $0
> for (i=1;i<=NF;i++)
> print 1, "\t" i, "[" $i "]"
> print ""
> }
>
> $ awk -f tstFPAT.awk file1.csv
> 3 ,,3
> 1 1 []
> 1 2 []
> 1 3 [3]
>
> 3 ,,3
> 1 1 []
> 1 2 []
> 1 3 [3]
>
> Regards,
>
> Ed.
>
>

Thanks for the workaround, appreciate it.

sivaram
--