Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

CSV FPAT issue with GAWK

678 views
Skip to first unread message

Marc de Bourget

unread,
Oct 21, 2016, 1:46:32 PM10/21/16
to
It seems there is an issue with the CSV split pattern
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

"As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:

FPAT = "([^,]*)|(\"[^\"]+\")"

Content of csv file "gawkcsvtest.csv":
Field1,Field2,Field3
Yes,"""Yes,No""",No

With a good CSV viewer, the data should look like:
Field1 Field2 Field3
Yes "Yes,No" No

# GAWKCSVtest.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")"
}

{
print "Rec. " NR " Field 1: " $1
print "Rec. " NR " Field 2: " $2
print "Rec. " NR " Field 3: " $3
}

gawk -f GAWKCSVtest.awk gawkcsvtest.csv
=>
Rec. 1 Field 1: Field1
Rec. 1 Field 2: Field2
Rec. 1 Field 3: Field3
Rec. 2 Field 1: Yes
Rec. 2 Field 2: """Yes
Rec. 2 Field 3: No"""

# TAWKCSVTEST.awk
BEGIN {
# TAWK manual page 202
FS = 0 # Disable FS
# CSV files with Separator ,
FPAT = /"([^"]|"")*"|[^,"]*/
}

{
print "Rec. " NR " Field 1: " $1
print "Rec. " NR " Field 2: " $2
print "Rec. " NR " Field 3: " $3
}

awkw -f TAWKCSVTEST.awk gawkcsvtest.csv
=>
Rec. 1 Field 1: Field1
Rec. 1 Field 2: Field2
Rec. 1 Field 3: Field3
Rec. 2 Field 1: Yes
Rec. 2 Field 2: """Yes,No"""
Rec. 2 Field 3: No

So, TAWK seems to work properly while GAWK does not.
How to change FPAT to the correct TAWK version?
I couldn't succeed due to different FPAT syntax.

Ed Morton

unread,
Oct 21, 2016, 2:20:59 PM10/21/16
to
This would do it:

FPAT = "([^,]*)|(\"([^\"]*(\"\")?[^\"]*)*\")"

Ed.

Ed Morton

unread,
Oct 21, 2016, 2:22:38 PM10/21/16
to
sorry wasn't thinking, all you need is:

FPAT = "([^,]*)|(\"([^\"]|\"\")*\")"

Marc de Bourget

unread,
Oct 21, 2016, 2:33:33 PM10/21/16
to
Hi Ed, great, thank you! First tests were successfull.
If we don't find further issues Arnold might want to consider updating the documentation with your version.

Ed Morton

unread,
Oct 21, 2016, 2:42:20 PM10/21/16
to
Sure but it depends on what CSV flavor you want to parse. "" might not be the
way to escape a ", it might be \" in a given CSV (so you could add a |\\" to the
regexp to work for both):

"([^"]|""|\\")*"|[^,]*

but then you may be accepting/handling CSVs as valid that are actually invalid
in the given domain or otherwise mishandling them.

In any case, just to show they're actually the same, FYI here's the tawk FPAT
you provided:

FPAT = /"([^"]|"")*"|[^,"]*/

used as-is with gawk (just changed the delimiters):

$ cat GAWK*CSVtest.awk
{
print "Rec. " NR " Field 1: " $1
print "Rec. " NR " Field 2: " $2
print "Rec. " NR " Field 3: " $3
}

$ awk -v FPAT='"([^"]|"")*"|[^,"]*' -f GAWKCSVtest.awk gawkcsvtest.csv
Rec. 1 Field 1: Field1
Rec. 1 Field 2: Field2
Rec. 1 Field 3: Field3
Rec. 2 Field 1: Yes
Rec. 2 Field 2: """Yes,No"""
Rec. 2 Field 3: No

The version where it's specified inside the file is just complicated by the need
to escape "s and some parens to make the grouping explicit.

Ed.

Marc de Bourget

unread,
Oct 21, 2016, 2:54:03 PM10/21/16
to
Thank you very much Ed.
If we can trust Wikipedia, it is almost always "" for escaping:
https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules:
"...
Fields with embedded commas or double-quote characters must be quoted.
1997,Ford,E350,"Super, luxurious truck"

Each of the embedded double-quote characters must be represented by a pair of double-quote characters.
1997,Ford,E350,"Super, ""luxurious"" truck"
..."

Thank you once more. I appreciate it.

Ed Morton

unread,
Oct 22, 2016, 5:10:39 PM10/22/16
to
I did a bit more testing and discovered you can't just combine handling of \"
and "" escaped quotes in one regexp:

$ cat tst.csv
a,"b1""b2"",""b3""","c1,c2","d1\"d2\",\"d3\"","",f

Works for \":

$ awk -v FPAT='([^,]*)|("([^"]|\\\\")*")' '{ for (i=1;i<=NF;i++) printf "%d %d
<%s>\n", NR, i, $i }' tst.csv
1 1 <a>
1 2 <"b1""b2"">
1 3 <""b3""">
1 4 <"c1,c2">
1 5 <"d1\"d2\",\"d3\"">
1 6 <"">
1 7 <f>

Works for "":

$ awk -v FPAT='([^,]*)|("([^"]|\"\")*")' '{ for (i=1;i<=NF;i++) printf "%d %d
<%s>\n", NR, i, $i }' tst.csv
1 1 <a>
1 2 <"b1""b2"",""b3""">
1 3 <"c1,c2">
1 4 <"d1\"d2\">
1 5 <\"d3\"">
1 6 <"">
1 7 <f>

Fails for "" and \" together (notice output field 4 ends in ," and 5 is a single "):

$ awk -v FPAT='([^,]*)|("([^"]|\\\\"|\"\")*")' '{ for (i=1;i<=NF;i++) printf "%d
%d <%s>\n", NR, i, $i }' tst.csv
1 1 <a>
1 2 <"b1""b2"",""b3""">
1 3 <"c1,c2">
1 4 <"d1\"d2\",\"d3\"",">
1 5 <">
1 6 <f>

Since you shouldn't be using both "" and \" together in one input file anyway I
don't feel any concern about that and there's various gsub()s you can do inside
the script to handle it if you want to, e.g.:

$ awk -v FPAT='([^,]*)|("([^"]|\"\")*")' '{ gsub(/\\"/,RS); for (i=1;i<=NF;i++)
{ gsub(RS,"\\\"",$i); printf "%d %d <%s>\n", NR, i, $i } }' tst.csv
1 1 <a>
1 2 <"b1""b2"",""b3""">
1 3 <"c1,c2">
1 4 <"d1\"d2\",\"d3\"">
1 5 <"">
1 6 <f>


Regards,

Ed.

Marc de Bourget

unread,
Oct 23, 2016, 9:35:40 AM10/23/16
to
Thank you Ed. "" is sufficient for me.
I now use this version derived from your version:

BEGIN {
FS = 0 # Disable FS, needed for Thompson AWK
FIELDSEP = "," # Set the field separator here
FPAT = "([^" FIELDSEP "]*)|(\"([^\"]|\"\")*\")"
}

This version works both for GAWK and TAWK.
In Europe we often use FIELDSEP = ";" so it can be easily adjusted.
I use this in combination with a csvdequote and csvenquote function.

Ed Morton

unread,
Oct 23, 2016, 9:55:21 AM10/23/16
to
If you'd like to handle embedded newlines within fields too, this is what I use:

$ cat decsv.awk
BEGIN { FPAT="([^,]*)|(\"([^\"]|\"\")*\")"; OFS="," }
{
# prepend previous incomplete record segment if any
$0 = prev $0
numq = gsub(/"/,"&")
if ( numq % 2 ) {
# this is inside double quotes so incomplete record
prev = $0 RT
#prev = $0 OFS # uncomment to replace RS with OFS inside fields
next
}
prev = ""

printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
printf "\t$%d=<%s>\n", i, $i
}
}

$ cat tst.csv
a,"b1""b2""
,""b3""","c1,
c2","d1""d2
"",""d3""","",f

$ awk -f decsv.awk tst.csv
Record 1:
$1=<a>
$2=<"b1""b2""
,""b3""">
$3=<"c1,
c2">
$4=<"d1""d2
"",""d3""">
$5=<"">
$6=<f>

Massage to suit...

Ed.

Ed Morton

unread,
Oct 23, 2016, 10:04:20 AM10/23/16
to
On 10/23/2016 8:55 AM, Ed Morton wrote:
> On 10/23/2016 8:35 AM, Marc de Bourget wrote:
<snip>
>> I now use this version derived from your version:
>>
>> BEGIN {
>> FS = 0 # Disable FS, needed for Thompson AWK
>> FIELDSEP = "," # Set the field separator here
>> FPAT = "([^" FIELDSEP "]*)|(\"([^\"]|\"\")*\")"
>> }
>>
>> This version works both for GAWK and TAWK.
>> In Europe we often use FIELDSEP = ";" so it can be easily adjusted.

Good idea but rather than adding a new variable FIELDSEP when we should already
be setting OFS to the real input separator character (assuming we want to output
the same format as we input), here's what I use to handle embedded newlines
after tweaking to use the OFS value instead of hard-coding "," twice (once in
OFS and once in FPAT as in my previous post):

$ cat decsv.awk
BEGIN { OFS=","; FPAT="([^" OFS "]*)|(\"([^\"]|\"\")*\")" }
Regards,

Ed.

Marc de Bourget

unread,
Oct 23, 2016, 1:34:50 PM10/23/16
to
Thank you Ed, good idea.

Ed Morton

unread,
Oct 23, 2016, 2:43:28 PM10/23/16
to
Final version, tweaked to separate construction of the record from use of the
record to make use of the record easier/clearer:

$ cat decsv.awk
function buildRec( _numQuotes, _gotRec) {
# prepend previous incomplete record segment if any
$0 = PrevSeg $0 # NOTE: PrevSeg is global
_numQuotes = gsub(/"/,"&")
if ( _numQuotes % 2 ) {
# this is inside double quotes so incomplete record
PrevSeg = $0 RT
#PrevSeg = $0 OFS # uncomment to replace RS with OFS inside fields
}
else {
PrevSeg = ""
_gotRec = 1
}
return _gotRec
}

BEGIN {
OFS=","
FPAT="([^" OFS "]*)|(\"([^\"]|\"\")*\")"
}

!buildRec() { next }

# From here on we have an awk record where each quoted field,
# may included OFSs, escaped quotes (""), or newlines.
{
printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
printf "\t$%d=<%s>\n", i, $i
}
}

$ cat tst.csv
a,"b1""b2""
,""b3""","c1,
c2","d1""d2
"",""d3""","",f
g,"h1""h2""",i,"
j
","k1""k2""",l

$ awk -f decsv.awk tst.csv
Record 1:
$1=<a>
$2=<"b1""b2""
,""b3""">
$3=<"c1,
c2">
$4=<"d1""d2
"",""d3""">
$5=<"">
$6=<f>
Record 2:
$1=<g>
$2=<"h1""h2""">
$3=<i>
$4=<"
j
">
$5=<"k1""k2""">
$6=<l>

Sorry for all the iterations, I kept getting ideas to improve it and it seems
like something that could be generally useful. I'm done now!

Regards,

Ed.

Marc de Bourget

unread,
Oct 23, 2016, 3:40:59 PM10/23/16
to
Thank you, Ed. I'll also want to contribute something I think is useful.
It's my final version which however does not take care of newlines,
but dequotes and enquotes double quotes properly. Even poorly formatted
csv gets "healed" with correct output. I've also added an "alwaysquote"
option to my enquote function which can be set to "1", e.g.
csvenquote($i, OFS, 1)
because some people like everything be quoted in csv files, see
https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules
"Any field may be quoted (that is, enclosed within double-quote characters)."
"1997","Ford","E350"
The debug result files csvdequote.txt and csvenquote.csv show correct values.

Content of csv test file "csvin3comma.csv":
Field1,Field2,Field3,Field4
One,,Two,Three
What,,"Is,",It
What,,Is,It
What,,"""Yes,No""",It
What,,Is,It
"""What""",,Is,It
"""What""","""yes, no",Is,It
20,,aaa20,No
20,"""",aaa20,Yes
21,"""Hello""",abc21,"""Sir"""
23,,xyz00,","
24,"""Yes!""",pdq24,9
28,,89,3
25,"""Yes,No""","26,pd",q24
25,,"26,pd","q""Sir""24"

Now my awk file awkCSV_comma2.awk for use with
gawk -f awkcsv_comma2.awk csvin3comma.csv

# awkCSV_comma2.awk
function csvdequote( str ) {
if (str ~ /^\".*\"$/) {
sub(/^\"/, "", str)
sub(/\"$/, "", str)
gsub(/\"\"/, "\"", str)
}
return ( str )
}

function csvenquote( str, separator, alwaysquote ) {
if ((str ~ separator) || (str ~ /\"/)) {
gsub(/\"/, "\"\"", str)
str = "\"" str "\""
}
else if (alwaysquote) {
str = "\"" str "\""
}
return ( str )
}

# ARGV[1]: csv file
BEGIN {
FS = 0 # Disable FS, needed for Thompson AWK
OFS = ","
FPAT = "([^" OFS "]*)|(\"([^\"]|\"\")*\")"
}

{
for (i=1; i<=NF; i++) {
$i = csvdequote($i)

# Debug:
print "dequoted: Rec. " NR " Field i: " $i
printf("%s", $i) >"csvdequote.txt"
if (i < NF)
printf("%s", "\t") >"csvdequote.txt"
else
printf("%s", "\n") >"csvdequote.txt"
}
print ""

# Do something with data (change contents as usual for $1, $0 etc.)

for (i=1; i<=NF; i++) {
$i = csvenquote($i, OFS, 0)

# Debug:
print "enquoted: Rec. " NR " Field i: " $i
printf("%s", $i) >"csvenquote.csv"
if (i < NF)
printf("%s", OFS) >"csvenquote.csv"
else
printf("%s", "\n") >"csvenquote.csv"
}
print ""
}

Ed Morton

unread,
Oct 24, 2016, 9:49:51 AM10/24/16
to
Could you give a specific example of that? Usually trying to repair a CSV fails
because you never really KNOW what the intent was of the bad input so it's best
to just report it and let the user fix it than try to repair it on the fly. Or
do you just mean it'll remove quotes from the fields that don't NEED them?
No need for the above test and if you DID want to test something then testing
the result of the gsub() I'm about to show would do.

> sub(/^\"/, "", str)
> sub(/\"$/, "", str)

The above 2 sub()s can be written as a single gsub():

gsub(/^"|"$/, "", str)

> gsub(/\"\"/, "\"", str)
> }
> return ( str )
> }

So the above function could be written more concisely as:

function csvdequote( str ) {
gsub(/^"|"$/, "", str)
gsub(/\"\"/, "\"", str)
return ( str )
}

>
> function csvenquote( str, separator, alwaysquote ) {
> if ((str ~ separator) || (str ~ /\"/)) {
> gsub(/\"/, "\"\"", str)
> str = "\"" str "\""
> }
> else if (alwaysquote) {
> str = "\"" str "\""
> }
> return ( str )
> }

The above could be written more concisely as:

function csvenquote( str, separator, alwaysquote ) {
gsub(/\"/, "\"\"", str)
if ( (str ~ separator) || (str ~ /\"/) || alwaysquote ) {
str = "\"" str "\""
}
return ( str )
}

> # ARGV[1]: csv file
> BEGIN {
> FS = 0 # Disable FS, needed for Thompson AWK
> OFS = ","
> FPAT = "([^" OFS "]*)|(\"([^\"]|\"\")*\")"
> }
>
> {
> for (i=1; i<=NF; i++) {
> $i = csvdequote($i)
>
> # Debug:
> print "dequoted: Rec. " NR " Field i: " $i
> printf("%s", $i) >"csvdequote.txt"
> if (i < NF)
> printf("%s", "\t") >"csvdequote.txt"
> else
> printf("%s", "\n") >"csvdequote.txt"

The above 6 lines can be (and usually is) written as 1:

printf "%s%s", $i, (i<NF ? "\t" : ORS) >"csvdequote.txt"

> }
> print ""
>
> # Do something with data (change contents as usual for $1, $0 etc.)
>
> for (i=1; i<=NF; i++) {
> $i = csvenquote($i, OFS, 0)
>
> # Debug:
> print "enquoted: Rec. " NR " Field i: " $i
> printf("%s", $i) >"csvenquote.csv"
> if (i < NF)
> printf("%s", OFS) >"csvenquote.csv"
> else
> printf("%s", "\n") >"csvenquote.csv"

The above 6 lines can be (and usually is) written as 1:

printf "%s%s", $i, (i<NF ? "\t" : ORS) >"csvenquote.txt"

> }
> print ""

Why the blank line output here and above after the dequote?

> }
>

Thanks for sharing.

Ed.

Marc de Bourget

unread,
Oct 24, 2016, 10:28:11 AM10/24/16
to
Thank you Ed.
Most of you comments concern debugging output (like the blank line).
I do know what you write and appreciate it, but I think I have another
programming style and my solution works fast, properly and reliable
and is thorougly tested, so I see no need to change much.
I would be glad if someone would appreciate it as a proper solution.

As for the example, I have encountered the issue that especially in files
where everything is quoted something like "Zur Brunnenwiese" (with should
be depicted with quotes) is quoted wrongly in csv files:
"No","Hotel","Name","Country"
"24",""Zur Brunnenwiese"","Germany"

With my awk script above and the change: $i = csvenquote($i, OFS, 1)
the output for csvenquote.csv is:
"No","Hotel","Name","Country"
"24","""Zur Brunnenwiese""","Germany"

So the quotes for "Zur Brunnenwiese" gets repaired.

Marc de Bourget

unread,
Oct 24, 2016, 10:52:01 AM10/24/16
to
Sorry, I have made an error in my example with csvenquote($i, OFS, 1):
Here is the correct version of the original csv:

Original csv:
"No","Hotel","Country"
"24",""Zur Brunnenwiese"","Germany"

csvenquote.csv:
"No","Hotel","Country"
"24","""Zur Brunnenwiese""","Germany"

csvdequote.txt:
No Hotel Country
24 "Zur Brunnenwiese" Germany


Marc de Bourget

unread,
Oct 24, 2016, 1:34:43 PM10/24/16
to
Le lundi 24 octobre 2016 15:49:51 UTC+2, Ed Morton a écrit :
> So the above function could be written more concisely as:
>
> function csvdequote( str ) {
> gsub(/^"|"$/, "", str)
> gsub(/\"\"/, "\"", str)
> return ( str )
> }

BTW, it seems there is an issue with your code:
gsub(/^"|"$/, "", str)
This deletes all leading quotes instead only the first one.
Please compare your result file csvdequote.txt with mine
(which is correct) using the test file csvin3comma.csv.

Marc de Bourget

unread,
Oct 24, 2016, 1:52:44 PM10/24/16
to
OK, this seems to be a TAWK issue. This code isn't compatible with TAWK.
Maybe I have found another little issue but it's too late now for testing
and I'm tired. If I test further I'll make too many errors. Thank you!


Ed Morton

unread,
Oct 24, 2016, 3:28:41 PM10/24/16
to
On 10/24/2016 12:52 PM, Marc de Bourget wrote:
> Le lundi 24 octobre 2016 19:34:43 UTC+2, Marc de Bourget a écrit :
>> Le lundi 24 octobre 2016 15:49:51 UTC+2, Ed Morton a écrit :
>>> So the above function could be written more concisely as:
>>>
>>> function csvdequote( str ) {
>>> gsub(/^"|"$/, "", str)
>>> gsub(/\"\"/, "\"", str)
>>> return ( str )
>>> }
>>
>> BTW, it seems there is an issue with your code:
>> gsub(/^"|"$/, "", str)
>> This deletes all leading quotes instead only the first one.

No, it doesn't. If you're using an awk that does that then your awk is broken as
it's misinterpreting " as "+

>> Please compare your result file csvdequote.txt with mine
>> (which is correct) using the test file csvin3comma.csv.
>
> OK, this seems to be a TAWK issue. This code isn't compatible with TAWK.

Then stop using that version of TAWK as it may have other breakage too you
haven't come across yet. gsub() does NOT iterate on the results of it's
operation (otherwise x="a,b,c,d"; gsub(/.,./,"y",x) would turn x into "" instead
of "y,y") and ^" simply does NOT mean ^"+ in an ERE (if it did, how WOULD you
write "the first char in the string"?)

> Maybe I have found another little issue but it's too late now for testing
> and I'm tired. If I test further I'll make too many errors. Thank you!

Sounds good.

Ed.

Marc de Bourget

unread,
Oct 24, 2016, 6:05:21 PM10/24/16
to
Same behaviour with MAWK as for TAWK:
test.awk:
{
print $0
gsub(/^"|"$/, "", $0)
print $0
}

test.txt:
"""Hello"""
"""""Hello"""""""""
""Hello""
"Hello"
""""

>mawk -f test.awk test.txt
"""Hello"""
Hello""
"""""Hello"""""""""
Hello""""""""
""Hello""
Hello"
"Hello"
Hello
""""

I'm not sure if this is a TAWK bug because MAWK behaves the same.

Ed Morton

unread,
Oct 24, 2016, 7:03:25 PM10/24/16
to
Well, let's see if they're consistent at least in their treatment of regexps.
Given input of ""Hello"" what does this output:

$ cat tst.awk
{
orig=$0; print "0 <" $0 ">"
$0=orig; gsub(/^"|"$/,""); print "1 <" $0 ">"
$0=orig; gsub(/^"/,""); print "2 <" $0 ">"
$0=orig; sub(/^"|"$/,""); print "3 <" $0 ">"
$0=orig; sub(/^"/,""); print "4 <" $0 ">"
$0=orig; match($0,/^"|"$/); print "5 <" substr($0,RSTART+RLENGTH) ">"
$0=orig; split($0,t,/^"|"$/); print "6 <" t[2] ">"
}

$ gawk -f tst.awk file
0 <""Hello"">
1 <"Hello">
2 <"Hello"">
3 <"Hello"">
4 <"Hello"">
5 <"Hello"">
6 <"Hello">

Regards,

Ed.

Marc de Bourget

unread,
Oct 25, 2016, 4:10:27 AM10/25/16
to
GAWK: >gawk -f test.awk file
0 <""Hello"">
1 <"Hello">
2 <"Hello"">
3 <"Hello"">
4 <"Hello"">
5 <"Hello"">
6 <"Hello">

TAWK: >awkw -f test.awk file
0 <""Hello"">
1 <Hello">
2 <Hello"">
3 <"Hello"">
4 <"Hello"">
5 <"Hello"">
6 <"Hello">

MAWK: >mawk -f test.awk file
0 <""Hello"">
1 <Hello">
2 <Hello"">
3 <"Hello"">
4 <"Hello"">
5 <"Hello"">
6 <>

I would not recommmend to use the strange code: gsub(/^"|"$/, "", str).
For <starts with> and <ends with> expressions normally sub() is used.
It is not obvious what you want to achieve nor is it for TAWK and MAWK.

BTW, the mentioned other issue with your csvdequote function is that is
destroys quotes from uneven pairs of quotes from wrongly formatted csv,
e.g. One,",Two,Three

-> The " gets deleted which is not good while my function keeps the ".

Dave Sines

unread,
Oct 25, 2016, 8:50:03 AM10/25/16
to
Marc de Bourget <marcde...@gmail.com> wrote:

[ treatment of /^"|"$/ in various awks ]

> Same behaviour with MAWK as for TAWK:
> test.awk:
> {
> print $0
> gsub(/^"|"$/, "", $0)
> print $0
> }
>
> test.txt:
> """Hello"""
> """""Hello"""""""""
> ""Hello""
> "Hello"
> """"
>
>>mawk -f test.awk test.txt
> """Hello"""
> Hello""
> """""Hello"""""""""
> Hello""""""""
> ""Hello""
> Hello"
> "Hello"
> Hello
> """"
>
> I'm not sure if this is a TAWK bug because MAWK behaves the same.

bwk's awk95 and heirloom nawk agree with gawk. This works around the
problem with mawk, I don't have tawk:

gsub(/(^")|("$)/, "", str)

Marc de Bourget

unread,
Oct 25, 2016, 9:08:09 AM10/25/16
to
Thank you (I have already tried this), but this does not work with my MAWK version (http://www.klabaster.com/freeware.htm). It does not work with TAWK either. All leading quotes are deleted except as wished only the first one.

Ed Morton

unread,
Oct 25, 2016, 10:00:34 AM10/25/16
to
There is nothing wrong with that code, tawk and mawk are just broken and someone
should report the bug to the maintainers of those tools so they can fix them.

In an ERE ^" means "one double quote at the start of a string". If you want to
describe "one or more double quotes at the start of a string" then that is ^"+.
tawk and mawk are incorrectly treating ^" as if it were ^"+ when used in gsub()
but not in other contexts - it's THAT simple.

gawk treats the regexp semantically identically in all contexts while both mawk
and tawk treat the regexp when used in gsub() differently from the same regexp
in other contexts and I assume treats that regexp differently in gsub() than
other regexps in gsub (I assume x="a,b,c,d"; gsub(/.,./,"y",x) would not turn x
into "" but let us know if it does, and why does it make sense for
gsub(/^"|"$,"") to remove all "s from the start of the string but only remove 1
from the end?) and mawk additionally treats it differently again when used in
split() (see the mawk output above) so that's good that we discovered a second
bug in mawk that someone should report too.

> For <starts with> and <ends with> expressions normally sub() is used.

It can be but it doesn't have to be. Whenever you use the word "normally" in
software you can replace it with "sometimes" as it doesn't matter how often
something else happens and you shouldn't write code that relies on it.

> It is not obvious what you want to achieve nor is it for TAWK and MAWK.

It is completely obvious. tawk and mawk are simply broken. It happens. Someone
needs to report the bug to the maintainers so they can be fixed.

> BTW, the mentioned other issue with your csvdequote function is that is
> destroys quotes from uneven pairs of quotes from wrongly formatted csv,
> e.g. One,",Two,Three

The tool should not try to guess what a broken CSV was intended to represent.

> -> The " gets deleted which is not good while my function keeps the ".
>

It's invalid input so whatever any tool does with it is fine, though an error
message would be nice.

Ed.

Marc de Bourget

unread,
Oct 25, 2016, 10:36:25 AM10/25/16
to
Thank you very much, Ed.
No need to think further about the dequote function anymore.

I have decided my final solution will use the csvsplit function
by Harlan Grove (amended by Peter Farley) mentioned here:
https://groups.google.com/forum/#!topic/comp.lang.awk/Uhvndv5atnw
with the change that it will repopulate $0.
The new function csvsplitfields() dequotes csv and repopulates
$0, $1, ... to these values.
Best of all: Even FS doesn't need to be changed.
Only an enquote function is needed for output later.
Your enquote function looks nicer than mine so I'll use your function.

function csvsplitfields( newf, oldf, fs, i) {
# csvsplitfields dequotes csv and repopulates $0, $1, ... to these valuess
fs = FS
oldf = newf = 0
while ( oldf < NF ) {
f[++newf] = $(++oldf)
if ( f[newf] ~ /^"/ ) {
while ( gsub(/"/, "\"", f[newf]) % 2 ) {
if ( oldf >= NF ) {
if ((getline) > 0) {
oldf = 0
fs = "\n"
}
else break
}
else fs = FS
f[newf] = f[newf] fs $(++oldf)
}
sub(/^"/, "", f[newf])
sub(/"$/, "", f[newf])
gsub(/""/, "\"", f[newf])
}
}
$0 = ""
for (i=1; i<=newf; i++)
$i = f[i]
delete f
return newf
}

function csvenquote( str, separator, alwaysquote ) {
gsub(/\"/, "\"\"", str)
if ((str ~ separator) || (str ~ /\"/) || alwaysquote) {
str = "\"" str "\""
}
return ( str )
}

# ARGV[1]: csv file
BEGIN {
FS = OFS = ","
}

{
# csvsplitfields dequotes csv and repopulates $0, $1, ... to these values
csvsplitfields()

# Do something with data (change contents as usual for $1, $0 etc.)
# ...

for (i=1; i<=NF; i++)
$i = csvenquote($i, FS, 0)
print $0 >"result.txt"
}

This might be the final version ... Thank you Ed, great dialogue!

Ed Morton

unread,
Oct 25, 2016, 8:20:40 PM10/25/16
to
That's initially an attractive idea and can be done by only setting FPAT within
the buildRec() function in the version I posted BUT (and it's a big "but") you
need to always be aware that if you do that then you cannot change a field after
that function is called or awk will re-split the record on commas instead of
leaving them split on fields as you want. So, for example, you couldn't simply
do $3="new value" or $(NF+1)="whatever" or swap fields 1 and 2 by simply doing
t=$2; $2=$1; $1=t, etc.
Idk if it'll work or not in all cases as it's quite complicated but I HATE the
unnecessary use of getline (which among other things introduces additional
complexity when handling multiple files as it will jump right across a file1
end/file2 start boundary when file1 contains mismatched quotes so you could
never write code to alert you about it if you wanted to) and it's extremely
lengthy for all it does and I'd expect extremely slow given the nested loops and
repeated re-construction of $0 every time $i is assigned.

> function csvenquote( str, separator, alwaysquote ) {
> gsub(/\"/, "\"\"", str)
> if ((str ~ separator) || (str ~ /\"/) || alwaysquote) {
> str = "\"" str "\""
> }
> return ( str )
> }
>
> # ARGV[1]: csv file
> BEGIN {
> FS = OFS = ","
> }
>
> {
> # csvsplitfields dequotes csv and repopulates $0, $1, ... to these values
> csvsplitfields()
>
> # Do something with data (change contents as usual for $1, $0 etc.)
> # ...
>
> for (i=1; i<=NF; i++)
> $i = csvenquote($i, FS, 0)
> print $0 >"result.txt"
> }
>
> This might be the final version ... Thank you Ed, great dialogue!
>

Sounds good and thank you too. FWIW my final version is:

$ cat decsv.awk
function buildRec() {
$0 = PrevSeg $0
PrevSeg = ( gsub(/"/,"&") % 2 ? $0 RT : "" )
return (PrevSeg == "" ? 1 : 0)
}

BEGIN {
OFS = ","
FPAT = "([^" OFS "]*)|(\"([^\"]|\"\")*\")"
}

!buildRec() { next }

{
printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
printf "\t$%d=<%s>\n", i, $i
}
}

and if I ever do need to manipulate the quotes (I've just never hit that need
yet in my usage) then I'll just add the sub()s as separate functions called
after buildRec(). If you wanted it to be completely portable, btw, you'd just
need a while(match($0,/same regexp/)) loop instead of using FPAT.

Regards,

Ed.
0 new messages