How to set the appropriate field separator for a csv file which has comma as certain field's content?

Hongyi Zhao

unread,

Jan 11, 2017, 8:50:52 PM1/11/17

to

Hi all,

See the following file for detail:

https://github.com/jedisct1/dnscrypt-proxy/blob/master/dnscrypt-
resolvers.csv

I want to use the comma as the FS, but there are some field's in the file
which includes comma. How to deal with this issue?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Hongyi Zhao

unread,

Jan 12, 2017, 12:15:31 AM1/12/17

to

On Thu, 12 Jan 2017 01:50:50 +0000, Hongyi Zhao wrote:

> Hi all,
>
> See the following file for detail:
>
> https://github.com/jedisct1/dnscrypt-proxy/blob/master/dnscrypt-
> resolvers.csv
>
> I want to use the comma as the FS, but there are some field's in the
> file which includes comma. How to deal with this issue?

Searching the google gives me the following method with gawk:

awk -v FPAT='[^,]*|"[^"]*"' ...

Thomas 'PointedEars' Lahn

unread,

Jan 12, 2017, 8:51:42 AM1/12/17

to

Hongyi Zhao wrote:

> On Thu, 12 Jan 2017 01:50:50 +0000, Hongyi Zhao wrote:
>> See the following file for detail:
>>
>> https://github.com/jedisct1/dnscrypt-proxy/blob/master/dnscrypt->> resolvers.csv
>>
>> I want to use the comma as the FS, but there are some field's in the
>> file which includes comma. How to deal with this issue?
>
> Searching the google gives me the following method with gawk:
>
> awk -v FPAT='[^,]*|"[^"]*"' ...

The second alternative should come first because fields matched by it are
more likely longer, than shorter, than fields matched by the first
alternative. Following the “longest match wins” principle of “greedy”
regular expression matching, reversing the order of the alternative would
allow them to be matched earlier.

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Thomas 'PointedEars' Lahn

unread,

Jan 12, 2017, 8:53:23 AM1/12/17

to

Hongyi Zhao wrote:

> On Thu, 12 Jan 2017 01:50:50 +0000, Hongyi Zhao wrote:
>> I want to use the comma as the FS, but there are some field's in the
>> file which includes comma. How to deal with this issue?
>
> Searching the google gives me the following method with gawk:
>
> awk -v FPAT='[^,]*|"[^"]*"' ...

The second alternative should come first because fields matched by it are
more likely longer, than shorter, than fields matched by the first
alternative. Following the “longest match wins” principle of “greedy”

regular expression matching, reversing the order of the alternatives would
allow the corresponding fields to be matched earlier.

Hongyi Zhao

unread,

Jan 12, 2017, 9:50:54 AM1/12/17

to

On Thu, 12 Jan 2017 14:53:20 +0100, Thomas 'PointedEars' Lahn wrote:

>> awk -v FPAT='[^,]*|"[^"]*"' ...
>
> The second alternative should come first because fields matched by it
> are more likely longer, than shorter, than fields matched by the first
> alternative. Following the “longest match wins” principle of “greedy”
> regular expression matching, reversing the order of the alternatives
> would allow the corresponding fields to be matched earlier.

Thanks for your notes.

A. Mehoela

unread,

Jan 12, 2017, 12:54:51 PM1/12/17

to

Hongyi Zhao wrote:
> On Thu, 12 Jan 2017 01:50:50 +0000, Hongyi Zhao wrote:
>
>> Hi all,
>>
>> See the following file for detail:
>>
>> https://github.com/jedisct1/dnscrypt-proxy/blob/master/dnscrypt-
>> resolvers.csv
>>
>> I want to use the comma as the FS, but there are some field's in the
>> file which includes comma. How to deal with this issue?
>
> Searching the google gives me the following method with gawk:
>
> awk -v FPAT='[^,]*|"[^"]*"' ...
>
> Regards
>

Wrong, because " Embedded double quote characters may then be represented by a pair of consecutive double quotes,"
(https://en.wikipedia.org/wiki/Comma-separated_values, http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm)

Hongyi Zhao

unread,

Jan 12, 2017, 7:12:38 PM1/12/17

to

On Thu, 12 Jan 2017 18:54:39 +0100, A. Mehoela wrote:

>> Searching the google gives me the following method with gawk:
>>
>> awk -v FPAT='[^,]*|"[^"]*"' ...
>>
>> Regards
>>
>>
> Wrong, because " Embedded double quote characters may then be
> represented by a pair of consecutive double quotes,"
> (https://en.wikipedia.org/wiki/Comma-separated_values,
> http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm)

What's your solution for this issue?

wil...@wilbur.25thandclement.com

unread,

Jan 12, 2017, 10:15:09 PM1/12/17

to

Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
> Hongyi Zhao wrote:
>
>> On Thu, 12 Jan 2017 01:50:50 +0000, Hongyi Zhao wrote:
>>> See the following file for detail:
>>>
>>> https://github.com/jedisct1/dnscrypt-proxy/blob/master/dnscrypt->> resolvers.csv
>>>
>>> I want to use the comma as the FS, but there are some field's in the
>>> file which includes comma. How to deal with this issue?
>>
>> Searching the google gives me the following method with gawk:
>>
>> awk -v FPAT='[^,]*|"[^"]*"' ...
>
> The second alternative should come first because fields matched by it are
> more likely longer, than shorter, than fields matched by the first
> alternative. Following the “longest match wins” principle of “greedy”
> regular expression matching, reversing the order of the alternative would
> allow them to be matched earlier.

That depends on the implementation. I think it would need to be a
backtracking matcher which also intentionally short-circuited to get the
desired behavior.

Strictly speaking, changing the order shouldn't matter. The alternation
operator (|) in regular expressions defines an expression that matches the
union of the alternatives. What you're describing is ordered choice; it's
one of the defining distinctions between regular expressions and parsing
expression grammars (PEGs).

Because the gawk documention describes FPAT as a regular expression, without
any qualifications, I'd be surprised if changing the order mattered, and
surprised if the proposed solution actually works reliably--not merely
because quoted fields tend to be longer.

Rakesh Sharma

unread,

Jan 16, 2017, 8:28:30 AM1/16/17

to

On Friday, 13 January 2017 05:42:38 UTC+5:30, Hongyi Zhao wrote:
> On Thu, 12 Jan 2017 18:54:39 +0100, A. Mehoela wrote:
>
> >> Searching the google gives me the following method with gawk:
> >>
> >> awk -v FPAT='[^,]*|"[^"]*"' ...
> >>
> >> Regards
> >>
> >>
> > Wrong, because " Embedded double quote characters may then be
> > represented by a pair of consecutive double quotes,"
> > (https://en.wikipedia.org/wiki/Comma-separated_values,
> > http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm)
>
> What's your solution for this issue?
>

Using Perl you could split on the regex shown:

my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/;
print "<$_>" for split m/$re/, $your_csv_string;

Hongyi Zhao

unread,

Jan 16, 2017, 8:03:02 PM1/16/17

to

On Mon, 16 Jan 2017 05:28:27 -0800, Rakesh Sharma wrote:

> Using Perl you could split on the regex shown:
>
> my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/; print "<$_>" for split
> m/$re/, $your_csv_string;

Thanks, I'm learning python currently, what's the corresponding python
version for the above code?

Thomas 'PointedEars' Lahn

unread,

Jan 16, 2017, 11:13:36 PM1/16/17

to

You have replied to a superseded posting.

wil...@wilbur.25thandClement.com wrote:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Your real name belongs there.

Funny MTA, BTW :)

> Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
>> Hongyi Zhao wrote:
>>> Searching the google gives me the following method with gawk:
>>>
>>> awk -v FPAT='[^,]*|"[^"]*"' ...
>>
>> The second alternative should come first because fields matched by it are
>> more likely longer, than shorter, than fields matched by the first
>> alternative. Following the “longest match wins” principle of “greedy”
>> regular expression matching, reversing the order of the alternative would
>> allow them to be matched earlier.
>
> That depends on the implementation. I think it would need to be a
> backtracking matcher which also intentionally short-circuited to get the
> desired behavior.

As every *efficient* regex engine.

> Strictly speaking, changing the order shouldn't matter.

You are wrong.

> The alternation operator (|) in regular expressions defines an expression

> that matches the union of the alternatives. […]

Dead wrong. Read the excellent “Mastering Regular Expressions” by Jeffrey
E. F. Friedl. A free preview of the relevant chapter is available on the
O’Reilly Media Web site.

Rakesh Sharma

unread,

Jan 17, 2017, 2:41:45 AM1/17/17

to

On Tuesday, 17 January 2017 06:33:02 UTC+5:30, Hongyi Zhao wrote:
> On Mon, 16 Jan 2017 05:28:27 -0800, Rakesh Sharma wrote:
>
> > Using Perl you could split on the regex shown:
> >
> > my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/; print "<$_>" for split
> > m/$re/, $your_csv_string;
>
> Thanks, I'm learning python currently, what's the corresponding python
> version for the above code?
>

#!/usr/bin/env python

import re

# your_csv_string holds the string to be split up

print re.split(r',(?=(?:[^"]*"[^"]*")*(?![^"]*"))', your_csv_string)

Eric

unread,

Jan 17, 2017, 5:40:04 AM1/17/17

to

On 2017-01-17, Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
> You have replied to a superseded posting.

Has he? You know that superseded posts get out into the world before they
are superseded. I too seem to have the message he replied to, and I do
not have any superceding message (in spite of collecting from more than
one news provider).

> wil...@wilbur.25thandClement.com wrote:
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Your real name belongs there.

Not if he doesn't want it to!

> Funny MTA, BTW :)

You haven't heard of tin? It's something of a classic (and still going).
And the releases are named after whisky. But it's not an MTA, its a
newsreader.

Eric
--
ms fnd in a lbry

Jim Beard

unread,

Jan 17, 2017, 10:06:08 AM1/17/17

to

On Tue, 17 Jan 2017 05:13:33 +0100, Thomas 'PointedEars' Lahn wrote:

> 25thandClement.com

If that is just down the street from Bill's Hamburgers and the State
Theater, and a block away from Geary, you may be identifying more about
yourself than you intended.

Or perhaps not. The Richmond District is a nice area.

Cheers!

jim b.

--
UNIX is not user-unfriendly, it merely expects users to be computer-
friendly.

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 10:31:41 AM1/17/17

to

Eric wrote:

> On 2017-01-17, Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
>> You have replied to a superseded posting.
>
> Has he?

Yes.

> [blub]

I do not need to be lectured by you about Usenet, thank you very much.

The fact is that I superseded the posting, and that he replied to the
superseded one instead of the one that superseded it. So he may have
referred to things I did meant to say differently. Hence the notice.

>> wil...@wilbur.25thandClement.com wrote:
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> Your real name belongs there.
>
> Not if he doesn't want it to!

It belongs there if this should not be my only follow-up to his posting.

>> Funny MTA, BTW :)
>
> You haven't heard of tin? It's something of a classic (and still going).
> And the releases are named after whisky. But it's not an MTA, its a
> newsreader.

I meant what I wrote. Think about it.

And next time: Read, think, post. In that order.

Score adjusted, F'up2 poster

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 10:55:23 AM1/17/17

to

Hongyi Zhao wrote:

> On Thu, 12 Jan 2017 18:54:39 +0100, A. Mehoela wrote:
>>> Searching the google gives me the following method with gawk:
>>>
>>> awk -v FPAT='[^,]*|"[^"]*"' ...
>>

>> Wrong, because " Embedded double quote characters may then be
>> represented by a pair of consecutive double quotes,"
>> (https://en.wikipedia.org/wiki/Comma-separated_values,
>> http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm)
>
> What's your solution for this issue?

1. Reverse the alternation for greater efficiency (AISB):

awk -v FPAT='"[^"]*"|[^,]*' ...

2. Handle the special case:

awk -v FPAT='"([^"]|"")*"|[^,]*' ...

[Here, although the second alternative within the first alternative
matches longer substrings, it is more efficient to put it in this
order because the substrings matched by [^"] are more likely than
those matched by "".]

I have not tested this approach with AWK but it is well-tested elsewhere
(I have written solidly working parsers based on it).

See also “Mastering Regular Expressions” (AISB) whose free chapter gave me
the idea (where “"([^\\]|\\")*"” is suggested to match string literals
delimited by double-quotes potentially containing escaped double-quotes [it
can be easily extended to allow for arbitrary escape sequences]).

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 11:32:57 AM1/17/17

to

Hongyi Zhao wrote:

> On Mon, 16 Jan 2017 05:28:27 -0800, Rakesh Sharma wrote:
>> Using Perl you could split on the regex shown:
>>
>> my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/; print "<$_>" for split
>> m/$re/, $your_csv_string;

Unnecessarily complicated, very likely also unnecessarily inefficient
because of all the assertions required.

With non-trivial field contents, you should match *fields* instead of
separators:

my $your_csv_string = '"foo","bar""baz","bla"';
while ($your_csv_string =~ /"(?:[^"]|"")*"|[^,]+/g) {
print "<$&>";
}

Note that using “+” instead of “*” is essential, *also in AWK*; otherwise
you will match zero-length non-fields [in Perl you can work around this with
“if ($& ne "")”, but why bother?]

> Thanks, I'm learning python currently, what's the corresponding python
> version for the above code?

Python is as off-topic here as is Perl, but FWIW:

#!/usr/bin/env python3

import re

your_csv_string = '"foo","bar""baz","bla"'
print(re.findall('"(?:[^"]|"")*"|[^,]+', your_csv_string))

#---------------------------------------------------------

RTFM.

Eric

unread,

Jan 17, 2017, 1:10:04 PM1/17/17

to

On 2017-01-17, Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
> Eric wrote:
>
>> On 2017-01-17, Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
>>> You have replied to a superseded posting.
>>
>> Has he?
>
> Yes.
>

>> You know that superseded posts get out into the world before they are
>> superseded. I too seem to have the message he replied to, and I do not
>> have any superceding message (in spite of collecting from more than one
>> news provider).

(evidence restored which you rudely removed)

>
> I do not need to be lectured by you about Usenet, thank you very much.

Perhaps not, but in my view the evidence suggests otherwise. In any case
this is a public place and my remarks were as much for others as for
you.

Incidentally, you were the one who started lecturing about Usenet, and
you do it quite often. Why are you entitled while I am not?

> The fact is that I superseded the posting, and that he replied to the
> superseded one instead of the one that superseded it. So he may have
> referred to things I did meant to say differently. Hence the notice.

Yes, they may be the facts, but I have still not seen the superseding
posting (as I said above), and I assume that he hasn't either.

>>> wil...@wilbur.25thandClement.com wrote:
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> Your real name belongs there.
>>
>> Not if he doesn't want it to!
>
> It belongs there if this should not be my only follow-up to his posting.

What on earth has that got to do with it?

>>> Funny MTA, BTW :)
>>
>> You haven't heard of tin? It's something of a classic (and still going).
>> And the releases are named after whisky. But it's not an MTA, its a
>> newsreader.
>
> I meant what I wrote. Think about it.

I have. In my world, MTA stands for Mail Transfer Agent, and there is no
evidence of the involvement of any such software in his message. If you
think it stands for something else, please tell us.

> And next time: Read, think, post. In that order.

This is my normal practice. Obviously my thinking is different from
yours, which is no excuse for you to assume that I do not think. If in
your opinion I am wrong, you could explain why - unjustified claims of
wrongness achieve nothing.

> Score adjusted,

Score as you wish, mentioning it is an implied criticism.

> F'up2 poster

This is not the first time you have replied to someone publicly while
specifying that they should reply to you privately. I prefer my reply to
be equally public.

As an aside, we know that the apostrophe there indicates that something
is omitted. It my just be my Australian upbringing, but my first
reaction is to assume that exactly three letters are omitted.

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 3:47:59 PM1/17/17

to

“Eric” trolled again:

> On 2017-01-17, Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
>> Eric wrote:
>>>> wil...@wilbur.25thandClement.com wrote:
>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> Your real name belongs there.
>>> Not if he doesn't want it to!
>> It belongs there if this should not be my only follow-up to his posting.
>
> What on earth has that got to do with it?

You are not ready for the answer yet.

>>>> Funny MTA, BTW :)
>>> You haven't heard of tin? It's something of a classic (and still going).
>>> And the releases are named after whisky. But it's not an MTA, its a
>>> newsreader.
>> I meant what I wrote. Think about it.
>
> I have. In my world, MTA stands for Mail Transfer Agent, and there is no
> evidence of the involvement of any such software in his message. If you
> think it stands for something else, please tell us.

I wrote this below an e-mail address, genius.

You should lose the royal “we”; you are neither authorized nor qualified.

>> And next time: Read, think, post. In that order.
>

> This is my normal practice. […]

Well, it certainly does not appear to be so.

>> Score adjusted,
>
> Score as you wish, mentioning it is an implied criticism.

Yes, it is an implied criticism of your stupidity or your playing stupid.
Had you just wondered, you would have written me an e-mail, as it is
customary. But no, you had to attack the person only. You wanted a flame
war, that’s what you got – so don’t go crying about it. And you got it,
until now. I really have better things to do.

>> F'up2 poster
>
> This is not the first time you have replied to someone publicly while
> specifying that they should reply to you privately. I prefer my reply to
> be equally public.

Don’t you worry, honey, the SNR of your postings is so low, pending your
changing attitude by 180°, I will not reply to you anymore. At all.
*PLONK*

> As an aside, we know that the apostrophe there indicates that something
> is omitted. It my just be my Australian upbringing, but my first
> reaction is to assume that exactly three letters are omitted.

That statement says more about you than about me.

<http://catb.org/esr/faqs/smart-questions.html#not-losing>

And now you *really* have to pay attention: FOAD.

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 4:56:29 PM1/17/17

to

Thomas 'PointedEars' Lahn wrote:

> Hongyi Zhao wrote:
>> On Mon, 16 Jan 2017 05:28:27 -0800, Rakesh Sharma wrote:
>>> Using Perl you could split on the regex shown:
>>>
>>> my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/; print "<$_>" for split
>>> m/$re/, $your_csv_string;
>
> Unnecessarily complicated, very likely also unnecessarily inefficient
> because of all the assertions required.
>
> With non-trivial field contents, you should match *fields* instead of
> separators:
>
> my $your_csv_string = '"foo","bar""baz","bla"';
> while ($your_csv_string =~ /"(?:[^"]|"")*"|[^,]+/g) {
> print "<$&>";
> }
>
> Note that using “+” instead of “*” is essential, *also in AWK*; otherwise
> you will match zero-length non-fields [in Perl you can work around this
> with “if ($& ne "")”, but why bother?]

There is a problem with the approach that matches fields. CSV allows empty,
non-quoted fields that neither the variant with “*” nor the one with “+”
would match. The approach that splits on the delimiter can handle that, and
it can also the allowed case of only one field per record (although I still
think it is possible to devise a simpler, more efficient expression).

If I had Perl or Python, I would not attempt to reinvent the wheel, and
would use a CSV module/library available for those languages instead.

AWK supports POSIX Extended Regular Expressions, therefore it does not
support assertions, so an AWK solution that works within in those
constraints would need to be found instead.

wil...@wilbur.25thandclement.com

unread,

Jan 17, 2017, 8:15:12 PM1/17/17

to

Thomas 'PointedEars' Lahn <Point...@web.de> wrote:

<snip>

>> Thomas 'PointedEars' Lahn <Point...@web.de> wrote:
>>> Hongyi Zhao wrote:
>>>> Searching the google gives me the following method with gawk:
>>>>
>>>> awk -v FPAT='[^,]*|"[^"]*"' ...
>>>
>>> The second alternative should come first because fields matched by it are
>>> more likely longer, than shorter, than fields matched by the first
>>> alternative. Following the “longest match wins” principle of “greedy”
>>> regular expression matching, reversing the order of the alternative would
>>> allow them to be matched earlier.
>>
>> That depends on the implementation. I think it would need to be a
>> backtracking matcher which also intentionally short-circuited to get the
>> desired behavior.
>
> As every *efficient* regex engine.
>
>> Strictly speaking, changing the order shouldn't matter.
>
> You are wrong.

And yet,

$ echo '"foo" ,' | gawk -v FPAT='[^,]*|"[^"]*"' '{ print "[" $1 "]" }'
["foo" ]
$ echo '"foo" ,' | gawk -v FPAT='"[^"]*"|[^,]*' '{ print "[" $1 "]" }'
["foo" ]

Notice that the longest match is captured ([^,]*) regardless of its position
in the alternation.

>> The alternation operator (|) in regular expressions defines an expression
>> that matches the union of the alternatives. […]
>
> Dead wrong. Read the excellent “Mastering Regular Expressions” by Jeffrey
> E. F. Friedl. A free preview of the relevant chapter is available on the
> O’Reilly Media Web site.

I find the distinction between regex-directed and text-directed regular
expressions misleading. IMO the former (regex-directed) aren't actually
regular expressions. Depending on your perspective, implementations which
short-circuit on the first alternation are either buggy, or poorly written.
But because of Perl many people seem to understand that it's how regular
expressions are supposed to behave.

That book is wrong on another point: an NFA implementation _can_ produce the
proper greedy matching behavior that doesn't eagerly exit an alternation.
Google "Thompson NFA".

For real regular expressions, NFAs and DFAs are equivalent. Implementations
taking either an NFA or DFA approach produce equivalent matching behvior. The
difference is that DFA-based implementations can't do substring captures.
And NFA implementations cannot implement backreferences efficiently. In
fact, perhaps handling backreferences performantly (in the common case, at
least) is why Perl's implementation departed from the semantics of regular
expressions.

A better description of regular expressions and implementations can be had
here:

https://swtch.com/~rsc/regexp/regexp1.html
https://swtch.com/~rsc/regexp/regexp2.html
https://swtch.com/~rsc/regexp/regexp3.html

FWIW, Mastering Regular Expressions is a decent book. I don't remember where
my copy of it went. Probably in storage somewhere. But it's misleading on
some aspects, I think.

Hongyi Zhao

unread,

Jan 17, 2017, 8:44:22 PM1/17/17

to

On Tue, 17 Jan 2017 16:55:18 +0100, Thomas 'PointedEars' Lahn wrote:

> 1. Reverse the alternation for greater efficiency (AISB):

What's the meaning of AISB?

>
> awk -v FPAT='"[^"]*"|[^,]*' ...

Great solution. But I still cann't figure out why the regexp _"[^"]*"_
can be used here to capture the Embedded double quote characters like the
following:

$ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"|[^,]*' '{ print
$2 }'
"bar""baz"

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 9:41:17 PM1/17/17

to

Straw man. I have not said anything else.

Thomas 'PointedEars' Lahn

unread,

Jan 17, 2017, 10:01:47 PM1/17/17

to

Hongyi Zhao wrote:

> On Tue, 17 Jan 2017 16:55:18 +0100, Thomas 'PointedEars' Lahn wrote:
>> 1. Reverse the alternation for greater efficiency (AISB):
>
> What's the meaning of AISB?

You already know what I am going to say.

>> awk -v FPAT='"[^"]*"|[^,]*' ...
>
> Great solution.

Well, it is *your* solution, or at the one that *you* found somewhere. I
just tweaked it a little bit :)

> But I still cann't figure out why the regexp _"[^"]*"_
> can be used here to capture the Embedded double quote characters like the
> following:
>
> $ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"|[^,]*' '{ print
> $2 }'
> "bar""baz"

In theory, "[^"]*" matches "". But that is not the whole field number 2, so
I, too, am surprised that it does work here (probably as surprised as the
person who said earlier that it were wrong). It would be better to ask
about AWK problems in an AWK newsgroup – *after* consulting the available
reference material.

Hongyi Zhao

unread,

Jan 18, 2017, 3:12:48 AM1/18/17

to

On Wed, 18 Jan 2017 04:01:43 +0100, Thomas 'PointedEars' Lahn wrote:

>> But I still cann't figure out why the regexp _"[^"]*"_
>> can be used here to capture the Embedded double quote characters like
>> the following:
>>
>> $ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"|[^,]*' '{ print
>> $2 }'
>> "bar""baz"
>
> In theory, "[^"]*" matches "". But that is not the whole field number
> 2, so I, too, am surprised that it does work here (probably as surprised
> as the person who said earlier that it were wrong). It would be better
> to ask about AWK problems in an AWK newsgroup – *after* consulting the
> available reference material.

I do the following further testing with the above example:

$ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"|[^,]*' '{ print

NF, $1, $2, $3 }'
3 "foo" "bar""baz" "bla"

$ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"' '{ print NF,$1,
$2,$3,$4 }'
4 "foo" "bar" "baz" "bla"

As you can see, with / without using the alternative `|', the pattern
_"[^"]*"_ will split out different contents. It seems so strange.

Rakesh Sharma

unread,

Jan 18, 2017, 8:39:09 AM1/18/17

to

echo '"foo","bar""baz","bla"' | awk -v FPAT='("[^"]*")*' '{ print NF,$1,$2,$3 }'

Eric

unread,

Jan 18, 2017, 9:40:10 AM1/18/17

to

On 2017-01-17, Thomas 'PointedEars' Lahn <Point...@web.de> wrote:

> On 2017-01-17, Eric <er...@deptj.eu> wrote:
8>< -------- (correct attribution restored above ...
... also according to Mr Lahn he will not be reading this,
so I have quoted selectively to comment on just one thing)

>>>>> wil...@wilbur.25thandClement.com wrote:
>>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> Your real name belongs there.
>>>>>

>>>>> Funny MTA, BTW :)
>>>> You haven't heard of tin? It's something of a classic (and still going).
>>>> And the releases are named after whisky. But it's not an MTA, its a
>>>> newsreader.
>>> I meant what I wrote. Think about it.
>>
>> I have. In my world, MTA stands for Mail Transfer Agent, and there is no
>> evidence of the involvement of any such software in his message. If you
>> think it stands for something else, please tell us.
>
> I wrote this below an e-mail address

8>< --------
Now I know what he meant. Not as obvious as he presumably thought it
was.

Never mind, I should have ignored him.

Ben Bacarisse

unread,

Jan 18, 2017, 11:16:59 AM1/18/17

to

Can you say what you find strange? Did you see my other post where I
explain that it's not "[^"]" that is matching the field in question?

--
Ben.

Jerry Peters

unread,

Jan 18, 2017, 6:14:27 PM1/18/17

to

Hongyi Zhao <hongy...@gmail.com> wrote:
> On Mon, 16 Jan 2017 05:28:27 -0800, Rakesh Sharma wrote:
>
>> Using Perl you could split on the regex shown:
>>
>> my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/; print "<$_>" for split
>> m/$re/, $your_csv_string;
>
> Thanks, I'm learning python currently, what's the corresponding python
> version for the above code?
>
> Regards

If you're going to use python, why not use the csv module instead of a
regex? Google 'python csv' for details.

Hongyi Zhao

unread,

Jan 19, 2017, 2:50:25 AM1/19/17

to

On Wed, 18 Jan 2017 16:16:55 +0000, Ben Bacarisse wrote:

> Can you say what you find strange?

I mean due to my limited understanding on the behavior of _"[^"]*"_, it
seems strange. Sorry for my ambiguous description.

> Did you see my other post

Which post? I can only see this post in this thread.

> where I explain that it's not "[^"]" that is matching the field in
question?

Ben Bacarisse

unread,

Jan 19, 2017, 6:44:52 AM1/19/17

to

Hongyi Zhao <hongy...@gmail.com> writes:

> On Wed, 18 Jan 2017 16:16:55 +0000, Ben Bacarisse wrote:
>
>> Can you say what you find strange?
>
> I mean due to my limited understanding on the behavior of _"[^"]*"_, it
> seems strange. Sorry for my ambiguous description.

I suspect you understand that pattern as much as you need to. What you
probably don't fully understand is the way | joins patterns together.

>> Did you see my other post
>
> Which post? I can only see this post in this thread.

Ah, maybe I forgot to send it -- I certainly wrote it! You wondered
(some time ago now) how "[^"]*" could match "abc""def"" and I explained
that is doesn't -- it's the other half of the pattern that matches it,
i.e. [^,]*.

And now you are puzzled that "[^"]"|[^,]* and "[^"]" match differently.
You seemed to suggest that adding or removing the second part (the
[^,]* alternative) is somehow changing what "[^"]" is matching. It
isn't.

In this example

echo '"abc""def"' | awk -v 'FPAT="[^"]*"' '{print $1}'

the pattern matches only "abc". In the version with two alternatives

echo '"abc""def"' | awk -v 'FPAT="[^"]*"|[^,]*' '{print $1}'

the pattern "[^"]" still matches just "abc" but the other possibility --
any number of non-comma characters -- matches "abc""def" and the longer
match always wins.

--
Ben.

Hongyi Zhao

unread,

Jan 19, 2017, 8:28:43 AM1/19/17

to

On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:

> I suspect you understand that pattern as much as you need to. What you
> probably don't fully understand is the way | joins patterns together.

Yes, this is a more complicated issue. You can reference the discussion
here:

---------
From: Hongyi Zhao <hongy...@gmail.com>
Newsgroups: comp.lang.awk
Subject: Regexp "[^"]*" in awk.
---------

Hongyi Zhao

unread,

Jan 19, 2017, 8:45:18 AM1/19/17

to

On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:

[snip]

> the pattern "[^"]" still matches just "abc" but the other possibility --
> any number of non-comma characters -- matches "abc""def" and the longer
> match always wins.

This is just want I cann't figure out before. Thanks a lot for your
wonderful explanations.

And, I also posted the relative topic on comp.lang.awk according to your
suggestion:

--------
From: Hongyi Zhao <hongy...@gmail.com>
Newsgroups: comp.lang.awk
Subject: Regexp "[^"]*" in awk.
--------

Rakesh Sharma

unread,

Jan 19, 2017, 10:11:37 AM1/19/17

to

On Thursday, 19 January 2017 19:15:18 UTC+5:30, Hongyi Zhao wrote:
> On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:
>
> [snip]
> > the pattern "[^"]" still matches just "abc" but the other possibility --
> > any number of non-comma characters -- matches "abc""def" and the longer
> > match always wins.
>
> This is just want I cann't figure out before. Thanks a lot for your
> wonderful explanations.
>

This is not correct. Watch...

perl -le '
$_ = q{"abc""def"};
$r = qr/"[^"]*"|[^,]*/;
m/$r/ and print $&; # matches "abc" due to the "[^"]*" component

# now reverse the regex components
$R = qr/[^,]*|"[^"]*"/;
m/$R/ and print $&; # matches "abc""def" due to the [^,]* component
'

So, the rule in | regex is that the alternation is searched L->R and search
stops soon as the first match is found. There is no longest/shortest match
criterion applied here.

What is happening in your awk FPAT variable is that some further processing
is going on under the hood which makes it to behave what you are observing.

Take a look at the example FPAT regex I showed where we can grab the fields
without the use of the |.

HTH

A. Mehoela

unread,

Jan 19, 2017, 3:23:15 PM1/19/17

to

Hongyi Zhao wrote:

> On Thu, 12 Jan 2017 18:54:39 +0100, A. Mehoela wrote:
>
>>> Searching the google gives me the following method with gawk:
>>>
>>> awk -v FPAT='[^,]*|"[^"]*"' ...
>>>

>>> Regards

>>>
>>>
>> Wrong, because " Embedded double quote characters may then be
>> represented by a pair of consecutive double quotes,"
>> (https://en.wikipedia.org/wiki/Comma-separated_values,
>> http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm)
>
> What's your solution for this issue?
>

> Regards
>

To include quote pairs.

Ben Bacarisse

unread,

Jan 19, 2017, 4:19:42 PM1/19/17

to

Rakesh Sharma <shar...@hotmail.com> writes:

> On Thursday, 19 January 2017 19:15:18 UTC+5:30, Hongyi Zhao wrote:
>> On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:
>>
>> [snip]
>> > the pattern "[^"]" still matches just "abc" but the other possibility --
>> > any number of non-comma characters -- matches "abc""def" and the longer
>> > match always wins.
>>
>> This is just want I cann't figure out before. Thanks a lot for your
>> wonderful explanations.
>>
>
> This is not correct. Watch...
>
> perl -le '
> $_ = q{"abc""def"};
> $r = qr/"[^"]*"|[^,]*/;
> m/$r/ and print $&; # matches "abc" due to the "[^"]*" component
>
> # now reverse the regex components
> $R = qr/[^,]*|"[^"]*"/;
> m/$R/ and print $&; # matches "abc""def" due to the [^,]* component
> '
>
> So, the rule in | regex is that the alternation is searched L->R and search
> stops soon as the first match is found.

I don't think you can say "the rule".

> There is no longest/shortest match criterion applied here.

I think there is in AWK, though I don't know what the standards say.
For example:

$ awk '{gsub(/"[^"]*"|[^,]*/, "X"); print}' <<<'"abc""def"'
X

I've tried gawk and mawk and got the same results. The
"native" // syntax also gives the same match a gsub (no surprises
there).

grep (GNU grep) also picks the longest match

$ grep -E '^("[^"]*"|[^,]*)$' <<<'"abc""def"'
"abc""def"

Using grep's match colouring option and a slightly different input
avoids any confusion or alteration cause by the added ^(...)$ but it's
harder to post helpful output here.

Emacs also matches the longest alternative as does sed:

$ sed -e 's/"[^"]*"\|[^,]*/X/' <<<'"abc""def"'
X

Of course all these may be using the same library so how many programs
behave in some particular way is not informative.

<snip>
--
Ben.

Kaz Kylheku

unread,

Jan 19, 2017, 5:30:00 PM1/19/17

to

On 2017-01-19, Rakesh Sharma <shar...@hotmail.com> wrote:
> So, the rule in | regex is that the alternation is searched L->R and search
> stops soon as the first match is found. There is no longest/shortest match
> criterion applied here.

The | operator commutes. A|B and B|A are equivalent.

Typically, both are compiled together into an NFA graph, that is
often translated to a DFA.

If a regex is executed by simulation of the NFA, the
simulator traverses both branches simultaneously, in effect.
It keeps track of the set of NFA graph nodes in which the
machine can be at any given input symbol.

If the NFA is converted to a DFA, there are then no traces
of the original | syntax at all.

Kaz Kylheku

unread,

Jan 19, 2017, 6:57:48 PM1/19/17

to

On 2017-01-19, Hongyi Zhao <hongy...@gmail.com> wrote:
> On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:
>
>> I suspect you understand that pattern as much as you need to. What you
>> probably don't fully understand is the way | joins patterns together.
>
> Yes, this is a more complicated issue. You can reference the discussion
> here:

Not really complicated; just some distracting side discussion
there makes it seem that way.

The breakdown is this:

Ignoring "Unixy embellishments" like anchoring and register capture, a
regular expression is equivalent to a set of strings, and to a finite
state machine for recognizing those strings:

regex <==> set <==> FSM

for instance the regex ab* denotes the countably infinite set of strings
{ "a", "ab", "abb", "abbb", ... }

Given a regex A|B, this simply denotes a set union set(A) U set(B).

The A|B regex denotes the combined set of strings from the
set denoted by A and from the set denoted by B.

For instance a*|bb+ denotes the set

{ "", "a", "aa", ..., "bb", "bbb", .. }

In a set, the order doesn't matter, so we could list the
elements in increasing length:

{ "", "a", "aa", "bb", "aaa", "bbb", "aaaa", "bbbb", ... }

The "" empty string is included because it comes from a*, which
matches zero or more, and so matches that empty string.

When we are searching, matching, splitting and so forth, using a regex,
the set interpretation applies: the longest matching element of the set
is identified in the given input. The regex implementation answers
for us questions like "does this input string occur in the set?"
or "how long is the longest prefix of this input which is still
found in the set?"

Ben Bacarisse

unread,

Jan 19, 2017, 7:03:10 PM1/19/17

to

Kaz Kylheku <221-50...@kylheku.com> writes:

> On 2017-01-19, Rakesh Sharma <shar...@hotmail.com> wrote:
>> So, the rule in | regex is that the alternation is searched L->R and search
>> stops soon as the first match is found. There is no longest/shortest match
>> criterion applied here.
>
> The | operator commutes. A|B and B|A are equivalent.
>
> Typically, both are compiled together into an NFA graph, that is
> often translated to a DFA.

That's the comp.theory answer. Not all regexp definitions (let alone
implementations) fit that picture. Perl's A|B is defined in a rather
operational way -- you feel you need to know things about the algorithm
to be sure what's going to match.

<snip>
--
Ben.

William Ahern

unread,

Jan 20, 2017, 5:00:11 PM1/20/17

to

The POSIX standard defines the most relevant behavior.

If the pattern permits a variable number of matching characters and thus
there is more than one such sequence starting at that point, the longest
such sequence is matched. For example, the BRE "bb*" matches the second to
fourth characters of the string "abbbc", and the ERE
"(wee|week)(knights|night)" matches all ten characters of the string
"weeknights".

-- POSIX.1-2008, 2016 edition, Base Definitions, ch. 9 sec. 1

There's no substitute for actually testing the implementation, but the
specification should establish expectations. Especially when it's consonant
with the formal definition of regular expressions. Although FPAT is an
extension, there's no reason to believe an implementation would diverge from
the specification in the relevant regard.

It's unfortunate that in the example above its only the first subexpression
in each alternation that combined match the longest string. The english
description is unequivocal, "the longest ... sequence is matched", at least
if taken at face value. Maybe the example should be reordered. Does anybody
here participate in the standards activity? It might be worth submitting a
defect report through http://austingroupbugs.net/view_all_bug_page.php

FWIW, the perlre man page describes Perl's divergent behavior and the
consequence.

Alternatives are tried from left to right, so the first alternative found
for which the entire expression matches, is the one that is chosen. This
means that alternatives are not necessarily greedy. For example: when
matching "foo|foot" against "barefoot", only the "foo" part will match, as
that is the first alternative tried, and it successfully matches the
target string. (This might not seem important, but it is important when
you are capturing matched text using parentheses.)

Thomas 'PointedEars' Lahn

unread,

Jan 21, 2017, 10:21:16 PM1/21/17

to

Ben Bacarisse wrote:

> In this example
>
> echo '"abc""def"' | awk -v 'FPAT="[^"]*"' '{print $1}'
>
> the pattern matches only "abc". In the version with two alternatives
>
> echo '"abc""def"' | awk -v 'FPAT="[^"]*"|[^,]*' '{print $1}'
>
> the pattern "[^"]" still matches just "abc" but the other possibility --
> any number of non-comma characters -- matches "abc""def" and the longer
> match always wins.

Ah, that makes sense: POSIX ERE are different here. Being so used to
ECMAScript RegExp and PCRE (where the first matching alternative wins),
I did not think of that.

Then there should indeed not be a performance gain from switching the
alternatives in AWK and egrep(1) as long as it is implemented this way.

Hongyi Zhao

unread,

Jan 23, 2017, 9:55:55 PM1/23/17

to

On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:

> In this example
>
> echo '"abc""def"' | awk -v 'FPAT="[^"]*"' '{print $1}'

What's the differences between the following forms:

echo '"abc""def"' | awk -v FPAT='"[^"]*"'

or

echo '"abc""def"' | awk -v 'FPAT="[^"]*"'

Which form is more preferable?

Eric

unread,

Jan 28, 2017, 11:10:04 AM1/28/17

to

On 2017-01-24, Hongyi Zhao <hongy...@gmail.com> wrote:
> On Thu, 19 Jan 2017 11:44:47 +0000, Ben Bacarisse wrote:
>
>> In this example
>>
>> echo '"abc""def"' | awk -v 'FPAT="[^"]*"' '{print $1}'
>
> What's the differences between the following forms:
>
> echo '"abc""def"' | awk -v FPAT='"[^"]*"'
>
> or
>
> echo '"abc""def"' | awk -v 'FPAT="[^"]*"'
>
> Which form is more preferable?

As they stand, neither, since they both produce an error!

They both need '{print $1}' (or some other AWK action) at the end of
them.

Assuming that is done, the difference is that the first quotes only the
variable value in the -v option and the other quotes the variable name
(and equals sign) as well. There is _no_ difference in behavior as they
are. Some people may think that the second looks neater, but I think
the first is clearer.

The first allows you to let a shell substitution provide all or part
of the variable name - I can't think of a reason for doing that at
the moment, but no doubt someone will. The second is only _necessary_
if the variable name is to contain something that the shell would
otherwise substitute - but most such things would leave you with an
invalid variable name!