Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Regexp "[^"]*" in awk.

60 views
Skip to first unread message

Hongyi Zhao

unread,
Jan 18, 2017, 3:04:32 AM1/18/17
to
Hi all,

See the following testing:

$ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"|[^,]*' '{ print
$2 }'
"bar""baz"

$ awk -V
GNU Awk 4.1.60, API: 1.2
Copyright (C) 1989, 1991-2015 Free Software Foundation.

Any hints for the result of the above regexp?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Kaz Kylheku

unread,
Jan 18, 2017, 3:17:07 AM1/18/17
to
On 2017-01-18, Hongyi Zhao <hongy...@gmail.com> wrote:
> Hi all,
>
> See the following testing:
>
> $ echo '"foo","bar""baz","bla"' | awk -v FPAT='"[^"]*"|[^,]*' '{ print
> $2 }'
> "bar""baz"
>
> $ awk -V
> GNU Awk 4.1.60, API: 1.2
> Copyright (C) 1989, 1991-2015 Free Software Foundation.
>
> Any hints for the result of the above regexp?

A|B finds the longest possible match for A and for
B. The longer of the two wins.

[^,]* matches "bar""baz".

"[^"]*" matches only "bar".

Hongyi Zhao

unread,
Jan 18, 2017, 7:58:04 AM1/18/17
to
On Wed, 18 Jan 2017 08:17:05 +0000, Kaz Kylheku wrote:

> A|B finds the longest possible match for A and for B. The longer of the
> two wins.

Is this the rule used only by awk or a rule also applied to regexps in
other tools, say, python/perl/sed, etc.?

Regards

>
> [^,]* matches "bar""baz".
>
> "[^"]*" matches only "bar".





Janis Papanagnou

unread,
Jan 18, 2017, 8:21:15 AM1/18/17
to
On 18.01.2017 13:58, Hongyi Zhao wrote:
> On Wed, 18 Jan 2017 08:17:05 +0000, Kaz Kylheku wrote:
>
>> A|B finds the longest possible match for A and for B. The longer of the
>> two wins.
>
> Is this the rule used only by awk or a rule also applied to regexps in
> other tools, say, python/perl/sed, etc.?

This "rule" is (as formulated) just not true as you can easily verify...

$ awk 'match($0,/(aabb|naa)/){print substr($0,RSTART,RLENGTH)}' <<< nnaabbnn
naa

Actually the longest match for A and the longest match for B is considered,
and in the example (obviously) the first one selected that is found.

Janis

Manuel Collado

unread,
Jan 18, 2017, 4:02:34 PM1/18/17
to
El 18/01/2017 14:21, Janis Papanagnou escribió:
> On 18.01.2017 13:58, Hongyi Zhao wrote:
>> On Wed, 18 Jan 2017 08:17:05 +0000, Kaz Kylheku wrote:
>>
>>> A|B finds the longest possible match for A and for B. The longer of the
>>> two wins.
>>
>> Is this the rule used only by awk or a rule also applied to regexps in
>> other tools, say, python/perl/sed, etc.?
>
> This "rule" is (as formulated) just not true as you can easily verify...
>
> $ awk 'match($0,/(aabb|naa)/){print substr($0,RSTART,RLENGTH)}' <<< nnaabbnn
> naa
>
> Actually the longest match for A and the longest match for B is considered,
> and in the example (obviously) the first one selected that is found.

Just to clarify things. My understanding is that match():

1. Sets RSTART for the first substring that matches the regexp, and

2. Sets RLENGTH for the longest substring that starts at RSTART and
matches the regexp.

Hope this helps.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Kaz Kylheku

unread,
Jan 18, 2017, 6:31:57 PM1/18/17
to
On 2017-01-18, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 18.01.2017 13:58, Hongyi Zhao wrote:
>> On Wed, 18 Jan 2017 08:17:05 +0000, Kaz Kylheku wrote:
>>
>>> A|B finds the longest possible match for A and for B. The longer of the
>>> two wins.
>>
>> Is this the rule used only by awk or a rule also applied to regexps in
>> other tools, say, python/perl/sed, etc.?
>
> This "rule" is (as formulated) just not true as you can easily verify...

I'm afraid you are leading yourself astray here.

> $ awk 'match($0,/(aabb|naa)/){print substr($0,RSTART,RLENGTH)}' <<< nnaabbnn
> naa

Here, the regular expression is not simply being matched.

The function is called "match", but it in fact performs a search.

The outer loop of that search has nothing to do with the regex.

> Actually the longest match for A and the longest match for B is considered,
> and in the example (obviously) the first one selected that is found.

This is because, effectively, the function does something like:

for i from 0 to len(str)-1
sub = substr(str, 0, len(str))
cnt = regex_match(reg, sub)
if (cnt > 0)
break # report match of length cnt at position i

The embedded regex_match itself strictly obeys the behavior that
given A|B, the longer of the two branch matches determines
the match length. The outer loop uses the regex API to find
the leftmost substring of str which matches reg. That
substring is the longest possible match at that leftmost
matching position that must be reported.

Janis Papanagnou

unread,
Jan 18, 2017, 7:25:17 PM1/18/17
to
On 19.01.2017 00:31, Kaz Kylheku wrote:
> On 2017-01-18, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 18.01.2017 13:58, Hongyi Zhao wrote:
>>> On Wed, 18 Jan 2017 08:17:05 +0000, Kaz Kylheku wrote:
>>>
>>>> A|B finds the longest possible match for A and for B. The longer of the
>>>> two wins.
>>>
>>> Is this the rule used only by awk or a rule also applied to regexps in
>>> other tools, say, python/perl/sed, etc.?
>>
>> This "rule" is (as formulated) just not true as you can easily verify...
>
> I'm afraid you are leading yourself astray here.
>
>> $ awk 'match($0,/(aabb|naa)/){print substr($0,RSTART,RLENGTH)}' <<< nnaabbnn
>> naa
>
> Here, the regular expression is not simply being matched.
>
> The function is called "match", but it in fact performs a search.
>
> The outer loop of that search has nothing to do with the regex.

Where did you get that from? - I inspected the gawk source and could not find
that loop that you propose below.

Janis

Janis Papanagnou

unread,
Jan 18, 2017, 7:58:28 PM1/18/17
to
On 19.01.2017 00:31, Kaz Kylheku wrote:
>> [ match() function ]
>

In addition to my previous post one more comment on your code below...

Ordinary (non-backtracking based) regexp matching is of complexity O(N).
If you assume the code below you would get a severe degradation in
performance. The outer loop would make it O(N^2). To me this looks like
code broken by design. (So, again, I'm interested where you got that
from.) Doing some as hoc performance tests with GNU awk with 16 lines
of 4 million characters per line (non-matching, i.e. full pattern match
without any early break), I do not get the performance degradation we
should expect.

Janis

> This is because, effectively, the function does something like:
>
> for i from 0 to len(str)-1
> sub = substr(str, 0, len(str))

BTW, assuming you meant something like this here

sub = substr(str, i, len(str))

Kaz Kylheku

unread,
Jan 18, 2017, 11:52:59 PM1/18/17
to
On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 19.01.2017 00:31, Kaz Kylheku wrote:
>>> [ match() function ]
>>
>
> In addition to my previous post one more comment on your code below...
>
> Ordinary (non-backtracking based) regexp matching is of complexity O(N).
> If you assume the code below you would get a severe degradation in
> performance.

That model with the search loop specifies the abstract behavior
(the result which is obtained), not necessarily the implementation.

Just like a naive substring search specifies what is to be found;
a Boyer-Moore or Knuth-Pratt-Morrison search will find it
faster in many circumstances.

It illustrates that the semantics of the | operator doesn't
change under the substring searching application of a regex.

> The outer loop would make it O(N^2). To me this looks like
> code broken by design. (So, again, I'm interested where you got that

"Unoptimized" isn't "broken".

The worst O(N*N) case doesn't always occur. E.g, suppose that
the regex matches at most M characters, M << N. Then it is
O(N*M). That worst case is only triggered if a number of
false matches occurs proportional to N times, each of which
is of a length proportional to M.

The regex a.b will not require k*N*N operations to find
the "abc" substring in aaaaaaaaaazzzzaaazaa...aaaaabcaaazzz...

> from.) Doing some as hoc performance tests with GNU awk with 16 lines
> of 4 million characters per line (non-matching, i.e. full pattern match
> without any early break), I do not get the performance degradation we
> should expect.

What are your test cases?

>> This is because, effectively, the function does something like:
^^^^^^^^^^^^ as in, producing this effect
>>
>> for i from 0 to len(str)-1
>> sub = substr(str, 0, len(str))
>
> BTW, assuming you meant something like this here
>
> sub = substr(str, i, len(str))

Yes.

Janis Papanagnou

unread,
Jan 19, 2017, 1:38:01 AM1/19/17
to
On 19.01.2017 05:52, Kaz Kylheku wrote:
> On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 19.01.2017 00:31, Kaz Kylheku wrote:
>>>> [ match() function ]
>>>
>>
>> In addition to my previous post one more comment on your code below...
>>
>> Ordinary (non-backtracking based) regexp matching is of complexity O(N).
>> If you assume the code below you would get a severe degradation in
>> performance.
>
> That model with the search loop specifies the abstract behavior
> (the result which is obtained), not necessarily the implementation.

The point was that the abstract code you provided makes no sense to
me. It immediately imposes an O(N^2) complexity where only O(N) would
do. The basic principle of the regexp search is that the input data
is traversed just once. The size of the regexp is in that respect a
constant finite factor.

(And any concrete implementation would need some counterpart of your
abstract model. So I'm still looking forward seeing concrete pointers.)

>
> Just like a naive substring search specifies what is to be found;
> a Boyer-Moore or Knuth-Pratt-Morrison search will find it
> faster in many circumstances.
>
> It illustrates that the semantics of the | operator doesn't
> change under the substring searching application of a regex.

The longest (greedy) match applies for patterns containing e.g.
quantifiers like * and + , where arbitrary amounts of an entity
can appear. In that case the longest match is taken in typical
implementations.

With '|' there's just an alternative branch defined in the FSM.
(There's no count (or something like that) that compares lengths
of matches in each branch; it's just not necessary.)

>
>> The outer loop would make it O(N^2). To me this looks like
>> code broken by design. (So, again, I'm interested where you got that
>
> "Unoptimized" isn't "broken".

It's not about optimization. It's about implementing something
unnecessary and the wrong way. (But I doubt anyway that it's
implemented in such a broken way; actual code references are welcome.)

>
> The worst O(N*N) case doesn't always occur. E.g, suppose that
> the regex matches at most M characters, M << N. Then it is
> O(N*M). That worst case is only triggered if a number of
> false matches occurs proportional to N times, each of which
> is of a length proportional to M.
>
> The regex a.b will not require k*N*N operations to find
> the "abc" substring in aaaaaaaaaazzzzaaazaa...aaaaabcaaazzz...

For Regular Expressions const * O(N) (or k * O(N) if you prefer)
suffices.

>
>> from.) Doing some as hoc performance tests with GNU awk with 16 lines
>> of 4 million characters per line (non-matching, i.e. full pattern match
>> without any early break), I do not get the performance degradation we
>> should expect.
>
> What are your test cases?

Repeating partial (incomplete) matches (i.e. no exact match), and one
exact match near to the end of the 16 million character long line.
pattern:
/(aabb|naa|XXXX|JJJJJJJJJxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)/
data:
A:BB:CCC:DDDD:EEEEE:FFFFFF:...(~16M)...JJJJJJJJJJXXXXXXXXXXXXX


Code references (preferable from GNU awk where we have public code
available) would help to show your point (if valid) and shorten the
abstract discussion we currently have.

Janis

Kaz Kylheku

unread,
Jan 19, 2017, 4:03:30 AM1/19/17
to
On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 19.01.2017 05:52, Kaz Kylheku wrote:
>> On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> On 19.01.2017 00:31, Kaz Kylheku wrote:
>>>>> [ match() function ]
>>>>
>>>
>>> In addition to my previous post one more comment on your code below...
>>>
>>> Ordinary (non-backtracking based) regexp matching is of complexity O(N).
>>> If you assume the code below you would get a severe degradation in
>>> performance.
>>
>> That model with the search loop specifies the abstract behavior
>> (the result which is obtained), not necessarily the implementation.
>
> The point was that the abstract code you provided makes no sense to
> me. It immediately imposes an O(N^2) complexity where only O(N) would
> do.

More correctly, it relaxes the worst case upper bound on running
time to that.

Poor, yet very simple, algorithms have excellent uses.

They can clearly specify *what* is to be computed, so then
faster algorithms can be checked that they compute the same thing.

They provide a simple implementation when performance isn't
important.

The basic principle of the regexp search is that the input data

> (And any concrete implementation would need some counterpart of your
> abstract model. So I'm still looking forward seeing concrete pointers.)

Gawk-wise, see the function re_search_internal in regexec.c.

See the loop starting here:


for (;; match_first += incr)
{
err = REG_NOMATCH;
if (match_first < left_lim || right_lim < match_first)
goto free_return;

/* Advance as rapidly as possible through the string, until we
find a plausible place to start matching. This may be done
with varying efficiency, so there are various possibilities:
only the most common of them are specialized, in order to
save on code size. We use a switch statement for speed. */

and where it later calls the actual regex matcher:

/* It seems to be appropriate one, then use the matcher. */
/* We assume that the matching starts from 0. */
mctx.state_log_top = mctx.nbkref_ents = mctx.max_mb_elem_len = 0;
match_last = check_matching (&mctx, fl_longest_match,
range >= 0 ? &match_first : NULL);


What is check_matching? This:

/* Check whether the regular expression match input string INPUT or not,
and return the index where the matching end, return -1 if not match,
or return -2 in case of an error.
FL_LONGEST_MATCH means we want the POSIX longest matching.
If P_MATCH_FIRST is not NULL, and the match fails, it is set to the
next place where we may want to try matching.
Note that the matcher assume that the maching starts from the current
index of the buffer. */

static int
internal_function __attribute_warn_unused_result__
check_matching (re_match_context_t *mctx, int fl_longest_match,
int *p_match_first)


See? A function that appears to implement a pure regex automaton which
reports how many characters of the input match, if at all.

The function is optimized in how it searches forward through the string;
it doesn't just throw a "check_matching" call at every character
position. But it *could* do that. Quite possibly, the first versions of
it might have been like that.

Hongyi Zhao

unread,
Jan 19, 2017, 8:50:04 AM1/19/17
to
On Wed, 18 Jan 2017 14:21:14 +0100, Janis Papanagnou wrote:

>>> A|B finds the longest possible match for A and for B. The longer of
>>> the two wins.
>>
>> Is this the rule used only by awk or a rule also applied to regexps in
>> other tools, say, python/perl/sed, etc.?
>
> This "rule" is (as formulated) just not true as you can easily verify...
>
> $ awk 'match($0,/(aabb|naa)/){print substr($0,RSTART,RLENGTH)}' <<<
> nnaabbnn naa

But the following testings support the rule (the examples are given by
Ben Bacarisse):

$ echo '"abc""def"' | awk -v 'FPAT="[^"]*"' '{print $1}'
"abc"
$ echo '"abc""def"' | awk -v 'FPAT="[^"]*"|[^,]*' '{print $1}'
"abc""def"

Regards

Janis Papanagnou

unread,
Jan 20, 2017, 6:23:28 AM1/20/17
to
On 19.01.2017 14:50, Hongyi Zhao wrote:
> On Wed, 18 Jan 2017 14:21:14 +0100, Janis Papanagnou wrote:
>
>>>> A|B finds the longest possible match for A and for B. The longer of
>>>> the two wins.
>>>
>>> Is this the rule used only by awk or a rule also applied to regexps in
>>> other tools, say, python/perl/sed, etc.?
>>
>> This "rule" is (as formulated) just not true as you can easily verify...
>>
>> $ awk 'match($0,/(aabb|naa)/){print substr($0,RSTART,RLENGTH)}' <<<
>> nnaabbnn naa
>
> But the following testings support the rule (the examples are given by
> Ben Bacarisse):

No, it doesn't explain anything, unfortunately. (I think Ben already
pointed you to his explanation.)

>
> $ echo '"abc""def"' | awk -v 'FPAT="[^"]*"' '{print $1}'
> "abc"

Here you match a quoted string.

> $ echo '"abc""def"' | awk -v 'FPAT="[^"]*"|[^,]*' '{print $1}'
> "abc""def"

Here you additionally match any sequence that doesn't contain a comma (or
a quoted string).

From these samples you cannot infer anything. You need the exact matched
pattern to make statements about what's selected in case of alternatives
(that's why I used match(), a function with provides the match parameters).
That's why I gave you an example upthread that shows that in A|B you will
not (as had been suggested by another poster) get "the longer of A or B".

Janis

>
> Regards
>

Janis Papanagnou

unread,
Jan 20, 2017, 6:53:59 AM1/20/17
to
On 19.01.2017 10:03, Kaz Kylheku wrote:
> On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 19.01.2017 05:52, Kaz Kylheku wrote:
>>> On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>>> On 19.01.2017 00:31, Kaz Kylheku wrote:
>>>>>> [ match() function ]
>>>>>
>>>>
>>>> In addition to my previous post one more comment on your code below...
>>>>
>>>> Ordinary (non-backtracking based) regexp matching is of complexity O(N).
>>>> If you assume the code below you would get a severe degradation in
>>>> performance.
>>>
>>> That model with the search loop specifies the abstract behavior
>>> (the result which is obtained), not necessarily the implementation.
>>
>> The point was that the abstract code you provided makes no sense to
>> me. It immediately imposes an O(N^2) complexity where only O(N) would
>> do.
>
> More correctly, it relaxes the worst case upper bound on running
> time to that.

The worst case is O(K*N) not O(K*N*N), isn't it? (Presuming that there's
not something unnecessary implemented.)

>
> Poor, yet very simple, algorithms have excellent uses.

I think the nice thing about the theory of Regular Expressions is that we
have (without additional efforts) a fast O(K*N) method to match patterns.

>
> They can clearly specify *what* is to be computed, so then
> faster algorithms can be checked that they compute the same thing.
>
> They provide a simple implementation when performance isn't
> important.
>
> The basic principle of the regexp search is that the input data

[something is missing here]

>
>> (And any concrete implementation would need some counterpart of your
>> abstract model. So I'm still looking forward seeing concrete pointers.)
>
> Gawk-wise, see the function re_search_internal in regexec.c.
>
> See the loop starting here:

But this is (as far as I see) not a nested loop (which would impose an
unnecessary O(K*N*N) complexity). It rather seems to just quickly try to
find the first position where to start the matching. (See first comment:
"Advance as rapidly as possible through the string, until we find a
plausible place to start matching.")
The subsequent code seems to indicate that the purpose is that depending
on the character set they have to implement some additional means to find
the proper beginning (since there's obviously a problem with multi-byte
(UCS) or variable-byte (UTF) encodings).
(I'll dig deeper into that code if I have more time, and followup later
for additional iinsights.)

Janis Papanagnou

unread,
Jan 20, 2017, 6:57:35 AM1/20/17
to
I had used match() upthread but you can also use sub() to see that effect

$ awk 'sub(/(aabb|naa)/,"X")' <<< nnaabbxx
nXbbxx

>
> Janis
>
>>
>> Regards
>>
>

Kaz Kylheku

unread,
Jan 20, 2017, 7:22:40 AM1/20/17
to
The situation is that wherever A|B matches in the input, at that place
in the input, A matches m characters (possibly zero) and B matches n
characters. The length of the match is max(m, n).

The match function finds a leftmost match whose position is determined
independently of this consideration. The first (leftmost) match for A|B
is found, regardless of its length. That length is then max(m, n).

Can you demonstrate a counterexample?

Even if match searched for the longest matching substring of A|B,
the length of that match at the matching position would be max(m, n).

That would also even be the case if it searched for the righmost
substring.

You seem to have gratuitously misinterpreted the wording in some utterly
nonsensical way (whose exact structure I can hardly guess at it).

Kaz Kylheku

unread,
Jan 20, 2017, 7:33:14 AM1/20/17
to
On 2017-01-20, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 19.01.2017 10:03, Kaz Kylheku wrote:
>> On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> On 19.01.2017 05:52, Kaz Kylheku wrote:
>>>> On 2017-01-19, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>>>> On 19.01.2017 00:31, Kaz Kylheku wrote:
>>>>>>> [ match() function ]
>>>>>>
>>>>>
>>>>> In addition to my previous post one more comment on your code below...
>>>>>
>>>>> Ordinary (non-backtracking based) regexp matching is of complexity O(N).
>>>>> If you assume the code below you would get a severe degradation in
>>>>> performance.
>>>>
>>>> That model with the search loop specifies the abstract behavior
>>>> (the result which is obtained), not necessarily the implementation.
>>>
>>> The point was that the abstract code you provided makes no sense to
>>> me. It immediately imposes an O(N^2) complexity where only O(N) would
>>> do.
>>
>> More correctly, it relaxes the worst case upper bound on running
>> time to that.
>
> The worst case is O(K*N) not O(K*N*N), isn't it? (Presuming that there's
> not something unnecessary implemented.)
>
>>
>> Poor, yet very simple, algorithms have excellent uses.
>
> I think the nice thing about the theory of Regular Expressions is that we
> have (without additional efforts) a fast O(K*N) method to match patterns.

I think it's useful also because it provides an algebra for
expressing large, possibly infinite sets of strings, which
also correspond to finite automata.

>>
>> They can clearly specify *what* is to be computed, so then
>> faster algorithms can be checked that they compute the same thing.
>>
>> They provide a simple implementation when performance isn't
>> important.
>>
>> The basic principle of the regexp search is that the input data
>
> [something is missing here]
>
>>
>>> (And any concrete implementation would need some counterpart of your
>>> abstract model. So I'm still looking forward seeing concrete pointers.)
>>
>> Gawk-wise, see the function re_search_internal in regexec.c.
>>
>> See the loop starting here:
>
> But this is (as far as I see) not a nested loop (which would impose an

Didn't I reveal the call to check_matching? That contains a loop,
of course:

while (!re_string_eoi (&mctx->input))
{
re_dfastate_t *old_state = cur_state;

Do you believe that if check_matching fails, the outer loop will advance
past all of the characters that were scanned, without backtracking?

Janis Papanagnou

unread,
Jan 20, 2017, 10:42:04 AM1/20/17
to
Performance evidence of /a/ and /(a|a+)/ in a file full of 'a' (see below).

>
> Even if match searched for the longest matching substring of A|B,
> the length of that match at the matching position would be max(m, n).
>
> That would also even be the case if it searched for the righmost
> substring.
>
> You seem to have gratuitously misinterpreted the wording in some utterly
> nonsensical way (whose exact structure I can hardly guess at it).

(I'm starting to believe that our dispute will not lead anywhere. I'll add
one final perspective that (given that you start with terms like "utterly
nonsensical") may or may not be enlightening.)

The basic question was whether always longest matches are selected (or not)
in all regexp expressions (in awk or the other mentioned languages). We can
observe that matching serves (at least) two purposes; "boolean" (true/false),
and "operational" purpose. In the first case which is covered by expressions
like
/pattern/ { action }
we can't see what's going on inside the /pattern/ match, which is okay since
all we want is to perform an action if it's true, if it matches. Whether
[internally] all matching substrings are determined until the longest match
is found is irrelevant. (Although we would expect an implementation to do
an early exit.) The second case is where we want to do some operation on
the matched word (examples have been provided already); match() allows to
identify the matching substring, and sub() and gsub() allow replacements of
matched substrings. We have seen that not necessarily the longest match was
taken in the cases where we're "operational" on the matched portion of the
word.
At this point we should also be aware that the "rules" that the OP is asking
for is for the purpose to get to know the visible behaviour of the tool he
uses. And we have yet to see an example where we can derive for sure that
[internally] the longest match would be effective, and yet there was also no
other evidence given.
On the contrary it needs explanation why /a/ matches as fast as /(a|a+)/ if
it would be true (as you said) that the longest match would be relevant in
/(a|a+)/ which would clearly be dominated by the /a+/ part. But tests show
that the /a+/ part does not dominate the /a/ part in performance.

I already said that previously but for completeness; longest matches _are_
relevant, but specifically in other regexp contexts than A|B; e.g. a+, a*, or
generally in cases where at any actual match position more than one pattern
applies. Longest matches are also the standard in most regexp parsers[*].
That's the reason why
awk 'sub(/(aabb|naa)/,"X")' <<< nnaabbxx
replaces "naa" (not the [overall] longest match /aabb/), but
awk 'sub(/(a|a+)/,"x")' <<< aaaaaaaa
replaces [at first possible occurrence] all "a" (the longest match for /a+/),
and, for completeness,
awk 'gsub(/(a|aa)/,"x")' <<< aaaaaaaaa # odd number of 'a'
replaces all "a", four times matched by /aa/ (longest match) and one time by
/a/.

Hope that clarifies matters.

Janis

[*] Even POSIX seems to mandate that in the relevant contexts, BTW and IIRC.

Janis Papanagnou

unread,
Jan 20, 2017, 11:04:31 AM1/20/17
to
On 20.01.2017 16:42, Janis Papanagnou wrote:
> On 20.01.2017 13:22, Kaz Kylheku wrote:
>> On 2017-01-20, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>> On 19.01.2017 14:50, Hongyi Zhao wrote:
>>>> On Wed, 18 Jan 2017 14:21:14 +0100, Janis Papanagnou wrote:
>>>>
>>>>>>> A|B finds the longest possible match for A and for B. The longer of
>>>>>>> the two wins.
>>>>>>
>>>>>> Is this the rule used only by awk or a rule also applied to regexps in
>>>>>> other tools, say, python/perl/sed, etc.?

Rushing ahead in a try to not escalate the dispute further, I try to condense
and give my resume, based on two of the examples I gave. With the thesis "The
longer of the two wins." in mind;

In general *not* if the match of the whole matched word is considered:
> awk 'sub(/(aabb|naa)/,"X")' <<< nnaabbxx
=> not the longer of the two wins here.

*yes* in case of considering any specific substring portion of the whole word:
> awk 'gsub(/(a|aa)/,"x")' <<< aaaaaaaaa
=> the longer aa has preference here.

(The OP may derive a more appropriate rule than "the longer wins" from those
examples.)

Janis

Janis Papanagnou

unread,
Jan 20, 2017, 11:27:03 AM1/20/17
to
On 20.01.2017 13:33, Kaz Kylheku wrote:
> On 2017-01-20, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>
>> I think the nice thing about the theory of Regular Expressions is that we
>> have (without additional efforts) a fast O(K*N) method to match patterns.
>
> I think it's useful also because it provides an algebra for
> expressing large, possibly infinite sets of strings, which
> also correspond to finite automata.

Yes, it's an invaluable method.

In the context of this talk it may be noteworthy to point out that
matching a word that is part of the Chomsky-3 language defined by the
Regular Expression _for specific substrings_ only can be extended by
adding .* to the front and rear of the pattern string, so that the
new regexp defines all words with matching substrings as part of the
corresponing regexp-defined language. Why I mention that is that it
provides the possibility to not introduce additional spurious string
traversing loops if all we want is to determine whether a parsed word
is or is not part of the Ch-3 language defined by the regexp.

>>
>> But this is (as far as I see) not a nested loop (which would impose an
>
> Didn't I reveal the call to check_matching? That contains a loop,
> of course:
>
> while (!re_string_eoi (&mctx->input))
> {
> re_dfastate_t *old_state = cur_state;
>
> Do you believe that if check_matching fails, the outer loop will advance
> past all of the characters that were scanned, without backtracking?

I do not believe anything here; as I said, I just need some more time
(and a bit quietude) to analyse the code.

Janis

Kaz Kylheku

unread,
Jan 20, 2017, 6:16:40 PM1/20/17
to
I don't fathom how you can use performance to confirm or refute
the length of a match. Five characters matched in a year is the
same result as those same five characters matched in a microsecond.

Can you show any situation where a match for some A|B matches
a place where A alone would have a 7 character match, and B alone
a 3 character match (or vice versa), but a 3 character match is returned
rather than a 7?

> The basic question was whether always longest matches are selected (or not)
> in all regexp expressions (in awk or the other mentioned languages). We can

The basic question was, why is this longer thing found rather than
this shorter prefix of it.

> observe that matching serves (at least) two purposes; "boolean" (true/false),
> and "operational" purpose. In the first case which is covered by expressions
> like
> /pattern/ { action }
> we can't see what's going on inside the /pattern/ match, which is okay since
> all we want is to perform an action if it's true, if it matches. Whether
> [internally] all matching substrings are determined until the longest match
> is found is irrelevant.

That is perfectly true.

For instance, to determine whether a string S has a prefix which matches
regular expression R, we indeed do not have to find the longest such
prefix! Basically, we can feed the characters through the equivalent
automaton and stop as soon as it hits an accceptance state. (It may
even start in one, if the regex matches the empty string).

Similar reasoning applies to finding whether a substring of S matches.

> At this point we should also be aware that the "rules" that the OP is asking
> for is for the purpose to get to know the visible behaviour of the tool he
> uses. And we have yet to see an example where we can derive for sure that
> [internally] the longest match would be effective, and yet there was also no
> other evidence given.
> On the contrary it needs explanation why /a/ matches as fast as /(a|a+)/ if
> it would be true (as you said) that the longest match would be relevant in
> /(a|a+)/ which would clearly be dominated by the /a+/ part. But tests show
> that the /a+/ part does not dominate the /a/ part in performance.

a|a++ denotes the same set of strings as a+.

These two regexes will produce different NFA graphs, if not optimized
at the abstract syntax tree. I think when the DFA subset construction
is applied, we get identical automata.
>
> I already said that previously but for completeness; longest matches _are_
> relevant, but specifically in other regexp contexts than A|B; e.g. a+, a*, or
> generally in cases where at any actual match position more than one pattern
> applies.

How is A|B not a situation where more than one pattern
applies at a given match position?

> Longest matches are also the standard in most regexp parsers[*].
> That's the reason why
> awk 'sub(/(aabb|naa)/,"X")' <<< nnaabbxx
> replaces "naa" (not the [overall] longest match /aabb/), but

naa is absolutely the longest match at the second character
of the string. aabb does not match there.

These two expressions are mutually exclusive: one always matches
zero characters if the other matches nonzero.

Yet, the length of the match is max(m, n) as before.

> awk 'sub(/(a|a+)/,"x")' <<< aaaaaaaa
> replaces [at first possible occurrence] all "a" (the longest match for /a+/),

With this regex, every possible match for a+ is also a match
for a, and is at least as long.
0 new messages