why is $ not literal mid-string in an ERE?

Ed Morton

unread,

Aug 18, 2022, 9:27:51 AM8/18/22

to

When I write a regexp that has a `$` in the middle of it I write it as
either of:

sed 's/foo\$bar/stuff/'
sed 's/foo[$]bar/stuff/'

so that it's clear the `$` should be treated literally. Given that, I've
never noticed before that an unescaped `$` mid-regexp is treated
differently in BREs vs EREs, e.g.:

$ echo 'foo$bar' | sed 's/foo$bar/stuff/'
stuff

$ echo 'foo$bar' | sed -E 's/foo$bar/stuff/'
foo$bar

As far as I can see, the relevant quotes of the POSIX spec
(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html)
for BREs are:

-----
$
The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a <dollar-sign>
as an anchor when used as the last character of a subexpression. The
<dollar-sign> shall anchor the expression (or optionally subexpression)
to the end of the string being matched; the <dollar-sign> can be said to
match the end-of-string following the last character.
-----

and for EREs (emphasis mine):

-----
$
The <dollar-sign> shall be special when used as an anchor.

A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the last
character of a string. For example, the EREs "ef$" and "(ef$)" match
"ef" in the string "abcdef", but fail to match in the string "cdefab",
and **the ERE "e$f" is valid, but can never match because the 'f'
prevents the expression "e$" from matching ending at the last character**.
-----

So, the BRE section doesn't explicitly state what `$` means when it's
not at the end of a regexp but given the "special when used as an
anchor" statement, it makes sense to take that as meaning it's literal
otherwise and that is how the various tools I've tried are interpreting it.

The ERE section, however, has that same statement about `$` being
special when used as an anchor, but then goes on to state that when it's
mid-regexp, e.g. `e$f`, it should NOT be treated literally even though
doing so means the regexp that includes it can never match anything.

That ERE specification seems odd - why interpret `$` in a way that's
different from BREs and results in a regexp that can never match
anything instead of simply treating it as literal, same as BREs do?

Does anyone have any insight into why a `$` mid-regexp is treated that
way in EREs?

Ed.

Oğuz

unread,

Aug 18, 2022, 2:29:26 PM8/18/22

to

On 8/18/22 4:27 PM, Ed Morton wrote:
> That ERE specification seems odd - why interpret `$` in a way that's
> different from BREs and results in a regexp that can never match
> anything instead of simply treating it as literal, same as BREs do?

The standard simply documents existing practice here. Under XRAT A.9.3.8
it says:

> The ability of '^', '$', and '*' to be non-special in certain circumstances may be confusing to
> some programmers, but this situation was changed only in a minor way from historical practice
> to avoid breaking many historical scripts. Some consideration was given to making the use of
> the anchoring characters undefined if not escaped and not at the beginning or end of strings.
> This would cause a number of historical BREs, such as "2^10", "$HOME", and "$1.35", that
> relied on the characters being treated literally, to become invalid.
> ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped
> anchor character has never matched its literal counterpart outside a bracket expression. Some
> implementations treated "foo$bar" as a valid expression that never matched anything; others
> treated it as invalid. POSIX.1-202x mandates the former, valid unmatched behavior.

Ed Morton

unread,

Aug 18, 2022, 2:44:21 PM8/18/22

to

On 8/18/2022 1:29 PM, Oğuz wrote:
> On 8/18/22 4:27 PM, Ed Morton wrote:
>> That ERE specification seems odd - why interpret `$` in a way that's
>> different from BREs and results in a regexp that can never match
>> anything instead of simply treating it as literal, same as BREs do?
>
> The standard simply documents existing practice here. Under XRAT A.9.3.8
> it says:

OK, I see that at:

https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08

>
>> The ability of '^', '$', and '*' to be non-special in certain
>> circumstances may be confusing to
>> some programmers, but this situation was changed only in a minor way
>> from historical practice
>> to avoid breaking many historical scripts. Some consideration was
>> given to making the use of
>> the anchoring characters undefined if not escaped and not at the
>> beginning or end of strings.
>> This would cause a number of historical BREs, such as "2^10", "$HOME",
>> and "$1.35", that
>> relied on the characters being treated literally, to become invalid.
>> ERE anchoring has been different from BRE anchoring in all historical
>> systems. An unescaped
>> anchor character has never matched its literal counterpart outside a
>> bracket expression. Some
>> implementations treated "foo$bar" as a valid expression that never
>> matched anything; others
>> treated it as invalid. POSIX.1-202x mandates the former, valid
>> unmatched behavior.

Them discussing `foo$bar` in that article when I used that exact string
in my question is quite a coincidence!

Thanks,

Ed.

Helmut Waitzmann

unread,

Aug 18, 2022, 3:47:54 PM8/18/22

to

Ed Morton <morto...@gmail.com>:

>When I write a regexp that has a `$` in the middle of it I write it
>as either of:
>
> sed 's/foo\$bar/stuff/'
> sed 's/foo[$]bar/stuff/'
>
>so that it's clear the `$` should be treated literally. Given that,
>I've never noticed before

So do I.

I feel it the other way round: The BRE specification seems odd:
Why should an unescaped unbracketed dollar sign only be interpreted
special when it's at the end of an expression? (That constraint
doesn't harm, though, because a dollar sign one wants to have its
literal meaning can be escaped or bracketed even when not at the end
of an expression.)

>Does anyone have any insight into why a `$` mid-regexp is treated
>that way in EREs?

I don't know, but I guess it's the principle “Keep it simple”: A
dollar‐sign ist special, unless it is either inside a bracket
expression or preceded be a quoting backslash. There is no
additional constraining rule “unless it is inside of an expression”,
which is unnecessary to bring dollar signs with their literal
meaning into regular expressions.

And I guess with BREs it was too late to abandon that constraining
rule without breaking existing utilities.

But that all is just a guess.

Ed Morton

unread,

Aug 18, 2022, 4:40:49 PM8/18/22

to

Right but there's other characters that are treated as metachars based
on context, e.g. `}`, `]`, and `)` are only metachars if they succeed
`{`, `[`, and `(` respectively, otherwise they're literal, and `^`, `-`
and `]` mean different things depending on where they appear inside a
bracket expression, so it wouldn't be much of a leap to make the meaning
of `^` and `$` outside of a bracket expression context-sensitive too.

Ed.

Kaz Kylheku

unread,

Aug 19, 2022, 1:34:21 PM8/19/22

to

On 2022-08-18, Ed Morton <morto...@gmail.com> wrote:
> When I write a regexp that has a `$` in the middle of it I write it as
> either of:
>
> sed 's/foo\$bar/stuff/'
> sed 's/foo[$]bar/stuff/'

Because $ can have the special anchoring meaning, even when
it occurs in the middle of a regex:

$ grep -E 'abc$|def'

matches lines ending in abc, or containing def.

To get the behavior you want, the exact rule would have to be
rooted in the abstract syntax: that a $ which has a right
sibling in the syntax tree becomes automatically literal.

> so that it's clear the `$` should be treated literally. Given that, I've

Treating characters literally or not based on their position in the
syntax is a bad idea in the first place.

For instance it's a misfeature some regex implementatons that ) becomes
literal, without escaping, if it is unmatched, rather than being
flagged as a syntax error.

Consistency is best.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Ed Morton

unread,

Aug 19, 2022, 8:30:10 PM8/19/22

to

On 8/19/2022 12:34 PM, Kaz Kylheku wrote:
> On 2022-08-18, Ed Morton <morto...@gmail.com> wrote:
>> When I write a regexp that has a `$` in the middle of it I write it as
>> either of:
>>
>> sed 's/foo\$bar/stuff/'
>> sed 's/foo[$]bar/stuff/'
>
> Because $ can have the special anchoring meaning, even when
> it occurs in the middle of a regex:
>
> $ grep -E 'abc$|def'
>
> matches lines ending in abc, or containing def.

All the relevant ERE text actually talks about regexp subexpressions in
the same way the BRE text talks about whole regexps, I just didn't want
to get into that, so in the above case the $ is at the end of a
subexpression and so the $ is being treated consistently between BREs
and EREs in that regard.

Ed.