Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regex confusion -- not matching; think it should?

4 views
Skip to first unread message

Linda Walsh

unread,
Jun 12, 2013, 6:53:50 PM6/12/13
to bug-bash
The trace looks aprolike this:
>>./ifc#137(handle_bonding_ops)> (( 18>3 ))
>>./ifc#138(handle_bonding_ops)> [[ mode=balance-rr 0 =~
^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)[[:space:]]+[a-zA-Z][-a-zA-Z0-9_]+=.+.*$ ]]
>>./ifc#142(handle_bonding_ops)> [[ mode=balance-rr 0 =~ ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)$ ]]
>>./ifc#145(handle_bonding_ops)> break

(#138 is all 1 line)
---- the source code looks like this:

my id='[a-zA-Z][-a-zA-Z0-9_]+'
while ((${#bond_ops}>3)); do
if [[ $bond_ops =~ ^($id)=(.+)[[:space:]]+$id=.+.*$ ]]; then
...
elif [[ "$bond_ops" =~ ^($id)=(.+)$ ]]; then
...
else break; fi

---
I would think the 2nd match would match it, but no luck...
Note the 2nd source line has double quotes due to testing...
Originally it had no quotes, as I don't believe they are
necessary in this case. Regardless, neither way matches.

So if not obvious, bond_ops has "mode=balance-rr 0" in it.

Thanks...


Linda Walsh

unread,
Jun 12, 2013, 8:52:26 PM6/12/13
to bug-bash


Linda Walsh wrote:
> The trace looks aprolike this:
>>> ./ifc#137(handle_bonding_ops)> (( 18>3 ))
>>> ./ifc#138(handle_bonding_ops)> [[ mode=balance-rr 0 =~
x^-extra space
at least it works now!

DJ Mills

unread,
Jun 19, 2013, 12:29:55 PM6/19/13
to Linda Walsh, bug-bash
On Wed, Jun 12, 2013 at 6:53 PM, Linda Walsh <ba...@tlinx.org> wrote:

> The trace looks aprolike this:
>
>> ./ifc#137(handle_bonding_ops)> (( 18>3 ))
>>> ./ifc#138(handle_bonding_ops)> [[ mode=balance-rr 0 =~
>>>
>> ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)**[[:space:]]+[a-zA-Z][-a-zA-Z0-**9_]+=.+.*$
> ]]
>
>> ./ifc#142(handle_bonding_ops)> [[ mode=balance-rr 0 =~
>>> ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)**$ ]]
>>> ./ifc#145(handle_bonding_ops)> break
>>>
>>
> (#138 is all 1 line)
> ---- the source code looks like this:
>
> my id='[a-zA-Z][-a-zA-Z0-9_]+'
> while ((${#bond_ops}>3)); do
> if [[ $bond_ops =~ ^($id)=(.+)[[:space:]]+$id=.+.***$ ]]; then
> ...
> elif [[ "$bond_ops" =~ ^($id)=(.+)$ ]]; then
> ...
> else break; fi
>
> ---
> I would think the 2nd match would match it, but no luck...
> Note the 2nd source line has double quotes due to testing...
> Originally it had no quotes, as I don't believe they are
> necessary in this case. Regardless, neither way matches.
>
> So if not obvious, bond_ops has "mode=balance-rr 0" in it.
>
> Thanks...
>
>
>
Just FYI, you should be using [[:alpha:]] and [[:alnum:]], as they're safe
for all locales. You can't count on a-z or A-Z unless the locale is C or
POSIX. And no, quotes on the LHS of [[ are not needed, as wordsplitting and
pathname expansion do not occur within the [[ keyword.

Chris Down

unread,
Jun 19, 2013, 12:39:56 PM6/19/13
to DJ Mills, Linda Walsh, bug-bash
On 20 June 2013 00:29, DJ Mills <daniel...@gmail.com> wrote:
> wordsplitting and pathname expansion do not occur within the
> [[ keyword.

$ > foo
$ [[ foo == * ]] && echo bar
bar

Greg Wooledge

unread,
Jun 19, 2013, 12:43:49 PM6/19/13
to Chris Down, bug-bash
That's pattern matching, which is neither word splitting nor pathname
expansion.

What DJ said was accurate, but perhaps confusing. The *left* hand side
of an = or == or =~ operator inside [[...]] doesn't need to be quoted,
because no special things are done on the left hand side. There are
potentially some special interpretations on the *right* hand side,
but not as many as there would be outside of [[...]].

Chris Down

unread,
Jun 19, 2013, 12:56:26 PM6/19/13
to Greg Wooledge, bug-bash
On 20 June 2013 00:43, Greg Wooledge <woo...@eeg.ccf.org> wrote:
> On Thu, Jun 20, 2013 at 12:39:56AM +0800, Chris Down wrote:
>> On 20 June 2013 00:29, DJ Mills <daniel...@gmail.com> wrote:
>> > wordsplitting and pathname expansion do not occur within the
>> > [[ keyword.
>>
>> $ > foo
>> $ [[ foo == * ]] && echo bar
>> bar
>
> That's pattern matching, which is neither word splitting nor pathname
> expansion.

Interesting, that's a misconception about RHS matching I've held for a
while now then. Thanks for pointing it out.

DJ Mills

unread,
Jun 19, 2013, 1:45:54 PM6/19/13
to Chris Down, Greg Wooledge, bug-bash
You can easily see that by doing the same thing without the file. It'll
still be true.

Linda Walsh

unread,
Jun 19, 2013, 1:53:02 PM6/19/13
to DJ Mills, bug-bash


DJ Mills wrote:
> On Wed, Jun 12, 2013 at 6:53 PM, Linda Walsh <ba...@tlinx.org
> <mailto:ba...@tlinx.org>> wrote:
>
> The trace looks aprolike this:
>
> ./ifc#137(handle_bonding_ops)> (( 18>3 ))
> ./ifc#138(handle_bonding_ops)> [[ mode=balance-rr 0 =~
>
>
> ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)__[[:space:]]+[a-zA-Z][-a-zA-Z0-__9_]+=.+.*$
> ]]
>
> ./ifc#142(handle_bonding_ops)> [[ mode=balance-rr 0 =~
> ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)__$ ]]
> ./ifc#145(handle_bonding_ops)> break
>
>
> (#138 is all 1 line)
> ---- the source code looks like this:
>
> my id='[a-zA-Z][-a-zA-Z0-9_]+'
> while ((${#bond_ops}>3)); do
> if [[ $bond_ops =~ ^($id)=(.+)[[:space:]]+$id=.+.__*$ ]]; then
> ...
> elif [[ "$bond_ops" =~ ^($id)=(.+)$ ]]; then
> ...
> else break; fi
>
> ---
> I would think the 2nd match would match it, but no luck...
> Note the 2nd source line has double quotes due to testing...
> Originally it had no quotes, as I don't believe they are
> necessary in this case. Regardless, neither way matches.
>
> So if not obvious, bond_ops has "mode=balance-rr 0" in it.
>
> Thanks...
>
>
>
> Just FYI, you should be using [[:alpha:]] and [[:alnum:]], as they're
> safe for all locales.
----
Since this is for the linux kernel, I'd be better off
just using locale=C; As for the quotes .. see the 2nd and 3rd lines
of the bottom paragraph...;-)


You can't count on a-z or A-Z unless the locale is
> C or POSIX. And no, quotes on the LHS of [[ are not needed, as

Dan Douglas

unread,
Jun 19, 2013, 7:12:57 PM6/19/13
to bug-...@gnu.org, DJ Mills, Linda Walsh
On Wednesday, June 19, 2013 12:29:55 PM DJ Mills wrote:
> On Wed, Jun 12, 2013 at 6:53 PM, Linda Walsh <ba...@tlinx.org> wrote:
>
> > The trace looks aprolike this:
> >
> >> ./ifc#137(handle_bonding_ops)> (( 18>3 ))
> >>> ./ifc#138(handle_bonding_ops)> [[ mode=balance-rr 0 =~
> >>>
> >> ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)**[[:space:]]+[a-zA-Z][-a-zA-Z0-**9_]+=.+.*$
> > ]]
> >
> >> ./ifc#142(handle_bonding_ops)> [[ mode=balance-rr 0 =~
> >>> ^([a-zA-Z][-a-zA-Z0-9_]+)=(.+)**$ ]]
> >>> ./ifc#145(handle_bonding_ops)> break
> >>>
> >>
> > (#138 is all 1 line)
> > ---- the source code looks like this:
> >
> > my id='[a-zA-Z][-a-zA-Z0-9_]+'
> > while ((${#bond_ops}>3)); do
> > if [[ $bond_ops =~ ^($id)=(.+)[[:space:]]+$id=.+.***$ ]]; then
> > ...
> > elif [[ "$bond_ops" =~ ^($id)=(.+)$ ]]; then
> > ...
> > else break; fi
> >
> > ---
> > I would think the 2nd match would match it, but no luck...
> > Note the 2nd source line has double quotes due to testing...
> > Originally it had no quotes, as I don't believe they are
> > necessary in this case. Regardless, neither way matches.
> >
> > So if not obvious, bond_ops has "mode=balance-rr 0" in it.
> >
> > Thanks...
> >
> >
> >
> Just FYI, you should be using [[:alpha:]] and [[:alnum:]], as they're safe
> for all locales. You can't count on a-z or A-Z unless the locale is C or
> POSIX. And no, quotes on the LHS of [[ are not needed, as wordsplitting and
> pathname expansion do not occur within the [[ keyword.

Where does it say that you can count on [[:alpha:]] being the same in non-POSIX
locales? I see it defined for the POSIX locale.

Thanks to mksh, posh, etc not supporting POSIX character classes at all, I'm
not so sure it's actually better in practice. (talking about standard shell
pattern matching of course)

--
Dan Douglas

Greg Wooledge

unread,
Jun 20, 2013, 8:09:38 AM6/20/13
to bug-...@gnu.org
On Wed, Jun 19, 2013 at 06:12:57PM -0500, Dan Douglas wrote:
> Thanks to mksh, posh, etc not supporting POSIX character classes at all, I'm
> not so sure it's actually better in practice. (talking about standard shell
> pattern matching of course)

I'm fairly sure nobody on the entire planet uses those shells except
their authors and you.

Now, since this is a bash mailing list, it's reasonable to talk about
bash. If you're writing a script in bash, you MUST NOT use the [a-z]
or [A-Z] ranges, or any other alphabetic ranges, unless you are
working in the POSIX locale. If you use an alphabetic range in any
other locale, you invite disaster.

Here is disaster:

imadev:~$ echo Hello World | tr A-Z a-z
h�MM� w�SM�

That is why you MUST NOT use alphabetic ranges in non-POSIX locales.
Here's how you SHOULD do it:

imadev:~$ echo Hello World | LANG=C tr A-Z a-z
hello world
imadev:~$ echo Hello World | tr '[:upper:]' '[:lower:]'
hello world

The latter is preferred if there is any chance you are working with
non-ASCII letters, as it will handle them:

imadev:~$ echo �bc | tr '[:upper:]' '[:lower:]'
�bc

In the POSIX locale, � isn't part of A-Z, so it is not matched:

imadev:~$ echo �bc | LANG=C tr A-Z a-z
�bc

Dan Douglas

unread,
Jun 21, 2013, 3:39:14 AM6/21/13
to Greg Wooledge, bug-...@gnu.org
On Thu, Jun 20, 2013 at 7:09 AM, Greg Wooledge <woo...@eeg.ccf.org> wrote:
> On Wed, Jun 19, 2013 at 06:12:57PM -0500, Dan Douglas wrote:
>> Thanks to mksh, posh, etc not supporting POSIX character classes at all, I'm
>> not so sure it's actually better in practice. (talking about standard shell
>> pattern matching of course)
>
> I'm fairly sure nobody on the entire planet uses those shells except
> their authors and you.

I'm talking about the entire family of pdksh-derived shells. mksh
ships with Android. oksh on openbsd. pdksh on SUA / interix. I'm sure
some use posh for testing. Collectively I'd say they're at least as
significant as dash, probably more.

> Now, since this is a bash mailing list, it's reasonable to talk about
> bash. If you're writing a script in bash, you MUST NOT use the [a-z]
> or [A-Z] ranges, or any other alphabetic ranges, unless you are
> working in the POSIX locale. If you use an alphabetic range in any
> other locale, you invite disaster.

I can't reproduce this on a GNU system using en_US.UTF-8

Are you saying this because certain implementations tend to behave
this way, or because it's implied by the spec? I'd assume this has
more to do with your C library than to do with Bash specifically.
According to POSIX the character ranges look just as bad as the
character classes. There's even text which says implementations may
offer extensions that do not even include those characters required
for the C locale, and I don't see anything that says what should occur
for non-POSIX locales.

Greg Wooledge

unread,
Jun 21, 2013, 8:00:47 AM6/21/13
to Dan Douglas, bug-...@gnu.org
On Fri, Jun 21, 2013 at 02:39:14AM -0500, Dan Douglas wrote:
> > If you're writing a script in bash, you MUST NOT use the [a-z]
> > or [A-Z] ranges, or any other alphabetic ranges, unless you are
> > working in the POSIX locale. If you use an alphabetic range in any
> > other locale, you invite disaster.
>
> I can't reproduce this on a GNU system using en_US.UTF-8

Yes, *some* implementations go out of their way to try to make [a-z]
work in the "intuitive" way. But if you want your script to be portable,
you can't rely on that.

> Are you saying this because certain implementations tend to behave
> this way, or because it's implied by the spec?

Because real computers behave in the way I demonstrated. "imadev"
is the workstation sitting under my desk. It's what I'm typing this
email on right now. It runs HP-UX 10.20.

There are a very large number of old HP-UX 10.20 and 11.11 machines
in the world.

> I'd assume this has
> more to do with your C library than to do with Bash specifically.

True.

0 new messages