Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

basic regexp with {0}

51 views
Skip to first unread message

John Doe

unread,
Jul 14, 2015, 7:40:50 AM7/14/15
to
I'm going through regular expressions and found in man grep that for
basic regexp {m} is repetition operator and means that "preceding item
is matched exactly m times".

This matches:
$ echo apple | grep -G "apples\{0\}"
apple

But this matches too:
$ echo apples | grep -G "apples\{0\}"
apples

I realise that examples might look a bit silly but nevertheless I
understand that in first case it matches when ending s is exactly 0
times (=not present at all) as in 'apple'. But in second example there
is ending 's' and it also matches. I'm not quite understand why... Could
you give me some hint?

John


John Doe

unread,
Jul 14, 2015, 8:15:08 AM7/14/15
to
Also, I have some issues going through examples from the book "Beginning
Portable Shell Scripting", for example:

[...]
the expression ba\(na\)* can match ba, bana, banana or bananana but it
cannot match banan
[...]

But when I do echo 'banan' | grep -G 'ba\(na\)*' it does match :(

What am I doing wrong?

John



Kaz Kylheku

unread,
Jul 14, 2015, 8:22:18 AM7/14/15
to
On 2015-07-14, John Doe <john.doe@notpresent> wrote:
> I'm going through regular expressions and found in man grep that for
> basic regexp {m} is repetition operator and means that "preceding item
> is matched exactly m times".
>
> This matches:
> $ echo apple | grep -G "apples\{0\}"
> apple

grep *searches* the input for a matching substring.

If you want to use grep for testing whether a string is in the
language described by the regex, anchor the regex with ^ and $.

Janis Papanagnou

unread,
Jul 14, 2015, 8:39:47 AM7/14/15
to
On 14.07.2015 14:15, John Doe wrote:
> On 14.07.2015 13:40, John Doe wrote:
>> I'm going through regular expressions and found in man grep that for
>> basic regexp {m} is repetition operator and means that "preceding item
>> is matched exactly m times".
>>
>> This matches:
>> $ echo apple | grep -G "apples\{0\}"
>> apple
>>
>> But this matches too:
>> $ echo apples | grep -G "apples\{0\}"
>> apples

Yes, the part /apple/ in the input string "apples" is matched by grep
regexp, where the /s/ part of the rexexp is not effetive, because of
the 0-repetition part.

>>
>> I realise that examples might look a bit silly but nevertheless I
>> understand that in first case it matches when ending s is exactly 0
>> times (=not present at all) as in 'apple'. But in second example there
>> is ending 's' and it also matches. I'm not quite understand why... Could
>> you give me some hint?
>
> Also, I have some issues going through examples from the book "Beginning
> Portable Shell Scripting", for example:
>
> [...]
> the expression ba\(na\)* can match ba, bana, banana or bananana but it cannot
> match banan
> [...]
>
> But when I do echo 'banan' | grep -G 'ba\(na\)*' it does match :(

As above; because the /ba/ (and the /bana/) would already match the
respective part of the (sub-)string, "ba" (or resp. "bana").

>
> What am I doing wrong?

Nothing wrong with what you're doing. Just try to understand what the
pattern actually does in the input string sequence.

Janis

>
> John
>
>
>

John Doe

unread,
Jul 14, 2015, 9:15:03 AM7/14/15
to
On 14.07.2015 14:39, Janis Papanagnou wrote:
> On 14.07.2015 14:15, John Doe wrote:
>>> But this matches too:
>>> $ echo apples | grep -G "apples\{0\}"
>>> apples
>
> Yes, the part /apple/ in the input string "apples" is matched by grep
> regexp, where the /s/ part of the rexexp is not effetive, because of
> the 0-repetition part.

Now I anchored the regexp like this:

echo 'apples' | grep -G '^apples\{0\}$'

Shouldn't it match only 'apple' line now?

>> Also, I have some issues going through examples from the book "Beginning
>> Portable Shell Scripting", for example:
>>
>> [...]
>> the expression ba\(na\)* can match ba, bana, banana or bananana but it cannot
>> match banan
>> [...]
>>
>> But when I do echo 'banan' | grep -G 'ba\(na\)*' it does match :(
>
> As above; because the /ba/ (and the /bana/) would already match the
> respective part of the (sub-)string, "ba" (or resp. "bana").

So, is my misunderstanding about anchoring the pattern? For example:

echo 'bananananan' | grep -G '^ba\(na\)*$' doesn't match
echo 'bananananana' | grep -G '^ba\(na\)*$' matches

>> What am I doing wrong?
>
> Nothing wrong with what you're doing. Just try to understand what the
> pattern actually does in the input string sequence.

I'm trying to follow you but have some uncertainty. Is this what you're
saying with the banana example:

echo 'banan' | grep -G 'ba\(na\)*'

The possible pattern(s) are: ba, bana, banana, bananana, ...
The string is banan
Two of the possible patterns match banan
Result is that banan matches ba\(na\)*

Is this correct?

John

Janis Papanagnou

unread,
Jul 14, 2015, 10:33:49 AM7/14/15
to
On 14.07.2015 15:14, John Doe wrote:
> On 14.07.2015 14:39, Janis Papanagnou wrote:
>> On 14.07.2015 14:15, John Doe wrote:
>>>> But this matches too:
>>>> $ echo apples | grep -G "apples\{0\}"
>>>> apples
>>
>> Yes, the part /apple/ in the input string "apples" is matched by grep
>> regexp, where the /s/ part of the rexexp is not effetive, because of
>> the 0-repetition part.
>
> Now I anchored the regexp like this:
>
> echo 'apples' | grep -G '^apples\{0\}$'
>
> Shouldn't it match only 'apple' line now?

Exactly. And it does so.

>
>>> Also, I have some issues going through examples from the book "Beginning
>>> Portable Shell Scripting", for example:
>>>
>>> [...]
>>> the expression ba\(na\)* can match ba, bana, banana or bananana but it cannot
>>> match banan
>>> [...]
>>>
>>> But when I do echo 'banan' | grep -G 'ba\(na\)*' it does match :(
>>
>> As above; because the /ba/ (and the /bana/) would already match the
>> respective part of the (sub-)string, "ba" (or resp. "bana").
>
> So, is my misunderstanding about anchoring the pattern? For example:
>
> echo 'bananananan' | grep -G '^ba\(na\)*$' doesn't match
> echo 'bananananana' | grep -G '^ba\(na\)*$' matches

Now the anchoring requires that the whole line has to match. The first
example does not, because it has a trailling "n" that is not part of the
pattern, while the second sample fulfills the pattern. Note that while
without anchoring /ba\(na\)*/ would also already match "ba", where with
anchoring the pattern must fulfill the match until the end of the line,
i.e. the repetitions must be completely satisied, and a spurious final
"n" will spoil the match.

>
>>> What am I doing wrong?
>>
>> Nothing wrong with what you're doing. Just try to understand what the
>> pattern actually does in the input string sequence.
>
> I'm trying to follow you but have some uncertainty. Is this what you're saying
> with the banana example:
>
> echo 'banan' | grep -G 'ba\(na\)*'
>
> The possible pattern(s) are: ba, bana, banana, bananana, ...

The possible _matches_ of the pattern expression in the string are ba,
bana, banana, bananana, etc., yes.

(As a note aside: typically the regexp parsers try a longest match, but
for the considerations here it is not important, yet.)

> The string is banan
> Two of the possible patterns match banan

There are two possibilities for a match of the pattern given the string
"banan"; "ba" and "bana".

Maybe it's better to understand if comming from the theory of formal
languages; the pattern expression, say 'ba\(na\)*' defines a language
consisting of the words "ba", "bana", "banana", etc., and the matching
parser searches for a substring sequence of characters that is defined
in that language. IOW, it's not important in that context whether the
whole string is part of the language defined by the pattern expression.
But, as previously illustrated, you may control some aspects by anchors.

> Result is that banan matches ba\(na\)*
>
> Is this correct?

With the slight adjustments of semantics; of What matches What. (YMMV)

It's confusing that the terminology seems not consistent in literature.
(Not even in a single book; e.g., the Robbin's "awk Programming" Book
says where the match operator '~' is defined, that in 'exp ~ /regexp/'
"exp (taken as a string) matches regexp". But in the glossary "Regexp"
it is said that "the regexp ... matches any string ...". [3rd Edition])

Janis

>
> John
>

John Doe

unread,
Jul 14, 2015, 10:47:01 AM7/14/15
to
On 14.07.2015 16:33, Janis Papanagnou wrote:
> On 14.07.2015 15:14, John Doe wrote:
>> Now I anchored the regexp like this:
>>
>> echo 'apples' | grep -G '^apples\{0\}$'
>>
>> Shouldn't it match only 'apple' line now?
>
> Exactly. And it does so.

Damn.. It matches also 'apples':

$ echo 'apple' | grep -G '^apples\{0\}$'
apple
$ echo 'apples' | grep -G '^apples\{0\}$'
apples

I'm still missing something :) But when I do this:

$ echo 'apple' | sed 's/^apples\{0\}$/x/'
x
$ echo 'apples' | sed 's/^apples\{0\}$/x/'
apples

So it looks like grep matches apples against ^apples\{0\}$ but sed
doesn't. Help! :)

> It's confusing that the terminology seems not consistent in literature.
> (Not even in a single book; e.g., the Robbin's "awk Programming" Book
> says where the match operator '~' is defined, that in 'exp ~ /regexp/'
> "exp (taken as a string) matches regexp". But in the glossary "Regexp"
> it is said that "the regexp ... matches any string ...". [3rd Edition])

Thank you!

John

Janis Papanagnou

unread,
Jul 14, 2015, 10:58:38 AM7/14/15
to
On 14.07.2015 16:46, John Doe wrote:
> On 14.07.2015 16:33, Janis Papanagnou wrote:
>> On 14.07.2015 15:14, John Doe wrote:
>>> Now I anchored the regexp like this:
>>>
>>> echo 'apples' | grep -G '^apples\{0\}$'
>>>
>>> Shouldn't it match only 'apple' line now?
>>
>> Exactly. And it does so.
>
> Damn.. It matches also 'apples':
>
> $ echo 'apple' | grep -G '^apples\{0\}$'
> apple
> $ echo 'apples' | grep -G '^apples\{0\}$'
> apples

Hmm.. - not on my system.

$ echo 'apples' | grep -G '^apples\{0\}$'
# no output

$ echo 'apple' | grep -G '^apples\{0\}$'
apple

Since the /s\{0\}/ is redundant you can omit it; what does your system
return on a

echo ... | grep -G '^apple$'

It should certainly *not* match "apples".

>
> I'm still missing something :) But when I do this:
>
> $ echo 'apple' | sed 's/^apples\{0\}$/x/'
> x
> $ echo 'apples' | sed 's/^apples\{0\}$/x/'
> apples
>
> So it looks like grep matches apples against ^apples\{0\}$ but sed doesn't.
> Help! :)

In principle (if the regexp syntax is equivalent) grep should not work
differently from sed. (And, as I'd expect it, it does not on my system.)

Janis

John Doe

unread,
Jul 14, 2015, 11:13:54 AM7/14/15
to
On 14.07.2015 16:58, Janis Papanagnou wrote:
> Since the /s\{0\}/ is redundant you can omit it; what does your system
> return on a
>
> echo ... | grep -G '^apple$'
>
> It should certainly *not* match "apples".

It doesn't:

$ echo 'apple' | grep -G '^apple$'
apple
$ echo 'apples' | grep -G '^apple$'
$

>> I'm still missing something :) But when I do this:
>>
>> $ echo 'apple' | sed 's/^apples\{0\}$/x/'
>> x
>> $ echo 'apples' | sed 's/^apples\{0\}$/x/'
>> apples
>>
>> So it looks like grep matches apples against ^apples\{0\}$ but sed doesn't.
>> Help! :)
>
> In principle (if the regexp syntax is equivalent) grep should not work
> differently from sed. (And, as I'd expect it, it does not on my system.)

My system:

$ grep --version # GNU grep 2.6.3
$ sed --version # GNU sed version 4.2.1

Anyway, it looks like it is necessary to keep eyes open for surprises
like this. Thank you both Janis and Kaz for answers!

John



Janis Papanagnou

unread,
Jul 14, 2015, 6:03:28 PM7/14/15
to
On 14.07.2015 17:13, John Doe wrote:
>
> My system:
>
> $ grep --version # GNU grep 2.6.3
> $ sed --version # GNU sed version 4.2.1

Mine is:

$ grep --version
grep (GNU grep) 2.10

>
> Anyway, it looks like it is necessary to keep eyes open for surprises like
> this.

Or it could be a (meanwhile fixed) bug in grep? - You could try out a newer
grep version to rule out possible bugs from possible environmental effects.

Janis

Ben Bacarisse

unread,
Jul 14, 2015, 9:06:26 PM7/14/15
to
John Doe <john.doe@notpresent> writes:
<snip>
> This matches:
> $ echo apple | grep -G "apples\{0\}"
> apple
>
> But this matches too:
> $ echo apples | grep -G "apples\{0\}"
> apples

A possibly handy tip that has not yet come up... If you turn colours on
(--color=auto) you will see what is actually being matched.

<snip>
--
Ben.

Joep van Delft

unread,
Jul 15, 2015, 3:58:06 AM7/15/15
to
On Wed, 15 Jul 2015 00:03:21 +0200
Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> On 14.07.2015 17:13, John Doe wrote:
> >
> > My system:
> >
> > $ grep --version # GNU grep 2.6.3
> > $ sed --version # GNU sed version 4.2.1
>
> Mine is:
>
> $ grep --version
> grep (GNU grep) 2.10

You do realize that GNU grep 2.6.3 is five years behind on bug fixes
(and GNU grep 2.10 four years)?

I did not test it, but this entry from NEWS for 2.7 seems relevant:

X{0,0} is implemented correctly. It used to be a synonym of X{0,1}.
[bug present since "the beginning"] [1]

[1] http://git.savannah.gnu.org/cgit/grep.git/tree/NEWS#n496

Kind regards,

Joep


John Doe

unread,
Jul 15, 2015, 4:24:28 AM7/15/15
to
On 15.07.2015 09:58, Joep van Delft wrote:
> You do realize that GNU grep 2.6.3 is five years behind on bug fixes
> (and GNU grep 2.10 four years)?
>
> I did not test it, but this entry from NEWS for 2.7 seems relevant:
>
> X{0,0} is implemented correctly. It used to be a synonym of X{0,1}.
> [bug present since "the beginning"] [1]

No, I didn't realise :/ I tested the examples on another machine where
grep is 2.16 and they worked fine. I have Centos 6.6.

Thank you for this tip.

John

Janis Papanagnou

unread,
Jul 15, 2015, 5:54:09 AM7/15/15
to
On 15.07.2015 09:58, Joep van Delft wrote:
> On Wed, 15 Jul 2015 00:03:21 +0200
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>
>> On 14.07.2015 17:13, John Doe wrote:
>>>
>>> My system:
>>>
>>> $ grep --version # GNU grep 2.6.3
>>> $ sed --version # GNU sed version 4.2.1
>>
>> Mine is:
>>
>> $ grep --version
>> grep (GNU grep) 2.10
>
> You do realize that GNU grep 2.6.3 is five years behind on bug fixes
> (and GNU grep 2.10 four years)?

I use the version that my distro supports, and I certainly don't update
everything myself (with only few exceptions). So, no, I didn't realize.

>
> I did not test it, but this entry from NEWS for 2.7 seems relevant:
>
> X{0,0} is implemented correctly. It used to be a synonym of X{0,1}.
> [bug present since "the beginning"] [1]
>
> [1] http://git.savannah.gnu.org/cgit/grep.git/tree/NEWS#n496

Looks that my suspicion was correct. Thanks.

Janis

>
> Kind regards,
>
> Joep
>
>

Thomas 'PointedEars' Lahn

unread,
Jul 24, 2015, 2:12:55 PM7/24/15
to
It does not appear to have been emphasized that “-G” is a GNU extension, in
*GNU* grep. From the TeXinfo manual (“[p]info grep”) of version 2.21:

| ‘-G’
| ‘--basic-regexp’
| Interpret the pattern as a basic regular expression (BRE). This is
| the default.

POSIX grep does not have this option; instead, “-e” makes it clear that the
expression is a BRE there:

<http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html>

(Therefore, portable code should not use the “-G” option, but “grep -e”.)

“--color” and “--colour” also are GNU extensions:

| ‘--color[=WHEN]’
| ‘--colour[=WHEN]’
| Surround the matched (non-empty) strings, matching lines, context
| lines, file names, line numbers, byte offsets, and separators (for
| fields and groups of context lines) with escape sequences to
| display them in color on the terminal. The colors are defined by
| the environment variable ‘GREP_COLORS’ and default to
| ‘ms=01;31:mc=01;31:sl=:cx=:fn=35:ln=32:bn=32:se=36’ for bold red
| matched text, magenta file names, green line numbers, green byte
| offsets, cyan separators, and default terminal colors otherwise.
| The deprecated environment variable ‘GREP_COLOR’ is still
| supported, but its setting does not have priority; it defaults to
| ‘01;31’ (bold red) which only covers the color for matched text.
| WHEN is ‘never’, ‘always’, or ‘auto’.

The manual is not clear about this, but “auto” appears to be the default.
For me,

alias grep='grep --color'

in ~/.bash_aliases (which I keep for my aliases an functions for interactive
bash sessions, sourced in ~/.bashrc) has sufficed to date. Now I realize
that I also can set GREP_COLORS to make GNU grep’s output fit my terminal’s
colors better.

An interesting side effect of “--color” and regular expressions that allow
alternation (such as EREs) is that you can use GNU grep (with “-E”, or “\|”
instead of “|”) or (better) GNU egrep to highlight matches *in context*:

egrep --color '…|$' …

*All* lines will be output (because all have an ending, “$”), and matches
will be highlighted.

(I found this gem somewhere on Stack Overflow.)

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Thomas 'PointedEars' Lahn

unread,
Jul 24, 2015, 2:14:20 PM7/24/15
to
egrep --color -e '…|$' …

Janis Papanagnou

unread,
Jul 27, 2015, 5:38:58 AM7/27/15
to
On 24.07.2015 20:10, Thomas 'PointedEars' Lahn wrote:
[...]
>
> An interesting side effect of “--color” and regular expressions that allow
> alternation (such as EREs) is that you can use GNU grep (with “-E”, or “\|”
> instead of “|”) or (better) GNU egrep to highlight matches *in context*:
>
> egrep --color -e '…|$' …
>
> *All* lines will be output (because all have an ending, “$”), and matches
> will be highlighted.

This is indeed a nice trick.

Note, though, that for larger files you'd then need to use your window
manager to scroll the text. So, alternatively, to navigate, you could use
a pager like 'less' which also supports regexp searches and highlighting
of the found matches.

Janis

> [...]


Geoff Clare

unread,
Jul 27, 2015, 8:41:08 AM7/27/15
to
Thomas 'PointedEars' Lahn wrote:

> It does not appear to have been emphasized that “-G” is a GNU extension, in
> *GNU* grep. From the TeXinfo manual (“[p]info grep”) of version 2.21:
>
> | ‘-G’
> | ‘--basic-regexp’
> | Interpret the pattern as a basic regular expression (BRE). This is
> | the default.
>
> POSIX grep does not have this option; instead, “-e” makes it clear that the
> expression is a BRE there:
>
> <http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html>

It says that BRE is the default, not that the -e option-argument is
always a BRE. You can use "grep -E -e ERE" to use an ERE.

> (Therefore, portable code should not use the “-G” option, but “grep -e”.)

POSIX also says that a pattern_list operand "shall be treated as if it
were specified as -e pattern_list."

Therefore portable code does not need to use -e, it just needs to
ensure it does not use -E or -F, or non-standard RE-flavour selection
options like -G or -P.

--
Geoff Clare <net...@gclare.org.uk>

Janis Papanagnou

unread,
Jul 27, 2015, 10:53:38 AM7/27/15
to
On 27.07.2015 14:36, Geoff Clare wrote:
[ POSIX ]
>
> [...] You can use "grep -E -e ERE" to use an ERE.
>
[...]

> Therefore portable code does not need to use -e, it just needs to
> ensure it does not use -E or -F, [...]

I have difficulties semantically merging those two statements.

Janis

Geoff Clare

unread,
Jul 28, 2015, 8:41:09 AM7/28/15
to
You need the context of the unquoted middle part in order to make
sense of the last part. Although I could have made it clearer by
saying "Therefore to use a BRE portable code does not need to use -e".

The point is that -e has no effect on whether the search pattern is a
BRE, ERE or fixed string; it is only the -E and -F options (or lack
thereof) which control that.

--
Geoff Clare <net...@gclare.org.uk>

Janis Papanagnou

unread,
Jul 28, 2015, 8:44:40 AM7/28/15
to
Thanks. I feared there was something subtile I didn't knew.

Janis

Thomas 'PointedEars' Lahn

unread,
Aug 8, 2015, 6:52:52 AM8/8/15
to
The first statement is correct. The second one is not.

Portable code does need to use “-e” because any positional argument that is
not preceded by “-e” but begins with “-” is parsed as an option (on
BSD/MacOS X usually only if it is not preceded by a non-option argument, but
I would not rely on that); if the grep(1) variant does not support that
option, the invocation will fail; if it supports it, anything unwanted may
happen.

However, portable code does not need to use “-e” in order to have the
expression interpreted as a BRE. grep(1) uses BRE by default, and egrep(1)
uses ERE by default. Therefore, portable code should not use “grep -E -e”,
but “egrep -e”.

Thomas 'PointedEars' Lahn

unread,
Aug 8, 2015, 6:53:37 AM8/8/15
to
Thomas 'PointedEars' Lahn wrote:

> Janis Papanagnou wrote:
>> On 27.07.2015 14:36, Geoff Clare wrote:
>> [ POSIX ]
>>> [...] You can use "grep -E -e ERE" to use an ERE.
>> [...]
>>> Therefore portable code does not need to use -e, it just needs to
>>> ensure it does not use -E or -F, [...]
>> I have difficulties semantically merging those two statements.
>
> The first statement is correct.

… for POSIX and GNU grep.

> The second one is not. […]

Geoff Clare

unread,
Aug 10, 2015, 8:41:07 AM8/10/15
to
Thomas 'PointedEars' Lahn wrote:

> Portable code does need to use “-e” because any positional argument that is
> not preceded by “-e” but begins with “-” is parsed as an option

That's what the "--" delimiter is for:

grep -- -foo file

will do exactly the same as

grep -e -foo file

The purpose of -e is to allow specifying multiple REs on the command
line. If you only want to give grep one RE, you don't need -e.

> grep(1) uses BRE by default, and egrep(1)
> uses ERE by default. Therefore, portable code should not use “grep -E -e”,
> but “egrep -e”.

Depends what you mean by "portable". If you need a script to work on
*very* old systems which don't support grep -E, then egrep is the way
to go, but otherwise grep -E is a better bet because egrep was removed
from POSIX in 2001 (after 9 years of requiring it but calling it
obsolescent).

--
Geoff Clare <net...@gclare.org.uk>
0 new messages