Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug of grep -E

0 views
Skip to first unread message

iPack

unread,
Dec 6, 2017, 10:14:06 AM12/6/17
to bug-gnu-utils
[urain39@urain39-pc ~]$ cat test
https://konachan.com/image/a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg

[urain39@urain39-pc ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z%_\.\-]+'
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg

[urain39@urain39-pc ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z\-%_\.]+'
a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20

It is bug ? or just my syntax error ?

Eric Blake

unread,
Dec 6, 2017, 10:33:07 AM12/6/17
to iPack, bug-gnu-utils
Your syntax error.

In the C locale,

[0-9A-Za-z%_\.\-] matches digits, letters, %, _, \ (listed twice, but
the second listing is ignored), ., and -.

[0-9A-Za-z\-%_\.] matches digits, letters, the range of ASCII bytes
between \ and % (whoops - in ASCII, \ is 47 but % is 37 - you have a
backwards range, so that portion of the range expression matches nothing
at all), then _, \, and . Hence, '-' is not one of the characters
matched, and grep's output is shorter. POSIX permits the implementation
you saw; it also permits an implementation that refuses to grep at all
by declaring your regex invalid because of the backwards range.

In non-C locales, use of - in a [] expression that is not either the
first or the last member of the set is implementation-defined, and all
bets are off on what it matches (lately, GNU tools have been moving
towards rational-range-interpretation, which means treating the range as
the same bytes as it would match in the C locale; but other
implementations, or even older versions of GNU tools, tried to get fancy
and match any character that would collate between the two endpoints,
which gets weird fast).

It _looks_ like you were trying to use \- and \. as escape characters.
But inside [] (at least, the Extended Regular Expression syntax of 'grep
-E' as defined by POSIX), \ is not an escape character; and nothing
needs escaping (there are only special rules about where ], ^, and - are
handled). Yes, there are other flavors of regex engines (perl, for
example) where \ DOES act as an escape even inside []. Which is why it
is essential that you know the quirks of each regex engine you are
targetting.

By the way, bug-gnu-utils is no longer the preferred bug reporting
address for grep; it means your version of grep is probably quite
outdated. These days, 'grep --help' suggests bug-...@gnu.org for
reporting bugs.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org

signature.asc

John Cowan

unread,
Dec 6, 2017, 10:45:50 AM12/6/17
to iPack, bug-gnu-utils
Backslash is not an escape in character classes. The only way to get a -
in a character class is to make sure it is at the end or the beginning. So
in your second pattern, the sequence \-% means "every character from
backslash to percent', which is no characters at all.

Bob Proulx

unread,
Dec 6, 2017, 1:32:02 PM12/6/17
to iPack, bug-gnu-utils
iPack wrote:
> [urain39@urain39-pc ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z\-%_\.]+'
> a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20
>
> It is bug ? or just my syntax error ?

Recent versions of grep (at least on my Debian system) report this as
an invalid expression for the reasons already noted by others. Here
using your verbatim pattern (with the invalid \- \. escaping):

$ grep -Eo '[0-9a-f]{32}/[0-9A-Za-z\-%_\.]+' /dev/null
grep: Invalid range end

Updating would add this improved validation reporting capability. :-)

Bob

0 new messages