Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

TRE regex

2 views
Skip to first unread message

James K. Lowden

unread,
Jun 6, 2016, 9:35:38 PM6/6/16
to
Back in 2009, Matthias-Christian Ott ported Ville Laurikari's regex
library, apparently with the intention of replacing the one in base,
originally from Henry Spencer.

http://mail-index.netbsd.org/tech-userlevel/2009/08/03/msg002477.html

What happened? Other than the announcement, I find no discussion about
it. I see it's in pkgsrc, all well and good, but why was the project
was undertaken and the work not brought into base?

In case you are feeling complacent about NetBSD's regex, the awk
documentation relies on it, and falls short. Awk claims to implement
regex per egrep(1) -- providing no further description -- but that's
just docurot:

$ echo aaa | egrep 'a{3}' | wc -l
1
$ echo aaa | awk '/a{3}/' | wc -l
0

As far as I know, we have 3 regex definitions in base: GNU grep, NetBSD
sed (with regex(3), defined by re_format(7)), and NetBSD awk. It would
be an improvement IMO to use one implementation for all utilities in
base, to make them internally consistent and dependable (and
reproducible), even at the expense of compatibilitly with GNU's
implementations.

--jkl

Christos Zoulas

unread,
Jun 6, 2016, 11:37:57 PM6/6/16
to
In article <20160606213525.46a4...@schemamania.org>,
James K. Lowden <tech-us...@netbsd.org> wrote:
>Back in 2009, Matthias-Christian Ott ported Ville Laurikari's regex
>library, apparently with the intention of replacing the one in base,
>originally from Henry Spencer.
>
> http://mail-index.netbsd.org/tech-userlevel/2009/08/03/msg002477.html
>
>What happened? Other than the announcement, I find no discussion about
>it. I see it's in pkgsrc, all well and good, but why was the project
>was undertaken and the work not brought into base?

It is in /usr/lib/libtre* and there are unit-tests that check both it
and the spenser regex.

>In case you are feeling complacent about NetBSD's regex, the awk
>documentation relies on it, and falls short. Awk claims to implement
>regex per egrep(1) -- providing no further description -- but that's
>just docurot:
>
> $ echo aaa | egrep 'a{3}' | wc -l
> 1
> $ echo aaa | awk '/a{3}/' | wc -l
> 0
>

Awk implements its own (see b.c) (but that does not implement the API).

>As far as I know, we have 3 regex definitions in base: GNU grep, NetBSD
>sed (with regex(3), defined by re_format(7)), and NetBSD awk. It would
>be an improvement IMO to use one implementation for all utilities in
>base, to make them internally consistent and dependable (and
>reproducible), even at the expense of compatibilitly with GNU's
>implementations.

14 (libc, TRE, grep, nvi, cvs, libiberty x 6, gettext, diffutils, less)
15 if you count awk.

christos

David Holland

unread,
Jun 7, 2016, 12:31:31 AM6/7/16
to
On Mon, Jun 06, 2016 at 09:35:25PM -0400, James K. Lowden wrote:
> As far as I know, we have 3 regex definitions in base: GNU grep, NetBSD
> sed (with regex(3), defined by re_format(7)), and NetBSD awk. It would
> be an improvement IMO to use one implementation for all utilities in
> base, to make them internally consistent and dependable (and
> reproducible), even at the expense of compatibilitly with GNU's
> implementations.

Too bad awk is a language with its own definition... why not fix Perl
to use standard regexps while you're at it?

--
David A. Holland
dhol...@netbsd.org

Alistair Crooks

unread,
Jun 9, 2016, 2:31:36 AM6/9/16
to
On 6 June 2016 at 18:35, James K. Lowden <jklo...@schemamania.org> wrote:
> Back in 2009, Matthias-Christian Ott ported Ville Laurikari's regex
> library, apparently with the intention of replacing the one in base,
> originally from Henry Spencer.
>
> http://mail-index.netbsd.org/tech-userlevel/2009/08/03/msg002477.html
>
> What happened? Other than the announcement, I find no discussion about
> it. I see it's in pkgsrc, all well and good, but why was the project
> was undertaken and the work not brought into base?

It was brought into base. The USE_LIBTRE definitions causes things to
happen in libc if it's defined.

However, tre itself has no accommodation of basic regexps. Basic
regexps are the default in sed.
Strange things happen when you attempt to compile a basic regexp with
an implementation expecting
an extended regexp, to the point where build.sh would not complete.

I do have a partial fix for that - take a look at the recently-added
regextend(3) in othersrc/external/bsd -
but until I've finished bringing that into libc, tre-based regexps
will have to wait.

There are other implementation-dependent issues, too, not necessarily
specific to tre - the ability to backtrack, to have specific-number
of repetitions, wide character support, efficient searching, etc.

> In case you are feeling complacent about NetBSD's regex, the awk
> documentation relies on it, and falls short. Awk claims to implement
> regex per egrep(1) -- providing no further description -- but that's
> just docurot:
>
> $ echo aaa | egrep 'a{3}' | wc -l
> 1
> $ echo aaa | awk '/a{3}/' | wc -l
> 0
>
> As far as I know, we have 3 regex definitions in base: GNU grep, NetBSD
> sed (with regex(3), defined by re_format(7)), and NetBSD awk. It would
> be an improvement IMO to use one implementation for all utilities in
> base, to make them internally consistent and dependable (and
> reproducible), even at the expense of compatibilitly with GNU's
> implementations.

The awk documentation describes Bell Labs egrep, for fairly obvious reasons.
The egrep in NetBSD is from GNU grep.

However, overall, you should look at Russ Cox's tutorials on regular
expressions in

https://swtch.com/~rsc/regexp/

Highly recommended.

Regards,
Alistair

James K. Lowden

unread,
Jun 11, 2016, 4:21:22 PM6/11/16
to
On Tue, 7 Jun 2016 03:37:33 +0000 (UTC)
chri...@astron.com (Christos Zoulas) wrote:

> Awk implements its own (see b.c) (but that does not implement the
> API).

Thank you. A nice bit of code, too, btw.

> >As far as I know, we have 3 regex definitions in base: GNU grep,
> >NetBSD sed (with regex(3), defined by re_format(7)), and NetBSD
> >awk. It would be an improvement IMO to use one implementation for
> >all utilities in base, to make them internally consistent and
> >dependable (and reproducible), even at the expense of compatibilitly
> >with GNU's implementations.
>
> 14 (libc, TRE, grep, nvi, cvs, libiberty x 6, gettext, diffutils,
> less) 15 if you count awk.

Why wouldn't I count awk?

Gah. I was under the naive impression that re_format and libc ruled
except for GNU utilities.

--jkl

James K. Lowden

unread,
Jun 11, 2016, 4:24:39 PM6/11/16
to
On Wed, 8 Jun 2016 23:31:07 -0700
Alistair Crooks <a...@pkgsrc.org> wrote:

> On 6 June 2016 at 18:35, James K. Lowden <jklo...@schemamania.org>
> wrote:
> > Back in 2009, Matthias-Christian Ott ported Ville Laurikari's regex

> It was brought into base. The USE_LIBTRE definitions causes things to
> happen in libc if it's defined.

I'm not on current, but didn't think I was too far behind, given that
so many years had passed. Thanks for the pointer and the work.

> Strange things happen when you attempt to compile a basic regexp with
> an implementation expecting an extended regexp, to the point where
> build.sh would not complete.

Easy to imagine. IMO, since basic regex has been "obsolete" since the
Late Bronze Age, it would be to move build.sh to use ERE, and (then)
make that the default in sed. Either that, or "Having two kinds of REs
is a botch" is just whining.

> I do have a partial fix for that - take a look at the recently-added
> regextend(3) in othersrc/external/bsd -
> but until I've finished bringing that into libc, tre-based regexps
> will have to wait.

[will have a look]

> > In case you are feeling complacent about NetBSD's regex, the awk
> > documentation relies on it, and falls short. Awk claims to
> > implement regex per egrep(1)

> The awk documentation describes Bell Labs egrep, for fairly obvious
> reasons. The egrep in NetBSD is from GNU grep.

I suspected as much, Alistair, thanks for confirming. You see, I've
been using NetBSD for just 17 years, so I rely on the manual. I realize
that sometimes one has to know the history and lore to understand how
things work, but I don't think that's a good thing.

It seems awk regex has no NetBSD documentation, as many readers of
this list were doubtless aware. The reference to egrep has been obsolete
since nawk was adopted (NetBSD 2.0) because afaik we've used GNU grep
since well before then.

At Christos's behest, I looked at external/historical/nawk/dist/b.c.
I don't see any way to convert it to use the Posix API, nor any
appetite to write a new awk. I guess just documenting its regex
implementation is that best we can hope for.

> https://swtch.com/~rsc/regexp/
>
> Highly recommended.

Agreed, absolutely.

--jkl

0 new messages