[9fans] regular expressions in plan9 different from the ones in unix? (at least linux)

Folkert van Heusden

unread,

Feb 22, 2007, 5:32:04 PM2/22/07

to

Hi,

A user of a program of mine (http://www.vanheusden.com/multitail/) tries
to use plan9 regexps under linux and doesn't succeed.
Am I right that plan9 regular expressions are not compatible with the
ones of "regular" unix?

Folkert van Heusden

--
www.vanheusden.com/multitail - multitail is tail on steroids. multiple
windows, filtering, coloring, anything you can think of
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

William Josephson

unread,

Feb 22, 2007, 6:28:14 PM2/22/07

to

On Thu, Feb 22, 2007 at 11:16:26PM +0100, Folkert van Heusden wrote:
> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
> to use plan9 regexps under linux and doesn't succeed.
> Am I right that plan9 regular expressions are not compatible with the
> ones of "regular" unix?

Many unix programs don't use ``extended'' regular expressions by
default. See regexp(7) on Plan 9 or try egrep/grep -E under Unix.

Russ Cox

unread,

Feb 22, 2007, 6:52:41 PM2/22/07

to

> Many unix programs don't use ``extended'' regular expressions by
> default. See regexp(7) on Plan 9 or try egrep/grep -E under Unix.

The Plan 9 regexp library matches the old Unix egrep command.
Any regexp you'd try under Plan 9 should work with new egreps,
though not vice versa -- new egreps tend to have newfangled
additions like [:upper:] and \w and {4,6} for repetition.

Russ

Joel Salomon

unread,

Feb 23, 2007, 1:36:55 AM2/23/07

to

On 2/22/07, Russ Cox <r...@swtch.com> wrote:
> The Plan 9 regexp library matches the old Unix egrep command.
> Any regexp you'd try under Plan 9 should work with new egreps,
> though not vice versa -- new egreps tend to have newfangled
> additions like [:upper:] and \w and {4,6} for repetition.

This came up as I was implementing my C lexer for the compilers class
I'm taking. How hard would it be to allow access to regcomp(2)'s
internals, so I could build up a regexp part-by part a la lex?

For example, to recognize C99 hexadecimal floating-point constants, I
wrote a second program that builds up the regexp piece-by-piece using
smprint(2), then compiling the whole thing:

char *decdig = "([0-9])",
*hexdig = "([0-9A-Fa-f])",
*sign = "([+\\-])",
*dot = "(\\.)",
*dseq, *dexp, *dfrac, *decflt,
*hseq, *bexp, *hfrac, *hexflt;
dseq = smprint("(%s+)", decdig);
dexp = smprint("([Ee]%s?%s)", sign, dseq);
dfrac = smprint("((%s?%s%s)|(%s%s))", dseq, dot, dseq, dseq, dot);
decflt = smprint("(%s%s?)|(%s%s)", dfrac, dexp, dseq, dexp);
regcomp(decflt); // make sure it compiles
print("decfloat: %s\n", decflt);

hseq = smprint("(%s+)", hexdig);
bexp = smprint("([Pp]%s?%s)", sign, dseq);
hfrac = smprint("((%s?%s%s)|(%s%s))", hseq, dot, hseq, hseq, dot);
hexflt = smprint("0[Xx](%s|%s)%s", hfrac, hseq, bexp);
regcomp(hexflt); // make sure it compiles
print("hexfloat: %s\n", hexflt);

I know that regcomp builds up the Reprog by combining subprograms with
catenation and alternation &c., but I’d be loath to try tinkering
there directly without a much better understanding of the algorithm.
I’ve glanced through the documents at swtch.com/????? and the regcomp
source code, just haven’t had the time for an in-depth study.

Would such a project be a worthwhile spent of time? (Might it develop
into the asteroid to kill the dinosaur waiting for it?)

--Joel

William K. Josephson

unread,

Feb 23, 2007, 2:07:16 AM2/23/07

to

On Fri, Feb 23, 2007 at 01:27:56AM -0500, Joel Salomon wrote:
> Would such a project be a worthwhile spent of time? (Might it develop
> into the asteroid to kill the dinosaur waiting for it?)

Why go to the trouble? For C, the lexer is easy
enough to just write by hand.

Alberto Cortes

unread,

Feb 23, 2007, 4:54:35 AM2/23/07

to

Folkert van Heusden said:

> Hi,
>
> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
> to use plan9 regexps under linux and doesn't succeed.
> Am I right that plan9 regular expressions are not compatible with the
> ones of "regular" unix?

They are different. I am not very sure what you mean by "regular"
UNIX regexp, as far as I now in Linux each command seems to use
different sets of regexps.

As for plan9, you can read regexp(6) at:

http://plan9.bell-labs.com/magic/man2html/6/regexp

Sam also support structural regexps:

http://plan9.bell-labs.com/sources/contrib/uriel/mirror/se.pdf

--
Alberto Cortés
Followup-To:
Distribution:
Organization: University of Bath Computing Services, UK
Keywords:
Cc:

--
Dennis Davis, BUCS, University of Bath, Bath, BA2 7AY, UK
D.H....@bath.ac.uk

Gorka Guardiola

unread,

Feb 23, 2007, 6:19:47 AM2/23/07

to

Also, I am not sure if you can use expressions with big unicode
characteres in Unix, last time I looked with sed, you could not.

--
- curiosity sKilled the cat

erik quanstrom

unread,

Feb 23, 2007, 7:16:44 AM2/23/07

to

utf-8 encoding will "just work" (unless the gnu folk are
rearranging characters with the bucky bit set) or if
the result depends on knowing the width of a character,
e.g. in

a) a character class
b) matching a single character with ".".

for example for a file "fu" with these lines

α0
β0
α1

(no leading tab) i get these results with no
local settings at all.

; grep δ fu
δ0

works because as far as grep is concerned, the string
i asked for 03 b4 is in there. this works, too

; egrep '(ε|δ)0' fu
ε0
δ0

and this works because there is a character before
"0" on the line:

; egrep '.0' fu
ε0
δ0

but this doesn't

; egrep '[αβ]0' fu
; egrep '^.0' fu

this is for gnu grep version

; egrep --version
egrep (GNU grep) 2.5.1

- erik

Gorka Guardiola

unread,

Feb 23, 2007, 7:22:56 AM2/23/07

to

If it doesn't for one case, then it doesn't.

On 2/23/07, erik quanstrom <quan...@coraid.com> wrote:
> ; egrep '[αβ]0' fu
> ; egrep '^.0' fu
>

erik quanstrom

unread,

Feb 23, 2007, 8:05:39 AM2/23/07

to

i don't think that sort of absolutist thinking really works.
i used gnu grep (and all the other gnu tools) on utf-8 stuff
from the time of the first sam release for unix till i stopped using
linux for much development. i never had a problem with
g(ed|sed|awk|e?grep) tripping on utf-8 when the local was
unset or "C". i did keep in mind that . wasn't going to match
"☺", though.

we all know the limitations of our tools. that doesn't make
them broken.

just because plan 9 does bad things if you exceed NPROCS,
doesn't make it broken.

- erik

On 2/23/07, erik quanstrom <quan...@coraid.com> wrote:

> ; egrep '[��]0' fu
> ; egrep '^.0' fu
>

Joel C. Salomon

unread,

Feb 23, 2007, 8:35:35 AM2/23/07

to

For a useful and significant subset of C, the lexer is easy enough to
just write by hand. I was trying for full C99 (what were those ISO
guys drinking?). I spent far too much time on it to call the task
"easy".

I have what I believe is a pretty complete C lexer
(http://www.tip9ug.jp/who/chesky/comp/lex.c). It still is far from
being integrated into a full grammar, but it scans cpp(1) output
nicely. I tested it against some of the odder "features" of C99—UCNs,
hex floats, &c.—and it seems to work.

Some parts were easy, some less so, and some looked easy until they
turned out to be subtly wrong. Recognizing whether the number seen is
an integer (in decimal, octal, or hex) or a real number was one of the
hard parts, and one I gladly handed off to a regexp. The way I
generated the regexp may not be ideal, as someone pointed out to me
off-list, but hand-generated code that recognizes what sort of number
was seen would be exactly equivalent to the regexp, and less readable.

--Joel

Russ Cox

unread,

Feb 23, 2007, 12:41:45 PM2/23/07

to

Lex has three benefits:

1) You don't have to write the lexer directly.
2) What you do have to write is fairly concise.
3) The resulting lexer is fairly efficient.

It has two main drawbacks:

4) The input model does not always match your
own program's input model, creating a messy interface.
5) Once you need more than regular expressions,
lexers written with state variables and such can get
very opaque very fast.

Many on this list would argue that (1) and (2) do not
outweigh (4) and (5), instead suggesting that writing a
lexer by hand is not too difficult and ends up being
more maintainable than a lex spec in the long run.
And of course, for a well-written by-hand lexer,
you get to keep (3).

Creating new entry hooks in the regexp library doesn't
preserve (1), (2), or (3). And if much of your time is
spent in lexical analysis (as Ken claimed was true for
the Plan 9 compilers), losing (3) is a big deal.
So that seems like not a very good replacement for lex.

All that said, lex has been used to write a lot of C
compilers, and can be used in that context without
running into much of (4) or (5). Why not just use lex here?

Russ

Darren Bane

unread,

Feb 26, 2007, 4:54:40 AM2/26/07

to

Apologies if this seems pedantic, but someone mentioned UNIX, which has
a very specific meaning, and is different to UNIX-like.

Alberto Cortes <alco...@it.uc3m.es> wrote:
> Folkert van Heusden said:
>
>> Hi,
>>
>> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
>> to use plan9 regexps under linux and doesn't succeed.
>> Am I right that plan9 regular expressions are not compatible with the
>> ones of "regular" unix?
>
> They are different. I am not very sure what you mean by "regular"
> UNIX regexp,

They're defined by Extended REs in

http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

(if you haven't registered for free with The Open Group, it's Chapter 9).
Basic REs are also specified, but only for backwards compatibility.

Extended REs look to my reading like a strict superset of Plan 9 REs
(UTF-8 issues notwithstanding), so any Plan 9 RE should be a UNIX RE.

> as far as I now in Linux each command seems to use
> different sets of regexps.

Legally speaking, Linux is not UNIX since it never passed the SUS
test suite. Therefore there is no guarantee that it uses UNIX regexps,
and it's possible that some tools won't understand Plan 9 REs.

!snip!
--
Darren Bane