A user of a program of mine (http://www.vanheusden.com/multitail/) tries
to use plan9 regexps under linux and doesn't succeed.
Am I right that plan9 regular expressions are not compatible with the
ones of "regular" unix?
Folkert van Heusden
--
www.vanheusden.com/multitail - multitail is tail on steroids. multiple
windows, filtering, coloring, anything you can think of
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
Many unix programs don't use ``extended'' regular expressions by
default. See regexp(7) on Plan 9 or try egrep/grep -E under Unix.
The Plan 9 regexp library matches the old Unix egrep command.
Any regexp you'd try under Plan 9 should work with new egreps,
though not vice versa -- new egreps tend to have newfangled
additions like [:upper:] and \w and {4,6} for repetition.
Russ
This came up as I was implementing my C lexer for the compilers class
I'm taking. How hard would it be to allow access to regcomp(2)'s
internals, so I could build up a regexp part-by part a la lex?
For example, to recognize C99 hexadecimal floating-point constants, I
wrote a second program that builds up the regexp piece-by-piece using
smprint(2), then compiling the whole thing:
char *decdig = "([0-9])",
*hexdig = "([0-9A-Fa-f])",
*sign = "([+\\-])",
*dot = "(\\.)",
*dseq, *dexp, *dfrac, *decflt,
*hseq, *bexp, *hfrac, *hexflt;
dseq = smprint("(%s+)", decdig);
dexp = smprint("([Ee]%s?%s)", sign, dseq);
dfrac = smprint("((%s?%s%s)|(%s%s))", dseq, dot, dseq, dseq, dot);
decflt = smprint("(%s%s?)|(%s%s)", dfrac, dexp, dseq, dexp);
regcomp(decflt); // make sure it compiles
print("decfloat: %s\n", decflt);
hseq = smprint("(%s+)", hexdig);
bexp = smprint("([Pp]%s?%s)", sign, dseq);
hfrac = smprint("((%s?%s%s)|(%s%s))", hseq, dot, hseq, hseq, dot);
hexflt = smprint("0[Xx](%s|%s)%s", hfrac, hseq, bexp);
regcomp(hexflt); // make sure it compiles
print("hexfloat: %s\n", hexflt);
I know that regcomp builds up the Reprog by combining subprograms with
catenation and alternation &c., but I’d be loath to try tinkering
there directly without a much better understanding of the algorithm.
I’ve glanced through the documents at swtch.com/????? and the regcomp
source code, just haven’t had the time for an in-depth study.
Would such a project be a worthwhile spent of time? (Might it develop
into the asteroid to kill the dinosaur waiting for it?)
--Joel
Why go to the trouble? For C, the lexer is easy
enough to just write by hand.
> Hi,
>
> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
> to use plan9 regexps under linux and doesn't succeed.
> Am I right that plan9 regular expressions are not compatible with the
> ones of "regular" unix?
They are different. I am not very sure what you mean by "regular"
UNIX regexp, as far as I now in Linux each command seems to use
different sets of regexps.
As for plan9, you can read regexp(6) at:
http://plan9.bell-labs.com/magic/man2html/6/regexp
Sam also support structural regexps:
http://plan9.bell-labs.com/sources/contrib/uriel/mirror/se.pdf
--
Alberto Cortés
Followup-To:
Distribution:
Organization: University of Bath Computing Services, UK
Keywords:
Cc:
--
Dennis Davis, BUCS, University of Bath, Bath, BA2 7AY, UK
D.H....@bath.ac.uk
--
- curiosity sKilled the cat
a) a character class
b) matching a single character with ".".
for example for a file "fu" with these lines
α0
β0
α1
(no leading tab) i get these results with no
local settings at all.
; grep δ fu
δ0
works because as far as grep is concerned, the string
i asked for 03 b4 is in there. this works, too
; egrep '(ε|δ)0' fu
ε0
δ0
and this works because there is a character before
"0" on the line:
; egrep '.0' fu
ε0
δ0
but this doesn't
; egrep '[αβ]0' fu
; egrep '^.0' fu
this is for gnu grep version
; egrep --version
egrep (GNU grep) 2.5.1
- erik
On 2/23/07, erik quanstrom <quan...@coraid.com> wrote:
> ; egrep '[αβ]0' fu
> ; egrep '^.0' fu
>
we all know the limitations of our tools. that doesn't make
them broken.
just because plan 9 does bad things if you exceed NPROCS,
doesn't make it broken.
- erik
On 2/23/07, erik quanstrom <quan...@coraid.com> wrote:
> ; egrep '[��]0' fu
> ; egrep '^.0' fu
>
For a useful and significant subset of C, the lexer is easy enough to
just write by hand. I was trying for full C99 (what were those ISO
guys drinking?). I spent far too much time on it to call the task
"easy".
I have what I believe is a pretty complete C lexer
(http://www.tip9ug.jp/who/chesky/comp/lex.c). It still is far from
being integrated into a full grammar, but it scans cpp(1) output
nicely. I tested it against some of the odder "features" of C99—UCNs,
hex floats, &c.—and it seems to work.
Some parts were easy, some less so, and some looked easy until they
turned out to be subtly wrong. Recognizing whether the number seen is
an integer (in decimal, octal, or hex) or a real number was one of the
hard parts, and one I gladly handed off to a regexp. The way I
generated the regexp may not be ideal, as someone pointed out to me
off-list, but hand-generated code that recognizes what sort of number
was seen would be exactly equivalent to the regexp, and less readable.
--Joel
1) You don't have to write the lexer directly.
2) What you do have to write is fairly concise.
3) The resulting lexer is fairly efficient.
It has two main drawbacks:
4) The input model does not always match your
own program's input model, creating a messy interface.
5) Once you need more than regular expressions,
lexers written with state variables and such can get
very opaque very fast.
Many on this list would argue that (1) and (2) do not
outweigh (4) and (5), instead suggesting that writing a
lexer by hand is not too difficult and ends up being
more maintainable than a lex spec in the long run.
And of course, for a well-written by-hand lexer,
you get to keep (3).
Creating new entry hooks in the regexp library doesn't
preserve (1), (2), or (3). And if much of your time is
spent in lexical analysis (as Ken claimed was true for
the Plan 9 compilers), losing (3) is a big deal.
So that seems like not a very good replacement for lex.
All that said, lex has been used to write a lot of C
compilers, and can be used in that context without
running into much of (4) or (5). Why not just use lex here?
Russ
Alberto Cortes <alco...@it.uc3m.es> wrote:
> Folkert van Heusden said:
>
>> Hi,
>>
>> A user of a program of mine (http://www.vanheusden.com/multitail/) tries
>> to use plan9 regexps under linux and doesn't succeed.
>> Am I right that plan9 regular expressions are not compatible with the
>> ones of "regular" unix?
>
> They are different. I am not very sure what you mean by "regular"
> UNIX regexp,
They're defined by Extended REs in
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
(if you haven't registered for free with The Open Group, it's Chapter 9).
Basic REs are also specified, but only for backwards compatibility.
Extended REs look to my reading like a strict superset of Plan 9 REs
(UTF-8 issues notwithstanding), so any Plan 9 RE should be a UNIX RE.
> as far as I now in Linux each command seems to use
> different sets of regexps.
Legally speaking, Linux is not UNIX since it never passed the SUS
test suite. Therefore there is no guarantee that it uses UNIX regexps,
and it's possible that some tools won't understand Plan 9 REs.
!snip!
--
Darren Bane