Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] camlp4 and lexers

3 views
Skip to first unread message

Pietro Abate

unread,
May 15, 2008, 11:01:57 AM5/15/08
to caml...@yquem.inria.fr
Hi all,
This question was asked a few weeks ago, and again last week. However I
still don't really get how to proceed. I hope we can cook down a small
example to understand a bit more the camlp4 internals.

Say I want to write a small parser for regexp (or an aritmetic
calculator), but I don't want to extend the ocaml grammar to do that. I
just want to create a minimal lexer and a minimal grammar to parse
expressions like (aaa*|b?);c

The parser part is easy (below). The part I don't understand is how to
create a lexer. I had a look at the ocsigen xmlcaml lexer and the camlp4
lexer, but I still haven't found a minimal example I can use without
getting confused.

In particular, the problem below is that I want my lexer to give me back
CHAR tokens (different from the CHAR of char * string of camlp4) and not
strings. I could do the same with the camlp4 lexer, but all my regexp
should be then written as ('a''a''a' *) etc ... that it's not good
looking.

A while ago I did something similar with the old camlp4 [1] using
plexer, but this is not possible anymore...

Nicolas a while ago suggested to copy the Camlp4.PreCast module and the
lexer module and customize them. I think it should be possible just
to use Struct.Grammar.Static.Make with a new lexer instead... but, as I
said, I'm not able to write a very minimal lexer for this example...
Maybe I'm confused about this.

I think a minimal example will help more then one person here.

thanks :)
p


-------------------------- This is my parser...

module RegExGram = Struct.Grammar.Static.Make(RegExpLexer)

let regex = RegExGram.Entry.mk "regex"

EXTEND RegExGram
GLOBAL: regex;

regex: [[ e1 = SELF ; "|" ; e2 = concat -> Alt(e1,e2)
| e1 = seq -> e1 ]
];

concat:[[ e1 = SELF ; ";"; e2 = seq -> Seq(e1,e2)
| e1 = SELF ; e2 = seq -> Seq(e1,e2)
| e1 = seq -> e1 ]
];

seq: [[ e1 = simple ; "?" -> Opt e1
| e1 = simple ; "*" -> Star e1
| e1 = simple ; "+" -> Plus e1
| e1 = simple -> e1 ]
];

simple:[[ "." -> Dot
| "("; e1 = regex; ")" -> e1
| `CHAR(s) -> Sym s ]
];

END

----------------------

[1] http://groups.google.com/group/fa.caml/browse_thread/thread/e26569427cc8879d

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Pietro Abate

unread,
May 16, 2008, 11:25:34 AM5/16/08
to caml...@yquem.inria.fr
Hi again.

I have a minimal (?) lexer (attached) working with the grammar below.
For the purpose of this excercise I used ulex. I started with the cduce
lexer and removed all cduce-specific functions. However I'm not enterely
happy.

First I'd like to have another example using ocamllex and not ulex (one
less dependecy), but I guess this is not too hard to do.

Second, I've copy-pasted some code in the lexer to instanciate the
camlp4 modules, but I'm not sure what is required and what is not. I
mean, I can look at the camlp4 modules sigs, but without documentation
there are a lot of functions that I don't really understand. Can anybody
explain the signature of the Loc, Token and Error modules ?
How these function used within the camlp4 parsing machinery ?
- Token.match_keyword
- Token.extract_string
- Token.Filter.mk
- Token.Filter.filter
- Token.Filter.define_filter
- Token.Filter.keyword_added
- Token.Filter.keyword_removed

Third, I'm not sure if this is the real minimal example I was looking
for. I've the impression I could reuse the Camlp4.PreCast.Loc module,
but I'm not sure if I can reuse the Camlp4.PreCast.Token since it is
linked with the token type definition. I don't think I can reuse/extend
the caml_token type... Making the lexer extensible would be a great !

Hope this helps.

comments ?

pietro


This is the _tags file to compile it:
---------- _tags -------
"parser.ml": use_camlp4, pp(camlp4of)
"ulexer.ml": pkg_ulex, use_camlp4, syntax_camlp4o
"ulexer.mli": use_camlp4, pkg_ulex
-----------

+ nicolas' universal myocamlbuil.ml

-------------------- parser.ml -----------------------

type t =
Seq of t * t
| Alt of t * t
| Opt of t
| Star of t
| Plus of t
| Dot
| Sym of char

open Ulexer

module RegExGram = Camlp4.Struct.Grammar.Static.Make(Ulexer)

let regex = RegExGram.Entry.mk "regex"

(* I guess I don't need to use KWD *)

EXTEND RegExGram
GLOBAL: regex;

regex: [[ e1 = SELF ; `KWD "|" ; e2 = concat -> Alt(e1,e2)
| e1 = concat -> e1 ]
];

concat:[[ e1 = SELF ; `KWD ";"; e2 = seq -> Seq(e1,e2)


| e1 = SELF ; e2 = seq -> Seq(e1,e2)
| e1 = seq -> e1 ]
];

seq: [[ e1 = simple ; `KWD "?" -> Opt e1
| e1 = simple ; `KWD "*" -> Star e1
| e1 = simple ; `KWD "+" -> Plus e1


| e1 = simple -> e1 ]
];

simple:[[ `KWD "." -> Dot
| `KWD "("; e1 = regex; `KWD ")" -> e1
| `CHAR(s) -> Sym s ]
];


END

let from_string s = RegExGram.parse_string regex (Loc.mk "<string>") s

------------------------------------------------------


ulexer.ml
ulexer.mli
myocamlbuild.ml
0 new messages