Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] help with regular expression

80 views
Skip to first unread message

zaid khalid

unread,
Dec 6, 2010, 6:50:06 AM12/6/10
to caml...@yquem.inria.fr
Hi Folks

I want some help in writing regular expressions in Ocaml, as I know how to write it in informal way but in Ocaml syntax I can not. For example I want to write "a* | (aba)* ".

Another question if I want the string to be matched against the regular expression to be matched as whole string not as substring what symbol I need to attach to the substring, i.e if I want only concrete strings accepted (like (" ", a , aa , aaa, aba, abaaba), but not ab or not abaa).


Hint I am using (Str.regexp)
Thanks


David Allsopp

unread,
Dec 6, 2010, 7:07:09 AM12/6/10
to zaid khalid, caml...@yquem.inria.fr
zaid Khalid wrote:
> Hi Folks
>
> I want some help in writing regular expressions in Ocaml, as I know how to write it
> in informal way but in Ocaml syntax I can not. For example I want to write "a* | (aba)* ".

This question would better be posted on the beginners' list - http://caml.inria.fr/resources/forums.en.html#id2267683

Regular Expressions can be done using the Standard Library with the Str module (as you've found) - see http://caml.inria.fr/pub/docs/manual-ocaml/libref/Str.html so your expression above (assuming you have loaded/linked str.cm[x]a) is Str.regexp "a*\\|\\(aba\\)*". The language of regexps is given in the docs for Str.regexp function. Remember to escape backslash characters as the regular expression is given in an OCaml string (so to escape a backslash in your regexp you have to write "\\\\").

> Another question if I want the string to be matched against the regular expression
> to be matched as whole string not as substring what symbol I need to attach to the
> substring, i.e if I want only concrete strings accepted (like (" ", a , aa , aaa,
> aba, abaaba), but not ab or not abaa).

Use ^ and $ at the beginning and end of your regexp to ensure that it matches the entire string only - "^\\(a*\\|\\(aba\\)*\\)$"

> Hint I am using (Str.regexp)

There are other libraries (e.g. pcre-ocaml) which provide different (I would say more powerful, rather than strictly better!) implementations.


David

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Sylvain Le Gall

unread,
Dec 6, 2010, 8:11:50 AM12/6/10
to caml...@inria.fr
On 06-12-2010, David Allsopp <dra-...@metastack.com> wrote:

> zaid Khalid wrote:
>>
>
>> Hint I am using (Str.regexp)
>
> There are other libraries (e.g. pcre-ocaml) which provide different (I
> would say more powerful, rather than strictly better!)
> implementations.
>
>

There is also syntax extension like mikmatch, that helps to write regexp
in a very meaningful syntax:

match str with
| RE bol "a"* | "ab"* eol ->
true
| _ ->
false

http://martin.jambon.free.fr/mikmatch-manual.html
http://martin.jambon.free.fr/mikmatch.html

You can use pcre and str with mikmatch.

Regards,
Sylvain Le Gall

Dawid Toton

unread,
Dec 6, 2010, 12:31:47 PM12/6/10
to caml-list

I also had problems with Str (regexp descriptions being unreadable,
error-prone and hard to generate dynamically) and decided just to stop
using Str.
I have a tiny module [1] made with clarity in mind. It is pure OCaml. It
defines operators like $$ to be used in regexp construction. This way
syntax of the expressions is checked at compile time. Also, it is
trivial to build them at run time.
The whole "engine" is contained in a relatively short function
HRegex.subwords_of_subexpressions, so I believe anybody can hack it
without much effort.

I haven't measured performance of this implementation. I expect it to be
slow when processing long strings. It's just OK for my needs so far.
Anyway, the important part is the module interface. It expresses my
point of view on this topic.

The code is available in a mercurial repository [2].

The exemple "a* | (aba)* " would become:

open HRegex.Operators

let rx = (!* !$ "a") +$ (!* !$ "aba")

Dawid

[1]
http://hg.ocamlcore.org/cgi-bin/hgwebdir.cgi/hlibrary/hlibrary/raw-file/tip/HRegex.mli
[2] http://hg.ocamlcore.org/cgi-bin/hgwebdir.cgi/hlibrary/hlibrary

Martin Jambon

unread,
Dec 6, 2010, 3:41:19 PM12/6/10
to caml...@yquem.inria.fr
On 12/06/10 05:11, Sylvain Le Gall wrote:
> On 06-12-2010, David Allsopp <dra-...@metastack.com> wrote:
>> zaid Khalid wrote:
>>>
>>
>>> Hint I am using (Str.regexp)
>>
>> There are other libraries (e.g. pcre-ocaml) which provide different (I
>> would say more powerful, rather than strictly better!)
>> implementations.
>>
>>
>
> There is also syntax extension like mikmatch, that helps to write regexp
> in a very meaningful syntax:
>
> match str with
> | RE bol "a"* | "ab"* eol ->
> true
> | _ ->
> false

If I understand correctly the original problem, the solution is:

match str with
| RE ("a"* | "aba"*) eos ->
(* matches always the beginning of the string,
eos enforces a match at the end of the string,
and the vertical bar has the lowest priority
and so parentheses are needed. *)
true
| _ ->
false

I would recommend the pcre variant mostly for one feature that is not
provided by str: lazy quantifiers, i.e. "repeat as little as possible
before trying to match what comes next".


Martin

0 new messages