On SVN : automatic conversion of Regex to Hampi Grammar

27 views
Skip to first unread message

Devdatta

unread,
Jan 22, 2010, 4:08:07 AM1/22/10
to hampi...@googlegroups.com
Hi

I have decided to send out the alpha release of the tool I have been
hacking on that converts a PCRE regular expression to Hampi Grammar.
In the latest svn, you can cd to the lib/regex-hampi directory to find
the code.

*** Using the library

You will need a standard perl install. The library is the
'pcre_tohampi.pm' file and the YAPE folder. Thanks to Jeff Pinyan for
the nice YAPE::RegEx library that I harnesses. The library is under
the MIT license. The library returns a string containing the Hampi CFG
that will generate the same strings that the regex does. The CFG non
terminals are named as follows :

The non terminal for the whole regex is flax0
the non terminal for 1st capturing group is flax1
and so on ...

So, usually you will want to get the CFG in string form and do something like :

assert stringvar in flax0;

Please go through the code in the hampiregexTest.pl. It demonstrates
how to use the library. If it is still not clear, ask on this mailing
list.

*** The Test Engine

For testing this tool, I have also included a sample test engine that
does the following :
1> It takes a list of files as input , where each line of each file
contains a regular expression. If no file is given, it looks for input
via STDIN.
2> For each regex (i.e each line), it generates a Hampi Grammar and
runs the Hampi String Solver on it.
3> The result of the Hampi String Solver is then matched with the
original regex using the RegEx engine present in perl.

So for e.g, you can try running (make sure you have compiled Hampi first)
perl hampiregexTest.pl < sampleRegex

from the regex-hampi directory and see the results.

***

Special thanks to Stephen McCamant for a *lot* of help.

Please feel free to send me feedback/bugs/patched on/off the mailing
list. I haven't concentrated on performance, so I am sure there are
lots of performance improvements possible.

Hope this is helpful to someone


Regards
Devdatta

Tim Hinrichs

unread,
Jan 22, 2010, 11:05:41 AM1/22/10
to hampi...@googlegroups.com
Hi Devdatta,

Thanks for putting this together--it'll help us out quite a bit!

I have a bug report. On the following regexp, the test script
reports failure.

(^[\w\.=-]+)

Here's the Hampi input file produced.

var string:1;
cfg q2 := \000 | [\045 - \046] | [\048 - \057] | \061 | [\065 - \090]
| \095 | [\097 - \122];
cfg q3 := (q2)+;
cfg q1 := q3;
cfg flax0 := q1;

assert string in flax0;

It looks like the problem is the occurrence of the null string \000
in q2 because Hampi returns the empty string.

Tim

Devdatta

unread,
Jan 22, 2010, 2:22:05 PM1/22/10
to hampi...@googlegroups.com
Hi Tim,

Thanks for using the code and catching the bug. I have done the fix to
the code and added this regex to the sampleRegex file. Please update
your copy. Note that you need to enclose the regex with slashes on
both sides.

Regards
Devdatta

2010/1/22 Tim Hinrichs <thin...@gmail.com>:

Devdatta

unread,
Jan 27, 2010, 6:24:14 PM1/27/10
to hampi...@googlegroups.com
Hi all,

I just updated the code, and have changed the API a little bit. Also I
have uploaded a large number of new regular expressions as test cases.
Please see the code hampiregexTest.pl to see how to use the new API.

I am interested in seeing regular expressions anyone of you finds
really common and/or bugs, so be sure to let me know if you find
anything interesting.

Regards
devdatta

2010/1/22 Devdatta <dev.a...@gmail.com>:

Adam Kiezun

unread,
Jan 27, 2010, 7:03:11 PM1/27/10
to hampi...@googlegroups.com
Devdatta,
This is great stuff! I think it would make a lot of sense to put some
advertisement for it on the Hampi website.

Can you write something: some info about it and some examples of how
to use it? I will then add a "news flash" item to the main page.

./adam

Devdatta

unread,
Jan 28, 2010, 2:23:47 AM1/28/10
to hampi...@googlegroups.com
I have created http://code.google.com/p/hampi/wiki/PCREToHampi

Cheers
Devdatta

Adam Kiezun

unread,
Feb 2, 2010, 10:53:59 AM2/2/10
to hampi...@googlegroups.com
Thanks heaps Devdatta!
I put a note on the Hampi website: http://people.csail.mit.edu/akiezun/hampi/

./ada,

Tim Hinrichs

unread,
Feb 8, 2010, 5:21:29 PM2/8/10
to hampi...@googlegroups.com
Hi Devdatta,

Thanks again for writing the regexp converter. I've been using it
within a wrapper around Hampi, and it's been a life-saver. Here's a
brain dump of the additional features I would have liked and the
problems I ran into.

One feature I hacked in myself is the ability to create unique CFG
names for each regexp. This is invaluable when you're asking Hampi
to solve multiple (converted) regexps simultaneously. The interface
I used was simple: have TOTHEHAMPI take an argument that's meant to
be unique. Example.

my $p=pcre_tohampi->new(qr/abc/);
$p->parse;
print $p->tothehampi("_unique1_");
//toplevel flax_unique1_0


my $p=pcre_tohampi->new(qr/def/);
$p->parse;
print $p->tothehampi("_unique2_");
//toplevel flax_unique2_0

If I remember right, I only had to touch a handful of places in the
code. I don't know if there's gensym functionality in perl that
might be more appropriate, but this worked fine for me.


The only bugs I fixed were escape characters that weren't handled,
e.g. \(, \$, \^. That just took adding the appropriate entries to %
slashhash.

I also had some problems with the @ symbol, though it was never clear
whether the problem was in the regex converter, hampi, or (less
likely) my code. The fix was to add \@ to %slashhash and escape @ in
regexps.

Attached is the altered pcre_tohampi.pm.

Tim

pcre_tohampi.pm

Devdatta

unread,
Feb 9, 2010, 12:37:03 AM2/9/10
to hampi...@googlegroups.com
Hi Tim

Did you do an svn up before this? I had sent out a mail on how I put
in more testcases and changed the interface a bit. In the newer
version (Jan 27), slashhash contains 30 special chars (which I think
cover all the ones you have added). Also the code is a lot more
cleaner.

Regarding the other two points :

1> I think the feature you are talking about is very important. It
should be pretty trivial to add. I will make the change and do a
commit soon (hopefully by tomorrow). It just wasn't a part of my use
case - but now that I see your viewpoint its a no-brainer.

2> Regarding the '@' issue. I think this is because perl arrays start
with an @ which breaks the eval. I don't think we can do anything
about this other than escaping it and adding it to the slashhash. It
will be great if you can send me some sample test cases so that I can
add them to the list of tests.

I will hopefully have something ready for you by tomorrow.

Thanks a lot for using the tool!


Regards
Devdatta

Devdatta

unread,
Feb 9, 2010, 1:00:00 AM2/9/10
to hampi...@googlegroups.com
Hi Tim,

I have pushed in all the changes I mentioned in the previous mail. I
have also added @ to slashhash.

Its an ugly hack, but I don't think I know anything better. Let me
talk to someone who actually knows perl and see if I can get to
something.


Cheers
Devdatta

Tim Hinrichs

unread,
Feb 9, 2010, 10:53:26 AM2/9/10
to hampi...@googlegroups.com
I hadn't done an svn update--was under a deadline.

Maybe you could escape the @ in the string you're given before the
eval. Still perhaps an ugly hack but at least one the user doesn't
do himself.

Here's an example I believe I had trouble with.

/[@a-zA-Z0-9\\.]+/

Tim

Devdatta

unread,
Feb 9, 2010, 1:51:33 PM2/9/10
to hampi...@googlegroups.com
Hi Tim,

Thanks for the example. I think your method of escaping '@' is the
most pragmatic idea.

Another possibility is the use of ' instead of / , as the regex
delimiter. This turns off interpolation by the qr operator. Thus, eval
'qr'.'\'[@a-zA-Z0-9\\.]+\'' works directly. Ofcourse, now you need to
escape the single quote inside the regex. Turning off interpolation
is also much more safer. Depending on the type of input you have, you
decide to use either method. I will add a note to the wiki.

Cheers
Devdatta

Tim Hinrichs

unread,
Feb 9, 2010, 4:18:19 PM2/9/10
to hampi...@googlegroups.com
Here's another example. It looks like we're missing \| from %slashhash.

[\\\^\$\(\)\[\]\{\}\.\*\?\|\+\- \w!@#%&:/="><;,'`~]+

Tim

Devdatta

unread,
Feb 9, 2010, 4:56:12 PM2/9/10
to hampi...@googlegroups.com
Hi

fixed. Also added both examples to the sampleRegex file.


cheers
devdatta

Reply all
Reply to author
Forward
0 new messages