I have decided to send out the alpha release of the tool I have been
hacking on that converts a PCRE regular expression to Hampi Grammar.
In the latest svn, you can cd to the lib/regex-hampi directory to find
the code.
*** Using the library
You will need a standard perl install. The library is the
'pcre_tohampi.pm' file and the YAPE folder. Thanks to Jeff Pinyan for
the nice YAPE::RegEx library that I harnesses. The library is under
the MIT license. The library returns a string containing the Hampi CFG
that will generate the same strings that the regex does. The CFG non
terminals are named as follows :
The non terminal for the whole regex is flax0
the non terminal for 1st capturing group is flax1
and so on ...
So, usually you will want to get the CFG in string form and do something like :
assert stringvar in flax0;
Please go through the code in the hampiregexTest.pl. It demonstrates
how to use the library. If it is still not clear, ask on this mailing
list.
*** The Test Engine
For testing this tool, I have also included a sample test engine that
does the following :
1> It takes a list of files as input , where each line of each file
contains a regular expression. If no file is given, it looks for input
via STDIN.
2> For each regex (i.e each line), it generates a Hampi Grammar and
runs the Hampi String Solver on it.
3> The result of the Hampi String Solver is then matched with the
original regex using the RegEx engine present in perl.
So for e.g, you can try running (make sure you have compiled Hampi first)
perl hampiregexTest.pl < sampleRegex
from the regex-hampi directory and see the results.
***
Special thanks to Stephen McCamant for a *lot* of help.
Please feel free to send me feedback/bugs/patched on/off the mailing
list. I haven't concentrated on performance, so I am sure there are
lots of performance improvements possible.
Hope this is helpful to someone
Regards
Devdatta
Thanks for putting this together--it'll help us out quite a bit!
I have a bug report. On the following regexp, the test script
reports failure.
(^[\w\.=-]+)
Here's the Hampi input file produced.
var string:1;
cfg q2 := \000 | [\045 - \046] | [\048 - \057] | \061 | [\065 - \090]
| \095 | [\097 - \122];
cfg q3 := (q2)+;
cfg q1 := q3;
cfg flax0 := q1;
assert string in flax0;
It looks like the problem is the occurrence of the null string \000
in q2 because Hampi returns the empty string.
Tim
Thanks for using the code and catching the bug. I have done the fix to
the code and added this regex to the sampleRegex file. Please update
your copy. Note that you need to enclose the regex with slashes on
both sides.
Regards
Devdatta
2010/1/22 Tim Hinrichs <thin...@gmail.com>:
I just updated the code, and have changed the API a little bit. Also I
have uploaded a large number of new regular expressions as test cases.
Please see the code hampiregexTest.pl to see how to use the new API.
I am interested in seeing regular expressions anyone of you finds
really common and/or bugs, so be sure to let me know if you find
anything interesting.
Regards
devdatta
2010/1/22 Devdatta <dev.a...@gmail.com>:
Can you write something: some info about it and some examples of how
to use it? I will then add a "news flash" item to the main page.
./adam
./ada,
Thanks again for writing the regexp converter. I've been using it
within a wrapper around Hampi, and it's been a life-saver. Here's a
brain dump of the additional features I would have liked and the
problems I ran into.
One feature I hacked in myself is the ability to create unique CFG
names for each regexp. This is invaluable when you're asking Hampi
to solve multiple (converted) regexps simultaneously. The interface
I used was simple: have TOTHEHAMPI take an argument that's meant to
be unique. Example.
my $p=pcre_tohampi->new(qr/abc/);
$p->parse;
print $p->tothehampi("_unique1_");
//toplevel flax_unique1_0
my $p=pcre_tohampi->new(qr/def/);
$p->parse;
print $p->tothehampi("_unique2_");
//toplevel flax_unique2_0
If I remember right, I only had to touch a handful of places in the
code. I don't know if there's gensym functionality in perl that
might be more appropriate, but this worked fine for me.
The only bugs I fixed were escape characters that weren't handled,
e.g. \(, \$, \^. That just took adding the appropriate entries to %
slashhash.
I also had some problems with the @ symbol, though it was never clear
whether the problem was in the regex converter, hampi, or (less
likely) my code. The fix was to add \@ to %slashhash and escape @ in
regexps.
Attached is the altered pcre_tohampi.pm.
Tim
Did you do an svn up before this? I had sent out a mail on how I put
in more testcases and changed the interface a bit. In the newer
version (Jan 27), slashhash contains 30 special chars (which I think
cover all the ones you have added). Also the code is a lot more
cleaner.
Regarding the other two points :
1> I think the feature you are talking about is very important. It
should be pretty trivial to add. I will make the change and do a
commit soon (hopefully by tomorrow). It just wasn't a part of my use
case - but now that I see your viewpoint its a no-brainer.
2> Regarding the '@' issue. I think this is because perl arrays start
with an @ which breaks the eval. I don't think we can do anything
about this other than escaping it and adding it to the slashhash. It
will be great if you can send me some sample test cases so that I can
add them to the list of tests.
I will hopefully have something ready for you by tomorrow.
Thanks a lot for using the tool!
Regards
Devdatta
I have pushed in all the changes I mentioned in the previous mail. I
have also added @ to slashhash.
Its an ugly hack, but I don't think I know anything better. Let me
talk to someone who actually knows perl and see if I can get to
something.
Cheers
Devdatta
Maybe you could escape the @ in the string you're given before the
eval. Still perhaps an ugly hack but at least one the user doesn't
do himself.
Here's an example I believe I had trouble with.
/[@a-zA-Z0-9\\.]+/
Tim
Thanks for the example. I think your method of escaping '@' is the
most pragmatic idea.
Another possibility is the use of ' instead of / , as the regex
delimiter. This turns off interpolation by the qr operator. Thus, eval
'qr'.'\'[@a-zA-Z0-9\\.]+\'' works directly. Ofcourse, now you need to
escape the single quote inside the regex. Turning off interpolation
is also much more safer. Depending on the type of input you have, you
decide to use either method. I will add a note to the wiki.
Cheers
Devdatta
[\\\^\$\(\)\[\]\{\}\.\*\?\|\+\- \w!@#%&:/="><;,'`~]+
Tim
fixed. Also added both examples to the sampleRegex file.
cheers
devdatta