Re: [re2-dev] Similar API like test_search() in python

156 views
Skip to first unread message
Message has been deleted

Paul Wankadia

unread,
Nov 19, 2015, 6:18:14 AM11/19/15
to swati, re2...@googlegroups.com
On Thu, Nov 19, 2015 at 9:52 PM, swati <swatiup...@gmail.com> wrote:

 I am new to re2 and I want to use re2 for my app.My objective is to search from number of pattern, which pattern match with a given message. Which of the re2 API is best suited for my objective,which will take less time then test_search.

It sounds like RE2::Set (https://github.com/google/re2/blob/master/re2/set.h) might be what you want.

Message has been deleted

Paul Wankadia

unread,
Nov 20, 2015, 7:17:38 AM11/20/15
to swati upadhyaya, re2...@googlegroups.com
On Fri, Nov 20, 2015 at 9:56 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

thanks Paul.. its working..
what i have to do in order to get match portion as key value pair.Is there any api which will give me like hash ?

RE2 supports named capturing groups, but has no equivalent to Python's groupdict() method. Instead, there are mappings from capturing group name to submatch array index and vice versa. Please see https://github.com/google/re2/blob/master/re2/re2.h#L450-L459 for further information.

swati upadhyaya

unread,
Nov 23, 2015, 12:25:26 AM11/23/15
to Paul Wankadia, re2...@googlegroups.com
Hi All,
I have checked the banchmark of some regex lib. re2 has the best But when I have checked with one pattern and one message,its taking lotes of time to process and getting the match portion.its taking 225 matching per sec .Below is my code in cpp also the pattern and msg is present.I am looping for 50 K time the same pattern and mesg.

const char* msg = "<14>Mar 2 11:34:38 89.237.143.23 MSWinEventLog 1 Security 6500 Fri Mar 02 11:34:37 2012 4816\tMicrosoft-Windows-Security-Auditing\tN/A\tN/A\tSuccess Audit\txyz.abc.com\tUser Logoff\tRPC detected an integrity violation while decrypting an incoming message.     Peer Name: %1    Protocol Sequence: %2    Security Error: %3";
std::string pattern = "MSWinEventLog\\s*(?:(?:(?:\\s+)))\\s*(?:\\s*(?:(?:(?:\\d\\s+)))\\s*)?\\s*(?:(?P<event_log__string>(?:\\S+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:\\s+)))\\s*\\s*(?:(?P<event_id__0>(?:4816)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?P<event_source__all>(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?P<event_category__all>(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*RPC\\s*(?:(?P<action__0>(?:detected)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?P<object__0>(?:integrity violation)))\\s*\\s*(?:(?:(?:while decrypting an incoming message.*?)))\\s*Peer\\ Name\\:\\s*(?:(?P<peer_name__all>(?:.*?)))\\s*Protocol\\s*(?:(?:(?:.*?)))\\s*Security\\ Error\\:\\s*(?:(?P<error_code__0>(?:\\S+)))\\s*";
g_test_timer_start ();
      for(count ; count < loop_limit ; count++){
    string str(msg);
    re2::StringPiece input(str);

    re2::RE2 re(pattern);

    int groupSize = re.NumberOfCapturingGroups();

    vector<re2::RE2::Arg> argv(groupSize);
      vector<re2::RE2::Arg*> args(groupSize); 
      vector<re2::StringPiece> ws(groupSize); 
      for (int i = 0; i < groupSize; ++i) { 
        args[i] = &argv[i]; 
        argv[i] = &ws[i]; 
      } 
  re2::RE2::PartialMatchN(input, re, &(args[0]), groupSize); 
 
  string s;
  //for (int i = 0; i < groupSize; ++i){
    //std::cout << ws[i] << std::endl << "\n" ;
 // }


    const map<int,string>& name= re.CapturingGroupNames();
   



  


    map<int,string> ret = name;
    map<int,string> :: iterator iter;
    iter = ret.begin();

    re2::StringPiece match[groupSize];

    int length = str.length();
    std::string event;
    struct timeval start, end;
    }
    double timeForLoopLimit = g_test_timer_elapsed ();
      std::cout << "\nsimpleCEFParser; Ran simpleCEFParser for " <<loop_limit << "times" << "; time_taken = " << timeForLoopLimit  << "seconds, " << " speed =" << loop_limit/timeForLoopLimit;



I am not sure whether I have done some mistake or re2 match itself is slow?

Paul Wankadia

unread,
Nov 23, 2015, 6:13:34 AM11/23/15
to swati upadhyaya, re2...@googlegroups.com
On Mon, Nov 23, 2015 at 4:25 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

std::string pattern = "MSWinEventLog\\s*(?:(?:(?:\\s+)))\\s*(?:\\s*(?:(?:(?:\\d\\s+)))\\s*)?\\s*(?:(?P<event_log__string>(?:\\S+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:\\s+)))\\s*\\s*(?:(?P<event_id__0>(?:4816)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?P<event_source__all>(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*\\s*(?:(?P<event_category__all>(?:.*?)))\\s*\\s*(?:(?:(?:[\t]+)))\\s*RPC\\s*(?:(?P<action__0>(?:detected)))\\s*\\s*(?:(?:(?:.*?)))\\s*\\s*(?:(?P<object__0>(?:integrity violation)))\\s*\\s*(?:(?:(?:while decrypting an incoming message.*?)))\\s*Peer\\ Name\\:\\s*(?:(?P<peer_name__all>(?:.*?)))\\s*Protocol\\s*(?:(?:(?:.*?)))\\s*Security\\ Error\\:\\s*(?:(?P<error_code__0>(?:\\S+)))\\s*";

I'm guessing that this regular expression was generated by a program, so I'm not going to analyse it, but I'm happy to see a use case in the wild for the coalesce logic that I implemented this year. :)

g_test_timer_start ();
      for(count ; count < loop_limit ; count++){

Sorry, this is the wrong place to start the timer and begin the loop. The biggest problem is the repeated RE2 object construction, which is going to be rather expensive for such a lengthy and complicated regular expression. However, there is also the repeated string and vector object construction, which will be much less expensive, but not free.

  re2::RE2::PartialMatchN(input, re, &(args[0]), groupSize); 

This is what you should time and loop – and nothing else.

swati upadhyaya

unread,
Nov 23, 2015, 11:37:07 PM11/23/15
to Paul Wankadia, re2...@googlegroups.com
Hi Paul,
As you suggested I have start the timer before PartialMatchN and end it after the same and loop it for 50K times. its accessing 900 message per second.Also I have checked with creating re object for the same pattern its creating re2 object 8700 per sec(checked with loop of 50K). So matching itself is slowi think
for python also i have checked its giving key value pair but for 50000 times its 125 per sec.


Actually I have checked with different library PCRE also its speed is much faster for the same pattern and message.but for some other pattern its not returning.I dont know why....But for the same pattern its processing and giving key value pair around 15K per second which is very fast..

I heard and also read that re2 is the best among all the regex lib. So trying to find if i am doing something wrong so its taking so much of time compared to pcre..

Thanks

Paul Wankadia

unread,
Nov 24, 2015, 2:00:49 AM11/24/15
to swati upadhyaya, re2...@googlegroups.com
On Tue, Nov 24, 2015 at 3:37 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

As you suggested I have start the timer before PartialMatchN and end it after the same and loop it for 50K times. its accessing 900 message per second.Also I have checked with creating re object for the same pattern its creating re2 object 8700 per sec(checked with loop of 50K). So matching itself is slowi think 
for python also i have checked its giving key value pair but for 50000 times its 125 per sec.


Actually I have checked with different library PCRE also its speed is much faster for the same pattern and message.but for some other pattern its not returning.I dont know why....But for the same pattern its processing and giving key value pair around 15K per second which is very fast..

I heard and also read that re2 is the best among all the regex lib. So trying to find if i am doing something wrong so its taking so much of time compared to pcre..

I had a hunch that the regular expression was "too big", so I ran some of my own benchmarks:

BM_MSWinEventLog_UTF8_NoSubmatches       1000000              2270 ns/op         137.86 MB/s
BM_MSWinEventLog_Latin1_NoSubmatches     1000000              2049 ns/op         152.69 MB/s
re2/testing/regexp_benchmark.cc:1495: NumberOfCapturingGroups = 8
re2/testing/regexp_benchmark.cc:1496: ProgramSize = 632
re2/testing/regexp_benchmark.cc:1499: ProgramFanout = 5
re2/testing/regexp_benchmark.cc:1500: Histogram of fanout:
re2/testing/regexp_benchmark.cc:1504:   0: 186
re2/testing/regexp_benchmark.cc:1504:   2: 21
re2/testing/regexp_benchmark.cc:1504:   3: 1
re2/testing/regexp_benchmark.cc:1504:   4: 32
re2/testing/regexp_benchmark.cc:1504:   5: 2
BM_MSWinEventLog_UTF8_Submatches            2000           1564023 ns/op           0.20 MB/s
re2/testing/regexp_benchmark.cc:1530: NumberOfCapturingGroups = 8
re2/testing/regexp_benchmark.cc:1531: ProgramSize = 398
re2/testing/regexp_benchmark.cc:1534: ProgramFanout = 4
re2/testing/regexp_benchmark.cc:1535: Histogram of fanout:
re2/testing/regexp_benchmark.cc:1539:   0: 108
re2/testing/regexp_benchmark.cc:1539:   2: 21
re2/testing/regexp_benchmark.cc:1539:   3: 16
re2/testing/regexp_benchmark.cc:1539:   4: 19
BM_MSWinEventLog_Latin1_Submatches         50000             31536 ns/op           9.93 MB/s

As you can see, using Latin-1 (by specifying RE2::Latin1 as an argument to the RE2 constructor) appears to be roughly fifty times faster – but why? :)

The answer is that the regular expression becomes small enough to use the BitState execution engine, which is much faster than the NFA execution engine. https://github.com/google/re2/blob/master/re2/re2.cc#L635-L643 shows you the (somewhat arbitrary) limits that RE2 imposes. As per the numbers above, the regular expression uses 632 instructions in UTF-8, but only 398 instructions in Latin-1, so the latter is under the BitState limit of 500 instructions. (Note also that the input string must not be "too big" either.)

IMO, it would be best if you could optimise the regular expression, but I can appreciate that this might not be an easy solution. Simply increasing the BitState limits in your copy of the RE2 code would be another possibility, I suppose. ;)

swati upadhyaya

unread,
Nov 24, 2015, 4:14:49 AM11/24/15
to Paul Wankadia, re2...@googlegroups.com
Thank you Paul, your trick works for me..that is i have increased the BitState limits and the speed increases for match.But is  doing so is ok? or it has some other effects?

For me I have to search from number of patterns which one is matching which I am doing using set as you suggested.then from that matched pattern I have to extract key value pair for which I have to create re2 object. So for every that matched pattern I have to create the re2 object and then match it. So I have to take benchmark of overall process of matching.

1) creating re2 object
2) matching to extract key value pair

Only matching is increased also overall is also increased previously it was around 200 now its 300 per sec :)

Paul Wankadia

unread,
Nov 24, 2015, 5:29:00 AM11/24/15
to swati upadhyaya, re2...@googlegroups.com
On Tue, Nov 24, 2015 at 8:14 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

Thank you Paul, your trick works for me..that is i have increased the BitState limits and the speed increases for match.But is  doing so is ok? or it has some other effects?

The number of bits allocated per execution is the product of the number of instructions and the length of the input string, but is limited by MaxBitStateVector, so raising that limit increases the maximum memory footprint. This means that bigger regular expressions and/or longer input strings can be accommodated, but that costs memory.

For me I have to search from number of patterns which one is matching which I am doing using set as you suggested.then from that matched pattern I have to extract key value pair for which I have to create re2 object. So for every that matched pattern I have to create the re2 object and then match it. So I have to take benchmark of overall process of matching.

1) creating re2 object
2) matching to extract key value pair

Only matching is increased also overall is also increased previously it was around 200 now its 300 per sec :)

If you build a vector of RE2 objects as you add to the RE2::Set such that each regular expression has the same index in both structures, then you can just map back to the RE2 object instead of constructing and destructing one every time. That will speed things up a wee bit more. (Plus any entries in the DFA state cache will not be wasted.)

swati upadhyaya

unread,
Nov 27, 2015, 6:07:16 AM11/27/15
to Paul Wankadia, re2...@googlegroups.com
Hi Paul,
I didnt fine any bitstate kind of thing in python API .So for python i cannot do anything?

Paul Wankadia

unread,
Nov 28, 2015, 1:13:19 AM11/28/15
to swati upadhyaya, re2...@googlegroups.com
On Fri, Nov 27, 2015 at 10:07 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

I didnt fine any bitstate kind of thing in python API .So for python i cannot do anything? 

I'm not sure what you mean. Are you referring to https://github.com/facebook/pyre2/ or to the Python re module? The former is just a wrapper, so linking against your tweaked RE2 library will be sufficient. The latter is an entirely different implementation of regular expressions and I have approximately zero knowledge of its internals.

swati upadhyaya

unread,
Dec 1, 2015, 6:07:34 AM12/1/15
to Paul Wankadia, re2...@googlegroups.com
Hi paul i am using the https://github.com/facebook/pyre2/ but as u suggested i have linked my re2 lib its working fine with .compile and groupdict api but my code is using test_search so its crashing

 matching = pattern.test_search(line)
 AttributeError: 're2.Pattern' object has no attribute 'test_search'

Paul Wankadia

unread,
Dec 2, 2015, 12:51:28 AM12/2/15
to swati upadhyaya, re2...@googlegroups.com
On Tue, Dec 1, 2015 at 10:07 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

Hi paul i am using the https://github.com/facebook/pyre2/ but as u suggested i have linked my re2 lib its working fine with .compile and groupdict api but my code is using test_search so its crashing

 matching = pattern.test_search(line)
 AttributeError: 're2.Pattern' object has no attribute 'test_search'

I'm not too familiar with the wrapper, but https://github.com/facebook/pyre2/blob/master/_re2.cc#L158 suggests that the type should actually be _re2.RE2_Regexp, so it does seem as though you have the wrong kind of object there.

swati upadhyaya

unread,
Dec 9, 2015, 12:52:17 AM12/9/15
to Paul Wankadia, re2...@googlegroups.com
Hi
Is re2 has some limitation related to extracting named group key value pair. For me I have a regex pattern that has 109 captured named group and in python re2.compile is throwing exception "sorry, but this version only supports 100 named groups". I am using the latest git code of re2.. Could you help

Thanks

Paul Wankadia

unread,
Dec 9, 2015, 1:02:43 AM12/9/15
to swati upadhyaya, re2...@googlegroups.com
On Wed, Dec 9, 2015 at 4:52 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

Is re2 has some limitation related to extracting named group key value pair. For me I have a regex pattern that has 109 captured named group and in python re2.compile is throwing exception "sorry, but this version only supports 100 named groups". I am using the latest git code of re2.. Could you help

Googling for the "sorry, but this version only supports 100 named groups" error message indicates that it comes from the Python re module, not from RE2.

swati upadhyaya

unread,
Dec 21, 2015, 1:02:23 AM12/21/15
to Paul Wankadia, re2...@googlegroups.com
Hi
I found
git://github.com/axiak/pyre2.git also is it also using re2 ?because changing the bitstate and rebuilding it is not working 

Paul Wankadia

unread,
Dec 21, 2015, 1:17:30 AM12/21/15
to swati upadhyaya, re2...@googlegroups.com
On Mon, Dec 21, 2015 at 5:02 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

I found 
git://github.com/axiak/pyre2.git also is it also using re2 ?because changing the bitstate and rebuilding it is not working 
Do you have more than one version of the RE2 library installed on your system? If so, have you checked that the Python wrapper is using your customised version of the RE2 library?

swati upadhyaya

unread,
Dec 21, 2015, 1:23:41 AM12/21/15
to Paul Wankadia, re2...@googlegroups.com
For my test code its using my customised version and increases the speed but when I am testing with the real time system its not increasing the speed although I have checkek the re2.so installed time its showing current time. Also re2.compile is giving me some other object not re2.Pattern. So I thought that my real time system is using the above code thats why its not increasing the speed also. So just need to confirm because https://pypi.python.org/pypi/re2/ here it is mentioned that its using re2 wrapper. I am a bit confused in between the two pyre2

Thanks

Paul Wankadia

unread,
Dec 21, 2015, 1:46:25 AM12/21/15
to swati upadhyaya, re2...@googlegroups.com
On Mon, Dec 21, 2015 at 5:23 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

For my test code its using my customised version and increases the speed but when I am testing with the real time system its not increasing the speed although I have checkek the re2.so installed time its showing current time. Also re2.compile is giving me some other object not re2.Pattern. So I thought that my real time system is using the above code thats why its not increasing the speed also. So just need to confirm because https://pypi.python.org/pypi/re2/ here it is mentioned that its using re2 wrapper. I am a bit confused in between the two pyre2

I have no experience with either version, but having just looked at both repositories, I can now tell you that they are completely different implementations, so your confusion is entirely reasonable. Axiak's pyre2 module is absolutely nothing like Facebook's pyre2 module under the hood and they evidently have user-visible API differences as well. The way forward now is to ensure that whichever version you have in production is what you use for developing and testing. :(

swati upadhyaya

unread,
Dec 21, 2015, 4:05:16 AM12/21/15
to Paul Wankadia, re2...@googlegroups.com
So the above https://pypi.python.org/pypi/re2/ is not using re2? :(

Paul Wankadia

unread,
Dec 21, 2015, 9:23:44 PM12/21/15
to swati upadhyaya, re2...@googlegroups.com
On Mon, Dec 21, 2015 at 8:05 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

So the above https://pypi.python.org/pypi/re2/ is not using re2? :(

That appears to be Axiak's version, but both versions are wrapping the RE2 library, so you should be able to make your changes work either way.

swati upadhyaya

unread,
Dec 23, 2015, 5:03:17 AM12/23/15
to Paul Wankadia, re2...@googlegroups.com
I have checked we re2 iits allowing duplicate keynames for example
(?:(?P<event_log__string>(?:\\S+)))(?:(?P<event_log__string>(?:\\S+))) re2 is compiling this pattern
Is it not ambiguous as we have the same group captured name

Also checked with re its raise exception
 redefinition of group name 'event_log__string' as group 2; was group 1


Is this the limitation of re2?

swati upadhyaya

unread,
Dec 24, 2015, 2:29:59 AM12/24/15
to Paul Wankadia, re2...@googlegroups.com
Anyone check this? whether its a limitation of re2 ?

Paul Wankadia

unread,
Dec 24, 2015, 9:18:50 AM12/24/15
to swati upadhyaya, re2...@googlegroups.com
On Wed, Dec 23, 2015 at 9:03 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

I have checked we re2 iits allowing duplicate keynames for example
(?:(?P<event_log__string>(?:\\S+)))(?:(?P<event_log__string>(?:\\S+))) re2 is compiling this pattern
Is it not ambiguous as we have the same group captured name

Also checked with re its raise exception
 redefinition of group name 'event_log__string' as group 2; was group 1


Is this the limitation of re2?

I presume that the decision was made to follow Perl in this regard by considering the leftmost capturing group to have the name. If you need to detect such regular expressions, I suggest comparing the number of map elements returned by NamedCapturingGroups() with the number of map elements returned by CapturingGroupNames().

swati upadhyaya

unread,
Dec 24, 2015, 10:41:58 PM12/24/15
to Paul Wankadia, re2...@googlegroups.com
Thanks but i have to use python not cpp :(

Paul Wankadia

unread,
Dec 25, 2015, 10:21:57 PM12/25/15
to swati upadhyaya, re2...@googlegroups.com
On Fri, Dec 25, 2015 at 2:41 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

Thanks but i have to use python not cpp :(

If you want to raise a Python exception in order to emulate what the Python re module does here, then you need to customise the Python wrapper. Calling PyErr_SetString() is the easiest way to raise a Python exception from C/C++ code, I believe, so this will hopefully be fairly straightforward.

swati upadhyaya

unread,
Jan 6, 2016, 1:07:10 AM1/6/16
to Paul Wankadia, re2...@googlegroups.com
Hello,
(?P<_datetime_11>(1[0-2]|0[1-9]|(?<![1-3])[1-9])[/-]\d{1,2}[/-]20\d{2}T\d{2}:\d{2}:\d{2}[.]\d+[+|-]\d{2}:\d{2})

re2 is unable to compile this and giving error
re2.error: (11, 'invalid perl operator: (?<')

re2 doesnt support lookbehind assertion?

Paul Wankadia

unread,
Jan 6, 2016, 1:41:54 AM1/6/16
to swati upadhyaya, re2...@googlegroups.com
On Wed, Jan 6, 2016 at 5:07 PM, swati upadhyaya <swatiup...@gmail.com> wrote:

(?P<_datetime_11>(1[0-2]|0[1-9]|(?<![1-3])[1-9])[/-]\d{1,2}[/-]20\d{2}T\d{2}:\d{2}:\d{2}[.]\d+[+|-]\d{2}:\d{2})

re2 is unable to compile this and giving error
re2.error: (11, 'invalid perl operator: (?<')

re2 doesnt support lookbehind assertion?

Sorry, RE2 doesn't support look-ahead and look-behind assertions. (It also doesn't support backreferences.)

Reply all
Reply to author
Forward
0 new messages