Efficiently detecting what patterns matched without capturing?

68 views
Skip to first unread message

jeffrey.w...@gmail.com

unread,
Sep 9, 2019, 2:06:52 PM9/9/19
to re2j-discuss
Greetings,

We are using RE2/J to attempt to detect which pattern matched an input, out of a fairly long disjunction of patterns.  In other words, we are using the engine like the following (simplified example).

if (Pattern.compile("abcd|efgh|ijkl").matches(input)) {
  // pattern set 1 matched
} else if (Pattern.compile("mnop|qrst|uvwx").matches(input)) {
  // pattern set 2 matched
}
// etc.

This is working well, but I was trying to figure out if it would be more efficient to somehow have a single RE2 instance, and use some other kind of metadata to detect whether "pattern set 1" or "pattern set 2" had actually matched.  So first I tried updating to RE2/J 1.3, which adds support for named groups, and writing something like

Matcher matcher = Pattern.compile("(?P<1>abcd|efgh|ijkl)|(?P<2>mnop|qrst|uvwx)").matcher(input);
try {
  matcher.group("1");
  // pattern set 1 matched
} catch (IllegalArgumentException e) {
  // pattern set 1 did not match
}
// etc.

This was unsurprisingly very slow, and not just because of the exception handling (I made a quick tweak to add a hasGroup method to avoid the exception handling overhead, which didn't really help).

Next, I tried implementing a new "mark" and instruction in RE2, to simply mark an integer in a bitmask during machine evaluation.  So instead of the above, we could have

Matcher matcher = Pattern.compile("(?M<1>abcd|efgh|ijkl)|(?M<2>mnop|qrst|uvwx)").matcher(input);
if (matcher.getMarks().get(1)) {
  // pattern set 1 matched
} else if (matcher.getMarks().get(2)) {
  // pattern set 2 matched
}
// etc.

You can see the commit that implements this here.  It seems to work fine (aside from some corner cases, which I would need to fix if proceeding), but performance is actually not very good when I plug this into our "real" code.  I'm not sure if that is due to the fact that certain optimizations in Parser are unavailable to me when using the marks (ex: maybeConcat), or if it's because a longer NFA of the disjunction of two patterns performs worse than two separate NFA instances for each separate pattern, beyond a certain point of complexity, or if it's due to some other reason.

Does this seem like it has the potential to work, if I implement more optimizations?  Is the general idea fine but I'm on the wrong track?  Or is what I'm trying to accomplish fundamentally impossible?  Any feedback is welcome; thanks.
Reply all
Reply to author
Forward
0 new messages