Greetings,
We are using RE2/J to attempt to detect which pattern matched an input, out of a fairly long disjunction of patterns. In other words, we are using the engine like the following (simplified example).
if (Pattern.compile("abcd|efgh|ijkl").matches(input)) {
// pattern set 1 matched
} else if (Pattern.compile("mnop|qrst|uvwx").matches(input)) {
// pattern set 2 matched
}
// etc.
This is working well, but I was trying to figure out if it would be more efficient to somehow have a single RE2 instance, and use some other kind of metadata to detect whether "pattern set 1" or "pattern set 2" had actually matched. So first I tried updating to RE2/J 1.3, which adds support for named groups, and writing something like
Matcher matcher = Pattern.compile("(?P<1>abcd|efgh|ijkl)|(?P<2>mnop|qrst|uvwx)").matcher(input);
try {
matcher.group("1");
// pattern set 1 matched
} catch (IllegalArgumentException e) {
// pattern set 1 did not match
}
// etc.
This was unsurprisingly very slow, and not just because of the exception handling (I made a quick tweak to add a hasGroup method to avoid the exception handling overhead, which didn't really help).
Next, I tried implementing a new "mark" and instruction in RE2, to simply mark an integer in a bitmask during machine evaluation. So instead of the above, we could have
Matcher matcher = Pattern.compile("(?M<1>abcd|efgh|ijkl)|(?M<2>mnop|qrst|uvwx)").matcher(input);
if (matcher.getMarks().get(1)) {
// pattern set 1 matched
} else if (matcher.getMarks().get(2)) {
// pattern set 2 matched
}
// etc.
You can see the commit that implements this
here. It seems to work fine (aside from some corner cases, which I would need to fix if proceeding), but performance is actually not very good when I plug this into our "real" code. I'm not sure if that is due to the fact that certain optimizations in Parser are unavailable to me when using the marks (ex: maybeConcat), or if it's because a longer NFA of the disjunction of two patterns performs worse than two separate NFA instances for each separate pattern, beyond a certain point of complexity, or if it's due to some other reason.
Does this seem like it has the potential to work, if I implement more optimizations? Is the general idea fine but I'm on the wrong track? Or is what I'm trying to accomplish fundamentally impossible? Any feedback is welcome; thanks.