Efficiently detecting what patterns matched without capturing?

68 views

Skip to first unread message

jeffrey.w...@gmail.com

unread,

Sep 9, 2019, 2:06:52 PM9/9/19

to re2j-discuss

Greetings,

We are using RE2/J to attempt to detect which pattern matched an input, out of a fairly long disjunction of patterns. In other words, we are using the engine like the following (simplified example).

if (Pattern.compile("abcd|efgh|ijkl").matches(input)) {

// pattern set 1 matched

} else if (Pattern.compile("mnop|qrst|uvwx").matches(input)) {

// pattern set 2 matched

}

// etc.

This is working well, but I was trying to figure out if it would be more efficient to somehow have a single RE2 instance, and use some other kind of metadata to detect whether "pattern set 1" or "pattern set 2" had actually matched. So first I tried updating to RE2/J 1.3, which adds support for named groups, and writing something like

try {

matcher.group("1");

// pattern set 1 matched

} catch (IllegalArgumentException e) {

// pattern set 1 did not match

}

// etc.

This was unsurprisingly very slow, and not just because of the exception handling (I made a quick tweak to add a hasGroup method to avoid the exception handling overhead, which didn't really help).

Next, I tried implementing a new "mark" and instruction in RE2, to simply mark an integer in a bitmask during machine evaluation. So instead of the above, we could have

if (matcher.getMarks().get(1)) {

// pattern set 1 matched

} else if (matcher.getMarks().get(2)) {

// pattern set 2 matched

}

// etc.

You can see the commit that implements this here. It seems to work fine (aside from some corner cases, which I would need to fix if proceeding), but performance is actually not very good when I plug this into our "real" code. I'm not sure if that is due to the fact that certain optimizations in Parser are unavailable to me when using the marks (ex: maybeConcat), or if it's because a longer NFA of the disjunction of two patterns performs worse than two separate NFA instances for each separate pattern, beyond a certain point of complexity, or if it's due to some other reason.

Does this seem like it has the potential to work, if I implement more optimizations? Is the general idea fine but I'm on the wrong track? Or is what I'm trying to accomplish fundamentally impossible? Any feedback is welcome; thanks.

Reply all

Reply to author

Forward

0 new messages