I want to implement an optimization for trailing match all expressions. An example is global replace operations like say host extraction from a url: ^https?://(?:www\.)?([^/]+)/.*$' —> ‘\1’. The engine will match the group, reach the match all at the end and spend linear time walking it even when it doesn’t have to.
I attempted to implement this optimization here
https://github.com/google/re2/pull/607. I tried to keep it simple:
- after parsing the regexp, find if the optimization applies: trailing greedy match all, dot matches newline, oneline mode
- rebuild a stripped regexp without the match all and flag this in Prog. If there was a capture group, note its index
- run matching on the stripped regexp and use the unmatched portion of the input string to assign to the group if any
The approach is simple and verified by tests and benchmarks that it works. It has two drawbacks though:
1. after stripping the trailing match all from the compiled regex there is a round trip to string and compilation again
2. each RE2 object holds a second RE2 object if the optimization applies
Questions:
1. Is this optimization desirable?
2. If yes, is this implementation approach acceptable? Is there a better way?
Cheers,