Questions:
*) Are there compelling reasons to use icu::Regex(Matcher|Pattern) instead of RE2 in the browser and net code or is this just historical cruft?
*) Would anyone object to switching the ~10 uses of icu::Regex over to re2 and banning further use of icu's regular expression use in chromium?
The rest of this email is historical background - feel free to skip it if you have informed opinions on the first two questions or don't care.
Currently we use at least 5 different regular expression engines inside chromium:
*) ICU
*) RE2
*) V8 (irregexp)
*) libxml
*) Yarr
We don't use PCRE any more (I hope) although we did in the distant past.
ICU has a regular expression library that's been used in various places in chromium since time immemorial. It's currently used in the download, autofill, policy and signin systems (in the browser), some IDN detection logic in net/, and a pair of one-offs in the renderer. At a mostly casual glance it looks like in these cases the regular expression is always coming from C++ code. It appears the most, if not all, of the code using ICU's regular expression support originates from before RE2 was added to the tree.
RE2 was added to the tree in July 2012 following this thread:
https://groups.google.com/a/chromium.org/d/msg/chromium-dev/1QVicr35r74/r0XYqsOrEdkJ with the code review here:
https://chromiumcodereview.appspot.com/10575037. The reason that ICU regex wasn't used here was the regular expression itself was provided from an extension, not C++ code, and ICU has exponential runtime on some backtracking regexes where RE2 has guaranteed polynomial execution time for arbitrary regexes. Since landing, the use of RE2 has spread to some chromeos code, other parts of the extension system, some gpu infrastructure and I recently added a caller in the blocked plugin implementation.
V8 has a JIT regular expression engine used for javascript called irrexp. It can only be used within a V8 context and has security implications (since it uses RWX pages) so it isn't used outside of executing javascript.
libxml's regular expression engine is used by libxml itself and as far as I can tell is not used anywhere else in chromium.
Yarr is a regular expression engine that's part of JavaScriptCore, the javascript engine used by Safari and some other WebKit ports. We use it in a tiny number of places in WebKit and (until very recently) exposed it via WebKit API to use in a few places in chromium. This is the only reason we maintain project files for the JavaScriptCore part of the WebKit repository. Interestingly, yarr is also used by mozilla:
http://mxr.mozilla.org/mozilla-central/source/js/src/yarr/YarrParser.h.
Of these, ICU RE2 and Yarr are the only reasonably general-purpose engines. Given how rarely we use Yarr I plan to just get rid of our uses and drop the dependency completely. That leaves ICU's regular expression engine and RE2. Do we really need both? We depend on ICU for many things, so we clearly can't stop using the library completely, but given that there appear to be use cases that require us to use RE2 I think we'd be better off switching to using just one regular expression library from chromium code instead of two.
- James