Regular expression libraries in chromium episode [1-9][0-9]*

James Robinson

unread,

Mar 26, 2013, 6:14:23 PM3/26/13

to Chromium-dev

Questions:

*) Are there compelling reasons to use icu::Regex(Matcher|Pattern) instead of RE2 in the browser and net code or is this just historical cruft?

*) Would anyone object to switching the ~10 uses of icu::Regex over to re2 and banning further use of icu's regular expression use in chromium?

The rest of this email is historical background - feel free to skip it if you have informed opinions on the first two questions or don't care.

Currently we use at least 5 different regular expression engines inside chromium:

*) ICU

*) RE2

*) V8 (irregexp)

*) libxml

*) Yarr

We don't use PCRE any more (I hope) although we did in the distant past.

ICU has a regular expression library that's been used in various places in chromium since time immemorial. It's currently used in the download, autofill, policy and signin systems (in the browser), some IDN detection logic in net/, and a pair of one-offs in the renderer. At a mostly casual glance it looks like in these cases the regular expression is always coming from C++ code. It appears the most, if not all, of the code using ICU's regular expression support originates from before RE2 was added to the tree.

RE2 was added to the tree in July 2012 following this thread: https://groups.google.com/a/chromium.org/d/msg/chromium-dev/1QVicr35r74/r0XYqsOrEdkJ with the code review here: https://chromiumcodereview.appspot.com/10575037. The reason that ICU regex wasn't used here was the regular expression itself was provided from an extension, not C++ code, and ICU has exponential runtime on some backtracking regexes where RE2 has guaranteed polynomial execution time for arbitrary regexes. Since landing, the use of RE2 has spread to some chromeos code, other parts of the extension system, some gpu infrastructure and I recently added a caller in the blocked plugin implementation.

V8 has a JIT regular expression engine used for javascript called irrexp. It can only be used within a V8 context and has security implications (since it uses RWX pages) so it isn't used outside of executing javascript.

libxml's regular expression engine is used by libxml itself and as far as I can tell is not used anywhere else in chromium.

Yarr is a regular expression engine that's part of JavaScriptCore, the javascript engine used by Safari and some other WebKit ports. We use it in a tiny number of places in WebKit and (until very recently) exposed it via WebKit API to use in a few places in chromium. This is the only reason we maintain project files for the JavaScriptCore part of the WebKit repository. Interestingly, yarr is also used by mozilla: http://mxr.mozilla.org/mozilla-central/source/js/src/yarr/YarrParser.h.

Of these, ICU RE2 and Yarr are the only reasonably general-purpose engines. Given how rarely we use Yarr I plan to just get rid of our uses and drop the dependency completely. That leaves ICU's regular expression engine and RE2. Do we really need both? We depend on ICU for many things, so we clearly can't stop using the library completely, but given that there appear to be use cases that require us to use RE2 I think we'd be better off switching to using just one regular expression library from chromium code instead of two.

- James

Tony Chang

unread,

Mar 26, 2013, 7:23:19 PM3/26/13

to James Robinson, Chromium-dev, Jungshik Shin (신정식, 申政湜)

I was concerned that RE2 wasn't unicode aware, but James showed me that it is and by default assumes that the input is UTF-8. AFAIK, that's why we were using ICU's regular expression library in the first place, but it should be OK to standardize on RE2.

Jungshik may know additional reasons for using ICU instead of RE2, but as long as the autofill tests continue to pass, seems like it would be OK to switch to RE2.

tony

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Ilya Sherman

unread,

Mar 26, 2013, 7:53:31 PM3/26/13

to Tony Chang, James Robinson, Chromium-dev, Jungshik Shin (신정식, 申政湜)

When we compared the two for Autofill about a year ago, we found that ICU was *much* faster than RE2 in practice (assuming you have cached regexes, as Autofill does). This performance impact is quite noticeable for Autofill, so I'd really prefer not to remove ICU unless RE2's performance has somehow improved dramatically since then.

James Robinson

unread,

Mar 26, 2013, 8:11:12 PM3/26/13

to Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜)

Do you have a performance test I could run to validate and profile the performance difference? I do notice that currently we have the autofill regexes stored as utf8 string literals in the compiled binary which we then inflate into UTF16 at runtime, which can't be great.

- James

Ilya Sherman

unread,

Mar 26, 2013, 9:40:10 PM3/26/13

to James Robinson, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

cc: David Holloway

I don't have a performance test on hand, no. I think we wrote a one-off test at the time, but David might remember if we had something more permanent.

On Tue, Mar 26, 2013 at 5:11 PM, James Robinson <jam...@google.com> wrote:

Do you have a performance test I could run to validate and profile the performance difference? I do notice that currently we have the autofill regexes stored as utf8 string literals in the compiled binary which we then inflate into UTF16 at runtime, which can't be great.

I'm pretty sure the regex comparisons far outweigh the string copies, but I don't have any numbers on hand to back that statement up with.

James Robinson

unread,

Mar 26, 2013, 9:56:22 PM3/26/13

to Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

On Tue, Mar 26, 2013 at 6:40 PM, Ilya Sherman <ishe...@chromium.org> wrote:

cc: David Holloway

I don't have a performance test on hand, no. I think we wrote a one-off test at the time, but David might remember if we had something more permanent.

The statements "performance is important here" and "there is no performance test" are mutually exclusive. If there's no performance test, how do you know if the performance has regressed since you did the initial investigation?

I could try something synthetic like running the autofill regexes against something like the text content of our page cyclers, but that wouldn't reflect how we actually use the library very well.

Elliott Sprehn

unread,

Mar 27, 2013, 12:20:13 PM3/27/13

to James Robinson, Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

If we don't even have a test we should probably just switch to RE2 and see what happens. :)

Nico Weber

unread,

Mar 27, 2013, 12:49:04 PM3/27/13

to Elliott Sprehn, James Robinson, Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

On Wed, Mar 27, 2013 at 9:20 AM, Elliott Sprehn <esp...@chromium.org> wrote:

If we don't even have a test we should probably just switch to RE2 and see what happens. :)

I think it's fairly well-known that re2 is on the slow side. Since you're now aware of potential performance problems, I don't think you should move ahead with this without at least some simple manual checking. (Yes, there should have been a perf test.)

Elliott Sprehn

unread,

Mar 27, 2013, 12:51:05 PM3/27/13

to Nico Weber, James Robinson, Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

On Wed, Mar 27, 2013 at 12:49 PM, Nico Weber <tha...@chromium.org> wrote:

On Wed, Mar 27, 2013 at 9:20 AM, Elliott Sprehn <esp...@chromium.org> wrote:

If we don't even have a test we should probably just switch to RE2 and see what happens. :)

I think it's fairly well-known that re2 is on the slow side. Since you're now aware of potential performance problems, I don't think you should move ahead with this without at least some simple manual checking. (Yes, there should have been a perf test.)

So if we know RE2 is slow, why not switch to ICU and get rid of RE2? It doesn't make sense that we're trying to eliminate the known faster one. Why do we need RE2?

- E

Nico Weber

unread,

Mar 27, 2013, 12:53:57 PM3/27/13

to Elliott Sprehn, James Robinson, Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

See upthread: RE2 doesn't take exponential time for malicious input while most other engines do. So it's often used with potentially malicious inputs.

- E

Darin Fisher

unread,

Mar 27, 2013, 1:12:25 PM3/27/13

to Nico Weber, Elliott Sprehn, James Robinson, Ilya Sherman, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

Any thoughts on how to turn this from lore into something self-documenting? Maybe a common interface to the two engines with factory methods and comments that call attention to the differences?

Ilya Sherman

unread,

Mar 27, 2013, 8:39:30 PM3/27/13

to James Robinson, Tony Chang, Chromium-dev, Jungshik Shin (신정식, 申政湜), David Holloway

I found some data from about a year ago that the libphonenumber team compiled. TL;DR: RE2 is faster in some cases and slower in others; libphonenumber ended up using a combination of the two. Note that Chrome compiles in the libphonenumber library.

With RE2:

Benchmark Time(ns) CPU(ns) Iterations

-------------------------------------------------------------

BM_ParseAmericanNumbers/0 49864 45258 14583
BM_ParseAmericanNumbers/1 205461 206298 3684

BM_ValidateI18nNumbers/0 37923 37000 20000

BM_ValidateI18nNumbers/1 181166 164567 3889
BM_ValidateAmericanNumbers/0 27674 27857 28000

BM_ValidateAmericanNumbers/1 245842 245171 2692

BM_ValidateNzNumbers/0 15981 14814 41176
BM_ValidateNzNumbers/1 122266 123436 5833

BM_FormatAmericanNumbers/0 18891 17972 41176

BM_FormatAmericanNumbers/1 11495 12000 66667
BM_ParseE164Numbers/1 129224 124420 5385

With ICU Regexp:

Benchmark Time(ns) CPU(ns) Iterations

-------------------------------------------------------------

BM_ParseAmericanNumbers/0 51273 57000 10000
BM_ParseAmericanNumbers/1 737510 740000 1000

BM_ValidateI18nNumbers/0 37110 39087 19444

BM_ValidateI18nNumbers/1 367595 358306 1842
BM_ValidateAmericanNumbers/0 27322 27386 25926

BM_ValidateAmericanNumbers/1 476626 463398 1489

BM_ValidateNzNumbers/0 15830 16900 53846
BM_ValidateNzNumbers/1 261247 264000 2500

BM_FormatAmericanNumbers/0 19554 17371 43750

BM_FormatAmericanNumbers/1 11523 10928 46667
BM_ParseE164Numbers/1 651815 630000 1000

I think that in these benchmarks, the regexp compile time should be negligible since we cache the compiled regexps. We had some severe performance issues though with RE2 (x15 slower) and very large regexps containing some Unicode characters if I remember correctly. That is why we use the ICU regexp engine in the phone number matcher (not covered by the benchmarks above) only when it's more appropriate. That means that if you use the PhoneNumberMatcher class, you will have to keep using ICU (or both RE2 and ICU).

I still seem to recall that we ran some benchmarks for the Autofill code that showed ICU to be noticeably faster for our use case, but I couldn't find any record of this in my email history. Unfortunately, the team member who worked on this, and hence would be most likely to remember the details, is no longer working either on Chromium or at Google :(

Jungshik Shin (신정식, 申政湜)

unread,

Mar 28, 2013, 6:04:58 PM3/28/13

to Tony Chang, James Robinson, Chromium-dev

Sorry for the late reply. I'm replying JFYI (because it seems that there are other reasons - mainly perf - to keep ICU RE).

2013/3/26 Tony Chang <to...@chromium.org>

I was concerned that RE2 wasn't unicode aware, but James showed me that it is and by default assumes that the input is UTF-8. AFAIK, that's why we were using ICU's regular expression library in the first place, but it should be OK to standardize on RE2.

Does RE2 support all the Unicode character classes, script names, etc? Ok. I found that it does ( https://re2.googlecode.com/hg/doc/syntax.html ). A cursory comparision with ICU RE ( http://userguide.icu-project.org/strings/regexp ) seems that both are on par in terms of the capability (of course, we don't use all the features although we may in the future).

Then, the remaining question would be how often its Unicode data is updated. (well, Chrome's copy of ICU hasn't been updated for a while, either so that I'm not too much worried about it unless it lags behind multi-years).

Jungshik

James Robinson

unread,

Mar 28, 2013, 6:32:30 PM3/28/13

to Jungshik Shin (신정식, 申政湜), Tony Chang, Chromium-dev

On Thu, Mar 28, 2013 at 3:04 PM, Jungshik Shin (신정식, 申政湜) <js...@chromium.org> wrote:

Sorry for the late reply. I'm replying JFYI (because it seems that there are other reasons - mainly perf - to keep ICU RE).

2013/3/26 Tony Chang <to...@chromium.org>

I was concerned that RE2 wasn't unicode aware, but James showed me that it is and by default assumes that the input is UTF-8. AFAIK, that's why we were using ICU's regular expression library in the first place, but it should be OK to standardize on RE2.

Does RE2 support all the Unicode character classes, script names, etc? Ok. I found that it does ( https://re2.googlecode.com/hg/doc/syntax.html ). A cursory comparision with ICU RE ( http://userguide.icu-project.org/strings/regexp ) seems that both are on par in terms of the capability (of course, we don't use all the features although we may in the future).

Then, the remaining question would be how often its Unicode data is updated. (well, Chrome's copy of ICU hasn't been updated for a while, either so that I'm not too much worried about it unless it lags behind multi-years).

OK, thanks. It appears that the unicode data is checked in and updated by a set of scripts that point to http://www.unicode.org/Public/6.0.0/ucd/. I'm not sure how often this is updated but I have confirmed that the files we have checked in match what is on unicode.org today. Is that a reasonable process? Would it be helpful or desirable to use the same data tables for RE2 as we do for ICU?

Assuming this is somewhat sane then the only remaining question is performance. The only hard data benchmarking data I've seen so far has shown RE2 faster nearly all the time. There's anecdotal evidence of ICU being faster for some workloads but nobody seems to have any benchmarks or tests showing this. I'll try to reproduce these claims and characterize them. We should definitely be using the fastest library we can and understand which is faster. It sounds like the answer might be RE2 unless the source data has certain classes of Unicode code points but we could do a runtime switch on that if it was worthwhile.

- James

Jeffrey Yasskin

unread,

Mar 28, 2013, 11:49:35 PM3/28/13

to James Robinson, Jungshik Shin (신정식, 申政湜), Tony Chang, Chromium-dev

2¢:
* RE2 doesn't expose us to DoS attacks. My understanding is that Russ
compromised the average-case performance some in order to bound the
worst-cast running time.

* If you use ICU for a regex whose source is checked into Chrome, and
then you run it over user-controlled data, you need to be sure that
this regex doesn't exhibit exponential time on any user input. I,
personally, don't know how to do that, and
http://userguide.icu-project.org/strings/regexp#TOC-Performance-Tips
is more a set of debugging guidelines than rules for reviewers to
follow.

Jeffrey

Jungshik Shin (신정식, 申政湜)

unread,

Mar 29, 2013, 2:19:49 PM3/29/13

to James Robinson, Tony Chang, Chromium-dev

2013/3/28 James Robinson <jam...@google.com>

On Thu, Mar 28, 2013 at 3:04 PM, Jungshik Shin (신정식, 申政湜) <js...@chromium.org> wrote:

Sorry for the late reply. I'm replying JFYI (because it seems that there are other reasons - mainly perf - to keep ICU RE).

2013/3/26 Tony Chang <to...@chromium.org>

I was concerned that RE2 wasn't unicode aware, but James showed me that it is and by default assumes that the input is UTF-8. AFAIK, that's why we were using ICU's regular expression library in the first place, but it should be OK to standardize on RE2.

Does RE2 support all the Unicode character classes, script names, etc? Ok. I found that it does ( https://re2.googlecode.com/hg/doc/syntax.html ). A cursory comparision with ICU RE ( http://userguide.icu-project.org/strings/regexp ) seems that both are on par in terms of the capability (of course, we don't use all the features although we may in the future).

Then, the remaining question would be how often its Unicode data is updated. (well, Chrome's copy of ICU hasn't been updated for a while, either so that I'm not too much worried about it unless it lags behind multi-years).

OK, thanks. It appears that the unicode data is checked in and updated by a set of scripts that point to http://www.unicode.org/Public/6.0.0/ucd/. I'm not sure how often this is updated but I have confirmed that the files we have checked in match what is on unicode.org today. Is that a reasonable process? Would it be helpful or desirable to use the same data tables for RE2 as we do for ICU?

Yes, the process to update the Unicode data for RE2 seems reasonable.

BTW, I took a closer look at the differences between ICU RE and RE2. I went to https://www.corp.google.com/~rsc/re2-designdoc-nonconf.html to find this:

For internationalized character classes, RE2 implements the Unicode 5.1 General Category property (e.g., \pN or \p{Lu}) as well as the Unicode Script property (e.g., \p{Greek}). These should be used whenever matches are not intended to be limited to ASCII characters (e.g., \pN or \p{Nd} instead of [[:digit:]] or \d). RE2 does not implement the other Unicode properties (see Unicode Technical Standard #18: Unicode Regular Expressions).

OTOH, according to http://userguide.icu-project.org/strings/regexp , ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2.

Level 1 defined in UTR 18 includes more than what's supported by RE2. For instance, set operations (intersection, union, difference) do not seem to be supported by RE2 while they're by ICU RE. So, there are some features missing in RE2 but available in ICU RE.

However, I have not found Chrome's code using these missing features (unless Autofill or libphonenumber regex patterns use them, which I didn't check). This does not mean that we'll never use them (e.g. "domain name spoofing" check in net/base may use more features later).

In some cases (net/base : IDN functions), having to convert UTF-16 strings to UTF-8 to use RE2 may introduce a small (likely to be with almost no user-perceptible impact) perf hit.

Given all these, at this point, it looks OK to switch if Autofill's potential perf concerns are resolved/addressed.

Jungshik

Reply all

Reply to author

Forward