How much costful is to instantiate a RegExp?

82 views
Skip to first unread message

Mateus Felipe Cordeiro Caetano Pinto (Twinsen)

unread,
Oct 11, 2022, 7:11:04 AM10/11/22
to Dart Misc
Hello,

The title of my message says it all.

I know that Dart had a lot of problems with RegExp performance. As far as I know, it has been solved through a refactor in the RegExp engine, although I couldn't find any detailed information on this.

However, my question is more related to how the object itself works.

In a code review of a code in my company, one of my coworkers made the following code:

static final _nonDigitsRegExp = RegExp(r'\D');
static final _onlyDigitsRegExp = RegExp(r'\d');

Accordingly to him, this was done to avoid calling the RegExp constructor everytime it's needed, as "RegExp can be expensive to create".

I feel like this is a premature micro-optimization that does not make much sense. Also, I think that the cost of the RegExp is paid whenever it's called (e.g. `firstMatch`), instead of when the object is created, although I have no evidence on this. He said that he would agree it me "if [I] can show that the compiler can optimize it".

Does his reasoning makes sense, or does mine? Or maybe is there something both of us didn't consider?

Is there any documentation that could give us more insight on this matter?

Lasse Reichstein Holst Nielsen

unread,
Mar 8, 2023, 11:21:59 AM3/8/23
to Dart Misc, Mateus Felipe Cordeiro Caetano Pinto (Twinsen)
> I know that Dart had a lot of problems with RegExp performance. As far as I know, it has been solved through a refactor in the RegExp engine, although I couldn't find any detailed information on this.

Not sure what this refers to, unless it's from before Dart switched from PCRE to the Irregexp engine. That was eons ago.


It's hard to say what creating an RegExp costs, because it depends on the regexp pattern. And how the implementation optimizes, which is always going to be speculative,
and will likely differ between platforms (native vs. web).

It's never free. Calling the `RegExp` constructor includes parsing the pattern (unless runtime internals cache between creations using the same pattern and flags, and who knows if it does),
and parsing is one of the more expensive tasks that a program can do. Parsing is generally expensive at the CPU level because it's usually full of unpredictable branches.
It needs to happen when calling the `RegExp` constructor, because that's when it must throw if the input is not valid.
But RegExp patterns are usually short, so again it's hard to say something definite.

The main compilation cost may be paid on first call (and maybe a later call too), since the implementation may compile separately for one-byte and two-byte string inputs,
so it needs to see the first input before knowing what kind of input to compile for. 
The backend might share compilation artifacts between RegExp instances with the same pattern. Or it might not.

But it's not free to create the object, it always performs a memory allocation, and it needs to check the pattern syntax.

I'd personally always create a variable caching a regexp that I expect to reuse repeatedly.
But I'd cache any other cacheable object allocations that I expect to repeat as well (if I can't use a const object creation inline, and RegExps cannot be const).

This is performance. If the code is not on a critical path, maybe performance doesn't matter.
And if it is on a critical path, you should have measurements showing where the pain points are.


But I personally prefer to avoid creating RegExps repeatedly too, because it's unnecessary overhead, and it has unpredictable performance, because it 
depends so much on which optimizations apply, and I prefer to not having to guess about those.

But that's just, like, my opinion man.
/L

Lasse R.H. Nielsen

unread,
Mar 9, 2023, 9:16:32 AM3/9/23
to Mateus Felipe Cordeiro Caetano Pinto (Twinsen), Dart Misc
On Wed, Mar 8, 2023 at 9:00 PM Mateus Felipe Cordeiro Caetano Pinto (Twinsen) <mateu...@gmail.com> wrote:

> Calling the `RegExp` constructor includes parsing the pattern (unless runtime internals cache between creations using the same pattern and flags, and who knows if it does),
and parsing is one of the more expensive tasks that a program can do.

I didn't consider that the parsing would happen in the construction of the object, I though it would happen when comparing and the parsing would be then cached somehow. Good to know.

> It needs to happen when calling the `RegExp` constructor, because that's when it must throw if the input is not valid.

A little side-topic, but if we had RegExp literals could we optimize away this cost?

Maybe!
If RegExp literals require a compile-time constant pattern, then the compiler could check that the pattern is valid, and the runtime can delay parsing it until first use, because it's known to not throw.

We won't get RegExp literals, though. Definitely not worth the syntactic cost, because raw strings gets us almost all the benefits that JavaScript gets from RegExp literals (and with much fewer parsing issues - parsing regexp literals is one of the hard parts of parsing JavaScript).

An alternative is that compilers get smart and recognize constant arguments to the `RegExp` constructor and pre-check the input,
then direct the object creation through a different path for pre-checked inputs so they won't be eagerly parsed at runtime.
That's possible today. (Heck, maybe it's being done today, I don't know. And what happens when compiling to JavaScript is two levels of unknowns.)

It's not clear that it'll be a win, because most code won't create a RegExp that it doesn't actually use.
The final variables used to cache here are all lazy-initialized, so the program won't even create the RegExp object unless the variable is read.

Probably won't cost much either.
 

> But I personally prefer to avoid creating RegExps repeatedly too, because it's unnecessary overhead, and it has unpredictable performance, because it 
depends so much on which optimizations apply, and I prefer to not having to guess about those.

Thanks, your opinion on the topic is much valuable! I think by coworker was generally right, after all!

I'd at least say that they weren't wrong. Whether anyone is strictly "right" here is not something I'd answer without actual performance measurements. 

Cheers!
/L


--
Lasse R.H. Nielsen - l...@google.com  
'Faith without judgement merely degrades the spirit divine'
Google Denmark ApS - Frederiksborggade 20B, 1 sal - 1360 København K - Denmark - CVR nr. 28 86 69 84

Mateus Felipe Cordeiro Caetano Pinto (Twinsen)

unread,
Mar 9, 2023, 9:20:58 AM3/9/23
to Dart Misc, Lasse Reichstein Holst Nielsen, Mateus Felipe Cordeiro Caetano Pinto (Twinsen)
> Not sure what this refers to, unless it's from before Dart switched from PCRE to the Irregexp engine. That was eons ago.

That's exactly the case. This is why I don't even considered it.

> Calling the `RegExp` constructor includes parsing the pattern (unless runtime internals cache between creations using the same pattern and flags, and who knows if it does),
and parsing is one of the more expensive tasks that a program can do.

I didn't consider that the parsing would happen in the construction of the object, I though it would happen when comparing and the parsing would be then cached somehow. Good to know.

> It needs to happen when calling the `RegExp` constructor, because that's when it must throw if the input is not valid.

A little side-topic, but if we had RegExp literals could we optimize away this cost?

> But I personally prefer to avoid creating RegExps repeatedly too, because it's unnecessary overhead, and it has unpredictable performance, because it 
depends so much on which optimizations apply, and I prefer to not having to guess about those.

Thanks, your opinion on the topic is much valuable! I think by coworker was generally right, after all!


Em quarta-feira, 8 de março de 2023 às 13:21:59 UTC-3, Lasse Reichstein Holst Nielsen escreveu:
Reply all
Reply to author
Forward
0 new messages