Avoiding/predicting long match times

67 views
Skip to first unread message

Lukas Backström

unread,
Oct 31, 2024, 8:25:40 AM10/31/24
to PCRE2 discussion list
Hello!

For many years the Erlang/OTP (https://erlang.org) project has used a modified version of pcre to provide regexp to the systems built using Erlang/OTP. The modifications we did to pcre were in order to preserve the soft real-time properties that Erlang guarantees. We added many yielding points inside pcre so that when a match took long time, we could yield execution and then continue executing later on.

We have been postponing migrating to anothing regexp library for many years, but it is now hightime to do the switch and we are wondering how to achieve something similar with pcre2, hopefully without modifying it.

One thing that has changed in Erlang since pcre was originally integrated is the possibility to run long jobs in a separate thread pool, but this decision needs to be done before the call to match and always using the thread pool is too expensive for normal usage of pcre.

So, we are looking for a way to either limit or predict the worstcase running time for a match. We have three ideas about how to do this so far and wanted to see if anyone here has any other ideas or sees issues with our ideas.

1. Supply max X characters to match and then if partial match is returned we continue running in the thread pool.
2. Use pcre2_pattern_info (or something else) to predict if any constructs used can cause very long running times.
3. Use re2 instead (something we would rather avoid as it would break backwards compatability, introduce a new dependency and other things).
4. Any other ideas that we have not thought of?

Lukas
Erlang/OTP core team

Giuseppe D'Angelo

unread,
Oct 31, 2024, 8:49:59 AM10/31/24
to Lukas Backström, PCRE2 discussion list
Hello,
There is already a facility to limit "how many steps" PCRE2 takes
during matching before returning an error, see pcre2_set_match_limit.
You could set it to a sufficient low number, and if the match fails
because of the limit being reached, do the match on another thread
(with a higher limit).
However this won't give you any guarantees of running time; and the
definition of what a step is is not really well-defined (esp. when JIT
is used) so the limit might end up being a magic number.

Hope this helps,
--
Giuseppe D'Angelo

Lukas Larsson

unread,
Oct 31, 2024, 9:01:56 AM10/31/24
to Giuseppe D'Angelo, PCRE2 discussion list
Hello!

We have looked at using that and it might be a good option. What we wondered here is whether each "step" represents roughly the same amount of time? Or can there be steps that are vastly more costly than other steps?

Philip Hazel

unread,
Oct 31, 2024, 12:32:39 PM10/31/24
to Lukas Larsson, Giuseppe D'Angelo, PCRE2 discussion list
> We have looked at using that and it might be a good option. What we wondered here is whether each "step" represents roughly the same amount of time? Or can there be steps that are vastly  more costly than other steps?

In the interpreter, each "step" processes one opcode in the compiled regex, but the times will vary. For example, "match one character" obviously takes less time than "match one character any number of times". But perhaps the difference isn't vast. I assume that the "steps" in the JIT are similar, but I do not actually know. You could do a temporary hack to pcre2_match.c to insert a timer for the main loop and then a pile of tests to see how different the times are.

Another possibility, but much more expensive, would be to make use of the "auto callout" facility. This inserts a callout before each pattern item. See pcre2api and pcre2callout man pages.

Philip

Lukas Backström

unread,
Mar 5, 2025, 2:49:55 AMMar 5
to PCRE2 discussion list
Hello!

In case anyone is interested, we opted for doing doing a similar approach to what we did for pcre, that is we added our own instrumentation to pcre2 to yield in specific places. You can see the PR with the end result here: https://github.com/erlang/otp/pull/9299

Lukas

Nicholas Wilson

unread,
Mar 5, 2025, 9:03:19 AMMar 5
to PCRE2 discussion list
Hi Lukas,

That's great to hear!

I have been thinking about your use-case since you last posted. I am sorry that things are slow to change in PCRE2 (since we only work in our free time).

I tried to see your pull request, but couldn't see what were the changes you made. I imagine you added yielding to the main matching loop in the regex interpreter?

As a long-term goal I could see us (potentially) making the following changes:
* Work harder to ensure that regex compilation is linear-time (at the moment it's at last worst-case O(n^2) in a handful of places). Knowing that compilation always completes in a reasonable length of time would be preferable to the current situation.
* Accept a patch, or write one ourselves, to allow for incremental/yieldable matching
* There are just a handful of regex operations which can take a long time - making these yieldable would be hard. A good example is "(x*)Y\1" which will match "xxxxxYxxxxx". The "\1" is a single atomic operation in our matcher, but it can examine an arbitrary amount of text (bounded only by the length of the input string - not the length of the pattern string). There would be, possibly, ways to make these yieldable as well.

Would these changes help you?
Are you interested in improvements beyond simply making matching yieldable?

Thank you very much for coming back to the thread to update us on your progress.

Nick

Fredrik Frantzen

unread,
Mar 11, 2025, 4:56:25 AMMar 11
to PCRE2 discussion list
Hi Nicholas,

As you imagined, we added yielding to the main matching loop, and unicode validation. We are interested in support for yielding for regex compilation as well.

Yes we would like to have yielding support pushed upstream, feel free to review the changes so that we can make it happen or salvage what you need.

Let me know if I can be of assistance. I am currently uplifting the changes to 10.45.

BR
Fredrik

Nicholas Wilson

unread,
Mar 17, 2025, 12:46:58 PMMar 17
to PCRE2 discussion list
Thank you very much Fredrik and Lukas.

(I tried to reply before, but the list seems to have swallowed my message. Apologies if this is a duplicate reply.)

I won't be able to look at this straight away. It's a relatively invasive change that will require some careful thought, some careful testing, and maybe a bit of API bikeshedding (to make sure this feature is useful and suitable for all downstream users).

Your branch looks very useful though, and I will see if I can tackle this for the 10.46 release in six months' time.

I've made a ticket to ensure it's not forgotten: https://github.com/PCRE2Project/pcre2/issues/730

I'll let you know when I come round to revisiting this.

Nick

Reply all
Reply to author
Forward
0 new messages