Performance of regular expression matches

116 views
Skip to first unread message

Jan Krüger

unread,
Mar 14, 2024, 9:20:28 AMMar 14
to elixir-lang-core
I've recently had to work on a code base that parses largish RDF XML files. Part of the code base does relatively simple but regular expression matches, but since the files are large, quite a lot of Regex.run calls. While profiling I've noticed, that there are callouts to :erlang.system_info, which fetches the PCRE version BEAM was compiled against.

An example regular expression from the code base in question matches the schema part of a URL. I've replaced Regex.run with erlang's :re.run for testing purposes, and at least for this case, there performance gain is quite dramatic.

Comparing fprof results:

```
RDF.IRI.scheme/1                                               1176473   30615.618    2354.355
---
RDF.IRI.scheme/1                                               1176473    3531.955    2353.905
```

I found this thread in the google group, which actually talk about the reasoning for fetching the version, and proposes and alternative.

https://groups.google.com/g/elixir-lang-core/c/CgFdxIONvGg/m/HN9ryeVXAwAJ?pli=1

Especially

```
Taking a further look at the code, the issue with recompiling regexes on the fly is that it makes executing the regexes more expensive, as we need to compute the version on every execution. We could store the version in ETS but that would have performance issues. Storing in a persistent_term would be great, but at the moment we support Erlang/OTP 20+. Thoughts?
```

Since this has a fairly noticeable impact, at least on all tests I've run, I wanted to start a discussion, if this could be implemented/improved now.

José Valim

unread,
Mar 14, 2024, 11:37:43 AMMar 14
to elixir-l...@googlegroups.com
I have pushed a fix to main. But also note we provide precompiled Elixir versions per OTP version. Using a matching version will always give you the best results and that's not only about regexes. :)

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/44d498c7-82a4-46d2-89be-7919400e0297n%40googlegroups.com.

Jan Krüger

unread,
Mar 14, 2024, 12:13:38 PMMar 14
to elixir-lang-core
True, but reading the regex module code, it seems like the call to :re.version would always get made, regardless of which elixir/erlang version is in use? I'm quite confident that in all our environments the elixir and OTP versions match, so it's not that Regex.run is slow because it ends up in the code path that doesn't use the compiled regular expression. The overhead definitely came from the call to :re.version. At least for small regexes.

Anyway. Thanks for fixing it. We'll try it out, as soon as it's available in the next release.

Jan Krüger

unread,
Mar 14, 2024, 12:15:40 PMMar 14
to elixir-lang-core
I read the commit, and I don't it fixes what our actual problem was. See my comment above. The problem is the actual call to :re.version, not the recompilation of the regex

On Thursday, March 14, 2024 at 4:37:43 PM UTC+1 José Valim wrote:

José Valim

unread,
Mar 14, 2024, 12:53:28 PMMar 14
to elixir-l...@googlegroups.com
I will benchmark but I would be very surprised if :re.version() is the one to blame. It takes 3-4us on my machine.

marcel...@googlemail.com

unread,
Mar 14, 2024, 12:55:47 PMMar 14
to elixir-lang-core
I'm the maintainer of RDF.ex library with the RDF.IRI module mentioned in the OP. I can confirm that this fix doesn't affect the problem, since we're actually not using `URI.parse/1` most of the time (we use it only when dealing with relative URIs). Even in this case the `Regex.version/0` call in `Regex.safe_run/3` (https://github.com/elixir-lang/elixir/blob/b8fca42e58850b56f65d0fb8a2086f2636141f61/lib/elixir/lib/regex.ex#L533) still performs the `:erlang.system_info/0` call. 

Jan Krüger

unread,
Mar 14, 2024, 1:01:16 PMMar 14
to elixir-lang-core
Thanks a lot. I'm also happy to share our case, and my fprof results, if that helps. I am very sure that my erlang, and elixir versions match, on the machine where I've tested this. Replacing Regex.run with an identical call to :re.run should show the performance improvement I've mentioned. The regex we've tested this on is: 

~r/^([a-z][a-z0-9\+\-\.]*):/i

José Valim

unread,
Mar 14, 2024, 5:21:25 PMMar 14
to elixir-l...@googlegroups.com
Do you have benchmarks or only the fprof results? fprof is not a benchmarking tool: comparing fprof results from different code may be misleading. Proper benchmarking is preferrable. I am benchmarking locally and I cannot measure any relevant difference even with the whole version checking removed.

Jan Krüger

unread,
Mar 15, 2024, 2:20:11 AMMar 15
to elixir-lang-core
The difference was definitely measurable just in pure running time of the code, setting aside fprof. I'll post what I have after work today.

marcel...@googlemail.com

unread,
Mar 15, 2024, 2:53:47 AMMar 15
to elixir-lang-core
The benchmark results I'm getting are indeed not as dramatic as the fprof results, but on the other hand also more than the 5% mentioned in the PR which introduced the check: https://github.com/elixir-lang/elixir/pull/9040

```elixir
regex = ~r/^([a-z][a-z0-9\+\-\.]*):/i
re_pattern = regex.re_pattern

Benchee.run(%{
  "Regex.run/2" => fn -> Regex.run(regex, "foo") end,
  ":re.run/3" => fn -> :re.run("foo", re_pattern, [{:capture, :all, :binary}]) end
})
```

```
Name                  ips        average  deviation         median         99th %
:re.run/3          2.88 M      346.90 ns  ±3623.51%         333 ns         417 ns
Regex.run/2        1.98 M      504.74 ns  ±5851.21%         416 ns         542 ns

Comparison:
:re.run/3          2.88 M
Regex.run/2        1.98 M - 1.46x slower +157.84 ns
```

Manish sharma

unread,
Mar 15, 2024, 2:58:12 AMMar 15
to elixir-l...@googlegroups.com

How Machine Learning Services Help Business?

  • With Machine Learning consulting services businesses can consider cost reduction while boosting performance.
  • It helps organizations to timely finish the task with utmost accuracy.
  • Retrieve information using cutting edge software tools.
  • Machine learning works according to recent trends and specifications.
  • It automates the analysis of past patterns and historical data to predict the future.



--
Kind Regards, 
Manish Kr. Sharma 
Digital Marketing Manager




José Valim

unread,
Mar 15, 2024, 3:22:29 AMMar 15
to elixir-l...@googlegroups.com
The 5% also take into account the option processing and result handling. The version check itself is a subset of that. I was not able to measure sensible gains after removing it.

marcel...@googlemail.com

unread,
Mar 15, 2024, 3:26:21 AMMar 15
to elixir-lang-core
I quickly checked how a persistent term cached implementation would compare, which turned out to perform almost equivalent. It seems the :re.version and :erlang.system_info(:endian) values are already cached.

```elixir
defmodule RegexPersistent do
  def version do
    case :persistent_term.get(__MODULE__, nil) do
      nil ->
        version = {:re.version(), :erlang.system_info(:endian)}
        :persistent_term.put(__MODULE__, version)
        version

      version ->
        version
    end
  end

  defp safe_run(
         %Regex{re_pattern: compiled, source: source, re_version: version, opts: compile_opts},
         string,
         options
       ) do
    case version() do
      ^version -> :re.run(string, compiled, options)
      _ -> :re.run(string, source, translate_options(compile_opts, options))
    end
  end

  def run(%Regex{} = regex, string, options \\ []) when is_binary(string) do
    return = Keyword.get(options, :return, :binary)
    captures = Keyword.get(options, :capture, :all)
    offset = Keyword.get(options, :offset, 0)

    case safe_run(regex, string, [{:capture, captures, return}, {:offset, offset}]) do
      :nomatch -> nil
      :match -> []
      {:match, results} -> results
    end
  end

  defp translate_options(<<?u, t::binary>>, acc), do: translate_options(t, [:unicode, :ucp | acc])
  defp translate_options(<<?i, t::binary>>, acc), do: translate_options(t, [:caseless | acc])
  defp translate_options(<<?x, t::binary>>, acc), do: translate_options(t, [:extended | acc])
  defp translate_options(<<?f, t::binary>>, acc), do: translate_options(t, [:firstline | acc])
  defp translate_options(<<?U, t::binary>>, acc), do: translate_options(t, [:ungreedy | acc])

  defp translate_options(<<?s, t::binary>>, acc),
    do: translate_options(t, [:dotall, {:newline, :anycrlf} | acc])

  defp translate_options(<<?m, t::binary>>, acc), do: translate_options(t, [:multiline | acc])

  defp translate_options(<<?r, t::binary>>, acc) do
    IO.warn("the /r modifier in regular expressions is deprecated, please use /U instead")
    translate_options(t, [:ungreedy | acc])
  end

  defp translate_options(<<>>, acc), do: acc
  defp translate_options(rest, _acc), do: {:error, rest}
end


regex = ~r/^([a-z][a-z0-9\+\-\.]*):/i
re_regex = regex.re_pattern


Benchee.run(%{
  "Regex.run/2" => fn -> Regex.run(regex, "foo") end,
  "RegexPersistent.run/2" => fn -> RegexPersistent.run(regex, "foo") end,
  ":re.run/3" => fn -> :re.run("foo", re_regex, [{:capture, :all, :binary}]) end
})
```

Results:
```
:re.run/3                    2.72 M      367.06 ns  ±3579.21%         333 ns         458 ns
Regex.run/2                  1.79 M      557.84 ns  ±5817.83%         417 ns         542 ns
RegexPersistent.run/2        1.79 M      558.06 ns  ±7018.25%         375 ns         541 ns

Comparison:
:re.run/3                    2.72 M
Regex.run/2                  1.79 M - 1.52x slower +190.77 ns
RegexPersistent.run/2        1.79 M - 1.52x slower +190.99 ns
```

Jan Krüger

unread,
Mar 15, 2024, 3:31:10 AMMar 15
to elixir-lang-core
Alright. If you can't see it, then it must have been something in my environment. What I did when working on this is run fprof to identify potential performance problems, and the version checked showed up as a substantial part of the time spent in the regex code. Is that a valid use of fprof in your opinion? Since we're running this in a very tight loop I actually also wanted to get rid of the keyword.get calls when running regexes, and swapped out Regex.run with :re.run, and that substantially improved the performance overall.

I think I didn't then go, and profile specifically if removing the version check alone will improve the performance by itself. So all I have to back up that the version check is the root cause, is fprof.

José Valim

unread,
Mar 15, 2024, 3:39:38 AMMar 15
to elixir-l...@googlegroups.com
fprof is great at telling what in a given workflow is taking time but comparing fprof results won't tell you by how much it got faster. For that you will have to benchmark it again. For tight-loops though, I can see how removing the version check, option handling and everything else speeds up performance. I think it is fine to go that route if you need to.

I am also not sure if fprof will consider the time spent on NIFs. I assume most time is spent on the regex engine but if that is not fully considered in fprof, that could affect measurements. But I am speculating here, I truly don't know. :)

Jan Krüger

unread,
Mar 15, 2024, 3:47:10 AMMar 15
to elixir-lang-core
The NIFs might be an explanation, for why this shows up as a larger part of the execution time, than it actually is. I hadn't considered that.

It probably makes sense for us to keep the :re.run in any event. I think the motivation for the thread was also just to give a heads up that there might be more of a performance issue here, than you guys assumed when introducing this version check. If it turns out to be a mirage, then I guess that's just as well :)
Reply all
Reply to author
Forward
0 new messages