[S] Change in code/re2[main]: reduce string copies in prefilter

0 views
Skip to first unread message

Jim Quinn (Gerrit)

unread,
Oct 6, 2022, 10:43:32 AM10/6/22
to re2...@googlegroups.com

Jim Quinn has uploaded this change for review.

View Change

reduce string copies in prefilter

When prefilter operates on longer regexes with many alterantes, the
merging of a and b into ab results in unnecessary string shuffling. By
moving the larger of the two directly into ab, then inserting the
smaller contents, we can avoid that work.

(If C++17 were enabled for RE2, we could use std::set<>::merge to make this
cheaper, but that's not possible in C++11.)

Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
---
M re2/prefilter.cc
1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/re2/prefilter.cc b/re2/prefilter.cc
index a8af821..f5ed631 100644
--- a/re2/prefilter.cc
+++ b/re2/prefilter.cc
@@ -8,6 +8,7 @@
#include <stdint.h>
#include <string>
#include <vector>
+#include <utility>

#include "util/util.h"
#include "util/logging.h"
@@ -286,13 +287,6 @@
return "";
}

-// Add the strings from src to dst.
-static void CopyIn(const std::set<std::string>& src,
- std::set<std::string>* dst) {
- for (ConstSSIter i = src.begin(); i != src.end(); ++i)
- dst->insert(*i);
-}
-
// Add the cross-product of a and b to dst.
// (For each string i in a and j in b, add i+j.)
static void CrossProduct(const std::set<std::string>& a,
@@ -343,8 +337,13 @@
Info *ab = new Info();

if (a->is_exact_ && b->is_exact_) {
- CopyIn(a->exact_, &ab->exact_);
- CopyIn(b->exact_, &ab->exact_);
+ // Avoid string copies by moving the larger exact_.set into
+ // ab directly, then merge in the smaller set.
+ if (a->exact_.size() < b->exact_.size())
+ std::swap(a, b);
+
+ ab->exact_ = std::move(a->exact_);
+ ab->exact_.insert(b->exact_.begin(), b->exact_.end());
ab->is_exact_ = true;
} else {
// Either a or b has is_exact_ = false. If the other

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 1
Gerrit-Owner: Jim Quinn <jimq...@google.com>
Gerrit-MessageType: newchange

Jim Quinn (Gerrit)

unread,
Oct 6, 2022, 10:45:28 AM10/6/22
to re2...@googlegroups.com

Jim Quinn uploaded patch set #2 to this change.

View Change

reduce string copies in prefilter

When prefilter operates on longer regexes with many alterantes, the
merging of a and b into ab results in unnecessary string shuffling. By
moving the larger of the two directly into ab, then inserting the
smaller contents, we can avoid that work.

(If C++17 were enabled for RE2, we could use std::set<>::merge to make this cheaper, but that's not possible in C++11.)

Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
---
M re2/prefilter.cc
1 file changed, 24 insertions(+), 9 deletions(-)

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 2
Gerrit-Owner: Jim Quinn <jimq...@google.com>
Gerrit-MessageType: newpatchset

Jim Quinn (Gerrit)

unread,
Oct 6, 2022, 10:46:44 AM10/6/22
to re2...@googlegroups.com

Jim Quinn uploaded patch set #3 to this change.

View Change

reduce string copies in prefilter

When prefilter operates on longer regexes with many alternates, the

merging of a and b into ab results in unnecessary string shuffling. By
moving the larger of the two directly into ab, then inserting the
smaller contents, we can avoid that work.

(If C++17 were enabled for RE2, we could use std::set<>::merge to make this cheaper, but that's not possible in C++11.)

Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
---
M re2/prefilter.cc
1 file changed, 24 insertions(+), 9 deletions(-)

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 3

Akshay Jaggi (Gerrit)

unread,
Oct 7, 2022, 10:14:25 AM10/7/22
to Jim Quinn, re2...@googlegroups.com

Attention is currently required from: Jim Quinn.

View Change

1 comment:

  • File re2/prefilter.cc:

    • Patch Set #1, Line 345: ab->exact_

      There is a subtle difference in logic between the previous code and this version.

      In the earlier version, content of a->exact_ and b->exact_ was copied (set-unioned) into ab->exact_.

      In this version, ab->exact_ = a->exact_ UNION b->exact_.

      The difference being that any pre-existing values in ab->exact_ (that may be defined by the constructor of Info()) will get lost in the second version.

      Now looking at the Info class, it doesn't seem possible right now, but we don't know how this file is touched or evolves in the future.

      I was thinking of the following two options:
      1. simply do ab->exact_.insert(a) and ab->exact_.insert(b), keeping the current behavior intact.
      2. overload the constructor of Info to take two sets, whose union it stores in exact_. Then, we simply call Info *ab = Info(a->exact_, b->exact_).
      Doing so will ensure anyone in future changing the constructor to set any pre-existing value in exact_ does not need to worry about the code here.

      2 can follow 1 in a separate CL as well. 😊

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 3
Gerrit-Owner: Jim Quinn <jimq...@google.com>
Gerrit-CC: Akshay Jaggi <a...@google.com>
Gerrit-Attention: Jim Quinn <jimq...@google.com>
Gerrit-Comment-Date: Fri, 07 Oct 2022 14:14:20 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Gerrit-MessageType: comment

Akshay Jaggi (Gerrit)

unread,
Oct 7, 2022, 10:38:02 AM10/7/22
to Jim Quinn, re2...@googlegroups.com

Attention is currently required from: Jim Quinn.

View Change

1 comment:

  • File re2/prefilter.cc:

    • There is a subtle difference in logic between the previous code and this version. […]

      Spoke too soon.
      Reading more about it, 1 doesn't help at all, since the contents of a set are const.

      Consider doing 2 to provide a safer interface (unless re2 maintainers are okay with this design as well).

      C++17 merge would have been super useful indeed. Can't we do something like:
      if C++17 do X else do Y? :P

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 3
Gerrit-Owner: Jim Quinn <jimq...@google.com>
Gerrit-CC: Akshay Jaggi <a...@google.com>
Gerrit-Attention: Jim Quinn <jimq...@google.com>
Gerrit-Comment-Date: Fri, 07 Oct 2022 14:37:57 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: Akshay Jaggi <a...@google.com>
Gerrit-MessageType: comment

Jim Quinn (Gerrit)

unread,
Oct 7, 2022, 10:57:18 AM10/7/22
to Akshay Jaggi, re2...@googlegroups.com

Attention is currently required from: Akshay Jaggi.

View Change

1 comment:

  • File re2/prefilter.cc:

    • Patch Set #1, Line 345: ab->exact_

      There is a subtle difference in logic between the previous code and this version. […]

    • I'm not the steward obviously, but I'm going to advise against 2.

      `Alt` is a static factory "constructor" for this class, and documented as such on
      line 209. There are 13 other functions that do similar work, and make similar assumptions about the state of a new Info().

      If we had a magic wand, I agree that traditional constructors may be useful, however, we get nearly all the performance improvements with these few lines.

      I think what you're recommending would be a cleaner C++ implementation. However, I'd expect the entirety of this class's factory functions to be moved over, to keep it consistent with other functions like `Quest`. I'm not seeing the ROI on that.

      ---

      C++17 would be useful. If I were the owner, I probably wouldn't want this to be the first place that someone introduces a `#ifdef cpp17`, but if that mechanism already exists elsewhere, I'd be happy to update it.

      In practice? It's probably not a huge win in our production workload. In tests on my end, I was seeing calls to `Alt` like:

      ```
      |a.exact| = 1, |b.exact| = 0. Copy 1 strings into ab. Copy 0 string into ab.

      Destroy |a|, destroy |b| (2 string destructors called)

      ... <15k iterations later>

      |a.exact| = 15k. |b.exact| = 1. Copy 15k strings into ab. Copy 1 string into ab.

      Destroy |a|, destroy |b|, (15001 string desctructors called)
      ```

      With this version:

      ```
      |a.exact| = 15k, |b.exact| = 1. Move 15k in O(1), copy 1 string, destroy 1 string.
      ```

      We're *still* destroying 15k strings when we could just be potentially saving them. That being said, the move saves the lion share of the work. If cpp17 is a no-go, I'm not hung up on it.

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 3
Gerrit-Owner: Jim Quinn <jimq...@google.com>
Gerrit-CC: Akshay Jaggi <a...@google.com>
Gerrit-Attention: Akshay Jaggi <a...@google.com>
Gerrit-Comment-Date: Fri, 07 Oct 2022 14:57:15 +0000

Akshay Jaggi (Gerrit)

unread,
Oct 7, 2022, 12:00:07 PM10/7/22
to Jim Quinn, re2...@googlegroups.com

Attention is currently required from: Jim Quinn.

View Change

1 comment:

  • File re2/prefilter.cc:

    • I'm not the steward obviously, but I'm going to advise against 2. […]

      Ah! That makes sense.
      Thanks for the detailed description.

      I saw a few compiler based #ifdefs in re2, but non specifically for C++17.

      Keeping this as resolved.

To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: re2
Gerrit-Branch: main
Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
Gerrit-Change-Number: 60490
Gerrit-PatchSet: 3
Gerrit-Owner: Jim Quinn <jimq...@google.com>
Gerrit-CC: Akshay Jaggi <a...@google.com>
Gerrit-Attention: Jim Quinn <jimq...@google.com>
Gerrit-Comment-Date: Fri, 07 Oct 2022 16:00:02 +0000
Gerrit-HasComments: Yes
Gerrit-Has-Labels: No
Comment-In-Reply-To: Jim Quinn <jimq...@google.com>

Paul Wankadia (Gerrit)

unread,
Oct 10, 2022, 10:09:42 AM10/10/22
to Jim Quinn, Paul Wankadia, Akshay Jaggi, re2...@googlegroups.com

Attention is currently required from: Jim Quinn.

Patch set 5:Code-Review +2

View Change

    To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

    Gerrit-Project: re2
    Gerrit-Branch: main
    Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
    Gerrit-Change-Number: 60490
    Gerrit-PatchSet: 5
    Gerrit-Owner: Jim Quinn <jimq...@google.com>
    Gerrit-Reviewer: Paul Wankadia <jun...@google.com>
    Gerrit-CC: Akshay Jaggi <a...@google.com>
    Gerrit-Attention: Jim Quinn <jimq...@google.com>
    Gerrit-Comment-Date: Mon, 10 Oct 2022 14:09:36 +0000
    Gerrit-HasComments: No
    Gerrit-Has-Labels: Yes
    Gerrit-MessageType: comment

    Paul Wankadia (Gerrit)

    unread,
    Oct 10, 2022, 10:21:46 AM10/10/22
    to Jim Quinn, Paul Wankadia, Akshay Jaggi, re2...@googlegroups.com

    Paul Wankadia submitted this change.

    View Change


    Approvals: Paul Wankadia: Looks good to me, approved
    reduce string copies in prefilter

    When prefilter operates on longer regexes with many alternates, the
    merging of a and b into ab results in unnecessary string shuffling. By
    moving the larger of the two directly into ab, then inserting the
    smaller contents, we can avoid that work.

    (If C++17 were enabled for RE2, we could use std::set<>::merge to make this cheaper, but that's not possible in C++11.)

    Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
    Reviewed-on: https://code-review.googlesource.com/c/re2/+/60490
    Reviewed-by: Paul Wankadia <jun...@google.com>
    ---
    M re2/prefilter.cc
    1 file changed, 27 insertions(+), 9 deletions(-)

    diff --git a/re2/prefilter.cc b/re2/prefilter.cc
    index a8af821..979c404 100644
    --- a/re2/prefilter.cc
    +++ b/re2/prefilter.cc
    @@ -7,6 +7,7 @@
    #include <stddef.h>
    #include <stdint.h>
    #include <string>
    +#include <utility>
    #include <vector>

    #include "util/util.h"

    @@ -286,13 +287,6 @@
    return "";
    }

    -// Add the strings from src to dst.
    -static void CopyIn(const std::set<std::string>& src,
    - std::set<std::string>* dst) {
    - for (ConstSSIter i = src.begin(); i != src.end(); ++i)
    - dst->insert(*i);
    -}
    -
    // Add the cross-product of a and b to dst.
    // (For each string i in a and j in b, add i+j.)
    static void CrossProduct(const std::set<std::string>& a,
    @@ -343,8 +337,14 @@

    Info *ab = new Info();

    if (a->is_exact_ && b->is_exact_) {
    - CopyIn(a->exact_, &ab->exact_);
    - CopyIn(b->exact_, &ab->exact_);
    +    // Avoid string copies by moving the larger exact_ set into

    + // ab directly, then merge in the smaller set.
    + if (a->exact_.size() < b->exact_.size()) {
    +      using std::swap;
    + swap(a, b);

    + }
    + ab->exact_ = std::move(a->exact_);
    + ab->exact_.insert(b->exact_.begin(), b->exact_.end());
    ab->is_exact_ = true;
    } else {
    // Either a or b has is_exact_ = false. If the other

    To view, visit change 60490. To unsubscribe, or for help writing mail filters, visit settings.

    Gerrit-Project: re2
    Gerrit-Branch: main
    Gerrit-Change-Id: I02dc4e294fe7e7252d0eb9fd045aa1bf97d0cdc8
    Gerrit-Change-Number: 60490
    Gerrit-PatchSet: 6
    Gerrit-Owner: Jim Quinn <jimq...@google.com>
    Gerrit-Reviewer: Paul Wankadia <jun...@google.com>
    Gerrit-CC: Akshay Jaggi <a...@google.com>
    Gerrit-MessageType: merged
    Reply all
    Reply to author
    Forward
    0 new messages