Handling CRLF: Is this the correct approach?

61 views
Skip to first unread message

Jerry Ding

unread,
Mar 25, 2021, 3:44:40 PM3/25/21
to re2-dev
Dear RE2 development community,

I used RE2 for an application and needed to make a change so that the library handles the '^' and '$' metacharacters in the same way boost::regex does in multi-line mode.

To be more exact, I wanted a "\r\n" sequence to have a match for '$' at the start and a match for '^' at the end, but with no '^' or '$' matches between the two characters. I also wanted '\f' and lone '\r' characters to act like newline characters.

I think I figured out how to do it from reading the RE2 source code, but since people from this community have a much broader understanding of the library I wonder if I can ask whether I did this correctly?

(To be clear, this isn't a proposed change to the upstream, I'm creating a fork for my use case with minor modifications. I'll be happy to contribute an option flag to enable this if there's interest though!)

Basically what I did was add an additional flag to DFA::State in the same fashion as the word boundary flags, and updated DFA::RunStateOnByte to mark these flags and the line end / start markers accordingly. Then I updated Prog::EmptyFlags to set the line end / start markers accordingly, since this seemed to be where the NFA mode checks for empty matches. I believed nothing needed to change in Prog::ComputeByteMap because '\r', '\f' and '\n' are all handled differently.

The diff is available at https://gist.github.com/JubilantJerry/292a741db8f38c142dbe26b4dd9ff4af. The line numbers might be a bit different from the original though.

It did seem to work on a simple test case, but I don't know how to test the NFA mode specifically, or whether anything would be different if I used RE2::Set. I'm just wondering if there's a clear high level problem with my approach?

Many thanks,
Jerry Ding

Paul Wankadia

unread,
Mar 29, 2021, 11:27:41 AM3/29/21
to Jerry Ding, re2-dev
On Fri, Mar 26, 2021 at 6:44 AM Jerry Ding <jubila...@gmail.com> wrote:

Basically what I did was add an additional flag to DFA::State in the same fashion as the word boundary flags, and updated DFA::RunStateOnByte to mark these flags and the line end / start markers accordingly. Then I updated Prog::EmptyFlags to set the line end / start markers accordingly, since this seemed to be where the NFA mode checks for empty matches. I believed nothing needed to change in Prog::ComputeByteMap because '\r', '\f' and '\n' are all handled differently.

The diff is available at https://gist.github.com/JubilantJerry/292a741db8f38c142dbe26b4dd9ff4af. The line numbers might be a bit different from the original though.

I haven't scrutinised the patch, but I imagine that I would do more or less what you have described. Good spelunking and good hacking. :)

It did seem to work on a simple test case, but I don't know how to test the NFA mode specifically, or whether anything would be different if I used RE2::Set. I'm just wondering if there's a clear high level problem with my approach?

RE2::Set should be fine. To test the different execution engines for consistency, take a look at tester.h. I'm not sure how to make PCRE behave how you want for CRLF, but I'm guessing that it should be possible with at most just a little more hacking.

jam...@gmail.com

unread,
Mar 29, 2021, 11:31:32 AM3/29/21
to re2-dev
I think this might require delaying the match by two bytes instead of one. I don't think this patch quite does that. But maybe the patch works in a subset of cases and that's good enough.

Paul Wankadia

unread,
Mar 29, 2021, 12:31:16 PM3/29/21
to jam...@gmail.com, re2-dev
On Tue, Mar 30, 2021 at 2:31 AM jam...@gmail.com <jam...@gmail.com> wrote:

I think this might require delaying the match by two bytes instead of one. I don't think this patch quite does that. But maybe the patch works in a subset of cases and that's good enough.

Yeah, this does make me a bit nervous for that reason. My intuition was that you could get away with handling \r followed by \n specially, but there was some mental handwaving too. And something is nagging at the back of my mind about propagating "needed" flags correctly.

Jerry Ding

unread,
Mar 31, 2021, 4:22:54 PM3/31/21
to re2-dev
Thank you for the responses!

I'm wondering why there's a concern that the match might need to be delayed by two bytes?
My understanding is that the situation here is analogous to having word boundary markers around two-character words, e.g. RE2 treats " cc " as " \bcc\b " and by analogy my change treats " \r\n " as " $\r\n^ ". Yet there is no delaying by more than one byte here.
Then again, I'm not very knowledgeable about why matches might even need to be delayed one byte, let alone two. Is it mainly because the zero width flags could appear "before the next byte" as an alternative to being "after the current byte"? If that's the reasoning then it seems like it'll be fine since there's no way to move a zero width flag two bytes over.

Andrew Gallant

unread,
Apr 1, 2021, 7:39:18 AM4/1/21
to Jerry Ding, re2-dev
A thing to note is that RE2's DFA has a total maximum alphabet size of
257, instead of 256 as you might expect. (In practice, the alphabet
size is much smaller due to combining bytes into equivalence classes.
But the max is 257.) The reason is that at the end of a search, if
you've reached the end of the haystack, you need to run the DFA once
more on a special sentinel transition that represents the end of the
input. In RE2's DFA, that's called 'kByteEndText'[1], and it is used
as the final transition here[2].

That kByteEndText transition is necessary to make $ and \b work, since
both of those assertions can be satisfied when at the end of the
input. But as the second link above shows, when a search completes, it
doesn't always necessarily run on the special kByteEndText sentinel
transition. It can also run on the byte immediately following the end
of the input[3]. Recall that RE2's DFA searches within a substring of
the input[4]. So for example, if you have the string, c='foobar\r\n'
and the caller wants to search for matches of '\w$' within the
substring c[0:6], then that should report a match. If you've modified
RE2 to have '$' recognize either '\r\n' or '\n' as line endings, then
the only way for this to work is if your search reads two bytes past
the end of the haystack and into the surrounding context.

With that said, re-reading your initial post, it sounds like you're
okay with just '\r' being treated as a line ending. In which case,
maybe you _can_ get away with not scanning two bytes past the end of
the search text. (Although, IIRC, only very old Mac systems use a bare
'\r' as a line ending.)

I still have a nagging feeling in my mind that something isn't quite
right though. I think I would just advise to think through this and
write a lot of tests for it. And in particular, write tests that
search a sub-string of a larger string, as permitted by the DFA's API.

Good luck!

[1] - https://github.com/google/re2/blob/f8e389f3acdc2517562924239e2a188037393683/re2/dfa.cc#L131
[2] - https://github.com/google/re2/blob/f8e389f3acdc2517562924239e2a188037393683/re2/dfa.cc#L1485-L1498
[3] - https://github.com/google/re2/blob/f8e389f3acdc2517562924239e2a188037393683/re2/dfa.cc#L1490
[4] - https://github.com/google/re2/blob/13ebb377c6ad763ca61d12dd6f88b1126bd0b911/re2/dfa.cc#L1822-L1824
> --
> You received this message because you are subscribed to the Google Groups "re2-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to re2-dev+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/re2-dev/2b243c9d-8a59-4236-8c26-d556eac47a33n%40googlegroups.com.

Paul Wankadia

unread,
Apr 1, 2021, 8:29:30 AM4/1/21
to Jerry Ding, re2-dev
On Thu, Apr 1, 2021 at 7:22 AM Jerry Ding <jubila...@gmail.com> wrote:

I'm wondering why there's a concern that the match might need to be delayed by two bytes?
My understanding is that the situation here is analogous to having word boundary markers around two-character words, e.g. RE2 treats " cc " as " \bcc\b " and by analogy my change treats " \r\n " as " $\r\n^ ". Yet there is no delaying by more than one byte here.
Then again, I'm not very knowledgeable about why matches might even need to be delayed one byte, let alone two. Is it mainly because the zero width flags could appear "before the next byte" as an alternative to being "after the current byte"? If that's the reasoning then it seems like it'll be fine since there's no way to move a zero width flag two bytes over.

Let's work through an example. For the pattern foo$ and the text foo\nbar, the DFA steps like this:

    f o o \n b a r 
0: ^
    f o o \n b a r 
1:   ^
    f o o \n b a r 
2:     ^
    f o o \n b a r 
3:       ^
   f o o \n b a r 
4:         ^

The DFA can't know where the end of the line is until it steps over the \n. At that point, it's at offset 4 and reports the match... at offset 3, so we say that the match was delayed by one byte. RE2 does so to support zero-width assertions that "look forward" – and that even includes \z because the DFA can't know where the end of the text is until it steps over the kByteEndText symbol. (Andrew's reply talks about that.)

In order to handle the text foo\r\nbar correctly, the DFA would have to be at offset 5 and then report the match at offset 3, but it won't do that due to https://github.com/google/re2/blob/master/re2/dfa.cc#L1455-L1462, for example, and possibly other places where the one-byte delay is assumed.

Does that make sense?

Jerry Ding

unread,
Apr 1, 2021, 4:38:35 PM4/1/21
to re2-dev
Thank you Paul and Andrew for the very detailed explanation (with illustrations and references too)! I now understand the reasoning behind the two byte delay.

It's interesting that as Andrew mentioned, handling the lone '\r' as a line ending might be sufficient to get around the two byte requirement. Luckily, it happens that I wanted this behavior anyway for compatibility with boost::regex.

So in the case of Paul's example of "foo\r\nbar", the match for "foo$" is found at offset 4 (for a match from offset 0 to offset 3) because regardless of the characters after '\r', we are sure to end the line at the '\r'.

But indeed I'll be sure to create many test cases to verify that the patch works as expected, especially when using the internal interface to match against substrings of the full context. With the explanations I also was able to figure out how the code uses the one byte delay (e.g. what happens to "before flags" and "after flags", what happens when a match is found on a normal character such that no lookahead is needed, etc), so I think I should be fine now.

Paul Wankadia

unread,
Apr 1, 2021, 11:01:27 PM4/1/21
to Jerry Ding, re2-dev
On Fri, Apr 2, 2021 at 7:38 AM Jerry Ding <jubila...@gmail.com> wrote:

But indeed I'll be sure to create many test cases to verify that the patch works as expected, especially when using the internal interface to match against substrings of the full context. With the explanations I also was able to figure out how the code uses the one byte delay (e.g. what happens to "before flags" and "after flags", what happens when a match is found on a normal character such that no lookahead is needed, etc), so I think I should be fine now.

Great to hear. Happy hacking. :)

Reply all
Reply to author
Forward
0 new messages