Regular expression string searching & matching

Clint O

unread,

Mar 4, 2018, 9:41:17 AM3/4/18

to

Hi:

I'm working on an implementation of Brzozowski derivatives[1] to translate
regular expressions into a DFAs for potential use as a RE debugger and
programming tool. I've been using the paper "Regular-expression derivatives
reexamined" by Owens, Reppy, and Turon[2] as my guide for the implementation.

One of the interesting things about using derivatives is that it allows for
some elegant RE set notation. So, rather than just the union -

r: a | b # a OR b

we also have set intersection and complement as well.

I've got things working pretty well, and I'm trying to test things out.
Unfortunately, I haven't seen much discussion on how string search works,
except for what little I know about the behavior of flex and grep.

Match is pretty straightforward: You begin in the starting state q0 and read
the input and follow the state transitions. If you get to the end of the
string without hitting the error state and you are in an accepting state
(Brzozowski refers to states as "nullable"), then this string is accepted by
this RE.

Search is a different beast: Here we have a string of input and we are
interested in knowing if any substring in the input matches our RE (like
grep).

The naive approach I have taken has been to start the DFA at every possible
position in the string, and start simulating the DFA. If I hit any accepting
states, I record the start, stop offsets. If I hit the error state or end of
string, I move to the next starting position. I then merge all of the
intervals at the end to cover the cases where I hit an accepting state
multiple times for a given start point. Note: I assume this is the faithful
way to do it because it's my understanding that we should aim for the longest
possible match. If I stop at the first accepting state, I will miss
potentially longer matches.

I hit an interesting case when trying out finding C-comments. Owens mentions
that with complement you can find non-nested C-comments with:

/\* ~(.* \*/ .*) \*/

and this works for me. However, I wasn't able to find an RE _without_ using
complement that exhibits identical behavior because of the inherent greedy
nature of REs that make it scan through successive non-overlapping comments in
the text until the final closing '*/'.

There was an interesting stackoverflow discussion[3] on this with links to
explanatory regex interpreters, and when I tried:

/\*[^*]*\*+(?:[^/*][^*]*\*+)*/

for me it continues through the input until it finds the _last_ '*/'.

I also tried constructing my own versions and came up with similar results.

Questions:

1) Am I applying the RE/DFA search algorithm correctly?

2) Is it possible to implement the non-greedy versions of '*', '+', '?' using
an RE->DFA implementation? A quick search made it sound like only backtracking
implementations can implement this behavior?

Thanks,

-Clint

[1] Brzozowski, Janusz A. (1964). Derivatives of regular expressions. Journal
of the ACM, 11(4), 481-494.

[2] S Owens, J Reppy, A Turon. (2009). Regular-expression derivatives
re-examined. Journal of Functional Programming 19 (2), 173-190

[3]
https://stackoverflow.com/questions/13014947/regex-to-match-a-c-style-multiline-comment

Ben Hanson

unread,

Mar 7, 2018, 2:59:10 PM3/7/18

to

[/][*]([^*]|[*]+[^*/])*[*]+[/]

is what you are looking for. I ran into this when developing my lexer
generator library lexertl in C++. Having a debug::dump() function
really helped me grok what was going on.

The trick of course is realising that you have to exclude the
characters that follow (i.e. the [^*/] part). That is the bit that
clobbers the greedy behaviour. I've had to remind myself of that on
more than one occasion recently!

Hope that helps.

Regards,

Ben

Ben Hanson

unread,

Mar 9, 2018, 9:45:40 AM3/9/18

to

I missed your question about non-greedy repeats.

Yes, it is possible. See build_dfa() in generator.hpp from lexertl.

Basically non-greedy transitions are snipped when building the dfa. I build a
regex syntax tree as suggested in the Dragon Book and I keep track of greedy
flags in the tree and that is passed down to partition/equivset.hpp and from
there to the generator. The thing you have to careful about is respecting that
the left side takes priority (i.e. the regex or sub-regex that came first).

Regards,

Ben

Clint O

unread,

Mar 9, 2018, 9:47:09 AM3/9/18

to

Hi Ben:

Thanks for your post. I did try your regular expression (and a few small
variations on it), but it exhibits the same behavior as the others I have
tried.

The difference with the complement version is that the accepting state I end
up with has all transitions to the error state (which guarantees termination
after match) where as these seem to still accept characters even after
matching the closing '*/'. It's possible I have a bug in my implementation, so
I'm still looking at it.

Thanks,

-Clint

On Wednesday, March 7, 2018 at 11:59:10 AM UTC-8, Ben Hanson wrote:
> [/][*]([^*]|[*]+[^*/])*[*]+[/]
>
> is what you are looking for. I ran into this when developing my lexer
> generator library lexertl in C++. Having a debug::dump() function
> really helped me grok what was going on.
>
> The trick of course is realising that you have to exclude the
> characters that follow (i.e. the [^*/] part). That is the bit that
> clobbers the greedy behaviour. I've had to remind myself of that on
> more than one occasion recently!

[This should work, it's a standard example in compiler texts. -John]

Clint O

unread,

Mar 10, 2018, 9:47:58 AM3/10/18

to

On Friday, March 9, 2018 at 6:47:09 AM UTC-8, Clint O wrote:
> On Wednesday, March 7, 2018 at 11:59:10 AM UTC-8, Ben Hanson wrote:
> > [/][*]([^*]|[*]+[^*/])*[*]+[/]
>

> [This should work, it's a standard example in compiler texts. -John]

/This/ actually worked for me (one character change):

[/][*]([^*]|[*]+[^/])*[*]+[/]
^

Thanks for looking at this!

-Clint

Ben Hanson

unread,

Mar 10, 2018, 9:48:21 AM3/10/18

to

Here's the dump of the state machine for that regex using lexertl:

State: 0
[/] -> 1

State: 1
[*] -> 2

State: 2
[^*] -> 2
[*] -> 3

State: 3
[^*/] -> 2
[*] -> 3
[/] -> 4

State: 4
END STATE

Regards,

Ben

Ben Hanson

unread,

Mar 12, 2018, 4:19:29 PM3/12/18

to

> /This/ actually worked for me (one character change):
>
> [/][*]([^*]|[*]+[^/])*[*]+[/]

Your modified regex produces the following state machine:

State: 0
[/] -> 1

State: 1
[*] -> 2

State: 2
[^*] -> 2
[*] -> 3

State: 3
[^*/] -> 2

[*] -> 4
[/] -> 5

State: 4
[^*/] -> 2
[*] -> 4
[/] -> 6

State: 5
END STATE

State: 6
END STATE

[^*] -> 2
[*] -> 3

Which will match

/***/a*/

in its entirety, when if should only match

/***/

Regards,

Ben
[Doesn't that depend on whether you interpret the END STATE in state 6 to stop even
if there's more input? -John]

Clint O

unread,

Mar 12, 2018, 9:36:20 PM3/12/18

to

On Monday, March 12, 2018 at 1:19:29 PM UTC-7, Ben Hanson wrote:
> > /This/ actually worked for me (one character change):
> >
> > [/][*]([^*]|[*]+[^/])*[*]+[/]
>
> Your modified regex produces the following state machine:
>

[snip]

>
> Which will match
>
> /***/a*/
>
> in its entirety, when if should only match
>
> /***/
>
> Regards,
>
> Ben
> [Doesn't that depend on whether you interpret the END STATE in state 6 to
stop even
> if there's more input? -John]

Interesting. I'm not seeing this behavior with the sample input you've
provided. Again, I'm willing to concede that I have a bug :) What I'm doing is
simulating the DFA until I get to an error state or I hit EOF. So, this
guarantees I'll record the longest match I've found.

I could post the states that I come up with, but my state dumper also prints
out the RE it's currently processing (the actual expression). The successive
computation of derivatives can sometimes produce some rather abhorrent output,
and it's not always obvious (to me) what's going on. I'll work on a cleaner
presentation and try to post this.

It also looks like you are running a DFA minimizer (like Hopcroft) on your
result since I am not producing a minimal DFA. That also may help me figure
out if I'm producing the right automaton because they'd match...

Thanks,

-Clint

Ben Hanson

unread,

Mar 12, 2018, 9:50:24 PM3/12/18

to

> [Doesn't that depend on whether you interpret the END STATE in state 6 to stop even
> if there's more input? -John]

That is why we have the "leftmost longest" rule. Try it out with any regex engine (including flex).

Regards,

Ben

Hans-Peter Diettrich

unread,

Mar 13, 2018, 3:33:53 PM3/13/18

to

Am 11.03.2018 um 21:52 schrieb Ben Hanson:
>> /This/ actually worked for me (one character change):
>>
>> [/][*]([^*]|[*]+[^/])*[*]+[/]
>
> Your modified regex produces the following state machine:
>
> State: 0
> [/] -> 1
>
> State: 1
> [*] -> 2
>
> State: 2
> [^*] -> 2
> [*] -> 3

I'm just curious about the notation.
What happens if neither pattern matches?
What's the purpose of [^*] in state 2, as opposed to state 1?

DoDi

Ben Hanson

unread,

Mar 13, 2018, 3:36:32 PM3/13/18

to

> It also looks like you are running a DFA minimizer (like Hopcroft) on your
> result since I am not producing a minimal DFA. That also may help me figure
> out if I'm producing the right automaton because they'd match...

Actually, although I have a minimising routine I'm not applying it here.

I intersect every character class in the regex to produce non-overlapping sets
and then these are grouped in equivalence sets for each state in the DFA. This
is the part that I didn't find in the Dragon Book (or any other compiler book
I looked in) and it was a message on this group from Vern Paxon that tipped me
off: https://compilers.iecc.com/comparch/article/94-10-091

(I only paid attention to the equivalence classes bit by the way!)

I intersect the equivalence sets as I build the DFA and I have found with this
technique you kind of have to go out of your way to get a non minimal DFA by
default. I've imagined an even more rigorous approach where character classes
are intersected as they are added to the syntax tree up front, but I've not
got around to trying to implement it.

I'd be interested to see your DFA output once you've cleaned it up a bit.
Using derivatives is interesting too - I've seen it discussed but never tried
to get into that.

Regards,

Ben

Ben Hanson

unread,

Mar 13, 2018, 5:32:47 PM3/13/18

to

On Tuesday, 13 March 2018 19:33:53 UTC, Hans-Peter Diettrich wrote:
> > State: 0
> > [/] -> 1
> >
> > State: 1
> > [*] -> 2
> >
> > State: 2
> > [^*] -> 2
> > [*] -> 3
>
> I'm just curious about the notation.
> What happens if neither pattern matches?
> What's the purpose of [^*] in state 2, as opposed to state 1?

In state 2 you are either looping on [^*] (anything other than '*') or changing state to state 3 on '*'.

I'm not sure I understand your question.

Regards,

Ben

Message has been deleted

Clint O

unread,

Mar 20, 2018, 2:00:24 PM3/20/18

to

[ reposted to try to make the special characters look right ]

/·[*]·([^*] | [*]+·[^/])*·[*]+·/

<RegexConcat [13]=0x10ccc6f60 0x10ccc6f98 0x10ccc67f0 '·'>
+---<RegexConcat [11]=0x10ccc6f98 0x10ccc6ef0 0x10ccc6898 '·'>
| +---<RegexConcat [10]=0x10ccc6ef0 0x10ccc6710 0x10ccc6b00 '·'>
| | +---<RegexConcat [2]=0x10ccc6710 0x10ccc67f0 0x10ccc6908 '·'>
| | | +---<RegexSym [0]=0x10ccc67f0 '/'>
| | | `---<RegexSym [1]=0x10ccc6908 '*'>
| | `---<RegexStar [9]=0x10ccc6b00 0x10ccc6c88 '*'>
| | `---<RegexOr [8]=0x10ccc6c88 0x10ccc6b70 0x10ccc6c18 '|'>
| | +---<RegexSym [4]=0x10ccc6b70 '^*'>
| | `---<RegexConcat [7]=0x10ccc6c18 0x10ccc6898 0x10ccc6da0 '·'>
| | +---<RegexPlus [5]=0x10ccc6898 0x10ccc6908 '+'>
| | | `---<RegexSym [1]=0x10ccc6908 '*'>
| | `---<RegexSym [6]=0x10ccc6da0 '^/'>
| `---<RegexPlus [5]=0x10ccc6898 0x10ccc6908 '+'>
| `---<RegexSym [1]=0x10ccc6908 '*'>
`---<RegexSym [0]=0x10ccc67f0 '/'>

q0: /·[*]·([^*] | [*]+·[^/])*·[*]+·/
[/] q2
['\x00'-.0-ÿ] q1
q1: ∅
['\x00'-ÿ] q1
q2: [*]·([^*] | [*]+·[^/])*·[*]+·/
[*] q3
['\x00'-)+-ÿ] q1
q3: ([^*] | [*]+·[^/])*·[*]+·/
[*] q4
['\x00'-)+-ÿ] q3
q4: ([*]*·[^/]·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
[*] q6
['\x00'-)+-.0-ÿ] q3
[/] q5
q5: ε
['\x00'-ÿ] q1
q6: (([*]*·[^/] | ε)·([^*] | [*]+·[^/])*·[*]+ | [*]*)·/
[*] q8
['\x00'-)+-.0-ÿ] q3
[/] q7
q7: ([^*] | [*]+·[^/])*·[*]+·/ | ε
[*] q4
['\x00'-)+-ÿ] q3
q8: ((([*]*·[^/] | ε)·([^*] | [*]+·[^/])* | [*]*·[^/]·([^*] | [*]+·[^/])*)·[*]+ | [*]*)·/
[*] q8
['\x00'-)+-.0-ÿ] q3
[/] q7
Total DFA states: 9
Total RE instances: 35

Clint O

unread,

Mar 22, 2018, 8:43:56 PM3/22/18

to

On 2018-03-20, Clint O <clint...@gmail.com> wrote:
> [ reposted to try to make the special characters look right ]
>

Thanks John for reposting this. It looks much better now.

In summary:

q5,q7 are accepting states since they contain epsilon. q2 represents the
error state.

The key to success with this algorithm is recognizing previously calculated
derivatives/expressions. When you no longer calculate unique derivatives,
the DFA construction terminates. As you can see the expressions can get
hairy pretty quickly. I don't know if you can glean much from the
successive expressions generated. It's akin to the method of walking the
parse tree but it bypasses the NFA construction entirely.

Thanks,

-Clint