Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

[boost] [xpressive] Support for multi-capture and balancing groups

64 views
Skip to first unread message

Erik Rydgren

unread,
Oct 2, 2010, 8:52:23 PM10/2/10
to bo...@lists.boost.org
Hi!

My company have been using pcre for a long while but there has been gripes
about it only returning the last
matched value from a capture group. Because of this I have been searching
for a C++ regex engine that can
handle the same stuff as the .NET implementation can do, to no avail. During
my searches I've stumbled on
several forum threads where others have been searching for the same thing
but it doesn't seem to exist
a regular expression library in C/C++ that handles both named captures and
multicapture.

Found boost.xpressive and it had almost everything we need. It's open
source, fast, got flexible api and
named captures. But alas, just as all other C and C++ based implementations
I have found, it lacked multiple captures.

So, I added it. On top of that I added support for balancing groups
(http://blog.stevenlevithan.com/archives/balancing-groups).
But the syntax for the pop capture and capture conditional is slightly
different then the .NET version to better fit xpressive.

Syntax for pop capture:
dynamic: (?P<-name>stuff)
static: (name -= stuff)

Syntax for capture conditional:
dynamic: (?P(name)stuff)
static: (name &= stuff)

There is no support for the (?<name-othername>stuff) construct.

All captures made by a group is stored in sub_match::captures which is a
vector of sub_match_capture objects.
A sub_match_capture behaves like a stripped down sub_match. It can be put in
an ostream and has a length and
helper function for returning a string.

The changes are in the vault and can be found here:
http://tinyurl.com/3aak7mp

It can be unpacked against trunk from 2010-10-02 or the 1.44.0 release. I've
run the dynamic
regression tests without errors and I have added some tests for the new
functionality.
The code it only tested on Visual Studio 2010 since I don't have access to
any other compiler.

Please give feedback on my changes since I would love to see them in an
official release.
Thanks in advance.

Regards,

Erik


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Eric Niebler

unread,
Oct 2, 2010, 11:12:33 PM10/2/10
to bo...@lists.boost.org
On 10/2/2010 5:52 PM, Erik Rydgren wrote:
> Hi!
>
> My company have been using pcre for a long while but there has been
> gripes about it only returning the last matched value from a capture
> group. Because of this I have been searching for a C++ regex engine
> that can handle the same stuff as the .NET implementation can do, to
> no avail. During my searches I've stumbled on several forum threads
> where others have been searching for the same thing but it doesn't
> seem to exist a regular expression library in C/C++ that handles both
> named captures and multicapture.
>
> Found boost.xpressive and it had almost everything we need. It's
> open source, fast, got flexible api and named captures. But alas,
> just as all other C and C++ based implementations I have found, it
> lacked multiple captures.

It sort of has them, just not in the form you happen to be looking for.
In xpressive, you can call named regexes from other regexes. When you do
that, you end up with nested match_results. If you quantify a named
regex, you end up with a sequence of match_results, kind of like
multicapture. Sadly, it's not very efficient to create a tree of
match_results, and xpressive gives you no help in navigating this tree.
It's a bit of an ugly hack.

FWIW, Boost.Regex has multicapture if you compile with a certain flag, IIRC.

> So, I added it.

Whoa, cool!

> On top of that I added support for balancing groups
> (http://blog.stevenlevithan.com/archives/balancing-groups). But the
> syntax for the pop capture and capture conditional is slightly
> different then the .NET version to better fit xpressive.
>
> Syntax for pop capture:
> dynamic: (?P<-name>stuff)
> static: (name -= stuff)
>
> Syntax for capture conditional:
> dynamic: (?P(name)stuff)
> static: (name &= stuff)
>
> There is no support for the (?<name-othername>stuff) construct.

I'll need to read up on what those constructs do. Can you send some
pointers?

> All captures made by a group is stored in sub_match::captures which
> is a vector of sub_match_capture objects. A sub_match_capture behaves
> like a stripped down sub_match. It can be put in an ostream and has a
> length and helper function for returning a string.
>
> The changes are in the vault and can be found here:
> http://tinyurl.com/3aak7mp
>
> It can be unpacked against trunk from 2010-10-02 or the 1.44.0
> release. I've run the dynamic regression tests without errors and I
> have added some tests for the new functionality. The code it only
> tested on Visual Studio 2010 since I don't have access to any other
> compiler.
>
> Please give feedback on my changes since I would love to see them in
> an official release. Thanks in advance.

This sounds really great and I have every intention of taking this
change once I grok it and look over the code. Can you open a feature
request ticket at http://svn.boost.org so I don't forget, because I'm a
little busy at the moment.

--
Eric Niebler
BoostPro Computing
http://www.boostpro.com

Erik Rydgren

unread,
Oct 3, 2010, 4:44:19 AM10/3/10
to bo...@lists.boost.org
> On 10/2/2010 5:52 PM, Erik Rydgren wrote:
>> Hi!
>>
>> My company have been using pcre for a long while but there has been
>> gripes about it only returning the last matched value from a capture
>> group. Because of this I have been searching for a C++ regex engine
>> that can handle the same stuff as the .NET implementation can do, to
>> no avail. During my searches I've stumbled on several forum threads
>> where others have been searching for the same thing but it doesn't
>> seem to exist a regular expression library in C/C++ that handles both
>> named captures and multicapture.
>>
>> Found boost.xpressive and it had almost everything we need. It's
>> open source, fast, got flexible api and named captures. But alas,
>> just as all other C and C++ based implementations I have found, it
>> lacked multiple captures.
>
> It sort of has them, just not in the form you happen to be looking for.
> In xpressive, you can call named regexes from other regexes. When you do
> that, you end up with nested match_results. If you quantify a named
> regex, you end up with a sequence of match_results, kind of like
> multicapture. Sadly, it's not very efficient to create a tree of
> match_results, and xpressive gives you no help in navigating this tree.
> It's a bit of an ugly hack.
>
Yea, I realized that but it wasn't practical for our needs.
I also did a static solution that used actions to make the captures just to
try them out.

> FWIW, Boost.Regex has multicapture if you compile with a certain flag,
> IIRC.
>

Ok, I didn't know that. Will take a second look at Boost.Regex then.

>> So, I added it.
>
> Whoa, cool!
>

That is the respose I was hoping for :)

>> On top of that I added support for balancing groups
>> (http://blog.stevenlevithan.com/archives/balancing-groups). But the
>> syntax for the pop capture and capture conditional is slightly
>> different then the .NET version to better fit xpressive.
>>
>> Syntax for pop capture:
>> dynamic: (?P<-name>stuff)
>> static: (name -= stuff)
>>
>> Syntax for capture conditional:
>> dynamic: (?P(name)stuff)
>> static: (name &= stuff)
>>
>> There is no support for the (?<name-othername>stuff) construct.
>
> I'll need to read up on what those constructs do. Can you send some
> pointers?
>

I already did, this blogpost explains them without fuss
http://blog.stevenlevithan.com/archives/balancing-groups.
But the very short version is that a (?P<-tag>exp) first matches exp then
removes the last capture from an earlier group named tag. If the tag group
haven't captured anything yet it fails and backtracks.
The (?P(tag)exp) is a shorthand if-then-else where the else part always
matches. Pseudo code: if (tag has matched) { exp must match } else { true }.

To demonstrate here is the regression definitions I've made

; multi capture
[test175]
str=aabb
pat=(..)*
br0=aabb
cp0_0=aabb
br1=bb
cp1_0=aa
cp1_1=bb
[end]

; multi capture several groups
[test176]
str=abba
pat=(.){2}(.){2}
br0=abba
cp0_0=abba
br1=b
cp1_0=a
cp1_1=b
br2=a
cp2_0=b
cp2_1=a
[end]

; multi capture, pop capture with backreference, check capture
[test177]
str=startabccbarest
pat=^(.*?)(?P<n>.)+(?P<-n>(?P=n))+(?P(n)(?!))(.*)$
br0=startabccbarest
cp0_0=startabccbarest
br1=start
cp1_0=start
br2=
br3=rest
cp3_0=rest
[end]

; match count
[test178]
str=aabb
pat=^(?P<n>a)*(?P<-n>b)*(?P(n)(?!))$
br0=aabb
br1=
[end]

; match count, fail on pop
[test179]
str=aabbb
pat=^(?P<n>a)*(?P<-n>b)*$
[end]

; match count, fail on check
[test180]
str=aab
pat=^(?P<n>a)*(?P<-n>b)*(?P(n)(?!))$
[end]

>> All captures made by a group is stored in sub_match::captures which
>> is a vector of sub_match_capture objects. A sub_match_capture behaves
>> like a stripped down sub_match. It can be put in an ostream and has a
>> length and helper function for returning a string.
>>
>> The changes are in the vault and can be found here:
>> http://tinyurl.com/3aak7mp
>>
>> It can be unpacked against trunk from 2010-10-02 or the 1.44.0
>> release. I've run the dynamic regression tests without errors and I
>> have added some tests for the new functionality. The code it only
>> tested on Visual Studio 2010 since I don't have access to any other
>> compiler.
>>
>> Please give feedback on my changes since I would love to see them in
>> an official release. Thanks in advance.
>
> This sounds really great and I have every intention of taking this
> change once I grok it and look over the code. Can you open a feature
> request ticket at http://svn.boost.org so I don't forget, because I'm a
> little busy at the moment.
>

Will do.

Erik Rydgren

unread,
Oct 3, 2010, 5:16:20 AM10/3/10
to bo...@lists.boost.org
> On 10/2/2010 5:52 PM, Erik Rydgren wrote:
>> Hi!
>>
>> My company have been using pcre for a long while but there has been
>> gripes about it only returning the last matched value from a capture
>> group. Because of this I have been searching for a C++ regex engine
[SNIP]

>
> This sounds really great and I have every intention of taking this
> change once I grok it and look over the code. Can you open a feature
> request ticket at http://svn.boost.org so I don't forget, because I'm a
> little busy at the moment.
>
Ticket #4704: Support for multicapture and balancing groups

> --
> Eric Niebler
> BoostPro Computing
> http://www.boostpro.com

/ Erik

OvermindDL1

unread,
Oct 3, 2010, 5:58:47 AM10/3/10
to bo...@lists.boost.org
On Sun, Oct 3, 2010 at 3:16 AM, Erik Rydgren <er...@rydgrens.net> wrote:
>> On 10/2/2010 5:52 PM, Erik Rydgren wrote:
>>>
>>> Hi!
>>>
>>> My company have been using pcre for a long while but there has been
>>> gripes about it only returning the last matched value from a capture
>>> group. Because of this I have been searching for a C++ regex engine
>
> [SNIP]
>>
>> This sounds really great and I have every intention of taking this
>> change once I grok it and look over the code. Can you open a feature
>> request ticket at http://svn.boost.org so I don't forget, because I'm a
>> little busy at the moment.
>>
> Ticket #4704: Support for multicapture and balancing groups

For note, depending on what exactly you need to parse, but if you can
statically define the parse then Boost.Spirit can do all you want and
a whole lot more with a lot more simple code and it will execute a lot
faster. If you have an example of what you want done, I (or others)
can give an example Spirit code to do the same if you want a
comparison.

Erik Rydgren

unread,
Oct 3, 2010, 9:46:20 AM10/3/10
to bo...@lists.boost.org
> On Sun, Oct 3, 2010 at 3:16 AM, Erik Rydgren <er...@rydgrens.net> wrote:
>>> On 10/2/2010 5:52 PM, Erik Rydgren wrote:
>>>>
>>>> Hi!
>>>>
>>>> My company have been using pcre for a long while but there has been
>>>> gripes about it only returning the last matched value from a capture
>>>> group. Because of this I have been searching for a C++ regex engine
>>
>> [SNIP]
>>>
>>> This sounds really great and I have every intention of taking this
>>> change once I grok it and look over the code. Can you open a feature
>>> request ticket at http://svn.boost.org so I don't forget, because I'm a
>>> little busy at the moment.
>>>
>> Ticket #4704: Support for multicapture and balancing groups
>
> For note, depending on what exactly you need to parse, but if you can
> statically define the parse then Boost.Spirit can do all you want and
> a whole lot more with a lot more simple code and it will execute a lot
> faster. If you have an example of what you want done, I (or others)
> can give an example Spirit code to do the same if you want a
> comparison.

That is a kind offer but unfortunately I don't have the luxury of static
expressions. The regex patterns is stored in a database and other sources
and we use them for a wide range of tasks. But I surly will take a look at
Spirit anyway, just so I might have another tool in my belt ready when
needed.

Thanks
Erik

Eric Niebler

unread,
Oct 6, 2010, 9:34:03 PM10/6/10
to Erik Rydgren, bo...@lists.boost.org
On Oct 3, 2:16 am, "Erik Rydgren" <e...@rydgrens.net> wrote:
> > On 10/2/2010 5:52 PM, Erik Rydgren wrote:
> >> Hi!
>
> >> My company have been using pcre for a long while but there has been
> >> gripes about it only returning the last matched value from a capture
> >> group. Because of this I have been searching for a C++ regex engine
> [SNIP]
>
> > This sounds really great and I have every intention of taking this
> > change once I grok it and look over the code. Can you open a feature
> > request ticket athttp://svn.boost.orgso I don't forget, because I'm a

> > little busy at the moment.
>
> Ticket #4704: Support formulticaptureand balancing groups

Hi Erik,

I've been looking at your changes and have some comments. Please see
the bug report: https://svn.boost.org/trac/boost/ticket/4704. In
particular, the last one is a deal-breaker until it's fixed:

"""
Oh wait! I just realized from looking at your change to
mark_end_matcher that you are always populating the captures member of
the sub_match struct (for non-hidden captures). This will have
disastrous performance implications. There cannot potentially be an
allocation every time you match a marked sub-expression. I can't
accept this change.

I suggest you find a way to make this behavior opt-in.
"""

I hope we can find a solution to this problem.

--
Eric Niebler
BoostPro Computing

www.boostpro.com

Reply all
Reply to author
Forward
0 new messages