Regexp- Why can't I get submatch names and match results in a single call?

1,031 views
Skip to first unread message

beatgammit

unread,
Mar 11, 2012, 3:03:54 PM3/11/12
to golan...@googlegroups.com
First off, the regexp package is awesome. I love the new one compared to the old one.

However, it seems that I have to exec a compiled regex twice to get the submatch names and the submatches.

Say I have a regexp like this:

"(?P<lws>\\s*)(?P<token>\\w+)(?P<attrs>\\(([^)])\\))"

This would match any of the following:

"    hello(foo='bar', quux='baz')"
"hello(foo='bar', quux='baz')"
"hello"
"    hello"

I need to know how much leading white space there is, and the optional attribute list. However, it seems I have do do the following:

n := reg.SubmatchNames(str)
s := reg.FindStringSubmatch(str)
matches := make(map[string]string)
for i, name := range(n[1:]) {
    matches[n] = s[i]
}

This type of pattern is pretty common, and I was surprised that there wasn't a call that did both, because other regex engines that allow naming matches have this sort of thing.

For example, in the D programming language:


Is there a reason why this was left out? This would be a really nice feature.

Thanks!

Rémy Oudompheng

unread,
Mar 11, 2012, 3:16:52 PM3/11/12
to beatgammit, golan...@googlegroups.com
Le 11 mars 2012 20:03, beatgammit <beatg...@gmail.com> a écrit :
> I need to know how much leading white space there is, and the optional
> attribute list. However, it seems I have do do the following:
>
> n := reg.SubmatchNames(str)
> s := reg.FindStringSubmatch(str)
> matches := make(map[string]string)
> for i, name := range(n[1:]) {
>     matches[n] = s[i]
> }

i can't find a "submatchnames" here. Maybe it's SubexpNames() ?

Rémy.

beatgammit

unread,
Mar 11, 2012, 6:14:14 PM3/11/12
to golan...@googlegroups.com, beatgammit
Yeah, you're right, I meant SubexpNames. I just wrote it up really quickly without looking at my code.

Anyway, I think there should be a method that returns a mapping of match name to match value. I can get the same data, but it requires two function calls, which would evaluate the expression twice. I'm doing this repeatedly in my parser, and it just makes sense to have a call that returns both.

beatgammit

unread,
Mar 11, 2012, 6:18:38 PM3/11/12
to golan...@googlegroups.com, beatgammit
On further reading, I guess I missed that SubexpNames only returns an array of names in the regex, not the names that have values assigned to them. I guess there's no way to guarantee that a name is matched to a value.

Is this just an oversight, or is this not there on purpose?

I know the package is new, but it kind of makes using names in a regex useless.

Matthew

unread,
Mar 11, 2012, 7:30:23 PM3/11/12
to golan...@googlegroups.com, beatgammit
I assume you're supposed to use the index of each match as an index into SubexpNames():

// names[0] is an empty string and matches[0] is the whole matched string. trim them.
// obvs blindly doing matches[1:] will trigger a panic if there are no matches at all.
names := re.SubexpNames()[1:]
matches := re.FindStringSubmatch(str)[1:]

data :=  make(map[string]string)
for i, m := range matches {
    n := names[i] // this is the name of the current match
   data[n] = m

beatgammit

unread,
Mar 12, 2012, 4:20:36 PM3/12/12
to golan...@googlegroups.com, beatgammit
I guess that makes sense. I think I misread the documentation. This actually doesn't seem all that bad, just grab the names in one call, then use the names wherever you parse. It just seems like it would be handy to have a name->match map instead of having to do that myself. I could always create a function for it though...

Matthew

unread,
Mar 12, 2012, 10:05:54 PM3/12/12
to golan...@googlegroups.com, beatgammit
One thing to note is that you may still want to add something like if m == "" { continue }. Otherwise you will create an entry in the map for a name, but the value will be an empty string.

Also, I noticed something silly in my code sample: instead of the [1:] twice rigmarole, just use a for loop that starts at 1 and ends at len(matches). ("Drop the [1:]. Just 1. It's cleaner.") 


I am not wise enough to speak to whether this or something like it should be a convenience function. I could see an argument in favor of it, as well as an argument that there are enough use cases that might just want, say, a struct that devs should write a function most convenient for them. 
Reply all
Reply to author
Forward
0 new messages