regexp syntax and named Unicode character classes

120 views
Skip to first unread message

Tom Payne

unread,
Jan 7, 2020, 1:21:58 PM1/7/20
to golang-nuts
Hi,

tl;dr How should I use named Unicode character classes in regexps?

I'm trying to write a regular expression that matches Go identifiers, which start with a Unicode letter or underscore followed by zero or more Unicode letters, decimal digits, and/or underscores.

Based on the regexp syntax, and the variables in the unicode package which mention the classes "Letter" and "Number, decimal digit", I was expecting to write something like:

  identiferRegexp := regexp.MustCompile(`\A[[\p{Letter}]_][[\p{Letter}][\p{Number, decimal digit}]_]*\z`)

However, this pattern does not compile, giving the error:

  regexp: Compile(`\A[[\p{Letter}]_][[\p{Letter}][\p{Number, decimal digit}]_]*\z`): error parsing regexp: invalid character class range: `\p{Letter}`

Using the short name for character classes (L for Letter, Nd for Number, decimal digit) does work however:

  identiferRegexp := regexp.MustCompile(`\A[\pL_][\pL\p{Nd}_]*\z`)


Is this simply an oversight that Unicode character classes like "Letter" and "Number, decimal digit" are not available for use in regexps, or should I be using them differently?

Many thanks,
Tom

Ian Lance Taylor

unread,
Jan 7, 2020, 1:36:02 PM1/7/20
to Tom Payne, golang-nuts
The strings you can use with \p are the ones listed in
unicode.Categories and unicode.Scripts. So use \pL as you do in the
second example.

Ian

Tom Payne

unread,
Jan 7, 2020, 1:39:29 PM1/7/20
to golang-nuts
Thank you :) Is this worth adding to the regexp/syntax documentation? I'd happily contribute a patch.

Ian Lance Taylor

unread,
Jan 7, 2020, 1:43:44 PM1/7/20
to Tom Payne, golang-nuts
On Tue, Jan 7, 2020 at 10:39 AM Tom Payne <twp...@gmail.com> wrote:
>
> Thank you :) Is this worth adding to the regexp/syntax documentation? I'd happily contribute a patch.

I think so, if it can be described precisely and tersely. Thanks.

Ian
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/a22421cc-becb-496e-8d32-b41506536a54%40googlegroups.com.

alan...@gmail.com

unread,
Jan 7, 2020, 3:44:53 PM1/7/20
to golang-nuts
As Go's regular expressions are based on RE2, I always use the latter's documentation page to check what is and isn't allowed.

Note though that: \C, which RE2 normally allows, isn't allowed in Go.

Alan
Reply all
Reply to author
Forward
0 new messages