Are Regular Expression classes Unicode aware?

66 views
Skip to first unread message

Peter W A Wood

unread,
Jul 9, 2020, 10:19:20 AM7/9/20
to racket...@googlegroups.com
I was experimenting with regular expressions to try to emulate the Python isalpha() String method. Using a simple [a-zA-Z] character class worked for the English alphabet (ASCII characters):

> (regexp-match? #px"^[a-zA-Z]+$" "hello")
#t
> (regexp-match? #px"^[a-zA-Z]+$" "h1llo")
#f

It then dawned on me that the Python is alpha() method was Unicode aware:

>>> "é".isalpha()
True

I started scratching my head as how to achieve the equivalent using a regular expression in Python. I tried the same regular expression with a non-English character in the string. To my surprise, the regular expression recognised the non-ASCII characters.

> (regexp-match? #px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")
#t

Are Racket regular expression character classes Unicode aware or is there some other explanation why this regular expression matches?

Peter

Sorawee Porncharoenwase

unread,
Jul 9, 2020, 10:32:18 AM7/9/20
to Peter W A Wood, Racket list

Racket REPL doesn’t handle unicode well. If you try (regexp-match? #px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a file and run it, you will find that it does evaluate to #f.


--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/2197C34F-165D-4D97-97AD-F158153316F5%40gmail.com.

Ryan Culpepper

unread,
Jul 9, 2020, 10:53:07 AM7/9/20
to Sorawee Porncharoenwase, Peter W A Wood, Racket list
If you want a regular expression that does match the example string, you can use the \p{property} notation. For example:

  > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo")
  #t

The "Regexp Syntax" docs have a grammar for regular expressions with links to examples.

Ryan


Philip McGrath

unread,
Jul 9, 2020, 2:43:19 PM7/9/20
to Sorawee Porncharoenwase, Peter W A Wood, Racket list
On Thu, Jul 9, 2020 at 10:32 AM Sorawee Porncharoenwase <sorawe...@gmail.com> wrote:

Racket REPL doesn’t handle unicode well. If you try (regexp-match? #px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a file and run it, you will find that it does evaluate to #f.

See this issue for workarounds, including installing the `readline-gpl` package: https://github.com/racket/racket/issues/3223

But you may have some other issues: for me, `(regexp-match? #px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")` gives an error saying "read-syntax: no hex digit following `\U`"
 
For the original question:


On Thu, Jul 9, 2020 at 7:19 AM Peter W A Wood <peter...@gmail.com> wrote:
I was experimenting with regular expressions to try to emulate the Python isalpha() String method.

You'd want to benchmark, but, for this purpose, I have a hunch you might get better performance by using `in-string` with a `for/and` loop (which can use unsafe operations internally)—probably especially so if you were content to just test `char-alphabetic?`, which follows Unicode's definition of "alphabetic" rather that Python's idiosyncratic one. Here's an example:

#lang racket

(module+ test
  (require rackunit))

(define (char-letter? ch)
  ;; not the same as `char-alphabetic?`: see
  ;; https://docs.python.org/3/library/stdtypes.html#str.isalpha
  (case (char-general-category ch)
    [(lm lt lu ll lo) #t]
    [else #f]))

(define (string-is-alpha? str)
  (for/and ([ch (in-string str)])
    (char-letter? ch)))

(module+ test
  (check-true (string-is-alpha? "hello"))
  (check-false (string-is-alpha? "h1llo"))
  (check-true (string-is-alpha? "héllo")))

Sorawee Porncharoenwase

unread,
Jul 9, 2020, 2:57:40 PM7/9/20
to Philip McGrath, Peter W A Wood, Racket list

I did in fact try installing readline-gpl (raco pkg install readline-gpl), but it didn’t change anything. Interestingly, the bug in #3223 persists for me, too. This suggests that I didn’t install or invoke it correctly. Do you need to run racket with any flag to make readline-gpl take its effect?

But yes, the problem is definitely due to readline. Sam suggested me to try racket -q which suppresses readline, and the result is that there’s no issue.

George Neuner

unread,
Jul 9, 2020, 4:47:44 PM7/9/20
to racket...@googlegroups.com
On Thu, 9 Jul 2020 14:43:03 -0400, Philip McGrath
<phi...@philipmcgrath.com> wrote:

>On Thu, Jul 9, 2020 at 10:32 AM Sorawee Porncharoenwase <
>sorawe...@gmail.com> wrote:
>
>> Racket REPL doesn’t handle unicode well. If you try (regexp-match?
>> #px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a file
>> and run it, you will find that it does evaluate to #f.
>>
>See this issue for workarounds, including installing the `readline-gpl`
>package: https://github.com/racket/racket/issues/3223
>
>But you may have some other issues: for me, `(regexp-match?
>#px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")` gives an error saying "read-syntax:
>no hex digit following `\U`"

It works if you remove the '+' sign. \U and \u are defined to take
hexidecimal values, which are unsigned. For comparison, \x fails with
the same error if the value is signed.

George

Peter W A Wood

unread,
Jul 10, 2020, 6:28:38 AM7/10/20
to racket...@googlegroups.com
Dear Ryan

Thank you very much for the kind, detailed explanation which I will study carefully. It was not my intention to reply to you off-list. I hope I have correctly addressed this reply to appear on-list.

Peter

> On 10 Jul 2020, at 15:47, Ryan Culpepper <rmculp...@gmail.com> wrote:
>
> (I see this went off the mailing list. If you reply, please consider CCing the list.)
>
> Yes, I understood your goal of trying to capture the notion of Unicode "alphabetic" characters with a regular expression.
>
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does assign every code point to a "General category", consisting of a main category and a subcategory. There is a category called "Letter", which seems like one reasonable generalization of "alphabetic".
>
> In Racket, you can get the code for a character's category using `char-general-category`. For example:
>
> > (char-general-category #\A)
> 'lu
> > (char-general-category #\é)
> 'll
> > (char-general-category #\ß)
> 'll
> > (char-general-category #\7)
> 'nd
>
> The general category for "A" is "Letter, uppercase", which has the code "Lu", which Racket turns into the symbol 'lu. The general category of "é" is "Letter, lowercase", code "Ll", which becomes 'll. The general category of "7" is "Number, decimal digit", code "Nd".
>
> In Racket regular expressions, the \p{category} syntax lets you recognize characters from a specific category. For example, \p{Lu} recognizes characters with the category "Letter, uppercase", and \p{L} recognizes characters with the category "Letter", which is the union of "Letter, uppercase", "Letter, lowercase", and so on.
>
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more Unicode letters. For example:
>
> > (regexp-match? #px"^\\p{L}+$" "héllo")
> #t
> > (regexp-match? #px"^\\p{L}+$" "straße")
> #t
> > (regexp-match? #px"^\\p{L}+$" "二の句")
> #t
> > (regexp-match? #px"^\\p{L}+$" "abc123")
> #f ;; No, contains numbers
>
> There are still some problems to watch out for, though. For example, accented characters like "é" can be expressed as a single pre-composed code point or "decomposed" into a base letter and a combining mark. You can get the decomposed form by converting the string to "decomposed normal form" (NFD), and the regexp above won't match that string.
>
> > (map char-general-category (string->list (string-normalize-nfd "é")))
> '(ll mn)
> > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
> #f
> 
> One fix would be to call `string-normalize-nfc` first, but some letter-modifier pairs don't have pre-composed versions. Another fix would be to expand the regexp to include modifiers. You'd have to decide which is better based on your application.
>
> Ryan
>
>
>
> On Fri, Jul 10, 2020 at 2:10 AM Peter W A Wood <peter...@gmail.com> wrote:
> Ryan
>
> > On 9 Jul 2020, at 22:52, Ryan Culpepper <rmculp...@gmail.com> wrote:
> >
> > If you want a regular expression that does match the example string, you can use the \p{property} notation. For example:
> >
> > > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo")
> > #t
> >
> > The "Regexp Syntax" docs have a grammar for regular expressions with links to examples.
> >
> > Ryan
>
> Thanks. I used héllo as an example. I was wondering if there was a way of specifying a regular expression group for Unicode “alphabetic” characters.
>
> On reflection, it seems a somewhat esoteric requirement that is almost impossible to satisfy. By way of example, would
> “Straße" be considered alphabetic? Would “二の句” be considered alphabetic?
>
> Strangely, Python considered the Japanese characters as being alphabetic but will not accept “Straße” as a valid string. (I suspect this is due to some problem relating to Locale..
>
> >>> "二の句".isalpha()
> True
> >>> “Straße".isalpha()
> File "<stdin>", line 1
> “Straße".isalpha()
> ^
> SyntaxError: invalid character in identifier
>
> Clearly, trying to identify “Unicode” alphabetic characters is far from straightforward, though it may well be useful in processing some language texts.
>
> Peter
>
>

Peter W A Wood

unread,
Jul 11, 2020, 6:27:32 AM7/11/20
to racket...@googlegroups.com
Dear Ryan

Thank you for both your full, complete and understandable explanation and a working solution which is more than sufficient for my needs.

I created a very simple function based on the reg=exp that you suggested and tested it against a number of cases:


#lang racket
(require test-engine/racket-tests)

(check-expect (alpha? "") #f) ; empty string
(check-expect (alpha? "1") #f)
(check-expect (alpha? "a") #t)
(check-expect (alpha? "hello") #t)
(check-expect (alpha? "h1llo") #f)
(check-expect (alpha? "\u00E7c\u0327") #t) ; çç
(check-expect (alpha? "noe\u0308l") #t) ; noél
(check-expect (alpha? "\U01D122") #f) ; 𝄢 (bass clef)
(check-expect (alpha? "\u216B") #f) ; Ⅻ (roman numeral)
(check-expect (alpha? "\u0BEB") #f) ; ௫ (5 in Tamil)
(check-expect (alpha? "二の句") #t) ; Japanese word "ninoku"
(check-expect (alpha? "مدينة") #t) ; Arabic word "madina"
(check-expect (alpha? "٥") #f) ; Arabic number 5
(check-expect (alpha? "\u0628\uFCF2") #t) ; Arabic letter beh with shaddah
(define (alpha? s)
(regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
(test)

I suspect that there are some cases with scripts requiring multiple code points to render a single character such as Arabic with pronunciation marks e.g. دُ نْيَا. At the moment, I don’t have the time (or need) to investigate further.

The depth of Racket’s Unicode support is impressive.

Once again, thanks.

Peter


> On 10 Jul 2020, at 15:47, Ryan Culpepper <rmculp...@gmail.com> wrote:
>
> (I see this went off the mailing list. If you reply, please consider CCing the list.)
>
> Yes, I understood your goal of trying to capture the notion of Unicode "alphabetic" characters with a regular expression.
>
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does assign every code point to a "General category", consisting of a main category and a subcategory. There is a category called "Letter", which seems like one reasonable generalization of "alphabetic".
>
> In Racket, you can get the code for a character's category using `char-general-category`. For example:
>
> > (char-general-category #\A)
> 'lu
> > (char-general-category #\é)
> 'll
> > (char-general-category #\ß)
> 'll
> > (char-general-category #\7)
> 'nd
>
> The general category for "A" is "Letter, uppercase", which has the code "Lu", which Racket turns into the symbol 'lu. The general category of "é" is "Letter, lowercase", code "Ll", which becomes 'll. The general category of "7" is "Number, decimal digit", code "Nd".
>
> In Racket regular expressions, the \p{category} syntax lets you recognize characters from a specific category. For example, \p{Lu} recognizes characters with the category "Letter, uppercase", and \p{L} recognizes characters with the category "Letter", which is the union of "Letter, uppercase", "Letter, lowercase", and so on.
>
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more Unicode letters. For example:
>
> > (regexp-match? #px"^\\p{L}+$" "héllo")
> #t
> > (regexp-match? #px"^\\p{L}+$" "straße")
> #t
> > (regexp-match? #px"^\\p{L}+$" "二の句")
> #t

Ryan Culpepper

unread,
Jul 11, 2020, 7:41:38 AM7/11/20
to Peter W A Wood, Racket Users
Great, I'm glad it was useful!

Ryan


--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages