How to support UTF

31 views
Skip to first unread message

Evgueni Gordienko

unread,
Jan 7, 2018, 9:27:12 AM1/7/18
to parboiled2.org User List
Hi All,

looks like by default parboiled2 supports only ascii, in
CharPredicate:
val LowerAlpha = CharPredicate('a' to 'z')
val UpperAlpha = CharPredicate('A' to 'Z')

What is the easiest way to enable UTF for parboiled2?

Thank you,
Evgueni

Mathias Doenitz

unread,
Jan 7, 2018, 9:50:33 AM1/7/18
to parboil...@googlegroups.com
Hi Evgueni,

parboiled operates on 16-bit `Char`s, so its support for Unicode is identical as everything else on the JVM.
As long as you don't need to deal with supplementary characters whose code points are above U+FFFF, and which therefore cannot be described as single 16-bit chars you should be fine.

That said: The default character classes defined on the `CharPredicate` companion indeed only cover standard ASCII as that can be done even more sufficiently.

You can easily define your own CharPredicates via `CharPredicate.from`, e.g.

```
val AllLowerAlpha = CharPredicate.from(Character.isLowerCase)
```

HTH and cheers,
Mathias

---
mat...@parboiled.org
http://www.parboiled.org
> --
> You received this message because you are subscribed to the Google Groups "parboiled2.org User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to parboiled-use...@googlegroups.com.
> Visit this group at https://groups.google.com/group/parboiled-user.
> To view this discussion on the web visit https://groups.google.com/d/msgid/parboiled-user/6a279f01-ae8f-49c9-a6a9-152e6bcb8f29%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

evgueni....@verizon.com

unread,
Jan 11, 2018, 3:05:19 AM1/11/18
to parboiled2.org User List
Hi Mathias,

The suggested fix works for Spanish, Portugese but not for Chinese.
I attached simple test Maven project to show it.

Still investigating.
Thank you,
Evgueni
testutfparboiled.tar.gz

evgueni....@verizon.com

unread,
Jan 11, 2018, 1:52:45 PM1/11/18
to parboiled2.org User List
Hi All,

Confirmed that 2 bytes fit UTF8 characters are accepted by Parboiled2 Parser.
3 bytes like Chines (like for example 腾) are not supported.

Is it possible to change Parser (say use 4 bytes presumption for UTF8 vs 2 bytes now) to support extensions?

Thanks,
Evgueni

Mathias Doenitz

unread,
Jan 11, 2018, 4:32:59 PM1/11/18
to parboil...@googlegroups.com
Hi Evgueni,

> Is it possible to change Parser (say use 4 bytes presumption for UTF8 vs 2 bytes now) to support extensions?


UTF8 decoding is done before parsing, when you load a source into a Java String.
Java Strings treat characters that do not fit into 16 bits as two characters.

If you simply parse these "long characters" as 2 16-bit Java chars things should simply work.

HTH and cheers,
Mathias

---
mat...@parboiled.org
http://www.parboiled.org

> To view this discussion on the web visit https://groups.google.com/d/msgid/parboiled-user/5cc30850-242c-4060-a138-9a7e354bbb38%40googlegroups.com.

Alexander Myltsev

unread,
Jan 11, 2018, 4:42:11 PM1/11/18
to parboiled2.org User List
The interesting thing about Chines is that they don't match proposed predicates. Consider an attempt to match "腾讯体育-NBA全网独播":

Character.isAlphabetic('腾') = true
Character.isLowerCase('腾') = false
Character.isUpperCase('腾') = false

Meaning that both

val AllLowerAlpha = CharPredicate.from(Character.isLowerCase)
val AllUpperAlpha = CharPredicate.from(Character.isUpperCase)

fail. But CharPredicate.from(Character.isAlphabetic) is impossible to define since isAlphabetic is of type Int => Boolean.


Mathias, do you propose to preprocess the input string somehow?

Mathias Doenitz

unread,
Jan 11, 2018, 4:49:52 PM1/11/18
to parboil...@googlegroups.com
> Mathias, do you propose to preprocess the input string somehow?

No.
I'd simply UTF8 decode the target characters into a Java String and look at what 16-bit values make up these characters.
Then I'd simply match these.
That's all.

Cheers,
> To view this discussion on the web visit https://groups.google.com/d/msgid/parboiled-user/7e7c1b66-805b-479e-a6ce-d623a9a2eff8%40googlegroups.com.

Alexander Myltsev

unread,
Jan 12, 2018, 1:30:38 AM1/12/18
to parboiled2.org User List
So, Evgueni should split every Chinese character to 2-bites representation, that I suspect is different from the original look of the string. That seems not very comfortable to use.

Evgeni, could you check if 

val IsLetter = CharPredicate.from(Character.isLetter)

suites your needs?

Alexander Myltsev

unread,
Jan 12, 2018, 2:15:21 AM1/12/18
to parboiled2.org User List
Mathias,

also, do you think to implement CharPreficate.from: Int => Boolean is good idea? 

Mathias Doenitz

unread,
Jan 12, 2018, 4:48:29 AM1/12/18
to parboil...@googlegroups.com
> also, do you think to implement CharPreficate.from: Int => Boolean is good idea?

How would you want to use it?
> To view this discussion on the web visit https://groups.google.com/d/msgid/parboiled-user/52ccf7e0-89ae-4992-bcbb-74b5f2615514%40googlegroups.com.

Alexander Myltsev

unread,
Jan 12, 2018, 5:19:14 AM1/12/18
to parboiled2.org User List
CharPreficate.from(Character.isAlphabetic) ?

Mathias Doenitz

unread,
Jan 12, 2018, 5:27:21 AM1/12/18
to parboil...@googlegroups.com
Yes, ok.
My question was misleading.

What I meant was: How would you want to implement it?

CharPredicate extends (Char ⇒ Boolean), so implementing `from(Int => Boolean)` will not work.
You'd have to introduce a new `IntCharPredicate extends (Int ⇒ Boolean)` that then also matches two characters instead of only one.
> To view this discussion on the web visit https://groups.google.com/d/msgid/parboiled-user/f99a39b0-2843-42cb-b996-eec1a8393978%40googlegroups.com.

evgueni....@verizon.com

unread,
Jan 15, 2018, 6:41:43 PM1/15/18
to parboiled2.org User List
Hi All,

Character.isLetter - does the job.

Many thanks,
Evgueni

evgueni....@verizon.com

unread,
Jan 15, 2018, 6:43:23 PM1/15/18
to parboiled2.org User List

scala> Character.isLetter('腾')

res0: Boolean = true

Reply all
Reply to author
Forward
0 new messages