Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

FYI: Unicode considering changing recommended definition of \w

4 views
Skip to first unread message

Karl Williamson

unread,
May 14, 2011, 7:31:51 PM5/14/11
to Perl5 Porters
This was not posted for public comment.

They are considering changing \w to be the same as \p{xid_continue}.
There are several other options under consideration, but they do want to
at least get the MIDDLE DOT and the ANO TELEIA into \w.

The change would affect the following (in diff-like output)
> 00B7 # MIDDLE DOT
< 037A # GREEK YPOGEGRAMMENI
> 0387 # GREEK ANO TELEIA
< 0488 # COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
< 0489 # COMBINING CYRILLIC MILLIONS SIGN
> 1369 # ETHIOPIC DIGIT ONE
...
> 1371 # ETHIOPIC DIGIT NINE
> 19DA # NEW TAI LUE THAM DIGIT ONE
< 20DD # COMBINING ENCLOSING CIRCLE
< 20DE # COMBINING ENCLOSING SQUARE
< 20DF # COMBINING ENCLOSING DIAMOND
< 20E0 # COMBINING ENCLOSING CIRCLE BACKSLASH
< 20E2 # COMBINING ENCLOSING SCREEN
< 20E3 # COMBINING ENCLOSING KEYCAP
< 20E4 # COMBINING ENCLOSING UPWARD POINTING TRIANGLE
> 2118 # SCRIPT CAPITAL P
> 212E # ESTIMATED SYMBOL
< 24B6 # CIRCLED LATIN CAPITAL LETTER A
...
< 24CF # CIRCLED LATIN CAPITAL LETTER Z
< 24D0 # CIRCLED LATIN SMALL LETTER A
...
< 24E9 # CIRCLED LATIN SMALL LETTER Z
< 2E2F # VERTICAL TILDE
< A670 # COMBINING CYRILLIC TEN MILLIONS SIGN
< A671 # COMBINING CYRILLIC HUNDRED MILLIONS SIGN
< A672 # COMBINING CYRILLIC THOUSAND MILLIONS SIGN
< FC5E # ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
< FC5F # ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
< FC60 # ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
< FC61 # ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
< FC62 # ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
< FC63 # ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
< FDFA # ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
< FDFB # ARABIC LIGATURE JALLAJALALOUHOU
< FE70 # ARABIC FATHATAN ISOLATED FORM
< FE72 # ARABIC DAMMATAN ISOLATED FORM
< FE74 # ARABIC KASRATAN ISOLATED FORM
< FE76 # ARABIC FATHA ISOLATED FORM
< FE78 # ARABIC DAMMA ISOLATED FORM
< FE7A # ARABIC KASRA ISOLATED FORM
< FE7C # ARABIC SHADDA ISOLATED FORM
< FE7E # ARABIC SUKUN ISOLATED FORM

Tom Christiansen

unread,
May 15, 2011, 9:20:52 PM5/15/11
to Karl Williamson, Perl5 Porters Mailing List
Thanks Karl. I was wondering about that.

Any word yet on tr18 updates?

Thanks,

--tom

Karl Williamson

unread,
May 15, 2011, 10:11:51 PM5/15/11
to Tom Christiansen, Perl5 Porters Mailing List

No. They often don't put up their meeting minutes for a long time, and
are hard to decipher. You could monitor the web site

Tom Christiansen

unread,
May 15, 2011, 10:14:18 PM5/15/11
to Karl Williamson, Perl5 Porters Mailing List
Karl Williamson <pub...@khwilliamson.com> wrote
on Sun, 15 May 2011 20:11:51 MDT:

>No. They often don't put up their meeting minutes for a long time, and
>are hard to decipher. You could monitor the web site

I already looked for the minutes, but I can only see the approved ones.

--tom

Father Chrysostomos

unread,
May 16, 2011, 1:14:24 AM5/16/11
to Perl5 Porters, Karl Williamson
Karl Williamson wrote:
> This was not posted for public comment.
>
> They are considering changing \w to be the same as \p{xid_continue}.
> There are several other options under consideration, but they do want to
> at least get the MIDDLE DOT and the ANO TELEIA into \w.

It seems strange to me that they should consider a punctuation mark a word character. The ano teleia is just as much a punctuation mark as is the question mark (U+037E).

Karl Williamson

unread,
May 16, 2011, 1:32:27 AM5/16/11
to Father Chrysostomos, Perl5 Porters
It's because its canonically equivalent to MIDDLE DOT. Now why that is,
I don't know.

In Catalan, something like the middle dot is apparently used in words,
and there has been pressure to encode a new character for that purpose,
which they would rather not. But it's sounding like a mistake for them
to try to bend Greek to Catalan rules.

Father Chrysostomos

unread,
May 16, 2011, 2:09:03 AM5/16/11
to Karl Williamson, Perl5 Porters

On May 15, 2011, at 10:32 PM, Karl Williamson wrote:

> On 05/15/2011 11:14 PM, Father Chrysostomos wrote:
>> Karl Williamson wrote:
>>> This was not posted for public comment.
>>>
>>> They are considering changing \w to be the same as \p{xid_continue}.
>>> There are several other options under consideration, but they do want to
>>> at least get the MIDDLE DOT and the ANO TELEIA into \w.
>>
>> It seems strange to me that they should consider a punctuation mark a word character. The ano teleia is just as much a punctuation mark as is the question mark (U+037E).
>>
>>
> It's because its canonically equivalent to MIDDLE DOT.

That sounds like a mistake. Maybe they should break that equivalence. (In fact, that would explain some of the bizarre problems I’ve encountered with programs that try to canonicalise text without asking. That would fix those problems.)

Karl Williamson

unread,
May 16, 2011, 6:31:28 PM5/16/11
to Father Chrysostomos, Perl5 Porters
On 05/16/2011 12:09 AM, Father Chrysostomos wrote:
>
> On May 15, 2011, at 10:32 PM, Karl Williamson wrote:
>
>> On 05/15/2011 11:14 PM, Father Chrysostomos wrote:
>>> Karl Williamson wrote:
>>>> This was not posted for public comment.
>>>>
>>>> They are considering changing \w to be the same as \p{xid_continue}.
>>>> There are several other options under consideration, but they do want to
>>>> at least get the MIDDLE DOT and the ANO TELEIA into \w.
>>>
>>> It seems strange to me that they should consider a punctuation mark a word character. The ano teleia is just as much a punctuation mark as is the question mark (U+037E).

Does it make sense at all for it to be part of an identifier?

Father Chrysostomos

unread,
May 17, 2011, 2:29:06 PM5/17/11
to Karl Williamson, Perl5 Porters

On May 16, 2011, at 3:31 PM, Karl Williamson wrote:

> On 05/16/2011 12:09 AM, Father Chrysostomos wrote:
>>
>> On May 15, 2011, at 10:32 PM, Karl Williamson wrote:
>>
>>> On 05/15/2011 11:14 PM, Father Chrysostomos wrote:
>>>> Karl Williamson wrote:
>>>>> This was not posted for public comment.
>>>>>
>>>>> They are considering changing \w to be the same as \p{xid_continue}.
>>>>> There are several other options under consideration, but they do want to
>>>>> at least get the MIDDLE DOT and the ANO TELEIA into \w.
>>>>
>>>> It seems strange to me that they should consider a punctuation mark a word character. The ano teleia is just as much a punctuation mark as is the question mark (U+037E).
>
> Does it make sense at all for it to be part of an identifier?

I don’t think so.

0 new messages