iswalpha and locales

317 views
Skip to first unread message

Renji

unread,
Jun 25, 2016, 4:48:29 AM6/25/16
to ISO C++ Standard - Future Proposals
As i can see in en.cppreference.com, iswalpha return true for "any alphabetic character specific to the current locale". Thanks for this, in Debian iswalpha(L'か') (か - symbol from hiragana) return false if i'm not call std::setlocale before iswalpha. And iswalpha(L'か') return true if i call std::setlocale with any valid locale name (testing for en_US.UTF-8, ru_RU.UTF-8 and even C.UTF-8). It's make no sense. I'm work with unicode character, why i'm need locales? And if this locale contain some important data, why name of locale is not important?

Proposal: iswalpha must not depend on any locales.
PS Sorry if my English is bad.

Bo Persson

unread,
Jun 25, 2016, 7:24:17 AM6/25/16
to std-pr...@isocpp.org
On 2016-06-25 10:48, Renji wrote:
> As i can see in en.cppreference.com
> <http://en.cppreference.com/w/cpp/string/wide/iswalpha>, iswalpha return
> true for "any alphabetic character specific to the current *locale*".
> Thanks for this, in Debian iswalpha(L'か') (か - symbol from
> hiragana) return false if i'm not call std::setlocale before iswalpha.
> And iswalpha(L'か') return true if i call std::setlocale with *any*
> valid locale name (testing for en_US.UTF-8, ru_RU.UTF-8 and even
> C.UTF-8). It's make no sense. I'm work with *uni*code character, why i'm
> need locales? And if this locale contain some important data, why name
> of locale is not important?
>
> Proposal: iswalpha must not depend on any locales.

Don't know about Japanese, but in countries using latin alphabets the
number of characters in the national alphabet vary.

Characters like åäöüÿïâéÀËÏ *very much* depends on the chosen locale.


Thiago Macieira

unread,
Jun 25, 2016, 1:34:50 PM6/25/16
to std-pr...@isocpp.org
On sábado, 25 de junho de 2016 01:48:29 PDT Renji wrote:
> As i can see in en.cppreference.com
> <http://en.cppreference.com/w/cpp/string/wide/iswalpha>, iswalpha return
> true for "any alphabetic character specific to the current *locale*".
> Thanks for this, in Debian iswalpha(L'か') (か - symbol from hiragana) return
> false if i'm not call std::setlocale before iswalpha. And iswalpha(L'か')
> return true if i call std::setlocale with *any* valid locale name (testing
> for en_US.UTF-8, ru_RU.UTF-8 and even C.UTF-8). It's make no sense. I'm
> work with *uni*code character, why i'm need locales? And if this locale
> contain some important data, why name of locale is not important?
>
> Proposal: iswalpha must not depend on any locales.

That's an implementation detail, not a C standard issue.

The reason is that until you call setlocale(), the runtime in your C++
Standard Library implementation knows only of the US-ASCII C locale, in which
the only alphabetic characters are those in US-ASCII. That is, the default
locale is "C.ANSI_X3.4-1986".

The moment you set the locale to something Unicode, then it knows about the
entire Unicode range.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Aso Renji

unread,
Jun 25, 2016, 2:19:06 PM6/25/16
to std-pr...@isocpp.org
Thiago Macieira <thi...@macieira.org> писал(а) в своём письме Sat, 25 Jun
2016 20:34:47 +0300:

> The reason is that until you call setlocale(), the runtime in your C++
> Standard Library implementation knows only of the US-ASCII C locale
In other words, iswalpha accept UNICODE character (wint_t), but
acknowledge only ASCII character. No, it's C standard issue. If you say
"i'm can work with unicode", you must support all unicode characters
(unicode include all national alphabets). If you say "i'm not support
unicode characters", you must not accept unicode characters.

Or, maybe wint_t is not unicode? In this case you know any other character
encoding with wide (not multi-byte) characters?

--
Написано с помощью почтового клиента Opera: http://www.opera.com/mail/

Aso Renji

unread,
Jun 25, 2016, 2:25:01 PM6/25/16
to std-pr...@isocpp.org
Bo Persson <b...@gmb.dk> писал(а) в своём письме Sat, 25 Jun 2016 14:24:06
+0300:

> Don't know about Japanese, but in countries using latin alphabets the
> number of characters in the national alphabet vary.
Unicode contain ALL national alphabets. Therefore number of characters in
unicode NOT vary. And isWalpha work with unicode characters.

Thiago Macieira

unread,
Jun 25, 2016, 2:39:15 PM6/25/16
to std-pr...@isocpp.org
wchar_t and wint_t are not required to be Unicode. The call to setlocale()
changes the encoding they use.

You should use char16_t and char32_t to be sure to have Unicode.

Aso Renji

unread,
Jun 25, 2016, 2:50:32 PM6/25/16
to std-pr...@isocpp.org
Thiago Macieira <thi...@macieira.org> писал(а) в своём письме Sat, 25 Jun
2016 21:39:10 +0300:

> You should use char16_t and char32_t to be sure to have Unicode.
But iswalpha with char16_t argument don't exist. In this case I'm change
proposal to adding isalpha version for utf16/utf32 characters. For example
- isu16alpha.

Nicol Bolas

unread,
Jun 25, 2016, 3:06:43 PM6/25/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com

First, the number of Unicode codepoints does vary. Newer versions add new valid Unicode codepoints. There is a fixed upper limit of course, but there are large blocks of that limit which are unallocated.

Second, `iswalpha` does not work with Unicode. It works with wide characters, which may be Unicode. They also may not. It depends on your implementation, and your implementation may depend on your locale.

Nicol Bolas

unread,
Jun 25, 2016, 3:09:47 PM6/25/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com

Here's the problem with that. Unicode has very complex rules about what constitutes an "alphabetic" character. Indeed, it has a large list; for every codepoint it defines, it says if that codepoint is alphabetic or not. So too for many other questions like `isupper` and so forth.

That's a lot of tables to be including. I believe that the table can be compacted via clever programming to be of reasonable length. But that would require a good implementation to prove it.

Also, such functions cannot be applied to a `char16_t`, becuase a `char16_t` is not a Unicode codepoint. It is only a UTF-16 code unit, which may be a valid codepoint or it may be a surrogate pair for a valid codepoint, in accord with the UTF-16 encoding rules.

Jeffrey Yasskin

unread,
Jun 25, 2016, 3:52:47 PM6/25/16
to std-pr...@isocpp.org
I think Thiago has it: when wchar_t and iswalpha were designed,
Unicode hadn't won yet, so they're designed to work with non-Unicode
encodings. I don't know of any truly double-byte encodings other than
UTF-16, but wchar_t could be 4 bytes and encode GB-18030 instead of
UTF-32.

The committee is interested in improving our Unicode support, and
we're actively looking at proposals to do so, including
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0353r0.html
and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0244r1.html.
I think we'd also welcome a set of predicate functions that assume
their input is in one of the UTF formats and check for the unicode
character class. You may have to figure something out for
multi-code-unit characters, or limit the proposal to char32_t.

Jeffrey (Library Evolution chair)

asor...@gmail.com

unread,
Jun 25, 2016, 4:00:54 PM6/25/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


суббота, 25 июня 2016 г., 22:09:47 UTC+3 пользователь Nicol Bolas написал:


That's a lot of tables to be including. I believe that the table can be compacted via clever programming to be of reasonable length. But that would require a good implementation to prove it.
bool isu32alpha(wint_t code)
{
    std::setlocale(LC_ALL,"any_unicode_locale");//implementation defined
    return iswalpha(code);
}
I'm already test this (see first post), it work very well.
Or:
bool isu32alpha(wint_t code) 
{
    return code<=0x10FFFF?table[code/8]&(1<<(code%8))]:false;
}
0x110000 bites table - reasonable length. At least if you keep this table in separate cpp file.

Nicol Bolas

unread,
Jun 25, 2016, 8:04:31 PM6/25/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com
On Saturday, June 25, 2016 at 4:00:54 PM UTC-4, asor...@gmail.com wrote:
суббота, 25 июня 2016 г., 22:09:47 UTC+3 пользователь Nicol Bolas написал:
That's a lot of tables to be including. I believe that the table can be compacted via clever programming to be of reasonable length. But that would require a good implementation to prove it.
bool isu32alpha(wint_t code)
{
    std::setlocale(LC_ALL,"any_unicode_locale");//implementation defined
    return iswalpha(code);
}
I'm already test this (see first post), it work very well.

Implementation-dependent code is implementation dependent. So you've proved nothing.

Or:
bool isu32alpha(wint_t code) 
{
    return code<=0x10FFFF?table[code/8]&(1<<(code%8))]:false;
}
0x110000 bites table - reasonable length. At least if you keep this table in separate cpp file.

... Have you looked at the Unicode tables? They are not small. Like I said, there are ways to make them smaller (and your code makes them far bigger than necessary, since only ~15% of the codepoint range is assigned). But there is no proof-of-concept that shows that it won't bloat executables by 100KB.

Also, there is no guarantee that `wint_t` can store a Unicode codepoint, so that API isn't reasonable. If you're serious about Unicode support, you need to focus on the types that actually store Unicode encodings.

asor...@gmail.com

unread,
Jun 25, 2016, 11:29:51 PM6/25/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


воскресенье, 26 июня 2016 г., 3:04:31 UTC+3 пользователь Nicol Bolas написал:

... Have you looked at the Unicode tables? They are not small. Like I said, there are ways to make them smaller (and your code makes them far bigger than necessary, since only ~15% of the codepoint range is assigned). But there is no proof-of-concept that shows that it won't bloat executables by 100KB.

Yes, there are ways to make them smaller. But this no way to make them faster. Sacrifice time for saving 100KB? We not load programs from floppy anymore, we have gigabytes of RAM and at least gigabytes of disk space. In modern word 100KB insignificant sacrifice for speed. In any case, in desktop, executables get C-function implementations from libc6-dev package in Linux or from User32.dll in Windows. If Linux or Windows bloat by 100KB, you even can't notice this.
PS single std::regex bloat executables by 100KB. I'm think we must remove std::regex from C++11 standard library.
Also, there is no guarantee that `wint_t` can store a Unicode codepoint, so that API isn't reasonable. If you're serious about Unicode support, you need to focus on the types that actually store Unicode encodings.
Okay, char32_t. 

Thiago Macieira

unread,
Jun 26, 2016, 12:07:57 AM6/26/16
to std-pr...@isocpp.org
On sábado, 25 de junho de 2016 20:29:50 PDT asor...@gmail.com wrote:
> Yes, there are ways to make them smaller. But this no way to make them
> *faster*. Sacrifice time for saving 100KB? We not load programs from floppy
> > anymore, we have gigabytes of RAM and at least gigabytes of disk space. In

No, we don't. There are many modern microcontroller-class CPUs with less than
1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn't
we program them with C++?

Another reason is that often the library gets supplied with the executable.
Once we load the entire set of Unicode tables for all attributes, it may be
well over a megabyte. In fact, the ICU data table is an 18 MB library. That
means your small 100-line C++ program is at least that big.

I don't mean to discourage you. I do think we need some more Unicode support
in the standard library, like better conversion functions from the current
locale to UTF-16 and 32. Qt has provided that for 15 years, so why not the
standard library?

> > Also, there is no guarantee that `wint_t` can store a Unicode codepoint,
> > so that API isn't reasonable. If you're serious about Unicode support, you
> > need to focus on the types that actually store Unicode encodings.
>
> Okay, char32_t.

asor...@gmail.com

unread,
Jun 26, 2016, 1:39:17 AM6/26/16
to ISO C++ Standard - Future Proposals


воскресенье, 26 июня 2016 г., 7:07:57 UTC+3 пользователь Thiago Macieira написал:

No, we don't. There are many modern microcontroller-class CPUs with less than
1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn't
we program them with C++?

Because microcontroller-class CPUs can't grant hardware support for std::thread, std::mutex, etc. Also, I'm don't think this support present in microcontroller OS (if microcontroller have any).
Yes, std::thread and iswalpha - different things. But if you write program for microcontroller, you should be preprepared for some restricts and forget about full C++11 support.

Another reason is that often the library gets supplied with the executable.
Once we load the entire set of Unicode tables for all attributes, it may be
well over a megabyte. In fact, the ICU data table is an 18 MB library. That
means your small 100-line C++ program is at least that big.

In that case my program bloat by 18 MB only if I'm use isu32alpha. In other case smart compiler can remove code of unused function. But if I'm use isu32alpha, I'm need normal unicode support. I'm don't need defective wide character function, that can't work with 128+ (wide) character codes. If price of normal support is 18 MB, so be it.

Patrice Roy

unread,
Jun 26, 2016, 1:52:16 AM6/26/16
to std-pr...@isocpp.org
18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.

I'll be happy to see good Unicode support proposals at committee meetings, but I'd strongly advise that hey are aware of such issues.

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc4-4526-9826-8062bb443869%40isocpp.org.

asor...@gmail.com

unread,
Jun 26, 2016, 2:04:59 AM6/26/16
to ISO C++ Standard - Future Proposals


воскресенье, 26 июня 2016 г., 8:52:16 UTC+3 пользователь Patrice Roy написал:
18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.

Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 bits for single predicate function. 18 MB enough for hundred predicates. I'm don't think you really need so many.

Jeffrey Yasskin

unread,
Jun 26, 2016, 4:28:35 AM6/26/16
to std-pr...@isocpp.org
On Sun, Jun 26, 2016 at 7:07 AM, Thiago Macieira <thi...@macieira.org> wrote:
> On sábado, 25 de junho de 2016 20:29:50 PDT asor...@gmail.com wrote:
>> Yes, there are ways to make them smaller. But this no way to make them
>> *faster*. Sacrifice time for saving 100KB? We not load programs from floppy
>> > anymore, we have gigabytes of RAM and at least gigabytes of disk space. In
>
> No, we don't. There are many modern microcontroller-class CPUs with less than
> 1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn't
> we program them with C++?
>
> Another reason is that often the library gets supplied with the executable.
> Once we load the entire set of Unicode tables for all attributes, it may be
> well over a megabyte. In fact, the ICU data table is an 18 MB library. That
> means your small 100-line C++ program is at least that big.

ICU does have ways to subset its data tables to include only the parts
you use. The proposal author should probably validate that to show us
what kinds of subsets are already possible, but it'll also be possible
to add new subsets if the C++ library wants to make finer-grained
distinctions.

Standard libraries targeting microcontrollers may need ways to only
ship subsets of unicode according to user assertions about which
characters they actually use, but I'm pretty confident it's possible
to make the size reasonable.

Jeffrey

Nicol Bolas

unread,
Jun 26, 2016, 9:08:45 AM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com

Again, let's forget that most of that range is not actually assigned and therefore takes up 0 bits.

Not all of the properties in the Unicode tables are binary. Indeed, most are not. Case-conversion, for example, cannot be binary. It has to specify how you go from codepoint X to one or more codepoints YZW. For each codepoint. That cannot take up a single bit per codepoint.

That being said, I firmly believe that 18MB is much larger than the Unicode tables need to be. That there must be clever ways to make that table much smaller (on the order of hundreds of kilobytes rather than megabytes). But as of yet, I have not undertaken the task of proving that, so that doesn't mean much.

And even if I'm wrong, I bet we can provide certain very useful features that require less than the full Unicode table space. For example, I'd bet that the non-compatibility Unicode normalization forms require much less table space than the compatibility ones. I'd bet that grapheme cluster iteration requires much less table space than case conversion.

Again, not proven. But it'd certainly be a worthy research project. If Unicode normalization functions only cost 15KB of executable room, that might be a reasonable tradeoff. Obviously, if you don't call them at all, you should get zero increase.

Nicol Bolas

unread,
Jun 26, 2016, 9:11:32 AM6/26/16
to ISO C++ Standard - Future Proposals
On Sunday, June 26, 2016 at 12:07:57 AM UTC-4, Thiago Macieira wrote:
On sábado, 25 de junho de 2016 20:29:50 PDT asor...@gmail.com wrote:
> Yes, there are ways to make them smaller. But this no way to make them
> *faster*. Sacrifice time for saving 100KB? We not load programs from floppy
> > anymore, we have gigabytes of RAM and at least gigabytes of disk space. In

No, we don't. There are many modern microcontroller-class CPUs with less than
1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn't
we program them with C++?

To be fair, microcontroller code probably isn't doing Unicode case-conversion or string comparisons. So as long as the mere presence of such functions in the standard library doesn't cause bloat (and why would it?), I don't think that case would be a problem.

asor...@gmail.com

unread,
Jun 26, 2016, 10:03:02 AM6/26/16
to ISO C++ Standard - Future Proposals


воскресенье, 26 июня 2016 г., 11:28:35 UTC+3 пользователь Jeffrey Yasskin написал:

ICU does have ways to subset its data tables to include only the parts
you use. The proposal author should probably validate that to show us
what kinds of subsets are already possible, but it'll also be possible
to add new subsets if the C++ library wants to make finer-grained
distinctions.

Okay, lets use subset model. Unicode use 273 unicode blocks, 271792 codes total. We can write something like this:
struct u32_subset
{
    int32_t lower_bound,size;
    const int8_t*table;
    bool operator<(int32_t code)const{return lower_bound<code;}
};
bool isu32alpha(int32_t code)
{
    static const u32_subset unicode_blocks[273]={/*some large table*/};

    const u32_subset&subset=*std::lower_bound(unicode_blocks,unicode_blocks+273,code);
    size_t offset=code-subset.lower_bound;
    return offset<subset.size?subset.table[offset/8]&(1<<(offset&7)):false;
}

u32_subset request 4 byte for lower_bound, 4 byte for size, and 8 bytes for table pointer. 273*16=4368 bytes total.
All 273 tables request 271792 bites, or 33974 bytes total.
4368+33974=38342 bytes total.
38 KB is still very big and your 4+ GB desktop can't afford this?

asor...@gmail.com

unread,
Jun 26, 2016, 10:03:50 AM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


воскресенье, 26 июня 2016 г., 16:08:45 UTC+3 пользователь Nicol Bolas написал:
On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:
воскресенье, 26 июня 2016 г., 8:52:16 UTC+3 пользователь Patrice Roy написал:
18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.

Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 bits for single predicate function. 18 MB enough for hundred predicates. I'm don't think you really need so many.

Again, let's forget that most of that range is not actually assigned and therefore takes up 0 bits.

Not all of the properties in the Unicode tables are binary. Indeed, most are not. Case-conversion, for example, cannot be binary. It has to specify how you go from codepoint X to one or more codepoints YZW. For each codepoint. That cannot take up a single bit per codepoint.

Lets concentrate to predicate (return bool value) functions. I'm don't believe that isalpha or isupper have this sort of problems. Although we should decide what is "isupper" means in languages without upper and lower characters. In Japanese language for example.
That being said, I firmly believe that 18MB is much larger than the Unicode tables need to be. That there must be clever ways to make that table much smaller (on the order of hundreds of kilobytes rather than megabytes). But as of yet, I have not undertaken the task of proving that, so that doesn't mean much.

In my answer to Jeffrey Yasskin, I'm compress tables to 38 KB. But price of this - binary search with eight (log2(273)) comparison. And eight problems with branch prediction unit. At least in desktop I'm prefer avoid this.

Nicol Bolas

unread,
Jun 26, 2016, 11:12:51 AM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com

1: The world of C++ is greater than the world of "4+GB desktop" computers.

2: Why are you using `lower_bound` for a table? That's like making an `unordered_map`, then using a range-for loop to search for an item by its key. You should be able to get the exact block index for a Unicode codepoint with some simple mathematics.

3: 38KB is still way too big for this information.

Nicol Bolas

unread,
Jun 26, 2016, 11:14:09 AM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com
On Sunday, June 26, 2016 at 10:03:50 AM UTC-4, asor...@gmail.com wrote:
воскресенье, 26 июня 2016 г., 16:08:45 UTC+3 пользователь Nicol Bolas написал:
On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:
воскресенье, 26 июня 2016 г., 8:52:16 UTC+3 пользователь Patrice Roy написал:
18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.

Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 bits for single predicate function. 18 MB enough for hundred predicates. I'm don't think you really need so many.

Again, let's forget that most of that range is not actually assigned and therefore takes up 0 bits.

Not all of the properties in the Unicode tables are binary. Indeed, most are not. Case-conversion, for example, cannot be binary. It has to specify how you go from codepoint X to one or more codepoints YZW. For each codepoint. That cannot take up a single bit per codepoint.

Lets concentrate to predicate (return bool value) functions. I'm don't believe that isalpha or isupper have this sort of problems. Although we should decide what is "isupper" means in languages without upper and lower characters. In Japanese language for example.

Before you can talk about what things Unicode should have, you should probably look at how Unicode currently works. Unicode already has an answer for what the case properties of CJK ideograms are. The only thing any prospective C++ API for Unicode should do is provide Unicode's answers for questions that Unicode has answers to.

That being said, I firmly believe that 18MB is much larger than the Unicode tables need to be. That there must be clever ways to make that table much smaller (on the order of hundreds of kilobytes rather than megabytes). But as of yet, I have not undertaken the task of proving that, so that doesn't mean much.

In my answer to Jeffrey Yasskin, I'm compress tables to 38 KB.

No, you did not. You compressed a single table. That's not "tables".

asor...@gmail.com

unread,
Jun 26, 2016, 12:44:21 PM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


воскресенье, 26 июня 2016 г., 18:12:51 UTC+3 пользователь Nicol Bolas написал:

2: Why are you using `lower_bound` for a table? That's like making an `unordered_map`, then using a range-for loop to search for an item by its key. You should be able to get the exact block index for a Unicode codepoint with some simple mathematics.
Because unicode blocks have random size. I'm can use Unicode plans with fixed size and simple mathematics instead. But then you start complain "oh, six Unicode planes contain 65536*6=393216 characters! You wish spend 393216 bits or 49 KB? I't very, very large amount of memory!".
You can get high speed with simple code, or you can get minimum memory usage. But you can't get both in one time. 

3: 38KB is still way too big for this information.
Only if you know more economical method. If more economical method don't exist, than 38KB reasonable size.

Thiago Macieira

unread,
Jun 26, 2016, 1:14:45 PM6/26/16
to std-pr...@isocpp.org
On sábado, 25 de junho de 2016 22:39:16 PDT asor...@gmail.com wrote:
> Because microcontroller-class CPUs can't grant hardware support for
> std::thread, std::mutex, etc. Also, I'm don't think this support present
> in microcontroller OS (if microcontroller have any).
> Yes, std::thread and iswalpha - different things. But if you write program
> for microcontroller, you should be preprepared for some restricts and
> forget about full C++11 support.

I think you know as much about microcontroller OSes as you know about
microcontrollers themselves.

I've been working with the people developing Zephyr OS and there they have
fibers and protothreads. There's no reason those primitives couldn't be
supported, if needed.

And let me repeat: I do want some more Unicode support. Just remember the C++
rule of not paying for the cost of things you're not using. Unfortunately,
Unicode character properties is one of those that cost a lot, just like the
rest of locale databases and timezones.

Thiago Macieira

unread,
Jun 26, 2016, 1:24:18 PM6/26/16
to std-pr...@isocpp.org
On domingo, 26 de junho de 2016 06:08:44 PDT Nicol Bolas wrote:
> That being said, I firmly believe that 18MB is much larger than the Unicode
> tables *need* to be. That there must be clever ways to make that table much
> smaller (on the order of hundreds of kilobytes rather than megabytes). But
> as of yet, I have not undertaken the task of *proving* that, so that
> doesn't mean much.
>
> And even if I'm wrong, I bet we can provide certain very useful features
> that require less than the full Unicode table space. For example, I'd bet
> that the non-compatibility Unicode normalization forms require much less
> table space than the compatibility ones. I'd bet that grapheme cluster
> iteration requires much less table space than case conversion.

ICU comes with a tool to select which properties and which locales to include
in your data pack. It's just not an easy tool to use and I personally know of
no one that has successfully deployed the data file with it.

Un-selecting entries from the "lines" from the database is often a short-
sighted decision. You may think "my application will not be run in Thailand"
and thus remove support for Thai grapheme support along with its locale
information. But then you may get a Thai customer calling your application's
support and they may not even be in Thailand.

Un-selecting "columns" would be safer, but you often don't know which
properties your application needs. You might think like you said above that
you don't need the non-compatibility normalisations, only to find out that
Internationalised Domain Names does need NFKC.

Also, ICU 57.1 isn't 18 MB:

$ v -h /usr/share/icu/57.1/icudt57l.dat
-rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat

Thiago Macieira

unread,
Jun 26, 2016, 1:29:51 PM6/26/16
to std-pr...@isocpp.org
On domingo, 26 de junho de 2016 06:11:31 PDT Nicol Bolas wrote:
> To be fair, microcontroller code probably isn't doing Unicode
> case-conversion or string comparisons. So as long as the mere *presence* of
> such functions in the standard library doesn't cause bloat (and why would
> it?), I don't think that case would be a problem.

Hopefully, but often people don't remeber that when creating their Internet of
Things protocols. Trust me, we're running into that in the Open Connectivity
Foundation: many things dictacted by IEEE and the CoRE initiative are case
insensitive and since data is often encoded in UTF-8, the logical
conclusion...

Nicol Bolas

unread,
Jun 26, 2016, 3:51:09 PM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com
On Sunday, June 26, 2016 at 12:44:21 PM UTC-4, asor...@gmail.com wrote:
воскресенье, 26 июня 2016 г., 18:12:51 UTC+3 пользователь Nicol Bolas написал:

2: Why are you using `lower_bound` for a table? That's like making an `unordered_map`, then using a range-for loop to search for an item by its key. You should be able to get the exact block index for a Unicode codepoint with some simple mathematics.

Why divide it up by Unicode's arbitrary blocks to begin with? That makes searching require a lot of conditional branching and cache misses.

I would start by dividing it into Unicode planes, and dividing each plane into X regions of Y codepoints apiece. Different properties, depending on the distribution of attributes, could have region sizes. The idea being that you choose region sizes based on the specific distribution within a plane. The overall goal being that you can jump from a codepoint directly to the specific set of codepoints from which to fetch the exact value.

If all of the codepoints in a region share the same value, you can instead jump to a function that returns that default value, rather than fetching it from a table of duplicate entries. Or rather more to the point, each region within a plane is implemented as a function, which may use a table to generate the return value, return a single default, employ RLE encoding, or any number of other tricks to reduce the overall size of the compiled binary.

All of which would be much faster than a binary search.

3: 38KB is still way too big for this information.
Only if you know more economical method. If more economical method don't exist, than 38KB reasonable size.

OK, allow me to say what I mean a different way. Having that information available at all is not worth 38KB to most users. The number of people who truly need to know if a codepoint is an alphabetic character or not is far less than the number of people who need, for example, Unicode normalization. Or grapheme cluster iteration. And so forth.

`isalpha` is not worth the cost. And no, normalization doesn't use that property. Even Unicode collation doesn't use that property, and that's about the most property-laden operation that Unicode offers.

Nicol Bolas

unread,
Jun 26, 2016, 3:53:02 PM6/26/16
to ISO C++ Standard - Future Proposals
On Sunday, June 26, 2016 at 1:24:18 PM UTC-4, Thiago Macieira wrote:
On domingo, 26 de junho de 2016 06:08:44 PDT Nicol Bolas wrote:
> That being said, I firmly believe that 18MB is much larger than the Unicode
> tables *need* to be. That there must be clever ways to make that table much
> smaller (on the order of hundreds of kilobytes rather than megabytes). But
> as of yet, I have not undertaken the task of *proving* that, so that
> doesn't mean much.
>
> And even if I'm wrong, I bet we can provide certain very useful features
> that require less than the full Unicode table space. For example, I'd bet
> that the non-compatibility Unicode normalization forms require much less
> table space than the compatibility ones. I'd bet that grapheme cluster
> iteration requires much less table space than case conversion.

ICU comes with a tool to select which properties and which locales to include
in your data pack. It's just not an easy tool to use and I personally know of
no one that has successfully deployed the data file with it.

Un-selecting entries from the "lines" from the database is often a short-
sighted decision. You may think "my application will not be run in Thailand"
and thus remove support for Thai grapheme support along with its locale
information. But then you may get a Thai customer calling your application's  
support and they may not even be in Thailand.

Un-selecting "columns" would be safer, but you often don't know which
properties your application needs. You might think like you said above that
you don't need the non-compatibility normalisations, only to find out that
Internationalised Domain Names does need NFKC.

Well, removing specific properties is a compile-time decision, since those functions simply don't exist. Remember: we're not talking about what to "remove" necessarily; we're talking about what should be added to the standard library. And if we don't add the compatibility normalization forms (because we deem them to be too costly), then whatever IDN needs is irrelevant; the standard library simply doesn't support it.

The general idea I'm trying to get to is that some operations are worth spending X amount of memory, and some operations are not. We should try to ascertain how much memory each Unicode operation that uses Unicode properties costs, so that we can determine which ones are worth supporting and which ones aren't.


Also, ICU 57.1 isn't 18 MB:

$ v -h /usr/share/icu/57.1/icudt57l.dat
-rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat

They store it as a file to be directly included? No run-length encoding of series of elements that contain the same value or anything?

Thiago Macieira

unread,
Jun 26, 2016, 4:02:23 PM6/26/16
to std-pr...@isocpp.org
On domingo, 26 de junho de 2016 12:53:01 PDT Nicol Bolas wrote:
> > Also, ICU 57.1 isn't 18 MB:
> >
> > $ v -h /usr/share/icu/57.1/icudt57l.dat
> > -rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat
>
> They store it as a file to be directly included? No run-length encoding of
> series of elements that contain the same value or anything?

They store it as a binary file (the L is for "little-endian") and they apply
compression to it, as far as I know.

It's just that ICU is more than just the Unicode property tables. It's ALL the
Unicode tables, plus CLDR, plus the IANA/Olson timezone database, plus
whatever else I haven't found out yet.

Qt has a portion of the Unicode tables and CLDR too. We do compress the
Unicode tables, but just for QString & QChar usage, they're several hundred kB
already.

asor...@gmail.com

unread,
Jun 26, 2016, 4:42:01 PM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


воскресенье, 26 июня 2016 г., 22:51:09 UTC+3 пользователь Nicol Bolas написал:

Why divide it up by Unicode's arbitrary blocks to begin with? That makes searching require a lot of conditional branching and cache misses.

Because it's exclude unassigned codepoints from table. After this - yes, we can use more complicated compression. But, be price of more complicated and slowly code.

OK, allow me to say what I mean a different way. Having that information available at all is not worth 38KB to most users. The number of people who truly need to know if a codepoint is an alphabetic character or not is far less than the number of people who need, for example, Unicode normalization. Or grapheme cluster iteration. And so forth.
If you happy with minimal  return (c>='a' && c<='z') || (c>='A' && c<='Z'); implementation, then just use isalpha instead. If you need more, this have some price.

Tony V E

unread,
Jun 26, 2016, 5:53:17 PM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com
‎> If you happy with minimal  return (c>='a' && c<='z') || (c>='A' && c<='Z'); implementation,‎...


‎That ("of course") doesn't work with EBCDIC. :-)

My real point being that none of this stuff is as simple as it seems. And even when we know it's complicated, it is still more complicated than that. 

As suggested by others, if we layer or slice functionality correctly, we can get close* to 'don't use what you don't pay for', but there is a ton of work to get there.

[*for sufficiently large values of 'close']

There is a reason ICU is huge. I hope no one is suggesting we could get all the functionality at 1/10 the price. 


Sent from my BlackBerry portable Babbage Device
Sent: Sunday, June 26, 2016 4:42 PM
To: ISO C++ Standard - Future Proposals
Subject: Re: [std-proposals] iswalpha and locales

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.

asor...@gmail.com

unread,
Jun 26, 2016, 6:26:56 PM6/26/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


понедельник, 27 июня 2016 г., 0:53:17 UTC+3 пользователь Tony V E написал:

As suggested by others, if we layer or slice functionality correctly, we can get close* to 'don't use what you don't pay for', but there is a ton of work to get there.
Okay, how about this?
template<int64_t flags>
bool isu32alpha(char32_t code)
{
    return (flags & EN_LANGUAGE_SUPPORT?en_isu32alpha(code):false) ||
        (flags & RU_LANGUAGE_SUPPORT?ru_isu32alpha(code):false) ||
        (flags & JA_LANGUAGE_SUPPORT?ja_isu32alpha(code):false) ||
        ....
}

//default implementation
bool isu32alpha(char32_t code)
{
    //code for support all national alphabets
}

Be default you get support of all national alphabets. Best choice in a desktop, where unicode support - part of OS. But if you nead only partial support, you can set what alphabets you need. This is compile-time choice, so smart compiler can remove all unused functions.

Thiago Macieira

unread,
Jun 27, 2016, 12:22:25 AM6/27/16
to std-pr...@isocpp.org
On domingo, 26 de junho de 2016 15:26:55 PDT asor...@gmail.com wrote:
> But if you nead only partial
> support, you can set what alphabets you need. This is compile-time choice,
> so smart compiler can remove all unused functions.

See my email when I said that developers (and product managers) are often
wrong about what locales they'll need in their applications.

Matthew Woehlke

unread,
Jun 27, 2016, 10:30:54 AM6/27/16
to std-pr...@isocpp.org
On 2016-06-25 15:06, Nicol Bolas wrote:
> On Saturday, June 25, 2016 at 2:25:01 PM UTC-4, Aso Renji wrote:
>> On 2016-06-25 07:24, Bo Persson wrote:
>>> Don't know about Japanese, but in countries using latin alphabets the
>>> number of characters in the national alphabet vary.
>> Unicode contain ALL national alphabets. Therefore number of characters in
>>
>> unicode NOT vary. And isWalpha work with unicode characters.
>
> First, the number of Unicode codepoints does vary. Newer versions add new
> valid Unicode codepoints. There is a fixed upper limit of course, but there
> are large blocks of that limit which are unallocated.

It doesn't vary as a function of the human-language portion of a locale.
(At least, I would hope not on any sane implementation.)

--
Matthew

Matthew Woehlke

unread,
Jun 27, 2016, 11:11:16 AM6/27/16
to std-pr...@isocpp.org
On 2016-06-25 20:04, Nicol Bolas wrote:
> ... Have you looked at the Unicode tables? They are not small. Like I said,
> there are ways to make them smaller (and your code makes them far bigger
> than necessary, since only ~15% of the codepoint range is assigned). But
> there is no proof-of-concept that shows that it won't bloat executables by
> 100KB.

Huh? Why on earth would you bake the tables into the executable ROM? Any
sane implementation is going to store them in shared memory.

For grins, I wrote a simple test program that calls 'iswalpha' on its
argument... it is 8618 bytes. (For comparison, I wrote a program that
does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
bytes. Hardly 100 KiB.)

--
Matthew

Jean-Marc Bourguet

unread,
Jun 27, 2016, 11:18:42 AM6/27/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com
Le samedi 25 juin 2016 20:19:06 UTC+2, Aso Renji a écrit :
Or, maybe wint_t is not unicode? In this case you know any other character  
encoding with wide (not multi-byte) characters?


The encoding used for wchar_t is locale specific.  The encoding model of C and C++ is that a locale has one charset and three encodings for that charset (a narrow encoding using char as encoding unit, which can have shift state and be multi-byte, a wide encoding using wchar_t, which may not have shift state and may not use several wchar_t to represent a code-point, an external encoding which has less restrictions -- the best known is that end of line may be represented by something else than a single character -- which is observable by looking at difference between binary and text file IO; if my understanding is correct, you can have a locale using UTF-8 as narrow encoding, UTF-32 as wide encoding and UTF-16 with BOM at start and CR-LF as line separator as external encoding).  I'm pretty sure -- I don't have access to that hardware/software combination anymore -- that I've used systems which had available at the same time:

- ascii char set, char and wchar_t are using directly the code point value

- ISO-8859-X charsets, char and wchar_t were using directly the code point value (note that this is different from the Linux behavior which is to use the code point value for char and the Unicode code point value for wchar_t -- note also that this is the case I wanted to check if I was remembering correctly)

- CJK charsets with char being a serialized form of EUC and wchar_t grouping the number of chars needed for the character (EUC is a way to encode a subset of ISO 2022 streams without using shift state with a maximum of 4 bytes per char)

- the unicode charset, char is UTF-8 and wchar_t is the code point.

Most language/region pair had variant locales for several charset (for instance Latin-1, Latin-9 and Unicode for the French locales, each of which would have used a different wide encoding).

Yours

Matthew Woehlke

unread,
Jun 27, 2016, 11:27:29 AM6/27/16
to std-pr...@isocpp.org
Okay, clarification... yes, you need to store the data *somewhere*. So I
guess you are talking specifically about OS-less embedded platforms
where the executable - including statically linked standard library - is
possibly the only thing on the device.

I'd argue that Unicode support should not be required for freestanding
implementations. (How many people on tiny embedded systems are dealing
with Unicode, anyway?) That seems to solve the problem neatly...

--
Matthew

Matthew Woehlke

unread,
Jun 27, 2016, 11:56:54 AM6/27/16
to std-pr...@isocpp.org
BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
metadata e.g. the table size).

Others:

iswalpha: 3612
iswblank: 68
iswgraph: 3588
iswlower: 4580
iswprint: 3572
iswpunct: 2020
iswspace: 76
iswupper: 4428
iswxdigit: 28

total: 21972

This is just the data table sizes, of course, not including code. Some,
obviously, may be better implemented as pure functions. Also, in
fairness, the code to use these requires a lower_bound (i.e. non-trivial
branching).

(Because the program I used to generate these is overly simplified, I
believe the numbers are all 4 bytes larger than is actually needed.
Also, no iswdigit, as that is specified as true for exactly '0'-'9',
which is trivial to implement statically as two compares and a Boolean and.)

--
Matthew

Jeffrey Yasskin

unread,
Jun 27, 2016, 1:34:58 PM6/27/16
to std-pr...@isocpp.org
Look into what ICU actually does to implement functions like u_isalpha
(http://icu-project.org/apiref/icu4c/uchar_8h.html#a86cc4f937e33bcea3772c6faf3e293c1).
I'm guessing it's similar to the encoding Matthew Woehlke used in his
most recent message, but whoever proposes these functions should be
able to answer questions about the most common existing practice.

> Lets concentrate to predicate (return bool value) functions. I'm don't
> believe that isalpha or isupper have this sort of problems. Although we
> should decide what is "isupper" means in languages without upper and lower
> characters. In Japanese language for example.

To elaborate on Nicol's point about Unicode already defining this,
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt says:

304B;HIRAGANA LETTER KA;Lo;0;L;;;;;N;;;;;

meaning 'か' is Lo, which is the "Letter, other" category:
http://www.unicode.org/reports/tr44/#General_Category_Values.

http://www.unicode.org/reports/tr18/#Compatibility_Properties says
"Letter, other" isn't counted as Uppercase, so it'd return false from
isupper().

Jeffrey

Nicol Bolas

unread,
Jun 27, 2016, 1:54:17 PM6/27/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com


On Monday, June 27, 2016 at 11:27:29 AM UTC-4, Matthew Woehlke wrote:
On 2016-06-27 11:11, Matthew Woehlke wrote:
> On 2016-06-25 20:04, Nicol Bolas wrote:
>> ... Have you looked at the Unicode tables? They are not small. Like I said,
>> there are ways to make them smaller (and your code makes them far bigger
>> than necessary, since only ~15% of the codepoint range is assigned). But
>> there is no proof-of-concept that shows that it won't bloat executables by
>> 100KB.
>
> Huh? Why on earth would you bake the tables into the executable ROM? Any
> sane implementation is going to store them in shared memory.
>
> For grins, I wrote a simple test program that calls 'iswalpha' on its
> argument... it is 8618 bytes. (For comparison, I wrote a program that
> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
> bytes. Hardly 100 KiB.)

Okay, clarification... yes, you need to store the data *somewhere*. So I
guess you are talking specifically about OS-less embedded platforms
where the executable - including statically linked standard library - is
possibly the only thing on the device.

It's not just that. If an application statically links to the standard library, there's no reason for it to be loading the Unicode table at runtime.
 
I'd argue that Unicode support should not be required for freestanding
implementations. (How many people on tiny embedded systems are dealing
with Unicode, anyway?) That seems to solve the problem neatly...

Well, the standard already has such requirements. Freestanding implementations can omit most of the standard library, providing only support for most of Chapter 18, the type traits from 20.10, and the atomics.

If even <string> isn't a requirement, I see no reason why Unicode operations would be.

Nicol Bolas

unread,
Jun 27, 2016, 1:56:56 PM6/27/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Monday, June 27, 2016 at 11:56:54 AM UTC-4, Matthew Woehlke wrote:
On 2016-06-27 11:27, Matthew Woehlke wrote:
> On 2016-06-27 11:11, Matthew Woehlke wrote:
>> On 2016-06-25 20:04, Nicol Bolas wrote:
>>> ... Have you looked at the Unicode tables? They are not small. Like I said,
>>> there are ways to make them smaller (and your code makes them far bigger
>>> than necessary, since only ~15% of the codepoint range is assigned). But
>>> there is no proof-of-concept that shows that it won't bloat executables by
>>> 100KB.
>>
>> Huh? Why on earth would you bake the tables into the executable ROM? Any
>> sane implementation is going to store them in shared memory.
>>
>> For grins, I wrote a simple test program that calls 'iswalpha' on its
>> argument... it is 8618 bytes. (For comparison, I wrote a program that
>> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
>> bytes. Hardly 100 KiB.)
>
> Okay, clarification... yes, you need to store the data *somewhere*. So I
> guess you are talking specifically about OS-less embedded platforms
> where the executable - including statically linked standard library - is
> possibly the only thing on the device.
>
> I'd argue that Unicode support should not be required for freestanding
> implementations. (How many people on tiny embedded systems are dealing
> with Unicode, anyway?) That seems to solve the problem neatly...

BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
metadata e.g. the table size).

What steps have you taken to determine if your `iswalpha` implementation is actually returning whether a Unicode codepoint is an alphabetic character? Because I highly doubt it's really doing that.

Aso Renji

unread,
Jun 27, 2016, 2:57:22 PM6/27/16
to std-pr...@isocpp.org
'Jeffrey Yasskin' via ISO C++ Standard - Future Proposals
<std-pr...@isocpp.org> писал(а) в своём письме Mon, 27 Jun 2016
20:34:08 +0300:

> Look into what ICU actually does to implement functions like u_isalpha
> (http://icu-project.org/apiref/icu4c/uchar_8h.html#a86cc4f937e33bcea3772c6faf3e293c1).
(Sarcasm on) It's very simple and fast code. (Sarcasm off) Endless chain
of defines, with some complicated conversations (see
_UTRIE2_INDEX_FROM_CP). I'm guess this make tables more compact (contained
this tables uchar_props_data.h have 304 KB size). But it's not make them
fasted.

source/common/uchar.c
U_CAPI UBool U_EXPORT2
u_isalpha(UChar32 c) {
uint32_t props;
GET_PROPS(c, props);
return (UBool)((CAT_MASK(props)&U_GC_L_MASK)!=0);
}

source/common/utrie2.h
#define GET_PROPS(c, result) ((result)=UTRIE2_GET16(&propsTrie, c));

#define UTRIE2_GET16(trie, c) _UTRIE2_GET((trie), index,
(trie)->indexLength, (c))

#define _UTRIE2_GET(trie, data, asciiOffset, c) \
(trie)->data[_UTRIE2_INDEX_FROM_CP(trie, asciiOffset, c)]

#define _UTRIE2_INDEX_FROM_CP(trie, asciiOffset, c) \
((uint32_t)(c)<0xd800 ? \
_UTRIE2_INDEX_RAW(0, (trie)->index, c) : \
(uint32_t)(c)<=0xffff ? \
_UTRIE2_INDEX_RAW( \
(c)<=0xdbff ?
UTRIE2_LSCP_INDEX_2_OFFSET-(0xd800>>UTRIE2_SHIFT_2) : 0, \
(trie)->index, c) : \
(uint32_t)(c)>0x10ffff ? \
(asciiOffset)+UTRIE2_BAD_UTF8_DATA_OFFSET : \
(c)>=(trie)->highStart ? \
(trie)->highValueIndex : \
_UTRIE2_INDEX_FROM_SUPP((trie)->index, c))

Matthew Woehlke

unread,
Jun 27, 2016, 3:58:10 PM6/27/16
to std-pr...@isocpp.org
I didn't actually write the function; I wrote a little program to
compute the data tables. Said program generates them by... invoking
iswalpha. I therefore submit that the assumption of accuracy is not
unreasonable :-). (And yes, I called setlocale first.)

I take it you haven't figured out how I achieved the reported sizes?

In effect, I consider the data table to be a bitmap (one bit per code
point) of whether a particular attribute applies or not, which is then
compressed by RLE, only I record the offset of the next run rather than
the run length. (Storing the value per run is not necessary, since it
can only alternate between two possible values.) To determine if a
particular code point has the attribute, one therefore uses the code
point to do a binary lower bound search into the array; the attribute
state is the resulting array index modulo 2.

In actuality, my program iterates over possible code points, checks if
the attribute state for the current code point differs from the
attribute state for the previous code point, and if so dumps an index
record. Counting the number of outputted records and multiplying by the
size of the index gives the number of bytes needed to store the table.

Disclaimer: The data is for code points 0 - 0x10FFFF. And, obviously
(given the above), the values are correct for the behavior of the
various functions on my system, which may be out of date.

And, in fact, the previously mentioned sizes could in theory be reduced
by an additional 25% by packing the indices into three bytes each rather
than four (a little more, even, if packed into 21 *bits* each), but the
further performance hit from doing so might not be worthwhile.

The flip side of course is that those 3612 are ONLY useful for telling
you if a code point "is alphabetic". They're useless for case folding,
normalization, etc. You're maximizing the space optimization of a
specific use case (iswalpha) at the expense of everything else.

--
Matthew

Matthew Woehlke

unread,
Jun 27, 2016, 4:19:09 PM6/27/16
to std-pr...@isocpp.org
On 2016-06-27 15:57, Matthew Woehlke wrote:
> On Monday, June 27, 2016 at 11:56:54 AM UTC-4, Matthew Woehlke wrote:
>> BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
>> metadata e.g. the table size).
>
> In effect, I consider the data table to be a bitmap (one bit per code
> point) of whether a particular attribute applies or not, which is then
> compressed by RLE, only I record the offset of the next run rather than
> the run length. (Storing the value per run is not necessary, since it
> can only alternate between two possible values.) To determine if a
> particular code point has the attribute, one therefore uses the code
> point to do a binary lower bound search into the array; the attribute
> state is the resulting array index modulo 2.
>
> And, in fact, the previously mentioned sizes could in theory be reduced
> by an additional 25% by packing the indices into three bytes each rather
> than four (a little more, even, if packed into 21 *bits* each), but the
> further performance hit from doing so might not be worthwhile.

Okay, after that and some more thinking... I can get the tables for
*all* attributes down to about 9232 bytes. This is by assuming that I
can encode the code point index into 24 bits and the character class (at
least, bits needed to derive the values for the various isw????
functions) into another 8 bits, so that I still have 4 bytes per run
(and thus don't trash performance by unaligned reads).

Of course, this is a further trade-off of space versus performance,
since now every function has to deal with that entire array, compared to
the straight bitmap approach where some had very, very small tables. A
real implementation based on this method would likely want to separate
iswblank/iswspace from the rest; these need only 144 bytes, which is
likely a worthwhile trade-off as it reduces the number of compares
needed by about 7-8, vs. only about 2 for the others.

I'm also assuming independent implementations of iswdigit and iswxdigit
as these are specified being true only for very limited characters (0-9
and 0-9A-Fa-f, respectively).

--
Matthew

Thiago Macieira

unread,
Jun 27, 2016, 11:39:16 PM6/27/16
to std-pr...@isocpp.org
On segunda-feira, 27 de junho de 2016 21:39:29 PDT Aso Renji wrote:
> (Sarcasm on) It's very simple and fast code. (Sarcasm off) Endless chain
> of defines, with some complicated conversations (see
> _UTRIE2_INDEX_FROM_CP). I'm guess this make tables more compact (contained
> this tables uchar_props_data.h have 304 KB size). But it's not make them
> fasted.

But the same tables support the other unicode properties. There's a good
chance that code that tests if a given codepoint is an uppercase character
will check elsewhere if it's lowercase, or alphabetic.

That would mean you'd pay a higher up-front cost, but then no more.

Of course, all of this is QoI.

Thiago Macieira

unread,
Jun 27, 2016, 11:45:07 PM6/27/16
to std-pr...@isocpp.org
On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrote:
> Huh? Why on earth would you bake the tables into the executable ROM? Any
> sane implementation is going to store them in shared memory.

Huh... and what do you think executable ROM is? Shared memory. Since it's ROM,
it can't be changed, which means it can be shared between multiple instances
of the same program (if the OS supports shared memory in the first place)

> I'd argue that Unicode support should not be required for freestanding
> implementations. (How many people on tiny embedded systems are dealing
> with Unicode, anyway?) That seems to solve the problem neatly...

You'd be surprised. As I said in another email, if you combine "data is
encoded in UTF-8" (like JSON) with "case-insensitive data", you get suddenly
require Unicode tables.

It's very likely the protocols that require this are poorly designed if they
are meant to be implemented in small OSes and in hardware/firmware, but they
exist.

FYI, there are Unicode tables inside the Linux kernel.
- VFAT is case-insensitive.
- VFAT stores filenames in UTF-16.

asor...@gmail.com

unread,
Jun 28, 2016, 12:29:35 AM6/28/16
to ISO C++ Standard - Future Proposals


вторник, 28 июня 2016 г., 6:39:16 UTC+3 пользователь Thiago Macieira написал:

But the same tables support the other unicode properties. There's a good
chance that code that tests if a given codepoint is an uppercase character
will check elsewhere if it's lowercase, or alphabetic.

STL support only 12 properties. 12 properties, 6 planes with 65536 codepoints, request 12*6*65536 bits total. It's about half megabyte. Not so big if you store this in some shared library. Yes, it's bigger that ICU tables, but code for this table far more simple and fast.

const int isalpha_offset=0;
const int isdigit_offset=1;
//...
const int isspace_offset=11;
const int max_offset=12;

bool u_isctype(char32_t c,int offset)
{
const int8_t*plane=[c/65536];
int offset=(c%65536)*max_offset+offset;
return plane?plane[offset/8]&(1<<(offset&7)):false;
}

bool u_isalpha(char32_t c){return u_isctype(isalpha_offset);} 

asor...@gmail.com

unread,
Jun 28, 2016, 12:33:35 AM6/28/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


вторник, 28 июня 2016 г., 7:29:35 UTC+3 пользователь asor...@gmail.com написал:


const int8_t*plane=[c/65536];
Ops, of course plane_table[c/65536];

Nicol Bolas

unread,
Jun 28, 2016, 12:58:26 AM6/28/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com
On Tuesday, June 28, 2016 at 12:29:35 AM UTC-4, asor...@gmail.com wrote:
вторник, 28 июня 2016 г., 6:39:16 UTC+3 пользователь Thiago Macieira написал:

But the same tables support the other unicode properties. There's a good
chance that code that tests if a given codepoint is an uppercase character
will check elsewhere if it's lowercase, or alphabetic.

If you're talking about cache coherency... I'm not sure how likely that is.

Generally speaking, most Unicode operations only test a small set of properties. Normalization (the non-compatibility forms) only care about whether it's a combining character and what its composed/decomposed form it is. Case conversion would test the current case of a codepoint (series) and what its converted form is. It wouldn't be testing if it's lowercase then uppercase. And so on.

Collation is probably the one where cross-property coherency is likely to be of the greatest value.

STL support only 12 properties.

If the goal is to get meaningful Unicode support into the C++ standard, then it does not matter what properties the standard library currently supports.

Meaningful Unicode support means having sufficient data to perform Unicode operations. Those 12 properties that the C++ standard library provides are completely useless for Unicode operations. These operations require properties, but they require Unicode properties.

We shouldn't be trying to make Unicode fit within C/C++'s garbage interface. We should be trying to improve C++'s interface match Unicode. Or rather, replace C++'s terrible interface with one that matches Unicode.

Nicol Bolas

unread,
Jun 28, 2016, 1:02:06 AM6/28/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Monday, June 27, 2016 at 3:58:10 PM UTC-4, Matthew Woehlke wrote:
The flip side of course is that those 3612 are ONLY useful for telling
you if a code point "is alphabetic". They're useless for case folding,
normalization, etc. You're maximizing the space optimization of a
specific use case (iswalpha) at the expense of everything else.

Right. Which is why I'm confused as to why you created this compression algorithm, then applied it to the least important and useful character property data (and from a source of dubious quality at that, rather than from the actual property tables).

To me, the key question for Unicode support in the standard (when it comes to table size) is this: what is the actual memory cost for these Unicode features:

1: Normalization, for each form.
2: Grapheme cluster iteration.
3: Text segmentation.
4: Case conversion.
5: Collation.

Your compression scheme might be a good tool, but you're aiming it in the wrong direction. Though I find myself concerned about your use of a log(n)-based algorithm.

asor...@gmail.com

unread,
Jun 28, 2016, 1:46:52 AM6/28/16
to ISO C++ Standard - Future Proposals, asor...@gmail.com


вторник, 28 июня 2016 г., 7:58:26 UTC+3 пользователь Nicol Bolas написал:

Meaningful Unicode support means having sufficient data to perform Unicode operations. Those 12 properties that the C++ standard library provides are completely useless for Unicode operations. These operations require properties, but they require Unicode properties.
Why useless? iswdigit=Decimal_Number property, iswalpha=Letter property, iswblank=Space_Separator property. Yes, Unicode also give more concrete properties. But I'm can live without Titlecase_Letter property, if I'm just need split text to words.

We shouldn't be trying to make Unicode fit within C/C++'s garbage interface. We should be trying to improve C++'s interface match Unicode. Or rather, replace C++'s terrible interface with one that matches Unicode.
In current time I'm only try make C/C++ wide character interface worked with wide characters (characters with 128+ codes).

Thiago Macieira

unread,
Jun 28, 2016, 1:48:54 AM6/28/16
to std-pr...@isocpp.org
On segunda-feira, 27 de junho de 2016 22:46:52 PDT asor...@gmail.com wrote:
> But I'm can live without Titlecase_Letter property, if I'm just
> need split text to words.

You can. Can you say the same for everyone?

FrankHB1989

unread,
Jun 28, 2016, 1:54:02 AM6/28/16
to ISO C++ Standard - Future Proposals
Not all properties are needed at any time. This is true to everyone.


在 2016年6月28日星期二 UTC+8下午1:48:54,Thiago Macieira写道:

Thiago Macieira

unread,
Jun 28, 2016, 1:55:53 AM6/28/16
to std-pr...@isocpp.org
On segunda-feira, 27 de junho de 2016 22:54:01 PDT FrankHB1989 wrote:
> Not all properties are needed at any time. This is true to everyone.

We're talking about API. Are you able to say NO ONE needs that aPI?

asor...@gmail.com

unread,
Jun 28, 2016, 1:57:15 AM6/28/16
to ISO C++ Standard - Future Proposals


вторник, 28 июня 2016 г., 8:48:54 UTC+3 пользователь Thiago Macieira написал:

You can. Can you say the same for everyone?
For everyone we can latter added new properties in universal iswctype functional. Now I'm need at least this 12 properties for Unicode, not ASCII characters.

Thiago Macieira

unread,
Jun 28, 2016, 3:21:26 AM6/28/16
to std-pr...@isocpp.org
On segunda-feira, 27 de junho de 2016 22:57:14 PDT asor...@gmail.com wrote:
> вторник, 28 июня 2016 г., 8:48:54 UTC+3 пользователь Thiago Macieira
>
> написал:
> > You can. Can you say the same for everyone?
>
> For everyone we can latter added new properties in universal iswctype
> <http://www.cplusplus.com/reference/cwctype/iswctype/> functional. Now I'm
> need at least this 12 properties for Unicode, not ASCII characters.

We're not talking about "now". We're talking about standardising something for
2020. We have the time to do it right.

asor...@gmail.com

unread,
Jun 28, 2016, 3:43:38 AM6/28/16
to ISO C++ Standard - Future Proposals
вторник, 28 июня 2016 г., 10:21:26 UTC+3 пользователь Thiago Macieira написал:

We're not talking about "now". We're talking about standardising something for
2020. We have the time to do it right.
Okay, than we can create set of functions with standard isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so implementation can use any methods to omit consume of memory, or any methods to get faster code.

Matthew Woehlke

unread,
Jun 28, 2016, 10:31:09 AM6/28/16
to std-pr...@isocpp.org
On 2016-06-27 23:44, Thiago Macieira wrote:
> On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrote:
>> Huh? Why on earth would you bake the tables into the executable ROM? Any
>> sane implementation is going to store them in shared memory.
>
> Huh... and what do you think executable ROM is? Shared memory. Since it's ROM,
> it can't be changed, which means it can be shared between multiple instances
> of the same program (if the OS supports shared memory in the first place)

...which is much less efficient than sharing across *different* programs.

> FYI, there are Unicode tables inside the Linux kernel.
> - VFAT is case-insensitive.
> - VFAT stores filenames in UTF-16.

"In the kernel" was on my list of "reasonable places to put such tables"
:-). Or even in NVRAM (i.e. firmware), if you're talking about an
embedded platform.

--
Matthew

Thiago Macieira

unread,
Jun 28, 2016, 11:01:17 AM6/28/16
to std-pr...@isocpp.org
Why can't we require O(1) complexity?

Thiago Macieira

unread,
Jun 28, 2016, 11:03:42 AM6/28/16
to std-pr...@isocpp.org
On terça-feira, 28 de junho de 2016 10:30:54 PDT Matthew Woehlke wrote:
> On 2016-06-27 23:44, Thiago Macieira wrote:
> > On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrote:
> >> Huh? Why on earth would you bake the tables into the executable ROM? Any
> >> sane implementation is going to store them in shared memory.
> >
> > Huh... and what do you think executable ROM is? Shared memory. Since it's
> > ROM, it can't be changed, which means it can be shared between multiple
> > instances of the same program (if the OS supports shared memory in the
> > first place)
>
> ...which is much less efficient than sharing across *different* programs.

That is included. Executable ROM (i.e., .text sections) will be shared across
multiple invocations, even of different programs, if they use the same
sections of ROM.

> > FYI, there are Unicode tables inside the Linux kernel.
> > - VFAT is case-insensitive.
> > - VFAT stores filenames in UTF-16.
>
> "In the kernel" was on my list of "reasonable places to put such tables"
>
> :-). Or even in NVRAM (i.e. firmware), if you're talking about an
> embedded platform.

Well, even since the 1980s, for me any "ROM" is actually some kind of PROM,
like Flash memory, NVRAM, etc.

asor...@gmail.com

unread,
Jun 28, 2016, 12:15:43 PM6/28/16
to ISO C++ Standard - Future Proposals


вторник, 28 июня 2016 г., 18:01:17 UTC+3 пользователь Thiago Macieira написал:

Why can't we require O(1) complexity?
O(1) complexity is not guarantee of fast code. Slow code that slow always, also have O(1) complexity. On the other hand some users have microcontroller with 1M memory. And if we give them fast O(1) code, this code eat significant amount of this 1M. Especially if code must support a lot of properties.

Thiago Macieira

unread,
Jun 28, 2016, 12:36:48 PM6/28/16
to std-pr...@isocpp.org
On terça-feira, 28 de junho de 2016 09:15:43 PDT asor...@gmail.com wrote:
> вторник, 28 июня 2016 г., 18:01:17 UTC+3 пользователь Thiago Macieira
>
> написал:
> > Why can't we require O(1) complexity?
>
> O(1) complexity is not guarantee of fast code. Slow code that slow *always,*
> also have O(1) complexity. On the other hand some users have
> microcontroller with 1M memory. And if we give them fast O(1) code, this
> code eat significant amount of this 1M. Especially if code must support a
> lot of properties.

Given some of the operations that may be constructed with those property
queries, O(1) complexity is probably a far better option.

Not sure the standard should require it.

Matthew Woehlke

unread,
Jun 28, 2016, 12:43:15 PM6/28/16
to std-pr...@isocpp.org
On 2016-06-28 11:01, Thiago Macieira wrote:
> On terça-feira, 28 de junho de 2016 00:43:38 PDT asor...@gmail.com wrote:
>> вторник, 28 июня 2016 г., 10:21:26 UTC+3 пользователь Thiago Macieira
>>
>> написал:
>>> We're not talking about "now". We're talking about standardising something
>>> for
>>> 2020. We have the time to do it right.
>>
>> Okay, than we can create set of functions with standard
>> isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so
>> implementation can use any methods to omit consume of memory, or any
>> methods to get faster code.
>
> Why can't we require O(1) complexity?

...because it limits the ways in which the algorithms might be
implemented. See for example my (notional) implementation using only
about 9 KiB; I achieve that using RLE which makes the functions O(logN)
for N = the size of the data table. (Which, okay, since that's constant
and not a function of the input, you could maybe argue *is* O(1), but...)

As we've discussed to death, the implementation almost certainly
involves a size/speed trade-off. If you require favoring speed, you are
likely also requiring an implementation that must have a very large data
table.

Leaving it open-ended allows implementations that must run on very
constrained systems to decide what is an acceptable trade-off.

--
Matthew

Matthew Woehlke

unread,
Jun 28, 2016, 12:50:18 PM6/28/16
to std-pr...@isocpp.org
On 2016-06-28 11:03, Thiago Macieira wrote:
> On terça-feira, 28 de junho de 2016 10:30:54 PDT Matthew Woehlke wrote:
>> On 2016-06-27 23:44, Thiago Macieira wrote:
>>> On segunda-feira, 27 de junho de 2016 11:11:08 PDT Matthew Woehlke wrote:
>>>> Huh? Why on earth would you bake the tables into the executable ROM? Any
>>>> sane implementation is going to store them in shared memory.
>>>
>>> Huh... and what do you think executable ROM is? Shared memory. Since it's
>>> ROM, it can't be changed, which means it can be shared between multiple
>>> instances of the same program (if the OS supports shared memory in the
>>> first place)
>>
>> ...which is much less efficient than sharing across *different* programs.
>
> That is included. Executable ROM (i.e., .text sections) will be shared across
> multiple invocations, even of different programs, if they use the same
> sections of ROM.

Are we talking about the same thing? I'm talking about the tables being
embedded as inline static data in the .exe (or equivalent), *not* a
shared library. I don't think that can be shared, as that would imply a)
that the data exactly starts and ends at page boundaries, and b) the OS
can share memory between different .exe's when they happen to have pages
with identical content. (Do OS's actually do that?)

Possible clarification: by "ROM", above, I'm talking about e.g. the
.rodata section. Not "ROM" in the hardware sense.

--
Matthew

Thiago Macieira

unread,
Jun 28, 2016, 1:12:59 PM6/28/16
to std-pr...@isocpp.org
On terça-feira, 28 de junho de 2016 12:42:57 PDT Matthew Woehlke wrote:
> > Why can't we require O(1) complexity?
>
> ...because it limits the ways in which the algorithms might be
> implemented. See for example my (notional) implementation using only
> about 9 KiB; I achieve that using RLE which makes the functions O(logN)
> for N = the size of the data table. (Which, okay, since that's constant
> and not a function of the input, you could maybe argue *is* O(1), but...)
>
> As we've discussed to death, the implementation almost certainly
> involves a size/speed trade-off. If you require favoring speed, you are
> likely also requiring an implementation that must have a very large data
> table.
>
> Leaving it open-ended allows implementations that must run on very
> constrained systems to decide what is an acceptable trade-off.

The standard does require the complexity of certain algorithms and containers.
Given that the most likely use of properties on characters is to loop over a
string, you have to remember that a non-O(1) property lookup will mean the
entire loop will have a higher complexity than O(n).

Thiago Macieira

unread,
Jun 28, 2016, 1:20:35 PM6/28/16
to std-pr...@isocpp.org
On terça-feira, 28 de junho de 2016 12:49:57 PDT Matthew Woehlke wrote:
> > That is included. Executable ROM (i.e., .text sections) will be shared
> > across multiple invocations, even of different programs, if they use the
> > same sections of ROM.
>
> Are we talking about the same thing? I'm talking about the tables being
> embedded as inline static data in the .exe (or equivalent), *not* a
> shared library.

I wasn't considering that. I thought you meant data in a library of some sort.

> I don't think that can be shared, as that would imply a)
> that the data exactly starts and ends at page boundaries, and b) the OS
> can share memory between different .exe's when they happen to have pages
> with identical content. (Do OS's actually do that?)

Starting data at page boundaries is quite easy.

Sharing pages even if they don't come from the same file is possible (look up
the Linux "kernel samepage merging" feature), but it's more efficient if they
come from the same file. Hence ICU having a file with the data.

> Possible clarification: by "ROM", above, I'm talking about e.g. the
> .rodata section. Not "ROM" in the hardware sense.

Off-topic:

Strictly speaking, you're talking about the ELF read-only segment(s) (not
sections). There are multiple sections in those segments: at the very least
.text, in addition to .rodata.

Jean-Marc Bourguet

unread,
Jun 28, 2016, 4:48:03 PM6/28/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
Le mardi 28 juin 2016 18:43:15 UTC+2, Matthew Woehlke a écrit :
On 2016-06-28 11:01, Thiago Macieira wrote:
> On terça-feira, 28 de junho de 2016 00:43:38 PDT asor...@gmail.com wrote:
>> вторник, 28 июня 2016 г., 10:21:26 UTC+3 пользователь Thiago Macieira
>>
>> написал:
>>> We're not talking about "now". We're talking about standardising something
>>> for
>>> 2020. We have the time to do it right.
>>
>> Okay, than we can create set of functions with standard
>> isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so
>> implementation can use any methods to omit consume of memory, or any
>> methods to get faster code.
>
> Why can't we require O(1) complexity?

...because it limits the ways in which the algorithms might be
implemented. See for example my (notional) implementation using only
about 9 KiB; I achieve that using RLE which makes the functions O(logN)
for N = the size of the data table. (Which, okay, since that's constant
and not a function of the input, you could maybe argue *is* O(1), but...)

It is possible in 10 KiB to store the 59 binary properties of Unicode for the basic
plane in such a way that they are accessible in O(1). 

Access function is something like:

uint64_t get_value1(char32_t c) {
    size_t i = c;
    size_t d1 = i % BinaryPropertiesBlockSizeL1;
    i /= BinaryPropertiesBlockSizeL1;
    size_t d2 = i % BinaryPropertiesBlockSizeL2;
    i /= BinaryPropertiesBlockSizeL2;
    size_t d3 = i % BinaryPropertiesBlockSizeL3;
    i /= BinaryPropertiesBlockSizeL3;
    i = BinaryPropertiesStartL3[i];
    i = BinaryPropertiesStartL2[i+d3];
    i = BinaryPropertiesStartL1[i+d2];
    return BinaryPropertiesValues[i+d1];
} // get_value1

bool is_uppercase(char32_t c) {
   return bool((get_value1(c) >> 8) & 1);
} // is_uppercase

(IIRC, that's a compression scheme suggested somewhere in Unicode documentation; my implementation of the compression part is a stupid gready one).

Thus I don't see the interest of allowing O(log N) for the small size.

For speed, I'd just ensure that / and % can be done with masks and shifts and then I'd not be that surprised that cache pressure is the most important factor for anything but benchmarks.

Yours,

--
Jean-Marc

Arthur O'Dwyer

unread,
Jun 28, 2016, 7:44:15 PM6/28/16
to ISO C++ Standard - Future Proposals, mwoehlk...@gmail.com
On Tuesday, June 28, 2016 at 1:48:03 PM UTC-7, Jean-Marc Bourguet wrote:
> Le mardi 28 juin 2016 18:43:15 UTC+2, Matthew Woehlke a écrit :
>> On 2016-06-28 11:01, Thiago Macieira wrote:
>>> On terça-feira, 28 de junho de 2016 00:43:38 PDT asor...@gmail.com wrote:
>>>> 
>>>> Without any complexity guarantee, so implementation can use any
>>>> methods to omit consume of memory, or any methods to get faster code.
>>> 
>>> Why can't we require O(1) complexity?
>>
>> ...because it limits the ways in which the algorithms might be
>> implemented. See for example my (notional) implementation using only
>> about 9 KiB; I achieve that using RLE which makes the functions O(logN)
>> for N = the size of the data table. (Which, okay, since that's constant
>> and not a function of the input, you could maybe argue *is* O(1), but...)
>
> It is possible in 10 KiB to store the 59 binary properties of Unicode for the basic
> plane in such a way that they are accessible in O(1). [...]

> I don't see the interest of allowing O(log N) for the small size.

You guys should stop talking about big-O notation; you're talking past each other.
In this subthread, asorenji's original point was that the Standard shouldn't specify implementation details of the semi-proposed isu32xxx functions.
He mistakenly used the term "complexity guarantee" to describe this idea.
Thiago correctly pointed out that the Standard could safely require O(1) complexity, because duh.
Matthew for some reason reverse-nitpicked the nitpick, despite agreeing that Thiago was technically correct about the meaning of "O(1)".
And so on...

Just drop it, and let the other subthreads resume with discussion of the actual feature space. Let's be reasonable and assume that implementors will always do what's best for their particular platform.

And in future, when discussing computational complexity, remember to define your terms!  Any algorithm is O(N) for suitable definitions of N, and any algorithm is O(1) for other definitions of N.  (In this case, the only halfway sane definition of "N" is "the size of Unicode"; unfortunately that's not a fully sane definition, because the size of Unicode is a global constant.)

–Arthur

FrankHB1989

unread,
Jun 29, 2016, 12:15:26 AM6/29/16
to ISO C++ Standard - Future Proposals


在 2016年6月28日星期二 UTC+8下午1:55:53,Thiago Macieira写道:
On segunda-feira, 27 de junho de 2016 22:54:01 PDT FrankHB1989 wrote:
> Not all properties are needed at any time. This is true to everyone.

We're talking about API. Are you able to say NO ONE needs that aPI?

When someone needs some API, he/she is not meant to make everything available in the same time. Although full Unicode support is good, lack of the full set of Unicode API or the cost to support it should not be the reason making specific API unusable or making it too difficult/subtle to use, unless it is technically unavoidable. I don't see `iswalpha` is the case here.

BTW, about the original question of OP - since ISO C states explicitly "the sets of characters tested for by the `iswalpha` function" is locale-specific and nothing Unicode stuff is specified, I don't see any actual problem in the specification. I also think the wide-character API should be never specific to Unicode. Namely, the related interface should be Unicode-agnostic; only implementations may care how to make it work better with current or future Unicode implementations, but the discussion has gone too far in this topic. If Unicode API is needed, it is better to create a new topic then.
Reply all
Reply to author
Forward
0 new messages