That's a lot of tables to be including. I believe that the table can be compacted via clever programming to be of reasonable length. But that would require a good implementation to prove it.
суббота, 25 июня 2016 г., 22:09:47 UTC+3 пользователь Nicol Bolas написал:That's a lot of tables to be including. I believe that the table can be compacted via clever programming to be of reasonable length. But that would require a good implementation to prove it.bool isu32alpha(wint_t code)
{std::setlocale(LC_ALL,"any_unicode_locale");//implementation definedreturn iswalpha(code);
}I'm already test this (see first post), it work very well.
Or:bool isu32alpha(wint_t code){return code<=0x10FFFF?table[code/8]&(1<<(code%8))]:false;
}0x110000 bites table - reasonable length. At least if you keep this table in separate cpp file.
... Have you looked at the Unicode tables? They are not small. Like I said, there are ways to make them smaller (and your code makes them far bigger than necessary, since only ~15% of the codepoint range is assigned). But there is no proof-of-concept that shows that it won't bloat executables by 100KB.
Also, there is no guarantee that `wint_t` can store a Unicode codepoint, so that API isn't reasonable. If you're serious about Unicode support, you need to focus on the types that actually store Unicode encodings.
No, we don't. There are many modern microcontroller-class CPUs with less than
1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn't
we program them with C++?
Another reason is that often the library gets supplied with the executable.
Once we load the entire set of Unicode tables for all attributes, it may be
well over a megabyte. In fact, the ICU data table is an 18 MB library. That
means your small 100-line C++ program is at least that big.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/84d2f2a0-0fc4-4526-9826-8062bb443869%40isocpp.org.
18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.
On sábado, 25 de junho de 2016 20:29:50 PDT asor...@gmail.com wrote:
> Yes, there are ways to make them smaller. But this no way to make them
> *faster*. Sacrifice time for saving 100KB? We not load programs from floppy
> > anymore, we have gigabytes of RAM and at least gigabytes of disk space. In
No, we don't. There are many modern microcontroller-class CPUs with less than
1 MB of flash and a around a hundred kilobytes of RAM, or less. Why shouldn't
we program them with C++?
ICU does have ways to subset its data tables to include only the parts
you use. The proposal author should probably validate that to show us
what kinds of subsets are already possible, but it'll also be possible
to add new subsets if the C++ library wants to make finer-grained
distinctions.
On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:воскресенье, 26 июня 2016 г., 8:52:16 UTC+3 пользователь Patrice Roy написал:18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 bits for single predicate function. 18 MB enough for hundred predicates. I'm don't think you really need so many.
Again, let's forget that most of that range is not actually assigned and therefore takes up 0 bits.
Not all of the properties in the Unicode tables are binary. Indeed, most are not. Case-conversion, for example, cannot be binary. It has to specify how you go from codepoint X to one or more codepoints YZW. For each codepoint. That cannot take up a single bit per codepoint.
That being said, I firmly believe that 18MB is much larger than the Unicode tables need to be. That there must be clever ways to make that table much smaller (on the order of hundreds of kilobytes rather than megabytes). But as of yet, I have not undertaken the task of proving that, so that doesn't mean much.
воскресенье, 26 июня 2016 г., 16:08:45 UTC+3 пользователь Nicol Bolas написал:On Sunday, June 26, 2016 at 2:04:59 AM UTC-4, asor...@gmail.com wrote:воскресенье, 26 июня 2016 г., 8:52:16 UTC+3 пользователь Patrice Roy написал:18MB means we get away from many platforms where we are today. It's 9 times the size of my whole current project (with debug info in!) which runs on embedded devices. Please remember that we want Unicode support, but not at the cost of not being able to support target platforms which are very much alive today.Unicode using codespace from 0 to 0x10FFFF. Therefore we need at most 0x110000 bits for single predicate function. 18 MB enough for hundred predicates. I'm don't think you really need so many.
Again, let's forget that most of that range is not actually assigned and therefore takes up 0 bits.
Not all of the properties in the Unicode tables are binary. Indeed, most are not. Case-conversion, for example, cannot be binary. It has to specify how you go from codepoint X to one or more codepoints YZW. For each codepoint. That cannot take up a single bit per codepoint.Lets concentrate to predicate (return bool value) functions. I'm don't believe that isalpha or isupper have this sort of problems. Although we should decide what is "isupper" means in languages without upper and lower characters. In Japanese language for example.
That being said, I firmly believe that 18MB is much larger than the Unicode tables need to be. That there must be clever ways to make that table much smaller (on the order of hundreds of kilobytes rather than megabytes). But as of yet, I have not undertaken the task of proving that, so that doesn't mean much.In my answer to Jeffrey Yasskin, I'm compress tables to 38 KB.
2: Why are you using `lower_bound` for a table? That's like making an `unordered_map`, then using a range-for loop to search for an item by its key. You should be able to get the exact block index for a Unicode codepoint with some simple mathematics.
3: 38KB is still way too big for this information.
воскресенье, 26 июня 2016 г., 18:12:51 UTC+3 пользователь Nicol Bolas написал:
2: Why are you using `lower_bound` for a table? That's like making an `unordered_map`, then using a range-for loop to search for an item by its key. You should be able to get the exact block index for a Unicode codepoint with some simple mathematics.Because unicode blocks have random size.
3: 38KB is still way too big for this information.Only if you know more economical method. If more economical method don't exist, than 38KB reasonable size.
On domingo, 26 de junho de 2016 06:08:44 PDT Nicol Bolas wrote:
> That being said, I firmly believe that 18MB is much larger than the Unicode
> tables *need* to be. That there must be clever ways to make that table much
> smaller (on the order of hundreds of kilobytes rather than megabytes). But
> as of yet, I have not undertaken the task of *proving* that, so that
> doesn't mean much.
>
> And even if I'm wrong, I bet we can provide certain very useful features
> that require less than the full Unicode table space. For example, I'd bet
> that the non-compatibility Unicode normalization forms require much less
> table space than the compatibility ones. I'd bet that grapheme cluster
> iteration requires much less table space than case conversion.
ICU comes with a tool to select which properties and which locales to include
in your data pack. It's just not an easy tool to use and I personally know of
no one that has successfully deployed the data file with it.
Un-selecting entries from the "lines" from the database is often a short-
sighted decision. You may think "my application will not be run in Thailand"
and thus remove support for Thai grapheme support along with its locale
information. But then you may get a Thai customer calling your application's
support and they may not even be in Thailand.
Un-selecting "columns" would be safer, but you often don't know which
properties your application needs. You might think like you said above that
you don't need the non-compatibility normalisations, only to find out that
Internationalised Domain Names does need NFKC.
Also, ICU 57.1 isn't 18 MB:
$ v -h /usr/share/icu/57.1/icudt57l.dat
-rw-r--r-- 1 root root 25M Jun 15 06:14 /usr/share/icu/57.1/icudt57l.dat
Why divide it up by Unicode's arbitrary blocks to begin with? That makes searching require a lot of conditional branching and cache misses.
OK, allow me to say what I mean a different way. Having that information available at all is not worth 38KB to most users. The number of people who truly need to know if a codepoint is an alphabetic character or not is far less than the number of people who need, for example, Unicode normalization. Or grapheme cluster iteration. And so forth.
| From: asor...@gmail.com Sent: Sunday, June 26, 2016 4:42 PM To: ISO C++ Standard - Future Proposals Reply To: std-pr...@isocpp.org Subject: Re: [std-proposals] iswalpha and locales |
As suggested by others, if we layer or slice functionality correctly, we can get close* to 'don't use what you don't pay for', but there is a ton of work to get there.
Or, maybe wint_t is not unicode? In this case you know any other character
encoding with wide (not multi-byte) characters?
On 2016-06-27 11:11, Matthew Woehlke wrote:
> On 2016-06-25 20:04, Nicol Bolas wrote:
>> ... Have you looked at the Unicode tables? They are not small. Like I said,
>> there are ways to make them smaller (and your code makes them far bigger
>> than necessary, since only ~15% of the codepoint range is assigned). But
>> there is no proof-of-concept that shows that it won't bloat executables by
>> 100KB.
>
> Huh? Why on earth would you bake the tables into the executable ROM? Any
> sane implementation is going to store them in shared memory.
>
> For grins, I wrote a simple test program that calls 'iswalpha' on its
> argument... it is 8618 bytes. (For comparison, I wrote a program that
> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
> bytes. Hardly 100 KiB.)
Okay, clarification... yes, you need to store the data *somewhere*. So I
guess you are talking specifically about OS-less embedded platforms
where the executable - including statically linked standard library - is
possibly the only thing on the device.
I'd argue that Unicode support should not be required for freestanding
implementations. (How many people on tiny embedded systems are dealing
with Unicode, anyway?) That seems to solve the problem neatly...
On 2016-06-27 11:27, Matthew Woehlke wrote:
> On 2016-06-27 11:11, Matthew Woehlke wrote:
>> On 2016-06-25 20:04, Nicol Bolas wrote:
>>> ... Have you looked at the Unicode tables? They are not small. Like I said,
>>> there are ways to make them smaller (and your code makes them far bigger
>>> than necessary, since only ~15% of the codepoint range is assigned). But
>>> there is no proof-of-concept that shows that it won't bloat executables by
>>> 100KB.
>>
>> Huh? Why on earth would you bake the tables into the executable ROM? Any
>> sane implementation is going to store them in shared memory.
>>
>> For grins, I wrote a simple test program that calls 'iswalpha' on its
>> argument... it is 8618 bytes. (For comparison, I wrote a program that
>> does NOTHING AT ALL, and it is 8455 bytes. That's a difference of... 163
>> bytes. Hardly 100 KiB.)
>
> Okay, clarification... yes, you need to store the data *somewhere*. So I
> guess you are talking specifically about OS-less embedded platforms
> where the executable - including statically linked standard library - is
> possibly the only thing on the device.
>
> I'd argue that Unicode support should not be required for freestanding
> implementations. (How many people on tiny embedded systems are dealing
> with Unicode, anyway?) That seems to solve the problem neatly...
BTW, I can represent the tables for iswalpha in 3612 bytes (not counting
metadata e.g. the table size).
But the same tables support the other unicode properties. There's a good
chance that code that tests if a given codepoint is an uppercase character
will check elsewhere if it's lowercase, or alphabetic.
const int8_t*plane=[c/65536];
вторник, 28 июня 2016 г., 6:39:16 UTC+3 пользователь Thiago Macieira написал:
But the same tables support the other unicode properties. There's a good
chance that code that tests if a given codepoint is an uppercase character
will check elsewhere if it's lowercase, or alphabetic.
STL support only 12 properties.
The flip side of course is that those 3612 are ONLY useful for telling
you if a code point "is alphabetic". They're useless for case folding,
normalization, etc. You're maximizing the space optimization of a
specific use case (iswalpha) at the expense of everything else.
Meaningful Unicode support means having sufficient data to perform Unicode operations. Those 12 properties that the C++ standard library provides are completely useless for Unicode operations. These operations require properties, but they require Unicode properties.
We shouldn't be trying to make Unicode fit within C/C++'s garbage interface. We should be trying to improve C++'s interface match Unicode. Or rather, replace C++'s terrible interface with one that matches Unicode.
You can. Can you say the same for everyone?
We're not talking about "now". We're talking about standardising something for
2020. We have the time to do it right.
Why can't we require O(1) complexity?
On 2016-06-28 11:01, Thiago Macieira wrote:
> On terça-feira, 28 de junho de 2016 00:43:38 PDT asor...@gmail.com wrote:
>> вторник, 28 июня 2016 г., 10:21:26 UTC+3 пользователь Thiago Macieira
>>
>> написал:
>>> We're not talking about "now". We're talking about standardising something
>>> for
>>> 2020. We have the time to do it right.
>>
>> Okay, than we can create set of functions with standard
>> isu32UNICODE_PROPERTY_NAME name. Without any complexity guarantee, so
>> implementation can use any methods to omit consume of memory, or any
>> methods to get faster code.
>
> Why can't we require O(1) complexity?
...because it limits the ways in which the algorithms might be
implemented. See for example my (notional) implementation using only
about 9 KiB; I achieve that using RLE which makes the functions O(logN)
for N = the size of the data table. (Which, okay, since that's constant
and not a function of the input, you could maybe argue *is* O(1), but...)
On segunda-feira, 27 de junho de 2016 22:54:01 PDT FrankHB1989 wrote:
> Not all properties are needed at any time. This is true to everyone.
We're talking about API. Are you able to say NO ONE needs that aPI?