On 12/07/2025 14:51, bil til wrote:
> Am Sa., 12. Juli 2025 um 15:05 Uhr schrieb 'Scott Morgan' via lua-l
> <
lu...@googlegroups.com>:
>>
>
>> ..., e.g. CreateFileA/CreateFileW. Generally, the *A form just converts
>> the text from locale 8-bit to UTF-16 and uses the *W calls internally.
Not sure why you're quoting this bit.
> Unicode16 has fixed 2 bytes per char (this is nice for strlenw command
> - just use relatively fast strlen and device by 2).
False. For starters, what is 'Unicode16'? Can you point to it's
specification? Do you mean UTF-16 which is the common 16-bit encoding
for Unicode? That uses one *or two* 16-bit words to encode a single
char. Unicode covers more characters than the 65536 possible with a
single 16-bit number.
UCS-2 started out, in the late-80s/early-90s, as a single 16-bit word
per char encoding. The idea was we'd only need 16-bits, but it was
quickly realised that it wasn't enough to handle all the codepoints needed.
If you want single word per char encoding, UTF-32 is required. But you
still have to handle things like combining chars, even after normalisation.
> Getting the char length of a UTF string (either 8 or 16) is an
> "elaborate task" - you have to analyze every UTF char in the string
> successively, which has typically 1-4 bytes (1-4 bytes is the same max
> length for UTF8 and UTF16 as I see, although I am only familiar with
> UTF8). So the typically very important and heavily used strlen
> function of C for UTF strings needs much more time.
Not really true. UTF-8 and 16 provides a hint on how many bytes/words
are involved in a char in it's first byte/word, so you can skim the
intervening bytes easily. Bit more complex than plain strlen, but not by
much.
strlen never truly worked as you think anyway, as you'd still have to
support non-Unicode multi-byte charsets like JIS or GB 18030 if you
wanted your code to be properly multi-language. It should only be
considered as a method for counting bytes, not characters/glyphs, in an
8-bit clean char encoding.
> The Unicode16 encoding used in Windows makes sense for internal RAM of
> the program, which typically I think should be fairly safe assumption.
> But already for HDD/SDD files, this conditions will become
> questionable, I would classify Unicode encoded files with fix number
> of Bytes per encoded char as generally dangerous.
This is all nonsense!
> I just read the wiki article "Unicode in Windows", and I am really
> surprised that as you say only WinNT used Unicode
False. Re-read my previous email, I never said that.
WinNT started with UCS-2, then moved to UTF-16, where it still stands
today. Modern day Windows are versions of NT. It never left, just got
rebranded.
I don't want to be rude, but it's pretty clear you don't know what
you're talking about with regard to Windows or the Unicode standard.
You're using made up terms, and evidently aren't aware of the history.
Please stop, you're not helping anyone.
Scott