On 09/09/16 14:54, Alf P. Steinbach wrote:
> On 09.09.2016 12:46, David Brown wrote:
>>
>> I haven't heard stories about gcc botching unicode
>
> It used to be that conforming UTF-8 source code with non-ASCII string
> literals could be compiled with either g++, or with MSVC, but not both.
> g++ would choke on a BOM at the start, which it should have accepted.
> And MSVC lacked an option to tell it the source endcoding, and would
> interpret it as Windows ANSI if was UTF-8 without BOM.
>
> At the time the zealots recommended restricting oneself to a subset of
> the language that could be compiled with both g++ and MSVC. This was
> viewed as a problem with MSVC, and/or with Windows conventions, rather
> than botched encoding support in the g++ compiler. I.e., one should use
> UTF-encoded source without BOM (which didn't challenge g++) and only
> pure ASCII narrow literals (ditto), and no wide literals (ditto),
> because that could also, as they saw it, be compiled with MSVC.
>
> They were lying to the compiler, and in the blame assignment, and as is
> usual that gave ungood results.
UTF-8 files rarely use a BOM - its only use is to avoid
misinterpretation of the file as Latin-1 or some other encoding. So it
is understandable, but incorrect, for gcc to fail when given utf-8
source code with a BOM. It is less understandable, and at least as bad
for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
MSVC fixed their compiler yet?
But that seems like a small issue to call "botching unicode".
gcc still doesn't allow extended identifiers, however - but I think MSVC
does? Clang certainly does.
>
>
>> - but I have heard
>> many stories about how Windows has botched it
>
> Well, you can tell that it's propaganda by the fact that it's about
> assigning blame. IMO assigning blame elsewhere for ported software that
> is, or was, of very low quality.
Are you referring to gcc as "very low quality ported software"? That's
an unusual viewpoint.
>
> But it's not just ported Unix-land software that's ungood: the
> propaganda could probably not have worked if Windows itself wasn't full
> of the weirdest stupidity and bugs, including, up front in modern
> Windows, that Windows Explorer, the GUI shell, scrolls away what you're
> clicking on so that double-clicks generally have unpredictable effects,
> or just don't work. It's so stupid that one suspects the internal
> sabotage between different units, that Microsoft is infamous for. Not to
> mention the lack of UTF-8 support in the Windows console subsystem.
> Which has an API that effectively restricts it to UCS-2. /However/,
> while there is indeed plenty wrong in general with Windows, Windows did
> get Unicode support a good ten years before Linux, roughly.
From a very quick google search, I see references to Unicode in Linux
from the late 1990s - so ten years is perhaps an exaggeration. It
depends on what you mean by "support", of course - for the most part,
Linux really doesn't care about character sets or encodings. Strings
are a bunch of bytes terminated by a null character, and the OS will
pass them around, use them for file system names, etc., with an almost
total disregard for the contents - baring a few special characters such
as "/". As far as I understand it, the Linux kernel itself isn't much
bothered with unicode at all, and it's only the locale-related stuff,
libraries like iconv, font libraries, and of course things like X that
need to concern themselves with unicode. This was one of the key
reasons for liking UTF-8 - it meant very little had to change.
>
> UTF-8 was the old AT&T geniuses' (Ken Thompson and Rob Pike) idea for
> possibly getting the Unix world on a workable path towards Unicode, by
> supporting pure ASCII streams without code change. And that worked. But
> it was not established in general until well past the year 2000.
>
Maybe MS and Windows simply suffer from being the brave pioneer here,
and the *nix world watched them then saw how to do it better.
>
>> , and how the C++ (and C)
>> standards botched it to make it easier to work with Windows' botched API.
>
> That doesn't make sense to me, sorry.
The C++ standard currently has char, wchar_t, char16_t and char32_t
(plus signed and unsigned versions of char - I don't know if the other
types have signed and unsigned versions). wchar_t is of platform
dependent size, making it (and wide strings) pretty much useless for
platform independent code. But it is the way it is because MS wanted to
have 16-bit wchar_t to suit their UCS-2 / UTF-16 APIs. The whole thing
would have been much simpler if the language simply stuck to 8-bit code
units throughout, letting the compiler and/or locale settings figure out
the input encoding and the run-time encoding. If you really need a type
for holding a single unicode character, then the only sane choice is a
32-bit type.
>
>
>> [snip]
>> Well, these things are partly a matter of opinion, and partly a matter
>> of fitting with the APIs, libraries, toolkits, translation tools, etc.,
>> that you are using. If you are using Windows API and Windows tools, you
>> might find UTF-16 more convenient. But as the worst compromise between
>> UTF-8 and UTF-32,
>
> Think about e.g. ICU.
>
> Do you believe that the Unicode guys themselves would choose the worst
> compromise?
>
When unicode was started, everyone thought 16 bits would be enough. So
decisions that stretch back to before Unicode 2.0 would use 16-bit types
for encodings that covered all code points. When the ICU was started,
there was no UTF-32 (I don't know if there was a UTF-8 at the time).
The best choice at that time was UCS-2 - it is only as unicode outgrew
16 bits that this became a poor choice.
> That does not make technical sense.
>
> So, this claim is technically nonsense. It only makes sense socially, as
> an in-group (Unix-land) versus out-group (Windows) valuation: "they're
> bad, they even small bad; we're good". I.e. it's propaganda.
No, the claim is technically correct - UTF-16 has all the disadvantages
of both UTF-8 and UTF-32, and misses several of the important
(independent) advantages of those two encodings. Except for
compatibility with existing UTF-16 APIs and code, there are few or no
use-cases where UTF-16 is a better choice than UTF-8 and also a better
choice than UTF-32. The fact that the world has pretty much
standardised on UTF-8 for transmission of unicode is a good indication
of this - UTF-16 is /only/ used where compatibility with existing UTF-16
APIs and code is of overriding concern.
But historically, ICU and Windows (and Java, and QT) made the sensible
decision of using UCS-2 when they started, but apart from QT they have
failed to make a transition to UTF-8 when it was clear that this was the
way forward.
>
> In order to filter out propaganda, think about whether it makes value
> evaluations like "worst", or "better", and where it does that, is it
> with respect to defined goals and full technical considerations?
>
The advantages of UTF-8 over UTF-16 are clear and technical, not just
"feelings" or propaganda.