On 08.02.2019 12:27, Ralf Goertz wrote:
> Am Fri, 8 Feb 2019 11:00:59 +0100
> schrieb "Alf P. Steinbach" <
alf.p.stein...@gmail.com>:
>
>> You state, in the form of an incorrect assertion that presumably I
>> should rush to correct (hey, someone's wrong on the internet!), that
>> you don't know any purpose of a BOM.
>>
>> OK then.
>>
>> A BOM serves two main purposes:
>>
>> * It identifies the general encoding scheme (UTF-8, UTF-16 or UTF-32),
>> with high probability.
>
> BOM for UTF-N with N>8 is fine IMHO. But as I understand it UTF-8 is
> meant to be as compatible as possible with ASCII. So if you happen to
> have a »UTF-8« file that doesn't contain any non-ASCII characters then
> why should it have a BOM? This can easily happen, e.g. when you decide
> to erase those fancy quotation marks I just used and replace them with
> ordinary ones like in "UTF-8". Suddenly, the file is pure ASCII but has
> an unnecessary BOM.
It's not unnecessary if the intent is to further edit the file, because
then it says what encoding should better be used with this file.
Otherwise I'd just save as pure ASCII.
Done.
> If the file contains non-ASCII characters you'll
> notice that soon enough. My favourite editor (vim) is very good at
> detecting that without the aid of BOMs and I guess others are, too.
Evidently vim doesn't have to relate to many Windows ANSI encoded files,
where all byte sequences are valid.
It's possible to apply statistical measures over large stretches of
text, but these are necessarily grossly inefficient compared to just
checking three bytes, and that efficiency versus inefficiency counts for
tools such as compilers.
For an editor that loads the whole file anyway, and also has an
interactive user in front that can guide it, maybe it doesn't matter.
> And BOMs can be a burden, for instance when you want to quickly
> concatenate two files with "cat file1 file2 >ouftile". Then you end up
> with a BOM in the middle of a file which doesn't conform to the
> standard AFAIK.
Binary `cat` is a nice tool when it's not misapplied.
I guess the argument, that you've picked up from somebody else, is that
it's plain impossible to make a corresponding text concatenation tool.
>> * It identifies the byte order for the multibyte unit encodings.
>
> As I said, for those BOMs are fine.
>
>> Since its original definition was a zero-width space it can be
>> treated as removable whitespace, and AFAIK that was the original
>> intent.
>
> But they increase the file size which can cause problems (in the above
> mentioned case of an ASCII only UTF-8 file).
Not having the BOMs for files intended to be used with Windows tools,
causes problems of correctness.
In the above mentioned case the "problem" of /not forgetting the
encoding/ sounds to me like turning black to white and vice versa.
I'd rather /not/ throw away the encoding information, and would see the
throwing-away, if that were enforced, as a serious problem.
> I really don't understand
> why UTF-8 has not become standard on Windows even after so many years of
> it's existence.
As I see it, a war between Microsoft and other platforms, where they try
their best to subtly and not-so-subtly sabotage each other.
Microsoft does things like not supporting UTF-8 in Windows consoles
(input doesn't work at all for non-ASCII characters), and not supporting
UTF-8 locales in Windows, hiding the UTF-8 sans BOM encoding far down in
a very long list of useless encodings in the VS editor's GUI for
encoding choice, letting it save with system-dependent Windows ANSI
encoding by default, and even (Odin save us!) using that as the default
basic execution character set in Visual C++ -- a /system dependent/
encoding as basic execution character set.
*nix-world folks do things such as restricting the JSON format, in newer
version of its RFC, to UTF without BOM, permitting a BOM to be treated
as an error.
Very political, as I see it.
Not engineering.
Cheers!,
- Alf