On 5/26/2017 10:39, Anton Ertl wrote:
> Alex <
al...@rivadpm.com> writes:
>> Stupidly, Microsoft say this;
>>
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
>
> |Because Unicode plain text is a sequence of 16-bit code values, it is
> |sensitive to the byte ordering used when the text is written.
>
> So Microsoft means UTF-16 when they write "Unicode". For UTF-16, byte
> order is an issue, so a byte-order mark is not completely superfluous.
> Also, because UTF-16 is not ASCII-compatible on byte-addressed
> machines, the main disadvantage of prepending a BOM does not play a
> role (tools that work on 8-bit characters don't work on UTF-16 anyway,
> BOM or no). But these reasons don't transfer to UTF-8.
>
> - anton
>
The table has the UTF-8 "BOM" in it. It takes a fairly narrow reading of
the text to assume that they aren't referring to UTF-8 but only
UTF-longer as Unicode, and I suspect they aren't.
"Therefore, Unicode has defined a character (U+FEFF) and a noncharacter
(U+FFFE) as byte order marks. They are mirror byte images of each other."
That's so wrong it's not even wrong. It's the byte encoding of U+FEFF
that looks like FE FF or FF FE under UTF-16xx. U+FFFE is undefined, and
is not some kind of BOM.
It's basically a train wreck. Wikipedia notes:
"many pieces of software on Microsoft Windows such as Notepad treat the
BOM as a required magic number rather than use heuristics. These tools
add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless
the BOM is present or the file contains only ASCII. Google Docs also
adds a BOM when converting a document to a plain text file for download."
--
Alex