RFD: Patching INCLUDED for UTF-8

Brad Eckert‏

غير مقروءة،

24‏/05‏/2017، 7:17:22 م24‏/5‏/2017

إلى

Now that the Unicode crowd has settled on UTF-8 for the most part (more than 50% of web pages now use UTF-8), it seems to be the new text file standard.

My proposal is to have INCLUDED look at the first three bytes of the text file. If it starts with the 3-byte sequence 0xEF,0xBB,0xBF then skip those three bytes. This is the byte-order mark of UTF-8 files.

Maybe some Forths already do this. SwiftForth doesn't, but I'll soon be adding a patch.

Anton Ertl‏

غير مقروءة،

25‏/05‏/2017، 6:17:49 ص25‏/5‏/2017

إلى

Brad Eckert <hwf...@gmail.com> writes:
>Now that the Unicode crowd has settled on UTF-8 for the most part (more than 50% of web pages now use UTF-8), it seems to be the new text file standard.
>
>My proposal is to have INCLUDED look at the first three bytes of the text file. If it starts with the 3-byte sequence 0xEF,0xBB,0xBF then skip those three bytes. This is the byte-order mark of UTF-8 files.

UTF-8 does not need a byte-order mark (it has only one byte order).
This sequence is some Windowsism. Maybe it will die out like
ones-complement, or maybe it will persist. Should we put code in
Windows Forth systems to ignore this sequence? Maybe. Should be put
code in non-Windows Forth systems to ignore this sequence. What for?
Should we standardize ignoring this sequence? Probably not.

An easy way to deal with it is to produce a noop word with that name,
and have a space after that word in all files that start with this
word.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2017: http://www.euroforth.org/ef17/

Alex‏

غير مقروءة،

25‏/05‏/2017، 7:09:25 ص25‏/5‏/2017

إلى

On 5/25/2017 10:32, Anton Ertl wrote:
> Brad Eckert <hwf...@gmail.com> writes:
>> Now that the Unicode crowd has settled on UTF-8 for the most part (more than 50% of web pages now use UTF-8), it seems to be the new text file standard.
>>
>> My proposal is to have INCLUDED look at the first three bytes of the text file. If it starts with the 3-byte sequence 0xEF,0xBB,0xBF then skip those three bytes. This is the byte-order mark of UTF-8 files.
>
> UTF-8 does not need a byte-order mark (it has only one byte order).
> This sequence is some Windowsism. Maybe it will die out like
> ones-complement, or maybe it will persist. Should we put code in
> Windows Forth systems to ignore this sequence? Maybe. Should be put
> code in non-Windows Forth systems to ignore this sequence. What for?
> Should we standardize ignoring this sequence? Probably not.
>
> An easy way to deal with it is to produce a noop word with that name,
> and have a space after that word in all files that start with this
> word.
>
> - anton
>

The Unicode consortium advice is here.
http://unicode.org/faq/utf_bom.html#bom10. Point 3 is worth noting.

3. Some byte oriented protocols expect ASCII characters at the beginning
of a file. If UTF-8 is used with these protocols, use of the BOM as
encoding form signature should be avoided.

--
Alex

Brad Eckert‏

غير مقروءة،

25‏/05‏/2017، 12:40:48 م25‏/5‏/2017

إلى

On Thursday, May 25, 2017 at 4:09:25 AM UTC-7, Alex wrote:
> If UTF-8 is used with these protocols, use of the BOM as
> encoding form signature should be avoided.
>

I'll give Bill a call.

Julian Fondren‏

غير مقروءة،

25‏/05‏/2017، 12:51:40 م25‏/5‏/2017

إلى

On Thursday, May 25, 2017 at 5:17:49 AM UTC-5, Anton Ertl wrote:
> Should be put
> code in non-Windows Forth systems to ignore this sequence. What for?

Windows customers. They can use FTP to update their website on your
server, introducing byte-order marks from their editor. They can
upload some OS-agnostic code to theforth.net that includes it.

> An easy way to deal with it is to produce a noop word with that name,
> and have a space after that word in all files that start with this
> word.

It's hard, though, to take such care with an invisible threat.

Alex‏

غير مقروءة،

25‏/05‏/2017، 4:28:16 م25‏/5‏/2017

إلى

On 5/25/2017 17:51, Julian Fondren wrote:
> On Thursday, May 25, 2017 at 5:17:49 AM UTC-5, Anton Ertl wrote:
>> Should be put
>> code in non-Windows Forth systems to ignore this sequence. What for?
>
> Windows customers. They can use FTP to update their website on your
> server, introducing byte-order marks from their editor. They can
> upload some OS-agnostic code to theforth.net that includes it.

Stupidly, Microsoft say this;
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx

>
>> An easy way to deal with it is to produce a noop word with that name,
>> and have a space after that word in all files that start with this
>> word.
>
> It's hard, though, to take such care with an invisible threat.
>

It's visible as ï»¿ using code page 1252.

It may be that you want your Forth to remove it during INCLUDE but I
would hate to see it standardized.

--
Alex

Anton Ertl‏

غير مقروءة،

26‏/05‏/2017، 5:45:11 ص26‏/5‏/2017

إلى

Alex <al...@rivadpm.com> writes:
>Stupidly, Microsoft say this;
>https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx

|Because Unicode plain text is a sequence of 16-bit code values, it is
|sensitive to the byte ordering used when the text is written.

So Microsoft means UTF-16 when they write "Unicode". For UTF-16, byte
order is an issue, so a byte-order mark is not completely superfluous.
Also, because UTF-16 is not ASCII-compatible on byte-addressed
machines, the main disadvantage of prepending a BOM does not play a
role (tools that work on 8-bit characters don't work on UTF-16 anyway,
BOM or no). But these reasons don't transfer to UTF-8.

Anton Ertl‏

غير مقروءة،

26‏/05‏/2017، 6:02:09 ص26‏/5‏/2017

إلى

Julian Fondren <julian....@gmail.com> writes:
>On Thursday, May 25, 2017 at 5:17:49 AM UTC-5, Anton Ertl wrote:
>> Should be put
>> code in non-Windows Forth systems to ignore this sequence. What for?
>
>Windows customers. They can use FTP to update their website on your
>server, introducing byte-order marks from their editor. They can
>upload some OS-agnostic code to theforth.net that includes it.

That would be a bug. So maybe Windows Forth systems should produce a
warning or error when they see a BOM, so that people don't run into
this unawares.

Given that Windows users are often clueless about how to fix the
problems caused by their software, one
could solve such problems at the server, if they occur frequently: A
site like theforth.net could clean up the files. FTP sites typically
have something that transfers files from incoming/ to the final place;
that could also clean up the files.

>> An easy way to deal with it is to produce a noop word with that name,
>> and have a space after that word in all files that start with this
>> word.
>
>It's hard, though, to take such care with an invisible threat.

As long as Forth systems produce a warning or error when they see a
BOM, it's not invisible.

Alex‏

غير مقروءة،

26‏/05‏/2017، 8:27:17 ص26‏/5‏/2017

إلى

On 5/26/2017 10:39, Anton Ertl wrote:
> Alex <al...@rivadpm.com> writes:
>> Stupidly, Microsoft say this;
>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
>
> |Because Unicode plain text is a sequence of 16-bit code values, it is
> |sensitive to the byte ordering used when the text is written.
>
> So Microsoft means UTF-16 when they write "Unicode". For UTF-16, byte
> order is an issue, so a byte-order mark is not completely superfluous.
> Also, because UTF-16 is not ASCII-compatible on byte-addressed
> machines, the main disadvantage of prepending a BOM does not play a
> role (tools that work on 8-bit characters don't work on UTF-16 anyway,
> BOM or no). But these reasons don't transfer to UTF-8.
>
> - anton
>

The table has the UTF-8 "BOM" in it. It takes a fairly narrow reading of
the text to assume that they aren't referring to UTF-8 but only
UTF-longer as Unicode, and I suspect they aren't.

"Therefore, Unicode has defined a character (U+FEFF) and a noncharacter
(U+FFFE) as byte order marks. They are mirror byte images of each other."

That's so wrong it's not even wrong. It's the byte encoding of U+FEFF
that looks like FE FF or FF FE under UTF-16xx. U+FFFE is undefined, and
is not some kind of BOM.

It's basically a train wreck. Wikipedia notes:

"many pieces of software on Microsoft Windows such as Notepad treat the
BOM as a required magic number rather than use heuristics. These tools
add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless
the BOM is present or the file contains only ASCII. Google Docs also
adds a BOM when converting a document to a plain text file for download."

--
Alex

Anton Ertl‏

غير مقروءة،

26‏/05‏/2017، 11:04:12 ص26‏/5‏/2017

إلى

Alex <al...@rivadpm.com> writes:
>On 5/26/2017 10:39, Anton Ertl wrote:
>> Alex <al...@rivadpm.com> writes:
>>> Stupidly, Microsoft say this;
>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
>>
>> |Because Unicode plain text is a sequence of 16-bit code values, it is
>> |sensitive to the byte ordering used when the text is written.
>>
>> So Microsoft means UTF-16 when they write "Unicode". For UTF-16, byte
>> order is an issue, so a byte-order mark is not completely superfluous.
>> Also, because UTF-16 is not ASCII-compatible on byte-addressed
>> machines, the main disadvantage of prepending a BOM does not play a
>> role (tools that work on 8-bit characters don't work on UTF-16 anyway,
>> BOM or no). But these reasons don't transfer to UTF-8.
>>
>> - anton
>>
>
>The table has the UTF-8 "BOM" in it. It takes a fairly narrow reading of
>the text to assume that they aren't referring to UTF-8 but only
>UTF-longer as Unicode, and I suspect they aren't.

That may be the case, but the sentence I cited above, which is one of
three in the introductory paragraph, as well as the discussion of byte
order differences further on shows where there focus is. UTF-8 is, at
best an afterthought on that page. Given that they don't mention it
except in the table, my guess is that the table entry was added later,
after someone complained about the missing UTF-8 entry in the table.

>"many pieces of software on Microsoft Windows such as Notepad treat the
>BOM as a required magic number rather than use heuristics. These tools
>add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless
>the BOM is present or the file contains only ASCII. Google Docs also
>adds a BOM when converting a document to a plain text file for download."

That's probably a stronger reason for BOM problems than the web page
above.

Brad Eckert‏

غير مقروءة،

01‏/06‏/2017، 12:56:00 ص1‏/6‏/2017

إلى

On Friday, May 26, 2017 at 5:27:17 AM UTC-7, Alex wrote:
> It's basically a train wreck. Wikipedia notes:
>
> "many pieces of software on Microsoft Windows such as Notepad treat the
> BOM as a required magic number rather than use heuristics. These tools
> add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless
> the BOM is present or the file contains only ASCII. Google Docs also
> adds a BOM when converting a document to a plain text file for download."
>

That's exactly my point. When in Rome, do as the Romans. The UTF-8 BOM used as a magic number is really a magic number. Tools in the chain may magically insert it without you seeing it. Windows apps now, maybe web apps later. So, why not just adopt it and move on?

Anton Ertl‏

غير مقروءة،

01‏/06‏/2017، 2:27:17 ص1‏/6‏/2017

إلى

Brad Eckert <hwf...@gmail.com> writes:
>That's exactly my point. When in Rome, do as the Romans. The UTF-8 BOM used=
> as a magic number is really a magic number. Tools in the chain may magical=
>ly insert it without you seeing it. Windows apps now, maybe web apps later.=

> So, why not just adopt it and move on?

I am in Vienna, not in Rome. The main advantage of UTF-8 is a certain
amount of ASCII compatibility. Just because Microsoft wants to break
that compatibility does not mean we should follow.