On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
> Evidently (and completely inadvertently) this exchange has just
> illustrated one of the inadmissable assumptions:
>
> "unicode as a medium is universal in the same way that ASCII used to be"
Ironically, your post was not Unicode.
Seriously. I am 100% serious.
Your post was sent using a legacy encoding, Windows-1252, also known as
CP-1252, which is most certainly *not* Unicode. Whatever software you
used to send the message correctly flagged it with a charset header:
Content-Type: text/plain; charset=windows-1252
Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
encodings correctly (or at all!), it screws up the encoding then sends a
reply with no charset line at all. This is one bug that cannot be blamed
on Google Groups -- or on Unicode.
> I wrote a number of ellipsis characters ie codepoint 2026 as in:
Actually you didn't. You wrote a number of ellipsis characters, hex byte
\x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
code point U+2026 in Unicode, but the two are as distinct as ASCII and
EBCDIC.
> Somewhere between my sending and your quoting those ellipses became the
> replacement character FFFD
Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
encodings and character sets. It doesn't just assume things are ASCII,
but makes a half-hearted attempt to be charset-aware, but badly. I can
only imagine that it was written back in the Dark Ages where there were a
lot of different charsets in use but no conventions for specifying which
charset was in use. Or perhaps the author was smoking crack while coding.
> Leaving aside whose fault this is (very likely buggy google groups),
> this mojibaking cannot happen if the assumption "All text is ASCII" were
> to uniformly hold.
This is incorrect. People forget that ASCII has evolved since the first
version of the standard in 1963. There have actually been five versions
of the ASCII standard, plus one unpublished version. (And that's not
including the things which are frequently called ASCII but aren't.)
ASCII-1963 didn't even include lowercase letters. It is also missing some
graphic characters like braces, and included at least two characters no
longer used, the up-arrow and left-arrow. The control characters were
also significantly different from today.
ASCII-1965 was unpublished and unused. I don't know the details of what
it changed.
ASCII-1967 is a lot closer to the ASCII in use today. It made
considerable changes to the control characters, moving, adding, removing,
or renaming at least half a dozen control characters. It officially added
lowercase letters, braces, and some others. It replaced the up-arrow
character with the caret and the left-arrow with the underscore. It was
ambiguous, allowing variations and substitutions, e.g.:
- character 33 was permitted to be either the exclamation
mark ! or the logical OR symbol |
- consequently character 124 (vertical bar) was always
displayed as a broken bar ¦, which explains why even today
many keyboards show it that way
- character 35 was permitted to be either the number sign # or
the pound sign £
- character 94 could be either a caret ^ or a logical NOT ¬
Even the humble comma could be pressed into service as a cedilla.
ASCII-1968 didn't change any characters, but allowed the use of LF on its
own. Previously, you had to use either LF/CR or CR/LF as newline.
ASCII-1977 removed the ambiguities from the 1967 standard.
The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
Unfortunately I haven't been able to find out what changes were made -- I
presume they were minor, and didn't affect the character set.
So as you can see, even with actual ASCII, you can have mojibake. It's
just not normally called that. But if you are given an arbitrary ASCII
file of unknown age, containing code 94, how can you be sure it was
intended as a caret rather than a logical NOT symbol? You can't.
Then there are at least 30 official variations of ASCII, strictly
speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
by their users, despite the differences, e.g. replacing the dollar sign $
with the international currency sign ¤, or replacing the left brace
{ with the letter s with caron š.
One consequence of this is that the MIME type for ASCII text is called
"US ASCII", despite the redundancy, because many people expect "ASCII"
alone to mean whatever national variation they are used to.
But it gets worse: there are proprietary variations on ASCII which are
commonly called "ASCII" but aren't, including dozens of 8-bit so-called
"extended ASCII" character sets, which is where the problems *really*
pile up. Invariably back in the 1980s and early 1990s people used to call
these "ASCII" no matter that they used 8-bits and contained anything up
to 256 characters.
Just because somebody calls something "ASCII", doesn't make it so; even
if it is ASCII, doesn't mean you know which version of ASCII; even if you
know which version, doesn't mean you know how to interpret certain codes.
It simply is *wrong* to think that "good ol' plain ASCII text" is
unambiguous and devoid of problems.
> With unicode there are in-memory formats, transportation formats eg
> UTF-8,
And the same applies to ASCII.
ASCII is a *seven-bit code*. It will work fine on computers where the
word-size is seven bits. If the word-size is eight bits, or more, you
have to pad the ASCII code. How do you do that? Pad the most-significant
end or the least significant end? That's a choice there. How do you pad
it, with a zero or a one? That's another choice. If your word-size is
more than eight bits, you might even pad *both* ends.
In C, a char is defined as the smallest addressable unit of the machine
that can contain basic character set, not necessarily eight bits.
Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
as a "byte" and/or char. Your in-memory representation of ASCII "a" could
easily end up as bits 001100001 or 0000000001100001.
And then there is the question of whether ASCII characters should be Big
Endian or Little Endian. I'm referring here to bit endianness, rather
than bytes: should character 'a' be represented as bits 1100001 (most
significant bit to the left) or 1000011 (least significant bit to the
left)? This may be relevant with certain networking protocols. Not all
networking protocols are big-endian, nor are all processors. The Ada
programming language even supports both bit orders.
When transmitting ASCII characters, the networking protocol could include
various start and stop bits and parity codes. A single 7-bit ASCII
character might be anything up to 12 bits in length on the wire. It is
simply naive to imagine that the transmission of ASCII codes is the same
as the in-memory or on-disk storage of ASCII.
You're lucky to be active in a time when most common processors have
standardized on a single bit-order, and when most (but not all) network
protocols have done the same. But that doesn't mean that these issues
don't exist for ASCII. If you get a message that purports to be ASCII
text but looks like this:
"\tS\x1b\x1b{\x01u{'\x1b\x13!"
you should suspect strongly that it is "Hello World!" which has been
accidentally bit-reversed by some rogue piece of hardware.
--
Steven