Some history, and some hope.
On 10/13/16 16:24, Paul Gilmartin wrote:
> Hmmm... You asked about Danish,
> but your Mail Agent seems to be speaking Finnish.
:-)
The advantage in the non-EBCDIC* world is that the lower half of 8-bit
space is rather more consistent. And that space is where we have some
serious trouble on this side of the line (pipe symbol versus
exclamation, square brackets, curly braces).
Years ago, Edwin Hart (then at JHU APL) and others worked through SHARE
to normalize EBCDIC into a code page which could be translated to/from
non-EBCDIC* consistently and reliably. We've discussed it in the
lists/fora, perhaps this particular list/forum, even recently. (I've
slept since then.) The result of the SHARE effort was what some call
"Code Page 37 version 2". IBM never fully took-up the customer-produced
code page, but they did listen and they gave us CP 1047.
Outside of IBM, most have an affinity for a _one-to-one reversible
mapping_ which treats the EBCDIC side as CP37V2 and the non-EBCDIC* side
as ISO-8859-1. This doesn't help the Poles, I suppose. (It would have
been nice if IBM had a Polish code page which could use the /same
translate table/ and match-up with a Polish non-EBCDIC code page.)
Witness Dignus: aside from newline (see below) their default
/translation is the same/ as that gleaned from this two-decades-old
SHARE effort. Nice work. Good job.
CP 1047 is the best we have, if we are to live in the world IBM has
created for us.
(And some people accept the "CP1047" tag even though they're really
talking CP37V2.)
Sadly, CP 1047 doesn't help the Poles (nor the Danes, nor the Finns).
But now it appears we can change locale. Fabulous!
Thankfully locale variables (LANG, LC_CTYPE, et al) are indicated using
an even smaller subset of EBCDIC than those code points which map from
"low order non-EBCDIC".
There is still the problem that a stream of bytes might not be
recognized. Tagging files with charset ABC or code page 123 is clumsy at
best.
*Here's hope: *
Newline is always non-printable whether EBCDIC or non-EBCDIC*.
Given a stream of bytes of unknown meaning (but reasonably expecting
"plain text") on can trigger on 0x15 and be reasonably sure the
preceding is EBCDIC or trigger on 0x0A and be reasonably sure the
preceding is not. (And one can strip off or append 0x0D as needed.)
If the content is a shell script, locale variables can be recognized and
respected.
XML, HTML, and source code can trivially include reliable cues to the
proper locale for rendering.
Again, for a byte stream text file, look for EBCDIC "NL" newline or look
for non-EBCDIC "LF" linefeed. EBCDIC NL will never appear in non-EBCDIC
printable plain text. Non-EBCDIC LF will never appear in EBCDIC
printable plain text. It's a good test.
This is where even Dignus doesn't quite get it: They translate EBCDIC
0x15 to non-EBCDIC 0x0A. (Actual non-EBCDIC for "newline" is 0x85.) But
their table only helps with the above test, and _makes sense_ for cases
where someone did an un-measured translation. So I can't fault them.
Once the result of the EBCDIC (or not) check is known, one can apply
locale and "convert" appropriately. i.e., beyond the cramped walls of
8-bit space.
-- R; <><
* I say non-EBCDIC here because "ASCII" has baggage for many. Y'all know
what I mean.