Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF-5

49 views
Skip to first unread message

/replace qwertyuiop with news/

unread,
Mar 13, 2013, 3:25:15 PM3/13/13
to
I realise April 1st is almost upon us, and I haven't let my compiler
escape in years... no time to do anything new and exciting, so I am
considering just doing something to help bring INTERCAL into the 21st
century (slowly and carefully of course); I may be able to implement
what I describe below if enough whisky is available.

There are a few different, incompatible, character encodings used by
different INTERCAL compilers and/or different I/O mechanisms (Turing
tapes, CLC-INTERCAL's various I/O mechanisms, etc). This is all well
and good but I was thinking if there may be a suitable INTERCAL way to
support, say, Unicode, or different character encodings. So I am
thinking of adding a side-effect to the I/O statements which would
select the encoding to use for the next operation. This would work as
follows:

1. attempting to read out / write in a "compiler" register will have the
side effect of modifying the encoding. The choice of "compiler"
registers is because at present they are the only registers which cannot
be used in I/O operations (well, nothing stops you using them, but it won't
work).

2. the selection applies to the opposite operation, so you "read out _1"
to select the character encoding to be used for subsequent "write in"
statements, and vice versa. This is obviously the right way to do it,
and there is no need to justify the choice.

3. the encoding selected will be found in the twospot register with the
same number as the crawling horror used, so for example "do read out _42"
uses the number in :42 to set the encoding. We require two numbers,
defining the encoding to use for "tail" and "hybrid" I/O, so these are
naturally provided by interleaving the two numbers (tail � hybrid).

4. the operation also selects a transformation for the Numeric I/O,
using the "spot" register with the appropriate number.

The following encoding will be available initially for tail/hybrid I/O:

#1 - CLC-INTERCAL's alphanumeric I/O -- this is the default for "tail"
I/O when the program is compiled in CLC ("sick") mode; note that it will
now be possible to use this for "hybrid" I/O as well.

#2 - CLC-INTERCAL's binary I/O -- default for "hybrid" I/O when the
program is compiled in CLC ("sick") mode.

#3 - C-INTERCAL's binary I/O -- default for all I/O when the program is
compiled in C ("ick") mode.

#4 - undefined at present

#5 - Unicode, represented as UTF-5, which is derived from, but is not
compatible with, the modified Baudot used by CLC-INTERCAL for
Alphanumeric I/O. The "shift" codes acquire new meanings which allow to
specify multi-character sequences to map Unicode characters to UTF-5 and
vice-versa. I haven't yet written a specification for the UTF-5
encoding, but I'm sure I'll think of something.

#0, #6 to #65535: undefined

The following transformations will be available for numeric output:

#1 - CLC-INTERCAL's Roman -- default for "sick" programs

#2 - C-INTERCAL's Roman -- default for "ick" programs

#3 - wimp mode -- default if the program runs in wimp mode, and
definitely not recommended in any case. Attempting to set this
transformation at runtime will cause a splat if the program was not
started in wimp mode: the main point is to be able to "unwimp" a
program by selecting transformation #1 or #2, and be able to "re-wimp"
it later.

#0, #4 to #65535: undefined

The following transformations will be available for numeric input:

#1 - Normal numeric input, where you type "FOUR TWO" to get #42

#2 - wimp mode -- default if the program runs in wimp mode, and
like the wimp output mode, attempting to select this mode when
the program was not started in wimp mode will result in a splat.

#0, #3 to #65535: undefined

(Setting the encoding and/or transformation to a value marked
"undefined" in the above lists will not result in a splat, however I
don't know, and don't want to know, what happens if you do).

C

John Cowan

unread,
Mar 13, 2013, 9:13:23 PM3/13/13
to
On Wednesday, March 13, 2013 3:25:15 PM UTC-4, /replace qwertyuiop with news/ wrote:

> I haven't yet written a specification for the UTF-5
> encoding, but I'm sure I'll think of something.

Fortunately, someone else has:

http://tools.ietf.org/html/draft-jseng-utf5-01 encodes
any Unicode code point using 1 to 6 quintets, each
carrying 4 bits of payload. To make this semi-compatible
with Baudot, use FIGS FIGS to specify that following
text is in UTF-5. There is, of course, no way back.

/replace qwertyuiop with news/

unread,
Mar 14, 2013, 3:41:10 AM3/14/13
to
On 2013-03-14, John Cowan wrote:
> On Wednesday, March 13, 2013 3:25:15 PM UTC-4, /replace qwertyuiop with news/ wrote:
>
>> I haven't yet written a specification for the UTF-5
>> encoding, but I'm sure I'll think of something.
>
> Fortunately, someone else has:

Nope, that's a completely different proposal which apes the current
UTF-8 encoding (using the highest bit to indicate position in a
multi-character sequence), and is therefore completely unsuitable
for INTERCAL. Instead, I am going to redefine the Baudot shift codes
for that, which would result in something completely incompatible with
Baudot.

I suppose to avoid any possible confusion I'll call it "UTF5" instead
of "UTF-5". Completely different name.

Provisionally I'm thinking of having a "length block" and a "data block"
within each Baudot sequence. The shift code "letters" will start
a length block, and the shift code "figures" starts a data
block. Each length block encodes the length of multi-character
sequences in the following data block. For example:

letters-5-2-1-figures-X-Y-X

where X is a 5-character sequence, Y a 2-character sequence, and X a
single character. This is incompatible with Baudot, but single
characters can use the same representation as Baudot letters, with a
suitable encoding of the "shift state" in 2-character sequences,
reserving longer sequences for characters which cannot be represented in
Baudot (or extended Baudot).

If no shift codes appear in the string, it is assumed to be a data
block of 1-character sequences, which means that any Baudot alphabetic
string not containing shift codes will also be a valid UTF5 string
with the same meaning. That's as far as compatibility with Baudot
will go.

I'm fairly sure no other encoding does this, so we'll be fine.

C

ais523

unread,
Mar 14, 2013, 4:18:29 PM3/14/13
to
I personally prefer to go along the lines of doing what nobody else does
via finding standards that aren't generally used. For instance, there's
a POSIX standard format for archives, that nobody uses, and it's what I
use to distribute C-INTERCAL.

Also, always doing things the opposite of the intuitive way is something
I'd warn you against; it's approaching on dangerous consistency. You
should feel free to mix it up a little. In this case, I note that the
shift codes have to strictly alternate between letters and figures in
your system, which seems like an awkward inefficiency. I suggest you
just use the same shift code to alternate between lengths and
characters, freeing up the other shift code for something entirely
different.

--
ais523

/replace qwertyuiop with news/

unread,
Mar 15, 2013, 2:35:11 AM3/15/13
to
On 2013-03-14, ais523 wrote:
> Also, always doing things the opposite of the intuitive way is something
> I'd warn you against; it's approaching on dangerous consistency. You
> should feel free to mix it up a little. In this case, I note that the
> shift codes have to strictly alternate between letters and figures in
> your system, which seems like an awkward inefficiency. I suggest you
> just use the same shift code to alternate between lengths and
> characters, freeing up the other shift code for something entirely
> different.

That's a good point. But it's not really inefficient as it allows to
convert a sequence even if the start is truncated by just skipping until
you find a length block; if you use just one shift code you can't do
that as you don't know what is length and what is data - I could make
sure that you can tell a length block from a data block from the
contents, but I suspect this would be more inefficient than using both
shift codes.

Also, you can still use sequences of shift codes to do something
completely different, just like CLC-INTERCAL uses them to introduce
lowercase letters and symbols not present in Baudot. I just have to
state that the meaning of empty blocks is undefined and leave that for
future extensions.

Alternatively I can use a different approach, interleaving the data and
length blocks, occasionally, introducing a shift code which means "start
of a new interleaved block". For example, a sequence of three symbols,
with lengths 3, 4 and 1 (call it A1,A2,A3,B1,B2,B3,B4,C1) could be
represented as:

length=3,A1,length=4,A2,length=1,A3,length=0,B1,B2,B3,B4,C1

where length=X means the appropriate Baudot symbol to indicate that
length (to be specified) and length=0 shifts from interleaved blocks to
just data block until the end of the block; or it could be:

length=3,A1,length=0,A2,A3,shift,length=4,B1,length=1,B2,length=0,B3,B4,C1

where the "shift" separates the two blocks. This also leaves the other
shift code for something completely different. Also, the encoding can
be made more efficient by introducing "double length" codes, which
specify two lengths in a single symbol: whether these can be used
depends on your data (there are 30 possible values and the maximum
length is 7 Baudot characters so length=0 to length=7 represent single
lengths, and length=8 to length=27 could represent two lengths in the
range 1 to 4 and 0 to 4 respectively; the remaining 2 Baudot characters
could also be used for something completely different). So the above
sequence could also be represented as one of (writing length=X:Y to
represent double lengths):

length=3:4,A1,length=1:0,A2,A3,B1,B2,B3,B4,C1
length=3:4,A1,A2,A3,B1,B2,B3,B4,shift,length=1:0,C1
length=3:4,A1,A2,A3,B1,B2,B3,B4,shift,length=1,C1,length=0
etc.

This seems very intuitive to me, so I'm not doing the opposite of
intuitive :-) I hope nobody copies it to make it a standard.

I also just thought of an advantage of using this sort of encoding,
where you have several possible representation of a sequence: you could
hide a different sequence in it like you can hide a whitespace program
in another program. How to do that is left as an exercise to the
reader.

C

0 new messages