Strings, charsets, and encodings, oh my!

Dan Sugalski

unread,

Nov 11, 2004, 11:42:29 AM11/11/04

to perl6-i...@perl.org

Or something like that.

Anyway, I'm nailing down the last bits of functionality for the
changes to the string system. There's still going to be a fair amount
of cleanup (including the eradication of some globals) once this is
in and merged, but I wanted to give folks a heads up, and a refresher
on the scheme going in.

We're going to continue the parrot tradition of confusing naming,
referring to any length-delimited wad of bytes as a string. They go
in string registers, PMCs can hold 'em, they're maintained internally
with the STRING structure, and so on. (This is true and not going to
change, regardless of whether it's a good idea or not) Each string
has attached to it an encoding and a charset.

Strings are a sequence of grapehemes. A grapheme is the smallest
logical unit of text. We'd call 'em characters, except there are
issues there with typography so we're not going to. A grapheme is
composed of one or more code points. And a code point is a 32 bit
integer.

The encoding code is responsible for managing the underlying byte
buffer. It's the layer that translates between code points and real
bytes, making the buffer *look* like a contiguous sequence of 32-bit
integers, even if it really isn't. (If, for example, the buffer is
UTF-8 data, where a 32-bit integer can be between 1 and 6 bytes, or
the buffer is sparse, or zip/gzip/bzip compressed)

The charset code is responsible for managing the graphemes in a
string, translating between graphemes and code points, giving basic
meaning to grapehemes, and doing basic manipulation of the graphemes.
In this case basic meaning is classification -- is this grapheme a
whitespace/alpha/numeric/punctuation/line break character, and basic
manipulation is case changes and insertions and deletions.

A picture looks something like:

So your parrot string ops (and C API calls) always talk to a string's
charset code, which then will talk to the encoding code (maybe--it's
OK for this code to cheat if it knows its OK), which then dives into
the actual buffer data. Parrot string ops never go past the charset.

For our purposes, graphemes and code points are all *virtual* -- that
is, the values may not be directly represented in the underlying
buffer. If the buffer is gzipped the encoding layer will do the
decompression as it needs to so it can present code points to the
charset layer, and the charset layer synthesizes code points as it
needs to if it needs to. Byte access, on the other hand, is always
real -- that is, when you ask for byte N from a string you will
always get the real byte N, or an exception if this byte isn't
accessible.

This real/virtual access is in for reasons of practicality. Code
should *never* be accessing strings by byte. The only reason to
access things by byte is if you want the real data in the buffer for
something like IO or other low-level things.

When things are done, encodings and charsets will be dynamically
loadable -- that is, while parrot will ship with quite a few, only
the ones you actually need will be loaded in. This makes for a
smaller runtime footprint (so no need to load in ICU if your program
is all about Latin-1 data) and for easier upgrading and extension (We
don't have any of the asian charsets, nor do we have most of the
ISO-8859 sets. Yet).

Now, with this in mind it's *very* important to draw a distinction
between what is an encoding and what is a charset. This gets somewhat
muddled, especially since many of the standards for this stuff define
both encoding and charset semantics. This means that we have to be
somewhat careful, and it means that we will have charsets and
encodings with the same names in some circumstances.

Things which define grapheme semantics are charsets. ASCII is a
charset. ISO-8859-x is a charset. Unicode is a charset. Shift-JIS is
a charset. EBCDIC is an abomination, but it's also a charset. RAD-50
is a charset. These all define how graphemes behave and what they
mean.

Things which define how bytes dance are encodings. UTF-8 is an
encoding, UTF-16 is an encoding. Byte is an encoding. (Though I'm
calling it fixed_8) Shift-JIS is an encoding. RAD-50 is an encoding.

Now, in some circumstances semantics are mushed together enough that
it's somewhat difficult to tease them apart (like in many of the
asian charset/encoding standards) so we'll have some fun there. We'll
live, and worst case everyone just pivots to unicode and pretends not
to worry about it.

It is also important to keep in mind that not all charset/encoding
pairs are allowable, and that charsets can require certain encodings
to be used with them. Unicode, for example, won't allow the RAD-50 or
byte encodings, since they don't have sufficient range. ASCII *could*
use the UTF-32 encoding if it wanted, though that'd be wasteful.
Charsets may have a preferred encoding as well, which is also fine,
though we'll prefer they not worry too much about that. (So we can
swap in compressing and sparse encodings, for example)

Anyway, with all this, things should work out reasonably well. The
bytecode-level API has already been specified, which allows pretty
much all of the underlying complexity to be hidden (and, indeed,
allows the existence of non-unicode data to be hidden if that's what
you really want) from bytecode programs, which is fine.

This should be all checked in and working in the next day or two, at
which point I want to merge back into the main tree. We'll use
Unicode support at that point, but putting together a Unicode charset
library should be straightforward. We will probably want to take a
look at some sort of pmc-class-style preprocessing code, since the
charset libraries are all awfully similar, so inheriting's not a bad
thing to do. OTOH, I'm not sure we'll have enough of these to matter.

The basic libraries at final merge, if you're following along, will be:

encoding: fixed_8 (byte == codepoint)
charset: binary, ascii, ISO-8859-1 (latin-1)

I'd like to get Unicode up to speed quickly at that point, as well as
either Shift-JIS or one of the GB sets, though I'm not sure I'll have
the time to do so. From there we'll see where we go.
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Larry Wall

unread,

Nov 11, 2004, 12:19:39 PM11/11/04

to perl6-i...@perl.org

All in all, looks really good, especially the fact that it defaults
to a grapheme view rather than a codepoint view. I also like the
escape valve for drilling down to bytes if you really need it, but
it reminds me that we'll need something similar for drilling down
to codepoints for those charsets that define graphemes with multiple
codepoints. If Parrot ops "never go past the charset", and if there's
only one charset view possible for a string, then it seems to imply
that either we have to force a conversion (yuck) to change the view
from a grapheme-oriented charset to a codepoint-oriented charset,
or we need some way of having the two different charset interfaces
to the same string simultaneously, or we need some other way to drill
through to codepoints much like the byte view provides.

Larry

Dan Sugalski

unread,

Nov 11, 2004, 12:33:53 PM11/11/04

to perl6-i...@perl.org

At 9:19 AM -0800 11/11/04, Larry Wall wrote:
>All in all, looks really good, especially the fact that it defaults
>to a grapheme view rather than a codepoint view. I also like the
>escape valve for drilling down to bytes if you really need it, but
>it reminds me that we'll need something similar for drilling down
>to codepoints for those charsets that define graphemes with multiple
>codepoints.

Ah, I'm too close to the source.

The charset API defines both "get_grapheme(s)" entry point and
"get_codepoint(s)" entry point. get_codepoint returns a single 32-bit
integer, the rest return STRINGs with the appropriate stuff in 'em. I
think this covers what you're worried about.

Ron Blaschke

unread,

Nov 14, 2004, 7:04:39 AM11/14/04

to Dan Sugalski, perl6-i...@perl.org

Thursday, November 11, 2004, 5:42:29 PM, Dan Sugalski wrote:
> Or something like that.

[snip]

FWIW, I really like the idea.

Will there be a data type for "characters," or are those just strings
with a single grapheme?

As a side note, the Java people decided for UTF-16 Unicode "char"s,
and some good time getting Supplementary Characters (> U+FFFF) to
work.
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Ron

Dan Sugalski

unread,

Nov 14, 2004, 4:41:41 PM11/14/04

to Ron Blaschke, perl6-i...@perl.org, Ron Blaschke

At 1:04 PM +0100 11/14/04, Ron Blaschke wrote:
>Thursday, November 11, 2004, 5:42:29 PM, Dan Sugalski wrote:
>> Or something like that.
>
>[snip]
>
>FWIW, I really like the idea.
>
>Will there be a data type for "characters," or are those just strings
>with a single grapheme?

Strings with a single grapheme. "Characters" can be multiple code
points, so it's the only way to do it properly.

There is direct code point access, and those are 32-bit unsigned
ints. We're going to frown on most uses of those, since it's a good
way to find yourself behaving really badly in a number of cases.

Luckily Leo's last name'll make sure that we at least manage it
properly in parrot. :)

>As a side note, the Java people decided for UTF-16 Unicode "char"s,
>and some good time getting Supplementary Characters (> U+FFFF) to
>work.
>http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Yeah, that was something I didn't want to deal with. Unicode's got
the largest range of code points, and it says 32 bits are enough. If
it goes 64-bit at some point, well... Hopefully I'll be long-retired
and not caring any more. :)