1) This is explicitly arguing for one viewpoint, not against another.
(That is, I've not included any such arguments.)
2) I've written it in a manner which doesn't directly mention Parrot,
so that I can use it in other contexts.
3) There are two separate issues with respect to Parrot: the strings
semantics, and the internal representation/implementation. This
addresses only the former. It also doesn't include an API
recommendation. This is because the semantics need to be sorted out
first.
4) There are a few topics yet-to-be-covered.
5) I should probably provide a (much shorter) synopsis.
----------------------------------------------------------------------
What's a String?
This article is intended to clarify some concepts related to strings.
The motivation is that they are a frequent source of confusion for
developers, working in various programming language. This lays out a
framework for thinking about strings--a viewpoint, if you will. It
reflects a conceptual model that's already embraced by some programming
environments, but even those environments often lack a detailed
explanation. This document also aims to disambiguate various
terms--clarifying some and providing self-consistent definitions of
others which are used inconsistently in different contexts.
So, what is a string?
In the most general sense, a string is a data type in a programming
language, meant to model text. It's meant to allow manipulation,
viewing, analysis, searching, of the stuff you read, or which computer
programs read on your behalf.
A bit more concretely:
Definition: A string represents a sequence of abstract characters.
So we can't go much further until we say a few words about what a
"character" is. But the two important words to remember from that
definition are "represents" and "characters". (And, "sequence" is
important too, but you won't overlook that one.)
What's a character?
A character, in fuzzy terms, is the smallest unit of meaning in a
language.
A character is an abstract concept. Characters are things like letters,
numbers, punctuation marks, the Japanese symbol for "middle"--stuff
like that. The concept of a character is supposed to capture an
intuitive notion--the notion ususally understood by the common language
speaker. (We are, after all, modelling what's essentially text--the
stuff that humans read. That's the whole point.) That said, precisely
determining what constitutes a character, and what doesn't, is
essentially a judgement call, and a decision. Fortunately, essentially
all existing standards agree on this, in a general sense. For instance,
the lowercase "a" and the uppercase "A" are considered to be different
characters, but stylistic variants (the letter "a" in different fonts,
for instance) are not (instead, they're just different graphical
representations of the same abstract character). These could have been
modelled differently--you could consider "a" and "A" just variants of
the same character. But, as it turns out, no one does it that way.
And I mention that now to prepare you for some arbitrariness--some
conventions that could have gone another way, but which are actually
still intuitively meaningful.
Notice that we haven't said anything about how a computer program (or
language) might model a character. That's important. So far we've just
said *what* a computer program is trying to model, not how it might do
it.
And also, just to hit the point home, we're not talking about the
"char" datatype in C. To keep things clear, we'll refer to that as a
"byte", when it comes up.
What is Unicode?
At this point I'd better explain what "Unicode" means, before I slip
and use the word. People like to throw the word around a lot, and it's
not always clear which of the following related (but different) things
they mean. There are at least 6 different possibilities:
1) The Unicode Consortium, a group of companies and organizations that
got together to sort out the handling of international text. ("Sort
out" in the sense of making it trivial to allow production of text
documents containing a mixture of different langauges, rather than
nearly impossible.)
2) The standard produced by that group, currently at version 4.0.
3) The character model which forms the basis of that standard, and
which also forms the basis of what I'm talking about here.
4) The database of character properties (or the properties themselves)
assembled by the Unicode Consortium. These properties are things like,
"this character has an uppercase version over here", "this character
represents the decimal digit 3", etc.
5) A family of encodings (word defined below) used for the interchange
of international text between computer programs (by mean of files or
network transmissions, for instance). Often, "Unicode" is used to refer
to a specific one; that usage is inherently ambiguous, but harkens back
to the first version of the Unicode standard, which defined only one
encoding.
6) International (multi-language) text handling, in general. This usage
arises because internaltionalization-aware applications tend to base
their implementations on the Unicode character model, and tend to
support the Unicode encodings. But it's important to recognize this
usage, because sometimes the term "Unicode" is used in contexts which
don't involve any of the above per se, but rather just indicate an
awareness and handling of international text issues.
Of these, really (3) is the only one which someone could really
"disagree" with. (That is, the consortium, standard, database, and
encodings exist and are precisely defined--there's nothing
philosophical there to reject.) But the other important point is that
these are all separate things--sometimes people will cite the need to
support other encodings as an argument against the character model,
which doesn't follow logically.
So the string/character model I'm espousing is based largely on the
Unicode Character Model, but it would be misleading to say that I'm
advocating, "use Unicode". (What would that mean, anyway?) But it's (3)
that does the conceptual heavy lifting, and (4) which does the grunt
work. Or, (3) is the concept, and (4) is the details.
I'll try to avoid using the word "Unicode" by itself, to avoid
confusion, but if I do, then I'm probably referring to the Unicode
Standard, or something it specifies.
So let's review. A string is a concrete thing--a data type in a
language; it represents a sequence of (abstract) characters, a
character is a unit of meaning in a language, and a character has
properties which have been collected into a database. To go a bit
further, the fundamental question you can ask a string is, "what's your
Nth character". All else flows from that.
What's a code point?
So if the fundamental question you can ask a string is, "what's your
Nth character", we have to talk a bit about how one would
programmatically express that answer--that is, what sort of data type
is/can/should be use to represent a character? Fundamentally, a
character can of course be represented opaquely--in an object-oriented
language, you could have character object. That's quite reasonable. As
it turns out, people find it convenient to programmatically represent a
character by an integer (think "whole number", not a specific data type
here). It's convenient for several reasons--it's compact and easy to
refer to in speech. And if the fundamental thing you can ask a string
is what its Nth character is, then the fundamental things you do with a
character is look up its properties, and test it for equality against
other characters. So if you just go through and give each character a
little serial number, then you can find the properties of a character
by using its number as an index into a property table (i.e., character
3's properties are at slot number 3 in the table), and you can tell
that 2 characters are different characters by checking whether they are
represented by different numbers.
A number used to represent a character like this is called a "code
point". Which number you choose to assign to which character is
fundamentally arbitrary--its one and only use is as a unique lookup
key. The letter "A" could be assigned the number 1, or 42, or 31337, or
2001, or anything else. It doesn't matter. Now, various standards have
numbered various characters in various ways, though many don't actually
think in these terms, so to be clear if you are going to refer to
"character number 42" or "the code point for the letter 'A'", then you
really need to point out what numbering scheme you are talking about.
Fortunately, the Unicode Standard has numbered *all* of them--it's
given a number to essentially every character in every
digitally-represented langauge in the world. Since its numbering scheme
is comprehensive, and since the numbering scheme is arbitrary, without
loss of generality we are going to use the Unicode standard's numbering
scheme for the rest of this discussion, and we'll go ahead and pretend
it's the only one. So when we say, "the code point for the letter 'A'",
we'll mean the code point that the Unicode Standard assigns to it,
which (in case you are wondering by now) is 65. (See, the number itself
is not very interesting!)
So, let's review again. For various practical reasons, it's preferable
to programatically represent characters using integers, you have to
pick an arbitrary numbering scheme, and somebody's done that, and it's
a good one. This numbering scheme defines a one-to-one correspondence
between numbers (code points) and characters, and that makes it
tempting to pretend that characters *are* numbers. But it's important
to keep in the back of your mind an awareness that the numbers merely
help you pick out the characters, and it's the characters themselves
which are important, and characters are *abstract*--they never actually
live inside of a computer program. [Note: Of course, some numbers don't
represent any character--there are only so many characters. So to be
mathematically precise, there's a one-to-one correspondence between a
subset of integers and all characters.]
What's an encoding
So, what we've convered so far is the type of stuff you need to work
with text in-memory. We can use the number 65 to represent the letter
"A", and we have a properties database to tell us whether A is a
number, or whitespace, or whatever.
But, text being what it is, people tend to want to save it to disk, or
transfer it between applications. That is, they want to do IO
operations on strings.
Now as it turns out, IO is alway an operation which involves bytes.
Even if your IO is hidden under an abstraction layer, at the bottom
you're always reading and writing bytes. Strings are not,
fundamentally, bytes. So how do you create bytes from some high-level
construct like a string? You define a serialization algorithm, that's
how. That's an encoding:
Definition: An encoding is a mapping from sequences of abstract
character to sequences of bytes (and vice versa).
Since we've said that a string is supposed to represent a sequence of
abstract characters, we could also just say, "An encoding is a mapping
from strings to sequences of bytes (and vice versa)."
*Important Note*: This is, in particular, an area where there is much
conflicting and inconsistent terminology. The concept that I just
defined is the exact same thing that the Unicode Standard refers to as
a "Character Map". That would probably be the preferred term, since I
don't believe it's been used elsewhere with a differing meaning. But,
it's not a commonly used term, so I'm going to use "encoding" instead,
with some hesitation, so that someone jumping into the middle of this
discussion has a chance of figuring out what I'm talking about. My
usage of "encoding" matches the usage of the XML specification, and the
Objective-C Foundation framework. Java and MIME use the word "charset"
for the same concept, and IANA uses "character set". Others use the
term "character set" for something different. But the important thing
to remember is just that, for the purposes of this document, I'm using
the definition above, and when you are reading elsewhere and encounter
any of those terms, you should look for a definition of what is meant
there.
The absolutely crucial thing the remember about an encoding is that it
defines an interchange format--that's 100% what it's for. It's purpose
is to let you communicate textual data between processes (or, store it
to the filesystem for later retrieval by some process). Encodings have
nothing, fundamentally, to do with any sort of in-memory representation
of a string.
The other thing to remember about an encoding is that there are lots of
them--lots and lots of national and industry standard define lots of
different ones. They tend to have names like, "ASCII" or "ISO-8859-1"
or "Shift-JIS" or "Big5" or "UTF-8".
The reason there are so many of them is that, in the early days of
computing, most everyone got it into their heads that they needed to
represent textual data using only one byte per character, which meant
that an encoding could only handle 256 different characters. This was
fine for English (and in fact ASCII only encodes 128 characters), but
as soon as you hit Europe you needed more characters--for French and
German and Greek and Icelandic and Polish and Turkish and so on. You
need more than 256 characters to handle all of that, so different
encodings where invented to handle different collections of characters.
This doesn't sound so bad, and it's fine as long as everyone just talks
to the guy next door, but it quickly turns into a mess if you try to
step outside--you can't create a single text document with words from
multiple languages (for some combinations), and (potentially worse)
when you read in a file created by someone else, you need to know what
encoding they used (that is, you need metadata, of the sort supplied by
MIME and HTTP headers, but not supplied by most filesystems). This
wasn't fun for anyone.
The obvious thing to do, in retropect, was to come up with a different
plan, one which would allow you to handle all languages without a
separate song-and-dance for each one. That's in fact what the Unicode
Consortium came together to do. And they did it. It was an enormous
undertaking.
Okay, back to the point. Not only are there lots of different
encodings, but by their nature most only know how to encode strings
containing a limited collection of characters. ASCII, for example,
doesn't know how to encode Japanese Kanji. But the Unicode Standard
defines a collection of encodings (UTF-8, UTF-16, and UTF-32, with big-
and little-endian variant of the latter two) which can encode strings
containing basically any character.
It's important to take a moment to point out how general the above
definition of "encoding" is, and what it doesn't imply:
1) As mentioned, an encoding doesn't have to know how to encode every
possible string. And reciprocally, not every sequence of bytes will be
decodable by every encoding.
2) Although an encoding is really a pair of algorithms (one to
serialize, one to de-serialize), that doesn't imply that an encoding
has to be strictly invertible. That is, if I encode a string into a
sequence of bytes, and then decode that sequence of bytes into a
string, I might not get back the same string I started with. For
example, and encoding might, conceptually, strip accent characters.
3) An encoding doesn't necessarily operate on a character-by-character
basis. That is, the bytes to encode "ab" might not be the concatenation
of the bytes to encode "a" with the bytes to encode "b". It's even
possible, based on the above definition, that an encoding might be able
to encode the string "ab", but not the string "ba". (In practice, I
don't know if this ever occurs "in the wild".)
So the key points here are: (1) an encoding is all about IO, (2) not
all encodings guarantee data integrity, and (3) the Unicode encodings
_do_ guarantee data integrity. Another important point is that the
choice of an encoding is a crucial piece of metadata for bytes
undergoing IO, and thus it's the sort of thing which is indicated
explicitly in higher-level protocols and formats (HTTP, MIME, and XML,
to name a few), and it is almost always indicated *by name*. IANA
mantiains a registry of these, and many protocols use the IANA names to
specify encodings.
Another important point: The Unicode Character Model (unlike many other
standards) thinks of the process of going from strings to bytes as a
multi-step process. That's useful from a pedagogical point of view, but
in fact no one much cares about the results of the intermediate steps.
(That is, Unicode describes the process as: characters --> code points
--> code units --> bytes, and I'm saying that the code units aren't
very useful to think about.) The reason for this is very
simple--encodings are only about data interchange, and data interchange
protocols and formats want to specify the encoding with a single name,
which picks out the whole strings-to-bytes mapping. They don't much
care if two different encodings could be thought of as agreeing on one
of those steps, and differing on another. If you're writing a library
to convert between sequences of bytes in different encodings, then this
can also be useful (to minimize code duplication), but it really has
nothing to do with the semantics of a string.
What's a Character Set?
As mentioned above, some people use the term "character set" to mean
what I'm calling "encoding".
In my usage, a "character set" is something simpler--it's a set of
characters, in the mathematical sense of "set". It's an unordered
collection of characters. The letter "A", the comma, and the Japanese
character for "middle"--there, that's a character set. The Unicode
Standard uses the term "abstract character repertoire" for this notion.
That's actually less ambigous terminology, but quite a mouthful. I'll
try to use "repertoire".
A character set, in this sense, isn't a terribly interesting concept,
and it's a good term to stay away from, given the ambiguity. But people
often say things like, "the ASCII character set". Now, in my usage, the
ASCII standard primarily defines an encoding. However, subject to a few
caveats mentioned above, an encoding implicitly picks out a character
set--specifically, the set of characters that it can encode. So in
light of this, I can meaningfully say "the ASCII character set", and
talk about "how the Shift-JIS encoding handles characters in the
ISO-8859-1 character set". I mentioning this not to particularly
encourage the usage, but because you'll hear it, and because it's a bit
less awkward to say than things like, "how the Shift-JIS encoding
handles characters encodable in the ISO-8859-1 encoding". And also,
it's convenient to say, "ASCII characters" (through a small abuse of
language) to talk about "ABC...", even in the context of another
encoding.
Wow, so we've covered a lot of ground. Let's sum up again: Conceptually
strings model a sequence of abstract characters--that's their job.
Encodings (or Character Maps) define interchange formats, and are only
important to IO--to let you transmit strings between processes (via the
network or via files). Encodings are basically algorithms, but in usage
they tend to be identified by conventional names.
What's a Locale?
One of the things that people like to do with strings is to manipulate
them. Two prototypical things they want to do are case transformations
(uppercasing and lowercasing) and sorting. And they also want to do
things like create string representations of numbers and dates, and the
inverse--interpret strings as representing numbers and dates.
Now as it turns out, different people want to do these things in
different ways--think of "1/31/2004" v. "31/1/2004" v. "2004-1-31".
Consequently, that means the sorting algorithms and number formatting
algorithms need to know which way you want to do things (and which way
I want to do them). Traditionally that means that these types of
algorithms need to take a parameter (explicitly or implicitly) to
specify this choice. Traditionally, this parameter is called a
"locale".
Now this word can cause a lot of confusion. It doesn't need to. In the
most concrete sense, a "locale" just specifies a set a algorithms and
settings--things like the format to use for dates, the comparison
operation to use for sorting, and the characters to use for the decimal
and thousands separators for numbers (think of "1,000.50" v.
"1.000,50"). In fact, to reflect this definition, I'm going to coin a
new term. I'm going to call this a "Text Preferences Set", or "TPS" for
short, to emphasize this defintion.
Now where the complexity tends to come in also explains where the term
"locale" came from. People, as it turns out, don't (often) sit around
thinking up new sort orderings, or new formats to display numbers and
dates. More often, they want to do these things in a way that matches
the custom of some language or country or region--they want to sort
strings to match somebody's phone book or dictionary. Consequently,
they find it convenient to specify a TPS by specifying a langauge or
country, and use notations such as "en" for English and "en_US" for US
English (v. British English). But from an API perspective, this is just
a way to go look up a TPS--once you have one, it doesn't much matter
where it came from (looked up by some name or build up
programmatically, piecemeal), you just use it.
Sorting, and Tailorings
Now as I hinted at above, people in different countries like to sort
strings in different ways. Fair enough. That makes it sound like I need
to define a whole bunch of binary string comparison functions--one for
each different custom. One for German, one for Swedish, one for
English, one for Japanese, etc. As it turns out, there's a more compact
way to do this.
The Unicode Standard defines a base collation algorithm (sort
algorithm)--you can use it to sort strings composed of any characters
at all, but it doesn't necessarily give you an order that matches any
particular langauge's custom (though for many languages it's at least
reasonable). This is called the Unicode Collation Algorithm. Now, this
algorithms supports the concept of "tailorings"--tweaks to change the
sort behavior for certain characters (or sequences of characters). The
idea is that you start with the UCA, and supply one set of tailorings
to get the customary English sort order, another set to get the German
sort order, etc. There are three really nice things about this
approach:
1) Rather than writing a whole new algorithm, you just tweak a base
algorithm, and these tweaks are usually data-driven. (That is, they
usually take the form of modifications to a weighting table, rather
than an algorithmic difference.)
2) Under any set of tailorings, you can sort strings containing any
characters. So, you can meaningfully talk about how Japanese words sort
under the English tailorings (or English locale or traditional English
TPS). It's just that, naturally, the sorting of Japanese words won't
differ under English v. German sorting rules (but sorting of some
English words might), and English words probably won't sort differently
under Japanese v. Taiwanese sorting rules (but Japanese and Chinese
words might).
3) In contexts where you don't know what language is relevant, you have
a reasonable fallback sort order.
So for sorting, a language (or more generally, a TPS) just provides a
bit of tuning on top of a default algorithm, and you can sort with or
without taking language into account.
Now, it's important to note that the UCA isn't just sorting in terms of
the numerical values of the Unicode code points assigned to charactes.
But you can sort this way too--it's sometimes called "binary
order"--and it can make sense to do this in situations where you don't
much care _what_ the sort order is, just that you have something
canonical (for instance, so that you can count duplicates in a list).
The reason you might want to use binary order in cases like this is
that it's faster, and simpler to implement. (Also, you might know that
it's an acceptable sort order for your problem domain--maybe you're
sorting part numbers, for instance.)
So, another sum-up: A TPS (or "locale") is a set of preferences
relating to how you want to carry out certain string operations, and
often these preferences take the form of "tailorings" or tweaks on top
of a base algorithm.
Another place where this sort of thing comes up is with case
mappings--how you uppercase or lowercase strings depends somewhat on
language-based conventions.
Another use for uppercasing or lowercasing in English is to perform
case-insensitive string comparison, so that "foo" and "FOO" and "Foo"
and "FoO" compare as the same. In English, you can either uppercase
everything before comparison, or lowercase everything--it doesn't much
matter which you do, the result will be the same. In other languages,
that's not necessarily the case. Conveniently, the Unicode Standard
defines a process called "case folding", which is designed to let you
do case-insensitive comparisons. Case folding transforms strings into
case-canonicalized representations (conceptually like uppercasing then
lowercasing everything), and it does this in a single,
non-language-sensitive way. So even though language is relevant to case
mappings, you can use case folding to do case-insensitive string
comparisons without specifying a language (or TPS).
So a take-away point here is that the Unicode Specification provides
infrastructure in a number of areas for doing operations in a
TPS-sensitive or TPS-insensitive manner, and provides a smooth
transtion between the two.
A note about Language
There's one thing about langauge-sensitive operations which people tend
to overlook, and which it vitally important: for string operations,
it's not the language _of_the_string_ which is important, it's the
language _of_the_reader_. The canonical example here is this: Consider
a list of names, some of them German, and some of them Swedish. I, as
an English-speaking reader, would want to see these sorted in the
_English_ sort order. I don't much care how Germans or Swedes would
sort them. My local phone book doesn't take the national origin of a
person into account when organzing the listing. It doesn't matter--the
point of having it sorted it to allow people in the US to look up a
name. This means that there isn't a problem of how to compare "English
strings" and "German strings"--the language of the sort operation, not
of the individual strings, determines the binary comparison algorithm
used in the sort. That also means that it's a non-problem to have a
single string containing words from different languages.
This also means that, in the US, the German word "STRASSE" (on a sign
or in the name of a company, for instance) would lowercase as
"strasse", even though in Germany it might be preferable to use
"straße".
What's a Grapheme Cluster?
More about this later, but briefly: As it turns out, certain langauges
have certain funny conventions about grouping sequences of characters
into what they see as one "thing". For instance, Spanish considers the
character sequence "ll" to be a separate letter (not two "l"'s), and
similarly for "rr". Now, they don't really treat them specially from an
"information processing" point of view--I don't believe that there are
any encodings which treat "ll" as a single character, and I don't think
that Spanish typewriters have a separate key for "ll". It's just that
they think of it as a separate letter--it shows up when they are
reciting the alphabet, and they see the word "llama" as containing four
letters (the first letter being called "ell-yay", with the word
pronounced "yama", not "lama"), and they would sort that word after
"luego" (because "ll" comes after "l").
The Unicode Standard calls this a "grapheme cluster", and (as mentioned
above) it's relevant to localized sorting and word-length counting.
This is a somewhat unfortunate term, since "grapheme" is a linguistics
term, and it means something different to linguists. Previous versions
of the Unicode Standard used to term "grapheme", but they've migrated
to using "grapheme cluster" to minimize confusion, though the standard
is still somewhat jumbled in this regard.
But the brief take-home idea here is: "grapheme cluster" ("grapheme"
for short, with reservations), in this context, is a sequence of one or
more characters (a range in a string) which natural-language users
would see as a unit. This is again a situation where language (or TPS)
is relevant, but where the Unicode Standard provides a default
TPS-independent implementation which allows for tailorings. So again
there's a smooth transition for language-independent to
language-dependent settings.
Also, importantly, a grapheme cluster is a notion built on top of
characters (it's a cluster of characters), and choosing a langauge lets
you refine how you break up a string into grapheme clusters, but it's
just a refinement--"adding a language into the mix" doesn't pick out a
different semantic construct, it just help you customize your choice of
what ranges make up single graphemes.
I haven't yet covered a few important topics, such as different
character sequences representing equivalent graphemes, canonical and
compatability equivalence, and Unicode normalization forms. I also
haven't said anything yet about concrete implementation or API
guidelines.
JEff
Hmmm... very good.
One question.
Does (that which the masses normally refer to as) binary data
fall inside or outside the scope of a string?
--
Bryan C. Warnock
bwarnock@(gtemail.net|raba.com)
On Wed, Apr 28, 2004 at 04:22:07AM -0700, Jeff Clites wrote:
: As it turns out, people find it convenient to programmatically represent a
: character by an integer (think "whole number", not a specific data type
: here).
After being so careful to define "character" abstractly, this whole
passage misleads the reader into believing that any such abstract
character can be represented by a single integer (code point).
Only a subset of characters can be represented by a single code point.
Many characters require multiple code points. I see this as a critical
point--it's at the one-to-many interfaces that things tend to break,
and that's precisely why Perl 6 has the four abstraction levels it does:
Level 0: bytes
Level 1: codepoints don't fit into bytes
Level 2: graphemes don't fit into codepoints
Level 3: characters don't fit into graphemes
(where I've used the term "characters" in the language-sensitive sense.)
Not making this distinction also causes you to leave out a level
of collation:
Level 0: binary sorting
Level 1: codepoint sorting
Level 2: language-independent grapheme sorting (UCA)
Level 3: UCA plus tailorings
: It's convenient for several reasons--it's compact and easy to
: refer to in speech. And if the fundamental thing you can ask a string
: is what its Nth character is, then the fundamental things you do with a
: character is look up its properties, and test it for equality against
: other characters. So if you just go through and give each character a
: little serial number, then you can find the properties of a character
: by using its number as an index into a property table (i.e., character
: 3's properties are at slot number 3 in the table), and you can tell
: that 2 characters are different characters by checking whether they are
: represented by different numbers.
But this is really only true of codepoints, not of graphemes or
characters. I realize that oversimplifying is a useful pedagogical
technique, but when you do that you ought to "unlie" in the same
document somewhere. (I'll grant you that you promise to unlie in
your final paragraph, kinda sorta.)
: Fortunately, the Unicode Standard has numbered *all* of them--it's
: given a number to essentially every character in every
: digitally-represented langauge in the world.
Um, no--not unless you've defined how to multiplex the multiple
integers of the codepoints in a grapheme into a single integer, and
I haven't heard that the Unicode consortium has come up with such
a definition.
: So, let's review again. For various practical reasons, it's preferable
: to programatically represent characters using integers, you have to
: pick an arbitrary numbering scheme, and somebody's done that, and it's
: a good one. This numbering scheme defines a one-to-one correspondence
: between numbers (code points) and characters,
There you go again. You need to settle on one definition of character
or the other. I kind of like the abstract definition, but that's not
how you're using it here.
: and that makes it
: tempting to pretend that characters *are* numbers. But it's important
: to keep in the back of your mind an awareness that the numbers merely
: help you pick out the characters, and it's the characters themselves
: which are important, and characters are *abstract*--they never actually
: live inside of a computer program.
Cain't have it both ways...
[Note: Of course, some numbers don't
: represent any character--there are only so many characters. So to be
: mathematically precise, there's a one-to-one correspondence between a
: subset of integers and all characters.]
And many characters are not represented by any integer, but by a sequence
of integers.
: Also, importantly, a grapheme cluster is a notion built on top of
: characters (it's a cluster of characters), and choosing a langauge lets
: you refine how you break up a string into grapheme clusters, but it's
: just a refinement--"adding a language into the mix" doesn't pick out a
: different semantic construct, it just help you customize your choice of
: what ranges make up single graphemes.
I'd say a grapheme cluster functions as a "character" by your original
definition, so this is another case where you're using "character"
to mean something less than that. Also the last sentence seems to
be calling a grapheme cluster a grapheme, which is confusing. A grapheme
cluster is a cluster of graphemes, kinda by definition...
: I haven't yet covered a few important topics, such as different
: character sequences representing equivalent graphemes, canonical and
s/character/codepoint/
: compatability equivalence, and Unicode normalization forms. I also
: haven't said anything yet about concrete implementation or API
: guidelines.
I await your coverage of those topics with interest.
Larry
> {snipped, obviously}
>
> Hmmm... very good.
>
> One question.
>
> Does (that which the masses normally refer to as) binary data
> fall inside or outside the scope of a string?
Outside. Conceptually, JPEG isn't a string any more than an XML
document is an MP3.
Some languages make this very clear by providing a separate data type
to hold a "blob of bytes". Java uses a byte[] for this (an array of
bytes), rather than a String. And Objective-C (via the Foundation
framework) has an NSData class for this (whereas strings are
represented via NSString).
Now, languages such as Perl5 can get away with trojaning binary data
into a string, because some encodings (for example, ISO-8859-1 and
MacRoman) have the property that any sequence of bytes can be decoded
into a string. That is, you can take an arbitrary blob of bytes, and
_pretend_ that it represents textual data encoded in ISO-8859-1 (for
example). But it's sort of a hack, and subverts the semantic purpose of
a string. (And it implies that you can uppercase a JPEG, for instance).
Only some encodings let you get away with this--for example, not every
byte sequence is valid UTF-8, so an arbitrary byte blob likely wouldn't
decode if you tried to pretend that it was the UTF-8-encoded version of
something. The major practical downside of doing something like this is
that it leads to confusion, and propagates the viewpoint that a string
is just a blob of bytes. And the conceptual downside is that if a
string is fundamentally intended to represent textual data, then it
doesn't make much sense to use it to represent something non-textual.
JEff
I'm not vehemently opposed to redefining the meaning of "string"
this way, but I would like to point out that the term used to have
a more general meaning. Witness terms like "bit string".
: Some languages make this very clear by providing a separate data type
: to hold a "blob of bytes". Java uses a byte[] for this (an array of
: bytes), rather than a String. And Objective-C (via the Foundation
: framework) has an NSData class for this (whereas strings are
: represented via NSString).
Another approach is to say that (in general) strings are sequences
of abstract integers, and byte strings (and their ilk) impose size
constraint, while text strings impose various semantic constraints.
This is more in line with the historical usage of "string".
: Now, languages such as Perl5 can get away with trojaning binary data
: into a string, because some encodings (for example, ISO-8859-1 and
: MacRoman) have the property that any sequence of bytes can be decoded
: into a string. That is, you can take an arbitrary blob of bytes, and
: _pretend_ that it represents textual data encoded in ISO-8859-1 (for
: example). But it's sort of a hack, and subverts the semantic purpose of
: a string.
Hmm, that implies a logical ordering of constraints that was not
present at the time.
: (And it implies that you can uppercase a JPEG, for instance).
: Only some encodings let you get away with this--for example, not every
: byte sequence is valid UTF-8, so an arbitrary byte blob likely wouldn't
: decode if you tried to pretend that it was the UTF-8-encoded version of
: something. The major practical downside of doing something like this is
: that it leads to confusion, and propagates the viewpoint that a string
: is just a blob of bytes. And the conceptual downside is that if a
: string is fundamentally intended to represent textual data, then it
: doesn't make much sense to use it to represent something non-textual.
I think of a string as a fundamental data type that can be *used* to
represent text when properly typed. But strings are more fundamental
than text--you can have a string of tokens, for instance. Just because
various string types were confused in the past is no reason to settle
on a single string type as "the only true string". If you can do it,
fine, but you'll have to come up with a substitute name for the more
general concept, or you're going to be fighting the culture continually
from here on out. I don't like culture wars...
I'm speaking strictly on a cultural level there. I'm certainly of
the opinion that Perl 6's Str type should assume textiness, and that
bit or byte or object strings should be declared some other way.
Alternately, the term "string" could be relegated to the category of
things that are too general to instantiate, and then we force text
strings to be declared as Text or some such. "String" would become
a role or some such instead. But that's language design, and I'm in
the wrong list for that...
Larry
>> Does (that which the masses normally refer to as) binary data
>> fall inside or outside the scope of a string?
> Some languages make this very clear by providing a separate data type
> to hold a "blob of bytes".
Back to Parrot, which isn't covered by the manifesto. But anyway we
already need[1] "enum_stringrep_blob" or "_bytes". I can't imagine that
we use a different data type, this would totally mess with Perl
compatibility.
We must ensure that such a string is never upscaled to another string
representation. We can do all byte-wise operations on such a string, but
e.g. appending an utfX string or such should be an error.
The main problem currently seems to be IO, where the best thing would be
to move the current hacks into a separate layer above the buffered
layer. An additiional parameter for open (or layer manipulation
features) can select byte-wise IO.
[1]
- transparent IO
e.g. $ parrot md5sum.imc a.out
- freeze/thaw
- writing packfiles from PASM
> JEff
leo
> Jeff Clites <jcl...@mac.com> wrote:
>> On Apr 28, 2004, at 4:57 AM, Bryan C. Warnock wrote:
>
>>> Does (that which the masses normally refer to as) binary data
>>> fall inside or outside the scope of a string?
>
>> Some languages make this very clear by providing a separate data type
>> to hold a "blob of bytes".
>
> Back to Parrot, which isn't covered by the manifesto. But anyway we
> already need[1] "enum_stringrep_blob" or "_bytes".
Certainly, for the things you've listed under [1] there's no problem
with using a separate data type.
> I can't imagine that
> we use a different data type, this would totally mess with Perl
> compatibility.
Not necessarily (or, that wasn't my intention). For Ponie, we can do
this:
1) Just always implicitly assume "iso-8859-1" when creating strings
which Perl5 would have interpreted as binary.
2) To handle certain features of Perl5 semantics, we could set a flag,
at the PerlString level, to indicate that it should have Perl5-ish
semantics. (That depends on wether a string created in Perl5 code and
passed to Perl6 code should act Perl5-ish or Perl6-ish there. That is,
is its semantics set by its creation context or its use context.) See
below for an example of a case I'm thinking where the semantics might
differ:
> We must ensure that such a string is never upscaled to another string
> representation. We can do all byte-wise operations on such a string,
> but
> e.g. appending an utfX string or such should be an error.
Although, Perl5 lets you append a "utf-8" string to a "binary" string.
But the behavior is odd. For instance, consider this Perl5 behavior
(not sure if it's a feature or a bug):
$a = chr(0xC8);
$b = substr($a.chr(0x212b), 0, 1); # append a "utf-8" character, then
pull it off
print $a; # these print....
print $b; # ...the same thing
print lc($a); # these print...
print lc($b); # ...different things
if( $a eq $b ) { print "yes" } # this prints yes
So, in Perl5, not only does the behavior of a (non-utf-8?) string
change if it "touches" something utf-8-ish, but it does this despite
"eq" telling us the strings are the same. (And, since lc() has no
effect on $a, the implication is that the string is sort of
half-ASCII-half-binary; that is, case mapping has not effect on
characters > 127, which implies they are somehow "uninterpreted"?)
But this behavior could be accommodated (if it's not a bug) at the
PerlString level by special-casing the relevant operations for the
Ponie case.
> The main problem currently seems to be IO, where the best thing would
> be
> to move the current hacks into a separate layer above the buffered
> layer. An additiional parameter for open (or layer manipulation
> features) can select byte-wise IO.
Yes, my intention there was for read-as-strings, you'd push a
string-ification layer onto the stack. For byte-wise IO, you wouldn't.
Anded or ored?
: 1) Just always implicitly assume "iso-8859-1" when creating strings
: which Perl5 would have interpreted as binary.
Well, that's what we initially tried to do in Perl 5, but it turned
out to break a lot of programs. Whether Ponie wants to break those
programs is another matter.
: 2) To handle certain features of Perl5 semantics, we could set a flag,
: at the PerlString level, to indicate that it should have Perl5-ish
: semantics. (That depends on wether a string created in Perl5 code and
: passed to Perl6 code should act Perl5-ish or Perl6-ish there. That is,
: is its semantics set by its creation context or its use context.) See
: below for an example of a case I'm thinking where the semantics might
: differ:
I don't think we want to import Perl 5 semantics (or lack thereof) into
Perl 6. Ponie could at least mark strings from a raw filehandle as
"presumed binary" for Perl 6, even if Ponie ignores the distinction
for the sake of backward compatibility. But I'd rather break the
interfaces between Ponie and Perl 6 occasionally than preserve Perl
5's inconsistent semantics in Perl 6. Perhaps type declarations
on the Perl 6 end can keep things sane at the interface.
: >We must ensure that such a string is never upscaled to another string
: >representation. We can do all byte-wise operations on such a string,
: >but
: >e.g. appending an utfX string or such should be an error.
:
: Although, Perl5 lets you append a "utf-8" string to a "binary" string.
: But the behavior is odd. For instance, consider this Perl5 behavior
: (not sure if it's a feature or a bug):
Well, a bug is just a feature you intend to get rid of. :-)
: $a = chr(0xC8);
: $b = substr($a.chr(0x212b), 0, 1); # append a "utf-8" character, then
: pull it off
:
: print $a; # these print....
: print $b; # ...the same thing
:
: print lc($a); # these print...
: print lc($b); # ...different things
:
: if( $a eq $b ) { print "yes" } # this prints yes
:
: So, in Perl5, not only does the behavior of a (non-utf-8?) string
: change if it "touches" something utf-8-ish, but it does this despite
: "eq" telling us the strings are the same. (And, since lc() has no
: effect on $a, the implication is that the string is sort of
: half-ASCII-half-binary; that is, case mapping has not effect on
: characters > 127, which implies they are somehow "uninterpreted"?)
:
: But this behavior could be accommodated (if it's not a bug) at the
: PerlString level by special-casing the relevant operations for the
: Ponie case.
It's a feature we don't intend to propagate. :-)
: >The main problem currently seems to be IO, where the best thing would
: >be
: >to move the current hacks into a separate layer above the buffered
: >layer. An additiional parameter for open (or layer manipulation
: >features) can select byte-wise IO.
:
: Yes, my intention there was for read-as-strings, you'd push a
: string-ification layer onto the stack. For byte-wise IO, you wouldn't.
Actually, if I recall, the :raw layer in Perl 5 ends up popping off the
default string layer. But the effect is presumably the same.
Larry
> Yes, my intention there was for read-as-strings, you'd push a
> string-ification layer onto the stack. For byte-wise IO, you wouldn't.
Ok. I/O maintainers, please jump in.
leo
And my thoughts in this regard, to be more specific, is that each layer
has a top and a bottom (as in the current design), and each "side" is
either string-oriented or byte-oriented. So you could potentially have
byte-byte (eg, buffering), string-byte, byte-string, and string-string
layers (some being more common than others). The "bottom" layer is
special--its "top" side works in bytes, and it doesn't really have a
bottom (that's the OS interface). To read strings, you need a
string-byte layer (if the nomenclature is top-bottom), and for such a
layer you need to specify an encoding. You just have to match round
pegs to round holes and square pegs to square holes as you stack them.
BTW, I've seen the phrase "IO filter" a few places, but not a
definition. What are these supposed to be?
JEff
On Apr 28, 2004, at 5:05 PM, Larry Wall wrote:
> On Wed, Apr 28, 2004 at 03:30:07PM -0700, Jeff Clites wrote:
> : Outside. Conceptually, JPEG isn't a string any more than an XML
> : document is an MP3.
>
> I'm not vehemently opposed to redefining the meaning of "string"
> this way, but I would like to point out that the term used to have
> a more general meaning. Witness terms like "bit string".
Good point. However, the more general usage seems to have largely
fallen out of use (to the extent to which I'd forgotten about it until
now). For instance, the Java String class lacks this generality.
Additionally, ObjC's NSString and (from what I can tell) Python and
Ruby conceive of strings as textual.
[And of course, it would be permissible in terms of English usage to
say that a bit string isn't a string, much like a fire house isn't a
house, and a suspected criminal isn't necessarily a criminal, and
melted ice isn't ice.]
> : Some languages make this very clear by providing a separate data type
> : to hold a "blob of bytes". Java uses a byte[] for this (an array of
> : bytes), rather than a String. And Objective-C (via the Foundation
> : framework) has an NSData class for this (whereas strings are
> : represented via NSString).
>
> Another approach is to say that (in general) strings are sequences
> of abstract integers, and byte strings (and their ilk) impose size
> constraint, while text strings impose various semantic constraints.
> This is more in line with the historical usage of "string".
Yes, though I think that this diverges from current usage (in general
programming contexts), and more importantly promotes the confusion that
"text" is inherently byte-based (or even, semantically number-based).
The parenthesized point there is that a representation of text a
sequence of numbers is an implementation detail--it's not inherent in
the notion of text. The semantics of text do not imply that it is a
semantic constraint layered on top of a sequence of numbers. In the
vein of the Perl philosophy of making different things look different,
I think it's important to linguistically distinguish between the two.
Many programming languages do that, and users of those languages suffer
less confusion in this area.
The key point is that text and uninterpreted byte sequences are
semantically oceans apart. I'd say that as data types, byte sequences
are semantically much simpler than hashes (for instance), and
strings-as-text are much more complex. It makes little sense to
bitwise-not text, or to uppercase bytes.
> : (And it implies that you can uppercase a JPEG, for instance).
> : Only some encodings let you get away with this--for example, not
> every
> : byte sequence is valid UTF-8, so an arbitrary byte blob likely
> wouldn't
> : decode if you tried to pretend that it was the UTF-8-encoded version
> of
> : something. The major practical downside of doing something like this
> is
> : that it leads to confusion, and propagates the viewpoint that a
> string
> : is just a blob of bytes. And the conceptual downside is that if a
> : string is fundamentally intended to represent textual data, then it
> : doesn't make much sense to use it to represent something non-textual.
>
> I think of a string as a fundamental data type that can be *used* to
> represent text when properly typed. But strings are more fundamental
> than text--you can have a string of tokens, for instance. Just because
> various string types were confused in the past is no reason to settle
> on a single string type as "the only true string". If you can do it,
> fine, but you'll have to come up with a substitute name for the more
> general concept, or you're going to be fighting the culture continually
> from here on out. I don't like culture wars...
I think the more general concept is "array".
The major problem with using "string" for the more general concept is
confusion. People do tend to get really confused here. If you define
"string of blahs" to mean "sequence of blahs" (to match the historical
usage), that's on its face reasonable. But people jump to the
conclusion that a string-as-bytes is re-interpretable as a
string-as-text (and vice-versa) via something like a cast--a
reinterpretation of the bytes of some in-memory representation. As a
general sequence, one wouldn't be tempted to think that a
string-of-quaternions was necessarily re-interpretable as a
string-of-PurchaseOrders. I don't think it's culturally possible to
shake this view of text-is-really-just bytes without using distinct
terminology.
I'm not vehemently opposed to jettisoning the word "string" entirely,
and instead using "Text" and "Sequence" for the above concepts--that's
the usual way to deal with an ambiguous term. But the downside is that
it forms a learning barrier for people coming from other languages. I
think that "string" meaning text, and "array" meaning general sequence
would be the most consistent with current general usage. But my main
concern is that we distinguish between different concepts, by using
different names.
I believe that bringing clarity to this area is crucial.
Since Perl5 doesn't give you a way to manipulate a byte sequence as
anything other than a string, I think it's an open question whether
current Perl users really think in terms of a generalized string, or
whether they've just not been given the tools to distinguish. It would
be interesting to know whether many programmers, faced with the
question "what's a string", would provide an answer which isn't
text-centric.
JEff
As a VM for multiple languages, Parrot must be more general than
any one of those languages, though, yes?
> The key point is that text and uninterpreted byte sequences are
> semantically oceans apart. I'd say that as data types, byte sequences
> are semantically much simpler than hashes (for instance), and
> strings-as-text are much more complex. It makes little sense to
> bitwise-not text, or to uppercase bytes.
If your "text" is taken from a size-two character set, it makes
perfect sense to complement (bitwise-not) it. Bit strings and text
strings are oceans apart like Alaska and Russia.
> The major problem with using "string" for the more general concept is
> confusion. People do tend to get really confused here. If you define
> "string of blahs" to mean "sequence of blahs" (to match the
historical
> usage), that's on its face reasonable. But people jump to the
> conclusion that a string-as-bytes is re-interpretable as a
> string-as-text (and vice-versa) via something like a cast--a
> reinterpretation of the bytes of some in-memory representation.
It is thus reinterpretable---via (de-)serialization. Take a "text"
string, serialize it in memory as UTF-8, say, to get a bit string, and
do ands ors and nots to your heart's content. If the in-memory
representation is already UTF-8, the serialization is nothing more than
changing the string's charset+encoding to "binary". Compilers for
languages like Perl 5, which treat strings as text or bits depending on
the operation being performed, can insert the
serialization/deserialization ops automatically as needed.
>>>> Jeff Clites <jcl...@mac.com> 2004-05-01 18:23:02 >>>
> [Finishing this discussion on p6i, since it began here.]
>> Good point. However, the more general usage seems to have largely
>> fallen out of use (to the extent to which I'd forgotten about it
> until
>> now). For instance, the Java String class lacks this generality.
>> Additionally, ObjC's NSString and (from what I can tell) Python and
>> Ruby conceive of strings as textual.
>
> As a VM for multiple languages, Parrot must be more general than
> any one of those languages, though, yes?
Yes, but not more general than any of them need.
In contrast, parrot doesn't allow for the possibility that INTVALs
could represent complex numbers (3 + 2i), for instance--if a language
wants those as its numerical primitives, it would layer them on top of
INTVALs, possibly--but in fact no language wants that anyway. That is,
parrot shouldn't try to provide a generality which can be built on top
of something simpler, or which no language actually needs.
But so far (in this thread), I've been talking about the general
concept--it was leading to recommendations for Parrot, but is turning
into just recommendations for Perl6, probably.
I do think, though, that in practical terms languages that exist today
fall into 2 categories: those which handle international text, and
those which don't. Those which do, do so uniformly--they have one or
maybe two internal representations for text, and they don't try to
model it as "bytes in an interchange format plus indication of encoding
plus something else". Languages in the latter category almost always
think of text as ASCII, and model it as a buffer of bytes, and they
also allow bytes with values > 127, but tend to leave them
uninterpreted (that is, under things like case transformation).
It's pretty straightforward to programmatically model a string in
manner which closely matches what the more text-sophisticated languages
expect. For the other category, you can either (a) enforce simpler
semantics at the PMC layer, or (b) gift them with full international
text handling. I don't know for a fact that the Ruby community, for
instance, wouldn't be thrilled to get full international text handling
for free. I don't think we've asked.
If we're going to generalize an implementation, we'd best look around
and find out the appropriate direction for generalization, rather than
just guessing. I think Parrot's trying to overdo it.
>> The key point is that text and uninterpreted byte sequences are
>> semantically oceans apart. I'd say that as data types, byte sequences
>
>> are semantically much simpler than hashes (for instance), and
>> strings-as-text are much more complex. It makes little sense to
>> bitwise-not text, or to uppercase bytes.
>
> If your "text" is taken from a size-two character set, it makes
> perfect sense to complement (bitwise-not) it. Bit strings and text
> strings are oceans apart like Alaska and Russia.
Yes and no.
In practical terms, if you bitwise-not some UTF-16 bytes (as bytes or
16-bit ints), you'll end up with bytes which don't represent any
characters at all. (Because all characters in the Unicode repertoire
have values < 2^21, and once you bitwise-not they'll all be >= 2^31.)
And UTF-8 and UTF-32 suffer similar problems, and probably Shift-JIS
too. You can only get away with this sort of thing if you are thinking
in terms of encodings in which any byte sequence is interpretable in
that encoding, which is only some encodings (ISO-8859-* fall into this
category, for instance).
Bitwise-not-ing is simply not a text operation. Another way to see that
is to remember that the assignment of numbers to characters is
arbitrary, so doing a mathematical transformation based on those
numbers isn't meaningful. Certainly, you can precisely define and
implement such a transformation, but it has no meaning as an operation
_on_text_.
If you take a top-down approach to designing a string API, you won't be
tempted to think of anything as "size two"--all the confusion comes
from working bottom-up. By analogy, look at objects. Objects represent
a certain approach to factoring and organizing computer programs, with
a focus on their behavior, and a hiding of their implementation
details. People like to invent serialization format for them, so that
they can persist them or send them between processes--to pick an
example, think of an XML-based serialization format. Now, nobody today
is seriously tempted to think of objects as just blobs of XML, plus an
interpretation--to base their in-memory representation on XML, and
create object API which reflect this. That would be thinking about
objects the wrong way. Now, imagine if history had gone differently--if
XML had been invented before object-oriented programming. You'd have
all of these documents sitting around holding structured data. When
objects began to materialize, as a concept, undoubtedly they'd arise as
an approach to in-memory manipulation of XML. It would be difficult to
get people thinking about objects top-down--to realize that the concept
had nothing to do with the serialization format.
My claim is that this is what has happened with text. People are locked
into a view based on the numerous interchange formats for text
(encodings). But if you start top-down, you'd never be tempted to think
that when I type and "A" (as I just did), what I _mean_ depends on the
encoding that my email client ends up choosing to send this message.
Quite the contrary--I don't know what encoding (or more generally,
format) it's going to choose. All I care is that it picks one which
will allow my text to get to you without losing anything.
But certainly, people have a strong tendency to think bottom-up. I
believe it's a historical accident, but it may be psychologically
impossible to get some people to think about this area differently. But
I'm completely convinced (having worked with text systems which take a
top-down approach), that thinking top-down makes things much clearer
and much simpler (and actually leads to fewer security-related bugs).
In particular, if your text model is locked in encoding land, then you
force individual programmers to know all of the details of various
encodings in order to work with text. With a top-down approach, they
just need to think in terms of text, and they need to realize that when
reading in some bytes off of disk which are supposed to represent text,
they need to know which format was used by the process which created
the file (but they don't need to know the details of that format).
>> The major problem with using "string" for the more general concept is
>
>> confusion. People do tend to get really confused here. If you define
>
>> "string of blahs" to mean "sequence of blahs" (to match the
> historical
>> usage), that's on its face reasonable. But people jump to the
>> conclusion that a string-as-bytes is re-interpretable as a
>> string-as-text (and vice-versa) via something like a cast--a
>> reinterpretation of the bytes of some in-memory representation.
>
> It is thus reinterpretable---via (de-)serialization. Take a "text"
> string, serialize it in memory as UTF-8, say, to get a bit string, and
> do ands ors and nots to your heart's content.
Yes, precisely--you can serialize a string into a bag of bytes, and do
whatever you want with that. Similarly, you can serialize an object
using some serialization scheme (Data::Dumper? XML? Perl's freeze/thaw?
Python's pickle? Parrot's freeze?), and manipulate the bytes of that.
But you wouldn't be tempted to want to bitwise-not the raw memory
locations which implement an object. You'd never think to do _that_,
and if someone suggested it you'd immediately think of two problems:
(1) it's not semantically meaningful (the internal representation is
supposed to be opaque, and changeable without changing the visible
behavior), and (2) if you did that, you'd likely end up with something
which no longer represented an object--you'd just get junk.
> If the in-memory representation is already UTF-8, the serialization is
> nothing more than changing the string's charset+encoding to "binary".
Yes, and in light of the above, I have 2 problems with that: (1) It's
basing externally-visible behavior on an internal implementation
detail, and (2) people seem to be forgetting that if you do that, you
won't be able to "go back"--that is, bitwise-not-ing a "UTF-8 string"
won't leave you with something interpretable as UTF-8. I don't think
that people would expect to bitwise-not some text, and have the result
be non-textual. (But then, I don't believe someone would ever seriously
want to bitwise-not some _text_.)
> Compilers for languages like Perl 5, which treat strings as text or
> bits depending on the operation being performed, can insert the
> serialization/deserialization ops automatically as needed.
Yes, that's fine in terms of convenience, but leads people to think of
text _as_ bytes. That going semantically in the wrong direction. And
witness the confusion that brings (i.e., this whole thread). If you
keep them clearly separate (text operations on text, binary operations
on binary data, some operations on both), things are clearer.
JEff
> All in all, very well written.
Thanks.
> I do, of course, have a few quibbles:
>
> On Wed, Apr 28, 2004 at 04:22:07AM -0700, Jeff Clites wrote:
> : As it turns out, people find it convenient to programmatically
> represent a
> : character by an integer (think "whole number", not a specific data
> type
> : here).
>
> After being so careful to define "character" abstractly, this whole
> passage misleads the reader into believing that any such abstract
> character can be represented by a single integer (code point).
> Only a subset of characters can be represented by a single code point.
> Many characters require multiple code points.
No, this is 100% intentional, and not meant to be a pedagogical fib,
actually. I'm saying that an entry if the big table in the Unicode
Standard is describing a character. This is in fact consistent with the
definition of "abstract character" put forth in the Unicode Standard.
They make the careful point that a "character" isn't necessarily what a
naive language user would see as a character (and they call the latter
a "grapheme"). That said, I think that this concept of a character
really is in fact trying to capture an intuitive, natural-language
concept. But the Unicode character repertoire is a product of
compromises, and of the desire to maintain backward compatibility with
previous standards. For instance, there's a separate entry for the
Angstrom Sign v. Latin Capital Letter A with Ring Above. Ideally, these
wouldn't be distinguished. They are, because they are in Shift-JIS, and
it was desirable to be able to round-trip between important national
standards and Unicode-defined encodings. And even conceptually that's
not that bad--presumably, Shift-JIS distinguished between the two
because someone thought of them as semantically different. But examples
such as this don't imply that a code point is trying to pick out a
different concept than a character, but just that in some cases things
may not mesh with a particular person's intuition.
Also, I don't have a problem with the formalization of a concept
breaking with its informal usage in spots. For instance, floating point
numbers and integers are supposed to model the mathematical notion of a
number, but the former are bounded in range and precision, whereas the
latter concept is not. But that's just a practical shortcoming--not a
whole separate concept.
> I see this as a critical
> point--it's at the one-to-many interfaces that things tend to break,
> and that's precisely why Perl 6 has the four abstraction levels it
> does:
>
> Level 0: bytes
> Level 1: codepoints don't fit into bytes
> Level 2: graphemes don't fit into codepoints
> Level 3: characters don't fit into graphemes
>
> (where I've used the term "characters" in the language-sensitive
> sense.)
I see there as ideally being just two levels:
First off, I'd say bytes don't have anything to do with a (text)
string, as in in-memory data type. You can serialize a string into
bytes, but you can serialize a hash into bytes as well. It's not
productive to have a byte-based view of either. You can't ask, "what's
the first byte of this string", any more than you can ask "what's the
first byte of this hash". But you can ask, "if this string were
serialized using the UTF-8 encoding, what would the first byte of the
result be", just as you could ask, "if this hash were serialized using
Data::Dumper, what would the first byte of the result be".
So I worry about your level 0, because it promotes the idea that a
string is semantically byte-based. Even more importantly, if a
byte-based operation is intended to on-the-fly serialize a string using
some default encoding, manipulate the result, and then re-create a
string from those bytes, then in the UTF-8 case you're quite often
going to end up with a byte sequence which can't be de-serialized into
a string. So you'll just end up shredding your string if you try to do
byte-based regex replacements.
(That said, in the approach I'm pushing for of having a separate data
type to hold "raw bytes", perhaps ByteArray, it makes perfect sense to
allow some regexes to be applied to that--essentially searching for
certain byte sequences.)
Level 1 is what I'm calling characters--no problem there semantically.
Level 2 is graphemes--sequences of characters. Okay-ish.
Level 3 I don't think should be a different level, though I'm not
certain I 100% understand what you have in mind. To my way of thinking,
a grapheme is basically that which a language user would think of as a
"character". The Unicode Standard defines a language-agnostic concept
of a grapheme (sort of a general consensus across langauges), plus the
concept of language-specific refinements. So I'd say that picking a
language lets you refine what sequences count as graphemes, but doesn't
pick out an entire separate concept.
So it feels to me like there should be per-character and per-grapheme
operations--like we just need two levels. But I need to give this area
some more thought--there's something a bit slippery about counting
graphemes.
> Not making this distinction also causes you to leave out a level
> of collation:
>
> Level 0: binary sorting
This binary sorting is how you have to sort ByteArrays, but not how
you'd naturally sort strings.
> Level 1: codepoint sorting
And I think you could have variations here--sorting by numerical code
point order, and sorting as though you're strings were in various
normalization forms (C, D, KC, KD).
> Level 2: language-independent grapheme sorting (UCA)
> Level 3: UCA plus tailorings
UCA and UCA-plus tailorings are two choices, but I don't think they are
really two "levels" of sorting.
I think the sorting choices are very much like sorting using different
comparison operators--not inherently "levels", but really chosen on a
per-sort basis. All-in-all, sorting seems more straightforward then
regex matches, since we already had the concept of there being
different sort flavors.
I think we should beware of being overly-contextual--that is, of
thinking of these things as "levels" or "modes", to be specified via
"use" directives, rather than clearly specified on a per-operation
bases. That is, ideally I could look at a line of code and know what it
would do, without having to look higher in the code for "use"
directives. (But specifying the "mode" as part of a regex fragment
would work nicely, if that's the basic idea.)
> : It's convenient for several reasons--it's compact and easy to
> : refer to in speech. And if the fundamental thing you can ask a string
> : is what its Nth character is, then the fundamental things you do
> with a
> : character is look up its properties, and test it for equality against
> : other characters. So if you just go through and give each character a
> : little serial number, then you can find the properties of a character
> : by using its number as an index into a property table (i.e.,
> character
> : 3's properties are at slot number 3 in the table), and you can tell
> : that 2 characters are different characters by checking whether they
> are
> : represented by different numbers.
>
> But this is really only true of codepoints, not of graphemes or
> characters.
The key here is that your usage of these terms diverges from mine, and
from the Unicode Standard's. To me (and to the Unicode Consortium, by
my reading of the standard), a "code point" is a number representing an
(abstract) character. A grapheme is a sequence of one or more abstract
characters, intended to correspond to some natural-language concept of
a single unit of text.
> I realize that oversimplifying is a useful pedagogical
> technique, but when you do that you ought to "unlie" in the same
> document somewhere. (I'll grant you that you promise to unlie in
> your final paragraph, kinda sorta.)
As I said above, I do actually mean what I was saying literally, though
the parts I haven't gotten to would have hit that point home.
> : Fortunately, the Unicode Standard has numbered *all* of them--it's
> : given a number to essentially every character in every
> : digitally-represented langauge in the world.
>
> Um, no--not unless you've defined how to multiplex the multiple
> integers of the codepoints in a grapheme into a single integer, and
> I haven't heard that the Unicode consortium has come up with such
> a definition.
No, exactly--they've numbered characters, not graphemes. This point in
fact brings up a worry I have about your grapheme-level of semantics. I
think it makes perfect sense to say that two strings are graphemically
equivalent (despite being composed of different characters), but it
gets dicey to say they're made of the "same graphemes". Saying that
implies that you have a data type to uniquely represent "a grapheme",
and there isn't a convenient data type for that (unless someone goes
and numbers all possible ones).
> : So, let's review again. For various practical reasons, it's
> preferable
> : to programatically represent characters using integers, you have to
> : pick an arbitrary numbering scheme, and somebody's done that, and
> it's
> : a good one. This numbering scheme defines a one-to-one correspondence
> : between numbers (code points) and characters,
>
> There you go again. You need to settle on one definition of character
> or the other.
Yep, I did, but I think your brain is rejecting it because it's not the
definition you expected.
I'm very much avoiding a fuzzy definition of character as something
like, "what a user would generally see as a single thing".
> I kind of like the abstract definition, but that's not how you're
> using it here.
The definition is that someone went through the abstract choices and
made some specific judgment calls, and decided what abstract notions to
distinguish between, and what not do. Then, they recorded their
decisions in a big table.
To use my example from before, I don't think one can give a
knock-down-drag-out argument that "a" and "A" are not just stylistic
variants of the same characters, and are in fact two different
characters. You could certainly think of them that way. But, once
you've made a decision one way or the other about how to look at them,
you've layed down part of a precise definition of a general, fuzzy
concept.
> : and that makes it
> : tempting to pretend that characters *are* numbers. But it's important
> : to keep in the back of your mind an awareness that the numbers merely
> : help you pick out the characters, and it's the characters themselves
> : which are important, and characters are *abstract*--they never
> actually
> : live inside of a computer program.
>
> Cain't have it both ways...
It's an isomorphism. Someone picked the list of characters, then
numbered them. It doesn't matter that "A" was given number 65, but
since it was, it's unambiguous to say, "the character to which Unicode
gives the number 65", or even just the shorter "character 65". But I
think in general it's important to remember, even when speaking like
that, that a code point is literally a number, which represents
something non-numerical (a character).
> [Note: Of course, some numbers don't
> : represent any character--there are only so many characters. So to be
> : mathematically precise, there's a one-to-one correspondence between a
> : subset of integers and all characters.]
>
> And many characters are not represented by any integer, but by a
> sequence
> of integers.
Not in my definition, or in Unicode's. I'd state this as, "Many
graphemes are not represented by any integer, but by a sequence of
integers".
For instance, see <http://www.unicode.org/reports/tr29/>, which states:
One or more Unicode characters may make up what the user thinks of
as a character or basic unit of the language. To avoid ambiguity
with the computer use of the term character, this is called a
grapheme cluster. For example, “G” + acute-accent is a grapheme
cluster: it is thought of as a single character by users, yet is
actually represented by two Unicode code points.
> : Also, importantly, a grapheme cluster is a notion built on top of
> : characters (it's a cluster of characters), and choosing a langauge
> lets
> : you refine how you break up a string into grapheme clusters, but it's
> : just a refinement--"adding a language into the mix" doesn't pick out
> a
> : different semantic construct, it just help you customize your choice
> of
> : what ranges make up single graphemes.
>
> I'd say a grapheme cluster functions as a "character" by your original
> definition, so this is another case where you're using "character"
> to mean something less than that. Also the last sentence seems to
> be calling a grapheme cluster a grapheme, which is confusing. A
> grapheme
> cluster is a cluster of graphemes, kinda by definition...
Nope, they're synonyms--a "grapheme cluster" is a preferred term for
this usage of "grapheme", to distinguish it from the linguistic usage.
For instance, the above-referenced document states:
In previous documentation, default grapheme clusters were previously
referred to as “locale-independent graphemes”. The term cluster has
been added to emphasize that the term grapheme as used differently
in linguistics.
I assume what they had in mind was something akin to, "a
grapheme-forming cluster (of characters)". But it is a bit
muddled--their usage is not entirely consistent.
(And shame on the Unicode Consortium for picking this term for this
concept, and then disambiguating in a confusing way.)
My goal in all of this is to provide concrete rather than fuzzy
definition where at all possible. The one-to-one mapping between code
points and characters make a lot of sense to me--not only do I believe
that is what the Unicode Consortium intended (despite compromises), but
if one tries to say "no no, code points don't correspond to characters,
groups of them do", then you're left wondering just what sort of thing
the Unicode Consortium when through and numbered.
And I will have some more to say about why it doesn't really bother me
that <o-with-acute-accent> and <o, combining-accute-accent> are two
_different_ character sequences which are graphemically equivalent (or,
which are inequal but equivalent under various notions of equivalence).
> : I haven't yet covered a few important topics, such as different
> : character sequences representing equivalent graphemes, canonical and
>
> s/character/codepoint/
>
> : compatability equivalence, and Unicode normalization forms. I also
> : haven't said anything yet about concrete implementation or API
> : guidelines.
>
> I await your coverage of those topics with interest.
Thanks. I need to write that up soon; I suppose I'll post it to p6l, as
that seems more appropriate at this point.
JEff
Overall, I agree with, and like, your description of text handling.
However, I believe that there is one point at which your text vs
objects analogy breaks down. Unlike objects, text has a *literal*
syntax in most languages. So, for example, in Python I can say
str = u"A"
to cause str to contain a single character, the capital letter A.
However, this statement is only true because my program code (which is
text) contains a character A. And in practice, program source is
*serialised* text, with all of the encoding issues that implies. (For
a stronger illustration of this, consider the case where that
character was A with an acute accent, and the source file was Latin-1
encoded). Language compilers tend to avoid the issue of defining a
means of specifying the encoding of program source files. Often, this
is done by sticking to ASCII and assuming byte <-> character
equivalences, which is OK until someone puts a non-ASCII byte (which
represents a particular character in *their* preferred encoding) into
a string literal.
I haven't managed to be as clear and precise as I'd like in the above
paragraph, but I hope you get my point. Encodings are messy, and
string literals force programmers to deal with encodings in a way
which obfuscates an otherwise clean "characters are abstract" model.
Paul.