In light of ongoing discussions of Parrot's string model, I've decided
to prepare a document spelling out my general viewpoint on the subject
of strings. It's intended also to supply a self-consistent set of
terminology. It will frame my future comments. Some notes:
1) This is explicitly arguing for one viewpoint, not against another.
(That is, I've not included any such arguments.)
2) I've written it in a manner which doesn't directly mention Parrot,
so that I can use it in other contexts.
3) There are two separate issues with respect to Parrot: the strings
semantics, and the internal representation/implementation. This
addresses only the former. It also doesn't include an API
recommendation. This is because the semantics need to be sorted out
first.
4) There are a few topics yet-to-be-covered.
5) I should probably provide a (much shorter) synopsis.
----------------------------------------------------------------------
What's a String?
This article is intended to clarify some concepts related to strings.
The motivation is that they are a frequent source of confusion for
developers, working in various programming language. This lays out a
framework for thinking about strings--a viewpoint, if you will. It
reflects a conceptual model that's already embraced by some programming
environments, but even those environments often lack a detailed
explanation. This document also aims to disambiguate various
terms--clarifying some and providing self-consistent definitions of
others which are used inconsistently in different contexts.
So, what is a string?
In the most general sense, a string is a data type in a programming
language, meant to model text. It's meant to allow manipulation,
viewing, analysis, searching, of the stuff you read, or which computer
programs read on your behalf.
A bit more concretely:
Definition: A string represents a sequence of abstract characters.
So we can't go much further until we say a few words about what a
"character" is. But the two important words to remember from that
definition are "represents" and "characters". (And, "sequence" is
important too, but you won't overlook that one.)
What's a character?
A character, in fuzzy terms, is the smallest unit of meaning in a
language.
A character is an abstract concept. Characters are things like letters,
numbers, punctuation marks, the Japanese symbol for "middle"--stuff
like that. The concept of a character is supposed to capture an
intuitive notion--the notion ususally understood by the common language
speaker. (We are, after all, modelling what's essentially text--the
stuff that humans read. That's the whole point.) That said, precisely
determining what constitutes a character, and what doesn't, is
essentially a judgement call, and a decision. Fortunately, essentially
all existing standards agree on this, in a general sense. For instance,
the lowercase "a" and the uppercase "A" are considered to be different
characters, but stylistic variants (the letter "a" in different fonts,
for instance) are not (instead, they're just different graphical
representations of the same abstract character). These could have been
modelled differently--you could consider "a" and "A" just variants of
the same character. But, as it turns out, no one does it that way.
And I mention that now to prepare you for some arbitrariness--some
conventions that could have gone another way, but which are actually
still intuitively meaningful.
Notice that we haven't said anything about how a computer program (or
language) might model a character. That's important. So far we've just
said *what* a computer program is trying to model, not how it might do
it.
And also, just to hit the point home, we're not talking about the
"char" datatype in C. To keep things clear, we'll refer to that as a
"byte", when it comes up.
What is Unicode?
At this point I'd better explain what "Unicode" means, before I slip
and use the word. People like to throw the word around a lot, and it's
not always clear which of the following related (but different) things
they mean. There are at least 6 different possibilities:
1) The Unicode Consortium, a group of companies and organizations that
got together to sort out the handling of international text. ("Sort
out" in the sense of making it trivial to allow production of text
documents containing a mixture of different langauges, rather than
nearly impossible.)
2) The standard produced by that group, currently at version 4.0.
3) The character model which forms the basis of that standard, and
which also forms the basis of what I'm talking about here.
4) The database of character properties (or the properties themselves)
assembled by the Unicode Consortium. These properties are things like,
"this character has an uppercase version over here", "this character
represents the decimal digit 3", etc.
5) A family of encodings (word defined below) used for the interchange
of international text between computer programs (by mean of files or
network transmissions, for instance). Often, "Unicode" is used to refer
to a specific one; that usage is inherently ambiguous, but harkens back
to the first version of the Unicode standard, which defined only one
encoding.
6) International (multi-language) text handling, in general. This usage
arises because internaltionalization-aware applications tend to base
their implementations on the Unicode character model, and tend to
support the Unicode encodings. But it's important to recognize this
usage, because sometimes the term "Unicode" is used in contexts which
don't involve any of the above per se, but rather just indicate an
awareness and handling of international text issues.
Of these, really (3) is the only one which someone could really
"disagree" with. (That is, the consortium, standard, database, and
encodings exist and are precisely defined--there's nothing
philosophical there to reject.) But the other important point is that
these are all separate things--sometimes people will cite the need to
support other encodings as an argument against the character model,
which doesn't follow logically.
So the string/character model I'm espousing is based largely on the
Unicode Character Model, but it would be misleading to say that I'm
advocating, "use Unicode". (What would that mean, anyway?) But it's (3)
that does the conceptual heavy lifting, and (4) which does the grunt
work. Or, (3) is the concept, and (4) is the details.
I'll try to avoid using the word "Unicode" by itself, to avoid
confusion, but if I do, then I'm probably referring to the Unicode
Standard, or something it specifies.
So let's review. A string is a concrete thing--a data type in a
language; it represents a sequence of (abstract) characters, a
character is a unit of meaning in a language, and a character has
properties which have been collected into a database. To go a bit
further, the fundamental question you can ask a string is, "what's your
Nth character". All else flows from that.
What's a code point?
So if the fundamental question you can ask a string is, "what's your
Nth character", we have to talk a bit about how one would
programmatically express that answer--that is, what sort of data type
is/can/should be use to represent a character? Fundamentally, a
character can of course be represented opaquely--in an object-oriented
language, you could have character object. That's quite reasonable. As
it turns out, people find it convenient to programmatically represent a
character by an integer (think "whole number", not a specific data type
here). It's convenient for several reasons--it's compact and easy to
refer to in speech. And if the fundamental thing you can ask a string
is what its Nth character is, then the fundamental things you do with a
character is look up its properties, and test it for equality against
other characters. So if you just go through and give each character a
little serial number, then you can find the properties of a character
by using its number as an index into a property table (i.e., character
3's properties are at slot number 3 in the table), and you can tell
that 2 characters are different characters by checking whether they are
represented by different numbers.
A number used to represent a character like this is called a "code
point". Which number you choose to assign to which character is
fundamentally arbitrary--its one and only use is as a unique lookup
key. The letter "A" could be assigned the number 1, or 42, or 31337, or
2001, or anything else. It doesn't matter. Now, various standards have
numbered various characters in various ways, though many don't actually
think in these terms, so to be clear if you are going to refer to
"character number 42" or "the code point for the letter 'A'", then you
really need to point out what numbering scheme you are talking about.
Fortunately, the Unicode Standard has numbered *all* of them--it's
given a number to essentially every character in every
digitally-represented langauge in the world. Since its numbering scheme
is comprehensive, and since the numbering scheme is arbitrary, without
loss of generality we are going to use the Unicode standard's numbering
scheme for the rest of this discussion, and we'll go ahead and pretend
it's the only one. So when we say, "the code point for the letter 'A'",
we'll mean the code point that the Unicode Standard assigns to it,
which (in case you are wondering by now) is 65. (See, the number itself
is not very interesting!)
So, let's review again. For various practical reasons, it's preferable
to programatically represent characters using integers, you have to
pick an arbitrary numbering scheme, and somebody's done that, and it's
a good one. This numbering scheme defines a one-to-one correspondence
between numbers (code points) and characters, and that makes it
tempting to pretend that characters *are* numbers. But it's important
to keep in the back of your mind an awareness that the numbers merely
help you pick out the characters, and it's the characters themselves
which are important, and characters are *abstract*--they never actually
live inside of a computer program. [Note: Of course, some numbers don't
represent any character--there are only so many characters. So to be
mathematically precise, there's a one-to-one correspondence between a
subset of integers and all characters.]
What's an encoding
So, what we've convered so far is the type of stuff you need to work
with text in-memory. We can use the number 65 to represent the letter
"A", and we have a properties database to tell us whether A is a
number, or whitespace, or whatever.
But, text being what it is, people tend to want to save it to disk, or
transfer it between applications. That is, they want to do IO
operations on strings.
Now as it turns out, IO is alway an operation which involves bytes.
Even if your IO is hidden under an abstraction layer, at the bottom
you're always reading and writing bytes. Strings are not,
fundamentally, bytes. So how do you create bytes from some high-level
construct like a string? You define a serialization algorithm, that's
how. That's an encoding:
Definition: An encoding is a mapping from sequences of abstract
character to sequences of bytes (and vice versa).
Since we've said that a string is supposed to represent a sequence of
abstract characters, we could also just say, "An encoding is a mapping
from strings to sequences of bytes (and vice versa)."
*Important Note*: This is, in particular, an area where there is much
conflicting and inconsistent terminology. The concept that I just
defined is the exact same thing that the Unicode Standard refers to as
a "Character Map". That would probably be the preferred term, since I
don't believe it's been used elsewhere with a differing meaning. But,
it's not a commonly used term, so I'm going to use "encoding" instead,
with some hesitation, so that someone jumping into the middle of this
discussion has a chance of figuring out what I'm talking about. My
usage of "encoding" matches the usage of the XML specification, and the
Objective-C Foundation framework. Java and MIME use the word "charset"
for the same concept, and IANA uses "character set". Others use the
term "character set" for something different. But the important thing
to remember is just that, for the purposes of this document, I'm using
the definition above, and when you are reading elsewhere and encounter
any of those terms, you should look for a definition of what is meant
there.
The absolutely crucial thing the remember about an encoding is that it
defines an interchange format--that's 100% what it's for. It's purpose
is to let you communicate textual data between processes (or, store it
to the filesystem for later retrieval by some process). Encodings have
nothing, fundamentally, to do with any sort of in-memory representation
of a string.
The other thing to remember about an encoding is that there are lots of
them--lots and lots of national and industry standard define lots of
different ones. They tend to have names like, "ASCII" or "ISO-8859-1"
or "Shift-JIS" or "Big5" or "UTF-8".
The reason there are so many of them is that, in the early days of
computing, most everyone got it into their heads that they needed to
represent textual data using only one byte per character, which meant
that an encoding could only handle 256 different characters. This was
fine for English (and in fact ASCII only encodes 128 characters), but
as soon as you hit Europe you needed more characters--for French and
German and Greek and Icelandic and Polish and Turkish and so on. You
need more than 256 characters to handle all of that, so different
encodings where invented to handle different collections of characters.
This doesn't sound so bad, and it's fine as long as everyone just talks
to the guy next door, but it quickly turns into a mess if you try to
step outside--you can't create a single text document with words from
multiple languages (for some combinations), and (potentially worse)
when you read in a file created by someone else, you need to know what
encoding they used (that is, you need metadata, of the sort supplied by
MIME and HTTP headers, but not supplied by most filesystems). This
wasn't fun for anyone.
The obvious thing to do, in retropect, was to come up with a different
plan, one which would allow you to handle all languages without a
separate song-and-dance for each one. That's in fact what the Unicode
Consortium came together to do. And they did it. It was an enormous
undertaking.
Okay, back to the point. Not only are there lots of different
encodings, but by their nature most only know how to encode strings
containing a limited collection of characters. ASCII, for example,
doesn't know how to encode Japanese Kanji. But the Unicode Standard
defines a collection of encodings (UTF-8, UTF-16, and UTF-32, with big-
and little-endian variant of the latter two) which can encode strings
containing basically any character.
It's important to take a moment to point out how general the above
definition of "encoding" is, and what it doesn't imply:
1) As mentioned, an encoding doesn't have to know how to encode every
possible string. And reciprocally, not every sequence of bytes will be
decodable by every encoding.
2) Although an encoding is really a pair of algorithms (one to
serialize, one to de-serialize), that doesn't imply that an encoding
has to be strictly invertible. That is, if I encode a string into a
sequence of bytes, and then decode that sequence of bytes into a
string, I might not get back the same string I started with. For
example, and encoding might, conceptually, strip accent characters.
3) An encoding doesn't necessarily operate on a character-by-character
basis. That is, the bytes to encode "ab" might not be the concatenation
of the bytes to encode "a" with the bytes to encode "b". It's even
possible, based on the above definition, that an encoding might be able
to encode the string "ab", but not the string "ba". (In practice, I
don't know if this ever occurs "in the wild".)
So the key points here are: (1) an encoding is all about IO, (2) not
all encodings guarantee data integrity, and (3) the Unicode encodings
_do_ guarantee data integrity. Another important point is that the
choice of an encoding is a crucial piece of metadata for bytes
undergoing IO, and thus it's the sort of thing which is indicated
explicitly in higher-level protocols and formats (HTTP, MIME, and XML,
to name a few), and it is almost always indicated *by name*. IANA
mantiains a registry of these, and many protocols use the IANA names to
specify encodings.
Another important point: The Unicode Character Model (unlike many other
standards) thinks of the process of going from strings to bytes as a
multi-step process. That's useful from a pedagogical point of view, but
in fact no one much cares about the results of the intermediate steps.
(That is, Unicode describes the process as: characters --> code points
--> code units --> bytes, and I'm saying that the code units aren't
very useful to think about.) The reason for this is very
simple--encodings are only about data interchange, and data interchange
protocols and formats want to specify the encoding with a single name,
which picks out the whole strings-to-bytes mapping. They don't much
care if two different encodings could be thought of as agreeing on one
of those steps, and differing on another. If you're writing a library
to convert between sequences of bytes in different encodings, then this
can also be useful (to minimize code duplication), but it really has
nothing to do with the semantics of a string.
What's a Character Set?
As mentioned above, some people use the term "character set" to mean
what I'm calling "encoding".
In my usage, a "character set" is something simpler--it's a set of
characters, in the mathematical sense of "set". It's an unordered
collection of characters. The letter "A", the comma, and the Japanese
character for "middle"--there, that's a character set. The Unicode
Standard uses the term "abstract character repertoire" for this notion.
That's actually less ambigous terminology, but quite a mouthful. I'll
try to use "repertoire".
A character set, in this sense, isn't a terribly interesting concept,
and it's a good term to stay away from, given the ambiguity. But people
often say things like, "the ASCII character set". Now, in my usage, the
ASCII standard primarily defines an encoding. However, subject to a few
caveats mentioned above, an encoding implicitly picks out a character
set--specifically, the set of characters that it can encode. So in
light of this, I can meaningfully say "the ASCII character set", and
talk about "how the Shift-JIS encoding handles characters in the
ISO-8859-1 character set". I mentioning this not to particularly
encourage the usage, but because you'll hear it, and because it's a bit
less awkward to say than things like, "how the Shift-JIS encoding
handles characters encodable in the ISO-8859-1 encoding". And also,
it's convenient to say, "ASCII characters" (through a small abuse of
language) to talk about "ABC...", even in the context of another
encoding.
Wow, so we've covered a lot of ground. Let's sum up again: Conceptually
strings model a sequence of abstract characters--that's their job.
Encodings (or Character Maps) define interchange formats, and are only
important to IO--to let you transmit strings between processes (via the
network or via files). Encodings are basically algorithms, but in usage
they tend to be identified by conventional names.
What's a Locale?
One of the things that people like to do with strings is to manipulate
them. Two prototypical things they want to do are case transformations
(uppercasing and lowercasing) and sorting. And they also want to do
things like create string representations of numbers and dates, and the
inverse--interpret strings as representing numbers and dates.
Now as it turns out, different people want to do these things in
different ways--think of "1/31/2004" v. "31/1/2004" v. "2004-1-31".
Consequently, that means the sorting algorithms and number formatting
algorithms need to know which way you want to do things (and which way
I want to do them). Traditionally that means that these types of
algorithms need to take a parameter (explicitly or implicitly) to
specify this choice. Traditionally, this parameter is called a
"locale".
Now this word can cause a lot of confusion. It doesn't need to. In the
most concrete sense, a "locale" just specifies a set a algorithms and
settings--things like the format to use for dates, the comparison
operation to use for sorting, and the characters to use for the decimal
and thousands separators for numbers (think of "1,000.50" v.
"1.000,50"). In fact, to reflect this definition, I'm going to coin a
new term. I'm going to call this a "Text Preferences Set", or "TPS" for
short, to emphasize this defintion.
Now where the complexity tends to come in also explains where the term
"locale" came from. People, as it turns out, don't (often) sit around
thinking up new sort orderings, or new formats to display numbers and
dates. More often, they want to do these things in a way that matches
the custom of some language or country or region--they want to sort
strings to match somebody's phone book or dictionary. Consequently,
they find it convenient to specify a TPS by specifying a langauge or
country, and use notations such as "en" for English and "en_US" for US
English (v. British English). But from an API perspective, this is just
a way to go look up a TPS--once you have one, it doesn't much matter
where it came from (looked up by some name or build up
programmatically, piecemeal), you just use it.
Sorting, and Tailorings
Now as I hinted at above, people in different countries like to sort
strings in different ways. Fair enough. That makes it sound like I need
to define a whole bunch of binary string comparison functions--one for
each different custom. One for German, one for Swedish, one for
English, one for Japanese, etc. As it turns out, there's a more compact
way to do this.
The Unicode Standard defines a base collation algorithm (sort
algorithm)--you can use it to sort strings composed of any characters
at all, but it doesn't necessarily give you an order that matches any
particular langauge's custom (though for many languages it's at least
reasonable). This is called the Unicode Collation Algorithm. Now, this
algorithms supports the concept of "tailorings"--tweaks to change the
sort behavior for certain characters (or sequences of characters). The
idea is that you start with the UCA, and supply one set of tailorings
to get the customary English sort order, another set to get the German
sort order, etc. There are three really nice things about this
approach:
1) Rather than writing a whole new algorithm, you just tweak a base
algorithm, and these tweaks are usually data-driven. (That is, they
usually take the form of modifications to a weighting table, rather
than an algorithmic difference.)
2) Under any set of tailorings, you can sort strings containing any
characters. So, you can meaningfully talk about how Japanese words sort
under the English tailorings (or English locale or traditional English
TPS). It's just that, naturally, the sorting of Japanese words won't
differ under English v. German sorting rules (but sorting of some
English words might), and English words probably won't sort differently
under Japanese v. Taiwanese sorting rules (but Japanese and Chinese
words might).
3) In contexts where you don't know what language is relevant, you have
a reasonable fallback sort order.
So for sorting, a language (or more generally, a TPS) just provides a
bit of tuning on top of a default algorithm, and you can sort with or
without taking language into account.
Now, it's important to note that the UCA isn't just sorting in terms of
the numerical values of the Unicode code points assigned to charactes.
But you can sort this way too--it's sometimes called "binary
order"--and it can make sense to do this in situations where you don't
much care _what_ the sort order is, just that you have something
canonical (for instance, so that you can count duplicates in a list).
The reason you might want to use binary order in cases like this is
that it's faster, and simpler to implement. (Also, you might know that
it's an acceptable sort order for your problem domain--maybe you're
sorting part numbers, for instance.)
So, another sum-up: A TPS (or "locale") is a set of preferences
relating to how you want to carry out certain string operations, and
often these preferences take the form of "tailorings" or tweaks on top
of a base algorithm.
Another place where this sort of thing comes up is with case
mappings--how you uppercase or lowercase strings depends somewhat on
language-based conventions.
Another use for uppercasing or lowercasing in English is to perform
case-insensitive string comparison, so that "foo" and "FOO" and "Foo"
and "FoO" compare as the same. In English, you can either uppercase
everything before comparison, or lowercase everything--it doesn't much
matter which you do, the result will be the same. In other languages,
that's not necessarily the case. Conveniently, the Unicode Standard
defines a process called "case folding", which is designed to let you
do case-insensitive comparisons. Case folding transforms strings into
case-canonicalized representations (conceptually like uppercasing then
lowercasing everything), and it does this in a single,
non-language-sensitive way. So even though language is relevant to case
mappings, you can use case folding to do case-insensitive string
comparisons without specifying a language (or TPS).
So a take-away point here is that the Unicode Specification provides
infrastructure in a number of areas for doing operations in a
TPS-sensitive or TPS-insensitive manner, and provides a smooth
transtion between the two.
A note about Language
There's one thing about langauge-sensitive operations which people tend
to overlook, and which it vitally important: for string operations,
it's not the language _of_the_string_ which is important, it's the
language _of_the_reader_. The canonical example here is this: Consider
a list of names, some of them German, and some of them Swedish. I, as
an English-speaking reader, would want to see these sorted in the
_English_ sort order. I don't much care how Germans or Swedes would
sort them. My local phone book doesn't take the national origin of a
person into account when organzing the listing. It doesn't matter--the
point of having it sorted it to allow people in the US to look up a
name. This means that there isn't a problem of how to compare "English
strings" and "German strings"--the language of the sort operation, not
of the individual strings, determines the binary comparison algorithm
used in the sort. That also means that it's a non-problem to have a
single string containing words from different languages.
This also means that, in the US, the German word "STRASSE" (on a sign
or in the name of a company, for instance) would lowercase as
"strasse", even though in Germany it might be preferable to use
"straße".
What's a Grapheme Cluster?
More about this later, but briefly: As it turns out, certain langauges
have certain funny conventions about grouping sequences of characters
into what they see as one "thing". For instance, Spanish considers the
character sequence "ll" to be a separate letter (not two "l"'s), and
similarly for "rr". Now, they don't really treat them specially from an
"information processing" point of view--I don't believe that there are
any encodings which treat "ll" as a single character, and I don't think
that Spanish typewriters have a separate key for "ll". It's just that
they think of it as a separate letter--it shows up when they are
reciting the alphabet, and they see the word "llama" as containing four
letters (the first letter being called "ell-yay", with the word
pronounced "yama", not "lama"), and they would sort that word after
"luego" (because "ll" comes after "l").
The Unicode Standard calls this a "grapheme cluster", and (as mentioned
above) it's relevant to localized sorting and word-length counting.
This is a somewhat unfortunate term, since "grapheme" is a linguistics
term, and it means something different to linguists. Previous versions
of the Unicode Standard used to term "grapheme", but they've migrated
to using "grapheme cluster" to minimize confusion, though the standard
is still somewhat jumbled in this regard.
But the brief take-home idea here is: "grapheme cluster" ("grapheme"
for short, with reservations), in this context, is a sequence of one or
more characters (a range in a string) which natural-language users
would see as a unit. This is again a situation where language (or TPS)
is relevant, but where the Unicode Standard provides a default
TPS-independent implementation which allows for tailorings. So again
there's a smooth transition for language-independent to
language-dependent settings.
Also, importantly, a grapheme cluster is a notion built on top of
characters (it's a cluster of characters), and choosing a langauge lets
you refine how you break up a string into grapheme clusters, but it's
just a refinement--"adding a language into the mix" doesn't pick out a
different semantic construct, it just help you customize your choice of
what ranges make up single graphemes.
I haven't yet covered a few important topics, such as different
character sequences representing equivalent graphemes, canonical and
compatability equivalence, and Unicode normalization forms. I also
haven't said anything yet about concrete implementation or API
guidelines.
JEff