Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

The strings design document

21 views

Skip to first unread message

Dan Sugalski

unread,

Apr 23, 2004, 5:43:19 PM4/23/04

to perl6-i...@perl.org

Is tacked on. Note that we *do* have to support as core languages
which don't force unicode universally (perl 5, python, and ruby)
*and* we have to support the writing of stream filters in pure
parrot, so the goal of 100% pure unadulterated Unicode except at the
very edge isn't attainable, no matter how nice it may be.

Anyway, here you go, and have at it.

Strings, a design document of sorts

A Preamble
==========

Let's get this on the table--I give. Unicode's officially enshrined
as the top level, Officially Blessed, "We think it's really keen"
standard for parrot.

Language support (computer, not human) realities mean we can't be
completely universal this way, and efficiency concerns mean that
internally we'll want to defer conversions for as long as possible, so
the guts need to be more flexible, but the presented model (presented
via ops to bytecode programs) is generally Unicode.

Requirements
============

* Efficiency - The system must do the absolute minimum amount of work
to get the job done

* Correctness - The job that's done must actually be right

* Upgradeability - This stuff's all going to change again in five years
so we really don't want to have to do it over again.

* Flexibility - Since, unfortunately, no one way of looking at
strings is going to be right for everyone

Realities
=========

* There are a lot of different ways of representing text. Many of
them annoying, some of them wildly incompatible, none of them
wrong.

* We don't get to make the call what is right or wrong

* Some of the languages we support don't do Unicode, or do Unicode
and other things (including perl 5 and Ruby)

Desires
=======

* We want to make it easily possible to do the right thing with string
data

* We want all the troublesome stuff to be as invisible as possible

* We want to make it look like everyone's got what they want without
actually doing it when we don't have to

With that list in mind, here's a solution. (It is, in large part, the
current solution, only with actual explanation to go with the fairly
enigmatic bits)

Definitions
===========

BYTE - 8 bits 'o data

CODE POINT - A 32-bit integer that represents a single thing in a
character set

ENCODING - How code points are mapped to bytes, and vice versa

CHARACTER SET - Contains meta-information about code points. This
includes both the meaning of individual code points
(65 is capital A, 776 is a combining diaresis) as
well as a set of categorizations of code
points (alpha, numeric, whitespace, punctuation, and
so on), and a sorting order.

CHARACTER - One or more code points which makes up a single real
entity. The "oe" (I'm stuck with ASCII here, that should
really be an o with two dots over it) in Leo's last name
is, in the unicode character set, a single character with
two code points, 111 (lowercase o) and 776 (combining
diaresis). Characters can *not* be legitimately
decomposed into individual code points in most cases.

Conceptually
============

The point of the string

The smallest unit of text that Parrot will process is the string,
something that can be put in an S register. These strings have the
following properties:

*) They have an encoding
*) They have a character set
*) They have a language
*) They have a taint status

The above things are independent of the view of the string presented
to bytecode programs--these are metadata elements that describe the
contents of the string as they actually exist, rather than as they
are presented.

Internally parrot is capable of maintaining strings in several
different basic encodings (8-bit, 16-bit, and 32-bit integer, as well
as UTF-8) and may load other encodings on the fly as needed. Parrot so
also capable of maintaining strings in many different character sets
(ASCII, EBCDIC, Unicode, Latin-n, etc) which are also dynamically
loadable. Finally Parrot is capable of maintaining strings in many
different languages, which also may be loaded on the fly.

This is done for maximum efficiency, regardless of the view of the
data presented to the bytecode programs. Conversion to a different
format may be done if needed to properly express the semantics of the
program, but will not be done if not needed.

For example, consider the following:

use Unicode;
open FOO, "foo.txt", :charset(latin-3);
open BAR, "bar.txt", :charset(big5);
$filehandle = 0;
while (<>) {
if ($filehandle++) {
print FOO $_;
} else {
print BAR $_;
}
$filehadle %= 2;
}

Relatively simple, the program reads from the input filehandle and
splits the data, line by line, between two output files. The two
output files have different requirements -- FOO gets data in Latin-1,
while BAR gets it in Big5. The "use Unicode;" thing at the top's a
hand-wavey way of asserting that we want full Unicode text semantics.

Even so, there's no actual reason in this program to convert to
Unicode at all. If the input file is either Latin-3 or Big5, half of
the lines read don't have to be converted to anything. If the input
file's a proper subset of both (like, US ASCII) then none of the
lines read in need any conversion at all.

If Parrot forced all input data to be converted to Unicode internally
then this program would potentially have some significant overhead,
depending on the type of the input file. Given the output, the input
is likely either Latin-3 or Big5, either of which needs some
conversion to get turned into Unicode, while Unicode is guaranteed to
need some conversion for proper output to both files.

Synthesized code points
=======================

Parrot provides code points for all characters, even for those
character sets/encodings which don't inherently do so. Most sets that
have variable-length encodings use an escape sequence scheme--the
value of the first byte in a character determines whether the
character is a one or more byte sequence. When parrot turns these into
code points it does it by building up the final value. The first byte
is put in the low 8 bits of the integer. If there's a second byte in
the sequence the current value is shifted left 8 bits and the new byte
is stuffed in the low 8 bits. If there's a third byte in the sequence
everything is shifted left again 8 bits and that third byte is stuffed
in the bottom, and so on.

For example, in Shift-JIS, if the frst byte is in the range
0x21-0x7E or 0xA1-0xDF the character is a single byte. If the first
byte is in the range 0x81-0x9F or 0xE0-0xEF the character takes two
bytes, with the first byte determining which table the second byte
indexes into. The roman character A is represented by a single byte
0x41, while the Japanese hiragana KA is represented by the byte
sequence 0x82 0xA9. When parrot turns this into code points, it
becomes two integers, 0x00000041 and 0x000082A9. (Though it could
represent them as 16-bit integers, since no character takes three or
more bytes)

While this is somewhat unconventional, it makes the text easy to
process internally as fixed-width integers, is trivally transformed
back into a byte stream, and trivially turned from a byte stream into
integers in the first place. It also has the advantage of making what
was a variable-width encoding (some of which make it difficult or
impossible to tell, if you pick a byte at a random spot in the byte
stream, whether you're in the middle of a character or not) into a
fixed-width encoding. As such it makes a reasonably pleasant way to
manipulate this sort of text.

Conversion Rules
================

There are two types of conversions, from one thing (encoding,
charset, or langauge) to a thing of a similar type or to a thing of a
different type.

Similar here means a thing where the conversion is lossless or
accepted as good enough to have no semantic loss--for example
converting US ASCII to most character sets, or pretty much any
character set to Unicode. Different here means a thing where the
conversion is *not* guaranteed lossless--for example converting from
Shift-JIS to US ASCII or from Unicode to Latin-1.

Conversion lossiness is guaged either as a potential loss (where data
*may* be lost) or actual loss (where data, after conversion, *has*
been lost). While, for example, Big5 and Shift-JIS aren't
interchangeable in general so there is potential loss, they both have
US ASCII as a subset so it's possible that the conversion won't
actually lose any information.

Current interpreter settings determine when an exception or warning
is thrown. Some languages may deem it an error to implicitly shift to
an encoding where data may be lost and throw an error any time that
happens, others may defer the error until actual data loss occurs,
and still others may decide that data loss is fine, since if you were
worried about it in the first place you would've done something about
it.

Conversions are not required nor guaranteed to be symmetric. Just
about everything can shift to Unicode, and US ASCII can shift to just
about anything, but the converse is not true.

Since maintaining a full set of conversions is untenable, Parrot
declares that, by definition, all sets can pivot through
Unicode. Unicode pivoting is considered a potential loss of data, so
if the interpreter is set to warn or throw exceptions on potential
loss it will do so, even if the conversion is actually OK. (In which
case someone had better note that somewhere) It's perfectly acceptable
(and, in fact, encouraged) for a set to declare that it can explicitly
pivot to another set, with the actual internal code first going
through Unicode.

Internals
=========

Internally all strings are tagged with an encoding, a charset, a
language, and a taint status. This is the minimum amount of
information that can be reasonably kept for a string without losing
enough information to damage it if the data is passed into a
subroutine which expects a string parameter rather than a full-blown
PMC.

Tainting status is the simplest thing here, maintainable with a single
bit in the flags word for the string. We have to maintain this so that
the sequence:

set S0, P0
set P0, S0

doesn't lose the taint status of the data in P0, as well as so this:

set S0, P0
some_sub(S0)

passes in a properly tainted string to the some_sub subroutine. We're
encouraging code to use values of the lowest possible type, but we
don't want to be sacrificing safety for it.

Encoding needs to be attached to each string so we have some idea of
how to turn the bytes in the string's buffer into actual code
points. Since we defer transforming the string data until we actually
need to use it, regardless of what logical structure we may think the
string has, we still need to work on the actual structure it has.
This also allows easier processing of data in an encoding different
than whatever parrot may take as 'normal', if it ever does. Each
character set will have a preferred encoding, but people are going to
want to shift encodings around at times. (Especially the various utf-N
encodings)

Character set is attached so we can tell what to do with the code
points that come from the encoding and how to classify them. While we
prefer Unicode, that doesn't mean we're actually *in* unicode
yet. Also, since the possibility exists that we may at least have two
different character sets (either Unicode or binary, even if we declare
there are no others) it's less error-prone to unconditionally use the
set information hanging off the string itself.

The language determines the overridden special behaviour of a
string--how its case mangling should work, overridden character
classifications, and some comparison information. This will often be
overridden in main code, but becomes important in library code.
Language is a "humor Dan" thing. It won't hurt you, really.

Core functionality
==================

The following functions need to be performed by the core:

*) Transform encodings
*) Transform character sets
*) Get/set byte, code point, and character from a string
*) Get/set substring
*) get length in bytes, code points, and characters
*) Get/Set encoding
*) Get/Set character set
*) Get/set language
*) flatten to and thaw from a binary string
*) Upcase, downcase, and titlecase

These are all unary operations. While binary operations are necessary
for actual use, we'll deal with them after we get basic string
manipulation working.

Opcodes
=======

The following ops are proposed. Note that for many of them there is a
string-native version and a Unicode version--this is noted by a
(u). For Unicode strings these will behave identically, while for
strings that aren't in unicode they perform the operation and
translate to or from unicode as necessary.

getbyte Ix, Sy, Iz
(u)getcodepoint Ix, Sy, Iz
(u)getcharacter Sx, Sy, Iz

Get the byte, codepoint, or character requested. Destination is either
an integer (representing the byte or codepoint) or a string. Sy is the
source string, Iz is the offset in bytes, code points, or characters
from the beginning of the string.

(u)getstring Sw, Sx, Iy, Iz

This is substr, with the destination guaranteed to be in Unicode for
the (u) case.

setbyte Sx, Iy, Iz
(u)setcodepoint Sx, Iy, Iz
(u)setcharacter Sx, Sy, Iz

Sets the byte, code point, or character at offset Z in source string X
to the value in Y. Note that in the unocode case the source is taken
to be a unicode code point or character and translated to the type of
the destination string. These opcodes may throw an exception if the
resulting destination string is illegal (for example if the
destination is a unicode string with illegal combining character
construction, or in the byte case if the resulting buffer is un-decodable)

(u)setstring Sw, Sx, Iy, Iz

This is lvalue substr--the characters at offset Y, count Z (NB
*characters*, not code points) are replaced by the string X. In the
unicode case the string is taken to be unicode and translated to the
type of the destination string

encoding Ix, Sy
charset Ix, Sy
language Ix, Sy

Returns the encoding, character set, or language of Y.

encodingname Sx, Iy
charsetname Sx, Iy
languagename Sx, Iy

Returns the name of the encoding, character set, or language that
corresponds to the internal value Y. (As returned by the encoding,
charset, and langauge ops)

findencoding Ix, Sy
findcharset Ix, Sy
findlanguage Ix, Sy

Find the internal value for the encoding, language, or character set
named Y.

bytelength Ix, Sy
codepointlength Ix, Sy
characterlength Ix, Sy

Return the length of Y in bytes, code points, or characters. Length is
actual length, and as such may vary for otherwise identical
strings. (This is especially true for strings that change encoding, as
lengths can vary wildly between a UTF-8 and UTF-32 version of the same
unicode string)

transcode Sx, Iy
transset Sx, Iy
translang Sx, Iy

Change the string to have the specified encoding, language, or
character set. Done in place

transcode Sx, Sy, Iz
transset Sx, Sy, Iz
translang Sx, Sy, Iz

Generate a new version of Y with the encoding, character set, or
language Z.

tounicode Sx
tounicode Sx, Sy

Change the string to unicode. The one arg version does it in place,
the two arg version generates a new string.

(d)upcase Sx
(d)upcase Sx, Sy
(d)downcase Sx
(d)downcase Sx, Sy
(d)titlecase Sx
(d)titlecase Sx, Sy

Make the string all uppercase, all lower case, or titlecase the first
character. The two-arg versions generate a new string, the one arg
version does it in place. These ops have two variants--the one with a
leading d (dupcase, ddowncase, dtitlecase) use the current interpreter
default langauge rule for case mangling and set that as the language
for the generated string, while the non-d version uses the information
in the string itself.

decompose Sx, Sy

Take the string in Y and return a version in X which is a flat byte
string with no language, character set, or encoding. (or, rather, the
language none, charset none, and encoding 8-bit binary)

compose Sw, Ix, Iy, Iz

Take the flattened binary string W and mark it as having the encoding
X, character set Y, and langauge Z. This may throw an exception if the
string doesn't meet the requirements of the language, charset, or
encoding.

compose Sv, Sw, Ix, Iy, Iz

As above, only a new string is generated and the original left alone.

Exceptions
==========

Here's a list of the exceptions that will be thrown if the string
subsystem comes across things its not happy about. All of these
exceptions are optional, and may be overridden by interpreter
settings. Additionally, some conversions are deemed less dangerous
than others, and as such there are two different types of conversion
(similar and dissimilar) rather than just one. These exceptions may
also be thrown either because of potential problems (where something
might happen) or actual problems (where something did happen).

* LANG_MISMATCH - thrown whenever a binary operation is done on two
strings with differing languages when there is otherwise no
overriding semantic in place.

* CHARSET_MISMATCH - thrown whenever a binary operation is done on
strings of different character sets.

* LOSSY_CONVERSION - Thrown whenever a conversion would lose
information. This includes getting a plain string from a PMC which
has segmented string data in it. (This would be a PMC which has
some data in Unicode, EBCDIC, and RAD-50, for example, or whose
contents had different languages attached to different parts of the
string data)

* DECOMPOSITION_ERROR - Thrown whenever you try and act on part of a
multi-code point character. This includes doing an ord() on a
string where the character you're ord'ing is made up of two or more
code points.

--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Jeff Clites

unread,

Apr 27, 2004, 12:40:12 PM4/27/04

to Dan Sugalski, perl6-i...@perl.org

On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:

> CHARACTER SET - Contains meta-information about code points. This
> includes both the meaning of individual code points
> (65 is capital A, 776 is a combining diaresis) as
> well as a set of categorizations of code
> points (alpha, numeric, whitespace, punctuation, and
> so on), and a sorting order.

I'm assuming here that you are referring to things like Shift-JIS and
ISO-8859-1 as character sets, right?

Questions (based on that assumption):

[*Note: assume everywhere below that the strings in question are not
explicitly language-tagged (or, are tagged with "Dunno"--however it's
supposed to work).]

1) ISO-8859-1 is used to represent text in several different languages,
including German and Swedish. German and Swedish differ in their sort
order, even for things they have in common. (For example, ö
(o-with-diaeresis) is considered a separate letter in Swedish, but is
just a accented "o" in German.) So (assuming my strings aren't
explicitly langauge-tagged, or are tagged with "Dunno"), what sort
order does ISO-8859-1 define? I'm not sure whether the national
standards themselves actually define a sort order, so are we going to
define one for every "character set"? In addition, many languages can
be represented in several different "character set", so that seems to
mean that the sort order for "öut" v. "out" will vary, depending on the
"character set" used for those strings?

2) In light of the above, how do you sort an array of strings, assuming
they're not all in the same "character set"?

3) If the answer to (2) is "you must upgrade them all to UTF-8", then
that means that the sort order for an array might totally change when
you add one new member, right? If the answer is, "for a given pair,
when you compare them during sorting, only upgrade if their character
sets don't match", then you open the door to non-convergent sorting
(ie, the sort might never finish).

My worry here is that if the semantics of the Latin Capital Letter A
("A"), for example (or pick any other character), are allowed to differ
between different "character sets", then we'll have problems for any
binary string operation.

JEff

Jarkko Hietaniemi

unread,

Apr 27, 2004, 12:57:29 PM4/27/04

to perl6-i...@perl.org, Jeff Clites, Dan Sugalski, perl6-i...@perl.org

> 1) ISO-8859-1 is used to represent text in several different languages,
> including German and Swedish. German and Swedish differ in their sort
> order, even for things they have in common. (For example, ö
> (o-with-diaeresis) is considered a separate letter in Swedish, but is
> just a accented "o" in German.) So (assuming my strings aren't
> explicitly langauge-tagged, or are tagged with "Dunno"), what sort
> order does ISO-8859-1 define? I'm not sure whether the national
> standards themselves actually define a sort order, so are we going to

National standards yes, ISO 8859 (and the like) not. In other words,
sorting standards exist, but they have (quite rightly) nothing to do
with sorting standards. Real life sorting is messy (multiple passes,
some parts may be ignored in some passes, acronyms, etc.) and worlds
apart from "let's compare the bytes one by one" or even from "let's
compare code points" or even from "let's compare grapheme (clusters)".

> define one for every "character set"? In addition, many languages can
> be represented in several different "character set", so that seems to
> mean that the sort order for "öut" v. "out" will vary, depending on the
> "character set" used for those strings?

FWIW, I think binding language to strings is a Mistake. But I have
decided to give up trying to argue anymore about it since Dan seems
to be convinced that it will solve some problems.

Larry Wall

unread,

Apr 27, 2004, 1:04:36 PM4/27/04

to perl6-i...@perl.org

I can't answer for Dan regarding implementation issues, but from
a (computer) language point of view, consistency is better than
correctness on this issue, because there is no single definition of
"correct" until you specify what you mean by "correct". So at the
first three Unicode support levels in Perl 6 (bytes, codepoints, and
graphemes), sorting is by default always "Unicodabetical" regardless
of the actual encoding of the string. You may, of course, always be
more explicit in your sort command. Alternately, you may go to the
fourth level, "letters", by specifying a particular language for the
current lexical scope, in which case the default sort order for that
language applies everywhere in that lexical scope. Arguably this
is getting up into the library layers rather than Parrot internals,
and I think Dan has it right to concentrate on the first three levels
with pure Unicode semantics (whatever those are this week :-).

Larry

Dan Sugalski

unread,

Apr 27, 2004, 1:25:33 PM4/27/04

to Jeff Clites, perl6-i...@perl.org

At 9:40 AM -0700 4/27/04, Jeff Clites wrote:
>On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:
>
>>CHARACTER SET - Contains meta-information about code points. This
>> includes both the meaning of individual code points
>> (65 is capital A, 776 is a combining diaresis) as
>> well as a set of categorizations of code
>> points (alpha, numeric, whitespace, punctuation, and
>> so on), and a sorting order.
>
>I'm assuming here that you are referring to
>things like Shift-JIS and ISO-8859-1 as
>character sets, right?

Sort of. Shift-JIS is actually both a character
set and an encoding, which makes life a bit
confusing if not downright annoying.

>Questions (based on that assumption):
>
>[*Note: assume everywhere below that the strings
>in question are not explicitly language-tagged
>(or, are tagged with "Dunno"--however it's
>supposed to work).]
>
>1) ISO-8859-1 is used to represent text in
>several different languages, including German
>and Swedish. German and Swedish differ in their
>sort order, even for things they have in common.
>(For example, ö (o-with-diaeresis) is considered
>a separate letter in Swedish, but is just a
>accented "o" in German.) So (assuming my strings
>aren't explicitly langauge-tagged, or are tagged
>with "Dunno"), what sort order does ISO-8859-1
>define? I'm not sure whether the national
>standards themselves actually define a sort
>order, so are we going to define one for every
>"character set"? In addition, many languages can
>be represented in several different "character
>set", so that seems to mean that the sort order
>for "öut" v. "out" will vary, depending on the
>"character set" used for those strings?

That's possible, yes.

Each character set has a default sort ordering
(amongst other things), which will be used in the
absence of overriding data.

>2) In light of the above, how do you sort an
>array of strings, assuming they're not all in
>the same "character set"?

You don't. Cross-set comparisons aren't
valid--either the strings get promoted to a
common set or an exception is thrown. Throwing an
exception will be the default.

>3) If the answer to (2) is "you must upgrade
>them all to UTF-8", then that means that the
>sort order for an array might totally change
>when you add one new member, right? If the
>answer is, "for a given pair, when you compare
>them during sorting, only upgrade if their
>character sets don't match", then you open the
>door to non-convergent sorting (ie, the sort
>might never finish).

Yep, that is a potential problem. The likely
case, though, is that adding a string of a
different type (character set or language) makes
sorting impossible and pitches an exception
instead.

>My worry here is that if the semantics of the
>Latin Capital Letter A ("A"), for example (or
>pick any other character), are allowed to differ
>between different "character sets", then we'll
>have problems for any binary string operation.

I've not really gotten into binary string
operations. In general, cross-type operations
will either throw exceptions or force an upgrade
to a compatible character set. Upgrades will (or
at least should) be sticky, so if you throw, say,
a unicode string into an array full of Latin-1
characters, by the time you're done sorting
everything'll be promoted to Unicode and worst
case you'll have some ringing as the conversion
propagates through.

I may, though, be completely deluded about that one.

Jarkko Hietaniemi

unread,

Apr 27, 2004, 1:38:02 PM4/27/04

to Dan Sugalski, perl6-i...@perl.org, Jeff Clites

Dan Sugalski wrote:

> At 7:57 PM +0300 4/27/04, Jarkko Hietaniemi wrote:
>
>> > 1) ISO-8859-1 is used to represent text in several different languages,
>>
>>> including German and Swedish. German and Swedish differ in their sort
>>> order, even for things they have in common. (For example, ö
>>> (o-with-diaeresis) is considered a separate letter in Swedish, but is
>>> just a accented "o" in German.) So (assuming my strings aren't
>>> explicitly langauge-tagged, or are tagged with "Dunno"), what sort
>>> order does ISO-8859-1 define? I'm not sure whether the national
>>> standards themselves actually define a sort order, so are we going to
>>
>>National standards yes, ISO 8859 (and the like) not. In other words,
>>sorting standards exist, but they have (quite rightly) nothing to do
>>with sorting standards.
>
>

> ?

Ooops. Replace the last "sorting" with "character". That's what I get,
errrm, what you get, from writing email while watching evening news :-)

>> Real life sorting is messy (multiple passes,
>>some parts may be ignored in some passes, acronyms, etc.) and worlds
>>apart from "let's compare the bytes one by one" or even from "let's
>>compare code points" or even from "let's compare grapheme (clusters)".
>
>

> True enough, though what I want the language for
> is as much case-mangling as sorting.

I just think that having languages for strings is akin to
having types (dimensioned or -less) for numbers.
(Making 2 kg plus 3 Hz to croak, that kind of thing.)

--
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

Dan Sugalski

unread,

Apr 27, 2004, 1:28:40 PM4/27/04

to Jarkko Hietaniemi, perl6-i...@perl.org, Jeff Clites

At 7:57 PM +0300 4/27/04, Jarkko Hietaniemi wrote:

> > 1) ISO-8859-1 is used to represent text in several different languages,
>> including German and Swedish. German and Swedish differ in their sort
>> order, even for things they have in common. (For example, ö
>> (o-with-diaeresis) is considered a separate letter in Swedish, but is
>> just a accented "o" in German.) So (assuming my strings aren't
>> explicitly langauge-tagged, or are tagged with "Dunno"), what sort
>> order does ISO-8859-1 define? I'm not sure whether the national
>> standards themselves actually define a sort order, so are we going to
>
>National standards yes, ISO 8859 (and the like) not. In other words,
>sorting standards exist, but they have (quite rightly) nothing to do
>with sorting standards.

> Real life sorting is messy (multiple passes,
>some parts may be ignored in some passes, acronyms, etc.) and worlds
>apart from "let's compare the bytes one by one" or even from "let's
>compare code points" or even from "let's compare grapheme (clusters)".

True enough, though what I want the language for

is as much case-mangling as sorting.

> > define one for every "character set"? In addition, many languages can

>> be represented in several different "character set", so that seems to
>> mean that the sort order for "öut" v. "out" will vary, depending on the
>> "character set" used for those strings?
>
>FWIW, I think binding language to strings is a Mistake. But I have
>decided to give up trying to argue anymore about it since Dan seems
>to be convinced that it will solve some problems.

Well, it's always possible that, once we get deeper into this, that

a) I get over the snit
and
b) I realize what a profoundly stupid idea it was in the first place.

Wouldn't be the first time, and probably not the last either.

Jeff Clites

unread,

Apr 28, 2004, 6:13:24 AM4/28/04

to Dan Sugalski, perl6-i...@perl.org

On Apr 27, 2004, at 10:25 AM, Dan Sugalski wrote:

> At 9:40 AM -0700 4/27/04, Jeff Clites wrote:
>> On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:
>>
>>> CHARACTER SET - Contains meta-information about code points. This
>>> includes both the meaning of individual code points
>>> (65 is capital A, 776 is a combining diaresis) as
>>> well as a set of categorizations of code
>>> points (alpha, numeric, whitespace, punctuation, and
>>> so on), and a sorting order.
>>
>> I'm assuming here that you are referring to things like Shift-JIS and
>> ISO-8859-1 as character sets, right?
>
> Sort of. Shift-JIS is actually both a character set and an encoding,
> which makes life a bit confusing if not downright annoying.

I think you're basically forcing this concept onto national standards
which lack it. I don't think that most of the national standards
actually define the semantics of the characters they encode
(categorizations, case mapping, sort order), and although they assign
byte sequences to represent their characters, I'm not sure they
actually present this in terms of assigning integers to them, in the
sense of code points v. byte sequences.

So it sounds like we are going to make up a set of semantics,
individually, for each character set which doesn't explicitly define
their own (which, I think, is most of them). So, we have two choices:
(1) do this arbitrarily and in ways which make different character
sets/encoding actively conflict, or (2) come up with an assignment of
semantics which makes them all fit together nicely, so that (for
instance) the letter "A" comes out as having lowercase version "a",
isAlpha, isNonNumeric, isHexDigit, isNotWhitespace, isNotPuncation,
etc., for all of the character sets. Well, option (2) is what the
Unicode Consortium spent years doing--coming up with a comprehensive
list of the semantics and categorizations of every character
represented in every major character set/standard.

The bottom line is that I don't think that anyone ever intended that
the letter "A" have different semantics in each-and-every character
set/encoding. In fact, they're all trying to provide potentially
different ways to represent _the_same_ character. (I'm using "A" here
because it's easy to represent in an email--other character choices
might be more illustrative.) And the main task the Unicode Consortium
carried out is to reconcile all of these. Forget about the encodings
and things like that they've defined (UTF-8/16/32, etc.)--they're
incidental. The important thing they did is not to define yet-another
character set--they created the logical union of all of the others.
They figured out where there was overlap, where they agreed, where
there were inconsistencies, and dealt with them.

You've got to tear yourself away from this byte-centric view. It's the
wrong mindset. Strings represent text. Text is made of characters.
Characters are abstract things--the Platonic forms of letters, numbers,
punctuation, etc. All of these different character sets/encodings are
trying to digitally represent the same things--not to pick out whole
separate notions. The letter A is the letter A is the letter A. When I
type some text into my text editor and save it, I get a popup menu of
choices for what encoding to use. My choice of UTF-16 v. Shift-JIS v.
Latin-1 is inconsequential--those are just different way to represent
_the_same_ text. All that matters is that when something, later, reads
that file, it has a way to know which choice I made, so that it can
decode those bytes and get to the text I saved.

So no matter how we choose to represent things in memory, the semantics
can't depend on what on-disk representation I chose--it's supposed to
the _the_same_ text.

And frankly it wouldn't take long to write a text editor which lets you
sort and do case mapping, but doesn't let you save to a file. In cases
like that (no IO), the notion of a character set or encoding need never
come into play. But, I've still got text that I'm programmatically
manipulating. Encoding only comes into play during IO (or the
preparation for IO).

>> 2) In light of the above, how do you sort an array of strings,
>> assuming they're not all in the same "character set"?
>
> You don't. Cross-set comparisons aren't valid--either the strings get
> promoted to a common set or an exception is thrown. Throwing an
> exception will be the default.
>
>> 3) If the answer to (2) is "you must upgrade them all to UTF-8", then
>> that means that the sort order for an array might totally change when
>> you add one new member, right? If the answer is, "for a given pair,
>> when you compare them during sorting, only upgrade if their character
>> sets don't match", then you open the door to non-convergent sorting
>> (ie, the sort might never finish).
>
> Yep, that is a potential problem. The likely case, though, is that
> adding a string of a different type (character set or language) makes
> sorting impossible and pitches an exception instead.

Just throwing exceptions all of the time doesn't seem to be the most
useful thing to do. We can do semantically better.

>> My worry here is that if the semantics of the Latin Capital Letter A
>> ("A"), for example (or pick any other character), are allowed to
>> differ between different "character sets", then we'll have problems
>> for any binary string operation.
>
> I've not really gotten into binary string operations. In general,
> cross-type operations will either throw exceptions or force an upgrade
> to a compatible character set. Upgrades will (or at least should) be
> sticky, so if you throw, say, a unicode string into an array full of
> Latin-1 characters, by the time you're done sorting everything'll be
> promoted to Unicode and worst case you'll have some ringing as the
> conversion propagates through.
>
> I may, though, be completely deluded about that one.

Well, if you upgrade everything as you go, it will probably converge,
but your sort order will likely depend on your initial order and your
sort algorithm (that is, quicksort v. bubble sort v. heap sort), which
is another way of saying it's indeterminate. This is because not every
array element will end up being matched against all others, so only
some of them will end up getting "upgraded". Certainly, the algorithmic
efficiency will be decreased.

If upgrades are sticky (which makes sense, in order to minimize
duplicated computation), then (due to the "character set" discussion
above), the semantics of my strings will change upon sorting them
(since their character sets will change).

See how that all doesn't make much sense?

JEff

Jarkko Hietaniemi

unread,

Apr 28, 2004, 2:16:00 PM4/28/04

to perl6-i...@perl.org, Jeff Clites, Dan Sugalski, perl6-i...@perl.org

> I think you're basically forcing this concept onto national standards
> which lack it. I don't think that most of the national standards
> actually define the semantics of the characters they encode
> (categorizations, case mapping, sort order), and although they assign
> byte sequences to represent their characters, I'm not sure they
> actually present this in terms of assigning integers to them, in the
> sense of code points v. byte sequences.

Yeah. Let's take, say, ISO 8859-1:

http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n411.pdf

No "semantics", just an assignment of abstract characters to numbers
and the respective bit patterns.

Jeff Clites

unread,

Apr 30, 2004, 11:38:18 AM4/30/04

to Dan Sugalski, Perl 6 Internals

On Apr 28, 2004, at 5:01 AM, Dan Sugalski wrote:

> At 3:17 AM -0700 4/28/04, Jeff Clites wrote:
>> On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:
>>

>>> For example, consider the following:
>>>
>>> use Unicode;
>>> open FOO, "foo.txt", :charset(latin-3);
>>> open BAR, "bar.txt", :charset(big5);
>>> $filehandle = 0;
>>> while (<>) {
>>> if ($filehandle++) {
>>> print FOO $_;
>>> } else {
>>> print BAR $_;
>>> }
>>> $filehadle %= 2;
>>> }
>>

>> What's the input record separator here?
>
> The filehandle default, which depends on the encoding and character
> set of the input data, or so Larry's told me.

So the nature of my question here is that I assume the input record
separator will be set as a string, with something similar to: $/ = "\n"
or $/ = "----" or whatever.

If that's the case, presumably the user won't have to keep resetting it
as they open files stored in a different encodings, if (from their
point of view) they're using the same separator--they'll just set it
once. But having it defined as a string would seem to imply that you'll
have to transcode as you read to a common representation, in order to
find the line endings. That is, if $/ was assigned "latin-1" when it
was created, then you'll be forced to transcode to UTF-8 (or something)
as you read, right?

JEff

Larry Wall

unread,

Apr 30, 2004, 12:02:10 PM4/30/04

to Perl 6 Internals

On Fri, Apr 30, 2004 at 08:38:18AM -0700, Jeff Clites wrote:

: On Apr 28, 2004, at 5:01 AM, Dan Sugalski wrote:
:
: >At 3:17 AM -0700 4/28/04, Jeff Clites wrote:
: >>On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:
: >>
: >>>For example, consider the following:
: >>>
: >>> use Unicode;
: >>> open FOO, "foo.txt", :charset(latin-3);
: >>> open BAR, "bar.txt", :charset(big5);
: >>> $filehandle = 0;
: >>> while (<>) {
: >>> if ($filehandle++) {
: >>> print FOO $_;
: >>> } else {
: >>> print BAR $_;
: >>> }
: >>> $filehadle %= 2;
: >>> }
: >>
: >>What's the input record separator here?
: >
: >The filehandle default, which depends on the encoding and character
: >set of the input data, or so Larry's told me.
:
: So the nature of my question here is that I assume the input record
: separator will be set as a string, with something similar to: $/ = "\n"
: or $/ = "----" or whatever.

Well, it's very good of you to state your assumption out front,
because it happens to be inaccurate. There is no $/ anymore.
Input record separator is an attribute of the filehandle in Perl 6,
for some definition of attribute, and some definition of filehandle,
which may or may not involve real attributes and/or layers.

And before you ask, chomping is also filehandle dependent. In fact,
it's depending on each line, since if the input record separator is
a pattern, it can match different ways. So chomping will generally
be done right within the <>, if you've asked for autochomping.
Alternately, the filehandle can mark the string somehow to indicate
where it should be chomped if you decide to chomp it later.

And just as a BTW, if you've asked for autochomping, you'd better use

for <> {...}

rather than

while <> {...}

since Perl 6 probably won't do the Perl 5 hack that makes the latter mean

while defined($_ = <>) {...}

And before you point out that <> in a list context will use up all your
memory, I'll point out that it doesn't in Perl 6. :-)

Offhand, I can't think of any more words to put in your mouth...

: If that's the case, presumably the user won't have to keep resetting it

: as they open files stored in a different encodings, if (from their
: point of view) they're using the same separator--they'll just set it
: once. But having it defined as a string would seem to imply that you'll
: have to transcode as you read to a common representation, in order to
: find the line endings. That is, if $/ was assigned "latin-1" when it
: was created, then you'll be forced to transcode to UTF-8 (or something)
: as you read, right?

$/ is gone. But if there were a $/, it would do the Right Thing. :-)

(Which, in Perl 6, is to have consistent Unicode semantics regardless
of the supposed encoding of the string.)

Arguably, this discussion should be happening in p6l rather than p6i...

Larry

Jeff Clites

unread,

Apr 30, 2004, 10:21:50 PM4/30/04

to Perl 6 Internals

On Apr 30, 2004, at 9:02 AM, Larry Wall wrote:

> On Fri, Apr 30, 2004 at 08:38:18AM -0700, Jeff Clites wrote:
> : On Apr 28, 2004, at 5:01 AM, Dan Sugalski wrote:
> :
> : >At 3:17 AM -0700 4/28/04, Jeff Clites wrote:
> : >>On Apr 23, 2004, at 2:43 PM, Dan Sugalski wrote:
> : >>
> : >>>For example, consider the following:
> : >>>
> : >>> use Unicode;
> : >>> open FOO, "foo.txt", :charset(latin-3);
> : >>> open BAR, "bar.txt", :charset(big5);
> : >>> $filehandle = 0;
> : >>> while (<>) {
> : >>> if ($filehandle++) {
> : >>> print FOO $_;
> : >>> } else {
> : >>> print BAR $_;
> : >>> }
> : >>> $filehadle %= 2;
> : >>> }
> : >>
> : >>What's the input record separator here?
> : >
> : >The filehandle default, which depends on the encoding and character
> : >set of the input data, or so Larry's told me.
> :
> : So the nature of my question here is that I assume the input record
> : separator will be set as a string, with something similar to: $/ =
> "\n"
> : or $/ = "----" or whatever.

....

> $/ is gone. But if there were a $/, it would do the Right Thing. :-)
>
> (Which, in Perl 6, is to have consistent Unicode semantics regardless
> of the supposed encoding of the string.)
>
> Arguably, this discussion should be happening in p6l rather than p6i...

Well, the implementation point that I was getting at, which perhaps I
should have stated more clearly up front, was that if one gets to
specify a default input-record-separator, then if that's done as a
string, then you're going to have to (in Dan's plan as stated)
transcode your input-record-separator and your input stream to a common
character set/encoding, so you're paying the computational price that
Dan indicated the above code could avoid. If the input-record-separator
is specified as a byte-sequence rather than as a string, then in the
plan I had in mind would also avoid the overhead of "decoding" the
string.

So the implementation point was that this example doesn't seem to argue
that Dan's plan has a performance advantage.

JEff

0 new messages