Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

One change to the strings document

8 views
Skip to first unread message

Dan Sugalski

unread,
Apr 25, 2004, 4:34:49 PM4/25/04
to perl6-i...@perl.org
Just a heads up, there are two things that have been pointed out.

First, the transset op is transcharset. The abbreviation was a bit sloppy.

Second, in spots where "character" is used, substitute "grapheme", as
I'm going to. Noting, of course, that a grapheme is *not* a glyph.
Glyphs are display things that we're staying very very (very!) far
away from. The change'll go into the op names--getglyph instead of
getcharacter and suchlike things.

Hopefully using a different word'll help people remember that
glyph!=codepoint, though we'll see how well that one works.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Bryan C. Warnock

unread,
Apr 25, 2004, 9:34:53 PM4/25/04
to Dan Sugalski, perl6-i...@perl.org
On Sun, 2004-04-25 at 16:34, Dan Sugalski wrote:
> Just a heads up, there are two things that have been pointed out.
>
> First, the transset op is transcharset. The abbreviation was a bit sloppy.
>
> Second, in spots where "character" is used, substitute "grapheme", as
> I'm going to. Noting, of course, that a grapheme is *not* a glyph.
> Glyphs are display things that we're staying very very (very!) far
> away from. The change'll go into the op names--getglyph instead of
> getcharacter and suchlike things.
>
> Hopefully using a different word'll help people remember that
> glyph!=codepoint, though we'll see how well that one works.

I don't understand. Substitute grapheme for character, as you're
staying away from glyphs, but "getglyph" for "getcharacter"?

And what about codepoints that *are* glyphs and/but aren't graphemes?

--
Bryan C. Warnock
bwarnock@(gtemail.net|raba.com)

Dan Sugalski

unread,
Apr 26, 2004, 8:12:29 AM4/26/04
to Bryan C. Warnock, perl6-i...@perl.org
At 9:34 PM -0400 4/25/04, Bryan C. Warnock wrote:
>On Sun, 2004-04-25 at 16:34, Dan Sugalski wrote:
>> Just a heads up, there are two things that have been pointed out.
>>
>> First, the transset op is transcharset. The abbreviation was a bit sloppy.
>>
>> Second, in spots where "character" is used, substitute "grapheme", as
>> I'm going to. Noting, of course, that a grapheme is *not* a glyph.
>> Glyphs are display things that we're staying very very (very!) far
>> away from. The change'll go into the op names--getglyph instead of
>> getcharacter and suchlike things.
>>
>> Hopefully using a different word'll help people remember that
>> glyph!=codepoint, though we'll see how well that one works.
>
>I don't understand. Substitute grapheme for character, as you're
>staying away from glyphs, but "getglyph" for "getcharacter"?

Gah. And that sound is the sound of me banging my head agains the
wall because I'm an idiot. It's grapheme, everywhere.

>And what about codepoints that *are* glyphs and/but aren't graphemes?

Where do we have those? (I'm getting tempted instead to just call
them fred--it'll at least avoid some of this confusion...)

Bryan C. Warnock

unread,
Apr 26, 2004, 5:22:15 PM4/26/04
to Dan Sugalski, perl6-i...@perl.org
On Mon, 2004-04-26 at 08:12, Dan Sugalski wrote:
> At 9:34 PM -0400 4/25/04, Bryan C. Warnock wrote:
> >On Sun, 2004-04-25 at 16:34, Dan Sugalski wrote:
> >> Just a heads up, there are two things that have been pointed out.
> >>
> >> First, the transset op is transcharset. The abbreviation was a bit sloppy.
> >>
> >> Second, in spots where "character" is used, substitute "grapheme", as
> >> I'm going to. Noting, of course, that a grapheme is *not* a glyph.
> >> Glyphs are display things that we're staying very very (very!) far
> >> away from. The change'll go into the op names--getglyph instead of
> >> getcharacter and suchlike things.
> >>
> >> Hopefully using a different word'll help people remember that
> >> glyph!=codepoint, though we'll see how well that one works.
> >
> >I don't understand. Substitute grapheme for character, as you're
> >staying away from glyphs, but "getglyph" for "getcharacter"?
>
> Gah. And that sound is the sound of me banging my head agains the
> wall because I'm an idiot. It's grapheme, everywhere.
>
> >And what about codepoints that *are* glyphs and/but aren't graphemes?
>
> Where do we have those? (I'm getting tempted instead to just call
> them fred--it'll at least avoid some of this confusion...)

Beats me. I don't know what you mean by grapheme. Or glyph.
:-)

The web has a wide variety of definitions, most of them centered on some
association with a spoken language (the grapheme/phoneme
association).

While that certainly covers what I think you mean - letters,
ideographs, diacritical combinations, etc. - and I'm fairly
certain that extends to other written representations of
language - punctuation, white space, numerics - I don't know if
it extends to things that aren't. The Arabic tatweel (0x0640),
for instance, is pure a typesetting construct.

Then you've got non-language things like math operators,
arrows, and "dingbats".

And *then* you've got several ranges of "Presentation Forms",
which Unicode explicitly references as glyphs. For instance,
see 0xFB50 - 0xFDFF, Arabic Presentation Forms-A.

Perhaps fred *is* better.

Jeff Clites

unread,
Apr 27, 2004, 5:00:13 AM4/27/04
to Bryan C. Warnock, Dan Sugalski, Perl 6 Internals
On Apr 26, 2004, at 5:12 AM, Dan Sugalski wrote:

> At 9:34 PM -0400 4/25/04, Bryan C. Warnock wrote:
>
>> And what about codepoints that *are* glyphs and/but aren't graphemes?
>
> Where do we have those? (I'm getting tempted instead to just call them
> fred--it'll at least avoid some of this confusion...)

There shouldn't be those anywhere. At least under the usual
definitions, a glyph is a graphic representation of a character (so
different fonts define different glyphs to represent the same
character), and a grapheme is a sequence of one or more characters
which a common language user would consider as a unit. [Note that this
usage differs from what a linguist means by a "grapheme", so the
Unicode standard currently uses the term "grapheme cluster" rather than
"grapheme", to minimize confusion.]

And further, the Unicode standard defines character (or abstract
character) as picking out an "abstract meaning _or_ abstract shape", so
a character for the "ff ligature" seems to be picking out something
related to a visual representation, but it's actually not picking out a
glyph (since that ligature looks different in different fonts).

(And ideally ligatures such as the above wouldn't be considered
separate characters, but several standards treat them that way, and
consequently Unicode includes them for backward compatibility with
these standards. So for new usage they should be avoided, instead
letting a rendering engine display a ligature glyph for a sequence of
two "f" characters. But you'll still encounter them "in the wild".)

(Also, I'm using the term "character" to match the Unicode standard's
usage, but it's the same thing for which others are using the word
"codepoint". But I'm avoiding the latter usage because it's got some
problems: (1) a code point is a number which picks out an abstract
character--there's a one-to-one mapping between the two, but they're
different things; (2) a "code point" implies an assignment of numbers
to abstract characters, and if you're thinking of an approach like the
one Dan spelled out, then you need to say _which_ assignment of numbers
to characters you're talking about at any given time; and (3) it's
supposed to be "code point", not "codepoint".)

JEff

Dan Sugalski

unread,
Apr 27, 2004, 9:57:51 AM4/27/04
to Jeff Clites, Bryan C. Warnock, Perl 6 Internals
At 2:00 AM -0700 4/27/04, Jeff Clites wrote:
>On Apr 26, 2004, at 5:12 AM, Dan Sugalski wrote:
>
>>At 9:34 PM -0400 4/25/04, Bryan C. Warnock wrote:
>>
>>>And what about codepoints that *are* glyphs and/but aren't graphemes?
>>
>>Where do we have those? (I'm getting tempted instead to just call
>>them fred--it'll at least avoid some of this confusion...)
>
>There shouldn't be those anywhere. At least under the usual
>definitions, a glyph is a graphic representation of a character (so
>different fonts define different glyphs to represent the same
>character), and a grapheme is a sequence of one or more characters
>which a common language user would consider as a unit. [Note that
>this usage differs from what a linguist means by a "grapheme", so
>the Unicode standard currently uses the term "grapheme cluster"
>rather than "grapheme", to minimize confusion.]

I think... we'll stick with grapeheme and deal with the confusion,
though it may utimately only be me that's confused. I'm used to that
by now, though.

So, are we otherwise set on the plan, such that it can be implemented
and we can work on some of the parts that haven't been specified?
(Like the behaviour of binary operations and some of the underlying
functionality that the regex engine will need and libraries will have
to implement?) This'd be a good time to speak up as I just want it
all to be *done*, dammit. :)

Leopold Toetsch

unread,
Apr 27, 2004, 10:07:31 AM4/27/04
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski <d...@sidhe.org> wrote:

> So, are we otherwise set on the plan, such that it can be implemented
> and we can work on some of the parts that haven't been specified?

Some syntax glue for PASM/PIR is needed. How do I specify
enoding/charset/language. The same holds for IO. We have to first get
various strings in and out to test implementation.

We need a default fallback behavior in the absence of ICU code too.

WRT char/code point/grapheme/character/glyph a glossary *with* examples
would really be helpful.

leo

Dan Sugalski

unread,
Apr 27, 2004, 11:17:35 AM4/27/04
to l...@toetsch.at, perl6-i...@perl.org

I'll add in all these in a second draft. I don't think any of them
alter the semantics or design, though, so I'll take 'em in parallel.

Jeff Clites

unread,
Apr 27, 2004, 12:41:34 PM4/27/04
to Dan Sugalski, Perl 6 Internals
On Apr 27, 2004, at 6:57 AM, Dan Sugalski wrote:

> So, are we otherwise set on the plan, such that it can be implemented
> and we can work on some of the parts that haven't been specified?

I have several questions/comments/concerns, and I'm splitting them up
into bite-sized pieces. I've just sent out the first one.

JEff

Dan Sugalski

unread,
Apr 27, 2004, 1:15:13 PM4/27/04
to Jeff Clites, Perl 6 Internals

Cool. Then we'll work that out before we go any further, and if I get
bored I'm sure the event&IO design can use the work. :)

0 new messages