catcodes of superscripts, etc.

Will Robertson

unread,

Jun 1, 2010, 1:24:29 AM6/1/10

to Unicode-based TeX for Mac OS X and other platforms, uni...@googlegroups.com, Jonathan Kew

(Sorry for the cross-posting, but this seems of very minor possible interest to a few different people.)

Hi Jonathan,

I've just noticed that catcodes of superscript i and n have catcode 11, as do the latin & greek subscripts. (Code points appended.)

I'm going to change this in unicode-math (so they're all cc12), but do you think this is something that should be changed in the ini file?

(I'll check this with LuaTeX, too.)

The reason it's a problem is that users try to write things like $\Cⁿ$ and the expected behaviour, well, isn't.

Many thanks,
-- Will

2071 {i}
207f {n}

2090 {a}
2091 {e}
1d62 {i}
2092 {o}
1d63 {r}
1d64 {u}
1d65 {v}
2093 {x}
1d66 {\beta}
1d67 {\gamma}
1d68 {\rho}
1d69 {\phi}
1d6a {\chi}

Ross Moore

unread,

Jun 1, 2010, 1:31:22 AM6/1/10

to uni...@googlegroups.com

Hi Will,

On 01/06/2010, at 3:24 PM, Will Robertson wrote:

> (Sorry for the cross-posting, but this seems of very minor possible
> interest to a few different people.)
>
> Hi Jonathan,
>
> I've just noticed that catcodes of superscript i and n have catcode
> 11, as do the latin & greek subscripts. (Code points appended.)
>
> I'm going to change this in unicode-math (so they're all cc12),

Yes, this seems more sensible.
Surely the only reason to have cc11 is if you want to allow
the character to be used within macro-names.

What other characters have cc11 ?
Does this include pre-formed accented Latin characters?
Presumably most other languages' chars are OK for cc11 too.

> but do you think this is something that should be changed in the
> ini file?
>
> (I'll check this with LuaTeX, too.)
>
> The reason it's a problem is that users try to write things like $
> \Cⁿ$ and the expected behaviour, well, isn't.

Indeed, it would. :-)

>
> Many thanks,
> -- Will
>
>
> 2071 {i}
> 207f {n}
>
> 2090 {a}
> 2091 {e}
> 1d62 {i}
> 2092 {o}
> 1d63 {r}
> 1d64 {u}
> 1d65 {v}
> 2093 {x}
> 1d66 {\beta}
> 1d67 {\gamma}
> 1d68 {\rho}
> 1d69 {\phi}
> 1d6a {\chi}
>

Cheers,

Ross

------------------------------------------------------------------------
Ross Moore ross....@mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------

Will Robertson

unread,

Jun 1, 2010, 2:07:55 AM6/1/10

to uni...@googlegroups.com

On 01/06/2010, at 3:01 PM, Ross Moore wrote:

> What other characters have cc11 ?
> Does this include pre-formed accented Latin characters?
> Presumably most other languages' chars are OK for cc11 too.

As far as I know, all glyphs that unicode class as alphabetic letters are also catcode-11-letters in XeTeX.

E.g.,

\def\éβЖ{odd}\show\éβЖ

This has actually been remarked upon by the same tester who found the superscript case; he was surprised to write

\def\C{\mathbb{C}}
...
$\Cβ$

and have an "undefined control sequence" error.

But I explained that that's XeTeX's standard behaviour and I don't want to change it.

(I guess I could have a package option to turn all greek cc12, but I don't like the idea that that would make the source less portable.)

Will

Jonathan Kew

unread,

Jun 1, 2010, 4:43:13 AM6/1/10

to uni...@googlegroups.com

On 1 Jun 2010, at 07:07, Will Robertson wrote:

> On 01/06/2010, at 3:01 PM, Ross Moore wrote:
>
>> What other characters have cc11 ?
>> Does this include pre-formed accented Latin characters?
>> Presumably most other languages' chars are OK for cc11 too.
>
> As far as I know, all glyphs that unicode class as alphabetic letters are also catcode-11-letters in XeTeX.

That's correct (or at least it's intended to be).... the unicode-letters.tex file that sets up these codes is created from the Unicode character database, and relies on the General Category property to decide which characters get catcode 11.

>
> E.g.,
>
> \def\éβЖ{odd}\show\éβЖ
>
> This has actually been remarked upon by the same tester who found the superscript case; he was surprised to write
>
> \def\C{\mathbb{C}}
> ...
> $\Cβ$
>
> and have an "undefined control sequence" error.
>
> But I explained that that's XeTeX's standard behaviour and I don't want to change it.
>
> (I guess I could have a package option to turn all greek cc12, but I don't like the idea that that would make the source less portable.)

I'd be very reluctant to deviate from the standard Unicode properties in the "global" defaults. In a matter like this, we'd never find a "perfect" solution that suits every user and use-case. I think it's better to follow an established standard, even though it may lead to surprising results for some people in some situations, than to try to reinvent the classification of the (literally) thousands of characters that could occur in a file. That path leads to endless

If someone really wants to be able to write $\Cβ$ rather than $\C β$ (even though it has to be $\C B$ if the second letter is Latin), I'd suggest that they use an \everymath hook or something like that to change the catcodes just within the math environment.

JK

Taco Hoekwater

unread,

Jun 1, 2010, 4:53:31 AM6/1/10

to uni...@googlegroups.com

Hi,

Jonathan Kew wrote:
>
> If someone really wants to be able to write $\Cβ$ rather than $\C β$
> (even though it has to be $\C B$ if the second letter is Latin), I'd
> suggest that they use an \everymath hook or something like that to
> change the catcodes just within the math environment.

LuaTeX itself does not actually set up these catcodes (the formats do,
but then I assume this is the same for XeTeX).

For what it's worth: I agree with Jonathan that Unicode letters are
letters, and I believe we should not be making exceptions to that just
to simplify life for users from certain Western countries that only
need ASCII.

Best wishes,
Taco

Will Robertson

unread,

Jun 1, 2010, 5:39:17 AM6/1/10

to uni...@googlegroups.com

On 01/06/2010, at 6:23 PM, Taco Hoekwater wrote:

> LuaTeX itself does not actually set up these catcodes (the formats do,
> but then I assume this is the same for XeTeX).

Right.

> For what it's worth: I agree with Jonathan that Unicode letters are
> letters, and I believe we should not be making exceptions to that just
> to simplify life for users from certain Western countries that only
> need ASCII.

We're all in agreement there, and I'm happy to leave things as they are, but what about the subscripts and superscripts I originally mentioned?

2071 {i}
207f {n}

2090 {a}
2091 {e}
1d62 {i}
2092 {o}
1d63 {r}
1d64 {u}
1d65 {v}
2093 {x}
1d66 {\beta}
1d67 {\gamma}
1d68 {\rho}
1d69 {\phi}
1d6a {\chi}

For all I know, they *are* letters in a language somewhere, but I assumed originally that they were simply symbols that ended up with the wrong designation of being "letters". Am I wrong about this?

We've got three options:

1 - change them to cc12 in the ini file
2 - change them to cc12 in unicode-math
3 - not change them at all

Originally I was asking about #1, and right now I'm with #2, but if you think this will lead to trouble down the road I'm happy to revert to #3.

-- Will

Jonathan Kew

unread,

Jun 1, 2010, 6:18:56 AM6/1/10

to uni...@googlegroups.com

On 1 Jun 2010, at 10:39, Will Robertson wrote:
>
> We're all in agreement there, and I'm happy to leave things as they are, but what about the subscripts and superscripts I originally mentioned?
>
> 2071 {i}

2071;SUPERSCRIPT LATIN SMALL LETTER I;Lm;0;L;<super> 0069;;;;N;;;;;

> 207f {n}

207F;SUPERSCRIPT LATIN SMALL LETTER N;Lm;0;L;<super> 006E;;;;N;;;;;

etc...

>
> For all I know, they *are* letters in a language somewhere, but I assumed originally that they were simply symbols that ended up with the wrong designation of being "letters". Am I wrong about this?

The Unicode category is typically "Lm" = "LETTER, MODIFIER" for the superscripts, or "Ll" = "LETTER, LOWERCASE" for a number of the subscripted ones I checked. So the designation as "letters" is deliberate.

I suspect most of them are used primarily in technical and archaic orthographies (e.g., medievalists transcribing old manuscripts) rather than modern languages, though I wouldn't be surprised if a few of them are in current use somewhere - e.g. a superscripted vowel to represent a schwa vowel, or superscripted n and m for prenasalized stops.

> We've got three options:
>
> 1 - change them to cc12 in the ini file
> 2 - change them to cc12 in unicode-math
> 3 - not change them at all
>
> Originally I was asking about #1, and right now I'm with #2, but if you think this will lead to trouble down the road I'm happy to revert to #3.

I think #3 is the correct default, as it's the only option that has a well-defined, standardized basis.

As this is TeX, where everything is customizable and programmable, I suppose I wouldn't oppose an OPTION somewhere to explicitly change them -- after all, a user can do this with \catcode at any time -- but it should be something that the document has to deliberately "opt in" to. We can't prevent people deviating from standards, but they should at least be made aware that they're deviating.

JK

Will Robertson

unread,

Jun 1, 2010, 8:39:01 AM6/1/10

to uni...@googlegroups.com

On 01/06/2010, at 7:48 PM, Jonathan Kew wrote:

>> We've got three options:
>>
>> 1 - change them to cc12 in the ini file
>> 2 - change them to cc12 in unicode-math
>> 3 - not change them at all
>>
>> Originally I was asking about #1, and right now I'm with #2, but if you think this will lead to trouble down the road I'm happy to revert to #3.
>
> I think #3 is the correct default, as it's the only option that has a well-defined, standardized basis.

Right. I definitely see the logic here.

So although I suspect just maybe that option #2 would have fewer puzzling consequences to unsuspecting users, I'll switch back to option #3.

If I do get lots of feedback about this issue after more people use the package, I'll reconsider. (After all, it's only 15 unicode characters that I'll be perverting...)

> As this is TeX, where everything is customizable and programmable, I suppose I wouldn't oppose an OPTION somewhere to explicitly change them -- after all, a user can do this with \catcode at any time -- but it should be something that the document has to deliberately "opt in" to. We can't prevent people deviating from standards, but they should at least be made aware that they're deviating.

I don't think it's any harder for the user to read some documentation about not writing $\Cⁿ$ than for them to use a package option that makes their source incompatible with someone else who doesn't use that option. It's one or the other, and unicode is unambiguous enough for me.

Very many thanks for the guidance.

-- Will

Jonathan Kew

unread,

Jun 1, 2010, 9:06:32 AM6/1/10

to uni...@googlegroups.com

On 1 Jun 2010, at 07:07, Will Robertson wrote:

> This has actually been remarked upon by the same tester who found the superscript case; he was surprised to write
>
> \def\C{\mathbb{C}}
> ...
> $\Cβ$
>
> and have an "undefined control sequence" error.

Just an added comment: that's a consequence of mixing real Unicode input with TeX's pure-ASCII math representation -- i.e. the use of control sequences for any characters outside the ASCII range (roughly speaking).

The "true Unicode" approach would be something like

$ℂβ$

which should work fine, assuming adequate fonts are configured. :)

JK

Ross Moore

unread,

Jun 1, 2010, 4:43:32 PM6/1/10

to uni...@googlegroups.com

Hi Jonathan, Will and others

On 01/06/2010, at 8:18 PM, Jonathan Kew wrote:

> On 1 Jun 2010, at 10:39, Will Robertson wrote:
>>
>> We're all in agreement there, and I'm happy to leave things as
>> they are, but what about the subscripts and superscripts I
>> originally mentioned?
>>
>> 2071 {i}
>
> 2071;SUPERSCRIPT LATIN SMALL LETTER I;Lm;0;L;<super> 0069;;;;N;;;;;
>
>> 207f {n}
>
> 207F;SUPERSCRIPT LATIN SMALL LETTER N;Lm;0;L;<super> 006E;;;;N;;;;;
>
> etc...

Since there is not a full alphabet here, it seems that these
characters are included in Unicode because they have a special
meaning, for some particular situation(s).

That would make them symbols, rather than letters.
However the names do not suggest this, which I find to be
a bit of a deficiency in Unicode.

>> For all I know, they *are* letters in a language somewhere, but I
>> assumed originally that they were simply symbols that ended up
>> with the wrong designation of being "letters". Am I wrong about this?
>
> The Unicode category is typically "Lm" = "LETTER, MODIFIER" for the
> superscripts, or "Ll" = "LETTER, LOWERCASE" for a number of the
> subscripted ones I checked. So the designation as "letters" is
> deliberate.

One thing that is clear, though, is that they are *not* intended
to be used for superscripts and subscripts in mathematics.
I've heard anecdotal evidence about this, through Barbara
Beeton, I think --- maybe I can find some old emails.

Not that one wouldn't be tempted to do this --- I have even
written some macros to put these characters into Bookmarks,
for section tiles which include math symbols.
But I would *never* use them for the body of a document.

You cannot copy/paste from bookmarks, so there is no harm
in using them there.

Aside:
Note that there are now at least 2 separate reasons why $\Cⁿ$
is a *very poor* LaTeX representation for ℂⁿ .
I've ranted before about 1-letter macro names, but \C has
long been used as a macro for a cyrillic letter or accent
--- it should *not* be used within mathematics, or any other
context!

I cannot find the source file which defines it, but it is
listed in the "Comprehensive LaTeX Symbol List"
e.g.
http://www.ctan.org/tex-archive/info/symbols/comprehensive/
http://ctan.unsw.edu.au/info/symbols/comprehensive/SYMLIST

>
> I suspect most of them are used primarily in technical and archaic
> orthographies (e.g., medievalists transcribing old manuscripts)
> rather than modern languages, though I wouldn't be surprised if a
> few of them are in current use somewhere - e.g. a superscripted
> vowel to represent a schwa vowel, or superscripted n and m for
> prenasalized stops.

Now this I have seen, in medieval manuscripts, and this
could explain why only a few characters are supported this way.
So I'm happy to accept that these characters are correctly
classified as letters. Though I'd doubt that anyone would
really want to use them in macro names.

>
>> We've got three options:
>>
>> 1 - change them to cc12 in the ini file
>> 2 - change them to cc12 in unicode-math
>> 3 - not change them at all
>>
>> Originally I was asking about #1, and right now I'm with #2, but
>> if you think this will lead to trouble down the road I'm happy to
>> revert to #3.

Does a catcode of 11 or 12 make any difference to TeX's
typesetting algorithms? e.g., do kerning pairs apply
when one of the characters is not a letter?

If there is no difference, then surely it doesn't matter
what TeX does with them. The important thing then is
being consistent --- not in following what seems to be
a Unicode designation, just for the sake of it.

> I think #3 is the correct default, as it's the only option that has
> a well-defined, standardized basis.

That is like saying don't eat meat unless it has been killed
in a particular way --- because once this was indeed important.
Now we have moved on and the situation is different to what the
rule was designed for.

I'm not saying that #3 is wrong; just that maybe this issue needs
to be discussed properly within the TeX community, and explore
what are the issues that can arise from the different choices.

>
> As this is TeX, where everything is customizable and programmable,
> I suppose I wouldn't oppose an OPTION somewhere to explicitly
> change them -- after all, a user can do this with \catcode at any
> time -- but it should be something that the document has to
> deliberately "opt in" to. We can't prevent people deviating from
> standards, but they should at least be made aware that they're
> deviating.

Agreed.
But there should be good reasons for the standard being what
it is agreed to be, rather than just following a convention
that is really applicable for some other kind of situation.

>
> JK

Will Robertson

unread,

Jun 1, 2010, 11:36:28 PM6/1/10

to uni...@googlegroups.com

Hi Ross,

On 02/06/2010, at 6:13 AM, Ross Moore wrote:

> One thing that is clear, though, is that they are *not* intended
> to be used for superscripts and subscripts in mathematics.
> I've heard anecdotal evidence about this, through Barbara
> Beeton, I think --- maybe I can find some old emails.
>
> Not that one wouldn't be tempted to do this --- I have even
> written some macros to put these characters into Bookmarks,
> for section tiles which include math symbols.
> But I would *never* use them for the body of a document.

The reason I added them all to unicode-math is that many of the other subscript/superscript symbols indeed look like they *are* for use in maths; there's +, -, =, (, ), and all the numerals. (This might just be my ignorance and all these are supposed to be used for textual purposes.)

It seemed like including the letters as well would follow the principle of least surprise (although there is now another surprise with the catcodes that could be just as confusing).

> Note that there are now at least 2 separate reasons why $\Cⁿ$
> is a *very poor* LaTeX representation for ℂⁿ .
> I've ranted before about 1-letter macro names, but \C has
> long been used as a macro for a cyrillic letter or accent
> --- it should *not* be used within mathematics, or any other
> context!

Right, I agree in principle, and this was just an example. I have a hard time believing that the majority of the TeX using mathematicians in the world will stop writing these shorthands, however :)

> Does a catcode of 11 or 12 make any difference to TeX's
> typesetting algorithms? e.g., do kerning pairs apply
> when one of the characters is not a letter?
>
> If there is no difference, then surely it doesn't matter
> what TeX does with them. The important thing then is
> being consistent --- not in following what seems to be
> a Unicode designation, just for the sake of it.

That's sort of where I was coming from originally.

I'm not happy about putting an option in the package, but I'm certainly open to more discussion.

Best regards,
-- Will

Reply all

Reply to author

Forward