[perl #38931] [RFE] Double-quoted strings automatically determine string type

Patrick R . Michaud

unread,

Apr 16, 2006, 2:22:40 PM4/16/06

to bugs-bi...@rt.perl.org

# New Ticket Created by Patrick R. Michaud
# Please include the string: [perl #38931]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/rt3/Ticket/Display.html?id=38931 >

This is a suggestion regarding double-quoted string literals
in Parrot. Currently double-quoted strings are always assumed
to be ASCII unless prefixed by a different charset identifier
such as 'unicode:' or 'iso-8859-1:'. Unfortunately, this means
that string literals like:

$S1 = "He said, \xabHello\xbb"
$S2 = "3 \u2212 4 = \u207b 1"

are treated as ASCII strings even though they obviously contain
codepoints outside of the ASCII range. (The first results in a
'malformed string' error when compiled, the second chops off the
high-order bits of the \u sequence.)

It would be really helpful to PIR emitters if Parrot could
automatically use the presence of \u or \x in double-quotes
to generate a 'unicode:' or 'iso-8859-1:' string (absent any other
prefix specification which would override). If this
were in place, producing a valid string literal for PIR would
simply be (regardless of the encoding of $S0):

$S1 = escape $S0
$S1 = concat '"', $S1
$S1 = concat $S1, '"'

Currently, an emitter must also check for the presence of any
\u or \x sequences in $S1, and then prefix the double-quoted
literal with 'unicode:' or 'iso-8859-1:' accordingly.

If this can't be easily done, then I will probably create a
"parrot_escape" function in Data::String to handle the generation,
but it would be great if Parrot could handle it natively.

Thanks,

Pm

Nicholas Clark

unread,

Apr 16, 2006, 2:35:38 PM4/16/06

to perl6-i...@perl.org

On Sun, Apr 16, 2006 at 11:22:40AM -0700, Patrick R. Michaud wrote:

> This is a suggestion regarding double-quoted string literals
> in Parrot. Currently double-quoted strings are always assumed
> to be ASCII unless prefixed by a different charset identifier
> such as 'unicode:' or 'iso-8859-1:'. Unfortunately, this means
> that string literals like:
>
> $S1 = "He said, \xabHello\xbb"
> $S2 = "3 \u2212 4 = \u207b 1"
>
> are treated as ASCII strings even though they obviously contain
> codepoints outside of the ASCII range. (The first results in a
> 'malformed string' error when compiled, the second chops off the
> high-order bits of the \u sequence.)

IIRC having ASCII as the default was a deliberate design choice to avoid
the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
string literal with bytes outside the range 0-127.

If so, then I assume that the behaviour of your second example is wrong - it
should also be a malformed string.

If PGE is always outputting UTF-8 literals, what stops it from always
prefixing every literal "unicode:", even if it only uses Unicode characters
0 to 127?

Nicholas Clark

Patrick R. Michaud

unread,

Apr 16, 2006, 5:41:05 PM4/16/06

to Nicholas Clark via RT

On Sun, Apr 16, 2006 at 11:36:10AM -0700, Nicholas Clark via RT wrote:
> IIRC having ASCII as the default was a deliberate design choice to avoid
> the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
> string literal with bytes outside the range 0-127.

Reasonable. Essentially I'm thinking the rule could be that any
double-quoted literal with a \u sequence in it is utf-8,
anything with \x (and no \u) is iso-8859-1, and all else is
ASCII.

I also proposed on IRC/#parrot that instead of automatically
selecting the encoding, we have an "auto:" prefix for literals
(or pick a better name) that would do the selection for us:

$S1 = auto:"hello world" # ASCII
$S2 = auto:"hello\nworld" # ASCII
$S3 = auto:"hello \xabworld\xbb" # iso-8859-1 (or unicode)
$S4 = auto:"3 \u2212 4 = \u207b 1" # Unicode

Leo suggested doing it without "auto:" for the RFE.

> If so, then I assume that the behaviour of your second example is wrong - it
> should also be a malformed string.

On this I'm only reporting what my parrot is telling me. :-)
If it should be a malformed string, we should have a test for
that (and I'll be glad to write it if this is the case).

> If PGE is always outputting UTF-8 literals, what stops it from always
> prefixing every literal "unicode:", even if it only uses Unicode characters
> 0 to 127?

The short answer is that some string operations on unicode strings
currently require ICU in order to work properly, even if the string
values don't contain any codepoints above 255. One such operation
is "downcase", but there are others.

So, if PGE prefixes every literal as "unicode:" (and this is how
I originally had PGE) then systems w/o ICU inevitably fail with
"no ICU library present" when certain string operations are attempted.
Also, once introduced unicode strings tend can easily spread
throughout the system, since an operation (e.g., concat) involving
a UTF-8 string and an ASCII string produces a UTF-8 result
even if all of the codepoints of the string value are in the
ASCII range.

Thus far PGE has handled this by looking at the resulting
escaped literal and prefixing it with "unicode:" only if it
will be needed -- i.e., if the escaped string has a '\x'
or a '\u' somewhere in it. But now I'm starting to have to
do this "check for unicode codepoint" in every PIR-emitting
system I'm working with (PGE, APL, Perl 6, etc.), which is
why I'm thinking that having PIR handle it directly would
be a better choice. (And if we decide not to let PIR handle
it, I'll create a library function for it.)

I also realized this past week that using 'unicode:' on
strings with \x (codepoints 128-255) may *still* be a bit
too liberal -- the « french angles » will still cause
"no ICU library present" errors, but would seemingly work
just fine if iso-8859-1 is attempted. I'm not wanting
to block systems w/o ICU from working on Perl 6,
so falling back to iso-8859-1 in this case seems like the
best of a bad situation. (OTOH, there are some potential
problems with it on output.)

Lastly, I suspect (and it's just a suspicion) that string
operations on ASCII and iso-8859-1 strings are likely to be
faster than their utf-8/unicode counterparts. If this is
true, then the more strings that we can keep in ASCII,
the better off we are. (And the vast majority of strings
literals I seem to be generating in PIR contain only ASCII
characters.)

One other option is to make string operations such as
downcase a bit smarter and figure out that it's okay
to use the iso-8859-1 or ASCII algorithms/tables when
the strings involved don't have any codepoints above 255.

More comments and discussion welcome,

Pm

Patrick R. Michaud

unread,

Apr 16, 2006, 5:49:42 PM4/16/06

to Nicholas Clark via RT

On Sun, Apr 16, 2006 at 04:41:05PM -0500, Patrick R. Michaud wrote:
> > If PGE is always outputting UTF-8 literals, what stops it from always
> > prefixing every literal "unicode:", even if it only uses Unicode characters
> > 0 to 127?

> [...]

> Also, once introduced unicode strings tend can easily spread
> throughout the system, since an operation (e.g., concat) involving
> a UTF-8 string and an ASCII string produces a UTF-8 result
> even if all of the codepoints of the string value are in the
> ASCII range.

Oops, I just rechecked and this no longer seems to be the case.
Or maybe it only happens in certain situations.

However, my other points about the problem of using 'unicode:'
on systems w/o ICU still stands. :-)

Pm

Nicholas Clark

unread,

Apr 16, 2006, 6:02:16 PM4/16/06

to Patrick R. Michaud, Nicholas Clark via RT

On Sun, Apr 16, 2006 at 04:41:05PM -0500, Patrick R. Michaud wrote:

> I also realized this past week that using 'unicode:' on
> strings with \x (codepoints 128-255) may *still* be a bit
> too liberal -- the « french angles » will still cause
> "no ICU library present" errors, but would seemingly work
> just fine if iso-8859-1 is attempted. I'm not wanting
> to block systems w/o ICU from working on Perl 6,
> so falling back to iso-8859-1 in this case seems like the
> best of a bad situation. (OTOH, there are some potential
> problems with it on output.)

I haven't been near ICU for about a year, but last time I had dealings with
it, it wasn't horribly portable. Furthermore, it had set itself up for trouble
by having at least an n*m model of the world (compilers * operating systems)
rather than an n+m (treat compiler related features independently of operating
system related features). I was not impressed.

Although clearly ICU is a good enough solution for now, so I'm not suggesting
"burn it", even if it floats like a duck.

> Lastly, I suspect (and it's just a suspicion) that string
> operations on ASCII and iso-8859-1 strings are likely to be
> faster than their utf-8/unicode counterparts. If this is
> true, then the more strings that we can keep in ASCII,
> the better off we are. (And the vast majority of strings
> literals I seem to be generating in PIR contain only ASCII
> characters.)
>
> One other option is to make string operations such as
> downcase a bit smarter and figure out that it's okay
> to use the iso-8859-1 or ASCII algorithms/tables when
> the strings involved don't have any codepoints above 255.

IIRC Jarkko's conclusion from having too much dealing with it in Perl 5 is
avoid UTF-8 like the plague. Variable length encodings are fine for
data exchange, but suck internally as soon as you want to manipulate them.
With hindsight, his view was that probably Perl 5 should have gone for
UCS-32 internally. (Oh, and don't repeat *the* Perl 5 Unicode fallacy,
assuming that 8 bit data the user gave you happens to be in ISO-8859-1 if
nothing told you this)

I think Dan was thinking that internally everything should be fixed width,
and for practical reasons pick the smallest of 8 bit, UCS-16 and UCS-32
internally. Convert variable width to fixed width (losslessly) the first time
you need to do anything to it, and leave it that way.

Specifically, even case conversion would be better done as fixed width, as
there's at least one character where 1 of uppercase/titlecase/lowercase has
a different width from the other 2. (That's before you get to special cases
such as Greek sigma)

Presumably therefore Dan's view was that while constants to the assembler
might well be fed in as UTF-8, the generated bytecode should be using the
tersest fixed width it can. I can see sense in this.

Nicholas Clark

Audrey Tang

unread,

Apr 16, 2006, 9:22:23 PM4/16/06

to perl6-i...@perl.org

Nicholas Clark wrote:
> On Sun, Apr 16, 2006 at 11:22:40AM -0700, Patrick R. Michaud wrote:
>> $S1 = "He said, \xabHello\xbb"
>> $S2 = "3 \u2212 4 = \u207b 1"
>>
>> are treated as ASCII strings even though they obviously contain
>> codepoints outside of the ASCII range. (The first results in a
>> 'malformed string' error when compiled, the second chops off the
>> high-order bits of the \u sequence.)
>
> IIRC having ASCII as the default was a deliberate design choice to avoid
> the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
> string literal with bytes outside the range 0-127.

Aye, it was auto-promoting to latin1 and was changed to ascii-by-default
by me and Leo a while ago.

> If so, then I assume that the behaviour of your second example is wrong - it
> should also be a malformed string.
>
> If PGE is always outputting UTF-8 literals, what stops it from always
> prefixing every literal "unicode:", even if it only uses Unicode characters
> 0 to 127?

Indeed, it would be much easier if unicode:"" on an ascii-only string
can automatically go back to use ascii for representation, and choose to
use utf8 (or better, latin1/ucs2) only iff there is high-bit parts in it.

A Perlish way to solve this is to introduce another pragma, similar to
"n_operators", that controls the encoding of all string literals of the
PIR program:

.pragma encoding utf8

Once written that way, you can simply use literal « » in the program,
which reads better than \xab and \xbb anyway... :-)

Audrey

signature.asc

Leopold Toetsch

unread,

Apr 17, 2006, 7:52:43 AM4/17/06

to Nicholas Clark, Patrick R. Michaud, Nicholas Clark via RT

On Apr 17, 2006, at 0:02, Nicholas Clark wrote:

> I think Dan was thinking that internally everything should be fixed
> width,
> and for practical reasons pick the smallest of 8 bit, UCS-16 and UCS-32
> internally. Convert variable width to fixed width (losslessly) the
> first time
> you need to do anything to it, and leave it that way.

Not really. He had the model of not converting a string at all, based
on an example, where a string was just passed through parrot. I'd
prefer the fixed width encoding scheme.

leo

Larry Wall

unread,

Apr 17, 2006, 7:41:49 PM4/17/06

to Leopold Toetsch, Nicholas Clark, Patrick R. Michaud, Nicholas Clark via RT

On Mon, Apr 17, 2006 at 01:52:43PM +0200, Leopold Toetsch wrote:
:

We actually discussed this several times early in the design of
Perl 6. One other point that hasn't been (re)made yet is that the
fixed-width approach scales quite well when combined with abstract
strings represented as fragments, since different fragments can be
stored in different sizes but can present a unified abstract interface.
And often one can just get away with upgrading the current fragment
rather than the whole string. The fragment approach also tends to
have nicer COW properties. It's not so nice for substr(), of course,
but you probably shouldn't be doing that on character strings anyway.
That's what buffers and arrays are for.

Whether Parrot wants to do fragmentary strings is another matter.
If we stick with fixed uniform strings then languages such as Perl 6
could certainly build something more flexible on top of that, though
perhaps not as efficiently.

Larry

Patrick R. Michaud

unread,

Apr 18, 2006, 9:27:00 AM4/18/06

to Audrey Tang via RT

On Sun, Apr 16, 2006 at 06:24:58PM -0700, Audrey Tang via RT wrote:
> Nicholas Clark wrote:
> > IIRC having ASCII as the default was a deliberate design choice to avoid
> > the confusion of "is it iso-8859-1 or is it utf-8" when encountering a
> > string literal with bytes outside the range 0-127.
>
> Aye, it was auto-promoting to latin1 and was changed to ascii-by-default
> by me and Leo a while ago.

After reading some of the comments and thinking about it some
more, I think having double-quoted strings auto-promote to
iso-8859-1 by default is probably not a good idea.

> > If PGE is always outputting UTF-8 literals, what stops it from always
> > prefixing every literal "unicode:", even if it only uses Unicode characters
> > 0 to 127?
>
> Indeed, it would be much easier if unicode:"" on an ascii-only string
> can automatically go back to use ascii for representation, and choose to
> use utf8 (or better, latin1/ucs2) only iff there is high-bit parts in it.

Yes, this would be a good solution also.

(It doesn't resolve the original problem that prompted this
RFE, so I'll open a separate ticket for that.)

Thanks,

Pm

Patrick R. Michaud via RT

unread,

May 1, 2006, 4:09:15 PM5/1/06

to perl6-i...@perl.org

FWIW, I'm retracting this particular RFE (and closing the ticket). I've
since decided it's better to have things the way they are now.

Pm