[perl #36794] [BUG] substr opcode segfault

Will Coleda

unread,

Aug 3, 2005, 2:58:25 PM8/3/05

to bugs-bi...@rt.perl.org

# New Ticket Created by Will Coleda
# Please include the string: [perl #36794]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/rt3/Ticket/Display.html?id=36794 >

With r8787, the following tcl code:

puts \u666

causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
and forces a few tcl-unicode escape tests into TODOs.

A short PIR test that is equivalent:

.sub main @MAIN
$S0 = "\\u666"
$I0 = 0x666
$S1 = chr $I0 # works, but substr doesn't like this string.
substr $S0, 0, 5, $S1
.end

Running the PIR through gdb, I get the stack trace below.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00a98fff
0xffff8cf4 in ___memcpy () at /System/Library/Frameworks/
System.framework/PrivateHeaders/ppc/cpu_capabilities.h:189
189 /System/Library/Frameworks/System.framework/PrivateHeaders/
ppc/cpu_capabilities.h: No such file or directory.
in /System/Library/Frameworks/System.framework/
PrivateHeaders/ppc/cpu_capabilities.h
(gdb) bt
#0 0xffff8cf4 in ___memcpy () at /System/Library/Frameworks/
System.framework/PrivateHeaders/ppc/cpu_capabilities.h:189
#1 0x0002d04c in string_replace (interpreter=0xd00180, src=0xe5b1c0,
offset=0, length=5, rep=0xe5a630, d=0x0) at src/string.c:1238
#2 0x00093a88 in Parrot_substr_s_ic_ic_s (cur_opcode=0xd125fc,
interpreter=0xd00180) at ops/string.ops:245
#3 0x00205c5c in runops_slow_core (interpreter=0xd00180,
pc=0xd125fc) at src/runops_cores.c:153
#4 0x0004c6f0 in runops_int (interpreter=0xd00180, offset=0) at src/
interpreter.c:750
#5 0x00044d70 in runops (interpreter=0xd00180, offs=0) at src/
inter_run.c:81
#6 0x0001680c in Parrot_runcode (interpreter=0xd00180, argc=1,
argv=0xbffff9d0) at src/embed.c:831
#7 0x000165bc in Parrot_runcode (interpreter=0xd00180, argc=1,
argv=0xbffff9d0) at src/embed.c:765
#8 0x000043c8 in main (argc=1, argv=0xbffff9d0) at imcc/main.c:637

Leopold Toetsch

unread,

Aug 3, 2005, 4:44:47 PM8/3/05

to perl6-i...@perl.org, bugs-bi...@netlabs.develooper.com

On Aug 3, 2005, at 20:58, Will Coleda (via RT) wrote:

>
> causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
> and forces a few tcl-unicode escape tests into TODOs.
>
> A short PIR test that is equivalent:
>
> .sub main @MAIN
> $S0 = "\\u666"
> $I0 = 0x666
> $S1 = chr $I0 # works, but substr doesn't like this string.
> substr $S0, 0, 5, $S1
> .end

> #1 0x0002d04c in string_replace (interpreter=0xd00180, src=0xe5b1c0,

> offset=0, length=5, rep=0xe5a630, d=0x0) at src/string.c:1238

string_replace has still the old code relying on fixed-width encodings
with 1, 2, or 4 bytes per char, which is of course not true for utf8.
This needs fixing.

Takers wanted,
leo

Joshua Juran

unread,

Aug 3, 2005, 8:27:13 PM8/3/05

to perl6-i...@perl.org

On Aug 3, 2005, at 2:58 PM, Will Coleda (via RT) wrote:

> With r8787, the following tcl code:
>
> puts \u666
>
> causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
> and forces a few tcl-unicode escape tests into TODOs.

Duh, because it's *evil*.

:-)

Josh

Leopold Toetsch

unread,

Aug 4, 2005, 5:31:25 AM8/4/05

to perl6-i...@perl.org, bugs-bi...@netlabs.develooper.com

Will Coleda (via RT) wrote:

> causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
> and forces a few tcl-unicode escape tests into TODOs.
>
> A short PIR test that is equivalent:
>
> .sub main @MAIN
> $S0 = "\\u666"
> $I0 = 0x666
> $S1 = chr $I0 # works, but substr doesn't like this string.
> substr $S0, 0, 5, $S1
> .end

Fixed - r8805

Thanks for testing and providing the test,
leo

Nicholas Clark

unread,

Aug 10, 2005, 6:23:37 AM8/10/05

to Leopold Toetsch, perl6-i...@perl.org, bugs-bi...@netlabs.develooper.com

I thought that one thing Jarkko learned from perl 5's Unicode model was that
the amount of code and pain to support a variable length encoding was
greater than the space saving that that encoding gives.

In turn Dan had decided that Parrot should internally unpack to some form
of fixed width encoding. So all Unicode would be stored internally in the
shortest of ISO-8859-1, UCS-16 and UCS-32 that encompassed all the code
points used.

1: My memory may be wrong on this
2: It may not have been explicit
3: I may have missed an explicit change

But having dealt with the fun of variable length encodings, my gut feeling
is with Jarkko, that it's probably better to stay fixed width internally.

Nicholas Clark

Leopold Toetsch

unread,

Aug 10, 2005, 8:56:46 AM8/10/05

to parrotbug...@parrotcode.org

Nicholas Clark via RT wrote:

> I thought that one thing Jarkko learned from perl 5's Unicode model was that
> the amount of code and pain to support a variable length encoding was
> greater than the space saving that that encoding gives.
>
> In turn Dan had decided that Parrot should internally unpack to some form
> of fixed width encoding. So all Unicode would be stored internally in the
> shortest of ISO-8859-1, UCS-16 and UCS-32 that encompassed all the code
> points used.

Yes, with the enhancenment (also proposed by Dan) that a conversion to
fixed width encoding is done lazily i.e. on demand. The substr would be
typically such a place to change encoding to fixed.

> But having dealt with the fun of variable length encodings, my gut feeling
> is with Jarkko, that it's probably better to stay fixed width internally.

My gut feeling is just the same.

> Nicholas Clark

leo

Nicholas Clark

unread,

Aug 10, 2005, 9:24:31 AM8/10/05

to Leopold Toetsch, parrotbug...@parrotcode.org

On Wed, Aug 10, 2005 at 02:56:46PM +0200, Leopold Toetsch wrote:
> Nicholas Clark via RT wrote:
>
> >I thought that one thing Jarkko learned from perl 5's Unicode model was
> >that
> >the amount of code and pain to support a variable length encoding was
> >greater than the space saving that that encoding gives.
> >
> >In turn Dan had decided that Parrot should internally unpack to some form
> >of fixed width encoding. So all Unicode would be stored internally in the
> >shortest of ISO-8859-1, UCS-16 and UCS-32 that encompassed all the code
> >points used.
>
> Yes, with the enhancenment (also proposed by Dan) that a conversion to
> fixed width encoding is done lazily i.e. on demand. The substr would be
> typically such a place to change encoding to fixed.

Aha. That's the subtly that I missed from all this. The form of the "fix"

> >But having dealt with the fun of variable length encodings, my gut feeling
> >is with Jarkko, that it's probably better to stay fixed width internally.
>
> My gut feeling is just the same.

Thanks for the clarification.

Nicholas Clark