Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Transferring control between code segments, eval, and suchlike things

3 views
Skip to first unread message

Dan Sugalski

unread,
Jan 22, 2003, 3:00:37 PM1/22/03
to perl6-i...@perl.org
Okay, since this has all come up, here's the scoop from a design perspective.

First, the branch opcodes (branch, bsr, and the conditionals) are all
meant for movement within a segment of bytecode. They are *not*
supposed to leave a segment. To do so was arguably a bad idea, now
it's officially an error. If you need to do so, branch to an op that
can transfer across boundaries.

Design Edict #1: Branches, which is any transfer of control that
takes an offset, may *not* escape the current bytecode segment.

Next, jumps. Jumps take absolute addresses, so either need fixup at
load time (blech), are only valid in dynamically generated code
(okay, but limiting), or can only jump to values in registers (that's
fine). Jumps aren't a problem in general.

Design Edict #2: Jumps may go anywhere.

Destinations. These are a pain, since if we can go anywhere then the
JIT has to do all sorts of nasty and unpleasant things to compensate,
and to make every op a valid destination. Yuck.

Design Edict #3: All destinations *must* be marked as such in the
bytecode metadata segment. (I am officially nervous about this, as I
can see a number of ways to subvert this for evil)

I'm only keeping jumps (and their corresponding jsr) around for
nostalgic reasons, and with the vague hope they may be useful. I'm
not sure about this.

Design Edict #4: Dan is officially iffy on jumps, but can see them as
useful for lower-level statically bound languages such as forth,
Scheme, or C.

That leads us to

Design Edict #5: Dan will accommodate semantics for languages outside
the core set (perl, python, ruby) only if they don't compromise
performance for the core set.

Calling actual routines--subs, methods, functions, whatever--at the
high level isn't done with branches or jumps. It is, instead, done
with the call series of ops. (call, callmeth, callcc, tailcall,
tailcallmeth, tailcallcc (though that one makes my head hurt),
invoke) These are specifically for calling code that's potentially in
other segments, and to call into them at fixed points. I think these
need to be hashed out a bit to make them more JIT-friendly, but
they're the primary transfer destination point

Design Edict #6: The first op in a sub is always a valid
jump/branch/control transfer destination

Now. Eval. The compile opcode going in is phenomenally cool (thanks,
Leo!) but has pointed out some holes in the semantics. I got
handwavey and, well, it shows. No cookie for me.

The compreg op should compile the passed code in the language that is
indicated and should load that bytecode into the current interpreter.
That means that if there are any symbols that get installed because
someone's defined a sub then, well, they should get installed into
the interpreter's symbol tables.

Compiled code is an interesting thing. In some cases it should return
a sub PMC, in some cases it should execute and return a value, and in
some cases it should install a bunch of stuff in a symbol table and
then return a value. These correspond to:


eval "print 12";

$foo = eval "sub bar{return 1;}";

require foo.pm;

respectively. It's sort of a mixed bag, and unfortunately we can't
count on the code doing the compilation to properly handle the
semantics of the language being compiled. So...

Design Edict #7: the compreg opcode will execute the compiled code,
calling in with parrot's calling conventions. If it should return
something, then it had darned well better build it and return it.

Oh, and:

Design Edict #8: compreg is prototyped. It takes a single string and
must return a single PMC. The compiler may cheat as need be. (No need
to check and see if it returned a string, or an int)

Yes, this does mean that for plain assembly that we want to compile
and return a sub ref for we need to do extra in the assembly we pass
in. Tough, we can deal. If it was dead-simple it wouldn't be
assembly. :)

I think that's it. Let's have at it and see where the edicts need fixing.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Benjamin Stuhl

unread,
Jan 22, 2003, 6:24:46 PM1/22/03
to Dan Sugalski, perl6-i...@perl.org
At 03:00 PM 1/22/2003 -0500, you wrote:
>Okay, since this has all come up, here's the scoop from a design perspective.
>
>First, the branch opcodes (branch, bsr, and the conditionals) are all
>meant for movement within a segment of bytecode. They are *not* supposed
>to leave a segment. To do so was arguably a bad idea, now it's officially
>an error. If you need to do so, branch to an op that can transfer across
>boundaries.
>
>Design Edict #1: Branches, which is any transfer of control that takes an
>offset, may *not* escape the current bytecode segment.

Seems reasonable. Especially when they bytecode loader may not guarantee
the relative placement of segments (think mmap()). Although,
all this would seem to suggest that we'd need/want a special-purpose
allocator for bytecode segments, since every sub has to fit within precisely
one segment (and I know _I'd_ like to keep bytecode segments on their own
memory pages, to e.g. maximize sharing on fork()).

>Next, jumps. Jumps take absolute addresses, so either need fixup at load
>time (blech), are only valid in dynamically generated code (okay, but
>limiting), or can only jump to values in registers (that's fine). Jumps
>aren't a problem in general.

Fixups aren't so bad if we make the jump opcode itself take an index into a
table of fixups (thus letting the bytecode stream stay read-only). Register
jumps
are dangerous, since parrot can't control what the user code loads into the
register (while we can theoretically protect the fixup table from anything
short of
native code).

>Design Edict #2: Jumps may go anywhere.
>
>Destinations. These are a pain, since if we can go anywhere then the JIT
>has to do all sorts of nasty and unpleasant things to compensate, and to
>make every op a valid destination. Yuck.
>
>Design Edict #3: All destinations *must* be marked as such in the bytecode
>metadata segment. (I am officially nervous about this, as I can see a
>number of ways to subvert this for evil)

Marked destinations are very important; as for evil subversion, how about
just saying "untrusted code only gets pure interpretation, and the
untrusting interpreter bounds-checks everything"?

[snip]


>Calling actual routines--subs, methods, functions, whatever--at the high
>level isn't done with branches or jumps. It is, instead, done with the
>call series of ops. (call, callmeth, callcc, tailcall, tailcallmeth,
>tailcallcc (though that one makes my head hurt), invoke) These are
>specifically for calling code that's potentially in other segments, and to
>call into them at fixed points. I think these need to be hashed out a bit
>to make them more JIT-friendly, but they're the primary transfer
>destination point
>
>Design Edict #6: The first op in a sub is always a valid
>jump/branch/control transfer destination

Wouldn't make much sense if you had a sub but couldn't call it, now would
it? :-D

>Now. Eval. The compile opcode going in is phenomenally cool (thanks, Leo!)
>but has pointed out some holes in the semantics. I got handwavey and,
>well, it shows. No cookie for me.
>
>The compreg op should compile the passed code in the language that is
>indicated and should load that bytecode into the current interpreter. That
>means that if there are any symbols that get installed because someone's
>defined a sub then, well, they should get installed into the interpreter's
>symbol tables.
>
>Compiled code is an interesting thing. In some cases it should return a
>sub PMC, in some cases it should execute and return a value, and in some
>cases it should install a bunch of stuff in a symbol table and then
>return a value. These correspond to:
>
>
> eval "print 12";
>
> $foo = eval "sub bar{return 1;}";
>
> require foo.pm;
>
>respectively. It's sort of a mixed bag, and unfortunately we can't count
>on the code doing the compilation to properly handle the semantics of the
>language being compiled. So...
>
>Design Edict #7: the compreg opcode will execute the compiled code,
>calling in with parrot's calling conventions. If it should return
>something, then it had darned well better build it and return it.

How does this play with

eval 'sub bar { change_foo(); } BEGIN { bar(); } (...stuff that depends on
foo...)';

? The semantics of BEGIN{} would seem to require that bar be installed into
the symbol table immediately... but then how do we reproduce that if we're
e.g. loading
precompiled bytecode?

>Oh, and:
>
>Design Edict #8: compreg is prototyped. It takes a single string and must
>return a single PMC. The compiler may cheat as need be. (No need to check
>and see if it returned a string, or an int)
>
>Yes, this does mean that for plain assembly that we want to compile and
>return a sub ref for we need to do extra in the assembly we pass in.
>Tough, we can deal. If it was dead-simple it wouldn't be assembly. :)

That makes sense.

-- BKS

Dan Sugalski

unread,
Jan 23, 2003, 12:11:20 AM1/23/03
to Benjamin Stuhl, perl6-i...@perl.org
At 6:24 PM -0500 1/22/03, Benjamin Stuhl wrote:
>At 03:00 PM 1/22/2003 -0500, you wrote:
>>Okay, since this has all come up, here's the scoop from a design perspective.
>>
>>First, the branch opcodes (branch, bsr, and the conditionals) are
>>all meant for movement within a segment of bytecode. They are *not*
>>supposed to leave a segment. To do so was arguably a bad idea, now
>>it's officially an error. If you need to do so, branch to an op
>>that can transfer across boundaries.
>>
>>Design Edict #1: Branches, which is any transfer of control that
>>takes an offset, may *not* escape the current bytecode segment.
>
>Seems reasonable. Especially when they bytecode loader may not
>guarantee the relative placement of segments (think mmap()).
>Although,
>all this would seem to suggest that we'd need/want a special-purpose
>allocator for bytecode segments, since every sub has to fit within
>precisely
>one segment (and I know _I'd_ like to keep bytecode segments on
>their own memory pages, to e.g. maximize sharing on fork()).

Every sub doesn't have to fit in a single segment, though. There may
well be a half-zillion subs in any one segment. (Though one segment
per sub does give us some interesting possibilities for GCing unused
code)

>>Next, jumps. Jumps take absolute addresses, so either need fixup at
>>load time (blech), are only valid in dynamically generated code
>>(okay, but limiting), or can only jump to values in registers
>>(that's fine). Jumps aren't a problem in general.
>
>Fixups aren't so bad if we make the jump opcode itself take an index
>into a table of fixups (thus letting the bytecode stream stay
>read-only). Register jumps
>are dangerous, since parrot can't control what the user code loads
>into the register (while we can theoretically protect the fixup
>table from anything short of
>native code).

Indirection. Ick. :)

Though, on the other hand, a jump with an integer constant
destination is pretty pointless, so we could consider using that to
index into a jump table. OTOH, it'd be the only thing using the jump
table, so I'm not sure it's worth it. Might speed things up some.
I'll think on that for a bit.

>>Design Edict #2: Jumps may go anywhere.
>>
>>Destinations. These are a pain, since if we can go anywhere then
>>the JIT has to do all sorts of nasty and unpleasant things to
>>compensate, and to make every op a valid destination. Yuck.
>>
>>Design Edict #3: All destinations *must* be marked as such in the
>>bytecode metadata segment. (I am officially nervous about this, as
>>I can see a number of ways to subvert this for evil)
>
>Marked destinations are very important; as for evil subversion, how
>about just saying "untrusted code only gets pure interpretation, and
>the untrusting interpreter bounds-checks everything"?

True, and we'll not be JITting safe-mode code, or likely not at least
because of the resource constraint checking.

>[snip]
>>Calling actual routines--subs, methods, functions, whatever--at the
>>high level isn't done with branches or jumps. It is, instead, done
>>with the call series of ops. (call, callmeth, callcc, tailcall,
>>tailcallmeth, tailcallcc (though that one makes my head hurt),
>>invoke) These are specifically for calling code that's potentially
>>in other segments, and to call into them at fixed points. I think
>>these need to be hashed out a bit to make them more JIT-friendly,
>>but they're the primary transfer destination point
>>
>>Design Edict #6: The first op in a sub is always a valid
>>jump/branch/control transfer destination
>
>Wouldn't make much sense if you had a sub but couldn't call it, now
>would it? :-D

Don't tempt the JAPHers!

If the compiler unit has special semantics it applies to chunks of
things that it's compiling, then it had darned well better apply them.

Or, more simply, it's the perl compiler module's responsibility to
get the BEGIN blocks executed.

Leopold Toetsch

unread,
Jan 23, 2003, 4:21:06 AM1/23/03
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski wrote:

> Okay, since this has all come up, here's the scoop from a design
> perspective.


Hard stuff did meet my printer at midnight, reading it onscreen twice
didn't help ;-)

First:

Definition #0: A bytecode segment is a sequence of code, which is loaded
into memory with no execution of such code intersparsed. So all subs,
modules, whatever loaded from zig files may be one code segment, *if*
the runloop wasn't entered. Or: as soon as the code is running, loading
additional bytecode puts this code into a different bytecode segment.


> Design Edict #1: Branches, which is any transfer of control that takes
> an offset, may *not* escape the current bytecode segment.

> Design Edict #2: Jumps may go anywhere.


> Design Edict #3: All destinations *must* be marked as such in the
> bytecode metadata segment. (I am officially nervous about this, as I can
> see a number of ways to subvert this for evil)


I would define: Jumps may go to any location aquired per set_addr call
or to branch tables. Jumping somewhere else may kill your dog.

Jumping to a set_addr label is recognized already, jump tables may
probably need some marker around them, so that the jump targets won't
get killed by dead code elimination.


> I'm only keeping jumps (and their corresponding jsr) around for
> nostalgic reasons, and with the vague hope they may be useful. I'm not
> sure about this.


They would be useful for a computed goto.


s/compreg/compile/g for($below);


> The compreg op should compile the passed code ...

> Design Edict #7: the compreg opcode will execute the compiled code,
> calling in with parrot's calling conventions. If it should return
> something, then it had darned well better build it and return it.


If the compile opcode has to execute the code, I would call it "eval".

But: When compile and eval are separate stages, the HL might be able to
pull the compile stage out of e.g. loops. So I think keeping compiling
and evaling separate makes sense.


Thanks for putting this together,
leo


Leopold Toetsch

unread,
Jan 23, 2003, 4:47:20 AM1/23/03
to Benjamin Stuhl, Dan Sugalski, perl6-i...@perl.org
Benjamin Stuhl wrote:

> At 03:00 PM 1/22/2003 -0500, you wrote:

> ... Although,


> all this would seem to suggest that we'd need/want a special-purpose
> allocator for bytecode segments, since every sub has to fit within
> precisely
> one segment (and I know _I'd_ like to keep bytecode segments on their
> own memory pages, to e.g. maximize sharing on fork()).


IMHO this is a big waste of memory - and running this page aligned code
JITted doesn't buy anything.


>> Design Edict #7: the compreg opcode will execute the compiled code,
>> calling in with parrot's calling conventions. If it should return
>> something, then it had darned well better build it and return it.
>
>
> How does this play with
>
> eval 'sub bar { change_foo(); } BEGIN { bar(); } (...stuff that depends
> on foo...)';
>
> ? The semantics of BEGIN{} would seem to require that bar be installed
> into the symbol table immediately... but then how do we reproduce that
> if we're e.g. loading
> precompiled bytecode?


Precompiled PBC and eval is a PITA. This issue seems to imply some extra
parsing during load time and setting up symbols. I dunno yet, how to
handle this.

leo

Jason Gloudon

unread,
Jan 23, 2003, 4:16:47 PM1/23/03
to Dan Sugalski, perl6-i...@perl.org
On Wed, Jan 22, 2003 at 03:00:37PM -0500, Dan Sugalski wrote:

> Destinations. These are a pain, since if we can go anywhere then the
> JIT has to do all sorts of nasty and unpleasant things to compensate,
> and to make every op a valid destination. Yuck.

Arbitrary jumps are not that difficult to deal with in the JIT. The JIT
compiler can handle jumps to arbitrary addresses by falling back into the
interpreter if the destination does not coincide with a previously known entry
point, reentering the JIT code later at a safe point. pbc2c generated code does
this. This way the JIT does not have to support making every instruction a safe
branch destination.

--
Jason

Juergen Boemmels

unread,
Jan 23, 2003, 4:25:26 PM1/23/03
to Dan Sugalski, perl6-i...@perl.org
Dan Sugalski <d...@sidhe.org> writes:

> Okay, since this has all come up, here's the scoop from a design perspective.
>
> First, the branch opcodes (branch, bsr, and the conditionals) are all
> meant for movement within a segment of bytecode. They are *not*
> supposed to leave a segment. To do so was arguably a bad idea, now
> it's officially an error. If you need to do so, branch to an op that
> can transfer across boundaries.
>
>
> Design Edict #1: Branches, which is any transfer of control that takes
> an offset, may *not* escape the current bytecode segment.

Okay with that.

> Next, jumps. Jumps take absolute addresses, so either need fixup at
> load time (blech), are only valid in dynamically generated code (okay,
> but limiting), or can only jump to values in registers (that's
> fine). Jumps aren't a problem in general.
>
>
> Design Edict #2: Jumps may go anywhere.

In the sense that every possible target (via #3) can be reached with a
jump, but bad things may happen if target isnt valid.

> Destinations. These are a pain, since if we can go anywhere then the
> JIT has to do all sorts of nasty and unpleasant things to compensate,
> and to make every op a valid destination. Yuck.
>
>
> Design Edict #3: All destinations *must* be marked as such in the
> bytecode metadata segment. (I am officially nervous about this, as I
> can see a number of ways to subvert this for evil)

This is not more or less evil than
branch -1
The destinations can be rangechecked at load time, the assembler will
hopefully emit these offsets correct, and they will be read-only after
compilation.

> I'm only keeping jumps (and their corresponding jsr) around for
> nostalgic reasons, and with the vague hope they may be useful. I'm not
> sure about this.
>
>
> Design Edict #4: Dan is officially iffy on jumps, but can see them as
> useful for lower-level statically bound languages such as forth,
> Scheme, or C.
>
>
> That leads us to
>
> Design Edict #5: Dan will accommodate semantics for languages outside
> the core set (perl, python, ruby) only if they don't compromise
> performance for the core set.
>
>
> Calling actual routines--subs, methods, functions, whatever--at the
> high level isn't done with branches or jumps. It is, instead, done
> with the call series of ops. (call, callmeth, callcc, tailcall,
> tailcallmeth, tailcallcc (though that one makes my head hurt), invoke)
> These are specifically for calling code that's potentially in other
> segments, and to call into them at fixed points. I think these need to
> be hashed out a bit to make them more JIT-friendly, but they're the
> primary transfer destination point

This calls are allways jumps or jsr in disguise. In the end they
always do a goto ADDRESS(something). These means that every
sub/method/continuation must be marked by #3

> Design Edict #6: The first op in a sub is always a valid
> jump/branch/control transfer destination

This is the essentally #3

> Now. Eval. The compile opcode going in is phenomenally cool (thanks,
> Leo!) but has pointed out some holes in the semantics. I got handwavey
> and, well, it shows. No cookie for me.
>
>
> The compreg op should compile the passed code in the language that is
> indicated and should load that bytecode into the current
> interpreter. That means that if there are any symbols that get
> installed because someone's defined a sub then, well, they should get
> installed into the interpreter's symbol tables.

Not the compile would install the symbols in the interpreters symbol
table, it would store it somewhere in the bytecode metadata. The eval
should install this in the interpreters symboltable.

The problem really starts if BEGIN {...} blocks are used because they
will be evaluated after the block compiled but before the whole
compile is finished.

> Compiled code is an interesting thing. In some cases it should return
> a sub PMC, in some cases it should execute and return a value, and in
> some cases it should install a bunch of stuff in a symbol table and
> then return a value. These correspond to:
>
>
>
> eval "print 12";
>
> $foo = eval "sub bar{return 1;}";
>
> require foo.pm;
>
> respectively. It's sort of a mixed bag, and unfortunately we can't
> count on the code doing the compilation to properly handle the
> semantics of the language being compiled. So...
>
>
> Design Edict #7: the compreg opcode will execute the compiled code,
> calling in with parrot's calling conventions. If it should return
> something, then it had darned well better build it and return it.

I find it better to leave compile and eval seperate.
The compile opcode should simply return a bytecode-PMC which then can
be invoked sometimes later.

> Oh, and:
>
> Design Edict #8: compreg is prototyped. It takes a single string and
> must return a single PMC. The compiler may cheat as need be. (No need
> to check and see if it returned a string, or an int)

It should return a bytecodesegment.



> Yes, this does mean that for plain assembly that we want to compile
> and return a sub ref for we need to do extra in the assembly we pass
> in. Tough, we can deal. If it was dead-simple it wouldn't be
> assembly. :)

The assembler in assembly might be very simple:

open P0, "infile", "r"
read S0, P0, filesize
close P0
compile P0, S0
open P1, "outfile", "w"
puts P1, P0
close P1
end

The hard part is all hidden in the compile opcode.

bye
boe.
--
Juergen Boemmels boem...@physik.uni-kl.de
Fachbereich Physik Tel: ++49-(0)631-205-2817
Universitaet Kaiserslautern Fax: ++49-(0)631-205-3906
PGP Key fingerprint = 9F 56 54 3D 45 C1 32 6F 23 F6 C7 2F 85 93 DD 47

Nicholas Clark

unread,
Jan 24, 2003, 6:46:16 PM1/24/03
to Dan Sugalski, perl6-i...@perl.org
On Thu, Jan 23, 2003 at 12:11:20AM -0500, Dan Sugalski wrote:

> Every sub doesn't have to fit in a single segment, though. There may
> well be a half-zillion subs in any one segment. (Though one segment
> per sub does give us some interesting possibilities for GCing unused
> code)

For an interpreter that is allowing eval (or a namespace that isn't locked
against eval) I think that you could only GC the old definition of redefined
subroutines, and any anonymous subroutines that become unreferenced.
Anything else is the potential lucky destination of a random future eval.

Nicholas Clark

0 new messages