Using imcc as JIT optimizer

Leopold Toetsch

unread,

Feb 20, 2003, 6:09:42 AM2/20/03

to P6I

Starting from the unbearable fact, that optimized compiled C is still
faster then parrot -j (in primes.pasm), I did this experiment:
- do register allocation for JIT in imcc
- use the first N registers as MAPped processor registers

Here is the JIT optimized PASM output of

$ imcc -Oj -o p.pasm primes.pasm
$ cat p.pasm
set ri2, 1
set I5, 50
set I4, 0
print "N primes up to "
print I5
print " is: "
time N1
set rn1, N1 # load
REDO:
set ri0, 2
div ri3, ri2, 2
LOOP:
cmod ri1, ri2, ri0
if ri1, OK # with -O1j unless ri1, NEXT
branch NEXT # deleted
OK:
# deleted
inc ri0
le ri0, ri3, LOOP
inc I4
set I6, ri2
NEXT:
inc ri2
le ri2, I5, REDO
time N0
set rn0, N0 # load
print I4
print "\nlast is: "
print I6
print "\n"
sub rn0, rn1
set N0, rn0 # save
print "Elapsed time: "
print N0
print "\n"
end

The ri? and rn? are processor registers, above is for intel (4 mapped
int/float regs), you can translate the ri? to [%ebx, %edi, %esi, %edx).
The processor regs are represented as (-1 - parrot_reg),
i.e. %ebx == -1, %edi == -2 ...

The MAP macro in jit_emit.h would then be:
# define MAP(i) ((i)>= 0 ? 0 : ...map_branch[jit_info->op_i -1-(i)])
where the mappings are directly intval_map or floatval_map. JIT wouldn't
need any further calculations.

The load/save instructions get inserted by looking at op_jit[].extcall,
i.e. if the instruction reads or writes a register, it gets saved/loaded
before/after and the parrot register is used instead. (Only the print
and time ops are external in i386).

I currently have the imcc part for some common cases, emough for above
output.

What do people think?

For reference: a similar idea: "Of mops and microops"

leo
PS: -O3 C 3.64s, JIT ~3.55.

Sean O'Rourke

unread,

Feb 20, 2003, 8:26:58 AM2/20/03

to Leopold Toetsch, P6I

On Thu, 20 Feb 2003, Leopold Toetsch wrote:
> What do people think?

Cool idea -- a lot of optimization-helpers could eventually be passed on
to the jit (possibly in the metadata?). One thought -- the information
imcc computes should be platform-independent. e.g. it could pass a
control flow graph to the JIT, but it probably shouldn't do register
allocation for a specific number of registers. How much worse do you
think it would be to have IMCC just rank the Parrot registers in order of
decreasing spill cost, then have the JIT take the top N, where N is the
number of available architectural registers?

/s

Leopold Toetsch

unread,

Feb 20, 2003, 10:28:24 AM2/20/03

to Sean O'Rourke, P6I

Sean O'Rourke wrote:

The registers are already in that order (with -Op or -Oj), this wouldn't
be a problem. Difficulties arise, when it comes to the register
load/save instructions, which get inserted by imcc in my scheme. These
are definitely processor/$arch specific. They depend on the number of
mappable (and non-preserved too) registers, and on the state of the
op_jit function table.

Of course CFG and register life information could be passed on to the
JIT, but this seems a little bit complicated, as JIT has it's own
sections, which match either a basic block from imcc or are a sequence
of non-JITable instructions.
But in the long run, it could be a way to go. OTOH - PBC compatibility
is not a big point here, when JIT is involved: in 99% of the time the
code would run on the machine, where it is generated.
And it would be AFAIK easier, to make some JIT crosscompiler. This would
basically only need the amount of mappable registers and the extcall
bits from the jump table, read in from some config file.

> /s

leo

Tupshin Harper

unread,

Feb 20, 2003, 2:45:49 PM2/20/03

to Leopold Toetsch, P6I

Leopold Toetsch wrote:

> Starting from the unbearable fact, that optimized compiled C is still
> faster then parrot -j (in primes.pasm)

Lol...what are you going to do when somebody comes along with the
unbearable example of primes.s(optimized x86 assembly), and you are
forced to throw up your hands in defeat? ;-)

Cool idea, if I understand correctly, and I am in awe of how fast the
bloody thing is already.

-Tupshin

Leopold Toetsch

unread,

Feb 20, 2003, 4:14:41 PM2/20/03

to Tupshin Harper, P6I

Tupshin Harper wrote:

> Leopold Toetsch wrote:
>
>> Starting from the unbearable fact, that optimized compiled C is still
>> faster then parrot -j (in primes.pasm)
>
>
> Lol...what are you going to do when somebody comes along with the
> unbearable example of primes.s(optimized x86 assembly), and you are
> forced to throw up your hands in defeat? ;-)

It only may be equally fast, that's it :)

> Cool idea, if I understand correctly, and I am in awe of how fast the
> bloody thing is already.

That's integer/float only. When it comes to objects, different things
matter.

> -Tupshin

leo

Daniel Grunblatt

unread,

Feb 20, 2003, 7:30:23 PM2/20/03

to Tupshin Harper, P6I

On Thursday 20 February 2003 18:14, Leopold Toetsch wrote:
> Tupshin Harper wrote:
> > Leopold Toetsch wrote:
> >> Starting from the unbearable fact, that optimized compiled C is still
> >> faster then parrot -j (in primes.pasm)
> >
> > Lol...what are you going to do when somebody comes along with the
> > unbearable example of primes.s(optimized x86 assembly), and you are
> > forced to throw up your hands in defeat? ;-)
>
> It only may be equally fast, that's it :)

Nahh, you know it can be faster... may be in a couple of years ;-D

Leopold Toetsch

unread,

Feb 21, 2003, 3:09:36 AM2/21/03

to Leopold Toetsch, P6I

Leopold Toetsch wrote:

> - do register allocation for JIT in imcc
> - use the first N registers as MAPped processor registers

The "[RFC] imcc calling conventions" didn't get any response. Should I
take this fact as an implict "yep, fine"?

Here is again the relevant part, which has implications on register
renumbering, used for JIT optimization:

=head1 Parrot calling conventions (NCI)

Proposed syntax:

$P0 = load_lib "libname"
$P1 = dlfunc $P0, "funcname", "signature"
.nciarg z # I5
.nciarg y # I6
.nciarg x # I7
ncicall $P1 # r = funcname(x, y, z)
.nciresult r

A code snippet like:

set I5, I0
dlfunc P0, P1, "func", "ii"
invoke
set I6, I5

now comes out as:

set ri1, ri0
dlfunc P0, P1, "func", "ii"
invoke
set ri0, ri1

which is clearly not, what pdd03 is intending. For plain PASM at least
the .nciarg/.nciresult are necessary, to mark these parrot registers as
fix and to have some hint for imcc, that dlfunc is actually using these
registers.

So there are some possibilities:
- disable register renumbering for all compilation units, where a
B<invoke> is found
- do it right, i.e. implement above (or a similar) syntax and rewrite
existing code

leo

Dan Sugalski

unread,

Feb 21, 2003, 5:05:43 PM2/21/03

to P6I

At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote:
>Starting from the unbearable fact, that optimized compiled C is
>still faster then parrot -j (in primes.pasm), I did this experiment:
>- do register allocation for JIT in imcc
>- use the first N registers as MAPped processor registers

This sounds pretty interesting, and I bet it could make things
faster. The one thing to be careful of is that it's easy to get
yourself into a position where you spend more time optimizing the
code you're JITting than you win in the end.

You also have to be very careful that you don't reorder things, since
there's not enough info in the bytecode stream to know what can and
can't be moved. (Which is something we need to deal with in IMCC as
well)
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Feb 21, 2003, 6:43:44 PM2/21/03

to Dan Sugalski, P6I

Dan Sugalski wrote:

> At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote:
>
>> Starting from the unbearable fact, that optimized compiled C is still
>> faster then parrot -j (in primes.pasm), I did this experiment:
>> - do register allocation for JIT in imcc
>> - use the first N registers as MAPped processor registers
>
>
> This sounds pretty interesting, and I bet it could make things faster.
> The one thing to be careful of is that it's easy to get yourself into a
> position where you spend more time optimizing the code you're JITting
> than you win in the end.

I don't think so. Efficiency of JIT code depends very much on register
save/restore instructions. imcc does a full parrot register life
analysis, and knows when e.g. I17 is rewritten and thus can assign the
same register for it, that some ins above I5 had. Current JIT code is
looking at parrot registers and emits save/loads to get processor
registers in sync, which is the opposite of:
The proposal is, to map the top N used parrot regs to physical processor
registers. This means: imcc emits instructions to get parrot registers
up to date and not vv. The code is already in terms of processor regs.

> You also have to be very careful that you don't reorder things, since
> there's not enough info in the bytecode stream to know what can and
> can't be moved. (Which is something we need to deal with in IMCC as well)

Yep. So, I'm trying to get *all* needed info's into the bytecode
stream/into the op_info/or as a hack in imcc. See e.g. "[RFC] imcc
calling conventions". Please remember times where I started digging into
parrot and core.ops: the in/out/inout definition of P-registers. These
issues are *crucial* for a language *compiler*.
If perl6 or any other language should run *efficiently*, imcc has to be
a compiler with all needed info at hand and not a plain PASM assembler.

leo

Gopal V

unread,

Feb 22, 2003, 2:18:27 AM2/22/03

to P6I

If memory serves me right, Dan Sugalski wrote:
> This sounds pretty interesting, and I bet it could make things
> faster. The one thing to be careful of is that it's easy to get
> yourself into a position where you spend more time optimizing the
> code you're JITting than you win in the end.

I think that's not the case for ahead of time optimisations . As long
as the JIT is not the optimiser , you could take your time optimising.

The topic is really misleading ... or am I the one who's wrong ?

> You also have to be very careful that you don't reorder things, since
> there's not enough info in the bytecode stream to know what can and
> can't be moved. (Which is something we need to deal with in IMCC as
> well)

I'm assuming that the temporaries are the things being moved around here ?.
Since imcc already moves them around anyway and the programmer makes
no assumptions about their positions -- this shouldn't be a problem ?.

The only question I have here , how does imcc identify loops ?. I've
been using "if goto" to loop around , which is exactly the way assembly
does it. But that sounds like a lot of work identifying the loops and
optimising accordingly.

To make it more clear -- identifying tight loops and the usage weights
correctly. 10 uses of $I0 outside the loop vs 1 use of $I1 inside a 100
times loop. Which will be come first ?.

Gopal
--
The difference between insanity and genius is measured by success

Leopold Toetsch

unread,

Feb 22, 2003, 6:50:23 AM2/22/03

to Gopal V, P6I

Gopal V wrote:

> I'm assuming that the temporaries are the things being moved around here ?.

It is not so much a matter of moving things around, but a matter of
allocating (and renumbering) parrot (or for JIT) processor registers.
These are of course mainly temporaries, but even when you have some
find_lexical/do_something/store_lexical, imcc selects the best register
for all involved ops, temps or "variables" it doesn't really matter.

> The only question I have here , how does imcc identify loops ?. I've
> been using "if goto" to loop around , which is exactly the way assembly
> does it. But that sounds like a lot of work identifying the loops and
> optimising accordingly.

Here are basic blocks, the CFG and loop info of
0 set I0, 10
1 x:
1 unless I0, y
2 dec I0
2 print I0
2 print "\n"
2 branch x
3 y:
3 end

Dumping the CFG:
-------------------------------
0 (0) -> 1 <-
1 (1) -> 2 3 <- 2 0
2 (1) -> 1 <- 1
3 (0) -> <- 1

Loop info
---------
loop 0, depth 1, size 2, entry 0, contains blocks:
1 2

> To make it more clear -- identifying tight loops and the usage weights
> correctly. 10 uses of $I0 outside the loop vs 1 use of $I1 inside a 100
> times loop. Which will be come first ?.

This is basically the current score calculation used for register
allocation:

r->score = r->use_count + (r->lhs_use_count << 2);

r->score += 1 << (loop_depth * 3);

> Gopal

leo

Gopal V

unread,

Feb 22, 2003, 9:43:59 AM2/22/03

to Leopold Toetsch, P6I

If memory serves me right, Leopold Toetsch wrote:
> > I'm assuming that the temporaries are the things being moved around here ?.
>
>
> It is not so much a matter of moving things around, but a matter of
> allocating (and renumbering) parrot (or for JIT) processor registers.

Ok .. well I sort of understood that the first N registers will be the
ones MAPped ?. So I thought re-ordering/sorting was the operation performed.

Direct hardware maps (like using CX for loop count etc) will need to be
platform dependent ?. Or you could have a fixed reg that can be used for
loop count (and gets mapped on hardware appropriately).

> > does it. But that sounds like a lot of work identifying the loops and
> > optimising accordingly.

....

> Loop info
> ---------
> loop 0, depth 1, size 2, entry 0, contains blocks:
> 1 2

Hmm.. this is what I said "sounds like a lot of work" ... which still
remains true from my perspective :-)

> r->score = r->use_count + (r->lhs_use_count << 2);
>
> r->score += 1 << (loop_depth * 3);

Ok ... deeper the loop the more important the var is .. cool.

Leopold Toetsch

unread,

Feb 22, 2003, 10:28:13 AM2/22/03

to Gopal V, P6I

Gopal V wrote:

> If memory serves me right, Leopold Toetsch wrote:

> Ok .. well I sort of understood that the first N registers will be the
> ones MAPped ?. So I thought re-ordering/sorting was the operation performed.

Yep. Register renumbering, so that the top N used (in terms of score)
registers are I0, I1, ..In-1

> Direct hardware maps (like using CX for loop count etc) will need to be
> platform dependent ?. Or you could have a fixed reg that can be used for
> loop count (and gets mapped on hardware appropriately).

We currently don't have special registers, like %ecx for loops, they are
not used in JIT either. My Pentium manual states, that these ops are not
the fastest.
But in the long run, we should have some hints, that e.g. i386 needs
%ecx as shift count, or that div uses %edx. But probably i386 is the
only weird architecure with such ugly restrictions - and with far too
few registers.

>>Loop info

> Hmm.. this is what I said "sounds like a lot of work" ... which still
> remains true from my perspective :-)

There is still a lot of work, yes, but some things already are done:

set I10, 10
x:
if I10, ok
branch y
ok:
set I0, 1
sub I10, I10, I0
print I10
print "\n"
branch x
y:
end

Ends up (with imcc -O2p) as:

set I0, 10
set I1, 1
x:
unless I0, y
sub I0, I1
print I0
print "\n"
branch x
y:
end

You can see:

opt1 sub I10, I10, I0 => sub I10, I0
if_branch if ... ok
label ok deleted

found invariant set I0, 1
inserting it in blk 0 after set I10, 10

The latter one is working out from the most inner loop.

> Gopal

leo

Nicholas Clark

unread,

Feb 22, 2003, 11:53:09 AM2/22/03

to Gopal V, Leopold Toetsch, P6I

Please don't take the following as a criticism of imcc - I'm sure I manage
to write code with things like this all the time.

until variables in 11 deep loops go undefined?
(it appears to be a signed int)
I'm not sure how to patch this specific instance - just trap loop depths over
10? Should score be unsigned?

More importantly, how do we trap these sort of things in the general case?

I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
that added trap code for all classes of undefined behaviour, and caused
code to abort (or something more colourfully "undefined") if anything
undefined gets executed. I realise that code would run very slowly, but it
would be a very very useful debugging tool.

Nicholas Clark

Rafael Garcia-Suarez

unread,

Feb 22, 2003, 3:44:12 PM2/22/03

to perl6-i...@perl.org

Nicholas Clark wrote in perl.perl6.internals :

>
>> > r->score = r->use_count + (r->lhs_use_count << 2);
>> >
>> > r->score += 1 << (loop_depth * 3);

[...]

> I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
> that added trap code for all classes of undefined behaviour, and caused
> code to abort (or something more colourfully "undefined") if anything
> undefined gets executed. I realise that code would run very slowly, but it
> would be a very very useful debugging tool.

What undefined behaviour are you referring to exactly ? the shift
overrun ? AFAIK it's very predictable (given one int size). Cases of
potential undefined behavior can usually be detected at compile-time. I
imagine that shift overrun detection can be enabled via an ugly macro
and a cpp symbol.

(what's a nasal demon ? can't find the nasald(8) manpage)

Leopold Toetsch

unread,

Feb 22, 2003, 4:14:51 PM2/22/03

to Nicholas Clark, Gopal V, P6I

Nicholas Clark wrote:

>>> r->score += 1 << (loop_depth * 3);

> until variables in 11 deep loops go undefined?

Not undefined, but spilled. First *oops*, but second of course this all
not final. I did change scoring several times from the code base AFAIK
Angel Faus did implement. And we don't currently have any code that goes
near that omplexity of such a deep nested loop.

There are probably a *lot* of such gotchas in the whole CFG code in
imcc. I'm currently on some failing perl6 tests, when using
optimization, all in regexen tests, which do a lot of branching.

> I'm not sure how to patch this specific instance - just trap loop depths over
> 10? Should score be unsigned?

A linear counting of loop_depth will do it, e.g.

r->score += 100 * loop_depth ;

Or score deeper nested loops vars always higher then outside, or ...

> More importantly, how do we trap these sort of things in the general case?

With a lot of tests

> I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
> that added trap code for all classes of undefined behaviour, and caused
> code to abort (or something more colourfully "undefined") if anything
> undefined gets executed. I realise that code would run very slowly, but it
> would be a very very useful debugging tool.

I'm currently adding asserts to e.g. loop detection code. Last one (to
be checked in) is:

/* we could also take the depth of the first contained
* block, but below is a check, that an inner loop is fully
* contained in an outer loop
*/

This is a check, that all blocks of a deeper nested loop are contained
totally in the outer loop, so that there can't be basic blocks outside.
But in regex code, this seems not to be true - or a prior stage of
optimization messes things up.
This issues are as hard to debug as deeply buried in ~400 basic blocks
with ~1000 edges connecting those.

perl6 $ ../imcc/imcc -O1 -d70 t/rx/basic_2.imc 2>&1 | less

> Nicholas Clark

leo

Nicholas Clark

unread,

Feb 22, 2003, 4:27:05 PM2/22/03

to Rafael Garcia-Suarez, perl6-i...@perl.org

On Sat, Feb 22, 2003 at 08:44:12PM -0000, Rafael Garcia-Suarez wrote:
> Nicholas Clark wrote in perl.perl6.internals :
> >
> >> > r->score = r->use_count + (r->lhs_use_count << 2);
> >> >
> >> > r->score += 1 << (loop_depth * 3);
> [...]
> > I wonder how hard it would be to make a --fsummon-nasal-demons flag for gcc
> > that added trap code for all classes of undefined behaviour, and caused
> > code to abort (or something more colourfully "undefined") if anything
> > undefined gets executed. I realise that code would run very slowly, but it
> > would be a very very useful debugging tool.
>
> What undefined behaviour are you referring to exactly ? the shift
> overrun ? AFAIK it's very predictable (given one int size). Cases of

Will you accept a shortcut written in perl? The shift op uses C signed
integers:

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
1234
0

vs

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
1234
1

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
4321
1

vs

$ perl -MConfig -le 'print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
4321
0

(all 4 are Debian GNU/Linux
And both architectures that give 0 for a shift of 32, happen to give 1 for
a shift of 256.
But I wouldn't count on it for all architectures)

> potential undefined behavior can usually be detected at compile-time. I

In this specific case, maybe. In the general case no.
signed integer arithmetic overflowing is undefined behavior

> imagine that shift overrun detection can be enabled via an ugly macro
> and a cpp symbol.
>
> (what's a nasal demon ? can't find the nasald(8) manpage)

Demons flying out of your nose. One alleged consequence of undefined
behaviour. Another is your computer turning into a butterfly. I guess a
third is "Microsoft releasing a bug free program"

Nicholas Clark

unread,

Feb 22, 2003, 4:39:47 PM2/22/03

to Rafael Garcia-Suarez, perl6-i...@perl.org

On Sat, Feb 22, 2003 at 09:27:04PM +0000, nick wrote:
> On Sat, Feb 22, 2003 at 08:44:12PM -0000, Rafael Garcia-Suarez wrote:

> > What undefined behaviour are you referring to exactly ? the shift
> > overrun ? AFAIK it's very predictable (given one int size). Cases of
>
> Will you accept a shortcut written in perl? The shift op uses C signed
> integers:

Oops. The logical shift uses *un*signed integers, except under use integer

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
1234
0

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
1234
1

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
4321
0

$ perl -MConfig -le 'use integer; print foreach ($^O, $Config{byteorder}, 1 << 32)'
linux
4321
1

So there's actually no difference in the numbers. But as I'm being a pedant I
ought to get the facts right. [I guess it's my fault for drinking Australian
wine :-)]

Nicholas Clark

Dan Sugalski

unread,

Feb 22, 2003, 6:03:18 PM2/22/03

to Leopold Toetsch, Gopal V, P6I

At 4:28 PM +0100 2/22/03, Leopold Toetsch wrote:

>Gopal V wrote:
>>Direct hardware maps (like using CX for loop count etc) will need to be
>>platform dependent ?. Or you could have a fixed reg that can be used for
>>loop count (and gets mapped on hardware appropriately).
>
>
>We currently don't have special registers, like %ecx for loops, they
>are not used in JIT either. My Pentium manual states, that these ops
>are not the fastest.
>But in the long run, we should have some hints, that e.g. i386 needs
>%ecx as shift count, or that div uses %edx. But probably i386 is the
>only weird architecure with such ugly restrictions - and with far
>too few registers.

I'm OK with adding in documentation that encourages using particular
registers for particular purposes, or having some sort of metadata
for the JIT that notes loop registers or something. As long as it's
out of band and optional, that's cool.

Leopold Toetsch

unread,

Feb 23, 2003, 10:43:14 AM2/23/03

to Dan Sugalski, P6I

Dan Sugalski wrote:

> At 12:09 PM +0100 2/20/03, Leopold Toetsch wrote:
>
>> Starting from the unbearable fact, that optimized compiled C is still
>> faster then parrot -j (in primes.pasm), I did this experiment:
>> - do register allocation for JIT in imcc
>> - use the first N registers as MAPped processor registers
>
>
> This sounds pretty interesting, and I bet it could make things faster.

I have now checked in a first version for testing:
- the define JIT_IMCC_OJ in jit.c is disabled - so no impact
- jit2h.pl defines now a MAP macro, which makes jit_cpu.c more readable
Restrictions:
- no vtable ops
- no saving of non preserved registers (%edx on I386)
So not much will run, when experimenting with it.

But I think, the numbers are promising, so it's worth a further try.

To enable the whole fun, recompile with JIT_IMCC_OJ enabled, build imcc
and use the -Oj switch (primes.pasm is from examples/benchmarks):

$ time imcc -j -Oj primes.pasm
N primes up to 50000 is: 5133
last is: 49999
Elapsed time: 3.523477

real 0m3.548s

$ ./primes # primes.c -O3 gcc 2.95.2
N primes up to 50000 is: 5133
last is: 49999
Elapsed time: 3.647063

$ time imcc -j -O1 primes.pasm # normal JIT
N primes up to 50000 is: 5133
last is: 49999
Elapsed time: 4.039121

real 0m4.065s

imcc/parrot was built without optimization, but this doesn't matter, no
external code is called for jit/i386 in the primes.pasm.
The timings for imcc obviously include compiling too.

leo

Leopold Toetsch

unread,

Feb 25, 2003, 5:16:42 AM2/25/03

to Leopold Toetsch, P6I

Leopold Toetsch wrote:

> - do register allocation for JIT in imcc
> - use the first N registers as MAPped processor registers

I have committed the next bunch of changes and an updated jit.pod.
- it should now be platform independent, *but* other platforms have to
define what they consider as preserved (callee-saved) registers and
put these first in the mapped register lists.
- for testing enable JIT_IMCC_OJ in jit.c and for platforms != i386:
copy the MAP macro at bottom of jit/i386/jit_emit.h to your jit_emit.h
- run programs like so:
imcc -Oj -d8 primes.pasm (-d8 shows generates ins)

It runs now ~95% of parrot tests on i386 but YMMV.

Have fun,

leo

Angel Faus

unread,

Feb 25, 2003, 1:18:11 PM2/25/03

to Leopold Toetsch, Gopal V, P6I

Saturday 22 February 2003 16:28, Leopold Toetsch wrote:
> Gopal V wrote:
> > If memory serves me right, Leopold Toetsch wrote:
> >
> >
> > Ok .. well I sort of understood that the first N registers will
> > be the ones MAPped ?. So I thought re-ordering/sorting was the
> > operation performed.
>
> Yep. Register renumbering, so that the top N used (in terms of
> score) registers are I0, I1, ..In-1

With your approach there are three levels of parrot "registers":

- The first N registers, which in JIT will be mapped to physical
registers.

- The others 32 - N parrot registers, which will be in memory.

- The "spilled" registers, which are also on memory, but will have to
be copied to a parrot register (which may be a memory location or a
physical registers) before being used.

I believe it would be smarter if we instructed IMCC to generate code
that only uses N parrot registers (where N is the number of machine
register available). This way we avoid the risk of having to copy
twice the data.

This is also insteresting because it gives the register allocation
algorithm all the information about the actual structure of the
machine we are going to run in. I am quite confident that code
generated this way would run faster.

We also need tho have a better procedure for saving and restoring
spilled registers. Specially in the case of JIT compilation, where it
could be translated to a machine save/restore.

What do you think about it?

-angel

Jason Gloudon

unread,

Feb 25, 2003, 8:46:43 AM2/25/03

to Angel Faus

On Tue, Feb 25, 2003 at 07:18:11PM +0100, Angel Faus wrote:
> I believe it would be smarter if we instructed IMCC to generate code
> that only uses N parrot registers (where N is the number of machine
> register available). This way we avoid the risk of having to copy
> twice the data.

It's not going to be very good if I compile code to pbc on an x86 where there
are about 3 usable registers and try to run it on any other CPU with a lot more
registers.

--
Jason

Leopold Toetsch

unread,

Feb 25, 2003, 8:51:57 AM2/25/03

to af...@corp.vlex.com, Gopal V, P6I

Angel Faus wrote:

> Saturday 22 February 2003 16:28, Leopold Toetsch wrote:
>
> With your approach there are three levels of parrot "registers":
>
> - The first N registers, which in JIT will be mapped to physical
> registers.
>
> - The others 32 - N parrot registers, which will be in memory.
>
> - The "spilled" registers, which are also on memory, but will have to
> be copied to a parrot register (which may be a memory location or a
> physical registers) before being used.

Spilling is really rare, you have to work hard, to get a test case :-)
But when it comes to spilling, we should do some register renumbering
(which is the case for processor registers too). The current allocation
is per basic block. When we start spilling, new temp registers are
created, so that the register life range is limited to the usage of the
new temp register and the spill code.
This is rather expensive, as for one spilled register, the whole life
analysis has to be redone.

> I believe it would be smarter if we instructed IMCC to generate code
> that only uses N parrot registers (where N is the number of machine
> register available). This way we avoid the risk of having to copy
> twice the data.

I don't think so. When we have all 3 levels of registers, using less
parrot registers would just produce more spilled registers.

Actually, I'm currently generating code that uses 32+N registers. The
processor registers are numbered -1, -2 ... for the top used parrot
registers 0, 1, ... But the processor registers are only fixed mirrors
of the parrot registers.

> This is also insteresting because it gives the register allocation
> algorithm all the information about the actual structure of the
> machine we are going to run in. I am quite confident that code
> generated this way would run faster.

All the normal operations boil down basically to 2 different machine
instruction types e.g. for some binop <op>:

<op>_rm or <op>_rr (i386)
<op>_rrr (RISC arch)

These are surrounded by mov_rm / mov_mr to load/store non mapped
processor registers from/to parrot registers, the reg(s) are some
scratch registers then like %eax on i386 or r11/r12 for ppc.

s. e.g. jit/{i386,ppc}/core.jit

So the final goal could be, to emit these load/stores too, which then
could be optimized to avoid duplicate loading/storing. Or imcc could
emit a register move, if in the next instruction the parrot register is
used again.
Then processor specific hints could come in, like:
shr_rr_i for i386 has to have the shift count in %ecx.

> We also need tho have a better procedure for saving and restoring
> spilled registers. Specially in the case of JIT compilation, where it
> could be translated to a machine save/restore.

I don't see much here. Where should the spilled registers be stored then?

> What do you think about it?

I think, when it comes to spilling, we should divide the basic block, to
get shorter life ranges, which would allow register renumbering then.

> -angel

leo

Phil Hassey

unread,

Feb 25, 2003, 10:06:23 AM2/25/03

to Leopold Toetsch, af...@corp.vlex.com, Gopal V, P6I

On Tuesday 25 February 2003 08:51, Leopold Toetsch wrote:
> Angel Faus wrote:
> > Saturday 22 February 2003 16:28, Leopold Toetsch wrote:
> >
> > With your approach there are three levels of parrot "registers":
> >
> > - The first N registers, which in JIT will be mapped to physical
> > registers.
> >
> > - The others 32 - N parrot registers, which will be in memory.
> >
> > - The "spilled" registers, which are also on memory, but will have to
> > be copied to a parrot register (which may be a memory location or a
> > physical registers) before being used.
>
> Spilling is really rare, you have to work hard, to get a test case :-)
> But when it comes to spilling, we should do some register renumbering
> (which is the case for processor registers too). The current allocation
> is per basic block. When we start spilling, new temp registers are
> created, so that the register life range is limited to the usage of the
> new temp register and the spill code.
> This is rather expensive, as for one spilled register, the whole life
> analysis has to be redone.

Not knowing much about virtual machine design... Here's a question --
Why do we have a set number of registers? Particularily since JITed code
ends up setting the register constraints again, I'm not sure why parrot
should set up register limit constraints first. Couldn't each code block say
"I need 12 registers for this block" and then the JIT system would go on to
do it's appropriate spilling magic with the system registers...

I suspect the answer has something to do with optimized C and not making
things hairy, but I had to ask anyway. :)

...

Phil

Leopold Toetsch

unread,

Feb 25, 2003, 11:23:48 AM2/25/03

to philh...@users.sourceforge.net, P6I

Phil Hassey wrote:

>
> Not knowing much about virtual machine design... Here's a question --
> Why do we have a set number of registers? Particularily since JITed code
> ends up setting the register constraints again, I'm not sure why parrot
> should set up register limit constraints first. Couldn't each code block say
> "I need 12 registers for this block" and then the JIT system would go on to
> do it's appropriate spilling magic with the system registers...

This is somehow the approach, the current optimizer in jit.c takes. The
optimizer looks at a section (a JITed part of a basic block) checks
register usage and then assigns the top N registers to processor registers.

This has 2 disadvantages:
- its done at runtime - always. It's pretty fast, but could have non
trivial overhead for big programs
- as each section and therefore each basic block has its own set of
mapped registers, now on almost every boundary of a basic block and when
calling out to non JITed code, processor registers have to be saved
parrot's and restored back again. These memory accesses slow things
down, so I want to avoid them where possible.

> Phil

leo

Angel Faus

unread,

Feb 25, 2003, 8:21:32 PM2/25/03

to Leopold Toetsch, Gopal V, P6I

I explained very badly. The issue is not spilling (at the parrot
level)

The problem is: if you only pick the highest priority parrot registers
and put them in real registers you are losing oportunities where
copying the date once will save you from copying it many times. You
are, in some sense, underspilling.

Let's see an example. Imagine you are compilling this imc, to be run
in a machine which has 3 registers free (after temporaries):

set $I1, 1
add $I1, $I1, 1
print $I1

set $I2, 1
add $I2, $I2, 1
print $I2

set $I3, 1
add $I3, $I3, 1
print $I3

set $I4, 1
add $I4, $I4, 1
print $I4

set $I5, 1
add $I5, $I5, 1
print $I5

print $I1
print $I2
print $I3
print $I4
print $I5

Very silly code indeed, but you get the idea.

Since we have only 5 vars, imcc would turn this into:

set I1, 1
add I1, I1, 1
print I1

set I2, 1
add I2, I2, 1
print I2

set I3, 1
add I3, I3, 1
print I3

set I4, 1
add I4, I4, 1
print I4

set I5, 1
add I5, I5, 1
print I5

print I1
print I2
print I3
print I4
print I5

Now, assuming you put registers I1-I3 in real registers, what would it
take to execute this code in JIT?

It would have to move the values of I4 and I5 from memory to registers
a total of 10 times (4 saves and 6 restores if you assume the JIT is
smart)

[This particular example could be improved by making the jit look if
the same parrot register is going to be used in the next op, but
that's not the point]

But, if IMCC knew that there were really only 3 registers in the
machine, it would generate:

set I1, 1
add I1, I1, 1
print I1

set I2, 1
add I2, I2, 1
print I2

set I3, 1
add I3, I3, 1
print I3

fast_save I3, 1

set I3, 1
add I3, I3, 1
print I3

fast_save I3, 2

set I3, 1
add I3, I3, 1
print I3

fast_save I3, 3

print I1
print I2
fast_restore I3, 3
print I3
fast_restore I3, 2
print I3
fast_restore I3, 1
print I3

When running this code in the JIT, it would only require 6 moves (3
saves, 3 restores): exactly the ones generated by imcc.

In reality this would be even better, because as you have the garantee
of having the data already in real registers you need less
temporaries and so have more machine registers free.

> So the final goal could be, to emit these load/stores too, which
> then could be optimized to avoid duplicate loading/storing. Or imcc
> could emit a register move, if in the next instruction the parrot
> register is used again.

Yes, that's the idea: making imcc generate the loads/stores, using the
info about how many registers are actually available in the real
machine _and_ its own knowledge about the program flow.

An even better goal would be to have imcc know how many temporaries
every JITed op requires, and use this information during register
allocation.

All this is obviously machine dependent: the code generated should
only run in the machine it was compiled for. So we should always keep
the original imc code in case we copy the pbc file to another
machine.

-angel

Nicholas Clark

unread,

Feb 25, 2003, 4:19:01 PM2/25/03

to Angel Faus, Leopold Toetsch, Gopal V, P6I

On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote:

[snip lots of good stuff]

> All this is obviously machine dependent: the code generated should
> only run in the machine it was compiled for. So we should always keep
> the original imc code in case we copy the pbc file to another
> machine.

Er, but doesn't that mean that imc code has now usurped the role of parrot
byte code?

I'm not sure what is a good answer here. But I thought that the intent of
parrot's bytecode was to be the same bytecode that runs everywhere. Which
is slightly incompatible with compiling perl code to something that runs
as fast as possible on the machine that you're both compiling and running
on. (These two being the same machine most of the time).

Maybe we starting to get to the point of having imcc deliver parrot bytecode
if you want to be portable, and something approaching native machine code
if you want speed. Or maybe if you want the latter we save "fat" bytecode
files, that contain IMC code, bytecode and JIT-food for one or more
processors.

And is this all premature optimisation, give that we haven't got objects,
exceptions, IO or a Z-code interpreter yet?

Nicholas Clark

Leopold Toetsch

unread,

Feb 25, 2003, 5:58:41 PM2/25/03

to Nicholas Clark, Angel Faus, Gopal V, P6I

Nicholas Clark wrote:

> On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote:
>
> [snip lots of good stuff]
>
>
>>All this is obviously machine dependent: the code generated should
>>only run in the machine it was compiled for. So we should always keep
>>the original imc code in case we copy the pbc file to another
>>machine.
>>
>
> Er, but doesn't that mean that imc code has now usurped the role of parrot
> byte code?

No. It's like another runtime option. Run "imcc -Oj the.pasm" and you
get what you want, a differently optimized piece of JIT code, that might
run faster then "imcc -j the.pasm".
And saying "imcc -Oj -o the.pbc the.pasm" should spit out the fastest
bytecode possible, for your very machine.

> I'm not sure what is a good answer here. But I thought that the intent of
> parrot's bytecode was to be the same bytecode that runs everywhere.

Yep

> ... Which

> is slightly incompatible with compiling perl code to something that runs
> as fast as possible on the machine that you're both compiling and running
> on. (These two being the same machine most of the time).

At PBC level, imcc already has "-Op" which does parrot register
renumbering (modulo NCI and such, where fixed registers are needed, and
this is -- hmmm suboptimal then :) and imcc can write out CFG
information in some machine independent form, i.e. at basic block level.
But no processor specific load/store instructions and such.
This can help JIT optimizer to do the job faster, though it isn't that
easy, because there are non JITed code sequences intersparsed.

I think some difficulties arise, when looking at, what imcc now is: It's
the assemble.pl generating PBC files. But it's also parrot, it can run
PBC files - and it's both - it can run PASM (or IMC) files -
immediately. And the latter one can be always as fast as the $arch
allows. Generating PBC doesn't have to use the same compile options - as
you wouldn't use, when running "gcc -b machine".

> Maybe we starting to get to the point of having imcc deliver parrot bytecode
> if you want to be portable, and something approaching native machine code
> if you want speed.

IMHO yes, the normal options produce a plain PBC file, more or less
optimized at PASM level, the -Oj option is definitely a machine
optimization option, which can run or will create a PBC that runs only
on a machine with equally or less mapped registers and the same external
(non JITted instructions) i.e. on the same $arch.
But the normal case is, that I compile the source for my machine and run
it here - with all possible optimizations.
I never did do any cross compilation here. Shipping the source is
enough. Plain PBC is still like an unoptimized executable running
everywhere - not a machine specific cross compile EXE.

> ... Or maybe if you want the latter we save "fat" bytecode

> files, that contain IMC code, bytecode and JIT-food for one or more
> processors.

There is really no need for a fat PBC. Though - as already stated - I
could imagine some cross compile capabilities for -Oj PBCs.

> And is this all premature optimisation, give that we haven't got objects,
> exceptions, IO or a Z-code interpreter yet?

It is a different approach to JIT register allocation. The current
optimizer allocates registers per JITed section, with no chance (IMHO)
to reuse registers after a branch, because the optimizer lacks all the
information to know, that this branch target will only be reached from
here, and that the registers are the same, so finally knows, the
savin/loading processor registers to memory could be avoided.

OTOH imcc has almost all this info already at hand (coming out of
CFG/life information needed for allocating parrot regs from $temps). So
the chance for generating faster code is there, IMHO.

Premature optimization - partly of course yes/no:
My copy here runs now all parrot tests except op/interp_2 (obvious, this
compares traced instructions, where the -Oj inserted some register
load/saves) and the pmc/nci tests, where just the fixed parameter/return
result register are mess up - the "imcc calling conventions" thread has
a proposal for this.
And yes: We don't have exceptions and threads yet. The other items,
don't matter (IMHO).
But we will come to a point, where for certain languages, we will
optimize P-registers, or mix them with I-regs, reusing same processor
regs. :-)

> Nicholas Clark

leo

Leopold Toetsch

unread,

Feb 25, 2003, 5:10:05 PM2/25/03

to af...@corp.vlex.com, Gopal V, P6I

[ you seem to be living some hors ahead in time ]

Angel Faus wrote:

> I explained very badly. The issue is not spilling (at the parrot
> level)

The problem stays the same: spilling processors to parrot's or parrots
to array.

[ ... ]

> set I3, 1
> add I3, I3, 1
> print I3
>
> fast_save I3, 1
>
> set I3, 1

Above's "fast_save" is spilling at parrot register level and moving regs
to parrot registers a processor regs level. Actual machine code could be:

mov 1, %eax # first write to a parrot register
inc %eax # add I3, I3, 1 => (*) add I3, 1 => inc I3
mov %eax, I3 # store reg to parrot registers mem
print I3 # print is external
*) already done now

Above sequence of code wouldn't consume any mapped register - for the
whole sequence originally shown.

>>So the final goal could be, to emit these load/stores too, which
>>then could be optimized to avoid duplicate loading/storing.

> An even better goal would be to have imcc know how many temporaries

> every JITed op requires, and use this information during register
> allocation.

As shown above, yep.

> All this is obviously machine dependent: the code generated should
> only run in the machine it was compiled for. So we should always keep
> the original imc code in case we copy the pbc file to another
> machine.

I'l answer this part in the reply to Nicholas reply.

> -angel

leo

Nicholas Clark

unread,

Feb 26, 2003, 5:46:18 AM2/26/03

to Leopold Toetsch, Nicholas Clark, Angel Faus, Gopal V, P6I

On Tue, Feb 25, 2003 at 11:58:41PM +0100, Leopold Toetsch wrote:
> Nicholas Clark wrote:

[thanks for the explanation]

> > And is this all premature optimisation, give that we haven't got objects,
> > exceptions, IO or a Z-code interpreter yet?

> And yes: We don't have exceptions and threads yet. The other items,
> don't matter (IMHO).

Well, I think that proper IO would be useful. But I don't think it affects
the innards of the execution system greatly - is there any reason why
parrot (or at least PBC) can't conceptually treat in the same way that C
treats IO - just another standard library?

"Z-code interpreter" is obfuscated shorthand for "dynamic opcode libraries"
and "reading foreign bytecode". I regard the first as important, the second
as "would be nice". I think Dan rates "reading foreign bytecode" more
important than I do.

Nicholas Clark

Leopold Toetsch

unread,

Feb 26, 2003, 9:02:13 AM2/26/03

to Nicholas Clark, Nicholas Clark, P6I

Nicholas Clark wrote:

>
> Well, I think that proper IO would be useful. But I don't think it affects
> the innards of the execution system greatly >

No, though we will need some more ops - or not. Current io also defines
a more or less dummy io PMC (e.g. io.ops:open). This could be a full
PMC, with a io_vtable (which could reflect the io stack). The most used
operations would be separate opcodes, others could be methods of this
io_pmc.

> ...- is there any reason why

> parrot (or at least PBC) can't conceptually treat in the same way that C
> treats IO - just another standard library?

Some times ago, I posted: "[RfC] a scheme for core.ops extending" :)

> "Z-code interpreter" is obfuscated shorthand for "dynamic opcode libraries"
> and "reading foreign bytecode". I regard the first as important, the second
> as "would be nice". I think Dan rates "reading foreign bytecode" more
> important than I do.

AFAIK are we not able to directly execute Z-code by just loading a
different opcode library. The Z-ops have encoded parameters in them. So
we can only load a Z-code interpreter/compiler which then reads the
Z-code program which is simple data then, no bytecode. Though it might
help, to have some specialized Z-ops for execution, but this falls under
above "extending".

> Nicholas Clark

leo

Phil Hassey

unread,

Feb 26, 2003, 12:09:13 PM2/26/03

to Leopold Toetsch, Nicholas Clark, Angel Faus, Gopal V, P6I

[snip]

> > Maybe we starting to get to the point of having imcc deliver parrot
> > bytecode if you want to be portable, and something approaching native
> > machine code if you want speed.
>
> IMHO yes, the normal options produce a plain PBC file, more or less
> optimized at PASM level, the -Oj option is definitely a machine
> optimization option, which can run or will create a PBC that runs only
> on a machine with equally or less mapped registers and the same external
> (non JITted instructions) i.e. on the same $arch.
> But the normal case is, that I compile the source for my machine and run
> it here - with all possible optimizations.
> I never did do any cross compilation here. Shipping the source is
> enough. Plain PBC is still like an unoptimized executable running
> everywhere - not a machine specific cross compile EXE.
>
> > ... Or maybe if you want the latter we save "fat" bytecode
> > files, that contain IMC code, bytecode and JIT-food for one or more
> > processors.
>
> There is really no need for a fat PBC. Though - as already stated - I
> could imagine some cross compile capabilities for -Oj PBCs.

Seems to me it would be good if

- mycode.pl -- my original code

would be compiled into
- mycode.pbc/imc -- platform neutral parrot bytecode with (as I sort of
suggested a day ago) no limitations on what registers there are, no spilling
code, as that comes next... In someways, this is what IMC code is right now.
Although it might be nice if IMC were binary at this stage (for some
feel-good-reason?). The current bytecode from parrot already has potential
for slowing things down, and that's what worries me here.

which when run on any system would generate
- mycode.jit -- a platform specific thing with native compiled code

And as a worst case, if a system didn't have a jit module would just run the
mycode.pbc, albeit not very speedily.

This gives the developer several choices:
1. He can hand out his original source (which would require the target to be
able to compile, jit)
2. He can hand out a platform neutral pbc/imc of compiled code that can be
compiled to full speed (which would require the target to be able to either
jit or just run it.)
3. He can hand out a platform specific .jit (which would require the target
to be able to run it.)

I suspect most end users would be able to use #1 or #2. However for use on
embedded systems where size is an issue, having #3 an option would be useful,
as I suspect it would shrink the footprint of parrot somewhat.

Just the thoughts of a future parrot user :) Hope they benefit someone.

Cheers,
Phil

Angel Faus

unread,

Feb 26, 2003, 1:00:37 PM2/26/03

to Leopold Toetsch, Gopal V, P6I

> [ you seem to be living some hors ahead in time ]

Yep, sorry about that.

> The problem stays the same: spilling processors to parrot's or
> parrots to array.
>

Thinking a bit more about it, now I believe that the best way to do it
would be:

(1) First, do a register allocation for machine registers, assuming
that there are N machine registers and infinite parrot registers.

(2) Second, do a register allocation for parrot registers, using an
array as spill area.

The first step assures us that we generate code that always puts data
in the availabe machine registers, and tries to minimize moves
between registers and the physical memory.

The second step tries to put all the data in parrot registers, and if
it is not able to do that in the parrot spilling area (currently an
PerlArray)

For example, code generated by (1) would look like:

set m3, 1 # m3 is the machine 3d register
add m3, m3, 1
print m3

set $I1, m3 # $I1 is a parrot virtual register

etc...

Then we would do register allocation for the virtual $I1 registers,
hoping to be able to put them all in the 32 parrot registers.

I believe this would be the optimal way to do it, because it actually
models our priorities: first to put all data in physical registers,
otherwise try do it in parrot registers.

This is better than reserving the machine registers for the most used
parrot registers (your original proposal) or doing a pyhsical
register allocation and assuming that we have an infinite number of
parrot registers (my original proposal).

Hope that it know make more sense,

-angel

Leopold Toetsch

unread,

Feb 26, 2003, 12:33:17 PM2/26/03

to philh...@users.sourceforge.net, P6I

Phil Hassey wrote:

> [snip]

> Although it might be nice if IMC were binary at this stage (for some
> feel-good-reason?).

You mean, that a HL like perl6 should produce a binary equivalent to
ther current .imc file? Yep - this was discussed already, albeit there
was no discussion, how this should look like. And the lexer in imcc is
pretty fast.

> ... The current bytecode from parrot already has potential

> for slowing things down, and that's what worries me here.

I don't see that.

> 3. He can hand out a platform specific .jit (which would require the target
> to be able to run it.)
>
> I suspect most end users would be able to use #1 or #2. However for use on
> embedded systems where size is an issue, having #3 an option would be useful,
> as I suspect it would shrink the footprint of parrot somewhat.

The JIT-PBC for #3 has a somewhat larger size then plain PBC due to
register load/store ops and an additional CFG/register usage PBC
section. But running it does require less memory, because the JIT
optimizer doesn't have to create all the internal bookkeeping tables.

> Cheers,
> Phil

leo

Leopold Toetsch

unread,

Feb 26, 2003, 1:54:31 PM2/26/03

to af...@corp.vlex.com, Gopal V, P6I

Angel Faus wrote:

>
> (1) First, do a register allocation for machine registers, assuming
> that there are N machine registers and infinite parrot registers.

This uses equally the top N used registers for processor regs. The
"spilling" for (1) is loading/moving them to parrot registers/temp
registers. Only the load/store would be that what spilling code makes
out of those. Then you still have 32 parrot registers per kind to allocate.

But it is not as easy as it reads: We have non preserved registers too,
which can be mapped, but are not preserved over function calls, so they
must, when mapped and used, be stored to parrots regs and reloaded after
extern function calls, if used again in that block or after. Albeit
load/stores of this kind can be optimized, depending on register usage.

> For example, code generated by (1) would look like:
>
> set m3, 1 # m3 is the machine 3d register
> add m3, m3, 1
> print m3
>
> set $I1, m3 # $I1 is a parrot virtual register

Not exactly: print is an external function.
Assuming ri0 - ri3 are mapped, ri3 is not callee saved:

set ri0, 1
add ri0, 1
set $I0, ri0 # save for print $I0
set $I1, ri3 # save/preserve the register, when used
print $I0 # external function
set ri3, $I1 # load
add ri3, ri1, ri2 # do something

(For debugging mapped registers are printed ri0..x or rn0..y by imcc)

> Hope that it know make more sense,

More, yes. This would give us 32 + N - (0..x) registers, where x is the
amount of non callee saved registers in the worst case, or 0 most of the
time. The $1 above can be always a new temp, which would then have a
very limited life range inside one basic block.

> -angel

leo

Phil Hassey

unread,

Feb 26, 2003, 1:58:37 PM2/26/03

to Leopold Toetsch, P6I

> > Although it might be nice if IMC were binary at this stage (for some
> > feel-good-reason?).
>
> You mean, that a HL like perl6 should produce a binary equivalent to
> ther current .imc file? Yep - this was discussed already, albeit there
> was no discussion, how this should look like. And the lexer in imcc is
> pretty fast.
>
> > ... The current bytecode from parrot already has potential
> > for slowing things down, and that's what worries me here.
>
> I don't see that.

My post was more a "wish-list" of what I was hoping parrot would be like in
terms of imc/pbc/jit/whatever. Since I don't completely understand how
parrot works, my comment above was actually more of a guess. But I'll try to
explain what I meant, in the off-chance it was right.

My understanding is that PBC has a limit of 16 (32?) integer registers. When
a code block needs more than 16 registers, they are overflowed into a
PMC.

With a processor with < 16 registers, I guess this would work. Although the
JIT would have to overflow more than what was originally planned in the PBC.
(Or does it just switch back and forth between the VM and the JIT, I don't
know.)

But with a processor with > 16 registers (do such things exist?). Parrot
would be overflowing registers that it could have been using in the JIT. My
guess is that this would slow things down.

Anyway, before I strut my ignorance of VMs and JITs and processors anymore, I
think I will end this message. :)

Thanks,
Phil

Leopold Toetsch

unread,

Feb 26, 2003, 5:32:07 PM2/26/03

to philh...@users.sourceforge.net, P6I

Phil Hassey wrote:

>>>... The current bytecode from parrot already has potential
>>>for slowing things down, and that's what worries me here.
>>>
>>I don't see that.

> My understanding is that PBC has a limit of 16 (32?) integer registers. When

> a code block needs more than 16 registers, they are overflowed into a
> PMC.

There are 32 registers per type. When life analysis of all used
temporary registers, can't allocate all used vars to a parrot register,
then overflowed vars get spilled into a PerlArray.
This may be different to just "a block needs more than...":

set $I0, 10
add $11, $I0, 2
print $I1
add $12, $I0, 3
print $I2
only needs two registers, $11 and $I2 get the same parrot register,
because their usage doesn't overlap.

> But with a processor with > 16 registers (do such things exist?). Parrot
> would be overflowing registers that it could have been using in the JIT.

RISC processor have a lot of them. But before there are unused processor
registers, we will allocate P and S registers too. When a CPU has more
then 4*32 free registers, we will look again.

> Thanks,
> Phil

leo

Piers Cawley

unread,

Mar 3, 2003, 9:22:17 AM3/3/03

to Nicholas Clark, Angel Faus, Leopold Toetsch, Gopal V, P6I

Nicholas Clark <ni...@unfortu.net> writes:

> On Wed, Feb 26, 2003 at 02:21:32AM +0100, Angel Faus wrote:
>
> [snip lots of good stuff]
>
>> All this is obviously machine dependent: the code generated should
>> only run in the machine it was compiled for. So we should always keep
>> the original imc code in case we copy the pbc file to another
>> machine.
>
> Er, but doesn't that mean that imc code has now usurped the role of parrot
> byte code?
>
> I'm not sure what is a good answer here. But I thought that the intent of
> parrot's bytecode was to be the same bytecode that runs everywhere. Which
> is slightly incompatible with compiling perl code to something that runs
> as fast as possible on the machine that you're both compiling and running
> on. (These two being the same machine most of the time).
>
> Maybe we starting to get to the point of having imcc deliver parrot bytecode
> if you want to be portable, and something approaching native machine code
> if you want speed. Or maybe if you want the latter we save "fat" bytecode
> files, that contain IMC code, bytecode and JIT-food for one or more
> processors.

Aren't there safety implications with 'fat' code? One could envisage a
malicious fat PBC where the IMC code and the bytecode did different things...

--
Piers

Dennis Haney

unread,

Mar 4, 2003, 10:15:16 AM3/4/03

to P6I

Leopold Toetsch wrote:

> Phil Hassey wrote:
>
>> But with a processor with > 16 registers (do such things exist?).
>> Parrot would be overflowing registers that it could have been using
>> in the JIT.
>
>
>
> RISC processor have a lot of them. But before there are unused
> processor registers, we will allocate P and S registers too. When a
> CPU has more then 4*32 free registers, we will look again.

Like IA64? AFAIK it has 128 integer registers and 128 fp registers...