Automatic register spill / restore?

Andy

unread,

Jul 4, 2022, 1:48:36 AM7/4/22

to

The discussions going on about register to/from stack and load/store
multiple instructions has got me vaguely remembering that there was some
talk about old mainframes that could save to stack automatically any
registers in danger of being overwritten after a jump to subroutine or such.

Anyone else remember such a thing?, or am I just making things up by
reason of senility and/or madness? (entirely possible I'm afraid :-( )

I wonder, that if they were real, is it possible given today's
transistor core counts those mechanisms could be revived in modern clean
sheet cpu designs?, to at least help alleviate some of the issues we see
in the current state of the art perhaps?

Or were there some really big downsides to the automagic things that I'm
not remembering well enough?

EricP

unread,

Jul 4, 2022, 12:14:41 PM7/4/22

to

Sparc register windows. They were opaque, asynchronous lazy spill/fill
which was done by kernel mode traps. Reportedly it had... issues.

A quicky search of 'sparc "register window" ' finds

[paywalled]
Error behavior comparison of multiple computing systems: A case study
using linux on pentium, solaris on sparc, and aix on power, 2008
https://ieeexplore.ieee.org/abstract/document/4725314/

I found a copy available in a online viewer with download option:

https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxkY2hlbjhyZXNlYXJjaHxneDoxNzdkNWMxOWVmOWNhMDdm

says:
"The registers introduced to manage the register window in the
SPARC architecture are particularly error-sensitive and contribute
to more than 50% of the hang cases for the Solaris System.
This result indicates that, while using the register window may
improve performance, it can create unexpected reliability problems."

MitchAlsup

unread,

Jul 4, 2022, 1:56:51 PM7/4/22

to

On Monday, July 4, 2022 at 12:48:36 AM UTC-5, Andy wrote:
> The discussions going on about register to/from stack and load/store
> multiple instructions has got me vaguely remembering that there was some
> talk about old mainframes that could save to stack automatically any
> registers in danger of being overwritten after a jump to subroutine or such.
>
> Anyone else remember such a thing?, or am I just making things up by
> reason of senility and/or madness? (entirely possible I'm afraid :-( )
>

IBM 360 series::
caller allocated a register save area and always kept it in R13
upon any arrival (Callee, interruptee) would perform STM 12,12(13)
which would dump all 16 registers in the save area, as the first step
in properly receiving control.
<
VAX did something similar, but performed in in the CALL side of ISA
processing.

>
> I wonder, that if they were real, is it possible given today's
> transistor core counts those mechanisms could be revived in modern clean
> sheet cpu designs?, to at least help alleviate some of the issues we see
> in the current state of the art perhaps?
<

Effectively that is what ENTER and EXIT do in My 66000 architecture.

>
> Or were there some really big downsides to the automagic things that I'm
> not remembering well enough?
<

Automagic was so slow in VAX that the more modern compilers used JSR
instead of CALL and got rid of ½ of the cycles in calling/returning.

John Levine

unread,

Jul 4, 2022, 2:56:37 PM7/4/22

to

According to MitchAlsup <Mitch...@aol.com>:

>> Anyone else remember such a thing?, or am I just making things up by
>> reason of senility and/or madness? (entirely possible I'm afraid :-( )
>>
>IBM 360 series::
>caller allocated a register save area and always kept it in R13
>upon any arrival (Callee, interruptee) would perform STM 12,12(13)
>which would dump all 16 registers in the save area, as the first step
>in properly receiving control.

Actually it was STM 14,12,12(13). You chained save area pointers in
R13 in other ways depending on whether your routine was reentrant or
recursive or neither. The first "12" might be a smaller number if your
routine didn't change the high numbered registers. On TSS the sequence
was messier since every routine had both a code pointer (V address) and a
data pointer (R address.)

For obvious reasons interrupts could not depend on register contents
and saved registers in a static place in low memory.

>VAX did something similar, but performed in in the CALL side of ISA
>processing.

The VAX calling sequence had a bit mask of registers to save at
the start of the procedure and very complex CALL/RET instructions that
did the save and set up the stack frame. They were so slow that a lot
of software used the much simpler JSB/RSB that just saved the return
address on the stack and jumped.

SPARC used the old pcc compiler which didn't do very sophisticated
register management, so the SPARC register windows were a way to
save and restore all the registers fast. The contemporary 801
project used the much better PL.8 compiler so their calling sequence
just saved the registers it needed to. The 801 did not have load
or store multiple but the ROMP put it back in so they could get
full memory speed on the load/store traffic.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Niklas Holsti

unread,

Jul 4, 2022, 2:57:21 PM7/4/22

to

On 2022-07-04 19:14, EricP wrote:
> Andy wrote:
>> The discussions going on about register to/from stack and load/store
>> multiple instructions has got me vaguely remembering that there was
>> some talk about old mainframes that could save to stack automatically
>> any registers in danger of being overwritten after a jump to
>> subroutine or such.
>>
>> Anyone else remember such a thing?, or am I just making things up by
>> reason of senility and/or madness? (entirely possible I'm afraid :-( )
>>
>>
>> I wonder, that if they were real, is it possible given today's
>> transistor core counts those mechanisms could be revived in modern
>> clean sheet cpu designs?, to at least help alleviate some of the
>> issues we see in the current state of the art perhaps?
>>
>> Or were there some really big downsides to the automagic things that
>> I'm not remembering well enough?
>
> Sparc register windows. They were opaque, asynchronous lazy spill/fill
> which was done by kernel mode traps. Reportedly it had... issues.

What sort of "issues"? Performance wrt alternatives? SPARC processors
are still used extensively in space systems, in particular on ESA
missions. I've written embedded SW for several such systems and I'm not
aware of any serious issues.

Register windows do complicate static WCET analysis a bit, because it is
not easy to predict exactly which call or which return can cause a trap.
But an upper bound can be calculated that is not very pessimistic. And
who uses static WCET analysis any more :-(

I have read that one port of gcc for SPARC offers a "flat"
register-usage model as an alternative to register windows. In the
"flat" model the same set of 32 registers is used at all points in the
program, and the register windows are never rotated (except for trap
handling, I assume). I suppose this "flat" option was implemented for a
reason, but the reason was not explained, IIRC.

A minor annoyance of register windows, and of the standard trap handlers
for rotating windows, is that every non-leaf subroutine must allocate 96
octets of stack for saving 24 registers. This area is used if, and only
if, the subroutine executes a call that causes a register-ring-overflow
trap. A programming style that uses many small subroutines can lead to
suprisingly large stack usage.

> A quicky search of 'sparc "register window" ' finds
>
> [paywalled]
> Error behavior comparison of multiple computing systems: A case study
> using linux on pentium, solaris on sparc, and aix on power, 2008
> https://ieeexplore.ieee.org/abstract/document/4725314/
>
> I found a copy available in a online viewer with download option:
>
> https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxkY2hlbjhyZXNlYXJjaHxneDoxNzdkNWMxOWVmOWNhMDdm

Thanks for this reference...

> says:
> "The registers introduced to manage the register window in the
> SPARC architecture are particularly error-sensitive and contribute
> to more than 50% of the hang cases for the Solaris System.
> This result indicates that, while using the register window may
> improve performance, it can create unexpected reliability problems."

These are not "natural", real-life errors and hangs. That study injects
single-bit errors into memory words and system registers and sees what
happens. It is to be expected that critical system registers, such as
the ones involved in register-ring management, are very sensititive to
such simulated HW errors. This is not a flaw in the logical
architecture, but HW architects may want to make these registers
especially robust.

SPARC processors used in space applications are radiation-tolerant and
use triple modular redundancy for critical registers.

Marcus

unread,

Jul 5, 2022, 2:33:03 AM7/5/22

to

On 2022-07-04, MitchAlsup wrote:
> On Monday, July 4, 2022 at 12:48:36 AM UTC-5, Andy wrote:
>> The discussions going on about register to/from stack and load/store
>> multiple instructions has got me vaguely remembering that there was some
>> talk about old mainframes that could save to stack automatically any
>> registers in danger of being overwritten after a jump to subroutine or such.
>>
>> Anyone else remember such a thing?, or am I just making things up by
>> reason of senility and/or madness? (entirely possible I'm afraid :-( )
>>
> IBM 360 series::
> caller allocated a register save area and always kept it in R13
> upon any arrival (Callee, interruptee) would perform STM 12,12(13)
> which would dump all 16 registers in the save area, as the first step
> in properly receiving control.

How does this differ from a regular STM to stack (except that the
allocation is done by the caller instead of the callee)? E.g. 68k style:

movem.l d0-d7/a0-a6,-(sp)

Niklas Holsti

unread,

Jul 5, 2022, 3:54:33 AM7/5/22

to

On 2022-07-04 21:56, John Levine wrote:
>
> SPARC used the old pcc compiler which didn't do very sophisticated
> register management, so the SPARC register windows were a way to
> save and restore all the registers fast.

Nitpick: not "all the registers". When a SPARC rotates the register
window in connection with a call, 16 registers visible to the caller are
saved (become inaccessible). These are the 8 "in" registers and the 8
"local" registers of the caller. The callee gets 16 new registers: 8 new
"local" registers and 8 new "out" registers.

The 8 "out" registers of the caller are seen in the callee as the 8 "in"
registers of the callee. Successive windows overlap on those 8
registers, which of course are used to pass parameters.

In addition there are 8 "global" registers that are not held in the
register ring and are always accessible with the same names.

IIRC...

Anton Ertl

unread,

Jul 5, 2022, 4:04:34 AM7/5/22

to

Niklas Holsti <niklas...@tidorum.invalid> writes:
>And who uses static WCET analysis any more :-(

I would presume that those applications that needed WCET (worst-case
execution time) analysis in the past still need it. What would they
use instead?

>I have read that one port of gcc for SPARC offers a "flat"
>register-usage model as an alternative to register windows. In the
>"flat" model the same set of 32 registers is used at all points in the
>program, and the register windows are never rotated (except for trap
>handling, I assume). I suppose this "flat" option was implemented for a
>reason, but the reason was not explained, IIRC.

I read somewhere that the register windows could be used for fast task
switching instead of for fast calling. To make use of that you would
need a compiler that does not use the windows for calling; it would
also have to limit its register usage to 16 (one window per task) or
24 (two windows per task) registers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Ivan Godard

unread,

Jul 5, 2022, 5:34:44 AM7/5/22

to

On 7/5/2022 12:54 AM, Anton Ertl wrote:
> Niklas Holsti <niklas...@tidorum.invalid> writes:
>> And who uses static WCET analysis any more :-(
>
> I would presume that those applications that needed WCET (worst-case
> execution time) analysis in the past still need it. What would they
> use instead?
>
>> I have read that one port of gcc for SPARC offers a "flat"
>> register-usage model as an alternative to register windows. In the
>> "flat" model the same set of 32 registers is used at all points in the
>> program, and the register windows are never rotated (except for trap
>> handling, I assume). I suppose this "flat" option was implemented for a
>> reason, but the reason was not explained, IIRC.
>
> I read somewhere that the register windows could be used for fast task
> switching instead of for fast calling. To make use of that you would
> need a compiler that does not use the windows for calling; it would
> also have to limit its register usage to 16 (one window per task) or
> 24 (two windows per task) registers.

Shouldn't need such limits. You can flush the windows by a recursive
call (making a dummy window) that stops when the end interrupt happens.
Then switch the stack and do a recursive return of the number of dummy
windows that were injected when the new stack was previously switched
out of, saved somewhere (possibly in the dummy). The flush/restore can
tell how many dummy windows there are by counting the recursive calls
before the interrupt; the overflow routine doesn't have to spill any of
the dummies, nor the restore refill them. It does have to spill/refill
the non-dummy windows, but that's inevitable.

It would go even faster if the hardware refilled lazy, assuming that the
spilled windows don't need to be window-block aligned; do you know it
that was required by the implementation?

MitchAlsup

unread,

Jul 5, 2022, 12:10:14 PM7/5/22

to

On Tuesday, July 5, 2022 at 1:33:03 AM UTC-5, Marcus wrote:
> On 2022-07-04, MitchAlsup wrote:
> > On Monday, July 4, 2022 at 12:48:36 AM UTC-5, Andy wrote:
> >> The discussions going on about register to/from stack and load/store
> >> multiple instructions has got me vaguely remembering that there was some
> >> talk about old mainframes that could save to stack automatically any
> >> registers in danger of being overwritten after a jump to subroutine or such.
> >>
> >> Anyone else remember such a thing?, or am I just making things up by
> >> reason of senility and/or madness? (entirely possible I'm afraid :-( )
> >>
> > IBM 360 series::
> > caller allocated a register save area and always kept it in R13
> > upon any arrival (Callee, interruptee) would perform STM 12,12(13)
> > which would dump all 16 registers in the save area, as the first step
> > in properly receiving control.
> How does this differ from a regular STM to stack (except that the
> allocation is done by the caller instead of the callee)? E.g. 68k style:
<

It was a linked list not a stack.

EricP

unread,

Jul 5, 2022, 12:39:18 PM7/5/22

to

Right, the errors were simulated but as I read it they contend the
fragility is due to the register windows and the way were manipulated.

> SPARC processors used in space applications are radiation-tolerant and
> use triple modular redundancy for critical registers.

I can't find the references just now but I have seen it
discussed in the past so I'll see what I can dig up.

IIRC the primary issue was the phase delay in window save/restore
was opaque so code could only know for sure that memory was up to
date by explicit flushing through OS traps -
something akin to manual cache coherence.
I would be paranoid about things like general OS calls, exceptions,
interrupts, user mode task switches not seeing a coherent view of memory,
and would flush the pending register window on entry to these
until each can be proved safe, on a case-by-case basis.

One issue is even for in-order implementations it can require a large
number of hardware registers but programmers can only use a small number,
similar to costs in bank-switched register sets like Arm32.

H&P note of Sparc:
- Given that each window has 16 unique registers, an implementation of
SPARC can have as few as 40 physical registers and as many as 520,
although most have 128 to 136.
...
The danger of register windows is that the larger number of
registers could slow down the clock rate."

One compiler writer notes:
- "The [Sparc] architecture requires a designer to implement 120 registers,
of which only 24 to 29 are available for use by the compiler writer."

- "Setjmp is an exceptional case for saving register windows.
Because setjmp saves processor state, it is necessary for it to
force the hidden register state to the stack and to save the current
state into the jump buffer. ...
This makes setjmp an exceptionally slow operation."

- some comments in comp.compilers
https://compilers.iecc.com/comparch/article/94-02-130
https://compilers.iecc.com/comparch/article/94-02-134

EricP

unread,

Jul 5, 2022, 1:22:51 PM7/5/22

to

The size of the Sparc saved HW register window is model dependent
from 2..32 so to flush it you would have to do 32 SAVE instructions,
which could potentially trigger multiple OS traps.

The hardware doesn't seem to have the tracking registers necessary
to optimize save/restore, though maybe later models did.

One improvment might be to have a model register that can be
read to find the HW window size in user mode.
Another could be a FLUSH instruction.

It looks like it just has a Valid bit per register window.
It also needs a Modified bit to track changes since last spill/fill,
and low-water and high-water spill/refill stack pointers.
That would allow it to only flush windows modified since reload,
spill-ahead when the window is 3/4 full, refill-ahead when 1/4 full,
and minimize the window management stalls.

John Levine

unread,

Jul 5, 2022, 3:02:59 PM7/5/22

to

According to MitchAlsup <Mitch...@aol.com>:

>> > IBM 360 series::
>> > caller allocated a register save area and always kept it in R13
>> > upon any arrival (Callee, interruptee) would perform STM 12,12(13)
>> > which would dump all 16 registers in the save area, as the first step
>> > in properly receiving control.
>> How does this differ from a regular STM to stack (except that the
>> allocation is done by the caller instead of the callee)? E.g. 68k style:
><
>It was a linked list not a stack.

Depends on the details. In PL/I it was a stack, in Fortran and COBOL which
did not allow recursion, it was a linked list of static save areas.

>> > Automagic was so slow in VAX that the more modern compilers used JSR
>> > instead of CALL and got rid of ½ of the cycles in calling/returning.

JSB actually, but yes, many of the complex VAX instructions were so slow that
they weren't useful.

VAX appears to have been designed in an era when microcode was much faster
than RAM so everything was limited by memory speed. Unfortunately, by the
time there were actual VAX computers, that wasn't true any more.

Niklas Holsti

unread,

Jul 5, 2022, 3:04:11 PM7/5/22

to

But they also say: "Injections into the system registers indicate that
each of the three processors (the Pentium 4, UltraSPARC IIIi, and
POWER5) have two to three critical registers that are very sensitive to
errors (see Figure 9 through Figure 11), those being IDTR, GDTR and CR4
in the Pentium 4; TBA and SET_SOFTINT in the UltraSPARCIIIi; and SPRG1,
IAR, and MSR in the POWER5."

I still don't think that this paper shows an intrinsic reliability flaw
in the register-window idea, at least not one that cannot be compensated
by increasing the robustness of the critical register HW.

However, I see now that UltraSPARC IIIi, used in the paper, uses the
SPARC V9 (or JPS1) architecture, while my experience is with the earlier
SPARC versions V7 and V8. I think the register-window mechanism is the
same in V9, but the paper's description of the mechanism disagrees in
some details with my understanding -- for example, the paper says that
each window has 32 registers, not 24 (visible) or 16 (owned by the window).

>> SPARC processors used in space applications are radiation-tolerant and
>> use triple modular redundancy for critical registers.
>
> I can't find the references just now but I have seen it
> discussed in the past so I'll see what I can dig up.
>
> IIRC the primary issue was the phase delay in window save/restore
> was opaque so code could only know for sure that memory was up to
> date by explicit flushing through OS traps -
> something akin to manual cache coherence.

All the occupied windows in the register ring have to be stored into
memory (into the areas reserved for this in each stack frame) when a
thread is suspended, and at least one windowful has to be reloaded when
the thread is resumed. This is in addition to storing and reloading the
8 "global" registers and any other thread-specific context.

A debugger may also prefer to store the ring into memory so that it can
access the saved SP, FP, and register-allocated locals of upper-level
calls. But at least in SPARC V7 and V8 that is all deterministic and
program-controlled. AFAIK the SPARC processors do not autonomously store
register-ring contents into memory in the background, it has to be
programmed explicitly.

A modern OoO processor may have hundreds of working registers, not
architecturally visible to the programmer. I don't know how those are
handled on thread switch, but it is probably complicated.

> I would be paranoid about things like general OS calls, exceptions,
> interrupts, user mode task switches not seeing a coherent view of memory,
> and would flush the pending register window on entry to these
> until each can be proved safe, on a case-by-case basis.

Traps and interrupts work in the same way as subroutine calls. OS calls
are typically done by a triggering a SW trap. Exceptions may have to
unwind the stack which involves both RESTORE instructions (as in normal
returns from calls) and reloading register windows that had been pushed
out of the register ring. Some work, but not horribly complex.

> One issue is even for in-order implementations it can require a large
> number of hardware registers but programmers can only use a small number,
> similar to costs in bank-switched register sets like Arm32.

But the same issue arises in OoO systems without register windows,
right? Hundreds of working registers, invisible to the programmer.

> H&P note of Sparc:
> - Given that each window has 16 unique registers, an implementation of
> SPARC can have as few as 40 physical registers and as many as 520,
> although most have 128 to 136.
> ...
> The danger of register windows is that the larger number of
> registers could slow down the clock rate."

I suppose the OoO processors with large numbers of registers should have
the same clock-rate risk, but perhaps the OoO parallelism compensates.

> One compiler writer notes:
> - "The [Sparc] architecture requires a designer to implement 120 registers,
> of which only 24 to 29 are available for use by the compiler writer."

Same issue for OoO machines.

> - "Setjmp is an exceptional case for saving register windows.
> Because setjmp saves processor state, it is necessary for it to
> force the hidden register state to the stack and to save the current
> state into the jump buffer. ...
> This makes setjmp an exceptionally slow operation."

Hm. That may be necessary for some weird uses of setjmp/longjmp, such as
implementing some cheapo threading or coroutining services, but I don't
think it is needed for the more normal use of setjmp/longjmp as an
exception-handling mechanism.

> - some comments in comp.compilers
> https://compilers.iecc.com/comparch/article/94-02-130
> https://compilers.iecc.com/comparch/article/94-02-134

Those discussions of setjmp/longjmp implementations reveal some
complexity, but it certainly works on SPARCs, and IMO it can't be seen
as a major draw-back. But I may be prejudiced because my applications
use exceptions either never (embedded SW) or rarely (other SW).

Niklas Holsti

unread,

Jul 5, 2022, 3:30:24 PM7/5/22

to

On 2022-07-05 10:54, Anton Ertl wrote:
> Niklas Holsti <niklas...@tidorum.invalid> writes:
>> And who uses static WCET analysis any more :-(
>
> I would presume that those applications that needed WCET (worst-case
> execution time) analysis in the past still need it. What would they
> use instead?

They can seldom use /static/ WCET analysis because the execution speed
of modern processors is too unpredictable, thanks to all the
acceleration mechanism that are ever-evolving and poorly documented, not
to mention multi-cores and their resource-contention problems, not to
mention out-of-order and speculative processing...

The only pleasure aficionados of static WCET analysis (like myself) have
nowadays is the schadenfreude we feel when all those acceleration
mechanisms turn out to be side channels leaking secrets -- Spectre and
Meltdown etc.

People who need WCETs instead use "hybrid" methods that combine
fine-grained execution-time measurements with some static control-flow
analysis to compute a probabilistic "WCET estimate". Possibly combined
with randomized HW to motivate their use of "extreme-value statistics"
for computing the reliability of that probabilistic WCET estimate.

>> I have read that one port of gcc for SPARC offers a "flat"
>> register-usage model as an alternative to register windows. In the
>> "flat" model the same set of 32 registers is used at all points in the
>> program, and the register windows are never rotated (except for trap
>> handling, I assume). I suppose this "flat" option was implemented for a
>> reason, but the reason was not explained, IIRC.
>
> I read somewhere that the register windows could be used for fast task
> switching instead of for fast calling. To make use of that you would
> need a compiler that does not use the windows for calling; it would
> also have to limit its register usage to 16 (one window per task) or
> 24 (two windows per task) registers.

That seems possible, but I haven't come across a real example.

In addition to the cases of one or two windows per task in a "flat"
model without any within-task use of SAVE/RESTORE instructions, one
could instead partition the register ring so that each task could use a
task-specific sector of the ring, containing any number of windows, with
the task using SAVE/RESTORE for call/return in the usual way (non-"flat").

Niklas Holsti

unread,

Jul 5, 2022, 3:46:13 PM7/5/22

to

Replying to myself:

On 2022-07-05 22:04, Niklas Holsti wrote:

> A modern OoO processor may have hundreds of working registers, not
> architecturally visible to the programmer. I don't know how those are
> handled on thread switch, but it is probably complicated.

On second thought, the invisible working registers should be handled
fully by the OoO HW if the thread switch is implemented by normal code
that only manipulates the architecturally visible registers. So the
SPARC register windows do complicate thread switching more than these
invisible working registers do.

MitchAlsup

unread,

Jul 5, 2022, 4:07:33 PM7/5/22

to

On Tuesday, July 5, 2022 at 2:04:11 PM UTC-5, Niklas Holsti wrote:
> On 2022-07-05 19:38, EricP wrote:
> > Niklas Holsti wrote:

<snip>

> >
> > Right, the errors were simulated but as I read it they contend the
> > fragility is due to the register windows and the way were manipulated.
> But they also say: "Injections into the system registers indicate that
> each of the three processors (the Pentium 4, UltraSPARC IIIi, and
> POWER5) have two to three critical registers that are very sensitive to
> errors (see Figure 9 through Figure 11), those being IDTR, GDTR and CR4
> in the Pentium 4; TBA and SET_SOFTINT in the UltraSPARCIIIi; and SPRG1,
> IAR, and MSR in the POWER5."
<

I don't know of any processors that can usefully withstand someone coming
along and randomly flipping bits in PSR, Root Pointers, Page table structures,
and other control registers. SPARC should be little different.

>
> I still don't think that this paper shows an intrinsic reliability flaw
> in the register-window idea, at least not one that cannot be compensated
> by increasing the robustness of the critical register HW.
<

Agreed.

<
>
> However, I see now that UltraSPARC IIIi, used in the paper, uses the
> SPARC V9 (or JPS1) architecture, while my experience is with the earlier
> SPARC versions V7 and V8. I think the register-window mechanism is the
> same in V9, but the paper's description of the mechanism disagrees in
> some details with my understanding -- for example, the paper says that
> each window has 32 registers, not 24 (visible) or 16 (owned by the window).
<

SPARC V9 added some new functionality of OS over register windows that
was supposed to make some of the background overhead go away.

<
> >> SPARC processors used in space applications are radiation-tolerant and
> >> use triple modular redundancy for critical registers.
> >
> > I can't find the references just now but I have seen it
> > discussed in the past so I'll see what I can dig up.
> >
> > IIRC the primary issue was the phase delay in window save/restore
> > was opaque so code could only know for sure that memory was up to
> > date by explicit flushing through OS traps -
> > something akin to manual cache coherence.
<
> All the occupied windows in the register ring have to be stored into
> memory (into the areas reserved for this in each stack frame) when a
> thread is suspended, and at least one windowful has to be reloaded when
> the thread is resumed. This is in addition to storing and reloading the
> 8 "global" registers and any other thread-specific context.
<

Imagine passing a pointer to the argument which arrives in a register window.

>
> A debugger may also prefer to store the ring into memory so that it can
> access the saved SP, FP, and register-allocated locals of upper-level
> calls. But at least in SPARC V7 and V8 that is all deterministic and
> program-controlled. AFAIK the SPARC processors do not autonomously store
> register-ring contents into memory in the background, it has to be
> programmed explicitly.
>
> A modern OoO processor may have hundreds of working registers, not
> architecturally visible to the programmer. I don't know how those are
> handled on thread switch, but it is probably complicated.
<

But those OoO registers are in use, whereas the invisible windows of SPARC
cannot be "in use".

<
> > I would be paranoid about things like general OS calls, exceptions,
> > interrupts, user mode task switches not seeing a coherent view of memory,
> > and would flush the pending register window on entry to these
> > until each can be proved safe, on a case-by-case basis.
<

When we shut down a company, we had a server (SPARC V8) that had been up
for 7 years 4 months and several days. It had every one of its CPU modules
replaced, every one of its memory modules replaced, and most of the disk
farm replaced; yet the system was never out of service or "down". Thus, I don't
see how register windows did those kinds of systems "any harm".

<
> Traps and interrupts work in the same way as subroutine calls.
<

I have to fundamentally disagree with this statement. subroutine calls know
who they are transferring control to, and that it is within its own address space.
Traps can be made to smell like subroutine calls, but are spurious in nature.
Interrupts are not like subroutine calls, because they are entirely asynchronous
with the thread being interrupted. You can make them smell like unexpected
subroutine calls if you desire.

<
> OS calls
> are typically done by a triggering a SW trap. Exceptions may have to
> unwind the stack which involves both RESTORE instructions (as in normal
> returns from calls) and reloading register windows that had been pushed
> out of the register ring. Some work, but not horribly complex.
<
> > One issue is even for in-order implementations it can require a large
> > number of hardware registers but programmers can only use a small number,
> > similar to costs in bank-switched register sets like Arm32.
<
> But the same issue arises in OoO systems without register windows,
> right? Hundreds of working registers, invisible to the programmer.
<

But in OoO design, one may have hundreds of physical registers, but over the
span of 200-ish clocks, every one of them can be used, without ever crossing
a subroutine boundary. Yes, the access time will be similar due to size/area/ports,
but the registers are in use, whereas in a register window, they are cannot be in use.
<
To build an OoO SPARC one would need the large window of the OoO and then
add to that the excesses of the register window. Worst of all cases (except Itanic-
like rotating register.....)....

<
> > H&P note of Sparc:
> > - Given that each window has 16 unique registers, an implementation of
> > SPARC can have as few as 40 physical registers and as many as 520,
> > although most have 128 to 136.
> > ...
> > The danger of register windows is that the larger number of
> > registers could slow down the clock rate."
<

Whereas MIPS got to high frequencies rather easily, SPARC never did.
Read into that what you will.

<
> I suppose the OoO processors with large numbers of registers should have
> the same clock-rate risk, but perhaps the OoO parallelism compensates.
<

You can "pipeline" the register file access time away. What you cannot pipeline
away is the trap handling complexity.

<
> > One compiler writer notes:
> > - "The [Sparc] architecture requires a designer to implement 120 registers,
> > of which only 24 to 29 are available for use by the compiler writer."
> Same issue for OoO machines.
> > - "Setjmp is an exceptional case for saving register windows.
> > Because setjmp saves processor state, it is necessary for it to
> > force the hidden register state to the stack and to save the current
> > state into the jump buffer. ...
> > This makes setjmp an exceptionally slow operation."
> Hm. That may be necessary for some weird uses of setjmp/longjmp, such as
> implementing some cheapo threading or coroutining services, but I don't
> think it is needed for the more normal use of setjmp/longjmp as an
> exception-handling mechanism.
> > - some comments in comp.compilers
> > https://compilers.iecc.com/comparch/article/94-02-130
> > https://compilers.iecc.com/comparch/article/94-02-134
> Those discussions of setjmp/longjmp implementations reveal some
> complexity, but it certainly works on SPARCs, and IMO it can't be seen
> as a major draw-back. But I may be prejudiced because my applications
> use exceptions either never (embedded SW) or rarely (other SW).
<

So, while it is not a MAJOR DrawBack, it is onerous enough that the
designer should proceed with great caution.

MitchAlsup

unread,

Jul 5, 2022, 4:11:28 PM7/5/22

to

On Tuesday, July 5, 2022 at 2:30:24 PM UTC-5, Niklas Holsti wrote:
> On 2022-07-05 10:54, Anton Ertl wrote:
> > Niklas Holsti <niklas...@tidorum.invalid> writes:
> >> And who uses static WCET analysis any more :-(
> >
> > I would presume that those applications that needed WCET (worst-case
> > execution time) analysis in the past still need it. What would they
> > use instead?
> They can seldom use /static/ WCET analysis because the execution speed
> of modern processors is too unpredictable, thanks to all the
> acceleration mechanism that are ever-evolving and poorly documented, not
> to mention multi-cores and their resource-contention problems, not to
> mention out-of-order and speculative processing...
>
> The only pleasure aficionados of static WCET analysis (like myself) have
> nowadays is the schadenfreude we feel when all those acceleration
> mechanisms turn out to be side channels leaking secrets -- Spectre and
> Meltdown etc.
<

Have we finally reached the point that we suspect ANY and EVERY new way
to extract IPL is LIKELY to create side-channels ?
<
So, the only sane architecture point is to define the boundary between
architecture (SW model) and microarchitecture (HW model) such that
microarchectural state is never visible even with a high precision closk.

MitchAlsup

unread,

Jul 5, 2022, 4:13:05 PM7/5/22

to

What happens when context switches are not preformed by running of
instructions, but by the arrival of a "context switch" message ? and HW
does everything wrt saving old state and absorbing new state.

Michael S

unread,

Jul 5, 2022, 4:49:06 PM7/5/22

to

Actually, it did, eventually.
Fujitsu got to 4250 MHz, Oracle to 5000 MHz.
MIPS never reached quite that high. But, then again, it didn't last that long and even at the peak of its power had much lower development budget.

EricP

unread,

Jul 5, 2022, 5:39:58 PM7/5/22

to

Yes, as Mitch also points out, these registers (mostly) contain
currently valid data - its just that their 5-bit architecture
register names are prefixed by the 5-bit Window Pointer that was
current when they were written. But user mode doesn't have access
to the full 10-bit architecture register numbers.

What an OoO implementation might do is free up the physical registers
after an older window was spilled, allowing them to be reused for
in-flight instructions. But it would still need to have enough
physical registers for the worse case window save sets,
plus a minimum to issue new instructions.
So that is about 32*16+16 = 528 + say 200 for in-flight = 728.

But I would want a fully automatic spill/fill mechanism to go with this
so when Renamer, which assigns free registers, sees the free list is too
small it can trigger a spill and get back a block of 16 free physicals.

The register window is a circular buffer, with 6-bit head and tail
circular indexes (5 bits index plus a wrap bit), Valid and Modify bits
for each set, and a LDM/STM mechanism attached to automatically
spill/fill sets without trapping to kernel mode.

The large number of effective architecture register could be a problem.
Using SRAM style Rename tables would require 10-bit arch reg id index,
1024 entries of say 10 bits each for the 728 physical registers,
to hold the architecture-to-physical map.
That's a pretty big table to checkpoint for each conditional branch.

OoO register windows gets expensive real fast.
That probably explains their scarcity in the wild.

Andy

unread,

Jul 5, 2022, 8:09:18 PM7/5/22

to

On 5/07/22 05:56, MitchAlsup wrote:

> IBM 360 series::
> caller allocated a register save area and always kept it in R13
> upon any arrival (Callee, interruptee) would perform STM 12,12(13)
> which would dump all 16 registers in the save area, as the first step
> in properly receiving control.
> <
> VAX did something similar, but performed in in the CALL side of ISA
> processing.
>>
>> I wonder, that if they were real, is it possible given today's
>> transistor core counts those mechanisms could be revived in modern clean
>> sheet cpu designs?, to at least help alleviate some of the issues we see
>> in the current state of the art perhaps?
> <
> Effectively that is what ENTER and EXIT do in My 66000 architecture.
>>

Heh, okay then, problem solved I guess. (I really should get around to
emailing you for a copy of the My 66000 manual, but I've still got a ton
of PDFs littering my desktop at the moment)

However on the other hand I'm still wondering if a fully transparent
mechanism existed/is possible.
IE a program JSRs to a subroutine and just uses the registers it needs
without any special instructions or calling conventions needed.
When it hits the return instruction all the registers it disturbed are
restored to their former values automatically (except for the nominated
resultant register/s).

>> Or were there some really big downsides to the automagic things that I'm
>> not remembering well enough?
> <
> Automagic was so slow in VAX that the more modern compilers used JSR
> instead of CALL and got rid of ½ of the cycles in calling/returning.

So hopefully there's enough transistors to go around these days that
such a thing could be made faster than programed spill/fill maybe?

MitchAlsup

unread,

Jul 5, 2022, 8:44:33 PM7/5/22

to

On Tuesday, July 5, 2022 at 7:09:18 PM UTC-5, Andy wrote:
> On 5/07/22 05:56, MitchAlsup wrote:
>
> > IBM 360 series::
> > caller allocated a register save area and always kept it in R13
> > upon any arrival (Callee, interruptee) would perform STM 12,12(13)
> > which would dump all 16 registers in the save area, as the first step
> > in properly receiving control.
> > <
> > VAX did something similar, but performed in in the CALL side of ISA
> > processing.
> >>
> >> I wonder, that if they were real, is it possible given today's
> >> transistor core counts those mechanisms could be revived in modern clean
> >> sheet cpu designs?, to at least help alleviate some of the issues we see
> >> in the current state of the art perhaps?
> > <
> > Effectively that is what ENTER and EXIT do in My 66000 architecture.
> >>
> Heh, okay then, problem solved I guess. (I really should get around to
> emailing you for a copy of the My 66000 manual, but I've still got a ton
> of PDFs littering my desktop at the moment)
>
> However on the other hand I'm still wondering if a fully transparent
> mechanism existed/is possible.
<
> IE a program JSRs to a subroutine and just uses the registers it needs
> without any special instructions or calling conventions needed.
<

Effectively, Mill does this without adding instructions.
My 66000 does this with instructions.
<
On the other hand, up to ½ of all subroutine calls are to leaf subroutines
and in My 66000 ISA, if a subroutine does not touch R16..R31 it neither
has to nor expends any effort to save/restore registers. In my opinion
this is where you want the overhead to be minimal--don't do stuff that
does not need doing. For such subroutines the overhead is:
<
a) setting up arguments to the yet to be called subroutine
b) calling the subroutine
c) performing the subroutine
d) RET
e) doing something with the return value.
<
Any and all subroutines that can be performed using R0..R16 and do not
need any Local_data_area on the stack are overhead free.
<
{You can't get rid of (a) and (e) without inlining the subroutine.
And you can't get rid of (c), leaving only the 2 transfers of control}

<
> When it hits the return instruction all the registers it disturbed are
> restored to their former values automatically (except for the nominated
> resultant register/s).
<

You are making the assumption that values in registers cost more
to recalculate than to save and restore--which is often untrue. It is
often the case that one can recalculate a value in one cycle where
saving and restoring might cost up to 6 cycles. Then there are
registers the compiler KNOWS are not holding a value that is still
alive. Allowing these to die at certain control transfer values, saves
overhead.

<
> >> Or were there some really big downsides to the automagic things that I'm
> >> not remembering well enough?
> > <
> > Automagic was so slow in VAX that the more modern compilers used JSR
> > instead of CALL and got rid of ½ of the cycles in calling/returning.
<
> So hopefully there's enough transistors to go around these days that
> such a thing could be made faster than programed spill/fill maybe?
<

I am betting in that direction with ENTER and EXIT, and zero-instruction
context switching. With our current transistor budgets, one should be able
to deliver/receive at least 4-8 registers per cycle to/from the stack. This
saves 3-7 AGENs and consumes no more than ¼ cache line per cycle.
It is hard to imaging a series of LDs and STs that would save this much
power, cycles, accesses,... Also with ENTER and EXIT one can put preserved
registers on a stack where they cannot be destroyed and guarantee that
the ABI is preserved in both directions across the CALL boundary.

Ivan Godard

unread,

Jul 5, 2022, 8:58:33 PM7/5/22

to

Whereas if the HW at the call site knows which registers have live
content then it can silently do the save/restore without instruction -
and without compiler analysis. If the save/restore is inherently lazy
and the callee doesn't use the location then there's no overhead at all.
That's what Mill tries to do.

However, IMO the biggest value of Mill-style implicit save compared to
the my66 (and all others here) explicit save is future-proofing. The two
methods can be made overhead-equivalent with enough work, but I prefer
to keep the sausage-making hidden.

MitchAlsup

unread,

Jul 5, 2022, 10:01:58 PM7/5/22

to

It would be easy enough to have the ENTER instruction just associate
a stack address with the registers annotated, and then only move them
to stack if they get written; i.e., lazily. Then the EXIT instruction can
restore only those modified. However, this can cause "sequencing"
issues as the registers are not 'dense' whereas they were dense in
the encoding. I prefer not to have too many sequencing issues.

Andy

unread,

Jul 5, 2022, 11:10:40 PM7/5/22

to

On 5/07/22 04:14, EricP wrote:

>
> Sparc register windows. They were opaque, asynchronous lazy spill/fill
> which was done by kernel mode traps. Reportedly it had... issues.

Yep while I'm vaguely aware of SPARC register windows, that's not
exactly what I was looking for, since I'm pretty sure it was one of the
older perhaps lesser known CISCy mainframes not a newish RISC style
microprocessor.
And I think it used a fairly strait forward register file aside from the
fact the programmer never needed to manually save and restore registers
when calling subroutines and the like.

But aside from that, I was never struck with the impression that
hardware managed register windows were a particularly great idea.

If someone is going to put 120 odd registers into a CPU surely making
them all programmer visible in some way and letting them / the compiler
or operating system decide how best to carve the register file up would
be the better option.

And of course with fresh pop-corn in hand, watching the ensuing
technical / corporate / tribal / religious / jihadist arguments flying
to and fro on how best to do that, would no doubt be hugely and/or
endlessly entertaining! ;-)

Anton Ertl

unread,

Jul 6, 2022, 6:36:54 AM7/6/22

to

MitchAlsup <Mitch...@aol.com> writes:
>Whereas MIPS got to high frequencies rather easily, SPARC never did.

Bullshit.

MHz Architecture, CPU
1000 MIPS R16000A (select customers), at least 800MHz in 2004
1200 SPARC UltraSPARC III Cu (released 2001 with at least 900MHz)
1500 MIPS 1074K
2000 MIPS P5600, P6600
4250 SPARC64 XII (Fujitsu)
5000 SPARC M8 (Oracle)

>Read into that what you will.

I read into that that SPARC had a ~3 year clock rate advantage on MIPS
in the early 2000s.

SGI/MIPS had difficulties competing in the GHz race and eventually
bowed out; the higher-clocked cores we see much later are embedded
cores.

Sun and Fujitsu persevered on. Fujitsu introduced the OoO Sparc64 V
in 2003 and reached competetive clock rates over time. Sun/Oracle
continued for a while with in-order cores and lower clock rates, until
they introduced the 2850-3000MHz SPARC T4 with OoO in 2011, and almost
doubled the clock rate compared to the in-order 1650MHz SPARC T3.
Later OoO CPUs from Oracle further increased the clock rate and also
the width. But apparently the customers had defected to AMD64 in the
meantime, so Oracle canceled SPARC after the M8.

John Dallman

unread,

Jul 6, 2022, 10:36:45 AM7/6/22

to

In article <2022Jul...@mips.complang.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> But apparently the customers had defected to AMD64 in
> the meantime, so Oracle canceled SPARC after the M8.

Oracle cancelled SPARC slightly before the M8 shipped, by getting rid of
the development team and a lot of the Solaris staff. They claimed that it
would carry on getting faster for years, via specialised on-chip
co-processors, but they were not very convincing in that, nor in their
claims that customers would be albe to run Solaris until at least 2031.

John

EricP

unread,

Jul 6, 2022, 11:44:00 AM7/6/22

to

A little poking about finds a description the OoO SPARC V circa 2004
microarchitecture and how it handles registers to get higher clock.

Integer register GPR (General Purpose Registers) has 8 read ports,
2x2 read for integer ops, 2x2 read for AGU.

To access the GPRs faster it:
(a) limits the register windows to 8 sets.
(b) splits the physical register file into 2 sections:
the slower large GPR, and fast access JWR Joint Work Register.

The JWR keeps 3 windows (64*8 bytes in total) for the current window and
those either side of it. Read data from JWR is fed to execution units.
Both JWR and GPR are updated at the same time on commit.
When a window switch occurs, hardware copies window data
between JWR and GPR in the background.

(Hmmm... if JWR and GPR are both updated on commit,
how does that not put the GPR back as the critical path limit?)

On some implementations it refers to the result of integer operations
are maintained in 32 entry GUB (GPR Update Buffer) which has 8R 4W ports.
There is also a FUB for floats.
I gather results are copied from GUB to GPR on commit.

Marcus

unread,

Jul 6, 2022, 11:45:36 AM7/6/22

to

I can still run Commodore BASIC v2.0 in 2022, so I guess the claim that
Solaris will be runnable in 2031 is true.

(That does not make it a sound investment, though)

>
> John

BGB

unread,

Jul 6, 2022, 2:15:43 PM7/6/22

to

Partial issues:
Encoding: 7 bit register fields wont really fit effectively into a
32-bit instruction format;
LUTs: For FPGA, 7-bit access in LUTRAM is significantly less
resource-efficient than 5 or 6 bits.

It would appear though that many of the RISC-V implementations
effectively have 3 copies of the register file (the U/S/M modes having
separate copies of the register file; so would be around 96 registers
internally).

...

Though, I guess one could also debate whether it would be viable to
implement a variant of the Itanium ISA on an FPGA (haven't looked into
it enough to figure how easily an IA-64 core could fit into an XC7A100T
or similar).

Would likely need a partial software emulation layer though to emulate
things like an SVGA card plugged into a PCI bus and similar, like if one
hopes to be able to run an IA64 build of Windows or similar on it.

This seems like an area where something like a DE10 or similar could
potentially have an advantage.

Looks it up, in a quick search it doesn't appear anyone has done IA-64
for the MiSTer or similar...

...

EricP

unread,

Jul 6, 2022, 2:31:12 PM7/6/22

to

Andy wrote:
> On 5/07/22 04:14, EricP wrote:
>
>>
>> Sparc register windows. They were opaque, asynchronous lazy spill/fill
>> which was done by kernel mode traps. Reportedly it had... issues.
>
> Yep while I'm vaguely aware of SPARC register windows, that's not
> exactly what I was looking for, since I'm pretty sure it was one of the
> older perhaps lesser known CISCy mainframes not a newish RISC style
> microprocessor.

Back around 1980..84 there were two seminal research projects that
popularized the whole RISC approach
- Stanford MIPS by Hennessy et al, which was later reworked
some and launched commercially as the MIPS R2000
- Berkley RISC with register windows, aka RISC-I, by Patterson et al,
which was later commercialized as SPARC architecture and inspired ARM.

Patterson is one of the originators behind the current RISC-V ISA.

> And I think it used a fairly strait forward register file aside from the
> fact the programmer never needed to manually save and restore registers
> when calling subroutines and the like.
>
>
> But aside from that, I was never struck with the impression that
> hardware managed register windows were a particularly great idea.

In days of yore when memory was 500 ns it was probably really smart.
Berkley RISC had 78 32-bit 2R1W registers which would be enough for
3 windows (3*16+16) plus sundry housekeeping. Subroutine SAVE and
RESTORE would have taken a clock or so if it didn't trigger a spill.

But the technology had to be good enough to cram 78 registers on 1 chip.
It needed 44500 4um NMOS transistors which only became possible a few
years earlier.

> If someone is going to put 120 odd registers into a CPU surely making
> them all programmer visible in some way and letting them / the compiler
> or operating system decide how best to carve the register file up would
> be the better option.

Per-priority register banks in ARM have the same effect.
You pay for holding all that architectural state in live registers
but can't use them.

Jecel Assumpção Jr

unread,

Jul 7, 2022, 8:08:10 PM7/7/22

to

On Monday, July 4, 2022 at 2:48:36 AM UTC-3, Andy wrote:
> The discussions going on about register to/from stack and load/store
> multiple instructions has got me vaguely remembering that there was some
> talk about old mainframes that could save to stack automatically any
> registers in danger of being overwritten after a jump to subroutine or such.

The AT&T Hobbit (CRISP) had a stack cache with a hardware spiller/refiller that took advantage of any otherwise unused memory cycles.

-- Jecel

Andy

unread,

Jul 9, 2022, 8:57:01 PM7/9/22

to

On 8/07/22 12:08, Jecel Assumpção Jr wrote:

> The AT&T Hobbit (CRISP) had a stack cache with a hardware spiller/refiller that took advantage of any otherwise unused memory cycles.
>
> -- Jecel

Thanks, good to know that other CPUs also had the feature.

Andy

unread,

Jul 9, 2022, 9:38:23 PM7/9/22

to

On 7/07/22 06:15, BGB wrote:

<snip>

>
> Though, I guess one could also debate whether it would be viable to
> implement a variant of the Itanium ISA on an FPGA (haven't looked into
> it enough to figure how easily an IA-64 core could fit into an XC7A100T
> or similar).
>
>
> Would likely need a partial software emulation layer though to emulate
> things like an SVGA card plugged into a PCI bus and similar, like if one
> hopes to be able to run an IA64 build of Windows or similar on it.
>
>
> This seems like an area where something like a DE10 or similar could
> potentially have an advantage.
>
> Looks it up, in a quick search it doesn't appear anyone has done IA-64
> for the MiSTer or similar...
>
> ...

Hmmm, somehow I think it'd be easier and cheaper to buy a real Itanium
based workstation off Ebay, than trying to buy a FPGA large enough to
hold a soft core version of an Itanium. YMMV

Then there's the issue of the compiler to deal with, I imagine progress
in VLIW scheduling compiler research has continued on since Itanium
effectively died, but would anyone be motivated enough to collect the
latest advancements and update a compiler just for the Itanium machines
still working out there?

There are probably far better / smaller / easier VLIW style cores to
study and replicate in a FPGA than Itanium I think.

But then with VLIW the hardware is just half the battle, you still need
to program it so that it runs at near peak performance, which I take it
is the harder part, again YMMV.

BGB

unread,

Jul 9, 2022, 11:12:57 PM7/9/22

to

On 7/9/2022 8:38 PM, Andy wrote:
> On 7/07/22 06:15, BGB wrote:
>
> <snip>
>>
>> Though, I guess one could also debate whether it would be viable to
>> implement a variant of the Itanium ISA on an FPGA (haven't looked into
>> it enough to figure how easily an IA-64 core could fit into an
>> XC7A100T or similar).
>>
>>
>> Would likely need a partial software emulation layer though to emulate
>> things like an SVGA card plugged into a PCI bus and similar, like if
>> one hopes to be able to run an IA64 build of Windows or similar on it.
>>
>>
>> This seems like an area where something like a DE10 or similar could
>> potentially have an advantage.
>>
>> Looks it up, in a quick search it doesn't appear anyone has done IA-64
>> for the MiSTer or similar...
>>
>> ...
>
> Hmmm, somehow I think it'd be easier and cheaper to buy a real Itanium
> based workstation off Ebay, than trying to buy a FPGA large enough to
> hold a soft core version of an Itanium. YMMV
>

I am left wondering if I could make it fit, at least in a basic sense,
on something like an XC7A100T. Dev-boards with these (such as the Nexys
A7) are available for around ~ $270 or so last I looked (and this is
basically what I am using for my BJX2 Core).

At least at a superficial level, the IA-64 ISA isn't *that* far beyond
what I have already done with the BJX2 ISA.

Most obvious difference is that the IA-64 register file would be
significantly larger. Would also probably need to omit the IA-32
decoder, ...

In this case, the idea would partly be to emulate parts of the ISA on
top of itself (likely via hardware traps).

If I were to do it via a modified BJX2 core, would potentially replace
the RISC-V alt-mode with an IA-64 alt-mode, and considerably expand the
size of the register file and similar.

Though, this looks concerning, the amount of expansions needed would
likely push the core beyond the resource limits of the XC7A100T.

If I were to approach the register file design in a similar way to to
what I have done with my BJX2 core, I will effectively need a 512-entry
register file (likely also 8R4W if using 64-bit ports). Probably "more
sane" to use multiple smaller register files.

This seems a little absurd...

This might require a bigger FPGA...

And or come up with a more cost-effective way to implement such a
register file.

Decided to leave out idle thoughts about possible ways to try to
approach implementing the register file.

>
> Then there's the issue of the compiler to deal with, I imagine progress
> in VLIW scheduling compiler research has continued on since Itanium
> effectively died, but would anyone be motivated enough to collect the
> latest advancements and update a compiler just for the Itanium machines
> still working out there?
>

AFAIK, GCC can target IA-64.
Not sure how good its code generation is.
Apparently the target has been deprecated though.

>
> There are probably far better / smaller / easier VLIW style cores to
> study and replicate in a FPGA than Itanium I think.
>

It seems like I am one of the (relatively few) people doing VLIW on FPGA
(at all).

Most of the other people I know of, are doing RISC variants (and/or
RISC-V implementations).

Looks like pretty much no one is bothering with soft-core processors for
IA-64.

> But then with VLIW the hardware is just half the battle, you still need
> to program it so that it runs at near peak performance, which I take it
> is the harder part, again YMMV.
>

Yeah...

With my existing ISA, my C compiler gets nowhere near the full speed of
what is possible. Can do a little better by writing hand-optimized ASM,
but this doesn't scale very well.

Thomas Koenig

unread,

Jul 10, 2022, 4:14:03 AM7/10/22

to

Andy <nos...@nowhere.com> schrieb:

> Then there's the issue of the compiler to deal with, I imagine progress
> in VLIW scheduling compiler research has continued on since Itanium
> effectively died, but would anyone be motivated enough to collect the
> latest advancements and update a compiler just for the Itanium machines
> still working out there?

Itanium is still supported by recent gcc versions, but it hasn't seen
much work in the recent years, unsurprisingly.

Anton Ertl

unread,

Jul 10, 2022, 7:08:09 AM7/10/22

to

Andy <nos...@nowhere.com> writes:
>Then there's the issue of the compiler to deal with, I imagine progress
>in VLIW scheduling compiler research has continued on since Itanium
>effectively died

I guess there are still a few stragglers, but the caravan has moved
on. Even when EPIC research was a hot topic, the progress in
compilers was not as good as the EPIC fans had hoped. I would not
expect a significant progress on that front, and even less when you
use gcc.

Robert Swindells

unread,

Jul 10, 2022, 8:29:16 AM7/10/22

to

On Sat, 9 Jul 2022 22:12:48 -0500, BGB wrote:

> On 7/9/2022 8:38 PM, Andy wrote:
>> On 7/07/22 06:15, BGB wrote:
>>
>> <snip>
>>>
>>> Though, I guess one could also debate whether it would be viable to
>>> implement a variant of the Itanium ISA on an FPGA (haven't looked into
>>> it enough to figure how easily an IA-64 core could fit into an
>>> XC7A100T or similar).
>>>
>>>
>>> Would likely need a partial software emulation layer though to emulate
>>> things like an SVGA card plugged into a PCI bus and similar, like if
>>> one hopes to be able to run an IA64 build of Windows or similar on it.
>>>
>>>
>>> This seems like an area where something like a DE10 or similar could
>>> potentially have an advantage.
>>>
>>> Looks it up, in a quick search it doesn't appear anyone has done IA-64
>>> for the MiSTer or similar...
>>>
>>> ...
>>
>> Hmmm, somehow I think it'd be easier and cheaper to buy a real Itanium
>> based workstation off Ebay, than trying to buy a FPGA large enough to
>> hold a soft core version of an Itanium. YMMV
>>
>>
> I am left wondering if I could make it fit, at least in a basic sense,
> on something like an XC7A100T. Dev-boards with these (such as the Nexys
> A7) are available for around ~ $270 or so last I looked (and this is
> basically what I am using for my BJX2 Core).

It won't be as fast but you can get an XC6SLX100 for a quite low price,
search for "pano logic g2".

You need to do a bit of soldering to add a JTAG connector.

John Dallman

unread,

Jul 10, 2022, 11:21:47 AM7/10/22

to

In article <tadg3l$19jd4$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:

> Looks like pretty much no one is bothering with soft-core
> processors for IA-64.

Nor emulators. There's enough hardware still around for those who need it,
and very few people regard IA-64 as fun.

John

Stefan Monnier

unread,

Jul 10, 2022, 11:26:45 AM7/10/22

to

> Looks like pretty much no one is bothering with soft-core processors for IA-64.

AFAIK noone wants to run IA-64 code anywhere at all (except for very
rare legacy applications which are being ported as fast as possible to
new hardware, maybe).

Any kind of IA-64 soft-core (or emulator for that matter) sounds like
masochism more than anything.

Stefan

John Dallman

unread,

Jul 10, 2022, 12:01:10 PM7/10/22

to

In article <jwvsfn97zzl.fsf-...@gnu.org>,

mon...@iro.umontreal.ca (Stefan Monnier) wrote:

> AFAIK noone wants to run IA-64 code anywhere at all (except for very
> rare legacy applications which are being ported as fast as possible
> to new hardware, maybe).

The residual HP-UX customer base is transitioning. NonStop has probably
finished its transition, since x86 hardware for it has been available
since 2014. The remaining VMS customer base is on IA-64, plus some
ancient Alphas. The commercial release of VMS on x86-64 ships next week,
but most of the compilers are still cross-compilers running on IA-64.

> Any kind of IA-64 soft-core (or emulator for that matter) sounds
> like masochism more than anything.

Yup.

John

Thomas Koenig

unread,

Jul 10, 2022, 12:21:20 PM7/10/22

to

John Dallman <j...@cix.co.uk> schrieb:

> The residual HP-UX customer base is transitioning. NonStop has probably
> finished its transition, since x86 hardware for it has been available
> since 2014. The remaining VMS customer base is on IA-64, plus some
> ancient Alphas. The commercial release of VMS on x86-64 ships next week,

I didn't know that.

Do you have any more details?

John Dallman

unread,

Jul 10, 2022, 1:03:24 PM7/10/22

to

In article <taeu9t$808$1...@newsreader4.netcologne.de>,

tko...@netcologne.de (Thomas Koenig) wrote:

> John Dallman <j...@cix.co.uk> schrieb:

> > The commercial release of VMS on x86-64 ships next week,
>
> I didn't know that.
>
> Do you have any more details?

<https://vmssoftware.com/about/faq/>
<https://vmssoftware.com/about/news/2022-07-08-state-of-the-92-release-rea
dy/>

John

BGB

unread,

Jul 10, 2022, 2:16:46 PM7/10/22

to

I guess it can be raised as a question of why, given all of the design
problems that exist with IA-64, why people thought it was a good design
in the first place?...

Namely, either the "EPIC" part works exactly as-intended, otherwise it
is going to suck, hard...

>
> Stefan

BGB

unread,

Jul 10, 2022, 2:25:10 PM7/10/22

to

On 7/10/2022 6:02 AM, Anton Ertl wrote:
> Andy <nos...@nowhere.com> writes:
>> Then there's the issue of the compiler to deal with, I imagine progress
>> in VLIW scheduling compiler research has continued on since Itanium
>> effectively died
>
> I guess there are still a few stragglers, but the caravan has moved
> on. Even when EPIC research was a hot topic, the progress in
> compilers was not as good as the EPIC fans had hoped. I would not
> expect a significant progress on that front, and even less when you
> use gcc.
>

Yes, probably true.

One drawback with the EPIC / Itanium approach is that either:
The compiler produces something that matches what the ISA can express;
Or, it is going to suck.

One advantage of the approach I was using (essentially a similar
approach to what was used by the TMS320, with a similar ASM syntax as
well, *) is that even if the compiler sucks at the whole VLIW thing, at
least the code density is still fairly reasonable.

*1: Well, or typically:
OP2 | OP1
Vs:
OP2
|| OP1

But, my assembler is flexible on this, if it sees either '|' or '||' it
will assume that one wants to encode the instruction to run in parallel.

Given BGBCC's limited effectiveness here, if the IA-64 approach were
used, the generated machine code would likely be around 3x bigger (with
BGBCC currently generating an average-case bundle length of ~ 1.2 .. 1.3
or so).

> - anton

Thomas Koenig

unread,

Jul 11, 2022, 1:36:41 AM7/11/22

to

John Dallman <j...@cix.co.uk> schrieb:

Thanks.

As a former DEC employee told me: The "OPEN" is silent.

Andy

unread,

Jul 13, 2022, 11:50:15 PM7/13/22

to

On 10/07/22 15:12, BGB wrote:

> I am left wondering if I could make it fit, at least in a basic sense,
> on something like an XC7A100T. Dev-boards with these (such as the Nexys
> A7) are available for around ~ $270 or so last I looked (and this is
> basically what I am using for my BJX2 Core).
>
>
> At least at a superficial level, the IA-64 ISA isn't *that* far beyond
> what I have already done with the BJX2 ISA.
>

If you say so, looks like Mount Everest to me though...

>
> Most obvious difference is that the IA-64 register file would be
> significantly larger. Would also probably need to omit the IA-32
> decoder, ...
>

Perhaps something smaller, Transmeta Crusoe or Efficion maybe, only 64
registers if you include the deep speculation, 32 if you skip that, and
the IA-32 decoding is just a re-assemble of the Code Morphing firmware
you can find on the internet.

> In this case, the idea would partly be to emulate parts of the ISA on
> top of itself (likely via hardware traps).
>
>
> If I were to do it via a modified BJX2 core, would potentially replace
> the RISC-V alt-mode with an IA-64 alt-mode, and considerably expand the
> size of the register file and similar.

Hmmm,

>
> Though, this looks concerning, the amount of expansions needed would
> likely push the core beyond the resource limits of the XC7A100T.

Maybe skipping the great big Intel CPU cores is for the best. ;-)

> If I were to approach the register file design in a similar way to to
> what I have done with my BJX2 core, I will effectively need a 512-entry
> register file (likely also 8R4W if using 64-bit ports). Probably "more
> sane" to use multiple smaller register files.
>
> This seems a little absurd...

Agreed

>
> This might require a bigger FPGA...

Oh no...

>
> And or come up with a more cost-effective way to implement such a
> register file.

Possibly, not sure myself.

>> Then there's the issue of the compiler to deal with, I imagine
>> progress in VLIW scheduling compiler research has continued on since
>> Itanium effectively died, but would anyone be motivated enough to
>> collect the latest advancements and update a compiler just for the
>> Itanium machines still working out there?
>>
>
> AFAIK, GCC can target IA-64.
> Not sure how good its code generation is.
> Apparently the target has been deprecated though.
>

Always wondered how good GCC would be at generating code for a VLIW CPU.
I just assumed those so inclined would steal whatever language front-end
they could find and write the bulk of the VLIW specific compiler from
scratch.

>> There are probably far better / smaller / easier VLIW style cores to
>> study and replicate in a FPGA than Itanium I think.
>>
>
> It seems like I am one of the (relatively few) people doing VLIW on FPGA
> (at all).
>

Aside from the odd DSP-core, you might be right.

If only Transmeta could have held on a little longer, or did things
differently, like opening up the internal instruction set so that
hackers and compiler writers could have targeted more optimal GCC code
generation to their cores... they might still be around with huge sales
in Android devices right now, and VLIW research could have got the
injection of resources it needed to gain the performance needed to stay
competitive at least.

Although Nvidia isn't exactly setting the world on fire in CPU sales
either...

> Most of the other people I know of, are doing RISC variants (and/or
> RISC-V implementations).
>

RISC is pretty much the text-book common denominator these days.

I kinda hope VLIW makes a mainstream comeback somehow, the current
CISC/RISC duopoly doesn't seem particularly healthy for the long term view.

Maybe massive machine learning trained compilers can make a dent in the
software side of the VLIW equation?

> Looks like pretty much no one is bothering with soft-core processors for
> IA-64.

I'm thinking that's probably for the best. ;-)

>> But then with VLIW the hardware is just half the battle, you still
>> need to program it so that it runs at near peak performance, which I
>> take it is the harder part, again YMMV.
>>
>
> Yeah...
>
> With my existing ISA, my C compiler gets nowhere near the full speed of
> what is possible. Can do a little better by writing hand-optimized ASM,
> but this doesn't scale very well.

Seems to me that is the nub of the issue, if your WEX hardware is pretty
much working as intended then getting decent code generation out of your
compiler might be the best bang for your buck.

I'm sure there's still plenty of research papers to be read on the
subject, and if you happen to invent some new way to efficiently pack
many operations into a string of wide words, well, fortune and glory and
that jazz could be yours for the taking...

Or possibly the 8000lb gorilla will stomp on your neck and steal your
lunch money just like it did to Transmeta...

To early to tell I guess... :-)

BGB

unread,

Jul 14, 2022, 5:15:39 AM7/14/22

to

On 7/13/2022 10:49 PM, Andy wrote:
> On 10/07/22 15:12, BGB wrote:
>
>
>> I am left wondering if I could make it fit, at least in a basic sense,
>> on something like an XC7A100T. Dev-boards with these (such as the
>> Nexys A7) are available for around ~ $270 or so last I looked (and
>> this is basically what I am using for my BJX2 Core).
>>
>>
>> At least at a superficial level, the IA-64 ISA isn't *that* far beyond
>> what I have already done with the BJX2 ISA.
>>
>
> If you say so, looks like Mount Everest to me though...
>

It is complicated, granted, but at a basic level most of the parts are
not *that* complicated. Main problem, as noted, would mostly be the
stupidly large register file.

>
>>
>> Most obvious difference is that the IA-64 register file would be
>> significantly larger. Would also probably need to omit the IA-32
>> decoder, ...
>>
>
> Perhaps something smaller, Transmeta Crusoe or Efficion maybe, only 64
> registers if you include the deep speculation, 32 if you skip that, and
> the IA-32 decoding is just a re-assemble of the Code Morphing firmware
> you can find on the internet.
>

I had considered a few times maybe trying to do an x86 emulator on BJX2,
but this is one of those "never got around to it" issues.

Would need to go directly to JIT though, as there is pretty much no hope
of usable performance with a conventional interpreter.

And, on a 50 MHz CPU core, I would probably be lucky if it even matched
the performance of the original IBM PC.

Would likely also need instructions to allow faking the behavior of x86
style ALU and branch ops (my ISA lacks condition codes, and these would
be expensive to emulate).

>
>> In this case, the idea would partly be to emulate parts of the ISA on
>> top of itself (likely via hardware traps).
>>
>>
>> If I were to do it via a modified BJX2 core, would potentially replace
>> the RISC-V alt-mode with an IA-64 alt-mode, and considerably expand
>> the size of the register file and similar.
>
> Hmmm,
>

In any case, not going to do this, it was more a hypothetical.

>>
>> Though, this looks concerning, the amount of expansions needed would
>> likely push the core beyond the resource limits of the XC7A100T.
>
> Maybe skipping the great big Intel CPU cores is for the best. ;-)
>

Probably true.

I had previously wanted to buy a board with an XC7A200T (Nexys Video),
but lacked money.

Now it seems they are sold out pretty much everywhere...

>> If I were to approach the register file design in a similar way to to
>> what I have done with my BJX2 core, I will effectively need a
>> 512-entry register file (likely also 8R4W if using 64-bit ports).
>> Probably "more sane" to use multiple smaller register files.
>>
>> This seems a little absurd...
>
> Agreed
>
>>
>> This might require a bigger FPGA...
>
> Oh no...
>
>>
>> And or come up with a more cost-effective way to implement such a
>> register file.
>
> Possibly, not sure myself.
>
>
>>> Then there's the issue of the compiler to deal with, I imagine
>>> progress in VLIW scheduling compiler research has continued on since
>>> Itanium effectively died, but would anyone be motivated enough to
>>> collect the latest advancements and update a compiler just for the
>>> Itanium machines still working out there?
>>>
>>
>> AFAIK, GCC can target IA-64.
>> Not sure how good its code generation is.
>> Apparently the target has been deprecated though.
>>
>
> Always wondered how good GCC would be at generating code for a VLIW CPU.
> I just assumed those so inclined would steal whatever language front-end
> they could find and write the bulk of the VLIW specific compiler from
> scratch.
>

Dunno. I wrote my whole compiler from scratch.

But, given how much it sucks with my own ISA, and what would be needed
for "good" results with IA-64, it would likely be straight up terrible...

>
>>> There are probably far better / smaller / easier VLIW style cores to
>>> study and replicate in a FPGA than Itanium I think.
>>>
>>
>> It seems like I am one of the (relatively few) people doing VLIW on
>> FPGA (at all).
>>
>
> Aside from the odd DSP-core, you might be right.
>

Possibly.

I suspect the sinking of the Itanium had done a lot to sour the
reputation if VLIW in general.

Like, Itanium did for VLIW what the Hindenburg did for airships...

> If only Transmeta could have held on a little longer, or did things
> differently, like opening up the internal instruction set so that
> hackers and compiler writers could have targeted more optimal GCC code
> generation to their cores... they might still be around with huge sales
> in Android devices right now, and VLIW research could have got the
> injection of resources it needed to gain the performance needed to stay
> competitive at least.
>
> Although Nvidia isn't exactly setting the world on fire in CPU sales
> either...
>

Yeah, quite possible, it could have been interesting.

Emulation is one of those areas where one is almost invariably going to
take a loss; so if targeting the underlying ISA, it could maybe have
been more competitive.

Maybe also not try to set oneself up as "compete with Intel or bust".

>
>> Most of the other people I know of, are doing RISC variants (and/or
>> RISC-V implementations).
>>
>
> RISC is pretty much the text-book common denominator these days.
>

Pretty much.

> I kinda hope VLIW makes a mainstream comeback somehow, the current
> CISC/RISC duopoly doesn't seem particularly healthy for the long term view.
>
> Maybe massive machine learning trained compilers can make a dent in the
> software side of the VLIW equation?
>

Not sure here.

In my case, it is kinda lots of fiddly.

I recently got things a little better, by fiddling a fair bit with the
logic for shuffling instructions around. It tries to reduce interlocking
and improve cases for bundling.

Then, I ended up needing to add in logic to limit how much shuffling it
does, and to try to hash and cache the results of intermediate
comparisons, mostly as the "more advanced" shuffling cases were starting
to result in the process taking an unreasonably long time.

Part of the issue seems to be that in the shuffling process it only
takes a limited window into account at a time (which expanded from 3 to
7 instructions), but at each point it isn't necessarily the case that
there will be an agreement as to which option is lowest cost.

The only real alternative would be to evaluate the entire basic block
for each possible swapping decision.

Though, this only really happens in a minority of cases (most cases
don't have quite so much "instruction mobility").

Some gains here were due to adding heuristics to infer "non-aliasing"
memory accesses, say:
Same base register but non-overlapping displacements;
SP and non-SP in some combinations.
SP and GBR (Stack and Globals are never the same memory);
...

But, can't do as much with indexed loads/stores here, since one can't do
much of anything to infer potential overlap or non-overlap between these
cases.

Have also made an inference that SP or GBR based loads/stores can't
alias with indexed load/store. While a little hand-wavy, this is
"probably true".

Though, Jumbo-form and instructions with Relocations (typically the same
instruction) are classified as "immovable", and thus can't be moved
(though the window is now large enough that it can now shuffle "around"
these instructions in many cases).

>
>> Looks like pretty much no one is bothering with soft-core processors
>> for IA-64.
>
> I'm thinking that's probably for the best. ;-)
>

Probably true, after thinking more on it.

>
>>> But then with VLIW the hardware is just half the battle, you still
>>> need to program it so that it runs at near peak performance, which I
>>> take it is the harder part, again YMMV.
>>>
>>
>> Yeah...
>>
>> With my existing ISA, my C compiler gets nowhere near the full speed
>> of what is possible. Can do a little better by writing hand-optimized
>> ASM, but this doesn't scale very well.
>
>
> Seems to me that is the nub of the issue, if your WEX hardware is pretty
> much working as intended then getting decent code generation out of your
> compiler might be the best bang for your buck.
>

Yeah, on the hardware side, WEX is fairly straightforward.
CPU doesn't need to figure anything else, just see the bits and go.

Compiler is a whole different matter though, but recently I have been
making at least a little progress here.

Did get a bit of a speedup in some cases by switching the L2 back to
being direct-mapped.

Doom and ROTT seem to get a little faster.
Quake and Hexen aren't really effected either way.

Got a little bit of a speedup in Heretic, but this was mostly after
recompiling it (seemed to benefit from some of my more recent work on my
compiler).

Heretic went from 8 to 10 fps (before recompile) to 12 to 16 fps post
recompile.

Hexen was uneffected by recompile, still mostly stuck at around 8 fps.

Recompile got ROTT from around 10 to 12 fps, up to around 16 to 20.

Doom is now semi-consistently pulling off upwards of 20 fps.

Weirdly, the gains from switching the L2 back to DM only seem to really
come into effect "after" recompiling code with the improved shuffling
and bundling. Not sure why this would be.

Though, curiously, despite the better numbers and theoretical reduction
in the amount of interlocks (since the compiler is optimizing to try to
avoid interlocks), there would also appear to be a relative increase in
the amount of clock cycles spent on interlock stalls (odd).

Then again, it could be due to a reduction in the number of non-bundled
instructions, which reduces the number of instructions sitting around to
"absorb" these clock cycles.

> I'm sure there's still plenty of research papers to be read on the
> subject, and if you happen to invent some new way to efficiently pack
> many operations into a string of wide words, well, fortune and glory and
> that jazz could be yours for the taking...
>
> Or possibly the 8000lb gorilla will stomp on your neck and steal your
> lunch money just like it did to Transmeta...
>
> To early to tell I guess... :-)
>

If anyone uses my stuff, it is probably a win.

If they want to pay me to keep working on it, that is better...