Why 32-bit?

Terje Slettebø

unread,

Jun 27, 2008, 3:54:59 AM6/27/08

to

(Apologies if this has been discussed before - I haven't found it in
the archives)

Hi.

I've spent quite a lot of time now updating the extASM assembler
(http://home.broadpark.no/~terjesl/eng/extasm.html - I'll submit it to
the RISC OS file archives when the conversion is done) to become 26/32
bit neutral, and I'm questioning the advantage of a 32-bit only
processor (like the Xscale). To me, at least, it seems that the
disadvantages outweight the advantages, but I may have a different
viewpoint than most in this respect, as I program in ARM assembly
(extASM is written in assembly, itself, as it was a necessity to
achieve acceptable performance at the time of the 8 MHz ARM2
processor).

The disadvantages I speak of is the following: One of the design
principles of the ARM processor is that every instruction can be
conditionally executed, and on the 26-bit architecture, there was no
performance loss or additional complexity involved for using it also
for function calls (BL), which is a common occurrence. For example,
you could write code like the following:

CMP R0,#<value>
BLLT less_than
BLEQ equal
BLGT greater_than

This makes the code both succinct, elegant and fast, as you can do
quite complicated conditionals and operations without needing a branch
(and the associated pipeline flushes).

However, using the recommended coding practice for 32-bit only ARM
processors (like the Xscale), where you forego the preservation of
flags across function calls, you have to write it something like this:

CMP R0,#<value>
BLT less_than_branch
BGT greater_than_branch
BL equal
B over
.less_than_branch
BL less_than
B over
.greater_than_branch
BL greater_than
.over

Back to the bad old days of the x86 processor (and most others, in
this respect)... Much more verbose and less elegant, and also slower,
due to all the branches and pipeline flushes.

Alternatively, you may write some code to save and restore the flags
in functions:

.equal
MRS R12,CPSR
...
MSR CPSR,R12
MOV R15,R14

However, this uses an extra register, takes longer time (than without
it), or makes even more complex code if you need to use LDM/STM for
saving/restoring several registers.

My question is: Why?

To set the context: The ARM processor (core) is designed for
_embedded_ systems (at least, that's its main application area today),
which are typically resource constrained. How on earth would do you
need more than 64 MB application code on such devices? I'm not talking
about the available data space - that has always been 4 GB, all the
way from ARM2. I'm talking about more than 64 MB _code_.

Frankly, even used in a desktop computer setting (like the Iyonix),
I'd think it's rather rare to use for than 64 MB for applications
(again, I'm only talking about the program code, not its data) -
however, I'd like to hear if someone else has a different experience:
I'd like to know that the above perceived added clumsyness to an
otherwise elegant processor has an important reason (it better have!
Given the above).

Furthermore: Even with 32 bt program counter, how are you going to use
it? Say you have more than 64 MB application code, and you want to
call a function defined above the 64 MB point: How are you going to
call it? The B/BL instructions only handle +/- 64 MB, as before...

Your thoughts, please.

Regards,

Terje

Rob Kendrick

unread,

Jun 27, 2008, 4:23:07 AM6/27/08

to

On Fri, 27 Jun 2008 00:54:59 -0700, Terje Slettebø wrote:

> To set the context: The ARM processor (core) is designed for _embedded_
> systems (at least, that's its main application area today), which are
> typically resource constrained. How on earth would do you need more than
> 64 MB application code on such devices? I'm not talking about the
> available data space - that has always been 4 GB, all the way from ARM2.
> I'm talking about more than 64 MB _code_.

Easily, and often. I speak from first hand, as well as dealing with
customers who also have this need.

And anyway, nobody said that anything ARM Ltd. has ever done has been
sane, elegant, and pleasant. In fact, it's only been very recently that
they've managed to design a CPU that worked and people wanted -
previously they were just things Acorn designed with go-faster stripes.

B.

Terje Slettebø

unread,

Jun 27, 2008, 4:47:49 AM6/27/08

to

On 27 Jun, 10:23, Rob Kendrick <n...@rjek.com> wrote:
> On Fri, 27 Jun 2008 00:54:59 -0700, Terje Slettebø wrote:
> > To set the context: The ARM processor (core) is designed for _embedded_
> > systems (at least, that's its main application area today), which are
> > typically resource constrained. How on earth would do you need more than
> > 64 MB application code on such devices? I'm not talking about the
> > available data space - that has always been 4 GB, all the way from ARM2.
> > I'm talking about more than 64 MB _code_.
>
> Easily, and often. I speak from first hand, as well as dealing with
> customers who also have this need.

Ok.

> And anyway, nobody said that anything ARM Ltd. has ever done has been
> sane, elegant, and pleasant.

Well, I do. The ARM processor has always been an elegant processor, in
terms of the programmer's model (which also translates to code density
and speed):

- 16 32-bit general purpose registers (when e.g. the x86 have
something like 4+ general-purpose registers, and for quite a while
they weren't even completely general-purpose)
- The PC may be used as any other register
- Three-register architecture (so you may add two numbers and put the
result in a third register)
- You may perform shift and other operation in the same instruction
- Conditional execution for every instruction
etc.

The ARM was way ahead of its time, and in several aspects still is.

> In fact, it's only been very recently that
> they've managed to design a CPU that worked and people wanted -
> previously they were just things Acorn designed with go-faster stripes.

I don't understand what you're talking about, here. What was the
problem with the earlier ARM generations?

Note that I'm talking about core, here, which was the topic in this
posting, not the surrounding infrastructure, like caches, etc. That's
a separate issue, and shouldn't be involved when evaluating the core
design.

However, that's another thing: From what I can tell, there's nothing
in the ARM architecture that makes it only suitable for low-power, low-
performance, as it's mainly used for, today, on the contrary. In fact,
when it came, it was faster than the competition at the time 386/486.
Unfortunately, that's no longer the case: As the x86 became the
"standard" PC processor, huge amounts of cash has been poured into its
development, enabling them to work around limitiations in its original
design (even if it means huge increases in complexity and power use),
so now it's much faster than the typical ARM processors, despite an
inferior internal design.

What I'd like to have is a high-performance ARM-based desktop
computer, but seeing that ARM Ltd. focuses exclusivly on the embedded
market, that seems even less likely than an ARM/RISC OS-powered
laptop...

Rob Kendrick

unread,

Jun 27, 2008, 4:57:14 AM6/27/08

to

On Fri, 27 Jun 2008 01:47:49 -0700, Terje Slettebø wrote:

>> And anyway, nobody said that anything ARM Ltd. has ever done has been
>> sane, elegant, and pleasant.
>
> Well, I do. The ARM processor has always been an elegant processor, in
> terms of the programmer's model (which also translates to code density
> and speed):

No, it *was* an elegant processor. Whith hacks such as Thumb and
Jazelle, the sheer uglyness of the VFP, the DSP extensions, the
pointlessness of things such as "CLZ", the hideous complexity introduced
with the 32 bit support, etc etc etc all add to this.

> I don't understand what you're talking about, here. What was the problem
> with the earlier ARM generations?

ARM 8 was the first CPU they designed. It didn't work so well, and was
rampantly unpopular. (Not helped by the fact that the StrongARM was four
times the speed for the same power consumption.)

ARM 9 was the first CPU that they actually designed from the ground up
that worked, and people actually wanted.

> However, that's another thing: From what I can tell, there's nothing in
> the ARM architecture that makes it only suitable for low-power, low-
> performance, as it's mainly used for, today, on the contrary. In fact,
> when it came, it was faster than the competition at the time 386/486.

It was faster than 386s and 486s because as a simple CPU it's faster.
However, modern x86 CPUs employ tricks that are impossible to do usefully
in ARM (such as branch prediction) or techniques that are very power
hungry (massive super-scalar) to make them utterly whip ARMs. ARM will
never conquer the desktop PC market unless it can /emulate/ current PCs
faster than the real thing. The gap is much too wide for that to happen
now - which is why they don't even bother attempting. In fact, they
won't have much joy even trying to reach MIPS's or PowerPC's performance
for very similar reasons.

B.

Ben Avison

unread,

Jun 27, 2008, 5:18:28 AM6/27/08

to

On Fri, 27 Jun 2008, Terje Slettebø <tsle...@gmail.com> wrote:
> The disadvantages I speak of is the following: One of the design
> principles of the ARM processor is that every instruction can be
> conditionally executed, and on the 26-bit architecture, there was no
> performance loss or additional complexity involved for using it also
> for function calls (BL), which is a common occurrence. For example,
> you could write code like the following:
>
> CMP R0,#<value>
> BLLT less_than
> BLEQ equal
> BLGT greater_than

One of the reasons, I suspect, is that compilers have very poor at
generating
code that operates on multiple condition codes. Code like

CMP R0, #3
TEQCS R0, #10
BLLS thing ; called if r0 was 0, 1, 2 or 10

is even harder for compilers.

Note that you can avoid adding branches to your example without faffing
around with flag preservation in the subroutines by doing:

CMP R0, #<value>
ADRLT R1, less_than
ADREQ R1, equal
ADRGT R1, greater_than
MOV LR, PC
MOV PC, R1

or if you're happy to target v5 architectures only (eg XScale) then you
don't need to corrupt R1:

CMP R0, #<value>
ADRLT R14, less_than
ADREQ R14, equal
ADRGT R14, greater_than
BLX R14

> My question is: Why?

Here are some suggestions:

* I imagine there are many OSes which unlike RISC OS (which was designed
for
26-bit ARMs) don't have all code in the bottom 64MB of address space.

* Some ARM chips don't have an MMU. Without a 32-bit PC, you'd be placing
constraints on the location of RAM/ROM in the physical address space.

Having said that, I expect pressure on the PSR may have been more immediate
than pressure on the PC. With 32-bit PSRs, you get

* extra processor modes, ABT, UND and SYS. Even RISC OS has used these
32-bit modes for abort and undefined instruction handling since RISC OS
3.5.
Without them, you couldn't use the FPEmulator or lazy-task-swapped pages in
SVC mode without corrupting R14_svc, which again would be a major headache
especially for compilers. SYS mode is invaluable for kernel threading
schemes.

* the Q bit, needed for saturated arithmetic instructions that are
invaluable for codecs.

* T and J bits, enabling additional instruction sets. Thumb is used in
embedded environments where code density is critical, and Jazelle for
accelerating Java.

And there are several more added in ARMv6, though no existing RISC OS
platforms use this yet.

Ben

Gavin Wraith

unread,

Jun 27, 2008, 5:30:20 AM6/27/08

to

In message <f3c2183e-c94d-4c4f...@k13g2000hse.googlegroups.com>
Terje Slettebř <tsle...@gmail.com> wrote:

Good to hear from you Terje. I always admired extASM.

> The disadvantages I speak of is the following: One of the design
> principles of the ARM processor is that every instruction can be
> conditionally executed,

> .......... [snip] ............

> Back to the bad old days of the x86 processor (and most others, in
> this respect)... Much more verbose and less elegant, and also slower,
> due to all the branches and pipeline flushes.

> .......... [snip] ............

> My question is: Why?
>
> To set the context: The ARM processor (core) is designed for
> _embedded_ systems (at least, that's its main application area today),
> which are typically resource constrained. How on earth would do you
> need more than 64 MB application code on such devices? I'm not talking
> about the available data space - that has always been 4 GB, all the
> way from ARM2. I'm talking about more than 64 MB _code_.
>
> Frankly, even used in a desktop computer setting (like the Iyonix),
> I'd think it's rather rare to use for than 64 MB for applications
> (again, I'm only talking about the program code, not its data) -
> however, I'd like to hear if someone else has a different experience:
> I'd like to know that the above perceived added clumsyness to an
> otherwise elegant processor has an important reason (it better have!
> Given the above).
>
> Furthermore: Even with 32 bt program counter, how are you going to use
> it? Say you have more than 64 MB application code, and you want to
> call a function defined above the 64 MB point: How are you going to
> call it? The B/BL instructions only handle +/- 64 MB, as before...
>
> Your thoughts, please.

I think you have to address these questions to those who developed
ARM. Why did they chuck out the clever 26-bit wheeze? I doubt whether
RISC OS figured anywhere in their plans. As a matter of interest,
were there ever any ARM CPUs used in embedded devices that ran in 26-bit
mode?

--
Gavin Wraith (ga...@wra1th.plus.com)
Home page: http://www.wra1th.plus.com/

Terje Slettebø

unread,

Jun 27, 2008, 6:03:50 AM6/27/08

to

On 27 Jun, 10:57, Rob Kendrick <n...@rjek.com> wrote:
> On Fri, 27 Jun 2008 01:47:49 -0700, Terje Slettebø wrote:
> >> And anyway, nobody said that anything ARM Ltd. has ever done has been
> >> sane, elegant, and pleasant.
>
> > Well, I do. The ARM processor has always been an elegant processor, in
> > terms of the programmer's model (which also translates to code density
> > and speed):
>
> No, it *was* an elegant processor. Whith hacks such as Thumb and
> Jazelle, the sheer uglyness of the VFP, the DSP extensions, the
> pointlessness of things such as "CLZ", the hideous complexity introduced
> with the 32 bit support, etc etc etc all add to this.

Ah, then we are in agreement after all. I was thinking of Acorn when
you said ARM Ltd, as I thought ARM Ltd was formed and involved in the
design earlier than this.

I also see you're well versed in the design. Yes, all the stuff you
mention I guess is especially geared towards embedded design, with
little or no use for as a desktop computer processor.

Just to make it absolutely clear: I have no intention of implementing
support in extASM for things like the Thumbs instruction set, for
which I have no use, and which would hardly be of use for other RISC
OS users, either.

> > I don't understand what you're talking about, here. What was the problem
> > with the earlier ARM generations?
>
> ARM 8 was the first CPU they designed. It didn't work so well, and was
> rampantly unpopular. (Not helped by the fact that the StrongARM was four
> times the speed for the same power consumption.)
>
> ARM 9 was the first CPU that they actually designed from the ground up
> that worked, and people actually wanted.

Then I understand your argument. :)

> > However, that's another thing: From what I can tell, there's nothing in
> > the ARM architecture that makes it only suitable for low-power, low-
> > performance, as it's mainly used for, today, on the contrary. In fact,
> > when it came, it was faster than the competition at the time 386/486.
>
> It was faster than 386s and 486s because as a simple CPU it's faster.
> However, modern x86 CPUs employ tricks that are impossible to do usefully
> in ARM (such as branch prediction)

Why is that impossible to do usefully on the ARM?

> or techniques that are very power
> hungry (massive super-scalar) to make them utterly whip ARMs. ARM will
> never conquer the desktop PC market unless it can /emulate/ current PCs
> faster than the real thing. The gap is much too wide for that to happen
> now - which is why they don't even bother attempting. In fact, they
> won't have much joy even trying to reach MIPS's or PowerPC's performance
> for very similar reasons.

Yep, with x86 having become the dominant architecture on the desktop,
you'd need to perform better than them, and I guess once the IBM PC
became the "standard", neither Acorn nor ARM even attempted to try to
win that market... Which I guess is reasonable given the resources
available.

To their credit, they're doing very well in the embedded world,
though, but again, I almost dare not think how the ARM could have
fared, had just as much resources being poured into making it more
powerful, as has been poured into the x86 processors...

Rob Kendrick

unread,

Jun 27, 2008, 6:14:52 AM6/27/08

to

On Fri, 27 Jun 2008 03:03:50 -0700, Terje Slettebø wrote:

>> It was faster than 386s and 486s because as a simple CPU it's faster.
>> However, modern x86 CPUs employ tricks that are impossible to do
>> usefully in ARM (such as branch prediction)
>
> Why is that impossible to do usefully on the ARM?

Conditional execution on so many instructions make it difficult (you
essentially have many many more things to consider to make accurate
predictions.)

The complexity involved in calculating instruction dependencies are also
why no super-scalar ARM exists, because the circuitry to just decide what
could safely be executed in parallel would most likely be larger than the
rest of the CPU.

B.

Terje Slettebø

unread,

Jun 27, 2008, 5:26:47 PM6/27/08

to

On Jun 27, 12:14 pm, Rob Kendrick <n...@rjek.com> wrote:
> On Fri, 27 Jun 2008 03:03:50 -0700, Terje Slettebø wrote:
> >> It was faster than 386s and 486s because as a simple CPU it's faster.
> >> However, modern x86 CPUs employ tricks that are impossible to do
> >> usefully in ARM (such as branch prediction)
>
> > Why is that impossible to do usefully on the ARM?
>
> Conditional execution on so many instructions make it difficult (you
> essentially have many many more things to consider to make accurate
> predictions.)

I thought branch prediction was mainly (or completely) determined by
statistics (i.e. the path most used recently is the one predicted).
However, I see from http://en.wikipedia.org/wiki/Branch_prediction
that there are several schemes for this.

That's another thing I'm looking forward to: Learning more about
microprocessor design, and the ARM-based ones in particular. With the
x86 processors, I just haven't felt for working at a lower level than
C.

Anyway, the same feature (conditionally executed instructions) also
eliminates many branches, and with it the possibility of branch
misprediction and associated pipeline flushing, so this might more
than compensate for the difficulty of doing branch prediction [1].

Alternatively, one may always preserve flags in subroutines, but this
gives an overhead for all code using them, not just the ones needing
flags preservation, and for small functions in tight loops, that
overhead may be significant.

> The complexity involved in calculating instruction dependencies are also
> why no super-scalar ARM exists, because the circuitry to just decide what
> could safely be executed in parallel would most likely be larger than the
> rest of the CPU.

I did a search for this, and interestingly, apparently there does
exist a superscalar ARM processor, the ARM Cortex-A8 (http://
www.arm.com/products/CPUs/ARM_Cortex-A8.html). It's "only" dual-issue,
though.

[1] As if to add insult to injury regarding loosing conditional
execution for BL instructions, pipeline stalls are likely more
expensive with the longer pipeline of the Xscale processor.

Terje Slettebø

unread,

Jun 27, 2008, 5:37:08 PM6/27/08

to

On Jun 27, 11:18 am, "Ben Avison" <usenetspam...@avison.me.uk> wrote:

> On Fri, 27 Jun 2008, Terje Slettebø <tslett...@gmail.com> wrote:
> > The disadvantages I speak of is the following: One of the design
> > principles of the ARM processor is that every instruction can be
> > conditionally executed, and on the 26-bit architecture, there was no
> > performance loss or additional complexity involved for using it also
> > for function calls (BL), which is a common occurrence. For example,
> > you could write code like the following:
>
> > CMP R0,#<value>
> > BLLT less_than
> > BLEQ equal
> > BLGT greater_than
>
> One of the reasons, I suspect, is that compilers have very poor at
> generating
> code that operates on multiple condition codes. Code like
>
> CMP R0, #3
> TEQCS R0, #10
> BLLS thing ; called if r0 was 0, 1, 2 or 10
>
> is even harder for compilers.

Yes, that may be.

> Note that you can avoid adding branches to your example without faffing
> around with flag preservation in the subroutines by doing:
>
> CMP R0, #<value>
> ADRLT R1, less_than
> ADREQ R1, equal
> ADRGT R1, greater_than
> MOV LR, PC
> MOV PC, R1

Interesting technique. However, it relies on the subroutines being
reasonably close for the ADR instructions (or they may need to be
"expanded", potentially up to four instructions, which could negate
any advantage compared to the alternatives).

> > My question is: Why?
>
> Here are some suggestions:
>
> * I imagine there are many OSes which unlike RISC OS (which was designed
> for
> 26-bit ARMs) don't have all code in the bottom 64MB of address space.
>
> * Some ARM chips don't have an MMU. Without a 32-bit PC, you'd be placing
> constraints on the location of RAM/ROM in the physical address space.
>
> Having said that, I expect pressure on the PSR may have been more immediate
> than pressure on the PC. With 32-bit PSRs, you get
>
> * extra processor modes, ABT, UND and SYS. Even RISC OS has used these
> 32-bit modes for abort and undefined instruction handling since RISC OS
> 3.5.
> Without them, you couldn't use the FPEmulator or lazy-task-swapped pages in
> SVC mode without corrupting R14_svc, which again would be a major headache
> especially for compilers. SYS mode is invaluable for kernel threading
> schemes.
>
> * the Q bit, needed for saturated arithmetic instructions that are
> invaluable for codecs.
>
> * T and J bits, enabling additional instruction sets. Thumb is used in
> embedded environments where code density is critical, and Jazelle for
> accelerating Java.
>
> And there are several more added in ARMv6, though no existing RISC OS
> platforms use this yet.

Thanks for the points.

Rob Kendrick

unread,

Jun 27, 2008, 6:04:20 PM6/27/08

to

On Fri, 27 Jun 2008 14:26:47 -0700, Terje Slettebø wrote:

> I did a search for this, and interestingly, apparently there does exist
> a superscalar ARM processor, the ARM Cortex-A8 (http://
> www.arm.com/products/CPUs/ARM_Cortex-A8.html). It's "only" dual-issue,
> though.

My understanding is that it uses spare parts of the CPU to process the
next instruction when stalled on memory, and the memory-stall-causing
instruction doesn't share any registers with it. So it's pretty
simplistic.

B.

druck

unread,

Jun 27, 2008, 6:28:14 PM6/27/08

to

On 27 Jun 2008 Terje Slettebų <tsle...@gmail.com> wrote:
> The disadvantages I speak of is the following: One of the design
> principles of the ARM processor is that every instruction can be
> conditionally executed, and on the 26-bit architecture, there was no
> performance loss or additional complexity involved for using it also
> for function calls (BL), which is a common occurrence.

Yes we all loved that when we hand wrote whole apps in assembler,
because we were young and foolish.

[snip]

<32bit flag preserving code>

> However, this uses an extra register, takes longer time (than without
> it), or makes even more complex code if you need to use LDM/STM for
> saving/restoring several registers.

> My question is: Why?

Having converted hundreds of thousands of lines of 26bit assembler,
the truth is that while the flag preserving behaviour was implemented
extensively by both aseembler authors and compilers, the actual number
of routines which rely on flag preservation is tiny. Therefor the
extra instruction overhead isn't so much of a problem, particularly if
you let the compiler take care of it.

> To set the context: The ARM processor (core) is designed for
> _embedded_ systems (at least, that's its main application area today),
> which are typically resource constrained. How on earth would do you
> need more than 64 MB application code on such devices? I'm not talking
> about the available data space - that has always been 4 GB, all the
> way from ARM2. I'm talking about more than 64 MB _code_.

Because they run OS's which utilise in a flat 4GB address space and
require vastly more than 64MB of code to be visible to any one
process.

> Frankly, even used in a desktop computer setting (like the Iyonix),
> I'd think it's rather rare to use for than 64 MB for applications
> (again, I'm only talking about the program code, not its data) -
> however, I'd like to hear if someone else has a different experience:
> I'd like to know that the above perceived added clumsyness to an
> otherwise elegant processor has an important reason (it better have!
> Given the above).

You are still thinking 26bit RISC OS. Code out there in the real world
is much larger and spread over huge shared libraries which consume
large amounts of address space.

> Furthermore: Even with 32 bt program counter, how are you going to use
> it? Say you have more than 64 MB application code, and you want to
> call a function defined above the 64 MB point: How are you going to
> call it? The B/BL instructions only handle +/- 64 MB, as before...

Indirected jumps via LDR PC as the SCL uses.

---druck

--
The ARM Club Free Software - http://www.armclub.org.uk/free/
The 32bit Conversions Page - http://www.quantumsoft.co.uk/druck/

druck

unread,

Jun 27, 2008, 6:15:42 PM6/27/08

to

On 27 Jun 2008 Rob Kendrick <nn...@rjek.com> wrote:
> No, it *was* an elegant processor. Whith hacks such as Thumb and
> Jazelle, the sheer uglyness of the VFP, the DSP extensions, the
> pointlessness of things such as "CLZ", the hideous complexity introduced
> with the 32 bit support, etc etc etc all add to this.

All those are additions after Acorn, the point is the 26bit ARM core
was very elegant.

druck

unread,

Jun 27, 2008, 6:42:30 PM6/27/08

to

On 27 Jun 2008 Rob Kendrick <nn...@rjek.com> wrote:

> On Fri, 27 Jun 2008 03:03:50 -0700, Terje Slettebř wrote:
>>> It was faster than 386s and 486s because as a simple CPU it's faster.
>>> However, modern x86 CPUs employ tricks that are impossible to do
>>> usefully in ARM (such as branch prediction)

A numbner of the XScale range have branch prediction, some of the
later ones actually work well enough to justify it.

>> Why is that impossible to do usefully on the ARM?

> Conditional execution on so many instructions make it difficult (you
> essentially have many many more things to consider to make accurate
> predictions.)

The conditional instructions relieve a number of pipeline stalls, but
the density of larger scale execution changes (loops, subroutines)
still makes it worthwhile.

> The complexity involved in calculating instruction dependencies are also
> why no super-scalar ARM exists, because the circuitry to just decide what
> could safely be executed in parallel would most likely be larger than the
> rest of the CPU.

A number of ARM designs feature the lightweight alternative of super
pipelining, which parallelises different instruction classes while
falling short of full superscalar behavior.

Interestingly was some scope to go down the VLIW route with
conditional instructions, but we all know what gave that a bad name.

Rob Kendrick

unread,

Jun 27, 2008, 6:45:48 PM6/27/08

to

On Fri, 27 Jun 2008 23:15:42 +0100, druck wrote:

> On 27 Jun 2008 Rob Kendrick <nn...@rjek.com> wrote:
>> No, it *was* an elegant processor. Whith hacks such as Thumb and
>> Jazelle, the sheer uglyness of the VFP, the DSP extensions, the
>> pointlessness of things such as "CLZ", the hideous complexity
>> introduced with the 32 bit support, etc etc etc all add to this.
>
> All those are additions after Acorn, the point is the 26bit ARM core was
> very elegant.

Precisely my point.

B.

Terje Slettebø

unread,

Jun 30, 2008, 6:47:50 AM6/30/08

to

On 28 Jun, 00:28, druck <n...@druck.freeuk.com> wrote:

> On 27 Jun 2008 Terje Slettebø <tslett...@gmail.com> wrote:
>
> > The disadvantages I speak of is the following: One of the design
> > principles of the ARM processor is that every instruction can be
> > conditionally executed, and on the 26-bit architecture, there was no
> > performance loss or additional complexity involved for using it also
> > for function calls (BL), which is a common occurrence.
>
> Yes we all loved that when we hand wrote whole apps in assembler,
> because we were young and foolish.

Some of us still love that when we have to update said applications to
become 32-bit compatible many years later (i.e. now). :) (and would
not have had to do that if later processors were backwarrds
compatible).

By the way, this particular application was actually more or less
necessary to write in assembly at the time, if it was to run with
acceptable speed on 8 MHz ARM2 processors, which were common then.

It might have been fast enough using C, but it would likely have ended
up as rather low-level bit-fiddling C, so I doubt you'd gained that
much by it, and there's a question about if C compilers at the time
were optimising enough to allow it to be written in more high-level
language.

Nevertheless, it was originally written in ARM assembly by Eivind
Hagen, so that's what I built on (adding FP instructions, FP and
string expressions, a reimplemented macro system, and several other
things, also speeding it up a lot), and it made sense at the time.

Writing ARM assembly also gives a certain joy and satisfaction of
being in complete control and be able to do anything, but that's
another thing. :)

> > However, this uses an extra register, takes longer time (than without
> > it), or makes even more complex code if you need to use LDM/STM for
> > saving/restoring several registers.
> > My question is: Why?
>
> Having converted hundreds of thousands of lines of 26bit assembler,
> the truth is that while the flag preserving behaviour was implemented
> extensively by both aseembler authors and compilers, the actual number
> of routines which rely on flag preservation is tiny. Therefor the
> extra instruction overhead isn't so much of a problem, particularly if
> you let the compiler take care of it.

Yep. In this case, I did one of the things Castle recommends, which is
not relying on PSR preservation across subroutine calls, not at least
to make it still compatible with ARM2/3 (which lacks MSR/MRS), and
avoid having to write "dodgy" code to make it work on them. The reason
for this backwards compatibility decision is quite simple: I'm doing
the conversion on an A3010, which, I think, don't have MSR/MRS.

I'm just curious: Would you recommend doing PSR-preservation in
subroutines or not in assembly code? If MRS/MSR is relatively cheap,
it could remove a "gotcha" involving not being able to rely on PSR-
preservation for certain instructions.

> > To set the context: The ARM processor (core) is designed for
> > _embedded_ systems (at least, that's its main application area today),
> > which are typically resource constrained. How on earth would do you
> > need more than 64 MB application code on such devices? I'm not talking
> > about the available data space - that has always been 4 GB, all the
> > way from ARM2. I'm talking about more than 64 MB _code_.
>
> Because they run OS's which utilise in a flat 4GB address space and
> require vastly more than 64MB of code to be visible to any one
> process.

Ok.

> > Frankly, even used in a desktop computer setting (like the Iyonix),
> > I'd think it's rather rare to use for than 64 MB for applications
> > (again, I'm only talking about the program code, not its data) -
> > however, I'd like to hear if someone else has a different experience:
> > I'd like to know that the above perceived added clumsyness to an
> > otherwise elegant processor has an important reason (it better have!
> > Given the above).
>
> You are still thinking 26bit RISC OS. Code out there in the real world
> is much larger and spread over huge shared libraries which consume
> large amounts of address space.

No, I was mostly thinking embedded applications (the main market for
ARM), but I guessed that desktop computers might use larger
applications, which is why I used RISC OS as an example. It may well
be that this is not the case, though, i.e. that embedded applications
may be large, or, as you write below, that one reason may be flat
memory area or lack of virtual/physical memory mapping.

> > Furthermore: Even with 32 bt program counter, how are you going to use
> > it? Say you have more than 64 MB application code, and you want to
> > call a function defined above the 64 MB point: How are you going to
> > call it? The B/BL instructions only handle +/- 64 MB, as before...
>
> Indirected jumps via LDR PC as the SCL uses.

Yep, that's one way.

Terje Slettebø

unread,

Jun 30, 2008, 6:57:37 AM6/30/08

to

On 28 Jun, 00:42, druck <n...@druck.freeuk.com> wrote:
> A number of ARM designs feature the lightweight alternative of super
> pipelining, which parallelises different instruction classes while
> falling short of full superscalar behavior.
>
> Interestingly was some scope to go down the VLIW route with
> conditional instructions, but we all know what gave that a bad name.

I guess you're thinking of the Itanium, here? Yes, when I first read
about it many years ago, I found it quite exciting in the way it
resembled ARM in several ways (and RISC in general in some ways), such
as large register bank, conditional execution of instructions, etc.

The VLIW aspect I was rather indifferent about, as I didn't have any
experience with it, but from what I've read, it appears it has been
hard to write compilers that utilise the VLIW-architecture efficiently
(I've also heard about failed attempts at this in the past).

Rob Kendrick

unread,

Jun 30, 2008, 7:13:38 AM6/30/08

to

On Mon, 30 Jun 2008 03:57:37 -0700, Terje Slettebø wrote:

> The VLIW aspect I was rather indifferent about, as I didn't have any
> experience with it, but from what I've read, it appears it has been hard
> to write compilers that utilise the VLIW-architecture efficiently (I've
> also heard about failed attempts at this in the past).

The whole and single point of VLIW is to move complexity out of the CPU
and into the compiler. Specifically, it is intended to remove the
complex dependency analysis required for super-scalar execution from the
CPU, and have the compiler deal with it by rearranging instructions such
that all the sub-instructions in the very long instructions can always be
executed in parallel safely.

Thus, it is unsurprising people have complained that writing compilers
for them is difficult :)

B.

druck

unread,

Jun 30, 2008, 6:59:34 PM6/30/08

to

On 30 Jun 2008 Terje Slettebų <tsle...@gmail.com> wrote:

> On 28 Jun, 00:42, druck <n...@druck.freeuk.com> wrote:
>> A number of ARM designs feature the lightweight alternative of super
>> pipelining, which parallelises different instruction classes while
>> falling short of full superscalar behavior.
>>
>> Interestingly was some scope to go down the VLIW route with
>> conditional instructions, but we all know what gave that a bad name.

> I guess you're thinking of the Itanium, here? Yes, when I first read
> about it many years ago, I found it quite exciting in the way it
> resembled ARM in several ways (and RISC in general in some ways), such
> as large register bank, conditional execution of instructions, etc.

Yes, the predicate registers were ARM condition codes on steroids and
would allow all sorts of interesting parallelisation of control
structures with only the results from the correct path being
committed.

However, despite the couple of good features, Intel could not resist
their old ways of shovelling in untold reams of crap, as if it were
the x86 all over again, much to the horror of HP, massively
overcomplicating the architecture and ensuring it would always
struggle for performance, as it still is 20 years later.

> The VLIW aspect I was rather indifferent about, as I didn't have any
> experience with it, but from what I've read, it appears it has been
> hard to write compilers that utilise the VLIW-architecture efficiently
> (I've also heard about failed attempts at this in the past).

Hard yes, but work done by the compiler once makes far more sense than
doing it on every run via complex superscalar dependency checking in
to embedded processors. And definitely better than moving to dual
core.

druck

unread,

Jun 30, 2008, 6:49:09 PM6/30/08

to

On 30 Jun 2008 Terje Slettebų <tsle...@gmail.com> wrote:
> Yep. In this case, I did one of the things Castle recommends, which is
> not relying on PSR preservation across subroutine calls, not at least
> to make it still compatible with ARM2/3 (which lacks MSR/MRS),
> and avoid having to write "dodgy" code to make it work on them. The reason
> for this backwards compatibility decision is quite simple: I'm doing
> the conversion on an A3010, which, I think, don't have MSR/MRS.

Well I don't really see any merit in continuing to support ARM2/3 for
32bit ports. If there is anyone out there still using those
processors, they can continue to use the 26bit versions. Its far far
easier if you use MSR/MRS.

> I'm just curious: Would you recommend doing PSR-preservation in
> subroutines or not in assembly code? If MRS/MSR is relatively cheap,
> it could remove a "gotcha" involving not being able to rely on PSR-
> preservation for certain instructions.

As a rule I don't flag preserve. There are a very small number of
routines, normally in algorithmic situations, that might require it,
but most of the time flags are used to return error conditions, so
just explicitly setting or clearing V with MSR is all thats needed at
the end of the routine.

> I wrote:-

>> You are still thinking 26bit RISC OS. Code out there in the real world
>> is much larger and spread over huge shared libraries which consume
>> large amounts of address space.

> No, I was mostly thinking embedded applications (the main market for
> ARM), but I guessed that desktop computers might use larger
> applications, which is why I used RISC OS as an example. It may well
> be that this is not the case, though, i.e. that embedded applications
> may be large, or, as you write below, that one reason may be flat
> memory area or lack of virtual/physical memory mapping.

These days many if not most embedded applications are far larger than
the whole of RISC OS. A typical smartphone will have an OS 32x the
size of the RISC OS ROM, and require as much RAM as in many Iyonix's.

Terje Slettebø

unread,

Jul 1, 2008, 5:39:38 AM7/1/08

to

On 1 Jul, 00:49, druck <n...@druck.freeuk.com> wrote:

> On 30 Jun 2008 Terje Slettebø <tslett...@gmail.com> wrote:
>
> > Yep. In this case, I did one of the things Castle recommends, which is
> > not relying on PSR preservation across subroutine calls, not at least
> > to make it still compatible with ARM2/3 (which lacks MSR/MRS),
> > and avoid having to write "dodgy" code to make it work on them. The reason
> > for this backwards compatibility decision is quite simple: I'm doing
> > the conversion on an A3010, which, I think, don't have MSR/MRS.
>
> Well I don't really see any merit in continuing to support ARM2/3 for
> 32bit ports.

Well, the thing was that I couldn't do the work on converting it
unless it also worked on ARM2/3, as I'm doing the conversion on an
A3010, as mentioned, and I wouldn't be able to test it during that
work, unless it also worked for it.

> If there is anyone out there still using those
> processors, they can continue to use the 26bit versions. Its far far
> easier if you use MSR/MRS.

Right, and as you also write, I'm going for non-PSR preserving, not at
least as I found that the number of subroutine calls actually relying
on PSR preservation was comparatively small.

> >> You are still thinking 26bit RISC OS. Code out there in the real world
> >> is much larger and spread over huge shared libraries which consume
> >> large amounts of address space.
> > No, I was mostly thinking embedded applications (the main market for
> > ARM), but I guessed that desktop computers might use larger
> > applications, which is why I used RISC OS as an example. It may well
> > be that this is not the case, though, i.e. that embedded applications
> > may be large, or, as you write below, that one reason may be flat
> > memory area or lack of virtual/physical memory mapping.
>
> These days many if not most embedded applications are far larger than
> the whole of RISC OS. A typical smartphone will have an OS 32x the
> size of the RISC OS ROM, and require as much RAM as in many Iyonix's.

Ah, interesting.

I guess one potential useful extension to extASM could be to enable
"auto expansion" of B/BL instructions (as it does for ADR and load/
store, today, if the target is outside the range), so that one might
write a B/BL over the whole 32-bit address space, without having to
manually rewrite it if it's outside the 64 MB range.

Richard Russell

unread,

Jul 7, 2008, 6:31:34 PM7/7/08

to

On Jun 27, 9:47 am, Terje Slettebø <tslett...@gmail.com> wrote:
> Well, I do. The ARM processor has always been an elegant processor, in
> terms of the programmer's model (which also translates to code density
> and speed):

Whilst, as an unashamed IA-32 (not x86) architecture enthusiast, I am
forced to concede that many of your points are valid, I can't agree
with the "code density" one. The code density of ARM (measured in
bytes, of course) is actually quite poor, because every instruction
uses 4 bytes. Surely that was the very motivation for the
introduction of the Thumb instruction set?

As a comparison, the entire 'BBC BASIC for Windows' interpreter
(including the IA-32 assembler, which I bet is a great deal more
complicated than the ARM assembler in BASIC 5) plus all the language
extensions like structures and dual 40/64 bit floats, is less than 32
Kbytes.

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Rob Kendrick

unread,

Jul 7, 2008, 7:06:53 PM7/7/08

to

On Mon, 07 Jul 2008 15:31:34 -0700, Richard Russell wrote:

> Whilst, as an unashamed IA-32 (not x86) architecture enthusiast, I am
> forced to concede that many of your points are valid, I can't agree with
> the "code density" one. The code density of ARM (measured in bytes, of
> course) is actually quite poor, because every instruction uses 4 bytes.
> Surely that was the very motivation for the introduction of the Thumb
> instruction set?

Sure, but for four bytes you get a lot of functionality. Such as an
entire jump table in one instruction, or stacking all your registers in
one instruction, or writing 10x4 bytes of memory out at once, or doing
things like a = b + (c << d), which requires many more instructions on
other architectures. You often also get to use the conditional execution
of instructions such that you don't need branches to jump over code,
too. Thumb just /improves/ matters, although not as much as other
approaches (such as swapping in your code on demand from compressed ROM).

> As a comparison, the entire 'BBC BASIC for Windows' interpreter
> (including the IA-32 assembler, which I bet is a great deal more
> complicated than the ARM assembler in BASIC 5) plus all the language
> extensions like structures and dual 40/64 bit floats, is less than 32
> Kbytes.

Is that compressed, though? I suspect there's a lot of room for
improvement in BBC BASIC's footprint, but it's always had performance as
the primary aim.

B.

Richard Russell

unread,

Jul 8, 2008, 5:51:10 PM7/8/08

to

On Jul 8, 12:06 am, Rob Kendrick <n...@rjek.com> wrote:
> Sure, but for four bytes you get a lot of functionality. Such as an
> entire jump table in one instruction, or stacking all your registers in
> one instruction, or writing 10x4 bytes of memory out at once, or doing
> things like a = b + (c << d)

When you need it, yes. But there's the rub: unless you're very lucky
only in a small proportion of cases will you be able to perform
multiple operations like this; the majority of instructions will be
simple operations. The situation is probably made worse when compiled
rather then hand-assembled.

> > As a comparison, the entire 'BBC BASIC for Windows' interpreter

> > ... is less than 32 Kbytes.
>
> Is that compressed, though?

Compressed? Certainly not! That's the actual size of the code.

Torben Ægidius Mogensen

unread,

Jul 23, 2008, 11:18:03 AM7/23/08

to

Richard Russell <ne...@rtrussell.co.uk> writes:

> On Jun 27, 9:47 am, Terje Slettebø <tslett...@gmail.com> wrote:
>> Well, I do. The ARM processor has always been an elegant processor, in
>> terms of the programmer's model (which also translates to code density
>> and speed):
>
> Whilst, as an unashamed IA-32 (not x86) architecture enthusiast, I am
> forced to concede that many of your points are valid, I can't agree
> with the "code density" one. The code density of ARM (measured in
> bytes, of course) is actually quite poor, because every instruction
> uses 4 bytes. Surely that was the very motivation for the
> introduction of the Thumb instruction set?

32-bit ARM has much better code density than most other 32-bit RISC
processors, such as MIPS, PowerPC, SPARC and Alpha and it is not
significantly different from IA-32. While IA-32 has byte-coded
instructions, you tend to use several bytes for each instruction, both
due to prefix instructions and due to immediate constants (such as
offsets). Since IA-32 has only 8 GP registers, you tend to use stack
offsets more than in ARM, so you end up using about the same total
number of bytes as ARM code.

Thumb and Thumb2 has significantly better code density than plain
32-bit ARM and IA-32.

Torben