Learning ARM assembler...

Gazza

unread,

Oct 10, 2009, 8:16:22 PM10/10/09

to

I'm interested in cutting my teeth in ASM. Are there any resources out
there for beginners. I've been writing in BASIC for years and I think
I'm ready to move up a gear. I would need something fairly recent. By
recent, I mean aware of the changes required to be 26/32 neutral. I
would probably start off with using the assembler built into BASIC
rather than buying any development tools, so I'd prefer something that
was either neutral in that respect or something that concentrated on
the BASIC assembler.

I know I could move onto C, but with the two competing environments
out there, which one? GCC or Norcroft?

Any help would be appreciated...

Rob Kendrick

unread,

Oct 10, 2009, 9:41:39 PM10/10/09

to

On Sat, 10 Oct 2009 17:16:22 -0700 (PDT)
Gazza <use...@garethlock.com> wrote:

> I'm interested in cutting my teeth in ASM. Are there any resources out
> there for beginners. I've been writing in BASIC for years and I think
> I'm ready to move up a gear.

<snip>

> I know I could move onto C, but with the two competing environments
> out there, which one? GCC or Norcroft?

I'd certainly recommend C over assembler; it's a skill that can be
applied equally to other computers, and the compiler will generate
better code than you can write, at least until you're very very
experienced at writing the assembler.

And GCC. Castle C/C++ is an expensive extravagance.

B.

John-Mark Bell

unread,

Oct 10, 2009, 9:42:57 PM10/10/09

to

On Sat, 2009-10-10 at 17:16 -0700, Gazza wrote:
> I'm interested in cutting my teeth in ASM. Are there any resources out
> there for beginners. I've been writing in BASIC for years and I think
> I'm ready to move up a gear. I would need something fairly recent. By
> recent, I mean aware of the changes required to be 26/32 neutral. I
> would probably start off with using the assembler built into BASIC
> rather than buying any development tools, so I'd prefer something that
> was either neutral in that respect or something that concentrated on
> the BASIC assembler.

You could do worse than Peter Cockerell's ARM Assembly Language
Programming: http://www.peter-cockerell.net/aalp/

As for 32bit neutrality, http://www.iyonix.com/32bit/ contains all the
information you need.

> I know I could move onto C, but with the two competing environments
> out there, which one? GCC or Norcroft?

Frankly, it doesn't matter. It's trivial to write code that compiles
correctly with both. Given you don't appear to want to buy development
tools, you're probably best off with GCC, as it's free.

John.

Paul Stewart

unread,

Oct 11, 2009, 3:20:30 AM10/11/09

to

On Oct 11, 1:16 am, Gazza <use...@garethlock.com> wrote:
> I'm interested in cutting my teeth in ASM. Are there any resources out
> there for beginners. I've been writing in BASIC for years and I think
> I'm ready to move up a gear. I would need something fairly recent. By
> recent, I mean aware of the changes required to be 26/32 neutral. I
> would probably start off with using the assembler built into BASIC
> rather than buying any development tools, so I'd prefer something that
> was either neutral in that respect or something that concentrated on
> the BASIC assembler.
>

RISC OS Connect site contains some links to software development
relating stuff for Assembler, C and BBC Basic.
http://www.riscos-connect.org/links/links-devel.asp

Regards
Paul

Matthew Phillips

unread,

Oct 11, 2009, 3:23:46 AM10/11/09

to

In message <20091011024...@trite.i.flarn.net.i.flarn.net>

I'd certainly agree with that. I used to do ARM assembly language stuff 19
years ago, and when I came back to RISC OS wrote CPMFS in assembly language,
but since then I've largely used C. C is easier to learn and much more
productive to write.

> And GCC. Castle C/C++ is an expensive extravagance.

Go for GCC to start with because it's good and free. That way you can find
out whether you like the language.

The Castle (Acorn/Norcroft) C compiler is good, but probably of most
relevance if you are wanting to compile RISC OS from ROOL's sources or to
maintain existing projects. By all means buy it from ROOL if you want to
support them with a bit of money!

--
Matthew Phillips
Dundee

James Peacock

unread,

Oct 11, 2009, 4:34:53 AM10/11/09

to

Gazza wrote:

> I know I could move onto C, but with the two competing environments
> out there, which one? GCC or Norcroft?

GCC is free, so you should use that to investigate the language.

Personally, I prefer Norcroft as it compiles things more quickly and
I tend to write things where Norcroft's nice inline assembler syntax
is useful. But if you are learning, none of this will make much of a
difference.

James

Jake Waskett

unread,

Oct 11, 2009, 5:54:04 AM10/11/09

to

On Sun, 11 Oct 2009 08:23:46 +0100, Matthew Phillips wrote:

> In message <20091011024...@trite.i.flarn.net.i.flarn.net>
> on 11 Oct 2009 Rob Kendrick wrote:
>> I'd certainly recommend C over assembler; it's a skill that can be
>> applied equally to other computers, and the compiler will generate
>> better code than you can write, at least until you're very very
>> experienced at writing the assembler.
>
> I'd certainly agree with that. I used to do ARM assembly language stuff
> 19 years ago, and when I came back to RISC OS wrote CPMFS in assembly
> language, but since then I've largely used C. C is easier to learn and
> much more productive to write.

I'd say it depends on why you want to learn and what you hope to get out
of it. I've been programming in C for 15 years now (yikes!), and I'd
agree that it's certainly more productive than any assembler, but is it as
much fun?

ARM's special. It's unique. Perhaps this is just me; I don't know. It
was the first assembly language that I learned properly (not counting half-
hearted attempts to learn the Z80 while my age was still in single
figures), so perhaps nostalgia has something to do with it. But -
regardless of how biased I may be - I've always found it occupies a
special place in computer languages.

With all assembly languages one gets a feeling of power and control. I
imagine it's like driving a (somewhat temperamental) race car - it's
rather wonderful to know that there's basically nothing between your mind
and the machine. You have complete control, and something approximating
complete understanding of what's going on. And if you're the kind of
person who's fond of micro-optimisations, you can spend many a happy hour
tweaking to eliminate an extra clock cycle. As a teenager, I used to
spend many a happy evening optimising graphics routines.

With most assembly languages - every assembly language other than ARM, as
far as I can tell - this comes at a price: complexity and mess (x86 code
being an extreme example of this). It really is chaos, and it is nearly
impossible for a human being to memorise. Not so for ARM. It's simple.
Frighteningly so, at first, perhaps. (Perhaps this is less true now than
it was in the ARM2 days when I learned [complexity of the ARM has steadily
increased over time; I haven't been able to obtain ARMv7 documentation
yet, but I gather that the trend is continuing].) But you quickly realise
that the simplicity covers everything you need, and the flexible design of
the ARM allows you to combine primitives in surprisingly sophisticated
ways. The instruction set was designed by an experienced and highly
skilled assembly-language programmer, and to be blunt it shows.

There seems to be some kind of received wisdom that, when learning
programming languages, one should start with high level languages and work
downwards, through C, down to assembler. I'm not sure whether I agree
with this. Maybe it is an easier way to learn, but I'm not convinced that
it's better. Does it help you to understand the machine?

Jake

Rob Kendrick

unread,

Oct 11, 2009, 6:37:35 AM10/11/09

to

On Sun, 11 Oct 2009 09:34:53 +0100
James Peacock <j.peaco...@googlemail.com> wrote:

> Personally, I prefer Norcroft as it compiles things more quickly and
> I tend to write things where Norcroft's nice inline assembler syntax
> is useful. But if you are learning, none of this will make much of a
> difference.

No, GCC's quicker, because you can cross compile from a cheap but
monstrous PC :)

GCC's inline assembler scheme's a bit nasty, but I've always preferred
to use functions instead. Most open source projects favour this
approach, too.

B.

Ron

unread,

Oct 11, 2009, 7:32:16 AM10/11/09

to

In message <02083a60-2a2c-4ee1...@c3g2000yqd.googlegroups.com>
Gazza <use...@garethlock.com> wrote:

> I'm interested in cutting my teeth in ASM. Are there any resources out
> there for beginners. I've been writing in BASIC for years and I think
> I'm ready to move up a gear. I would need something fairly recent. By
> recent, I mean aware of the changes required to be 26/32 neutral. I
> would probably start off with using the assembler built into BASIC
> rather than buying any development tools, so I'd prefer something that
> was either neutral in that respect or something that concentrated on
> the BASIC assembler.
>

http://www.marvell.com/technologies/what-is-a-cpu/risccomp.jsp
has a small introduction to a cpu -compatible with the iyonix.

There was a large pdf on the Arm site about arm chips in general.
I should still have it on my machine, if you cant find it.

Maybe the Archive DVD, or there could be an article series available
from Archive.
I'm not sure if there was 32bit assembler articles done, as I haven't
read archive since our usergroup disbanded. Buying the DVD looks like a
good option.

Ron M.

Xavier

unread,

Oct 11, 2009, 7:52:37 AM10/11/09

to

On 11 oct, 11:54, Jake Waskett <j...@waskett.org> wrote:
> On Sun, 11 Oct 2009 08:23:46 +0100, Matthew Phillips wrote:

> > In message <20091011024139.2fb85...@trite.i.flarn.net.i.flarn.net>

Well Jake.
Your view is exactly mine.
ARM has something more.
Like you I tried in my teenage years to code on a Z80 platform.
Then on a 80286 one.
Not much pleasure with these trials.
With the 68000 machines, a big step towards ease of coding was made.
When the Archimedes came out, coding in assembler became a joyful
experience, which, to me,
is miles ahead of what C can bring to a developper.
Of course I understand if you want to developp a word processor,
database program etc...
assembly is not the best suited language.
Anything that has to be really fast, is graphics oriented, and
intended to run on slow target machines
(ARM2 or ARM3 systems) has to, IMHO, make use of ARM assembly code.

Chris Johnson

unread,

Oct 11, 2009, 1:20:42 PM10/11/09

to

In article <5e4b89a...@ron1954.woosh.co.nz>,

Ron <be...@woosh.co.nz> wrote:
> Maybe the Archive DVD, or there could be an article series
> available from Archive. I'm not sure if there was 32bit assembler
> articles done, as I haven't read archive since our usergroup
> disbanded. Buying the DVD looks like a good option.

There have been a number of assembler articles in Archive.

An 8 part series, starting at Vol 1 No10 (July 1988) by Alan Glover.

Another 8 part series, starting in Vol 10 No 1 (Oct 1996 by myself.

A five part series starting in Vol 7 No 9 (Jun 1994) by James Riden.

Seven articles by Hariet Bazley starting Vol 14 No 2 (Nov 2000).

A number of single articles on various aspects of assembler.

If starting out to learn with simple things, then the 32-bit aspect
should not impinge too much as long as you use the correct way to
return from a subroutine. There is stuff on the Iyonix web site
detailing things to watch out for.

There are also three articles by Paul Skirrow on 'Writing for 32-bit
RISC OS'. Part three (Vol 16 No 4 Jan 2003) explains clearly how to
write 26-bit/32-bit neutral code in assembler.

Get the Archive DVD and all this will be at your finger tips at the
click of a mouse. The HTML version has full (live) indexes and cross
references.

Chris...

--
Chris Johnson

Ste (news)

unread,

Oct 11, 2009, 4:55:03 PM10/11/09

to

In article <0XhAm.27211$X31....@newsfe17.ams2>,

Jake Waskett <ja...@waskett.org> wrote:
> ARM's special. It's unique. Perhaps this is just me; I don't know. It
> was the first assembly language that I learned properly (not counting half-
> hearted attempts to learn the Z80 while my age was still in single
> figures), so perhaps nostalgia has something to do with it. But -
> regardless of how biased I may be - I've always found it occupies a
> special place in computer languages.

If you ever see ARM v7, your dreams will be laid to waste, I'm afraid. Urgh.

Ta,

Steve

--
Steve Revill @ Home
Note: All opinions expressed herein are my own.

Rob Kendrick

unread,

Oct 11, 2009, 7:27:24 PM10/11/09

to

On Sun, 11 Oct 2009 21:55:03 +0100
"Ste (news)" <st...@revi11.plus.com> wrote:

> In article <0XhAm.27211$X31....@newsfe17.ams2>,
> Jake Waskett <ja...@waskett.org> wrote:
> > ARM's special. It's unique. Perhaps this is just me; I don't
> > know. It was the first assembly language that I learned properly
> > (not counting half- hearted attempts to learn the Z80 while my age
> > was still in single figures), so perhaps nostalgia has something to
> > do with it. But - regardless of how biased I may be - I've always
> > found it occupies a special place in computer languages.
>
> If you ever see ARM v7, your dreams will be laid to waste, I'm
> afraid. Urgh.

ARM v5 and above cannot be sensibly called a RISC machine anymore.
Yes, it's an utter utter mess. And most of the nonsense they've added
is to try to compete with x86.

David Seal (editor of the ARM ARM and credited with the FPE) is said to
have a law: if adding a new feature uses so many transistors as for
them to be better spent on increasing the cache, don't implement it.

B.

Ron

unread,

Oct 11, 2009, 9:14:00 PM10/11/09

to

In message <50a8a931b7chr...@spamcop.net>
Chris Johnson <chrisjoh...@spamcop.net> wrote:
<snip excellent review>

> Get the Archive DVD and all this will be at your finger tips at the
> click of a mouse. The HTML version has full (live) indexes and cross
> references.
>
> Chris...
>

Thanks, I'll be doing that, the $NZ is at a 20 year high against the
pound, so I'm looking out for great deals.

I'm not a professional programmer, so correct me, but Gazza has been
working on string handling programs and I think this is an area where
Python excels for power and ease of use. There is probably a lot of
cross platform Python scripts available, that could either be run with
little or nothing to change, or converted/applied to Basic.
It is quick to learn, and yet Python and its close cousin Perl is used
in web-site design among many other things.

Ron M.

David Holden

unread,

Oct 12, 2009, 2:53:07 AM10/12/09

to

On 12-Oct-2009, Ron <be...@woosh.co.nz> wrote:

> Thanks, I'll be doing that, the $NZ is at a 20 year high against the
> pound, so I'm looking out for great deals.

You should also have a look at <http://www.riscworld.co.uk>. The book 'ARM
Assembly Language Programming' by Peter Cockerell was serialised in RISC
World so you could buy the incredibly reasonably priced RISC World DVD or CD
or even read the book online.

--
David Holden - APDL - <http://www.apdl.co.uk>

Rob Kendrick

unread,

Oct 12, 2009, 4:48:26 AM10/12/09

to

On Sun, 11 Oct 2009 09:54:04 GMT
Jake Waskett <ja...@waskett.org> wrote:

> There seems to be some kind of received wisdom that, when learning
> programming languages, one should start with high level languages and
> work downwards, through C, down to assembler. I'm not sure whether I
> agree with this. Maybe it is an easier way to learn, but I'm not
> convinced that it's better. Does it help you to understand the
> machine?

C helps you understand how computers work; after all, it's simply a
portable assembler :) Assembler helps you understand how the CPU
works, but that's about it; there's more to how the machine works than
that, and neither C nor assembler help.

At least with C you'll get your program finished sooner, as well as it
likely being faster :)

B.

Gavin Wraith

unread,

Oct 12, 2009, 5:49:14 AM10/12/09

to

In message <8e86d4a...@ron1954.woosh.co.nz>
Ron <be...@woosh.co.nz> wrote:

> I'm not a professional programmer, so correct me, but Gazza has been
> working on string handling programs and I think this is an area where
> Python excels for power and ease of use. There is probably a lot of
> cross platform Python scripts available, that could either be run with
> little or nothing to change, or converted/applied to Basic.
> It is quick to learn, and yet Python and its close cousin Perl is used
> in web-site design among many other things.

These remarks also apply to Lua, which is arguably easier to learn,
runs faster, and is freely available on RISC OS in the form of RiscLua from
http://www.wra1th.plus.com/lua/ . Strings are held in Lua not as arrays
but as hashes, making for very quick comparison and lookup. The design of
algorithms for string handling depends critically on how strings are
represented in memory, so beware of thinking that algorithms that are
efficient in one language must also be efficient in another. It ain't
necessarily so!

--
Gavin Wraith (ga...@wra1th.plus.com)
Home page: http://www.wra1th.plus.com/

Jake Waskett

unread,

Oct 12, 2009, 6:04:19 AM10/12/09

to

On Mon, 12 Oct 2009 09:48:26 +0100, Rob Kendrick wrote:

> On Sun, 11 Oct 2009 09:54:04 GMT
> Jake Waskett <ja...@waskett.org> wrote:
>
>> There seems to be some kind of received wisdom that, when learning
>> programming languages, one should start with high level languages and
>> work downwards, through C, down to assembler. I'm not sure whether I
>> agree with this. Maybe it is an easier way to learn, but I'm not
>> convinced that it's better. Does it help you to understand the
>> machine?
>
> C helps you understand how computers work; after all, it's simply a
> portable assembler :) Assembler helps you understand how the CPU works,
> but that's about it; there's more to how the machine works than that,
> and neither C nor assembler help.

Apologies, that was a poor choice of words on my part. You're right, of
course: the machine is much more complicated than just the CPU, but I
think that the ability to understand how the CPU itself works is a very
valuable thing. Besides being fun, it can be very useful knowledge when
optimising (or helping a weak compiler to optimise) code in a higher-level
language, too. A startling number of programmers don't even understand
how a compiler translates, say, a "while" loop (for some reason, these
people often seem to be Visual Basic programmers).

I've found C quite annoying lately. Depending on the situation, I find
it's either too low-level, and I know I'd be more productive in Python or C
++, or it's too high-level, and I want control at the asm level.

> At least with C you'll get your program finished sooner, as well as it
> likely being faster :)

I think that's usually true, because it takes less effort to use a more
efficient algorithm. Having said that, I reckon given the same algorithm,
a human can easily out-compile the best assembler (on ARM, at least, I'm
not so sure about other platforms).

The last time I had the time to try this was a graphics driver on an
XScale gadget. The code had to rotate the display by 90 degrees in
software, while converting from RGB565 to RGB555 (or possibly the other
way around). It was performance critical, and seemed a perfect
opportunity to play around with ARM assembler, so I hand-translated the
code. As I recall, my first attempt was 2.5x as fast as the compiler's
most optimised output. I'm not sure how representative that was, though;
it was a crappy Microsoft compiler that generated hideous code. It would
have been interesting to try some GCC and ARM's compiler.

Jake

Rob Kendrick

unread,

Oct 12, 2009, 6:33:36 AM10/12/09

to

On Mon, 12 Oct 2009 10:04:19 GMT
Jake Waskett <ja...@waskett.org> wrote:

> > C helps you understand how computers work; after all, it's simply a
> > portable assembler :) Assembler helps you understand how the CPU
> > works, but that's about it; there's more to how the machine works
> > than that, and neither C nor assembler help.
>
> Apologies, that was a poor choice of words on my part. You're right,
> of course: the machine is much more complicated than just the CPU,
> but I think that the ability to understand how the CPU itself works
> is a very valuable thing.

C teaches you how CPUs work. Assembler teaches you how a specific CPU
works.

> I've found C quite annoying lately. Depending on the situation, I
> find it's either too low-level, and I know I'd be more productive in
> Python or C ++, or it's too high-level, and I want control at the asm
> level.

C's very rarely too high level to achieve something, unless you're
writing drivers or operating systems, at which point perhaps 1% of
your code has to be written in assembler. (Try looking at Linux or QNX
for nice examples of this.)

If it's not high level enough for you, use a utility library like
GLib. This gives you almost all the features of modern interpreted
languages like Python, but with the added advantage of performance.

> > At least with C you'll get your program finished sooner, as well as
> > it likely being faster :)
>
> I think that's usually true, because it takes less effort to use a
> more efficient algorithm. Having said that, I reckon given the same
> algorithm, a human can easily out-compile the best assembler (on ARM,
> at least, I'm not so sure about other platforms).

Possibly, but it'd take them a lot longer. Perhaps more time to do
than would ever be saved by the more efficient version :)

B.

Ron

unread,

Oct 12, 2009, 9:33:30 AM10/12/09

to

In message <7jg22nF...@mid.individual.net>
"David Holden" <Spa...@apdl.co.uk> wrote:

>
> On 12-Oct-2009, Ron <be...@woosh.co.nz> wrote:
>
> > Thanks, I'll be doing that, the $NZ is at a 20 year high against the
> > pound, so I'm looking out for great deals.
>
> You should also have a look at <http://www.riscworld.co.uk>. The book 'ARM
> Assembly Language Programming' by Peter Cockerell was serialised in RISC
> World so you could buy the incredibly reasonably priced RISC World DVD or CD
> or even read the book online.
>

I had a quick look, looks good, and there is an earlier article on
BASIC driven assembler also. I noticed another bargain mentioned,
RiscCAD for 15ukp. I'll check out your site also. I missed using
Apollonius PDT on the Iyonix, it would have been fast.

Ron M.

Theo Markettos

unread,

Oct 12, 2009, 11:39:04 AM10/12/09

to

Rob Kendrick <nn...@rjek.com> wrote:
> No, GCC's quicker, because you can cross compile from a cheap but
> monstrous PC :)
>
> GCC's inline assembler scheme's a bit nasty, but I've always preferred
> to use functions instead. Most open source projects favour this
> approach, too.

I'd recommend starting with C. If you're a purely-BASIC programmer you need
to get your head around some of the ways that another language can be
different. C isn't the best language for this, but it's probably easier
than assembler. The basic C language isn't too complex: it's the libraries
where the complexity comes, and you can go in as deep to those as you want
(or don't).

When you have a feel for C you can also inspect the output for the compiler
to see what sort of assembly code it's producing. If you start with simple
functions you can learn roughly how your C code gets broken down into
instructions. That, combined with an introduction to ARM assembler, will
give you a rundown on how things work. Whether you want to learn the other
details about how you'd write an assembly program or function from scratch
is up to you.

If you want more transferable skills, use a separate assembler program (eg
asasm, which comes with GCC, or objasm, that comes with Norcroft). The
BASIC assembler is friendly to BASIC programmers, but ends up quite limiting
in some respects (like you can't link its code with C).

Given that the tools are now better and free, there's no reason to pick up
the bad habits that some of us picked up by learning a while ago when we
only had less capable tools available.

Theo

Matthew Phillips

unread,

Oct 12, 2009, 3:26:19 PM10/12/09

to

In message <0XhAm.27211$X31....@newsfe17.ams2>

on 11 Oct 2009 Jake Waskett wrote:

> On Sun, 11 Oct 2009 08:23:46 +0100, Matthew Phillips wrote:
>
> > I'd certainly agree with that. I used to do ARM assembly language stuff
> > 19 years ago, and when I came back to RISC OS wrote CPMFS in assembly
> > language, but since then I've largely used C. C is easier to learn and
> > much more productive to write.
>
> I'd say it depends on why you want to learn and what you hope to get out
> of it. I've been programming in C for 15 years now (yikes!), and I'd
> agree that it's certainly more productive than any assembler, but is it as
> much fun?
>
> ARM's special. It's unique. Perhaps this is just me; I don't know. It
> was the first assembly language that I learned properly (not counting half-
> hearted attempts to learn the Z80 while my age was still in single
> figures), so perhaps nostalgia has something to do with it. But -
> regardless of how biased I may be - I've always found it occupies a
> special place in computer languages.

At the risk of contradicting my earlier posting, I do agree with a lot of
what you say. However, beyond BASIC the first language I learnt was Z80
assembly language, and it wasn't that bad. The small number of registers was
the main limitation, and the fact that quite a lot of instructions you would
have liked to use did not exist! But the main pleasure I got from it was
writing stuff which worked blindingly fast (or as blinding as 4MHz gets)
compared to the BASIC. Plus being able to access hardware directly.

Yes, ARM assembly language is really nice to write by hand, and assembly
language does give you a much better feel for what the computer really does.
But RISC OS has so many SWIs to do what you need that you'll never really get
the feeling of control and mastery of the machine that you would have with an
old 8-bit micro.

But you're right that for the hobbyist the choice of which language to learn
next depends very much on our motivations.

All I would say is that moving from BASIC to C the biggest thing you will
miss is easy handling of string memory allocation. But on the other hand,
one of the biggest gains is the possibility for strings longer than 255
bytes!

It's probably even easier to write bad C than bad BASIC. Find some good
tutorials and be disciplined! And try to avoid the articles in old Archive
magazines by Paul Johnson. There were a lot of faults in the example
programmes and his C style was not always very idiomatic.

--
Matthew Phillips
Dundee

Rob Kendrick

unread,

Oct 12, 2009, 3:57:59 PM10/12/09

to

On Mon, 12 Oct 2009 20:26:19 +0100
Matthew Phillips <mn...@sinenomine.freeserve.co.uk> wrote:

> All I would say is that moving from BASIC to C the biggest thing you
> will miss is easy handling of string memory allocation. But on the
> other hand, one of the biggest gains is the possibility for strings
> longer than 255 bytes!

Of course, you do this in BASIC in exactly the same way you do it in C;
blocks of memory with functions to manipulate them :)

To be honest, I think string handling in most applications is
infrequent enough for it not to be a complete ballache.

> It's probably even easier to write bad C than bad BASIC.

Well, the closest to misuse of the syntax you can get is Duff's
Device. (Look it up and have a bucket handy.) Much more evil is
available through BASIC's lax token-based system.

B.

Gazza

unread,

Oct 13, 2009, 3:02:41 PM10/13/09

to

On Oct 11, 6:20 pm, Chris Johnson <chrisjohnson+n...@spamcop.net>
wrote:
> In article <5e4b89a850.b...@ron1954.woosh.co.nz>,

This sounds like the best way to go then... I guess the next question
is where to get this DVD and how much do I have to fork out?

Vince M Hudd

unread,

Oct 13, 2009, 3:48:03 PM10/13/09

to

Gazza <use...@garethlock.com> wrote:
> On Oct 11, 6:20�pm, Chris Johnson <chrisjohnson+n...@spamcop.net> wrote:

> > Get the Archive DVD and all this will be at your finger tips at the
> > click of a mouse. The HTML version has full (live) indexes and cross
> > references.

> This sounds like the best way to go then... I guess the next question is

> where to get this DVD and how much do I have to fork out?

http://www.archivemag.co.uk

--
Vince M Hudd
Soft Rock Software

Terje Slettebø

unread,

Oct 13, 2009, 6:12:36 PM10/13/09

to

Hi Gazza.

Good to hear that someone is still interested in learning assembly
programming. :)

I still think it's a useful skill to know how things happen at the
lowest level of the computer...

Not exactly a tutorial, but Richard Murray has some nice resources and
inspiration related to ARM programming at his homepage:
http://www.heyrick.co.uk/assembler/index.html

As an aside, I'm currently updating the extASM assembler (http://
home.broadpark.no/~terjesl/eng/extasm.html) to the latest ARM version
(ARMv7), as it currently only supports ARMv3 (ARM2/3/6).

extASM is itself written in ARM assembly... :) This was a necessity at
the time it was made (about 14 years ago), but I still think there's
something to be said for programming in assembly code...

The updating of extASM was spurred on by the latest developments
involving RISC OS being ported to the Beagle Board, and potentially
other ARM Cortex/OMAP devices. This may well give RISC OS a new lease
of life...

I agree with Jake Waskett's comments on ARM assembly: The reason I
still care for ARM/RISC OS is that I find both being elegant...

Yes, ARMv7 is much more complex (a much larger instruction set) than
the early ARM versions, but I still think it retains a certain
elegance, and the larger instruction set yields a more powerful
processor...

Also, before anyone thinks I'm stuck in the past: In my job, I develop
web applications using PHP, and have been studying and using C++ for
many years (and before that, BASIC and C), as well as having played
with other languages, like Haskell.

I find that low-level languages, like assembly, and high-level
languages, like C++ and Haskell, both gives me a certain kind of
pleasure, in a different way...

Also, I think it's useful to have knowledge of, and experience with,
the whole gamut of programming languages, from very low-level to very
high-level.

Regards,

Terje

Rob Kendrick

unread,

Oct 13, 2009, 6:36:07 PM10/13/09

to

On Tue, 13 Oct 2009 15:12:36 -0700 (PDT)
Terje Slettebø <tsle...@gmail.com> wrote:

> Yes, ARMv7 is much more complex (a much larger instruction set) than
> the early ARM versions, but I still think it retains a certain
> elegance, and the larger instruction set yields a more powerful
> processor...

But by ARMv9, we'll have something more CISCy than x86. No,
seriously. It's dangerously close to that complex now. We already get
to deal with the joys of remembering which instructions can be
conditionally executed, and which only work on even-numbered registers,
which operate on which halves of words, etc etc.

ARM is no longer elegant. ARM is now a CPU designed for performance,
and not for the joy of the programmer.

B.

Terje Slettebø

unread,

Oct 13, 2009, 7:17:18 PM10/13/09

to

On 14 Okt, 00:36, Rob Kendrick <n...@rjek.com> wrote:
> On Tue, 13 Oct 2009 15:12:36 -0700 (PDT)
>

I agree that some of the elegance is gone, such as:

- It used to be that all instructions could be conditionally executed,
but that has now been sacrificed to make room for more instructions
(as you point out).

- It also used to be possible to call and return from subroutines
using single instructions, preserving the PSR. That's no longer the
case, with 32-bit only ARMs.

- With the previous point goes that (unless you use multi-instruction
sequences to save and restore the PSR) "BL" and "SWI"s no longer
preserve the PSR, leading to another "symmetry" being lost (in
addition to that about conditional execution).

Still, I wouldn't put it as bleak as saying that it's no longer
elegant, and keep in mind that as long as you can tolerate the last
point here ("BL" and "SWI" not preserving the PSR), you can write
essentially identical ARM code as before.

I have direct evidence of this, having recently updated extASM to be
32-bit compatible, and it didn't involved that much of a job (most of
it was quite mechanical, having to do with the mentioned PSR
preserving).

Furthermore, what we've lost in simplicity, we've gained in power,
such as the SIMD and VFP instructions. Finally, we have ARM chips that
have built-in floating point computation (even SIMD and vector
operations), and quite frankly, I think that's worth the added
complexity.

Besides, without this growth, the ARM processor would likely had not
had much chance in today's market, so you can lament the added
complexity and loss of elegance, but without it, there might not have
been anything to lament about...

As for the large increase in number of instructions (many of which are
rather specialised, and not at all obvious what they may be used
for...), I guess I feel that more than most, having to implement all
of that... :) Still, the flip side is that with all this power
(especially for SoCs like OMAP), I feel like a kid let loose in a toy
store... :)

Rob Kendrick

unread,

Oct 13, 2009, 7:27:31 PM10/13/09

to

On Tue, 13 Oct 2009 16:17:18 -0700 (PDT)
Terje Slettebø <tsle...@gmail.com> wrote:

> I agree that some of the elegance is gone, such as:
>
> - It used to be that all instructions could be conditionally executed,
> but that has now been sacrificed to make room for more instructions
> (as you point out).

There have always been a handful of instructions it was not meaningful
to conditionally execute (TEQP, and some other corner cases), but now
it has become silly.

> - It also used to be possible to call and return from subroutines
> using single instructions, preserving the PSR. That's no longer the
> case, with 32-bit only ARMs.

On the plus side, we have BX, which allows transparent interworking
between ARM code and that horror that is Thumb.

> - With the previous point goes that (unless you use multi-instruction
> sequences to save and restore the PSR) "BL" and "SWI"s no longer
> preserve the PSR, leading to another "symmetry" being lost (in
> addition to that about conditional execution).

SWI has all but been deprecated. Encoding data in the instruction has
been a bad idea since the StrongARM; it pollutes the data cache. Under
Linux, using the new syscall standard in EABI makes Firefox on ARM
around 30% faster, due to the lack of data cache destruction. Another
feature RISC OS will sadly never get :-/

> Still, I wouldn't put it as bleak as saying that it's no longer
> elegant, and keep in mind that as long as you can tolerate the last
> point here ("BL" and "SWI" not preserving the PSR), you can write
> essentially identical ARM code as before.

But not take advantage of any of the new performance offered.

> I have direct evidence of this, having recently updated extASM to be
> 32-bit compatible, and it didn't involved that much of a job (most of
> it was quite mechanical, having to do with the mentioned PSR
> preserving).

There are also memory access issues with newer CPUs where seemingly
valid code will explode. (Which is why many applications will need a
recompile to work on the Beagleboard.)

> Furthermore, what we've lost in simplicity, we've gained in power,
> such as the SIMD and VFP instructions. Finally, we have ARM chips that
> have built-in floating point computation (even SIMD and vector
> operations), and quite frankly, I think that's worth the added
> complexity.

Although for any decent performance you're forced to use
single precision. Double precision in A8 and upwards is depreciated,
and not pipelined. And you still have the problem that C code running
on an x86 is faster than hand-crafted assembler on the fasted ARM. The
VFP has been deprecated in A8 and above, like the FPA10 was in ARM11.
Don't rely on it appearing in future processors.

> Besides, without this growth, the ARM processor would likely had not
> had much chance in today's market, so you can lament the added
> complexity and loss of elegance, but without it, there might not have
> been anything to lament about...

Well, the performance hacks are aimed at PDAs and netbooks, there'd
still be a huge market in microcontrollers, where the ARM7TDMI still
reigns supreme in number of units shipped.

B.

Xavier

unread,

Oct 13, 2009, 8:10:51 PM10/13/09

to

Well gentlemen, it's very interesting to read you.
I remember reading about the Cortex instructions set and features,
and well I quickly understood there must be far more than 37 000
transistors.
It's not Acorn philosophy anymore (simple, fast, sturdy, inexpensive).

Terje Slettebø

unread,

Oct 14, 2009, 3:37:51 AM10/14/09

to

On 14 Okt, 01:27, Rob Kendrick <n...@rjek.com> wrote:
> On Tue, 13 Oct 2009 16:17:18 -0700 (PDT)
>

> Terje Slettebø <tslett...@gmail.com> wrote:
> > - With the previous point goes that (unless you use multi-instruction
> > sequences to save and restore the PSR) "BL" and "SWI"s no longer
> > preserve the PSR, leading to another "symmetry" being lost (in
> > addition to that about conditional execution).
>
> SWI has all but been deprecated. Encoding data in the instruction has
> been a bad idea since the StrongARM; it pollutes the data cache. Under
> Linux, using the new syscall standard in EABI makes Firefox on ARM
> around 30% faster, due to the lack of data cache destruction. Another
> feature RISC OS will sadly never get :-/

I didn't quite follow this... How does encoding data in the
instruction pollute the data cache?

> > Furthermore, what we've lost in simplicity, we've gained in power,
> > such as the SIMD and VFP instructions. Finally, we have ARM chips that
> > have built-in floating point computation (even SIMD and vector
> > operations), and quite frankly, I think that's worth the added
> > complexity.
>
> Although for any decent performance you're forced to use
> single precision. Double precision in A8 and upwards is depreciated,
> and not pipelined. And you still have the problem that C code running
> on an x86 is faster than hand-crafted assembler on the fasted ARM. The
> VFP has been deprecated in A8 and above, like the FPA10 was in ARM11.
> Don't rely on it appearing in future processors.

Interesting, I've never heard of this... *sigh* The ARM processor
seems very much a moving target, nowadays... I had barely implemented
VFP as defined in ARMv5, before I got the ARMv7 ARM, where it has been
deprecated, and replaced with VFP v.2, with a completely different
instruction set and syntax... Now they have deprecated again, although
"just" the vector mode version of it, it seems (not all of VFP).

Doing a search on this, I found this page: http://lua-users.org/lists/lua-l/2009-08/msg00338.html:

"The VFP vector mode is not true SIMD. It's about quickly issuing
multiple operations in succession. But turning it on and off involved
a pipeline flush and programming it was quite tricky. I guess it
wasn't popular outside of handcoded assembly."

Do you know any more about it than this, i.e. why it has been
deprecated, and where you have the information from? (I have the ARMv7
ARM, as well as various Cortex A8 and A9 documents, but I've mostly
concentrated on implementing the instructions at this point, not
studying the whole specification)

Thanks for the heads-up: I guess this means the vector notation in
extASM will go now...

> > Besides, without this growth, the ARM processor would likely had not
> > had much chance in today's market, so you can lament the added
> > complexity and loss of elegance, but without it, there might not have
> > been anything to lament about...
>
> Well, the performance hacks are aimed at PDAs and netbooks, there'd
> still be a huge market in microcontrollers, where the ARM7TDMI still
> reigns supreme in number of units shipped.

Right, but PDAs and netbooks/laptops are also some of the most
promising targets for a RISC OS port.

Jake Waskett

unread,

Oct 14, 2009, 5:04:18 AM10/14/09

to

On Wed, 14 Oct 2009 00:37:51 -0700, Terje Slettebø wrote:

>> SWI has all but been deprecated. Encoding data in the instruction has
>> been a bad idea since the StrongARM; it pollutes the data cache. Under
>> Linux, using the new syscall standard in EABI makes Firefox on ARM
>> around 30% faster, due to the lack of data cache destruction. Another
>> feature RISC OS will sadly never get :-/
>
> I didn't quite follow this... How does encoding data in the instruction
> pollute the data cache?

Basically, because the kernel has to work out what service is requested,
so it reads the SWI instruction. On processors with unified caches, it's
quite efficient, because the instruction is already in the cache as a
result of having been executed. But with split caches, it's inefficient,
effectively loading an entire data cache line for something that should
already be known.

I wonder if RISC OS could support another system call convention alongside
the classical SWI interface. It might be necessary to use the old
interface for some SWIs with register usage that made change impossible,
but I wouldn't be surprised if most could be supported. It might even be
possible to dynamically patch applications to use the new interface.

Rob Kendrick

unread,

Oct 14, 2009, 5:53:46 AM10/14/09

to

On Wed, 14 Oct 2009 09:04:18 GMT
Jake Waskett <ja...@waskett.org> wrote:

> I wonder if RISC OS could support another system call convention
> alongside the classical SWI interface. It might be necessary to use
> the old interface for some SWIs with register usage that made change
> impossible, but I wouldn't be surprised if most could be supported.
> It might even be possible to dynamically patch applications to use
> the new interface.

You can most likely implement the EABI syscall system as a stand-alone
module in RISC OS that registers a normal SWI handler. But it will get
you no performance benefit, for reasons I described in my reply to
Terje a moment ago.

B.

Rob Kendrick

unread,

Oct 14, 2009, 5:52:38 AM10/14/09

to

On Wed, 14 Oct 2009 00:37:51 -0700 (PDT)
Terje Slettebø <tsle...@gmail.com> wrote:

> > SWI has all but been deprecated. Encoding data in the instruction
> > has been a bad idea since the StrongARM; it pollutes the data
> > cache. Under Linux, using the new syscall standard in EABI makes
> > Firefox on ARM around 30% faster, due to the lack of data cache
> > destruction. Another feature RISC OS will sadly never get :-/
>
> I didn't quite follow this... How does encoding data in the
> instruction pollute the data cache?

Because you read the instruction back as data to find out which SWI
number you are trying to call. This destroys a cache line with useless
data. The new syscall system, the "SWI number" you want to call is in
a register, avoiding this.

Fortunately, you can provide both syscall systems at the same time, as
which SWI number to use for the new system is defined (but never
read). Unfortunately, the cost of this compatibility is none of the
performance benefit (because you still need to read it out to see if
it's a new or old style syscall.)

> > Although for any decent performance you're forced to use
> > single precision. Double precision in A8 and upwards is
> > depreciated, and not pipelined. And you still have the problem
> > that C code running on an x86 is faster than hand-crafted assembler
> > on the fasted ARM. The VFP has been deprecated in A8 and above,
> > like the FPA10 was in ARM11. Don't rely on it appearing in future
> > processors.
>
> Interesting, I've never heard of this... *sigh* The ARM processor
> seems very much a moving target, nowadays... I had barely implemented
> VFP as defined in ARMv5, before I got the ARMv7 ARM, where it has been
> deprecated, and replaced with VFP v.2, with a completely different
> instruction set and syntax... Now they have deprecated again, although
> "just" the vector mode version of it, it seems (not all of VFP).

In A8, the VFP is implemented using something called "VFP Lite", a
cut-down simplified version that is not pipelined. Its performance is
terrible. However, they have introduced the NEON instruction set,
which is very funky (and very CISCy). However, it lacks
double-precision arithmetic.

> Do you know any more about it than this, i.e. why it has been
> deprecated, and where you have the information from? (I have the ARMv7
> ARM, as well as various Cortex A8 and A9 documents, but I've mostly
> concentrated on implementing the instructions at this point, not
> studying the whole specification)

Sure; ARM's site says its deprecated in favour of NEON, and the
dreadful implementations in currect CPUs reflect that. ARM have never
been interested in maintaining backwards compatibility; and most of
their customers don't care because of their business models.

> > > Besides, without this growth, the ARM processor would likely had
> > > not had much chance in today's market, so you can lament the added
> > > complexity and loss of elegance, but without it, there might not
> > > have been anything to lament about...
> >
> > Well, the performance hacks are aimed at PDAs and netbooks, there'd
> > still be a huge market in microcontrollers, where the ARM7TDMI still
> > reigns supreme in number of units shipped.
>
> Right, but PDAs and netbooks/laptops are also some of the most
> promising targets for a RISC OS port.

Sure, but that's not ARM's main interest, which was what I was trying
to say.

B.

Michael Gerbracht

unread,

Oct 14, 2009, 7:00:34 AM10/14/09

to

In article <20091013233...@trite.i.flarn.net.i.flarn.net>,

Rob Kendrick <nn...@rjek.com> wrote:
> On Tue, 13 Oct 2009 15:12:36 -0700 (PDT)
> Terje Sletteb� <tsle...@gmail.com> wrote:

> > Yes, ARMv7 is much more complex (a much larger instruction set) than
> > the early ARM versions, but I still think it retains a certain
> > elegance, and the larger instruction set yields a more powerful
> > processor...

> But by ARMv9, we'll have something more CISCy than x86. [...]

> ARM is no longer elegant. ARM is now a CPU designed for performance,
> and not for the joy of the programmer.

When ARM was born the philosophy was that a processor with only a few
instructions is faster than a very complex CISC CPU. Also it could be clocked
much faster. Is this not true anymore?

I just wonder why we do not have an ARM with a limited instruction set which
clocks at, say, 10 GHz?

No offense here, I am just interested.

Michael

--
Please replace "nospam" by "m.gerbracht" when replying by mail

John Kortink

unread,

Oct 14, 2009, 7:31:44 AM10/14/09

to

On Wed, 14 Oct 2009 13:00:34 +0200, Michael Gerbracht
<nos...@cityweb.de> wrote:

>In article <20091013233...@trite.i.flarn.net.i.flarn.net>,
> Rob Kendrick <nn...@rjek.com> wrote:
>> On Tue, 13 Oct 2009 15:12:36 -0700 (PDT)
>

>[...]
>
>> ARM is no longer elegant. ARM is now a CPU designed for performance,
>> and not for the joy of the programmer.
>
>When ARM was born the philosophy was that a processor with only a few
>instructions is faster than a very complex CISC CPU. Also it could be clocked
>much faster. Is this not true anymore?

It never was. There's not just clock speed, but number of
cycles per instruction as well. CISC CPUs generally need
many more cycles per instruction than RISC CPUs. They
probably do (very roughly) the same amount of work per
clock cycle, although a CISC CPU will tend to do much
more in parallel.

John Kortink

--

Email : kor...@inter.nl.net
Homepage : http://www.inter.nl.net/users/J.Kortink

Those who can, do. Those who can't, manage.

Jake Waskett

unread,

Oct 14, 2009, 8:31:48 AM10/14/09

to

As I recall, EABI uses SWI 0, so I don't think it could be implemented
without clashing with OS_WriteC. And as you say, even if it could be
implemented, it wouldn't buy any benefit.

But there are more options than just RISC OS SWI or EABI SWI. Anything
that results in a controlled switch to a privileged mode would work, which
on ARM basically means undefined instructions, prefetch aborts and data
aborts. Undefined instructions suffer from the same problem as SWIs - the
kernel would need to read the instruction to check that it is the "right"
kind of undefined instruction, and deliberate data aborts would probably
cost a register for holding a (useless) address, but prefetch aborts
(through jumps to a specific, invalid memory page) could be quite
efficient. As I recall Windows CE used to use that method for system
calls (I merely mention this as a point of interest).

Rob Kendrick

unread,

Oct 14, 2009, 8:48:57 AM10/14/09

to

Several operating systems use data abort as a method for entering
system calls. I'm not entirely sure why.

B.

Theo Markettos

unread,

Oct 14, 2009, 12:44:52 PM10/14/09

to

Jake Waskett <ja...@waskett.org> wrote:
> But there are more options than just RISC OS SWI or EABI SWI. Anything
> that results in a controlled switch to a privileged mode would work, which
> on ARM basically means undefined instructions, prefetch aborts and data
> aborts. Undefined instructions suffer from the same problem as SWIs - the
> kernel would need to read the instruction to check that it is the "right"
> kind of undefined instruction, and deliberate data aborts would probably
> cost a register for holding a (useless) address,

I'm quite surprised there isn't a coprocessor register for holding the
instruction that caused the abort/SWI - it's not uncommon for wanting to
unpick what happened (eg emulating missing instructions by sitting on the
undefined trap) and would seem to solve the cache problems.

(No use on an OS that doesn't know about the existence of such a register,
of course...)

Theo

Terje Slettebø

unread,

Oct 14, 2009, 6:56:08 PM10/14/09

to

On 14 Okt, 11:52, Rob Kendrick <n...@rjek.com> wrote:
> On Wed, 14 Oct 2009 00:37:51 -0700 (PDT)
>

> Terje Slettebø <tslett...@gmail.com> wrote:
> > > SWI has all but been deprecated. Encoding data in the instruction
> > > has been a bad idea since the StrongARM; it pollutes the data
> > > cache. Under Linux, using the new syscall standard in EABI makes
> > > Firefox on ARM around 30% faster, due to the lack of data cache
> > > destruction. Another feature RISC OS will sadly never get :-/
>
> > I didn't quite follow this... How does encoding data in the
> > instruction pollute the data cache?
>
> Because you read the instruction back as data to find out which SWI
> number you are trying to call. This destroys a cache line with useless
> data. The new syscall system, the "SWI number" you want to call is in
> a register, avoiding this.

Ah, right. If there was a way to store the SWI number in an unused
register, it should have been possible to potentially dynamically
reassemble ARM code to swap SWI for e.g. MOV Rx, #<SWI number> + BKPT,
and then forward the call from the breakpoint vector to the SWI code,
but there's the matter of that "unused register"...

> > > Although for any decent performance you're forced to use
> > > single precision. Double precision in A8 and upwards is
> > > depreciated, and not pipelined. And you still have the problem
> > > that C code running on an x86 is faster than hand-crafted assembler
> > > on the fasted ARM. The VFP has been deprecated in A8 and above,
> > > like the FPA10 was in ARM11. Don't rely on it appearing in future
> > > processors.
>

> In A8, the VFP is implemented using something called "VFP Lite", a
> cut-down simplified version that is not pipelined. Its performance is
> terrible. However, they have introduced the NEON instruction set,
> which is very funky (and very CISCy). However, it lacks
> double-precision arithmetic.
>
> > Do you know any more about it than this, i.e. why it has been
> > deprecated, and where you have the information from? (I have the ARMv7
> > ARM, as well as various Cortex A8 and A9 documents, but I've mostly
> > concentrated on implementing the instructions at this point, not
> > studying the whole specification)
>
> Sure; ARM's site says its deprecated in favour of NEON, and the
> dreadful implementations in currect CPUs reflect that.

I've done a search at the ARM site, as well as read documents on the
Cortex A9, etc., but I haven't found anywhere that it says that VFP/
double precision operations has been deprecated. Could you provide a
link?

What it does say, however, is that VFP vector mode is deprecated,
which is rather different from all of VFP being deprecated. It means
that you may still use VFP code (which includes single and double
precision operations), just not in vector mode (or it needs support
code for the vector mode to work).

From the Cortex A9 TRM:

"The Cortex-A9 NEON MPE hardware supports single and double-precision
add, subtract, multiply, divide, multiply and accumulate, and square
root operations as described in the ARM VFPv3 architecture.
...
ARMv7 deprecates the use of VFP vector mode. The Cortex-A9 NEON MPE
hardware does not support VFP vector operations. In this manual, the
term vector refers to Advanced SIMD integer, polynomial and single-
precision vector operations. The Cortex-A9 NEON MPE provides high
speed VFP operation without support code. However, if an application
requires VFP vector operation, then it must use support code."

Rob Kendrick

unread,

Oct 14, 2009, 9:11:02 PM10/14/09

to

On Wed, 14 Oct 2009 15:56:08 -0700 (PDT)
Terje Slettebø <tsle...@gmail.com> wrote:

> > > I didn't quite follow this... How does encoding data in the
> > > instruction pollute the data cache?
> >
> > Because you read the instruction back as data to find out which SWI
> > number you are trying to call. This destroys a cache line with
> > useless data. The new syscall system, the "SWI number" you want to
> > call is in a register, avoiding this.
>
> Ah, right. If there was a way to store the SWI number in an unused
> register, it should have been possible to potentially dynamically
> reassemble ARM code to swap SWI for e.g. MOV Rx, #<SWI number> + BKPT,
> and then forward the call from the breakpoint vector to the SWI code,
> but there's the matter of that "unused register"...

Yes, but how do you know when to trigger this without reading the SWI
out? What about programs that generate SWI instructions on the fly yo
call?

> I've done a search at the ARM site, as well as read documents on the
> Cortex A9, etc., but I haven't found anywhere that it says that VFP/
> double precision operations has been deprecated. Could you provide a
> link?

The discussion on the Lua mailing list that somebody posted earlier has
these citations.

B.

Rob Kendrick

unread,

Oct 14, 2009, 9:17:02 PM10/14/09

to

On Wed, 14 Oct 2009 15:56:08 -0700 (PDT)
Terje Slettebø <tsle...@gmail.com> wrote:

> From the Cortex A9 TRM:
>
> "The Cortex-A9 NEON MPE hardware supports single and double-precision
> add, subtract, multiply, divide, multiply and accumulate, and square
> root operations as described in the ARM VFPv3 architecture.

(Sorry for re-reply)

http://infocenter.arm.com/help/topic/com.arm.doc.dui0204i/CIHDIBDG.html

You'll notice from this that a 64 bit/double precision float type is not
available on NEON, and the performance of the VFPlite in A8 is so
dreadful as to not be worth using.

(NEON does, however, support 64 bit integers and lots of funky maths on
them, which is quite exciting.)

B.

Jake Waskett

unread,

Oct 15, 2009, 6:00:58 AM10/15/09

to

On Wed, 14 Oct 2009 13:48:57 +0100, Rob Kendrick wrote:

> Several operating systems use data abort as a method for entering system
> calls. I'm not entirely sure why.

Out of interest, which OSes are you thinking of?

Rob Kendrick

unread,

Oct 15, 2009, 6:28:31 AM10/15/09

to

L4, and some versions of CE.

B.

Rodolph Perfetta

unread,

Oct 15, 2009, 8:50:54 AM10/15/09

to

Rob Kendrick wrote:
> On Tue, 13 Oct 2009 16:17:18 -0700 (PDT)
> Terje Slettebø <tsle...@gmail.com> wrote:
>
>> Furthermore, what we've lost in simplicity, we've gained in power,
>> such as the SIMD and VFP instructions. Finally, we have ARM chips that
>> have built-in floating point computation (even SIMD and vector
>> operations), and quite frankly, I think that's worth the added
>> complexity.
>
> Although for any decent performance you're forced to use
> single precision. Double precision in A8 and upwards is depreciated,
> and not pipelined. And you still have the problem that C code running
> on an x86 is faster than hand-crafted assembler on the fasted ARM. The
> VFP has been deprecated in A8 and above, like the FPA10 was in ARM11.
> Don't rely on it appearing in future processors.

Double precision and VFP have not been deprecated, far from it. As you
mentioned in a latter reply, A8 has VFP lite which is optimised for size
(in gates). A9 has a VFP implementation optimised for speed and is a lot
faster.

VFP is the recommended option for floating point (single and double).
Neon is a SIMD engine and should be used as such (doing multiple
operations in parallel), for one operation VFP should be used. Also Neon
is not IEEE 7xx (I never remember the exact number) compliant, VFP is.

HTH,
Rodolph.

Jake Waskett

unread,

Oct 15, 2009, 9:08:09 AM10/15/09

to

Interesting. L4 has a history of extreme attention being paid to the
performance implications of every design decision, so I wouldn't expect
such a choice to be made unless there were very good reasons for it. I
must admit, though, that I can't think what those are - prefetch seems far
more sensible to me.

Jake Waskett

unread,

Oct 15, 2009, 9:37:05 AM10/15/09

to

On Thu, 15 Oct 2009 02:11:02 +0100, Rob Kendrick wrote:

> On Wed, 14 Oct 2009 15:56:08 -0700 (PDT) Terje Slettebø
> <tsle...@gmail.com> wrote:
>
>> > > I didn't quite follow this... How does encoding data in the
>> > > instruction pollute the data cache?
>> >
>> > Because you read the instruction back as data to find out which SWI
>> > number you are trying to call. This destroys a cache line with
>> > useless data. The new syscall system, the "SWI number" you want to
>> > call is in a register, avoiding this.
>>
>> Ah, right. If there was a way to store the SWI number in an unused
>> register, it should have been possible to potentially dynamically
>> reassemble ARM code to swap SWI for e.g. MOV Rx, #<SWI number> + BKPT,
>> and then forward the call from the breakpoint vector to the SWI code,
>> but there's the matter of that "unused register"...
>
> Yes, but how do you know when to trigger this without reading the SWI
> out?

If you arrive via a <insert fast syscall method here> instruction, you'll
enter the kernel via a different exception vector than that which is used
for SWIs, so you don't need to read the instruction. You can just use the
SWI number directly.

If you arrive via a SWI instruction, then you always read the SWI
instruction & extract the bits you want. You then patch the code so that
it doesn't use a SWI instruction (I'm not familiar with BKPT, so I can't
comment on that specifically, but the basic scheme ought to work
regardless of the exact syscall method).

Since the patched code would be larger than the original, it would be
necessary to maintain a "patch buffer" that's reachable via a branch from
what was the SWI instruction. (As a bonus, if you were to attach an
invalid page to either end of this patch buffer, it would make a good
target address for data aborts, accessible via PC-relative addressing.)

There are probably a few "weird" SWIs that do non-standard things with
registers, and hence cannot be safely patched. Also, code might be
running from ROM, or it might be impossible to map a patch buffer into the
available virtual address space. So it would be necessary to check
whether it is safe to patch. If it isn't, just execute the SWI as is done
at present.

There would be an associated performance cost with the first SWI
instruction, but that's unavoidable. The cost for later 'syscalls' should
be much less. And new programs could (optionally) be compiled to use the
new-style syscall directly, making them faster.

> What about programs that generate SWI instructions on the fly yo
> call?

Just execute them. Worst-case scenario is that they'll run a little
slower. Realistically, CallASWI has been around for years, and self-
modifying code has been best avoided since the StrongARM was introduced,
so such programs are probably so ancient that performance problems aren't
an issue.

Jake

Rob Kendrick

unread,

Oct 15, 2009, 9:56:35 AM10/15/09

to

On Thu, 15 Oct 2009 13:37:05 GMT
Jake Waskett <ja...@waskett.org> wrote:

> > Yes, but how do you know when to trigger this without reading the
> > SWI out?
>
> If you arrive via a <insert fast syscall method here> instruction,
> you'll enter the kernel via a different exception vector than that
> which is used for SWIs, so you don't need to read the instruction.
> You can just use the SWI number directly.

If you want to remain compatible, you need to use SWI for EABI, however.

B.

Jake Waskett

unread,

Oct 15, 2009, 3:49:35 PM10/15/09

to

On Thu, 15 Oct 2009 14:56:35 +0100, Rob Kendrick wrote:

> If you want to remain compatible, you need to use SWI for EABI, however.

I don't understand. Are you saying that it's necessary to use EABI-style
SWIs in order to be compatible with something? If so, with what, and why?

Rob Kendrick

unread,

Oct 15, 2009, 7:29:03 PM10/15/09

to

Code from other EABI platforms. In the same way Linux and RISC iX have
RISC OS personalities. There is a standard SWI number used for EABI,
and a looser standard for different UNIX-flavoured system calls.

B.

Terje Slettebø

unread,

Oct 16, 2009, 3:01:38 AM10/16/09

to

You're right. ARM has also confirmed this (I asked them).

Rob Kendrick

unread,

Oct 16, 2009, 4:18:35 AM10/16/09

to

On Thu, 15 Oct 2009 13:50:54 +0100
Rodolph Perfetta <rodolph....@arm.com> wrote:

> A9 has a VFP implementation optimised for speed and is a lot
> faster.

Although I make the habit of not believing the performance claims of
ARM CPUs until they're in production :)

B.

Jake Waskett

unread,

Oct 16, 2009, 5:04:15 AM10/16/09

to

Yes, I looked into this about six months ago. As I recall, it's SWI 0x0 -
the same as OS_WriteC. I guess it would be possible to set an internal
flag when loading non-RO apps, but it could be messy. Anyway I'm not sure
how much of an advantage there would be in running POSIX-type apps on RISC
OS: RO is a great platform, but it isn't designed to be a POSIX platform,
and is weak in that respect. It seems to me that improving the
performance of RISC OS "SWI"s would be far more beneficial than emulation
layers for running code written for other platforms.

Rob Kendrick

unread,

Oct 16, 2009, 5:52:32 AM10/16/09

to

On Fri, 16 Oct 2009 09:04:15 GMT
Jake Waskett <ja...@waskett.org> wrote:

> > Code from other EABI platforms. In the same way Linux and RISC iX
> > have RISC OS personalities. There is a standard SWI number used
> > for EABI, and a looser standard for different UNIX-flavoured system
> > calls.
>
> Yes, I looked into this about six months ago. As I recall, it's SWI
> 0x0 - the same as OS_WriteC. I guess it would be possible to set an
> internal flag when loading non-RO apps, but it could be messy.

Not as messy as the rest of the state that the WIMP stores on each
process already.

> Anyway I'm not sure how much of an advantage there would be in
> running POSIX-type apps on RISC OS: RO is a great platform, but it
> isn't designed to be a POSIX platform, and is weak in that respect.

Well, there's plenty of value of running POSIX applications under RISC
OS. Look at Firefox and NetSurf (which is POSIX) for examples.

There have been efforts in the past to allow running of ARM Linux
binaries directly under RISC OS, and it worked after a fashion. (A
module that provided Linux's OABI syscall interface via SWI)

> It seems to me that improving the performance of RISC OS "SWI"s would
> be far more beneficial than emulation layers for running code written
> for other platforms.

But it's not possible to do and get an advantage, unless you want to
throw away all your existing software.

B.

Ron

unread,

Oct 16, 2009, 6:25:09 AM10/16/09

to

In message <gemini.krgxo30...@softrock.co.uk>
Vince M Hudd <sp...@softrock.co.uk> wrote:

> Gazza <use...@garethlock.com> wrote:
> > On Oct 11, 6:20�pm, Chris Johnson <chrisjohnson+n...@spamcop.net> wrote:
>
> > > Get the Archive DVD and all this will be at your finger tips at the
> > > click of a mouse. The HTML version has full (live) indexes and cross
> > > references.
>
> > This sounds like the best way to go then... I guess the next question is
> > where to get this DVD and how much do I have to fork out?
>
> http://www.archivemag.co.uk
>
I have just found a history of ARM from Acorn through to Arm9
in year 2000.

http://www.realworldtech.com/page.cfm?ArticleID=RWT110900000000&p=1

Real World Technologies - ARM's Race to Embedded World Domination
by Paul DeMone

Cheers Ron M

Jake Waskett

unread,

Oct 16, 2009, 9:26:39 AM10/16/09

to

On Fri, 16 Oct 2009 10:52:32 +0100, Rob Kendrick wrote:

> On Fri, 16 Oct 2009 09:04:15 GMT
> Jake Waskett <ja...@waskett.org> wrote:
>
>> > Code from other EABI platforms. In the same way Linux and RISC iX
>> > have RISC OS personalities. There is a standard SWI number used for
>> > EABI, and a looser standard for different UNIX-flavoured system
>> > calls.
>>
>> Yes, I looked into this about six months ago. As I recall, it's SWI
>> 0x0 - the same as OS_WriteC. I guess it would be possible to set an
>> internal flag when loading non-RO apps, but it could be messy.
>
> Not as messy as the rest of the state that the WIMP stores on each
> process already.

Fair point. :-)

>> It seems to me that improving the performance of RISC OS "SWI"s would
>> be far more beneficial than emulation layers for running code written
>> for other platforms.
>
> But it's not possible to do and get an advantage, unless you want to
> throw away all your existing software.

I'm not sure if that's true. I've outlined a system that would work (I
reckon), but can't be sure without testing whether it would offer a
performance advantage or not. My guess is that although it would cost an
extra 5 or so instructions per "SWI", the improved d-cache behaviour would
lead to a small overall gain in performance.

Terje Slettebø

unread,

Oct 16, 2009, 5:06:45 PM10/16/09

to

Just a quick comment to this earlier posting:

On 14 Okt, 00:36, Rob Kendrick <n...@rjek.com> wrote:
> On Tue, 13 Oct 2009 15:12:36 -0700 (PDT)

> But by ARMv9, we'll have something more CISCy than x86. No,
> seriously. It's dangerously close to that complex now. We already get
> to deal with the joys of remembering which instructions can be
> conditionally executed, and which only work on even-numbered registers,
> which operate on which halves of words, etc etc.

>
> ARM is no longer elegant. ARM is now a CPU designed for performance,
> and not for the joy of the programmer.

A few days ago, I saw a great interview with Steve Furber, one of the
people behind the development of the ARM processor (http://
www.computinghistory.org.uk/det/5438/Steve-Furber-Interview-17-08-2009/),
and it's quite clear that two of the aims behind the ARM processor was
performance, and low interrupt latency.

It may be argued that these things are still driving the design, in
particular performance/Watt.

In other words, the "joy of the programmer" doesn't appear to have
been a design aim, and in fact, RISC processors have tended to be more
difficult to program in assembly than CISC processors, due to it
typically requiring more instructions to perform a task, compared to
the CISC version.

Fortunately, the ARM marries the easy to program part with
performance, but my point is that it appears they are still following
the original aims, even if it has lead to a drastically more complex
processor (especially due to all the specialised SIMD, etc.
instructions).

Anyway, what is elegant is subjective, and I still find ARM overall to
be an elegant processor (in particular with regard to other mainstream
processors), and SIMD/FP support are definitely useful additions.

Rob Kendrick

unread,

Oct 16, 2009, 7:35:31 PM10/16/09

to

On Fri, 16 Oct 2009 14:06:45 -0700 (PDT)
Terje Slettebø <tsle...@gmail.com> wrote:

> A few days ago, I saw a great interview with Steve Furber, one of the
> people behind the development of the ARM processor (http://
> www.computinghistory.org.uk/det/5438/Steve-Furber-Interview-17-08-2009/),
> and it's quite clear that two of the aims behind the ARM processor was
> performance, and low interrupt latency.

tbh, I had always assumed the main aim was to build something that
worked, and something cheap :)

> In other words, the "joy of the programmer" doesn't appear to have
> been a design aim, and in fact, RISC processors have tended to be more
> difficult to program in assembly than CISC processors, due to it
> typically requiring more instructions to perform a task, compared to
> the CISC version.

On the other hand, the instruction set was devised by a programmer who
loved the way 6502 worked.

> Anyway, what is elegant is subjective, and I still find ARM overall to
> be an elegant processor (in particular with regard to other mainstream
> processors), and SIMD/FP support are definitely useful additions.

Steve Fryatt

unread,

Oct 16, 2009, 8:51:22 PM10/16/09

to

Rob Kendrick <nn...@rjek.com> wrote:

> On Fri, 16 Oct 2009 14:06:45 -0700 (PDT) Terje Slettebø
> <tsle...@gmail.com> wrote:
>
> > A few days ago, I saw a great interview with Steve Furber, one of the
> > people behind the development of the ARM processor (http://
> >
www.computinghistory.org.uk/det/5438/Steve-Furber-Interview-17-08-2009/),
> > and it's quite clear that two of the aims behind the ARM processor was
> > performance, and low interrupt latency.
>
> tbh, I had always assumed the main aim was to build something that worked,
> and something cheap :)

The interrupt latency issue is one that Steve seems to mention a lot in
these talks. From memory (having heard a couple of his recent Acorn-group
talks), the problem was that they wanted to design the Archimedes (as it
became) using the same techniques as the Beeb. The Beeb relied on
interrupts to operate, and Steve and Sophie found that the CISC processors
of the day were, in the worst case, simply not up to things like disc or
network access.

--
Steve Fryatt - Leeds, England

http://www.stevefryatt.org.uk/

Terje Slettebø

unread,

Oct 17, 2009, 3:30:58 AM10/17/09

to

On 17 Okt, 01:35, Rob Kendrick <n...@rjek.com> wrote:
> On Fri, 16 Oct 2009 14:06:45 -0700 (PDT)
>

> Terje Slettebø <tslett...@gmail.com> wrote:
> > A few days ago, I saw a great interview with Steve Furber, one of the
> > people behind the development of the ARM processor (http://
> >www.computinghistory.org.uk/det/5438/Steve-Furber-Interview-17-08-2009/),
> > and it's quite clear that two of the aims behind the ARM processor was
> > performance, and low interrupt latency.
>
> tbh, I had always assumed the main aim was to build something that
> worked, and something cheap :)

I assume you're joking, here, because of your earlier positive
comments on ARM, at least the earlier versions. :) Clearly, the
simplest would be to use an off-the-shelf processor, but they didn't
find any they were satisfied with, so they decided to try and make
their own.

In a way, I guess you can say that simplicity was part of the original
aim, because they were (intentionally) set up in such a way that they
had to keep the design very simple...

> > In other words, the "joy of the programmer" doesn't appear to have
> > been a design aim, and in fact, RISC processors have tended to be more
> > difficult to program in assembly than CISC processors, due to it
> > typically requiring more instructions to perform a task, compared to
> > the CISC version.
>
> On the other hand, the instruction set was devised by a programmer who
> loved the way 6502 worked.

Yes, and I found that a cute little processor, too (I did some
assembly programming for it on the BBC Micro).

Don't get me wrong: I like simplicity, as well. Just so you know where
I'm coming from, these are some of the things I in particular
appreciate about the ARM (which were even more unique at the time it
was released):

16 32-bit general-purpose registers
Three-register architecture
Conditional execution of every instruction
Optional updating of the PSR
Optional shift of one of the operands
32-bit instructions
Simple, yet complete instruction set

The extASM assembler (mentioned in an earlier message) auguments this,
by providing "auto-expansion", by increasing the range of operation of
each instruction, by automatically changing it, or substituting it
with several instructions, for example:

MOV R0,#&1234 -> MOV R0,#&34 : ORR R0,R0,#&1200
AND &FFFFFF -> BIC &FF000000
LDR R0,label_outside_range -> ADR R0, label_outside_range (auto-
expanded) : LDR R0,[R0]
etc.

This makes it possible to write assembly code with less concern for
the limitation of the instructions (although you should understand how
they are transformed, when needed, so that you don't inadvertantly
create inefficient code).

Now, some of the things in the above list don't quite hold, anymore,
namely:

Conditional execution of every instruction
Simple, yet complete instruction set

I'm no expert in microprocessor design, and maybe real-life use have a
need for 30 or so different variations of the multiply instruction, as
well as the plethora of other instructions in ARMv7, some of them
appearing to be very specialised.

I, too, prefer general facilities, compared to specialised features
and instructions (which is why I liked ARM so much when it came, and
why I disliked, and still do, the x86, with its abundance of special
cases and baggage from the past).

However, I also realise that for the ARM to be able to compete in its
market (or any market), it may need to acquire specialised or esoteric
features. I don't care much about Thumb, either (and it won't be
implemented in extASM), but I understand that in some segments of
their market, code desity is important (even if it reduces
performance), and it is, after all, their market.

I would have preferred that ARM focused more on performance, and less
on low power use, competing against processors like x86, rather than
aiming for the mobile market, which could have avoided quite a bit of
these things (such as Thumb). However, I also realise that they may
not have had much chance against giants like Intel in this market, and
it appears that going for the niche market of mobile and embedded
devices have been a winning move.

Since you're clearly not happy about the way the ARM processor has
evolved, let me ask you this: What would you have preferred that ARM
(the company) did differently, and would it have made business sense
(enabling them to survive and grow their business)?

> > Anyway, what is elegant is subjective, and I still find ARM overall to
> > be an elegant processor (in particular with regard to other mainstream
> > processors), and SIMD/FP support are definitely useful additions.
>
> You may also like PowerPC.

I don't know much about it (and finding an assembly code reference for
it online has proven difficult), but from what I remember, having
looked at it some in the past, I think it lacks quite a few of the
things that I like about ARM, as given in the above list.

Just to be clear: I like simplicity, elegance, and generality of
design just as much as you. It's just that the reality of the market
you're in will shape the design, and sometimes it may be difficult to
keep some of these things as a design evolves.

Case in point: As the number of instructions grew, it has apparently
been difficult to keep the feature of having all instructions
conditionally executable, and giving the choice between, say,
unconditional SIMD instructions, and no SIMD instructions at all, I
think the former makes sense as a choice.

Theo Markettos

unread,

Oct 17, 2009, 12:15:30 PM10/17/09

to

Steve Fryatt <ne...@stevefryatt.org.uk> wrote:
> The interrupt latency issue is one that Steve seems to mention a lot in
> these talks. From memory (having heard a couple of his recent Acorn-group
> talks), the problem was that they wanted to design the Archimedes (as it
> became) using the same techniques as the Beeb. The Beeb relied on
> interrupts to operate, and Steve and Sophie found that the CISC processors
> of the day were, in the worst case, simply not up to things like disc or
> network access.

It also allowed them to cut some corners on other hardware. For example,
floppy controllers since the original IBM PC were designed for DMA. They
don't have big buffers, but the idea is that they DMA the floppy data
straight into memory. That means a DMA controller and some extra latches to
enable memory to be driven, which cost money (and potentially silicon area).

The Archimedes architecture has fast interrupts (FIQs), and so saves the
cost of a DMA controller. When the floppy controller produces a byte, the
processor gets a FIQ, stops what it is doing, grabs the byte, stores it in
memory then gets back to work. Due to the separate overlay registers
R8-R15, there's minimal overhead in switching to the FIQ routine. Econet
works on the same principle.

This means you can throw away the extra hardware demands, but it doesn't
scale. If the floppy controller is producing 10Kbyte/s, you can cope with
10K interrupts/sec. If it's producing 10Mbytes/s, you can't. The extra
hardware really does win in the end because it decouples the CPU from the
I/O.

But the idea of FIQs is a good one, especially in the embedded sphere. And
it does allow you to do things you couldn't previously do, with (AFAICS)
minimal knock-on effects for other things.

Theo

druck

unread,

Oct 19, 2009, 2:34:06 PM10/19/09

to

Jake Waskett wrote:
> On Wed, 14 Oct 2009 10:53:46 +0100, Rob Kendrick wrote:
>
>> On Wed, 14 Oct 2009 09:04:18 GMT
>> Jake Waskett <ja...@waskett.org> wrote:
>>
>>> I wonder if RISC OS could support another system call convention
>>> alongside the classical SWI interface. It might be necessary to use
>>> the old interface for some SWIs with register usage that made change
>>> impossible, but I wouldn't be surprised if most could be supported. It
>>> might even be possible to dynamically patch applications to use the new
>>> interface.
>>>
>> You can most likely implement the EABI syscall system as a stand-alone
>> module in RISC OS that registers a normal SWI handler.
>

> As I recall, EABI uses SWI 0, so I don't think it could be implemented
> without clashing with OS_WriteC. And as you say, even if it could be
> implemented, it wouldn't buy any benefit.

We already have a similar mechanism of OS_CallASWI, but it can't be
patched in to applications as it requires and extra register to be
passed to the SWI.

---druck

Gazza

unread,

Nov 3, 2009, 9:53:51 AM11/3/09

to

When I started this thread a while back looking for a beginner's guide
to programming Acorn machines in BBC BASIC's built in assembler, I
never knew I'd open such a can of worms. It's been fun reading it.

Theo Markettos

unread,

Nov 3, 2009, 11:11:52 AM11/3/09

to

So what did you end up doing? Did we convince you that you really wanted to
learn Haskell, or .NET, or Z? ;-)

Theo

Gazza

unread,

Nov 4, 2009, 1:10:48 PM11/4/09

to

On Nov 3, 4:11 pm, Theo Markettos <theom+n...@chiark.greenend.org.uk>
wrote:

Well... I think I'll eventually end up doing what I started out
thinking about in the sense that I'll one day get that Invaders clone
you saw running at the show on the 3rd October moved from raw BASIC
to, at least, partial assembler. Once I have the speed and knowledge
from that, I might try and implement some sort of starfield or
background overlay with the space station in the background a la the
original arcade hardware. The other thing that comes to mind is maybe
try and implement the bunkers a la the original with pixel level
destruction rather than bumping them around like pinball targets. The
other things I'd like to get right are the timings. (Running the game
on a 48MHz A7000+ is about right, whereas it's too fast really on a
RPC.) The accuracy of some of the collision detection code. (Trying to
destroy a zig-zag type invader missile is a bit hit and miss.) But
where to start??

I'd want it 32 bit friendly and I'd want it to work pretty much as it
does now. I'd probably just convert the event handlers to asm and
leave the rest of the stuff in BASIC. After all it's the animation and
re-draw stuff that needs to be quick. This would have two advantages
over doing it in C. I would be able to work on the existing code-base
rather than re-coding from scratch. I wouldn't have to learn the ins
and outs of how BASIC stores files written in PRINT# format.

I might do it in C first, just to see if compilation as opposed to
BASIC mashing is any faster, but I've always fancied trying to learn
asm. Simple generic ARM certainly seems to be more palatable to me
than x86. Certainly to begin with anyhow. It also has the advantage
that there's a perfectly good assembler already built into the
machine... lol

Terje Slettebø

unread,

Nov 5, 2009, 5:12:49 AM11/5/09

to

On 4 Nov, 19:10, Gazza <use...@garethlock.com> wrote:
>
> Well... I think I'll eventually end up doing what I started out
> thinking about in the sense that I'll one day get that Invaders clone
> you saw running at the show on the 3rd October moved from raw BASIC
> to, at least, partial assembler.

> <snip>

> I might do it in C first, just to see if compilation as opposed to
> BASIC mashing is any faster, but I've always fancied trying to learn
> asm. Simple generic ARM certainly seems to be more palatable to me
> than x86. Certainly to begin with anyhow. It also has the advantage
> that there's a perfectly good assembler already built into the
> machine... lol

Sounds like fun. :)

I think you will find learning assembly code rewarding... I agree with
you about x86 vs ARM: I would never want to write x86 assembly code...
Then I'd rather go for a higher-level language. However, I've found
writing ARM assembly quite fun and rewarding, and as mentioned in an
earlier post, I'm currently updating the extASM ARM assembler, which
is itself written in assembly (about 25,000 lines of it), to the
latest ARM models (ARMv7).

extASM is a free-standing assembler, but if it turns out to be a
practical possibility, now that we have RISC OS Open, it might also be
used as the basis for an updated BASIC assembler...

There are a _lot_ of new instructions in ARMv7, compared to ARMv3/
ARMv4 (the ones currently in use in computers like RiscPC), including
a powerful floating point instruction set (something that has been
sorely lacking for years), long multiply instructions, and SIMD
instructions, and these are currently unavailable in the BASIC
assembler.

Also, to take advantage of them, you need to run RISC OS on hardware
with the latest ARM models, such as the BeagleBoard, or Touch Book (if
we get RISC OS ported to the latter).

Regards,

Terje

P.S. Feel free to contact me at tsle...@broadpark.no, if you have
questions about ARM assembly, or if you'd like to discuss something.

jl

unread,

Nov 5, 2009, 6:58:11 AM11/5/09

to

In article
<ae63c1c0-e236-43a2...@15g2000yqy.googlegroups.com>,

Gazza <use...@garethlock.com> wrote:
> On Nov 3, 4:11 pm, Theo Markettos <theom+n...@chiark.greenend.org.uk>
> wrote:
> > Gazza <use...@garethlock.com> wrote:
> > > When I started this thread a while back looking for a beginner's guide
> > > to programming Acorn machines in BBC BASIC's built in assembler, I
> > > never knew I'd open such a can of worms. It's been fun reading it.
> >
> > So what did you end up doing? Did we convince you that you really wanted to
> > learn Haskell, or .NET, or Z? ;-)
> >
> > Theo

> Well... I think I'll eventually end up doing what I started out
> thinking about in the sense that I'll one day get that Invaders clone
> you saw running at the show on the 3rd October moved from raw BASIC
> to, at least, partial assembler. Once I have the speed and knowledge
> from that, I might try and implement some sort of starfield or
> background overlay with the space station in the background a la the
> original arcade hardware.

I did that once for my program !Planets. A small machine code routine
which wrote directly to screen memory. It is a hack and only works in one
mode - but work it did.

Jochen

--

------------------------------------
Limavady and the Roe Valley
Roe Valley News Browser
http://www.jochenlueg.freeuk.com

Christopher Bazley

unread,

Nov 6, 2009, 2:22:13 PM11/6/09

to

In message <02083a60-2a2c-4ee1...@c3g2000yqd.googlegrou
ps.com>
Gazza <use...@garethlock.com> wrote:

> I'm interested in cutting my teeth in ASM. Are there any resources out
> there for beginners. I've been writing in BASIC for years and I think
> I'm ready to move up a gear.

Up a gear? I'm not sure that metaphor works: In higher gears the
wheels of a car rotate more times per revolution of the engine's
crankshaft. Likewise, a single statement in a high-level programming
language typically encapsulates more behaviour than a statement in a
low-level language. You would use assembly language to get more
'traction'! :-)

Anyway, enough of strained analogies.

> I would need something fairly recent. By
> recent, I mean aware of the changes required to be 26/32 neutral. I
> would probably start off with using the assembler built into BASIC
> rather than buying any development tools, so I'd prefer something that
> was either neutral in that respect or something that concentrated on
> the BASIC assembler.

I started out with ARM assembly language programming using Nick
Roberts's ASM ( http://tigger.orpheusweb.co.uk/programs/misc.html ).
This is similar to Acorn's ObjAsm in that it comes with a desktop
front-end powered by the FrontEnd module and generates AOF files
suitable for linking with compiled C source code. ASM has some nice
features (named data structures, conditional assembly, macros and a
few built-in functions for use in constant expressions) and the major
benefit that it is free.

However, I eventually abandoned ASM in favour of Acorn's ObjAsm
because ASM is far less powerful than the assembler built into BBC
BASIC. I needed an assembler with all the features of the BASIC
assembler but which could generate AOF output for linking with new
code that I had written in C for Star Fighter 3000. As a result, the
source code for Star Fighter 3000 is currently a mixture of C, ASM and
ObjAsm assembly language!

The trouble was that ASM doesn't support loops and its support for
conditional assembly is severely limited in that sections of code can
only be included or excluded depending on whether or not a 'flag' has
been defined. In contrast, ObjAsm allows arithmetic, boolean and
string variables with either local or global scope, and can evaluate
complex expressions as the condition for its WHILE and IF directives.

As an example of the kind of thing that would be impossible using ASM,
here is a snippet of the texture mapping code for SF3000:

GBLA text_compile
text_compile SETA firstmapper
WHILE text_compile <= 4
ROUT
[ text_compile = 1
text_res SETA 2
bonus_texture SETL {FALSE}
|
[ text_compile = 2
text_res SETA 1
bonus_texture SETL {FALSE}
|
[ text_compile = 3
text_res SETA 0
bonus_texture SETL {FALSE}
|
[ text_compile = 4
text_res SETA -1
bonus_texture SETL {FALSE}
]
]
]
]

Richard Murray provides an appraisal of various assemblers for RISC OS
machines here: http://www.heyrick.co.uk/assembler/apcsasm.html
I certainly disagree with his opinion that "ASM leaves objasm
standing", although no doubt someone will be along soon to say that I
have unfairly impugned ASM.

> I know I could move onto C, but with the two competing environments
> out there, which one? GCC or Norcroft?

If I'm quite honest, I don't think that time spent learning assembly
language in this day and age is time well spent (except perhaps for
the edification and amusement of enthusiasts).

I made the transition from programming BASIC to programming in C
almost exactly ten years ago. At the time I was disappointed with the
poor performance of early programs that I wrote in C (which no doubt
used floating point arithmetic liberally). In retrospect I suppose I
was still in thrall to the proud claims on the packaging of Archimedes
games that they were '100% hand-optimised machine code' (as if that
were a good thing!). However, I have since realised that using
efficient and appropriate algorithms is far more important than the
choice of language used to implement them.

If optimisation is required then an experienced C programmer can use
simple tricks like using automatic variables to avoid dereferencing a
pointer multiple times, whilst keeping in mind the limited number of
variable registers available within each function. (It is no faster to
load an automatic variable's value from the stack than it would be to
access a structure member by dereferencing a pointer, and the ARM has
already wasted time by copying that member's value to the stack!)

I often optimise my C programs after examining the ARM code produced
by the compiler, but nowadays I would almost never abandon the
abstraction and portability of C in favour of assembly language. It's
nice to be able to compile the same source code for other platforms,
even if the results are sub-optimal! :-)

My early attempts at writing C were stunted by the fact that I was
translating directly from BASIC (sometimes mentally, sometimes using a
program that I had written to do the conversion automatically). There
are superficial similarities because they are both imperative
languages, but C is so much more powerful than BASIC that I urge you
to discard your preconceptions and buy a decent textbook on C
programming. I like 'The Practice of Programming' by Kernighan and
Pike.

For example, the lack of function pointers in BASIC makes it almost
impossible to write reusable code that doesn't suffer from hopeless
interdependencies between modules. Other features of C which initially
passed me by were the ternary operator ?: (best used in moderation),
the fact that the three expressions following a 'for' statement can be
*anything* (whereas BASIC's FOR keyword can only be used to iterate
through a linear numeric series), and the nicer alternative to GOTO
provided by the 'break' and 'continue' statements.

HTH,
--
Chris Bazley
Star Fighter 3000: http://starfighter.acornarcade.com/

Terje Slettebø

unread,

Nov 7, 2009, 6:32:42 PM11/7/09

to

On 6 Nov, 20:22, Christopher Bazley
<christopher.baz...@blueyonder.co.uk> wrote:
> In message <02083a60-2a2c-4ee1-91f7-41894e394...@c3g2000yqd.googlegrou
> ps.com>

Hi Chris.

Great to hear from one of the authors (or maybe the author) of
Starfighter 3000, my favourite Archimedes game... :)

> Richard Murray provides an appraisal of various assemblers for RISC OS
> machines here:http://www.heyrick.co.uk/assembler/apcsasm.html
> I certainly disagree with his opinion that "ASM leaves objasm
> standing", although no doubt someone will be along soon to say that I
> have unfairly impugned ASM.

And extASM isn't even mentioned. Hrmmmf... :)

Granted, it's understandable, as hardly anyone knows about it, not at
least because it has no (proper) homepage, and it hasn't been updated
for over a decade, until recently.

Still, even at that time, it supported essentially a superset of the
BASIC assembler functionality (including loops, conditional assembly,
arbitrarily complex int/float/string expressions, etc.).

> > I know I could move onto C, but with the two competing environments
> > out there, which one? GCC or Norcroft?
>
> If I'm quite honest, I don't think that time spent learning assembly
> language in this day and age is time well spent (except perhaps for
> the edification and amusement of enthusiasts).

I think that's enough justification, actually... :)

Besides, as mentioned in other posts, it's useful to know what is
happening "at the bottom", even when using a high-level language. It
could help to avoid or pinpoint performance problems, for example.

It can also help you understand how high-level programming languages
work. For example, I had no problem understanding the "pointer"
feature in C, after having done some assembly programming earlier:
It's simply the address of a variable (or some other entity)... :)

I've heard that some have had great conceptual problems with
understanding pointers, but that didn't happen to me...

> I made the transition from programming BASIC to programming in C
> almost exactly ten years ago. At the time I was disappointed with the
> poor performance of early programs that I wrote in C (which no doubt
> used floating point arithmetic liberally). In retrospect I suppose I
> was still in thrall to the proud claims on the packaging of Archimedes
> games that they were '100% hand-optimised machine code' (as if that
> were a good thing!). However, I have since realised that using
> efficient and appropriate algorithms is far more important than the
> choice of language used to implement them.

Also, nowadays, a compiler is typically able to do as least as good a
job as an assembly programmer.

Still, there are some areas where I've found assembly programming to
fit quite nicely, and an assembler is one of them. :) Assembly is good
at "bit fiddling", and there's a lot of that in an assembler/
disassembler.

Still, if I were to write an assembler, today, I'd likely done it in C+
+, using the Spirit parser framework, or something like that, which
would enable one to work at a higher level of abstraction, and leave
more of the "boilerplate" to the compiler/library.

> If optimisation is required then an experienced C programmer can use
> simple tricks like using automatic variables to avoid dereferencing a
> pointer multiple times, whilst keeping in mind the limited number of
> variable registers available within each function. (It is no faster to
> load an automatic variable's value from the stack than it would be to
> access a structure member by dereferencing a pointer, and the ARM has
> already wasted time by copying that member's value to the stack!)

Here you are using your knowledge of assembly code to point out
possible performance optimisations in C code... Was it still not time
well spent learning...? ;)

Regards,

Terje

chris...@bigfoot.com

unread,

Nov 8, 2009, 6:22:42 AM11/8/09

to

On Nov 7, 11:32 pm, Terje Slettebø <tslett...@gmail.com> wrote:
> On 6 Nov, 20:22, Christopher Bazley
>
> <christopher.baz...@blueyonder.co.uk> wrote:
> > In message <02083a60-2a2c-4ee1-91f7-41894e394...@c3g2000yqd.googlegrou
> > ps.com>
>
> Hi Chris.
>
> Great to hear from one of the authors (or maybe the author) of
> Starfighter 3000, my favourite Archimedes game... :)

I'd like to take credit for that, but I really can't. All I did was
hack the copy protection, fix lots of bugs, rewrite the sound player
module, make the game code into a state machine, and bolt on a desktop
interface. A huge amount of work, but not really the part that people
appreciate. :-)

> > If optimisation is required then an experienced C programmer can use
> > simple tricks like using automatic variables to avoid dereferencing a
> > pointer multiple times, whilst keeping in mind the limited number of
> > variable registers available within each function. (It is no faster to
> > load an automatic variable's value from the stack than it would be to
> > access a structure member by dereferencing a pointer, and the ARM has
> > already wasted time by copying that member's value to the stack!)
>
> Here you are using your knowledge of assembly code to point out
> possible performance optimisations in C code... Was it still not time
> well spent learning...? ;)

You make a good point, although conceivably that kind of thing could
be taught without knowledge of assembly language, by knowing certain
features of the target architecture/PCS.

For example:
1) In each function a maximum of 6 fast temporary storage locations
are available to hold automatic variables of type 'char', 'int',
'short int', 'long int' or 'void *'.
2) If too many automatic variables of the above types are declared
then some will be held in slower temporary storage.
4) Using the '&' operator to get a pointer to an automatic variable
forces it to be held in slower storage.
5) Automatic structures are always held in slower storage locations.
6) Calling a function or assigning to an l-value through a pointer
invalidates any values previously accessed through pointers that might
otherwise have been cached in fast temporary storage locations.
...etc

--
Christopher Bazley

Gavin Wraith

unread,

Nov 8, 2009, 7:27:32 AM11/8/09

to

In message <23f71761-f389-49be...@v25g2000yqk.googlegroups.com>
chris...@bigfoot.com wrote:

> You make a good point, although conceivably that kind of thing could
> be taught without knowledge of assembly language, by knowing certain
> features of the target architecture/PCS.

One of my criteria for a good programming language is to what extent it
allows the programmer a useful mental picture of what actually happens
when the program runs - the operational semantics. No picture can
be entirely accurate; it is a question of compromise. Assembly languages
can be particularly useful educationally in this way. However, I think
that particular architectures, adopted long in the past, tend to have
a restricting influence on the choice of algorithms and datastructures
that are encouraged by the languages influenced by them. C, for example,
stems from the PDP11 era, when everything was based on buffers and pointers,
as opposed, say to the use of hashing; so C promotes the mental picture that
pointer values are actual addresses, as opposed to any fancier sort of
descriptors that the compiled code may actually be using. I am not arguing
that there is anything wrong with this, just that we can be lulled into
making assumptions that may in some circumstances be limiting.

--
Gavin Wraith (ga...@wra1th.plus.com)
Home page: http://www.wra1th.plus.com/

Richard Russell

unread,

Nov 8, 2009, 10:07:39 AM11/8/09

to

On Nov 5, 10:12 am, Terje Slettebø <tslett...@gmail.com> wrote:
> I agree with you about x86 vs ARM: I would never want to write x86 assembly code...

It's simply a case of what you're used to. Having been an x86
assembly language programmer for more than 20 years, I find it much
easier to understand and program than ARM assembler. For me, thinking
in terms of conditional instructions rather than conditional jumps is
quite alien, requiring more of an intellectual leap from high-level
programming.

It's true that the old 16-bit x86 was horrible to program, because of
the asymmetrical nature of its instruction set (certain instructions
only being able to use certain registers). Modern 32-bit IA-32 code
is much nicer, because (with few exceptions) all the registers are
equivalent.

I'm not qualified to comment on RISC OS, but on MS Windows using a
hybrid of BBC BASIC and assembler is a very powerful technique. All
my 'large' applications are written that way. In the case of games
programming, David Williams' site illustrates the sort of thing the
technique can achieve (you'll need a Windows PC to run the programs,
but there are YouTube videos of some of them):

http://www.bb4w-games.com/

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Terje Slettebø

unread,

Nov 8, 2009, 5:31:32 PM11/8/09

to

On 8 Nov, 16:07, Richard Russell <n...@rtrussell.co.uk> wrote:

Hi Richard.

> On Nov 5, 10:12 am, Terje Slettebø <tslett...@gmail.com> wrote:
>
> > I agree with you about x86 vs ARM: I would never want to write x86 assembly code...
>
> It's simply a case of what you're used to. Having been an x86
> assembly language programmer for more than 20 years, I find it much
> easier to understand and program than ARM assembler. For me, thinking
> in terms of conditional instructions rather than conditional jumps is
> quite alien, requiring more of an intellectual leap from high-level
> programming.
>
> It's true that the old 16-bit x86 was horrible to program, because of
> the asymmetrical nature of its instruction set (certain instructions
> only being able to use certain registers). Modern 32-bit IA-32 code
> is much nicer, because (with few exceptions) all the registers are
> equivalent.

You're right, I guess it depends to a large extent on what you're used
to, and also, x86 and ARM have become closer to each other: x86 has
become more regularised, as you say (mostly, any register may be used
for any instruction, and we don't have the horrible segmented memory
system, anymore...), while ARM has grown quite a bit more complex (a
lot more instructions).

Also, I understand what you mean about the conditional executed
instructions. They take some getting used to, but once you've wrapped
your head around them, they may lead to very succinct code (and
eliminating many jumps). For example:

IF (A=1 OR B=2) AND C=3 AND D=4 THEN...

CMP R0,#1
CMPNE R1,#2
CMPEQ R2,#3
CMPEQ R3,#4
BEQ ...

By the way, your BBC for Windows is very nice. :) (I've got a licensed
copy of it)

Regards,

Terje

Richard Russell

unread,

Nov 8, 2009, 6:30:05 PM11/8/09

to

On Nov 8, 10:31 pm, Terje Slettebø <tslett...@gmail.com> wrote:
> IF (A=1 OR B=2) AND C=3 AND D=4 THEN...
>
> CMP R0,#1
> CMPNE R1,#2
> CMPEQ R2,#3
> CMPEQ R3,#4
> BEQ ...

Certainly that is elegant and succinct, but there's no guarantee it
will result in better performance or shorter code. For example, the
most straightforward x86 equivalent (listed below) is exactly the same
length (20 bytes) and, on modern processors with branch prediction,
conditional jumps aren't expensive.

If you know in advance something about the statistics (for example you
know that D is hardly ever 4) then doing that test first and bailing
out if it fails may be more efficient than testing all four registers
every time.

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

83 F8 01 cmp eax,1
74 05 je l1
83 FB 02 cmp ebx,2
75 08 jne l2
83 F9 03 .l1 cmp ecx,3
75 03 jne l2
83 FA 04 cmp edx,4
74 xx .l2 je ...

Rob Kendrick

unread,

Nov 8, 2009, 7:18:36 PM11/8/09

to

On Sun, 8 Nov 2009 15:30:05 -0800 (PST)
Richard Russell <ne...@rtrussell.co.uk> wrote:

> > IF (A=1 OR B=2) AND C=3 AND D=4 THEN...
> >
> > CMP R0,#1
> > CMPNE R1,#2
> > CMPEQ R2,#3
> > CMPEQ R3,#4
> > BEQ ...
>
> Certainly that is elegant and succinct, but there's no guarantee it
> will result in better performance or shorter code. For example, the
> most straightforward x86 equivalent (listed below) is exactly the same
> length (20 bytes) and, on modern processors with branch prediction,
> conditional jumps aren't expensive.

The idea here for the ARM is that doing the above and not having branch
prediction is a lot simpler for similar performance, and thus more
power efficient. Doing the above for anything much more complex
however doesn't win you anything, because it's cheaper to take the
pipeline flush than execute all those no-ops. MIPS has an elegant, but
quite confusing for new-comers approach to this; you have something
called the branch delay slot. The instruction after a branch will be
executed regardless; at this point, it's already made its way through
half the pipeline, so you might as well finish it off.

B.

Terje Slettebø

unread,

Nov 9, 2009, 12:53:17 PM11/9/09

to

On 9 Nov, 01:18, Rob Kendrick <n...@rjek.com> wrote:
> On Sun, 8 Nov 2009 15:30:05 -0800 (PST)
>

> Richard Russell <n...@rtrussell.co.uk> wrote:
> > > IF (A=1 OR B=2) AND C=3 AND D=4 THEN...
>
> > > CMP R0,#1
> > > CMPNE R1,#2
> > > CMPEQ R2,#3
> > > CMPEQ R3,#4
> > > BEQ ...
>
> > Certainly that is elegant and succinct, but there's no guarantee it
> > will result in better performance or shorter code. For example, the
> > most straightforward x86 equivalent (listed below) is exactly the same
> > length (20 bytes) and, on modern processors with branch prediction,
> > conditional jumps aren't expensive.
>
> The idea here for the ARM is that doing the above and not having branch
> prediction is a lot simpler for similar performance, and thus more
> power efficient.

Also, I'm no processor expert, but it's my understanding that with
today's deep pipelines, a branch misprediction can be quite costly
(more than earlier, shorter pipelines).

Regards,

Terje

Rob Kendrick

unread,

Nov 9, 2009, 12:58:52 PM11/9/09

to

On Mon, 9 Nov 2009 09:53:17 -0800 (PST)
Terje Slettebø <tsle...@gmail.com> wrote:

> > The idea here for the ARM is that doing the above and not having
> > branch prediction is a lot simpler for similar performance, and
> > thus more power efficient.
>
> Also, I'm no processor expert, but it's my understanding that with
> today's deep pipelines, a branch misprediction can be quite costly
> (more than earlier, shorter pipelines).

Yes, but we have speculative and out-of-order execution to get around
that :)

B.

druck

unread,

Nov 9, 2009, 4:50:10 PM11/9/09

to

Terje Slettebø wrote:
> You're right, I guess it depends to a large extent on what you're used
> to, and also, x86 and ARM have become closer to each other: x86 has
> become more regularised, as you say (mostly, any register may be used
> for any instruction, and we don't have the horrible segmented memory
> system, anymore...), while ARM has grown quite a bit more complex (a
> lot more instructions).

They are still as different as oranges and fish. Despite the additional
instructions added from ARMv4 onwards, ARM is still a pretty orthogonal,
fixed instruction length, 3 operand, load/store architecture. The x86
ISA is a variable length, 2 operand, register starved, reg/mem, bloated
piss poor pile of cack.

---druck

Rob Kendrick

unread,

Nov 9, 2009, 7:50:29 PM11/9/09

to

On Mon, 09 Nov 2009 21:50:10 +0000
druck <ne...@druck.org.uk> wrote:

> They are still as different as oranges and fish. Despite the
> additional instructions added from ARMv4 onwards, ARM is still a
> pretty orthogonal, fixed instruction length, 3 operand, load/store
> architecture. The x86 ISA is a variable length, 2 operand, register
> starved, reg/mem, bloated piss poor pile of cack.

Actually, it's becoming increasingly unorthoganal. Such as
instructions that only work on even-numbered registers. Additionally,
modern x86s actually have more general-purpose registers than ARM (16,
verses 15), and is getting scarily close to ARM's performance per Watt
(which is why ARM have added all the bloated cack to try to catch up
again.)

B.

Jake Waskett

unread,

Nov 10, 2009, 5:48:51 AM11/10/09

to

For the most part, though, you can just ignore the instructions that have
been added to ARM. The ARMv4 instruction set is (obviously) more than
adequate for programming a complete machine, and those core instructions
are still available and as flexible and orthogonal as ever, even in much
more recent ARM architectures. I recognise that there's a technical
ugliness aspect, but in practical terms does it really matter if an
obscure new instruction is a bit peculiar?

On the other hand, x86 is far from orthogonal, even for the basics. Try
writing a reasonably efficient compiler back end. There are so many
special cases in x86 (shifts and multiplies, for example) that the
resulting code is a hideous mess of special cases. It's even worse if you
try to emit binary code directly, as x86 encoding is bizarre.

That's the way I'd sum up the differences: ARM is a clean, RISC-style core
with some ugly additions; x86 is a mess throughout. And because you have
to deal with that mess effectively, you have to hold it in your head.

Terje Slettebø

unread,

Nov 10, 2009, 6:59:31 AM11/10/09

to

On 9 Nov, 18:58, Rob Kendrick <n...@rjek.com> wrote:
> On Mon, 9 Nov 2009 09:53:17 -0800 (PST)
>

> Terje Slettebø <tslett...@gmail.com> wrote:
> > > The idea here for the ARM is that doing the above and not having
> > > branch prediction is a lot simpler for similar performance, and
> > > thus more power efficient.
>
> > Also, I'm no processor expert, but it's my understanding that with
> > today's deep pipelines, a branch misprediction can be quite costly
> > (more than earlier, shorter pipelines).
>
> Yes, but we have speculative and out-of-order execution to get around
> that :)

How? Branch prediction is already speculative execution. Is there any
mainstream processor that speculatively executes both paths of a
branch? And of multiple branches after each other, as in the case of
Richard's example code?

Regards,

Terje

Rob Kendrick

unread,

Nov 10, 2009, 7:02:39 AM11/10/09

to

On Tue, 10 Nov 2009 03:59:31 -0800 (PST)
Terje Slettebø <tsle...@gmail.com> wrote:

> How? Branch prediction is already speculative execution. Is there any
> mainstream processor that speculatively executes both paths of a
> branch?

Sure; most modern x86s. Remember, branch prediction is more about
pre-loading instructions from the likely path, not hedging ones bets.
Speculative execution becomes somewhat easier when you've already done
the dependency analysis required by out-of-order execution.

B.

Richard Russell

unread,

Nov 10, 2009, 7:28:37 AM11/10/09

to

On Nov 10, 10:48 am, Jake Waskett <j...@waskett.org> wrote:
> On the other hand, x86 is far from orthogonal, even for the basics.

I don't accept that. In the IA-32 instruction set non-orthogonal
instructions are the exception rather than the rule and, rather as for
the newer ARM instructions, you can largely ignore them if you prefer.

> There are so many special cases in x86 (shifts and multiplies, for example)

Shifts by a *constant* number of bits aren't special cases: you can
shift any of the general-purpose registers by any number of bits.
Shifts by a *variable* number of bits are special only to the extent
that the shift count must be in a specific register.

Regular 32-bit multiplies aren't special cases either; you can
multiply any register by any other register (or any register by a
constant). The only special case is when you need to multiply two 32-
bit values together to give a 64-bit product.

> It's even worse if you try to emit binary code directly, as x86 encoding is bizarre.

Huh? What's 'bizarre' about it? BBC BASIC for Windows includes a
full x86 assembler (accepting 16-bit and 32-bit instruction variants,
floating point instructions and MMX instructions) which totals less
than 9Kbytes including all the tables. The actual 'encoding' logic is
about 2 Kbytes.

Terje Slettebø

unread,

Nov 10, 2009, 9:34:07 AM11/10/09

to

On 10 Nov, 13:02, Rob Kendrick <n...@rjek.com> wrote:
> On Tue, 10 Nov 2009 03:59:31 -0800 (PST)
>

> Terje Slettebø <tslett...@gmail.com> wrote:
> > How? Branch prediction is already speculative execution. Is there any
> > mainstream processor that speculatively executes both paths of a
> > branch?
>
> Sure; most modern x86s. Remember, branch prediction is more about
> pre-loading instructions from the likely path, not hedging ones bets.
> Speculative execution becomes somewhat easier when you've already done
> the dependency analysis required by out-of-order execution.

I'm afraid I still don't follow you... From what I understand, branch
prediction is about predicting the most likely outcome of a
conditional branch, and then continuing along the predicted path,
speculatively fetching, possibly executing (but not retiring)
instructions along that path. If it predicted correctly, then all is
fine, and it can retire the finished instructions, and continue along
that path.

However, if it didn't predict correctly, then unless it _also_
speculatively executes along the nonpredicted path, it would have to
flush the pipeline, and go down that path, instead.

I also don't understand your reference to "hedging one's bets" (maybe
due to me not being native English speaking). If you agree with my
description of branch prediction as given above, then at least we have
a common understanding of that.

I'm not sure where out-of-order execution figures in this picture.
Could you perhaps have described in more detail how this is supposed
to mitigate the cost of branch misprediction?

Regards,

Terje

Rob Kendrick

unread,

Nov 10, 2009, 9:39:20 AM11/10/09

to

On Tue, 10 Nov 2009 06:34:07 -0800 (PST)
Terje Slettebø <tsle...@gmail.com> wrote:

> However, if it didn't predict correctly, then unless it _also_
> speculatively executes along the nonpredicted path, it would have to
> flush the pipeline, and go down that path, instead.

Quite; and modern CPUs do have multiple execution units, and can do
just this.

> I also don't understand your reference to "hedging one's bets" (maybe
> due to me not being native English speaking).

To hedge one's bet is to gamble both ways.

> I'm not sure where out-of-order execution figures in this picture.
> Could you perhaps have described in more detail how this is supposed
> to mitigate the cost of branch misprediction?

The dependency information gathered that is essential for out-of-order
execution to work is also very useful for the other optimisations
mentioned.

B.

Jake Waskett

unread,

Nov 10, 2009, 10:47:45 AM11/10/09

to

On Tue, 10 Nov 2009 04:28:37 -0800, Richard Russell wrote:

> On Nov 10, 10:48 am, Jake Waskett <j...@waskett.org> wrote:
>> On the other hand, x86 is far from orthogonal, even for the basics.
>
> I don't accept that. In the IA-32 instruction set non-orthogonal
> instructions are the exception rather than the rule and, rather as for
> the newer ARM instructions, you can largely ignore them if you prefer.

I disagree. I'd say that non-orthogonality is the norm in x86, but
usually it's mild enough that you don't notice. I think that x86
programmers also develop habits that help to avoid non-orthogonality (for
example if you always use ESP as a stack pointer then you don't notice
that you can't use another register for the same purpose).

A good example that affects almost every instruction is the addressing
mode restriction. x86 instructions, as you know, generate addresses as
the sum of a displacement plus a register plus a second register shifted
left by 0, 1, 2, or 3. The scheme is fairly flexible, but the second
register cannot be ESP. (If memory serves correctly, there's also an
impossible combination of registers, but I may be mistaken about that.) So
ESP isn't truly a general-purpose register, which introduces nastiness
into any code generator, as it has to take this quirk into account.

There are also very minor things that may not count (depending on your
definition) as non-orthogonal, but which nevertheless affect code
generators and both human- and machine-based optimisation. For example,
consider the fact that EAX gets special treatment as an accumulator. Many
instructions have a more compact encoding when used in this way, which can
result in faster code. So the optimum performance depends on your choice
of registers.

>> There are so many special cases in x86 (shifts and multiplies, for
>> example)
>
> Shifts by a *constant* number of bits aren't special cases: you can
> shift any of the general-purpose registers by any number of bits. Shifts
> by a *variable* number of bits are special only to the extent that the
> shift count must be in a specific register.

I'm puzzled why you say "only to the extent" here - I'm having difficulty
in thinking of a more "special" requirement. Having to use a particular
register is pretty inconvenient. It's bad enough having only eight
registers to begin with, but when there are restrictions on how they are
used it is even less pleasant.

> Regular 32-bit multiplies aren't special cases either; you can multiply
> any register by any other register (or any register by a constant). The
> only special case is when you need to multiply two 32- bit values
> together to give a 64-bit product.

Sorry, you're quite right - multiplies are flexible.

>> It's even worse if you try to emit binary code directly, as x86
>> encoding is bizarre.
>
> Huh? What's 'bizarre' about it?

The assignment of opcodes is seemingly arbitrary - there are few regular
patterns. There's no obvious relationship between the opcode and the
function, whether a modr/m byte follows, whether an immediate quantity
(and of what size) follows, etc. ARM is trivial in comparison - you can
write a disassembler in a few hundred lines of C. That's why x86 decoders
are so huge (and in turn why Atom can't beat ARM's power consumption).

To be fair to x86, it is a variable instruction length encoding, for which
the priority is code size rather than regularity, but even so it is uglier
than necessary.

> BBC BASIC for Windows includes a full
> x86 assembler (accepting 16-bit and 32-bit instruction variants,
> floating point instructions and MMX instructions) which totals less than
> 9Kbytes including all the tables. The actual 'encoding' logic is about
> 2 Kbytes.

That's impressive. Does that perform any optimisation such as choosing
the most compact encoding?

Jake

Rob Kendrick

unread,

Nov 10, 2009, 10:53:40 AM11/10/09

to

On Tue, 10 Nov 2009 15:47:45 GMT
Jake Waskett <ja...@waskett.org> wrote:

> The assignment of opcodes is seemingly arbitrary - there are few
> regular patterns. There's no obvious relationship between the opcode
> and the function, whether a modr/m byte follows, whether an immediate
> quantity (and of what size) follows, etc. ARM is trivial in
> comparison - you can write a disassembler in a few hundred lines of
> C. That's why x86 decoders are so huge (and in turn why Atom can't
> beat ARM's power consumption).

Tricky, this: I think you'll find an Atom uses a similar amount of
power encoding an MP3, or similar task. (Simply because it gets it
done faster.)

Where things get interesting is modern x86 CPUs in power saving mode,
where they can mostly turn off. Unfortunately, they've not got that
down enough yet. When they do, they can spring into life and run at
full tilt, do the work very quickly, and then switch themselves off,
while the ARM is still crunching away.

> To be fair to x86, it is a variable instruction length encoding, for
> which the priority is code size rather than regularity, but even so
> it is uglier than necessary.

Amusingly, ARM now suggest using the variable-length instruction ISA
travesty called Thumb-2 now, as they claim it gives better performance.

(I have no idea if it actually does, all the Thumb-2-capable hardware
I've developed on can't do the normal ARM instruction set, making
comparisons tricky.)

B.

Jake Waskett

unread,

Nov 10, 2009, 11:55:41 AM11/10/09

to

On Tue, 10 Nov 2009 15:53:40 +0000, Rob Kendrick wrote:

> On Tue, 10 Nov 2009 15:47:45 GMT
> Jake Waskett <ja...@waskett.org> wrote:
>
>> The assignment of opcodes is seemingly arbitrary - there are few
>> regular patterns. There's no obvious relationship between the opcode
>> and the function, whether a modr/m byte follows, whether an immediate
>> quantity (and of what size) follows, etc. ARM is trivial in comparison
>> - you can write a disassembler in a few hundred lines of C. That's why
>> x86 decoders are so huge (and in turn why Atom can't beat ARM's power
>> consumption).
>
> Tricky, this: I think you'll find an Atom uses a similar amount of power
> encoding an MP3, or similar task. (Simply because it gets it done
> faster.)

You might be right, but I could be convinced either way. I'd like to see
some figures.

> Where things get interesting is modern x86 CPUs in power saving mode,
> where they can mostly turn off. Unfortunately, they've not got that
> down enough yet. When they do, they can spring into life and run at
> full tilt, do the work very quickly, and then switch themselves off,
> while the ARM is still crunching away.

I've seen this described as a "race to idle", but I'm not sure whether it
applies to Atom, which is pretty feeble in terms of performance. I don't
have any figures to hand, but I wouldn't be surprised if a recent, fast
Cortex A8 was about as fast as an Atom, at least on integer code. So it
could race to idle just as effectively.

> Amusingly, ARM now suggest using the variable-length instruction ISA
> travesty called Thumb-2 now, as they claim it gives better performance.
>
> (I have no idea if it actually does, all the Thumb-2-capable hardware
> I've developed on can't do the normal ARM instruction set, making
> comparisons tricky.)

You're obviously ahead of me on this - I haven't actually *used* Thumb-2,
but have merely read about it. My understanding is that it's faster than
ARM when using 16-bit memory (as Flash often is). I gather that it *can*
be faster than ARM even when using a wider memory due to better cache
utilisation, but I imagine that this would depend on the characteristics
of the code being executed.

Rob Kendrick

unread,

Nov 10, 2009, 12:01:42 PM11/10/09

to

On Tue, 10 Nov 2009 16:55:41 GMT
Jake Waskett <ja...@waskett.org> wrote:

> I've seen this described as a "race to idle", but I'm not sure
> whether it applies to Atom, which is pretty feeble in terms of
> performance. I don't have any figures to hand, but I wouldn't be
> surprised if a recent, fast Cortex A8 was about as fast as an Atom,
> at least on integer code. So it could race to idle just as
> effectively.

Sure, if you can find anybody who'll sell you a 1.6GHz dual-core A8,
for integer non-memory-bound operations, they might be of similar
performance. Floating point on an A8's a joke.

Also remember that almost all A8 CPUs out there lack any
high-speed external bus, meaning external peripheral access (such as
hard discs, networking, loads of other stuff) have to go over
SPI, local memory bus, or similar low-bandwidth high-overhead bus.

> > Amusingly, ARM now suggest using the variable-length instruction ISA
> > travesty called Thumb-2 now, as they claim it gives better
> > performance.
> >
> > (I have no idea if it actually does, all the Thumb-2-capable
> > hardware I've developed on can't do the normal ARM instruction set,
> > making comparisons tricky.)
>
> You're obviously ahead of me on this - I haven't actually *used*
> Thumb-2, but have merely read about it. My understanding is that
> it's faster than ARM when using 16-bit memory (as Flash often is). I
> gather that it *can* be faster than ARM even when using a wider
> memory due to better cache utilisation, but I imagine that this would
> depend on the characteristics of the code being executed.

Very much so, which is why I've taken ARM's claims with a pinch of salt.

B.

Richard Russell

unread,

Nov 10, 2009, 12:38:26 PM11/10/09

to

On Nov 10, 3:47 pm, Jake Waskett <j...@waskett.org> wrote:
> A good example that affects almost every instruction is the addressing
> mode restriction. x86 instructions, as you know, generate addresses as
> the sum of a displacement plus a register plus a second register shifted
> left by 0, 1, 2, or 3. The scheme is fairly flexible, but the second
> register cannot be ESP.

Well, esp isn't a 'general purpose' register so I discount that. On
any processor which has a dedicated stack pointer, I wouldn't expect
it to be usable in the same ways as the general-purpose registers, so
I don't consider that to be a 'non-orthogonal' feature. I'm looking
at it from the point of view of a programmer, *not* from an
instruction encoding standpoint.

> For example,
> consider the fact that EAX gets special treatment as an accumulator. Many
> instructions have a more compact encoding when used in this way, which can
> result in faster code. So the optimum performance depends on your choice
> of registers.

That's why I said you can largely ignore the non-orthogonal
instructions (which includes the special accumulator instructions).
Doing so may, as you say, result in an encoding being slightly longer
than it might otherwise be, but I'd be surprised if execution *time*
was affected other than in rare 'edge cases' such as pushing code into
another cache line.

> Having to use a particular register is pretty inconvenient.

To an experienced x86 programmer it is second-nature, but clearly this
*is* a case of non-orthogonality. I might argue that having to
increase the size of the instruction to accommodate a field for the
'shift count' register would be pretty wasteful!

> There's no obvious relationship between the opcode and the
> function, whether a modr/m byte follows, whether an immediate quantity
> (and of what size) follows, etc.

Considering the technology available when the original 8086/8088 was
developed (from which all x86 processors derive) I bet there *is* some
pattern to it (Intel will have wanted to minimize the decoding
logic)! Unfortunately the need to shoehorn in additional instructions
whilst maintaining binary compatibility with those early processors
will have adversely affected the regularity of the encoding.

> That's impressive. Does that perform any optimisation such as choosing
> the most compact encoding?

The currently-released version will always use the accumulator-
specific encoding (when available) in preference to the 'orthogonal'
encoding. This *usually* results in the most compact code, but
there's a special case (when a 32-bit operand can be represented as a
sign-extended 8-bit value) when it doesn't. The next release of BB4W,
due in the new year, addresses this issue and will always use the most
compact encoding.

Martin Bazley

unread,

Nov 10, 2009, 2:12:12 PM11/10/09

to

The following bytes were arranged on 10 Nov 2009 by Richard Russell :

> On Nov 10, 10:48�am, Jake Waskett <j...@waskett.org> wrote:
> > There are so many special cases in x86 (shifts and multiplies, for example)
>
> Shifts by a *constant* number of bits aren't special cases: you can
> shift any of the general-purpose registers by any number of bits.
> Shifts by a *variable* number of bits are special only to the extent
> that the shift count must be in a specific register.
>

Whereas on ARM, of course, you can shift (or rotate) the second
parameter of any instruction, by any value, in any direction, arithmetic
or logical, including a special 'extend' option, and by any constant or
a number held in any register. :-)

Well, except for the new instructions, which, as mentioned above, are
best ignored.

Just thought I might point that out...

--
__<^>__ "Your pet, our passion." - Purina
/ _ _ \ "Your potential, our passion." - Microsoft, a few months later
( ( |_| ) )
\_> <_/ ======================= Martin Bazley ==========================

Gwyn

unread,

Nov 10, 2009, 2:19:04 PM11/10/09

to

In message <20091110155...@trite.i.flarn.net.i.flarn.net>
Rob Kendrick <nn...@rjek.com> wrote:

>On Tue, 10 Nov 2009 15:47:45 GMT
>Jake Waskett <ja...@waskett.org> wrote:
>
>> The assignment of opcodes is seemingly arbitrary - there are few
>> regular patterns. There's no obvious relationship between the opcode
>> and the function, whether a modr/m byte follows, whether an immediate
>> quantity (and of what size) follows, etc. ARM is trivial in
>> comparison - you can write a disassembler in a few hundred lines of
>> C. That's why x86 decoders are so huge (and in turn why Atom can't
>> beat ARM's power consumption).
>
>Tricky, this: I think you'll find an Atom uses a similar amount of
>power encoding an MP3, or similar task. (Simply because it gets it
>done faster.)

Surely you mean a similar amount of energy - ie more power for a shorter
time.

[snip]

--
Gwyn

druck

unread,

Nov 10, 2009, 5:01:09 PM11/10/09

to

Rob Kendrick wrote:
> Actually, it's becoming increasingly unorthoganal. Such as
> instructions that only work on even-numbered registers. Additionally,
> modern x86s actually have more general-purpose registers than ARM (16,
> verses 15),

That's AMD64 not x86, different fish.

> and is getting scarily close to ARM's performance per Watt

> (which ckis why ARM have added all the bloated cack to try to catch up
> again.)

Still at least 3 binary magnitudes higher last time I looked.

---dru

Richard Russell

unread,

Nov 10, 2009, 5:55:45 PM11/10/09

to

On Nov 10, 10:01 pm, druck <n...@druck.org.uk> wrote:
> That's AMD64 not x86, different fish.

Still very recognisably a member of the x86-family. In fact it's
almost binary compatible, if the INC instructions are avoided (they
have become REX prefix bytes in the x86-64) - and of course it's
fully compatible when switched to 32-bit mode. To misuse a different
metaphor, the x86 and x86-64 are as different as cheese and cheese!

Rob Kendrick

unread,

Nov 11, 2009, 1:56:28 AM11/11/09

to

On Tue, 10 Nov 2009 22:01:09 +0000
druck <ne...@druck.org.uk> wrote:

> Rob Kendrick wrote:
> > Actually, it's becoming increasingly unorthoganal. Such as
> > instructions that only work on even-numbered registers.
> > Additionally, modern x86s actually have more general-purpose
> > registers than ARM (16, verses 15),
>
> That's AMD64 not x86, different fish.

It's an x86 CPU that has a different mode that mops up some uglyness
from previous generations and deprecates/disables older functionality.
To me, that sounds similar to the 26->32 bit mode changes, the removal
of the old FPA10 instruction set and replacement with VFP/NEON, and
bags of other changes made to ARM.

B.

Terje Slettebø

unread,

Nov 11, 2009, 3:25:35 AM11/11/09

to

On 10 Nov, 18:38, Richard Russell <n...@rtrussell.co.uk> wrote:
>
> Considering the technology available when the original 8086/8088 was
> developed (from which all x86 processors derive) I bet there *is* some
> pattern to it (Intel will have wanted to minimize the decoding
> logic)! Unfortunately the need to shoehorn in additional instructions
> whilst maintaining binary compatibility with those early processors
> will have adversely affected the regularity of the encoding.

The same is to some degree the case for the newer ARM instructions:
They are still divided into classes that generally occupy a pattern
for each class, but they are often not as straightforward to encode/
decode as the original instruction set.

For example, one mnemonic may result in several different rather
distinct opcodes, depending on mnemonic options and arguments.

To put it in positive terms: This makes it more challenging to write
an assembler for it... :)

Regards,

Terje

Jake Waskett

unread,

Nov 11, 2009, 5:18:22 AM11/11/09

to

On Tue, 10 Nov 2009 09:38:26 -0800, Richard Russell wrote:
> Well, esp isn't a 'general purpose' register so I discount that. On any
> processor which has a dedicated stack pointer, I wouldn't expect it to
> be usable in the same ways as the general-purpose registers, so I don't
> consider that to be a 'non-orthogonal' feature. I'm looking at it from
> the point of view of a programmer, *not* from an instruction encoding
> standpoint.

That seems a rather circular argument: an architectural decision to have a
dedicated stack pointer *is* a non-orthogonal feature. It may be a
justifiable one, perhaps, and it might not inconvenience the regular x86
programmer too much, but it's still non-orthogonal.

Regarding your last point, I'm also looking at this from the viewpoint of
a programmer. Some time ago, I was working on a dynamic translator that
translated RISC instructions to x86. Using ESP as a general purpose
register (which it very nearly is) bought me some performance, but at the
cost of a great big wedge of logic that had to handle all of the special
cases.

> [re short-form encodings]

> That's why I said you can largely ignore the non-orthogonal instructions
> (which includes the special accumulator instructions). Doing so may, as
> you say, result in an encoding being slightly longer than it might
> otherwise be, but I'd be surprised if execution *time* was affected
> other than in rare 'edge cases' such as pushing code into another cache
> line.

You might want to read Intel's (and AMD's) optimisation manuals. They
both stress the importance of keeping instructions short (some processors
can issue up to 4 instructions per cycle, but only if they all fit in a
single cache line).

>> Having to use a particular register is pretty inconvenient.
>
> To an experienced x86 programmer it is second-nature, but clearly this
> *is* a case of non-orthogonality. I might argue that having to increase
> the size of the instruction to accommodate a field for the 'shift count'
> register would be pretty wasteful!

That's exactly what the imm8 variants of the x86 shift instructions do!

>> There's no obvious relationship between the opcode and the function,
>> whether a modr/m byte follows, whether an immediate quantity (and of
>> what size) follows, etc.
>
> Considering the technology available when the original 8086/8088 was
> developed (from which all x86 processors derive) I bet there *is* some
> pattern to it (Intel will have wanted to minimize the decoding logic)!

Actually, early x86 processors (and, I believe, the 8080) were microcoded,
so decoding was essentially trivial (probably involving shifting the
opcode byte by a few bits and using the result as an index into the
microcode ROM). The current x86 execution model, which essentially
involves hardware decode and translation to RISC-like instructions for
execution, is a comparatively recent invention.

You're probably right, though, that there is some kind of pattern in
there. But whatever it is is far from obvious.

Terje Slettebø

unread,

Nov 11, 2009, 5:52:01 AM11/11/09

to

On 11 Nov, 11:18, Jake Waskett <j...@waskett.org> wrote:

Hi Jake.

> On Tue, 10 Nov 2009 09:38:26 -0800, Richard Russell wrote:
> > Well, esp isn't a 'general purpose' register so I discount that. On any
> > processor which has a dedicated stack pointer, I wouldn't expect it to
> > be usable in the same ways as the general-purpose registers, so I don't
> > consider that to be a 'non-orthogonal' feature. I'm looking at it from
> > the point of view of a programmer, *not* from an instruction encoding
> > standpoint.
>
> That seems a rather circular argument: an architectural decision to have a
> dedicated stack pointer *is* a non-orthogonal feature. It may be a
> justifiable one, perhaps, and it might not inconvenience the regular x86
> programmer too much, but it's still non-orthogonal.

You might be interested to know that the use of R13 as anything else
than a stack pointer has been deprecated... Section 2.3 of the ARMv7
ARM:

"The use of SP for any purpose other than as a stack pointer is
deprecated."

Why they've done this is not explained. Possibly due to R13 being an
implicit register for some Thumb operations (similar to the x86).

Regards,

Terje

Richard Russell

unread,

Nov 11, 2009, 5:55:42 AM11/11/09

to

On Nov 11, 10:18 am, Jake Waskett <j...@waskett.org> wrote:
> That seems a rather circular argument: an architectural decision to have a
> dedicated stack pointer *is* a non-orthogonal feature.

I disagree. You might as well say that an architectural decision to
have a dedicated program counter register is non-orthogonal, or an
architectural decision to have a dedicated flags register is non-
orthogonal!

> You might want to read Intel's (and AMD's) optimisation manuals. They
> both stress the importance of keeping instructions short

Please give a specific reference in the Intel Architecture
Optimization Reference Manual (with which I'm fairly familar, but
can't remember ever having read that). Doing the experiment, i.e.
comparing the actual execution times of the accumulator-version and
the 'orthogonal' version, the longer encoding was, if anything, very
slightly *faster*!

Richard Russell

unread,

Nov 11, 2009, 9:50:00 AM11/11/09

to

On Nov 11, 10:55 am, Richard Russell <n...@rtrussell.co.uk> wrote:
> the longer encoding was, if anything, very slightly *faster*!

Repeating the experiment more carefully, to make sure that alignment
issues weren't affecting the results, the execution times of the 5-
byte instruction using 'eax' and the 6-byte instruction using 'ebx'
were identical on my Pentium 4 HT. I've listed the code I used, and
the results, below.

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

Code:

DIM P% 200
P% = (P%+15) AND -16
[
.test1
or eax,123456
or eax,123456
or eax,123456
or eax,123456
or eax,123456
or eax,123456
or eax,123456
or eax,123456
loop test1
ret
]
P% = (P%+15) AND -16
[
.test2
or ebx,123456
or ebx,123456
or ebx,123456
or ebx,123456
or ebx,123456
or ebx,123456
or ebx,123456
or ebx,123456
loop test2
ret
]

C% = 1000000000
TIME=0:CALL test1:T%=TIME:PRINT T%
TIME=0:CALL test2:T%=TIME:PRINT T%

Result:

100418F0 .test1
100418F0 0D 40 E2 01 00 or eax,123456
100418F5 0D 40 E2 01 00 or eax,123456
100418FA 0D 40 E2 01 00 or eax,123456
100418FF 0D 40 E2 01 00 or eax,123456
10041904 0D 40 E2 01 00 or eax,123456
10041909 0D 40 E2 01 00 or eax,123456
1004190E 0D 40 E2 01 00 or eax,123456
10041913 0D 40 E2 01 00 or eax,123456
10041918 E2 D6 loop test1
1004191A C3 ret
10041920 .test2
10041920 81 CB 40 E2 01 00 or ebx,123456
10041926 81 CB 40 E2 01 00 or ebx,123456
1004192C 81 CB 40 E2 01 00 or ebx,123456
10041932 81 CB 40 E2 01 00 or ebx,123456
10041938 81 CB 40 E2 01 00 or ebx,123456
1004193E 81 CB 40 E2 01 00 or ebx,123456
10041944 81 CB 40 E2 01 00 or ebx,123456
1004194A 81 CB 40 E2 01 00 or ebx,123456
10041950 E2 CE loop test2
10041952 C3 ret
296
296
>