All very interesting you might say, but it intrigues me as this sort
of instruction is usually only used in multiprocessor systems as a
software semaphore.
Why did Acorn add this instruction to the Arm3? Are they planning to
produce a multiprocessor archimedes??!? that would be pretty cool!
Andy (dreaming of 8 x Arm3's in the same box)
---------------------------------------------------------------------------
Andrew Mell : to...@gnu.ai.mit.edu, am...@isis.cs.du.edu
---------------------------------------------------------------------------
Indeed, but not necessarily not interleavable with other memory operations
(sorry about the double negative :-). In particular, to fully support the
SWP on a system with multiple memory bus masters the memory control logic
which decides which bus master has access to the memory next would have to
force an interlock between the memory read and memory write of the SWP
instruction. Now, the ARM3 has a LOCK pin for this, but to support
multi-processors you need to connect it to something :-).
>All very interesting you might say, but it intrigues me as this sort
>of instruction is usually only used in multiprocessor systems as a
>software semaphore.
>
>Why did Acorn add this instruction to the Arm3?
Because a long time ago, when we were very young (;-) we tried to write a
multi-threaded OS (ARX) and we ``found'' (sic, thought) that it was
spending a lot of time going into supervisor mode and disabling interrupts
so that it could implement mutexes (for user mode code - including the OS,
which ran in user mode too). In theory SWP allows user code to implement
mutexes efficiently.
As far as I am concerned the MP aspects of SWP are bonuses (clearly these
were considered at the same time - or the LOCK pin wouldn't be there).
Notice that SWP always bypasses the cache; again this is MP support, however
there is an ommission here in that it is impossible to do a (reliable) read
from external memory (you might get the cache contents instead!)
John Bowler (jbo...@acorn.co.uk)
> Notice that SWP always bypasses the cache; again this is MP support, however
> there is an ommission here in that it is impossible to do a (reliable) read
> from external memory (you might get the cache contents instead!)
If you're using it to implement semaphores, this is not a problem, as you'd
never need to access the semaphore with any instruction other than SWP.
Cheers, Julian.
BTW. You wouldn't happen to know the instruction format for SWP, by any
chance? If a software emulator can be written for it for ARM2 machines
(like the FPE - or even add it to the FPE) then we can all start using
it.
Yeah - offload the VDU and sound onto another processor! Way to go!
Actually, that instruction is really handy just for things like oping with
busses which have multiple-mastering, and even for multi-tasking (single
processor) for an operation that is non-interruptable by *itself*... Normally
when you're handling semaphores with a multi-tasking OS, you have to disable
interrupts, which of course does wonders for the interrupt latency, etc...
However, using SWP you could conceivably do it using only that instruction,
without having to worry about enabling/disabling interrupts.
It's a handy type of instruction - I wonder why Acorn didn't include it from the
start? Probably for the same reason why the didn't include multiplication in
the ARM1... get the easy ( ;-} ) bits working first!
Ug!
>I notice that the Arm3 has a new instruction over the Arm2 which is
>SWP. It swaps a byte or a word between register and external memory.
>(uninterruptible between the read and write)
>All very interesting you might say, but it intrigues me as this sort
>of instruction is usually only used in multiprocessor systems as a
>software semaphore.
>Why did Acorn add this instruction to the Arm3? Are they planning to
>produce a multiprocessor archimedes??!? that would be pretty cool!
Probably for the same reasons that they made you use 3 extra bytes
of information in a VDU 19 call (VDU 19,x,x;b;b;b; where b were all 0)
in the original BBC. 6 or so years after they did that, along came the
Archimedes, which actually USED those 3 bytes (for r,g,b triplet)
I would say that Acorn were just being sensible and making sure that
they have support for the multiprocessing instructions in all their
'old' machines by the time software comes out for theirt mprocessing
machines (i.e. no 'sudden' compatability break, because although old
machines aren't mprocessing, they can still handle the instructions
in the new programs.)
But, ACORN, PLEASE don't feel you have to prove me right and delay
the release of an Acorn Y-MP (oooh! Dat sounds nice!) for another 6
years!!! ;-) (Hmmm... I wonder how many people an Acorn Y-MP will seat!)
--
_____ . . ___ _____ .
/ /| /| / \ / /| The Master of the Arcane jw...@cs.aukuni.ac.nz
/ / | / | / / / /-| (A.K.A. Jason Williams)
/ / |/ | \___/ / / | "Dent... Arthur Dent? ... You're a jerk, Dent."
>jo...@acorn.co.uk (John Bowler) writes:
>> Notice that SWP always bypasses the cache; again this is MP support, however
>> there is an ommission here in that it is impossible to do a (reliable) read
>> from external memory (you might get the cache contents instead!)
>If you're using it to implement semaphores, this is not a problem, as you'd
>never need to access the semaphore with any instruction other than SWP.
Two remarks:
- SWP only handles binary (i.e. two-valued) semaphores. Normal semaphores
(`normal' as in `those proposed by E.W. Dijkstra') are multivalued.
- If you're using semaphores, you're protecting some data against multiple
use. The semaphore might not be gotten from cache because of the SWP
circumvening it, but the actual data could!
Tiggr
Yes; there is no problem with the semaphore, but the semaphore must be
protecting some state which is shared. When a processor has claimed that
semaphor it probably needs to read the state and to obtain consistent
results when it reads it. If the data is in cacheable memory the only way
it can do that is to use sequences of the form:-
SWP rx, rx, [raddr] ; read a value out
STR rx, [raddr] ; and put it back... :-(
The alternative is to allocate shared data in uncacheable memory. This
requires some OS intervention (a user program cannot simply allocate
shareable data structures out of its own heap unless the whole heap is
uncacheable) and uncacheable data obviously has a performance hit.
>BTW. You wouldn't happen to know the instruction format for SWP, by any
> chance? If a software emulator can be written for it for ARM2 machines
> (like the FPE - or even add it to the FPE) then we can all start using
> it.
RISC iX 1.2 emulates the SWP instruction on machines which do not support
it. RISC OS doesn't. The assembler syntax is:-
SWP{cond}{B} Rd, Rm, [Rn]
the semantics (except for the cache behaviour and so on) are:-
MOV <temp>, Rm
LDR{cond}{B} Rd, [Rn]
STR{cond}{B} <temp>, [Rn]
(ie the SWP Rx, Rx, [Raddr] example above *does* store the *old* Rx value
in [Raddr]... :-).
The instruction format is:-
bit 31 bit 0
c.o.n.d.0.0.0.1 0.B.0.0.n.n.n.n d.d.d.d.0.0.0.0 1.0.0.1.m.m.m.m
c.o.n.d - the condition
B - 0 = swap word
1 = swap byte
n.n.n.n - Rn
d.d.d.d - Rd
m.m.m.m - Rm
Data aborts (from the memory manager) leave Rd/Rm as they were before.
SWP bypasses the ARM3 cache, although the write operation still updates
the cache (if the address is cached). I don't know whether the read
will cause the rest of that part of the cache to be updated (I assume
not, and the programmer should not care :-)
John Bowler (jbo...@acorn.co.uk)
Wrong. You simply use a guard value. Load that guard value into a register
swap it with the semephore location and if the value you get back from the
location is not the guard value then increment or decrement the location
and swap it back. If you do get the guard value then you either loop trying
again or block the process trying to semaphore and come back later.
Nicko
+-----------------------------------------------------------------------------+
| Nicko van Someren, nb...@cl.cam.ac.uk, (44) 223 358707 or (44) 860 498903 |
+-----------------------------------------------------------------------------+
> The alternative is to allocate shared data in uncacheable memory. This
> requires some OS intervention (a user program cannot simply allocate
> shareable data structures out of its own heap unless the whole heap is
> uncacheable) and uncacheable data obviously has a performance hit.
This could be implemented fairly easily. You could have a shared memory
manager which controls a block of uncacheable RAM, and allocates chunks
of it to whichever tasks need it. Tasks could then place their own
structures in their allocated parts of the shared ram, and pass handles
between each other by messages.
A better solution would be to add another LDR instruction to the ARM which
will always ignore the cache. Actually, will the cache retain it's contents
while it is switched off? It might be just a matter of temporarily turning the
cache off while you read the data in question...
Cheers, Julian.
PS. Thanks for the info about the SWP instruction.
AAAAAAAAAAAAARRRRRRRRGGGGGGGGGGGHHHHH!!!!!!!!!!!!! As the saying goes. One
bypasses the cache at one's own peril - the data may have been updated in the
cache, but not written back yet.
The use of a block of uncacheable memory, as Julian suggested, is an extremely
good idea. How about it Acorn?
-Gavin.
--
The main "user" of well brought up, and educated, children is the community
at large. So if you really believe in "user pays", charge the correct users
- stop overloading parents with financial penalties.
******* These comments have no known correlation with dept. policy! *******
Umm maybe someone hasn't read the stuff about the ARM3 SWI's put out.
One of those allows you to mark blocks of the memory map as uncache-
able. Thus no problem with reading/writing to it directly...
Philip
--
Philip R. Banks ban...@rata.vuw.ac.nz (An Arc owner with a mission) @@@@@/|
(Quite what mission I am not too sure yet. But I do have one!) @@@@/#|
"There may be an om in moment,but there's very few folk in focus!" @@@/--|
--'Hallowed be thy Name' by Emerson,Lake & Palmer. @@/###|
This is only a heuristic approach. If you have more than one processor
using the event count (semaphore) you can end up with one or more processors
always getting the guard value and consequentially wasting time which they
should be spending feeding into the queue (or whatever the event count is
managing) spinning on the lock, or rescheduling themselves.
I know of no reliable way of synthesising a fetch&add primitive from ARM
SWP without knowing the number of processors involved, or of making
detailing assumptions about the behaviour of the bus arbitration logic.
(``reliable'' in this cases means ``guaranteed to complete in a certain
maximum time'').
John Bowler (jbo...@acorn.co.uk)
>
>AAAAAAAAAAAAARRRRRRRRGGGGGGGGGGGHHHHH!!!!!!!!!!!!! As the saying goes. One
>bypasses the cache at one's own peril - the data may have been updated in the
>cache, but not written back yet.
Not true. The ARM3 cache is write through/around, not copy back. The
updateable region indicates whether to write through (ie update the cache
contents on a cache hit), or to write around (ie just update the memory, and
not the cache contents, on a cache hit). A write on an ARM3 will never
complete before the memory has been updated. However, this shouldn't be
relied on as this may not be true for future ARMs.
___________________________________________________________________
Ashley Stevens aste...@acorn.co.uk
Acorn Computers, 645 Newmarket Rd, Cambridge, UK. Tel.(0223) 214411
and...
In article <1991Aug22.0...@rata.vuw.ac.nz> ban...@rata.vuw.ac.nz (Philip Banks) writes:
> Umm maybe someone hasn't read the stuff about the ARM3 SWI's put out.
> One of those allows you to mark blocks of the memory map as uncache-
> able. Thus no problem with reading/writing to it directly...
There has been a slight loss of context here :-). On current Acorn systems
there is no problem whatsoever about having data (shared by several tasks
or not) cached - there is only one CPU, and it only has one cache, and it
is always consistent with the external memory.
On these systems you might still want to use SWP to control access to shared
data in a multi-threaded process - SWP is not *interruptible*. Such data
can be safely read and written using LDR/LDM and STR/STM.
As Julian points out a separate cache-bypassing variant of LDR would be
useful *for multi-processor systems* (LDP? LDR but from physical memory
rather than the cache?). Similarly, if an ARM chip had (either) a write
back cache (not write through) or a write buffer a cache bypassing/write
buffer flushing (as appropriate) STR variant would also be useful. But none
of this matters until you have an MP system where the multiple processors
have homogenous access to the memory.
Having to allocate uncacheable memory for inter-thread communication forces
a certain model on the OS - the model is probably perfectly acceptable for
a single-thread-per-address-space system (I haven't thought about this in
detail), but it could be more tricky for multiple-thread-per-address-space
systems, where much more of the data tends to be shared. However I am
dubious about suggesting that cache bypassing is, itself, sufficient for
such a system - it may have too great a performance cost. Hardware
maintenance of cache consistency would make things easier - then SWP/LDP do
not need to bypass the cache, but bus-locking is still required to ensure
consistency of shared data.
John Bowler (jbo...@acorn.co.uk)
>I know of no reliable way of synthesising a fetch&add primitive from ARM
>SWP without knowing the number of processors involved, or of making
>detailing assumptions about the behaviour of the bus arbitration logic.
>(``reliable'' in this cases means ``guaranteed to complete in a certain
>maximum time'').
As soon as I saw the SWP instruction, my first thought was "Hum. Not really
good enough. What you really need is an indivisible compare-and-swap
instruction." and indeed this would seem to be true. The trouble is,
a CMPSWAP needs to do both a load and a store. You give it an address to
load from, a value to compare with, a value to store, and an address to store
to. It loads the data from the load address, compares it with the compare
value, and only if some condition is meet as a result (usually equal)
do you store the other value into the store address, all as one indivisble
instruction. You also get told whether the store happened or not. I have a
feeling that needing to do both a load and a store would play havoc with
the ARM internals though, and specifying four registers could be tricky
unless the shift field was used for one of them. Note that the load address
and store address are usually the same, and then you load your semaphore in
as the compare value, increment or decrement it to get the store value,
do the compare-and-swap, if the compare failed loop back to the load of
the semaphore value. In this way you can do a pretty bomb proof counting
semaphore as a four instruction spin loop. CMPSWP also works really
well for having several pre-emptive processes all allocating and freeing
blocks of memory in a common shared memory area.
This defficiency of the SWP type instructions seems to be a common illness
on Risc chips - the 88000 has an almost identical instruction and indeed
at my previous company we had no end of hassle converting our CMPSWP
code from the MV/Eclipse minis to the 88000 SWP instruction. In one case
we just couldn't make it work with more than two processes involved.
Fortunatly we were only using two processes, but it was supposed to be
a general purpose n-processes mechanism.
Owen.
The views expressed are my own and are not necessarily those of Acorn.