Instruction Set and OS Needs

Georges Brun-Cottan

unread,

Jun 5, 1996, 3:00:00 AM6/5/96

to

Hi,

I was wondering about a big challenge found in distributed systems but
also in other domains such databases or general concurrency control
areas.

What's the problem:

Status of data in memory may change and require additional treatment
(locking it, mapping it from a persistent store, invoking a stub for
an rpc and so on). A recurrent problem is how to realize this
functionality so that I have no overhead (i.e no check at the code
level, no indirection + function call, ...) when I do not need it and
that I have a limited (less than current interruption/trap/exception)
overhead when I need it.

What are current solutions :

Mainly two that are bad : 1) including code that test for possibly
needed treatment before each access to the data. 2) using mechanisms
not made for that purpose and so inefficient, mainly page or
misalignment fault.

Example of 1) are : Inserting instructions for locking before
accessing the data (e.g test-and-set and so on). Testing status flags
associated to data to check if the data is currently mapped in memory
(it have been swapped on second level store, migrated on another site
and so on).

Example of 2) are : Changing memory protection so that next access
will generate an exception that will effectively install the
data. That works bad when the issue is to mapped the data and not at
all when what you need is adding a treatment (take a lock, invoking an
RPC and so on).

What do we need :

Upon accessing the data, I want to be able to change the load/write of
this data to the execution of the data considered now as an
opcode. This switch could be made based, for example, on the
interpretation of a bit or a word associated to each address. Would
this bit be one, the processor upon the load will change execution
stream so that the next instruction is to reinterpret the loaded data
as an opcode and proceed.

Notice than this operation has to be executed in the same context than
the load/store operation. So, no need to context switch, stack switch,
cache or TLB flushes... We must be able to implement this operation in
a very fast way. Beside the memory architecture, the added logic to
the processor seem me (a real neophyte) small. I am even wondering if
the PAL code of Alpha processors for example, can be used for this
purpose...

How the solution can be used :

For distributed system and databases it is really obvious,so let me
take my example in the concurrency control domain. Consider processes
using Unix System V shared memory. If data races can occur, processes
synchronize generally using system V semaphores. And that whether or
not the shared segment is currently shared. Let's consider that when
an application attach to a shared segment it install at the data
address (presumably after moved the data in a second address planed to
that purpose) a call instruction that branch to a function that first
grant lock before retrieving this data. When data are not shared, we
avoid all semaphore system calls; we access directly the data. The
overhead is only the (des)installation overhead when data are actually
(un)shared. There is lot of such example. Think about synchronization
access in multi-threaded system. You execute in synchronization code
before accessing data only when several threads are launched, when
capabilities are given so that race condition can effectively occur
on a resource and so on...

Do not you think that this idea is worth to be discussed ?

Regards,

PS: Subsidiary question:
Is there any processors I am not aware of that exhibit such
functionality ?
--
georges Brun-Cottan
projet SOR, INRIA, B.P. 105 Rocquencourt, 78153 Le Chesnay Cedex, France.
Tel.: +33(1)39-63-54-26; Fax: +33(1)39-63-53-30;
<georges.b...@inria.fr> <http://prof.inria.fr/SOR/members/bruncott.html>

Bryan O'Sullivan

unread,

Jun 6, 1996, 3:00:00 AM6/6/96

to

In article <4p51t8$n...@darkstar.UCSC.EDU> brun...@dormeur.inria.fr
(Georges Brun-Cottan) writes:

b> Upon accessing the data, I want to be able to change the load/write
b> of this data to the execution of the data considered now as an
b> opcode.

This is quite ugly, as it requires either an extra bit of data to be
added on to a word, for some bit patterns to be treated magically
(eugh), or (most reasonably) for there to be a special instruction
that either loads one word or jumps to it, based on the contents of
another word.

It is not obvious to me that doing this transparently in hardware
would be any faster than doing it in software. In particular, the
obvious software-only solution to this problem is to use a header word
for each object you are interested in. The header word is an
instruction wide, and instruction-aligned. For vanilla data that
needs no special treatment, the header word just contains a return
instruction. If you need some other magic to be done, the header word
contains a jump to some routine that performs the necessary magic.

Accessing objects of this kind is simple: first, you unconditionally
jump to the header word, which may or may not do anything useful, then
you access your object. This will cost you a few cycles in the no-op
case, but it shouldn't be much more expensive than some hypothetical
hardware solution of similar ilk (after all, the bottleneck will be
cache/memory access); more importantly, it doesn't require adding a
wart to your instruction set architecture.

Some extant systems use just this software-only approach for a variety
of purposes. The Tagless G-Machine, for example, uses unconditional
jumps in place of tagging for certain operations (the implementation
of the STG Machine in the Glasgow Haskell Compiler, for example, uses
such jumps to do marking during garbage collection). The last time I
looked, whether or not this provided any improvement in performance
over vanilla checking of tags was pretty much moot. It certainly made
object headers fat.

<b

--
Let us pray:
What a Great System. b...@eng.sun.com
Please Do Not Crash. b...@serpentine.com
^G^IP@P6 http://www.serpentine.com/~bos

James Edward Bennett

unread,

Jun 8, 1996, 3:00:00 AM6/8/96

to

In article <4p51t8$n...@darkstar.UCSC.EDU>,
brun...@dormeur.inria.fr (Georges Brun-Cottan) writes:

|> Status of data in memory may change and require additional treatment
|> (locking it, mapping it from a persistent store, invoking a stub for
|> an rpc and so on). A recurrent problem is how to realize this
|> functionality so that I have no overhead (i.e no check at the code
|> level, no indirection + function call, ...) when I do not need it and
|> that I have a limited (less than current interruption/trap/exception)
|> overhead when I need it.

...

To solve this, Georges suggests:

|> Upon accessing the data, I want to be able to change the load/write of

|> this data to the execution of the data considered now as an

|> opcode. This switch could be made based, for example, on the
|> interpretation of a bit or a word associated to each address. Would
|> this bit be one, the processor upon the load will change execution
|> stream so that the next instruction is to reinterpret the loaded data
|> as an opcode and proceed.

I worked on a machine once (long ago enough that I don't remember
which one) that had an "EXEC" opcode, which allowed you to load a value
from memory and execute it like it was an opcode. But modern machines
don't do this, because it breaks the instruction fetch/decode/execute
pipeline. Your proposal has the same problem.

A simpler solution is to store a function pointer in the data structure,
and have it point to a null procedure for the case when no locking is
required.

Jim Bennett

Michael Furman

unread,

Jun 8, 1996, 3:00:00 AM6/8/96

to

In article <4p51t8$n...@darkstar.UCSC.EDU>, brun...@dormeur.inria.fr says...
> [....]
>What's the problem:

>
>Status of data in memory may change and require additional treatment
>(locking it, mapping it from a persistent store, invoking a stub for
>an rpc and so on). A recurrent problem is how to realize this
>functionality so that I have no overhead (i.e no check at the code
>level, no indirection + function call, ...) when I do not need it and
>that I have a limited (less than current interruption/trap/exception)
>overhead when I need it.
>

>What are current solutions :
>
>Mainly two that are bad : 1) including code that test for possibly
>needed treatment before each access to the data. 2) using mechanisms
>not made for that purpose and so inefficient, mainly page or
>misalignment fault.
>
>Example of 1) are : Inserting instructions for locking before
>accessing the data (e.g test-and-set and so on). Testing status flags
>associated to data to check if the data is currently mapped in memory
>(it have been swapped on second level store, migrated on another site
>and so on).
>
>Example of 2) are : Changing memory protection so that next access
>will generate an exception that will effectively install the
>data. That works bad when the issue is to mapped the data and not at
>all when what you need is adding a treatment (take a lock, invoking an
>RPC and so on).
>

As I understand you are considering two problems:
[a] memory mapping;
[b] using locks to access shared memory.

For typical cases [a]: swapped or migrated to another site date, IMO using
existent hardware support (page fault e.t.c) is good enough because it
does not have any overhead (since memory mapping already being used in
system) in case that no actions needed. Otherwise moving data from external
memory or anothe computer will be much longer process then processing one
interrupt.

For [b] IMO you forget about releasing lock after the end of transaction.
If you mean protecting just one atomic access to shared memory - it is
already done bu hardware (or not needed in single CPU system). If you
are locking object for some sequence of read / write operations (real case),
you have to release lock at the end - it can not be done automatically
neither by hardware nor software. Thus you have to insert unlocking code
anyway.

So I don't se real example where such hardware mechanism would be needed.

---------------------------------------------------------------
Michael Furman, (603)893-1109
Geophysical Survey Systems, Inc. fax:(603)889-3984
13 Klein Drive - P.O. Box 97 en...@gssi.mv.com
North Salem, NH 03073-0097 71543...@compuserve.com
---------------------------------------------------------------

Marc Shapiro

unread,

Jun 8, 1996, 3:00:00 AM6/8/96

to

In article <4p51t8$n...@darkstar.UCSC.EDU> brun...@dormeur.inria.fr (Georges Brun-Cottan) writes:

From: brun...@dormeur.inria.fr (Georges Brun-Cottan)
Date: 5 Jun 1996 22:32:08 GMT

Upon accessing the data, I want to be able to change the load/write of
this data to the execution of the data considered now as an
opcode. This switch could be made based, for example, on the
interpretation of a bit or a word associated to each address.

Fine-grain protection as found in some architectures, such as the IBM 801,
provides an elegant solution to this problem. In the 801 it was implemented
as a few extra protection bits in the TLB, no more expensive to test than the
usual protection bits.

This is a recurrent problem. Some support for this would be really useful in
many, many places. I have always wondered how come hardware architects never
took this seriously.

Marc
--
Marc Shapiro

M. Shapiro, INRIA Rocquencourt, BP 105, 78153 Le Chesnay Cedex, France.
Tel.: +33 (1) 39-63-53-25. Fax: +33 (1) 39-63-53-30.
e-mail: marc.s...@inria.fr. http://www-sor.inria.fr/SOR/

Georges Brun-Cottan

unread,

Jun 8, 1996, 3:00:00 AM6/8/96

to

In article <4p5eon$h...@darkstar.UCSC.EDU> b...@serpentine.com

(Bryan O'Sullivan) writes:

> It is not obvious to me that doing this transparently in hardware
> would be any faster than doing it in software.

It is neither obvious to me. But I feel it may be faster. It's why I
also ask to computer architecture guys.

> Accessing objects of this kind is simple: first, you unconditionally
> jump to the header word, which may or may not do anything useful, then
> you access your object. This will cost you a few cycles in the no-op
> case, but it shouldn't be much more expensive than some hypothetical
> hardware solution of similar ilk (after all, the bottleneck will be
> cache/memory access); more importantly, it doesn't require adding a
> wart to your instruction set architecture.

Your point looks to me like this: as the bottleneck is along the
memory/cache path, do not care about optimizing the code
instructions. Weird. Let's consider an application with good locality
properties. Its code and data accesses pattern fits well with the
cache policy. Your assertion

> (after all, the bottleneck will be cache/memory access);

is weak. Even with no cache misses, you do no want to pay the cost for
dummy call/return operations, even if instructions and data are
already in the cache. If the programmer and the compiler did a good
job, you will loose because the bottleneck will be no more on the
cache/memory access.

Well. To advance, we need to have number. I will look further on this
way. If anyone (Munin, Midway or Spin team ?) have data, it would be
nice...

BTW, one person points me on Tera computer. This architecture exhibit
something that could be used for that purpose.

> http://www.tera.com/hardware-overview.html
> In addition, every word has associated with it four additional
> bits of access state which, among other things, help implement
> lightweight synchronization operations. Waiting for
> synchronization is initially busy, but turns into non-busy
> waiting after a programmable number of attempts. At this time a
> trap occurs and the stream state is saved and enqueued at the
> target memory word. Subsequent references to that word are made
> to trap by another access state bit. Traps does not change
> instruction stream privilege in the processor.

Regards,

Shubu Mukherjee

unread,

Jun 8, 1996, 3:00:00 AM6/8/96

to

>>>>> In article <4p51t8$n...@darkstar.UCSC.EDU>, brun...@dormeur.inria.fr (Georges Brun-Cottan) writes:

G> What's the problem:

G> Status of data in memory may change and require additional treatment
G> (locking it, mapping it from a persistent store, invoking a stub for
G> an rpc and so on). A recurrent problem is how to realize this
G> functionality so that I have no overhead (i.e no check at the code
G> level, no indirection + function call, ...) when I do not need it and
G> that I have a limited (less than current interruption/trap/exception)
G> overhead when I need it.

This can be done via fine-grain access control, which can be
implemented with or without processor modification. For example, it
can implemented with a snoopy device on the memory bus [Wisconsin
T-Zero, Sequent STiNG] or directly in the processor TLB [MIT
M-machine]. These machines use fine-grain access control to implement
cache-coherent shared memory.

The basic idea is that each cache block is tagged with a few state
bits. A processor can issue a load or store to a cache block only in
certain states. Loads or stores to blocks in all other states will
generate an access fault (called block access fault), which will be
vectored to the processor. So, if someone steals the cache block
underneath you, the state bits will be reset and the cache block will
be invalidated. The next time you try to load from it, you will get
an access fault. Thus, cache hits will proceed at full speed, while
only cache misses to predesignated pages/segments will be intercepted
and forced to generate an access fault.

-Shubu
--

-------------------------------------------------------------------------
Shubu Mukherjee Univeristy of Wisconsin-Madison, Computer Sciences