I am just confusing about the following and i hope someone can clear
it out for me :)
I have read about paging and i understand that in pentium the CR3
register holds the starting address of the page table in the memory.
This register is generally called the Page Table Base Register (PTBR)
Pure segmentation in based maping needs also a table however it is
often fully cached in the processor as its size tend to be small.
Now it seems the modern processor instruction set , implicitely or
explicitely deals with different segments in the application, i.e the
code is segmented. we don't have only one linera adress space but
many. However, windows NT and linux both ignore segmentation memory
mapping since according to the documentation all the application
segments descriptors are equal, they are all mapped into the same
linear space with base =0 and limit 4GB ( 32 bits processors).
Having said that how the page table of each segment can be retrieved
as there is only one PTBR in the system? there should not be only one
PTBR!
thank you
The Intel scheme doesn't use a page table per segment -- it uses the
segmentation hardware to map to a linear virtual address space, and
then uses the paging system to map this to physical memory.
No. There is only one linear address space. Different segments can start
at different offsets within that linear space (and can overlap), but the
result is always a simple linear address.
>However, windows NT and linux both ignore segmentation memory
>mapping since according to the documentation all the application
>segments descriptors are equal, they are all mapped into the same
>linear space with base =0 and limit 4GB ( 32 bits processors).
Yes.
>Having said that how the page table of each segment can be retrieved
>as there is only one PTBR in the system? there should not be only one
>PTBR!
There is only one set of page tables. You start with a segment number and
offset (ie, CS:123456). You add the starting address for that segment to
the offset, and that produces a linear address. You look up the linear
address in the page tables, and that produces a physical address. The
physical address goes out on the bus.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
Many thanks for your reply. I said above and you agreed on that Linux
and Windows ignore segmentation by setting the base in the segment
descriptor to zero. therefore,the linear addresses of CS:123456 and DS:
123456 are the same and equal to Base+123456= 123456. As such any
address of the code segment, data segment and any other segment will
point to the same entry in the page table. how this can happen? sorry
but i am still confused
Thanks Joe.... and since windows and linux ignore segmentation by
setting the base value of all descriptors to zeros, how can we protect
between the different segments. as i said above Cs: offset, DS: offset
will be translated to the value offset then passed to the page
table....where it will index the same entry!!
Yes, they all point to the same place in the same pagetable, because
they access the same byte of memory. The only difference is what you are
allowed to do to the byte. If it is accessed through the CS, then you
can execute it. If it is accessed through the DS, then you can
read/write it. What you can do is then controlled further by the
settings of the writable/executable/kernel/user flags in the descriptor
used. Thus, if you access the byte through the CS, but the CS is set to
be a user segment, then if the pagetable lists the page as either data,
or kernel data, or kernel code, then you will get a fault.
In other words, the pagetable defines what you type of memory something
is, whereas the segment descriptor defines what you are trying to do to
the memory. Thus, if your user code is given only a user data and a user
code descriptor, (remember that it cannot create its own descriptors)
then it can only access pages marked as user data or user code. If the
descriptor and the pagetable entry do not match according to a specific
set of rules, then a fault occurs.
Hope that helps.
Matt
Read too :)
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
ma...@storagecraft.com
http://www.storagecraft.com
but you can also use the pagetable to limit to execute only, with no
read. (probably screws up any compilers that store constants in the code
segments, but can be done.)
Matt
Thanks Matt for your answer but i am still confused despite trying to
look over more online materials. May i give the following example,
assume a user application with 2 segments: code and data. each segment
has 4 pages. therefore the operating systems will create a page table
of 8 valid entries. however the page number from the code segment
ranges from 0 to 3 as does the page number from the data segment. they
are fully overlapped and half of the page table and as such the
application physical address space can't be addressed. i really don't
know what i am missing...
No, the OS creates a page table of four entries. In the usual 32 bit
flat model, the code and data segments map to the same place in the
linear address space. If a program makes a code reference to CS:100,
the segment setup will map that to location 100 in the linear address
space, which is in page 0, so it's 100 bytes into wherever the OS maps
page 0. If a program makes a data reference to DS:100, it is mapped
to the exact same place, location 100 in the linear address space,
which being the same location is also 100 bytes into the mapped page
zero.
For most purposes, on the 32 bit x86 it's easiest to pretend that the
segments don't exist at all. The OS generally sets up all the segments
to do a 1:1 mapping of each segment into the linear address space.
R's,
John
:) thanks John. but after reading your post i got confused further.
Your explanation means that DS:100 and CS:100 will point to the same
byte/word in the physical memory (as they transit through the same
page table entry and have both the same relative offset) although that
physical location should hold two separate items...please forgive my
ignorance...it seems i am missing an elementary concept
There are two totally different mechanisms for protecting information.
1) Putting it in NON-overlapping segments. Thus CD:100 and DS:100 are
different addresses, in different pages. You separate your 4Gb address
space into chunks which are then given to processes. Each chunk
(segment) has a specific type, such as user data, kernel code etc. Each
is seen by the code as starting at 0, and ending at some arbitrary number.
2) Putting in overlapping segments, but separating the
code/data/kernel/user stuff by actually giving them different addresses.
The pagetable then controls the protection, more than the
segmentation, and each segment starts and ends at some arbitrary value.
This is not a problem, as you usually compile/link your code to use
separate addresses for code and data.
Most compilers are written for the second mechanism, and assume that all
code and data segments cover the entire address range. Separating
processes is then a case of changing the pagetable pointer, so that all
code thinks it is running at the same addresses, and accessing data at
the same addresses, regardless of where it is loaded in real memory.
Matt
Right. It's a single flat address space.
> although that physical location should hold two separate items...
No, it shouldn't. This is a Von Neumann architecture, not a Harvard
architecture.
Although it is possible to set up 386 segments so that the code and
data map to different places, nobody does so. On the 286 you had to
do separate code and data segments because few programs could fit all
of the code and data into a single 64K segment. On the 386 a single
segment can map the entire 32 bit linear address space, so that's what
we do.
As someone else noted, the way we keep code and data separate is to
put them at different addresses, and we can use page protection to
(mostly) prohibit broken programs from writing into their code.
> Your explanation means that DS:100 and CS:100 will point to the same
> byte/word in the physical memory (as they transit through the same
> page table entry and have both the same relative offset) although that
> physical location should hold two separate items...
Indeed, at first glance it does look like something crazy's going on. The
secret is that if page 0 is a code page then your memory manager will
never return a block of memory containing the linear address 100. And so
your compiler will (should) never generate a reference to DS:100.
When your memory manager allocates memory and returns a linear address,
that address encodes the PDE, PTE and offset. The only way this address
could be 100 is if PDE=0, PTE=0, offset=100. That implies the first page
in the first page table. If the first page was previously allocated for
code, then the memory manager won't return any address in that range when
it allocates heap memory - it might map page 1 and return the address
PDE=0, PTE=1, offset=100.
So the two addresses you're thinking of (the 100th byte in code and the
100th byte in data) will look something like this (illustration only, I
haven't done the arithmetic):
CS:100 -> PDE=0, PTE=0, offset=100 -> linear 0x00000064
DS:4196 -> PDE=0, PTE=1, offset=100 -> linear 0x00001064
Cheers,
Ciaran
--
Ciaran Keating
Amadan Technologies
and the newer "no-execute" ... countermeasure for (buffer overflow)
attacks that polute data areas with executable instructions and then
attempt to get execution transferred there.
Researcher: CPU No-Execute Bit Is No Big Security Deal
http://www.techweb.com/wire/security/166403451
'No Execute' Flag Waves Off Buffer Attacks
http://www.washingtonpost.com/wp-dyn/articles/A55209-2005Feb26.html
What's the new /NoExecute switch that's added to the boot.ini file
http://www.windowsitpro.com/Article/ArticleID/46302/46302.html
CPU-Based Security: The NX Bit
http://hardware.earthweb.com/chips/article.php/3358421
A detailed description of the Data Execution Prevention (DEP) feature in
Windows XP Service Pack 2, Windows XP Tablet PC Edition 2005, and
Windows Server 2003
http://support.microsoft.com/kb/875352
misc. past posts mentioning buffer overflow
http://www.garlic.com/~lynn/subintegrity.html#overflow
OK i think i understand now. Many thanks to you all. Considering
windows and linux memory mapping in IA32, most of the compilers opt
for the flat memory model. As such, the offset "X" in an instruction
with operand DS:X is actually relative to the beginning of the entire
flat logical address space of the application and not the Data
segment. This does not match the segmentation section in the
silberschatz OS textbook! actually he talks more about the theoritical
segmentation concept
Please Is there any reference or textbook that i can consult
addressing compilers code generation practical implementation
techniques , e.g flat model, segmented model , etc
Many thanks
That's not surprising--segmented architectures have long been more
popular in academia than in real life. The idea of tidily
constraining each address to its proper segment seems appealing, and
works OK for the toy programs you write in programming classes, but
doesn't work very well in interestingly large practical programs.
Segments worked adequately on Multics, back when both memories and
addresses were small, but the Intel chips killed them dead. On the
286, the segments were too small for a lot of data structures, and
poorly chosen bit layout in segment selectors made it needlessly hard
to use multiple segments for a single array or data structure. Also,
loading segment registers on a 286 was very slow, so there was a
performance advantage if you designed your program to use as few
segments as possible. The 386 segments were just as slow, and Intel's
decision to do paging in a single linear address space rather than per
segment made the segments all but useless since unlike the 286, using
segments didn't let you address more memory than the "tiny model" that
uses one code segment and one data segment both mapped to the same
memory. IBM 390 mainframes had an addressing mode similar to what the
386 would have been if there were a page table per segment, but the
newer Z series has flat 64 bit addressing.
>Please Is there any reference or textbook that i can consult
>addressing compilers code generation practical implementation
>techniques , e.g flat model, segmented model , etc
Compiler books generally assume flat model addressing unless they
specifically say otherwise, and in any event I can't think of many
compiler techniques useful for segmented addressing. My book "Linkers
and Loaders" might be useful to help understand how programs and data
are laid out in memory.
R's,
John
Exactly. We just don't worry about CS, DS, etc and deal with
protection strictly at the paging level.
Near as I can tell, you're assuming the segmentation is relevant --
the whole point of the way Windows and Linux do their VM is that it's
not. So CS:100 and DS:100 do indeed point to the same place: the
same place in the linear address space since CS and DS are both
zeroed, and then the same place in the physical memory since the
paging system only has one set of page tables per process.
It's just treating the whole thing as a single, linear, per-process
address space. So you're perfectly free to try to read an instruction
from address 100, or read or write data from address 100. But if you
try to do both in a single program, you're either doing something very
weird or you've got a bug (and many people would argue that you're
trying to do something weird enough that it constitutes a bug...).
Actually, segmentation works quite
well when it has the right hardware
and programmer mindset support.
Unisys 2200 XPA-series mainframes
which use segmentation (i.e.
"banking" in Unisys-speak) on top
of a very large (2**57 word) paged
intermediate address space are
still around and have been doing
useful work since their introduced
in the 1990's.
And the last time I looked, the OS
for these Unisys 2200 series was
no toy and consisted several million
lines of code.
Some regard Linux as a toy by
comparison.
> Segments worked adequately on Multics, back when both memories and
> addresses were small, but the Intel chips killed them dead. ...
IIRC Multics was pretty much dead
before the 8008 even hit the fan.
> ... On the
> 286, the segments were too small for a lot of data structures, and
> poorly chosen bit layout in segment selectors made it needlessly hard
> to use multiple segments for a single array or data structure. Also,
> loading segment registers on a 286 was very slow, so there was a
> performance advantage if you designed your program to use as few
> segments as possible. The 386 segments were just as slow, ...
Unfortunately true.
> ...and Intel's
> decision to do paging in a single linear address space rather than per
> segment made the segments all but useless since unlike the 286, using
> segments didn't let you address more memory than the "tiny model" that
> uses one code segment and one data segment both mapped to the same
> memory. ...
IMHO, the use of a single linear
address space wasn't so much the
problem as the fact that it was
2**32 bytes long, the same size
as the physical address space,
rather than at least 2*35 bytes
long, large enough to hold every
possible based segment.
Unfortunately, I don't think
there were enough unused bits in
the segment descriptors to pull
this off while maintaining
backward compatibility.
> ... IBM 390 mainframes had an addressing mode similar to what the
A great textbook, and the one I picked for teaching senior-level OS
from this semester. But I made a point of skipping the segmentation
section (because it's just not used at this point).
> Please Is there any reference or textbook that i can consult
> addressing compilers code generation practical implementation
> techniques , e.g flat model, segmented model , etc
Actually, Silberschatz does it as well as anybody I know. The thing
is, an architecture supporting a model doesn't mean the OS has to use
it.
> R's,
> John
>
Any comments from the B5500/6700/A-Series builders or customers?
--
Blaming one party is easy. Finding a solution is hard.
Patrick Scheible
How's that working for you? Virus-wise?
[...]
> IIRC Multics was pretty much dead
> before the 8008 even hit the fan.
This would have been in 1972, which makes this statement wrong to the
degree of being completely ridicolous, as easily verified on the web.
I must admit I'd forgotten about the Burroughs machines. My
impression is that they're the healthiest segmented machines around
today, but they also suffer from performance and address space issues.
Fine thanks. On my FreeBSD box, the protection between processes and a
design that doesn't run everything as the superuser is much more important
than putting code and data in separate address spaces.
For the latter, recent x86 models have per-page no-execute protection,
but I hear it doesn't help much.
there is MVS and various descendants ... where the same ("segmented")
image of the kernel appears in every virtual address space ... along
with the "common segment" ... which was an early MVS gimick allowing
pointer-passing paradigm to continue to work between different
applications and various subsystems functions when the were moved into
different virtual address spaces (i.e. application could squirrel
something away in the "common segment" and make a subsystem call,
passing a pointer to the "common segment" data). of course,
"dual-address" space ... and follow-on "access registers" ... were
attempt to obsolete the need for the common segment ... aka allowing
called routines (in different virtual address spaces) to "reach" back
into the virtual address space of the calling routine.
misc. recent posts mentioning common segment
http://www.garlic.com/~lynn/2007g.html#59 IBM to the PCM market(the sky is falling!!!the sky is falling!!)
http://www.garlic.com/~lynn/2007k.html#27 user level TCP implementation
http://www.garlic.com/~lynn/2007o.html#10 IBM 8000 series
http://www.garlic.com/~lynn/2007q.html#26 Does software life begin at 40? IBM updates IMS database
http://www.garlic.com/~lynn/2007q.html#68 Direction of Stack Growth
http://www.garlic.com/~lynn/2007r.html#56 CSA 'above the bar'
http://www.garlic.com/~lynn/2007r.html#69 CSA 'above the bar'
the ingrained (MVS) "common segment" even resulted in custom hardware
support in later machine generations. in lots of implementations,
table-look-aside (TLB) hardware implementation is virtual address space
"associative" (each TLB entry is associated with a specific virtual
address space). Segment sharing can result in the same virtual address
(information) in the same (shared) segment appearing multiple times in
the TLB (associated with use by specific virtual address spaces). The
"common segment" use was so prevalent in MVS ... that it justified
special TLB handling ... where the dominant TLB association was virtual
address space ... but there was a special case for common segment
entries ... to eliminate all the (unncessary) duplicate entries.
If everything is in the same address space - then any non-superuser can read
anything, and the very notion of super/non-super users is null and void, like
it is in Win9x/Me.
> For the latter, recent x86 models have per-page no-execute protection,
> but I hear it doesn't help much.
NX bit is in PAE (64bit PTEs) mode only. Helps a lot.
In the case of high-performance, numeric-intensive computing, the
Burroughs B5000/6000/7000/A Series machines (now known as Unisys
ClearPath MCP systems) admittedly do not stack up all that well. It's my
impression that for many this problem space is the only one in which to
measure "performance", but this facet of computing is only one of many.
In general server-oriented roles, especially OLTP and transactional data
base applications, the Unisys MCP architecture has superior qualities
and stacks up much better against other systems. The variable-length
segmentation is not entirely responsible for that, of course, but it
certainly is a contributing factor.
John's comment about address space is one that I really cannot
understand, though. All of the MCP machines going back to the B5500 have
had huge virtual address spaces, just not ones that are allocated all in
one piece. I'm not sure how you could compute the total virtual address
space limit on the current systems, but it's easily many trillions of
bytes PER TASK. You would hit some practical limitations, such as the
maximum size of the system's segment descriptor tables, long before
running out of virtual address space provided by the architecture. In
almost 40 years of working with these systems, I've never seen an
application on them that came even close to pressuring the virtual
address space, let alone exceeding it. There were serious PHYSICAL
address space problems with the B6700/7700 systems in the late 1970s and
early '80s, but these were resolved by the mid-80s in the A Series line,
and done so largely without impact on existing applications.
So are segment register loads on any x86 CPU.
x86-style segmentation requires the _developer_ (not even the compiler) to be
aware of it.
First of all, SS != DS issue, which is especially "fine" in DLLs with their own
data segment. MakeProcInstance for callbacks is second issue.
For tiny segments (286, including 16bit Windows), huge pointers are major
issue.
This, combined with high cost of segment register load and portability, caused
the most OS designers to abandon segment support from the OSes in the timeframe
of late 1980ies (design) and early-to-mid 90ies (market).
First, the 'everything' that is supposed to be in the same address
space is supposed to refer to the address space of a single process,
which is flat, meaning a pointer to anything in this address space is
just a 0-based offset. Second, MMUs usually support per-page access
permission with different privilege levels. Otherwise, they would be
quite useless for implementing memory protection.
John said it was his "impression," which is something short of a claim
to knowing the one and only truth. I'm sure he's willing to stand
corrected, just as the rest of us are open to education.
>
> In the case of high-performance, numeric-intensive computing, the
> Burroughs B5000/6000/7000/A Series machines (now known as Unisys
> ClearPath MCP systems) admittedly do not stack up all that well. It's my
> impression that for many this problem space is the only one in which to
> measure "performance", but this facet of computing is only one of many.
>
> In general server-oriented roles, especially OLTP and transactional data
> base applications, the Unisys MCP architecture has superior qualities
> and stacks up much better against other systems. The variable-length
> segmentation is not entirely responsible for that, of course, but it
> certainly is a contributing factor.
>
> John's comment about address space is one that I really cannot
> understand, though. All of the MCP machines going back to the B5500 have
> had huge virtual address spaces, just not ones that are allocated all in
> one piece. I'm not sure how you could compute the total virtual address
> space limit on the current systems, but it's easily many trillions of
> bytes PER TASK. You would hit some practical limitations, such as the
> maximum size of the system's segment descriptor tables, long before
> running out of virtual address space provided by the architecture. In
> almost 40 years of working with these systems, I've never seen an
> application on them that came even close to pressuring the virtual
> address space, let alone exceeding it. There were serious PHYSICAL
> address space problems with the B6700/7700 systems in the late 1970s and
> early '80s, but these were resolved by the mid-80s in the A Series line,
> and done so largely without impact on existing applications.
Can you give an overview of how A-Series segmentation works? "RTFM" is
OK in my case, since I have the manual...
Louis
(I've trimmed follow-ups, since as far as I know, Linux hasn't been
ported to the A-Series. Of course, I'm willing to stand corrected on that.)
No MMU supports ACLs based on users/groups for memory accesses. The only
support is kernel/user, read-only/writeable and sometimes - not on all CPUs -
execute/no-execute.
It was sort of an editorial "we". But if Intel's paging
implementation had included an execute permission (before the recently
added NX bit), buffer overflow exploits wouldn't be a serious problem.
Control Data's last great mainframe series, the Cyber 180, was strongly
influenced by Multics. It used a segmented architecture, with 16 rings, of
which only 9 were actually used. It went into production in the mid-1980s,
although I doubt any of them are still in use today.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
>> just a 0-based offset. Second, MMUs usually support per-page access
>> permission with different privilege levels.
>
> No MMU supports ACLs based on users/groups for memory accesses. The only
> support is kernel/user, read-only/writeable and sometimes - not on all CPUs -
> execute/no-execute.
Why would it need to support ACLs? Since the page table is
per-process, and all the MMUs I'm familiar with support kernel/user,
just set the user-mode permissions based on the process owner's (or
whatever) access. No reason to make the MMU complicated to support
something that's easily done in software.
I do recognize not having execute/no-execute permission is a problem,
but that's independent of ACLs.
> :) thanks John. but after reading your post i got confused further.
> Your explanation means that DS:100 and CS:100 will point to the same
> byte/word in the physical memory (as they transit through the same
> page table entry and have both the same relative offset) although that
> physical location should hold two separate items...please forgive my
> ignorance...it seems i am missing an elementary concept
No, the compiler will not generate code like that. The compiler is aware
of the fact that it's generating code for a machine with "non-separate I
and D space", i.e. there is *one* linear address space per process, and
code and data share this address space. If you intend to write assembler
code, you'll have to be aware of that fact, too.
So, if you have 8k program code (i.e. 2 pages of 4k each) and 8k data
(i.e. 2 pages), then they will be mapped to different areas in this one
address space, e.g. the code might be mapped from 0..8k and the data
might be mapped to 128k..136k.
The other way round, if you have a NOP instruction at address CS:100,
then reading a byte from DS:100 will return 0x90, the opcode for a NOP
instruction.
HTH,
Josef
--
These are my personal views and not those of Fujitsu Siemens Computers!
Josef Möllers (Pinguinpfleger bei FSC)
If failure had no penalty success would not be a prize (T. Pratchett)
Company Details: http://www.fujitsu-siemens.com/imprint.html
What a MMU support need not necesarily be describable in terms of NTFS
file system access permissions, despite that you may know these
better.
> The only
> support is kernel/user,
aka 'different privilege levels'
> read-only/writeable and sometimes - not on all CPUs -
> execute/no-execute.
aka 'access permisssions
Actual MMUs are not necessarily that simplistic. For instance, an
ARMv5 MMU supports per-page access permission which can be 'no
access', 'ro-access in priviledged mode, no access in user mode',
'ro-access in both priviledged mode and user mode', 'r/w access in
privileged mode, no access in user mode', 'r/w access in privileged
mode, ro-access in user mode', 'r/w access in privileged mode and in
user mode'. Additionally, each descriptor belongs to one of sixteen
'domains' and the exists a 'domain controll register', which can,
individually for each process and every domain, grant either 'client'
or 'manager' access to memory belonging to that particular domain,
where 'client' access (in all modes) are checked against the access
permission bits, while manager accesses are not.
BTW, the assumption that what one does not understand will certainly
be wrong is a common, but usually wrong one.
I know they were in use as late as 2001. I worked at the time for AWI
(which was owned by CDC at one time.) One of our older lottery contracts
(Florida) was still running with Cyber 180's at that time.
--
Mike McGinn
"more kidneys than eyes"
Registered Linux User 377849
> IBM 390 mainframes had an addressing mode similar to what the
> 386 would have been if there were a page table per segment, but the
> newer Z series has flat 64 bit addressing.
Minor correction: The only difference between Z and S/390 (in this
regard) is the address space size. Z still supports Access-Register
mode (which is the one comparable to a machine with segmentation).
Note that what IBM calls "segments" are really just the second level
in the hierarchical page table, and Z adds three new "region" levels
to cover the larger address range.
Perhaps you meant to say that Z operating systems use a single flat
view where their S/390 predecessors exploited AR mode to get more
effective virtual address range (multiple 2G spaces).
Michel.
re:
http://www.garlic.com/~lynn/2007t.html#9 How the pages tables of each segment is located
hot off the press:
Buffer Overflows Are Top Threat, Report Says
http://www.darkreading.com/document.asp?doc_id=139871
from above:
Research data says buffer overflow bugs outnumber Web app
vulnerabilities, and some severe Microsoft bugs are on the decline
... snip ...
as before ... lots of past posts mentioning buffer overlow
threat/problems
http://www.garlic.com/~lynn/subintegrity.html#overflow
Right. Now that there's a 64 bit address space, I wouldn't expect anyone
to use AR mode other than for backward compatbility.
Ah, no. AR mode is used extensively for communication between address
spaces. For example, let's say you issue an SQL request to DB2. The
DB2 subsystem can write the results of the Select directly into your
address space. That is its primary function.
A secondary use for AR mode is for dataspaces (and related entities,
like Hiperspaces), which *are* used for additional data storage that
won't fit in a single 2GB address space. It is reasonable to expect
that the use of dataspaces will decrease as more and mode code becomes
64-bit aware, and that AR mode will end up mainly doing IPC.
I'd think that in OLTP, fast context switching would be important,
which you get from the stack architecture. How does the segmentation
help? Burroughs style segments are certainly helpful for reliability,
since they make it nearly impossible to clobber program code, but the
extra memory traffic to load all those segment descriptors has to be
paid for somehow.
>> John's comment about address space is one that I really cannot
>> understand, though. All of the MCP machines going back to the B5500
>> have had huge virtual address spaces, just not ones that are
>> allocated all in one piece.
The limit I'm wondering about is per-segment, not overall. On the 286,
there were plenty of segments (8K per process plus 8K global) but the
per-segment size was the problem.
Most of what I know about the Burroughs and descendants' architecture
is from Blaauw and Brooks. Their description is somewhat confusing
(not really their fault, since the hardware architecture is
phenomenally complicated), but as far as I can tell, each segment is
limited to 32K words. I realize that multidimensional arrays are
arrays of pointers so each row of an array is a separate segment, but
do you never have structures or text blobs that don't fit in 15 bits
of intra segment address.
>(I've trimmed follow-ups, since as far as I know, Linux hasn't been
>ported to the A-Series.
Too bad.
R's,
John
No B6700 or its successors has ever been taken down by a buffer overflow.
Not even on a B6700 heavily loaded with mixed production and development
work in 1972. Never. How much is that worth?
Typically today these are called "MCP systems" by everyone but Unisys
marketing, since the marketdroids have changed the names so often that
nobody else can keep track of them. And all the rest of the multiple OSs
named MCP have faded away, so the term "MCP system" has become fairly
specific.
On Tue, 27 Nov 2007 06:18:56 +0000 (UTC), John L wrote:
> I'd think that in OLTP, fast context switching would be important,
> which you get from the stack architecture. How does the segmentation
> help?
Mostly it does not help with process switching. It helped more when memory
was scarce, because you did not need to make much in particular present in
order to swap out a segment. (And code segments are read-only and thus
could be replaced without swapping out.) But as with any machine, swapping
and overlaying was always a huge kludge; it didn't get around the fact that
disk is slower than RAM.
MCP systems have never been that fast on process initiation, termination,
and switching, even with the stack architecture to help. They are great in
that they carry a lot of information with the process, and that you can
switch from one very complex environment to another with no more oerhead
than a trivial switch, but all in all really lightweight processes are not
part of the design.
> Burroughs style segments are certainly helpful for reliability,
> since they make it nearly impossible to clobber program code, but the
> extra memory traffic to load all those segment descriptors has to be
> paid for somehow.
Nah, they are just about always in the fastest CPU cache. Was probably more
of an issue in the B6700 days. However, in those days the fact that the
code is extremely compact made a lot of difference -- fewer memory fetches
for code, more likely to find the code word in cache. And the compactness
of the code was in part due to the work done in the CPU for the
segmentation and the stack.
>>> John's comment about address space is one that I really cannot
>>> understand, though. All of the MCP machines going back to the B5500
>>> have had huge virtual address spaces, just not ones that are
>>> allocated all in one piece.
This was Paul's comment. There is one caveat. It's true that the virtual
address space of a task has always been essentially unlimited. However, at
one time the physical address space was limited to 1 megaword (6MB). You
can't use a huge amount of virtual memory if your physical is severely
limited. This was becoming a major problem by the late 1970s. There were
several generations of systems with kludges which allowed more physical
memory on a system, but it was difficult for a single task to use more than
1MW. Finally -- I guess it was in the mid 1980s -- they completed the
architecture changes which allowed even a single task to use amounts of
physical memory which are still not common. (I think it runs into some
limitations at half a terabyte, but I'm not sure, and even that could be
very easily expanded within the current struture.) And again, that is a
limitation on physical memory, not virtual.
But this may have been where the idea about a limited address space came
from.
> The limit I'm wondering about is per-segment, not overall. On the 286,
> there were plenty of segments (8K per process plus 8K global) but the
> per-segment size was the problem.
On most current MCP systems, I can declare a single row of an array to be
up to 2^28 words. (The MCP will actually segment the physical memory
allocated to such a large array, but this is totally invisible to the
program. At one time this invisible segmentation carried a performance
penalty, but larger control stores have pretty much eliminated the
penalty.) But there's hardly ever a need to use a single array row that way
-- software which does so is usually doing structuring within a large flat
area, and on MCP system you would use at least some of the MCP's memory
structuring instead of doing it yourself.
There's probably still a limitation on the size of code segments. But the
compilers take care of that, and it's totally transparent. It's been years
since I was even aware of the size of code segments except in bizarre cases
where it would be a clue to some strange bug.
If you want a lot of big segments, just declare
ARRAY A [0:2**20-1, 0:2**20-1, 0:2**20-1];
(Note that exponentiation is Algol is **.) That gives you a billion array
rows of a million words each. A googol of them would only take a couple
more lines. You would not be able to use them all due to physical memory
limitations and the number of lifetimes it would take to go through them,
but you could compile the program and access some random words from
anywhere just to prove the point.
Ironically, the most noticeable limit today is one you aren't likely to
hear about. IOs are still limited to 2^16 words (6 * 2^16 bytes).
> Most of what I know about the Burroughs and descendants' architecture
> is from Blaauw and Brooks.
I'm not familiar with their book. I recommend Elliot Organick's "Computer
System Organization: The B-5700-B-6700 Series". Used copies are not
numerous but are not hard to find either. It won't give all the details,
but it's well written. Three decades out of date, but the basics haven't
changed.
> Their description is somewhat confusing
> (not really their fault, since the hardware architecture is
> phenomenally complicated), but as far as I can tell, each segment is
> limited to 32K words.
Totally incorrect. I can't even guess where this misconception came from.
As mentioned above, the modern limit is about 2^28 words. In programing
terms it was never less than 2^20 words. There were various limits that
were smaller, but they did not affect programming.
Remember that most MCP segmentation was always transparent to the
programmer.
Code segments? Maybe code segments are limited to 2^15 words, though
actually 2^13 comes to mind. But it's totally transparent. When the
compiler fills up a code segment, it generates a branch to the next code
segment. End of problem.
> I realize that multidimensional arrays are
> arrays of pointers so each row of an array is a separate segment, but
> do you never have structures or text blobs that don't fit in 15 bits
> of intra segment address.
Huh? I do not even have a context to hang this in. Perhaps you are thinking
that structures have to be implemented within a segment? But no, if you
have a structure within a structure, the embedded structure is simply
represented by a descriptor, which has all the capabilities of a top level
descriptor. That's why the virtual address space is essentially unlimited.
This method does cause probems in copying structures, which is not a strong
point under the MCP.
If that's not it, please explain a bit more what type of structure you are
thinking of.
>>(I've trimmed follow-ups, since as far as I know, Linux hasn't been
>>ported to the A-Series.
>
> Too bad.
However, C was ported long ago, and POSIX compliant programs run pretty
well. I realize that's not the same thing by a long shot, but a lot of
useful programs have become available under the MCP as a result.
Edward
--
Art Works by Melynda Reid: http://paleost.org
>...
> I'm not familiar with their book. I recommend Elliot Organick's "Computer
> System Organization: The B-5700-B-6700 Series". Used copies are not
> numerous but are not hard to find either. It won't give all the details,
> but it's well written. Three decades out of date, but the basics haven't
> changed.
http://bitsavers.org/pdf/burroughs/B5000_5500_5700/Organick_B5700_B6700_1973.pdf
Enjoy!!
Cheers, Mike.
---------------------------------------------------------------
Mike Hore mike_h...@OVE.invalid.aapt.net.au
---------------------------------------------------------------
--
Posted via a free Usenet account from http://www.teranews.com
I would guess that it came from the B-5500 where there were only 15
bits for address. I was told once that this number came more or less
from the IBM 7090 family, which also had only 15 bits of address and
which were competitors to the B5500.
>>>(I've trimmed follow-ups, since as far as I know, Linux hasn't been
>>>ported to the A-Series.
>>
>> Too bad.
>
>However, C was ported long ago, and POSIX compliant programs run pretty
>well. I realize that's not the same thing by a long shot, but a lot of
I haven't looked at any of this, but will comment that it's hard to do
something like Linux in a Burroughs-style machine because C in a more
conventional machine lets you get away with things that the Burroughs
architecture would prohibit. And there are differences in coding
style. In the B5500 MCP there were lots of arrays, taking advantage
of the array descriptor facilities, but no structures. In Unix/Linux
there tend to be arrays or lists of structures. Things that are grouped
together in a structure in Unix tend to be scattered into a whole bunch
of arrays in MCP. In B5500 MCP a central datum was the mix index, a
small number representing a job currently in the mix and used as an index
into all the arrays of information about that job. In Unix the process ID
number serves somewhat the same purpose, but PIDs run into thousands, with
most of the available numbers being unused at any one time.
The B5500 was an outstanding batch job machine and a lousy time sharing
machine. The reason it was lousy was that absolute addresses get into the
stack, so when a job is swapped completely out it has to be swapped back
in to the same memory addresses it previously occupied. This was corrected
by an architectural change in the B6500. Also the disk I/O was slow
enough that swapping a large job in and out was seriously time consuming.
While there was a time sharing version of B5500 MCP, many sites instead
ran the batch MCP and an application called R/C or Remote/Card with was
similar to Wylbur as used on OS/360. R/C allows a terminal user to
call up and edit a program file, submit it as a batch job for execution,
and then examine the output when the job completes.
First, Louis' point concerning my comment on John L's "impression" is
well taken. As I read John's sentence concerning the Burroughs machines,
the first clause was an impression, but the second was a conclusion. If
that was not what John intended, then I apologize.
Louis asked if I could give an overview of how the Burroughs
segmentation works, and John Ahlstrom has privately encouraged me to do
so. I think I know this reasonably well from a software perspective, but
as John L points out, the architecture is phenomenally complicated. The
elements of the architecture are also extremely synergistic, and it's
impossible to talk about one aspect (such as segmentation) in the
absence of others. As a result, this is a very long post -- about 9900
words.
In order to understand the context in which memory addressing and
segmentation operate, I also have to talk to some degree about stacks,
compilers, codefiles, and the very tight integration between the
hardware architecture and the MCP (operating system) architecture. This
isn't easy, so please bear with me while I lay down some groundwork and
then to try to describe (without pictures) how segmentation and memory
addressing for this very complex, but fascinating architecture work.
Background.
I will start with a short history lesson. The origin of the Burroughs
stack/descriptor architecture was the B5000 of 1962. Bob Barton is
generally credited with the conceptual design of the architecture. This
machine was a major departure from anything Burroughs had attempted
before. It was clearly influenced by the Rice University (where Barton
had spent some time) computer, particularly by the Rice system's use of
"codewords", from which the concept of descriptors arose. I believe the
Ferranti Atlas was also an influence on the B5000 design. A somewhat
updated version of the architecture was released as the B5500 in 1964
and stayed in production until 1970. At the very end there was briefly a
version called the B5700, but it was really a B5500. (The B5900
mentioned below is a contemporary of the other Bx900 models and not a
variant of the B5500.)
Burroughs began work on a successor machine, the B6500, in 1965, but it
was not released until 1969. It had a very difficult introduction to
customer use, and did not work properly until it was updated and
re-released as the B6700 in 1971. All B6500 CPUs in the field were then
replaced with B6700s.
The B6500/B6700 was a substantially different design from the B5500 --
the two instruction sets were quite different, as were the format of
descriptors and other control words. Limits on the size of both physical
and virtual address spaces were increased dramatically. A dramatically
different way of handling character-oriented data was also introduced.
About the only things that got carried over without change were the word
size (48 bits), the integer and floating point word formats, and the
six-bit character code, known as BCL. The MCP operating system was also
completely redesigned. Nonetheless, the two designs are conceptually
similar, and both the B6500/6700 hardware and system software should be
seen as refinements of those for the B5500.
The Burroughs B6700 is the basis for the Unisys ClearPath MCP systems
that are still being sold today. There have been some tweaks to the
instruction set, and a major change was made in the mid-1980s to the way
memory is addressed to accommodate still larger physical address spaces,
but the processor architecture (at least from the perspective of
user-level software) has remained essentially the same since 1970. The
architecture is now referred to internally as "E-mode" (I think the "E"
stands for "emulation"), and its formal specification has gone through a
number of levels, known as Beta, Gamma, Delta, and (the latest) Epsilon.
The machines have had a variety of marketing names over the years. Since
the various names can be confusing to those not familiar with the
product line, here is a summary:
Burroughs B5000/5500/5700 -- the original early 1960s architecture.
Burroughs B6500/6700/7700/6800/7800/B5900/6900/7900
Burroughs/Unisys A Series (MicroA, A1-A7, A9-A13, A14-A19)
Unisys ClearPath 4600/4800/5600/5800/6800
Unisys ClearPath NX4200 (the first of the models where the hardware is a
standard Intel box and the MCP architecture is emulated on the Intel
processor. This approach is currently used on all small-to-medium
performance models)
Unisys ClearPath LX5000/6000/7000 (later emulated systems)
Unisys ClearPath Libra (these are also referred to as the "CS" series;
the Libra 300, 400, and 520 models are emulated systems)
Having so many models, we need a generic name, so I'll follow the
current Unisys convention and refer to them as MCP systems
(acknowledging that other, now obsolete, Burroughs/Unisys architectures
also used operating systems named "MCP").
With that background, I will try to describe how memory addressing and
segmentation works on the B6700, since discussing that particular model
gives a good conceptual base. The mechanism was similar, but more
primitive, on the B5500. This mechanism changed with the A Series to
expand the physical address space and implement improvements in some
other areas. I'll discuss those differences later. Even though the B6700
has been obsolete for 30 years, I'll talk about it mostly in the present
tense.
Words and Tags.
The best place to start is with word formats. The B6700 (like the B5500)
has a 48-bit data word. One of the significant changes from the B5500,
however, was the addition of some extra bits to the word, called the
"tag". The B6700 and later systems through E-mode Beta had a three-bit
tag. E-mode Gamma and subsequent systems use a four-bit tag.
The tag indicates what type of information the word contains. Generally,
tags 0, 2, 4, and 6 are data types that user-level code can freely
manipulate; the rest are control types that are used by a combination of
the hardware and operating system kernel. The tag values for the B6700 are:
0 = single-precision data word (numeric or character data)
1 = IRW (indirect reference word) or SIRW (stuffed indirect
reference word). These are effectively indirect addresses to
words in a program stack. A brief discussion of what "stuffed"
means is covered in the section on Addressing Data, below.
2 = double-precision data word (typically numeric)
3 = general purpose protected control word: stack linkage, code
segment descriptor. All words in a code segment (one containing
executable instructions) also must have tags of 3.
4 = SIW (step index word): control word for the STBR (step branch)
looping instruction -- a nice idea, but the performance was
poor, and STBR is no longer supported on current systems. This
tag value is now used by the MCP as a special marker word in
stacks (e.g., for fault trap indicators).
5 = data descriptor. These words are the basis for data
segmentation and are discussed in detail below.
6 = uninitialized operand. Words with this tag value can be
overwritten, but reading them generates a fault interrupt. Used
as the initial and NULL value for pointers.
7 = PCW (program control word). The entry point to a procedure
(subroutine). Also used as one form of dynamic branch address.
The additional tag values introduced with E-mode Gamma are primarily
used for optimizations of descriptors after a segment has been allocated
and are not significant for this conceptual discussion.
There are user-mode instructions that can read and set the tag on the
word at top of stack. You could theoretically do quite a bit of mischief
with this on the B6700, but the compilers are a trusted component of the
system, so in practice this was not a problem. In more recent systems,
the microcode carefully restricts which tag values can be set on data
words in the stack. The Set Tag instruction (STAG) running in user-level
code cannot now, for example, set tag 5 on a data word with bit 47=1
(i.e., create a present data descriptor -- see below for what this means).
This brings up the subject of "codefiles" (files containing executable
code) and compilers. A codefile is a controlled entity under the MCP.
Codefiles can be generated only by compilers, which are programs that
must be "blessed" using a privileged MCP command. There is no assembler
for these systems, nor anything that allows a user to generate arbitrary
sequences of object code. Recompiling the source code for, say, the
COBOL-85 compiler does not generate a compiler, just an ordinary program
that inputs source code and outputs a file of instructions and other
run-time data. That file is not a codefile, however. It cannot be
executed and cannot be turned into a codefile. There is no facility for
a program (even one running with the highest privileges) to read that
file and present any of its data to the hardware as a code segment.
Words of instructions in a code segment must have tags of 3. There is no
way for user-level code to set those tags, let alone create a code
segment and branch to it. The I/O hardware has a variant of the disk
read operation that will apply tags of 3 to data as it is being read and
transferred to memory, but only the MCP has access to the physical I/O
hardware, so there is no way for user programs to initiate such an
operation.
[When I say "no way" here, I mean that the hardware/MCP architecture is
specifically designed to prohibit such a thing, and I'm not aware of
even a conceptual attack that could get around the design. There have
been some holes discovered in the past, though, and of course it's
entirely possible that there are more that have not yet been brought to
light. I won't speculate on the likelihood of their existence. To me,
the weakest part of this design is the trust that must be placed in the
compilers. If you can generate arbitrary code, or modify a valid
codefile externally (e.g., on a backup tape) and reload it, you can get
around at least some of the system's protections, but it's still not easy).]
This discussion of codefiles and compilers highlights a basic
characteristic of MCP memory segmentation: code and data are entirely
separate entities and are allocated and managed by the system
separately. There are code segments and data segments, and while they
are allocated from the same system-global heap and may be adjacent in
physical memory, logically they are separate and addressed entirely
differently. Both types of segments can be created only by the MCP. The
contents of code segments are loaded solely from codefiles. Code
segments are read-only, and as we will see, are automatically reentrant.
Data Segment Descriptors.
Descriptors are called such because they "describe" an area of memory.
MCP systems are a form of capability architecture, and the descriptors
are the capability -- you have to have access to the descriptor to
access the data it describes. Descriptors are the basis of memory
addressing and memory segmentation.
A tag-5 word in the B6700 architecture represents a data descriptor. The
word has a number of fields which I will identify using what is called
partial-word notation, [s:n], where "s" is the starting bit number and
"n" is the length of the field in bits. Bit 47 is high order and bit 0
is low order; all MCP systems use big-endian addressing.
[47:1] Presence bit (commonly known as the "P-bit")
[46:1] Copy bit
[45:1] Indexed bit
[44:1] Paged bit
[43:1] Read-only bit
[42:3] Element Size field
0 = single precision words
1 = double precision words
2 = four-bit characters (packed decimal)
3 = six-bit characters (BCL, now obsolete)
4 = eight-bit characters (EBCDIC)
[39:20] Length or index field (determined by [45:1])
[19:20] Segment starting address
Accessing the data in a segment is done by means of "indexing" the
descriptor with a zero-relative offset into the segment. There are a
series of instructions that do this, as will be detailed below.
The Presence bit indicates whether the memory area represented by the
descriptor is physically present in memory. It is the basis for the
virtual memory mechanism. "Virtual memory" is IBM's clever marketing
term that everyone has adopted, but it does not accurately describe what
goes on in MCP systems. "Automated memory segment overlay" is a more
accurate term, and is what Burroughs called it before IBM's term caught
everyone's attention.
When the Presence bit is 1, the segment is physically present in memory
starting at the real address in [19:20]. (Note that in A Series and
later systems, the real memory address is no longer contained in
descriptors, as will be discussed later).
When the Presence bit is zero, the segment is not present in memory (or
has never been allocated) and the descriptor is said to be "absent"
(although it's really the data area that is absent). Attempting to
access a segment through an absent descriptor generates a "Presence bit
interrupt" -- what other systems would call a page fault. When handling
this interrupt, the MCP interprets the value in the address field as
follows:
* If zero, this indicates that the segment has never been allocated.
The MCP simply allocates an area of length specified by [39:20],
relocating or overlaying other physically-present segments as
necessary. The MCP clears the allocated area to binary zeroes
(primarily to wipe out any nasty tags that may be there from prior
use of that space), fixes up the address field in the descriptor and
sets its Presence bit. Exiting from the interrupt handler causes the
hardware to restart the instruction which generated the interrupt.
The compilers generate code in the initialization section of
procedures to build these "untouched" descriptors directly in the
program's stack. The physical memory area is not allocated until
(and unless) the program actually references it.
* Some types of objects require additional information from the
compiler (e.g., for multi-dimensional array-of-arrays structures,
the length field in the descriptor only specifies the length of the
first dimension so certain bit patterns in the address field for
untouched descriptors point to data structures the compiler has
built within the codefile that carry the necessary information for
the other dimensions).
* Otherwise, the segment was once present in memory but has been
rolled out, usually due to pressure from other memory allocation
activity. The value in the address field of the descriptor indicates
the location within the task's "overlay file" (a temporary, unnamed
file the MCP allocates for each task) where the data associated with
this descriptor has been rolled out. The MCP allocates an
appropriate area of memory, reads the segment from the overlay file,
fixes up the descriptor with the new real memory address, and exits
the interrupt handler to restart the instruction that was
interrupted.
The Copy bit indicates that a tag-5 word is not the original descriptor
for an area, but rather a copy of it. A non-copy descriptor is said to
be the "mom" descriptor for an area, and there can only be one such mom.
Copy descriptors are generated automatically by a number of
instructions, principally those involved with indexing and loading words
on the top of stack (e.g., to pass an array as a parameter to a
procedure). Present copies point directly to the segment's memory area;
absent copies point to the mom.
In the B6700, there is no central page table, per se, to keep track of
memory areas. The mom descriptor effectively serves the purpose of a
page table entry. Every area of physical memory (allocated or available)
is surrounded by a set of "memory link" words that are used by the MCP
memory allocation routines to keep track of allocated areas and locate
available areas of appropriate size. For allocated areas, one of these
link words points back to the mom descriptor. The handling of moms and
copies is another thing that has changed with the A Series and later models.
The Indexed bit indicates whether the descriptor points to a whole
segment or to one element within a segment. When the bit is 0, the
length/index field contains the length of the segment. Some indexing
instructions generate an "indexed" copy descriptor with both the Indexed
and Copy bits set to 1 (all indexed descriptors are by definition
copies). Indexed descriptors are typically used as a pointer, e.g., as a
destination address for store operators or as a call-by-reference
parameter to a procedure. They are also used as starting addresses for
string manipulation instructions. Algol has a "pointer" type; it
represents an indexed data descriptor.
In an indexed descriptor, the length field is replaced by a
zero-relative offset into the segment. The physical address of that
element (assuming the segment is present in memory) is the sum of the
base address in [19:20] and the offset in the length field.
Depending on the value of the Element Size field [42:3], a descriptor
could be word oriented or character oriented. Physically, memory is
accessed as whole words. When a character-oriented descriptor is
indexed, its length field is replaced by a specially-encoded offset.
Bits [35:16] are a word offset within the segment; bits [39:4] are a
character offset within that word. This obviously limits the offset for
character pointers to 393215 for eight-bit characters and twice that for
four-bit packed decimal digits (although this limit can be effectively
eased by the use of paged areas, discussed next). The string
instructions understand these character offsets and transparently handle
character operations that start in the middle of words.
The Paged bit indicates whether the memory area for the descriptor is
monolithic (0) or paged (1). If paged (the original term for this is
"segmented", but talking about segmented segments can be confusing), the
address field does not point to the address of the data, but instead to
a vector of other descriptors. Each of these descriptors in turn points
to a data segment. In memory, this looks identical to a two-dimensional
array-of-arrays structure. The difference is in how the software
accesses the data. With a standard two-dimensional array, the software
must explicitly compute and apply an index for each dimension. With a
paged segment, the software is unaware of the second dimension. When the
indexing instructions encounter a descriptor with the Paged bit set,
they partition the index value into a page number and page offset and
automatically re-index the second dimension. If an indexed copy is
generated, it is a copy of the descriptor for the page, not the original
(paged) descriptor, and the index offset will be that within the page.
This helps alleviate the smaller limit for word offset available with
indexed character descriptors.
For the B6700, the page size was 256 words. Data areas were typically
paged when they exceeded 1024 words, but the programmer had some control
over this. The page size on current systems is 8192 words. The paging
threshold is adjustable by the system administrator, and by default is
also 8192 words. As with all memory areas in the system, the individual
pages are not allocated until first referenced. The array of page
descriptors (called a "dope vector") is not even allocated until one of
the pages is initially accessed. All pages are individually relocatable
and overlayable.
String operations that run off the end of a physical memory area
generate a "segmented [paged] array" interrupt. The MCP responds to this
interrupt by locating the next page in sequence (if there is one) and
restarting the operation. If there is no next page (which would also be
the case if the string operation ran off the end of an unpaged segment),
the result is a "Segmented Array Error", with which everyone who has
programmed for MCP systems is no doubt intimately familiar. Unless
trapped, this error (a form of boundary violation) is fatal to the program.
The Read-only bit is just that -- if 1, it marks the segment as
non-writeable. This is primarily used for segments that represent
constant pools generated by the compiler and which are loaded from the
codefile.
The Element Size field, as discussed above, determines whether the
descriptor represents single precision word, double precision word, or
character-oriented data. It is quite common, especially with COBOL
programs, to have multiple descriptors with differing size fields
pointing to the same segment. Only one of these could be the mom, of
course; the rest would have to be copies. This allows the software to
address the same memory area as a mixture of word and character fields.
The B5500 strictly used six-bit characters; the B6700 was basically an
eight-bit EBCDIC machine, but could also handle six- and four-bit
characters. Support for the six-bit codes was dropped from the
architecture in the Bx900 models, ca. 1980. I understand that the latest
Libra models have added support for 16-bit characters.
The role of the Length/index and Address fields has been covered in the
discussion above, so nothing additional needs to be said about them here.
Code Segment Descriptors.
Thus far, the discussion has been about data descriptors. There are also
descriptors for code segments. They are similar to data descriptors, but
simpler. Code segment descriptors have a tag of 3 and the Presence bit,
Copy bit, Length, and Address fields of data descriptors. Bits [45:6]
are not used. Code segments cannot be indexed or paged. They are by
definition read only. The only element size they support is single
precision words.
Code segment descriptors live in a special type of data segment called
the Segment Dictionary. An image of this segment is built by the
compiler (all descriptors being in their absent, untouched form, of
course) and stored in the codefile. The Segment Dictionary is loaded by
the MCP as part of task initiation. In addition to code segment
descriptors, the Segment Dictionary may contain (read-only) data
descriptors for constant pools and scalar constant values. The Segment
Dictionary in memory is actually a type of stack, although not for
push/pop type of activity. As we will see, stacks in MCP systems are a
central element in the addressing environment, and it is for this
purpose that a Segment Dictionary is loaded as a stack. Segment
Dictionaries are also sharable -- if multiple tasks are initiated from
the same codefile, the Segment Dictionary is loaded only once and the
separate tasks are linked to this common copy. Thus, all of the object
code and read-only constant pools for a program are automatically reentrant.
Object code is addressed using a three-part index. The first part is the
"segment number", which is the code segment descriptor's offset within
the Segment Dictionary. The second part is the word offset within the
segment. The third part is the instruction syllable (byte) offset within
the word. These numbers are zero-relative and generally written in hex,
so an address of 03C:0041:3 indicates segment #60, word offset 65, byte
offset 3.
The processor uses variable-length instructions and can branch to any
syllable offset within a segment. By using one of several dynamic branch
instructions with a Program Control Word (PCW, tag 7) as an argument, a
program can branch across segments -- in fact PCWs allow programs to
branch and call procedures across stacks. When a program branches to a
different code segment (either directly or by means of a procedure
call), the segment is made present if it is not already so, and the
segment length and base memory address are loaded into registers within
the processor designated for that purpose. Intra-segment branches use
only the word and syllable offsets, and therefore do not need to
continually reference the code segment descriptor.
The Presence bit in a code segment descriptor indicates physical
presence or absence of the segment in memory, just as for data segments.
For an absent descriptor, the address field indicates the offset within
the codefile where the segment starts. Code segments and constant pools
are not loaded from the codefile until first reference. When a program
is initiated, the Segment Dictionary is the only thing that is loaded
from the codefile. Everything else is loaded as a result of Presence bit
interrupts.
Since codefiles and code segments are guaranteed to be read only, code
segments (and constant pools loaded from the codefile) are never rolled
out to an overlay file. The physical memory area is simply deallocated
and the codefile offset restored in the absent descriptor (that offset
is stored in one of the memory link words while the segment is present
and the descriptor's address field points to the area in memory). The
code or data is simply reloaded from the codefile the next time a task
trips over the descriptor's Presence bit.
Resizing Segments.
Descriptors obviously support bounds checking, along with the dynamic
relocation and overlay of real memory areas, but they have another
significant advantage -- the dynamic resizing of data segments. Since
the length of a segment is part of the descriptor, the basis for bounds
checking is centralized. The physical memory area can be resized by a
user program at run time, the length in the descriptor will be updated,
and future indexing operations will check against the new length.
The B6700 and all later systems support the programmatic resizing of
data segments. Code segments are immutable, so resizing them is not a
meaningful operation. Support for resizing varies by language, but in
Algol, it is performed using the RESIZE intrinsic procedure.
RESIZE(A,N) resizes the array A to a new length of N. The unit of N
is determined by the Element Size field in the descriptor. The old
segment is deallocated and its contents are lost. The segment with
the new length will not be allocated until it is next referenced.
RESIZE(A,N,RETAIN) allocates a new segment of length N and copies up
to N units from the old segment to the new one. If the new length is
shorter, any remaining units from the old segment are lost; if the
new length is longer, the remainder of the segment is filled with
binary zeros. Once the data is copied to the new segment, the old
segment is deallocated.
RESIZE(A,N,PAGED) is similar to the RETAIN option, but creates the
new data segment with a paged data descriptor.
RESIZE works on individual segments. In the case of multi-dimensional
array-of-arrays structures, the row for each final dimension can be
resized independently and to differing lengths.
Actually, RESIZE can be applied to any of the dimensions of a
multi-dimensional array. Given an Algol declaration of
ARRAY M [0:99, 0:49, 0:63, 0:4095];
RESIZE(M[4,7,*,*], 75, RETAIN) would resize the M[4,7] dope vector in
the third dimension from its original length of 64 to a new length of
75. This would create 11 new, untouched descriptors of length 4096 at
the end of that dope vector.
These resizable array-of-arrays structures are often used, especially in
Algol, to implement flexible, safe, dynamic storage allocation schemes
for user programs.
Addressing Data -- Indexing and the Stack.
Note that in the discussion thus far, nothing has been said about
user-level programs having access to memory addresses -- real or
virtual. Data descriptors for the B6700 can hold real memory addresses,
but the content of descriptors is controlled only by hardware
instructions and the MCP, not user-level code. User-level code (and all
but small parts of the MCP kernel, for that matter) does not manipulate
addresses, it manipulates offsets into memory areas. Separate memory
areas generally relate to separate objects for a programming language
(e.g., arrays for Algol, Pascal, and Fortran, or 01 levels for COBOL).
Since there are no addresses, there can be no invalid addresses. You can
try to use an invalid offset, of course, but that offset must be applied
against a descriptor, where it is ALWAYS checked against the length
field. If the offset is less than zero or greater than or equal to the
length, your task becomes the happy recipient of the "Invalid Index"
bounds violation interrupt. Depending on the language you are using, you
can trap this interrupt (actually, what you trap is the MCP's response
to the interrupt), but you can't restart the operation which caused it
-- your only options are to branch to a recovery routine or enter the
catch portion of a try/catch block. If you do not trap the interrupt,
your task is terminated by the MCP.
Another thing that has not been mentioned yet is registers. There are
certainly registers in the processor -- the B6700 had dozens of them --
but the instruction set accesses them implicitly. None of them are
accessed directly by user-level code. (Actually, on the B6700 that
wasn't quite true -- compilers would generate code to access some of the
registers directly, but this was not common. Access to these registers
has been tightened up considerably on more recent systems). Instead of
loading addresses directly into registers and using those to read and
store real or virtual memory locations, user-level programs on MCP
systems access data in two ways: (a) directly in a stack or (b) outside
of a stack by going through a descriptor that is in a stack.
A stack in the MCP architecture is simultaneously three things:
(a) a memory area for push/pop expression evaluation,
(b) the stack frames (activation records) and return history for
procedure calls, and
(c) the basis for addressing global and local variables in a
program.
The B5000 and all of its descendants were designed to support Algol. To
understand how stack addressing works, you need to understand how Algol
programs can be structured. If you don't know Algol, think Pascal --
they're very similar.
An Algol program is constructed as a series of nested blocks. Each block
can contain local variables (including procedure declarations), but also
has addressability to the variables in all of its more global (i.e.,
containing) blocks. The body of a procedure can be considered a block,
whether it has local declarations or not. The scope rules for
identifiers in this scheme are the same as for a two-level language such
as C, it's just that there can be more levels. In MCP systems, this
nesting is referred to as the "lexicographical level" (lex level or LL).
The global code ("outer block") of most user programs runs at LL=2.
First-level procedures (such as you would have in a C program) run at
LL=3. Any procedures declared within those first-level procedures
(something you can't do in C) would run at LL=4, and so forth.
The Segment Dictionary is loaded as a stack with an environment of LL=1.
Therefore, multiple tasks initiated from the same codefile see their
code segments and constant pools as globals at a higher level -- hence
the Segment Dictionary and all of the segments based on it are
reentrant. The MCP stack (another that is purely an addressing
environment, not a data space for a task) is at LL=0. Note that stack
addressing can cross stacks. Also note that on the B6700, all MCP
globals were in the scope of the user programs. This allowed them to
call MCP service procedures directly, as if they were global to the user
program (which, in fact, they were). There is no "service call" or
"branch communicate" or other major environment change to access O/S
services -- user programs simply call what to them are global procedures.
This accessibility to the MCP stack was recognized as a serious security
issue fairly early on, and later systems blocked direct access to LL=0.
User programs now still call MCP service routines as normal procedures,
but access to the procedure entry points and other global objects is
provided through indirect addresses in the Segment Dictionary. These
indirect addresses are initially set up to cause a fault interrupt when
first accessed, at which point the MCP verifies the access and replaces
the intentionally bad word with a valid SIRW (tag=1, Stuffed Indirect
Reference Word) to the MCP global. Subsequent references to the service
routine merely require a one-level indirection to reach the entry point.
The "stuffed" for an SIRW refers its ability to address a word in a
stack outside the current scope chain, and in particular, across stacks.
What gets stuffed in the word is sufficient environment information to
allow out-of-scope, cross-stack addressing. A normal IRW (also tag=1)
only addresses within the current scope chain. There is an instruction
(STFF) that converts a normal IRW to an SIRW. Perhaps inevitably, it is
commonly called the "get stuffed" operator. (Current systems no longer
generate normal IRWs -- all reference words are now generated as SIRWs
unconditionally, so STFF has become a no-op).
The local variables for a block are allocated in the stack. In the case
of procedures, this provides efficient recursion. Simple blocks (i.e.,
nested BEGIN/ENDs containing declarations) are implemented as
parameterless procedure calls. Therefore, more-global variables are
lower in the stack, or possibly in a separate stack (note that unlike
some systems, MCP stacks grow from low addresses to high addresses -- a
push increments the top-of-stack address, not decrements it).
Addressing within the current scope chain is a very common thing to do,
so MCP systems provide a series of base registers, called the "D" (for
"display") registers. There is one D register for each lex level, and it
contains the absolute (real) memory address to the base of a block's
stack frame. The B6700 had 32 such D registers, but this proved (for
once) to be more than necessary. Later systems cut back to 16 D
registers (allowing user procedures nested 13 levels deep -- I doubt
that I've ever coded anything that goes more than four levels deep). The
B5900, which was the first microcoded processor and was based on
bit-slice chips, tried to get by with four D registers (0, 1, 2, and
current LL), but that didn't work too well, and that approach was
abandoned in later designs.
The simplest addressing mode for MCP systems is based on these D
registers and uses a construct known as an "address couple". The address
couple has two fields, LL (which selects a D register) and an offset
from the address in that D register. This is written "(LL,offset)" --
thus (2,17) refers to the 17-th (zero relative) word in the global
(LL=2) address space of a program. For an Algol program, this address
space would be the outer block; for COBOL, it would be WORKING-STORAGE;
for FORTRAN, it would be COMMON; for a C program it would be the
environment for static declarations.
With that introduction to stack addressing, here is the key concept:
scalar variables are allocated in the stack; non-scalar variables
(arrays, structures, records, whatever) are allocated outside the stack.
A descriptor is allocated in the stack to access that non-scalar area.
To illustrate this, here is a simple Algol program:
1: BEGIN
2: INTEGER I; % I is allocated at (2,2)
3: ARRAY A[0:99]; % descriptor for A at (2,3)
4: I:= 1;
5: DO
6: A[I]:= A[I-1]+I
7: UNTIL I:= I+1 > 100;
8: END.
The stack offset for I starts at 2 because there are two linkage words
at the base of a stack frame for (a) the procedure return address and
(b) a control word called the MSCW (Mark Stack Control Word) that allows
the processor to reconstruct the D registers to the values for the
calling environment upon procedure (or in this case, block) exit.
Here is the code that Algol would generate for this snippet (note that
this is slightly idealized and out of order from what the current Algol
compiler would generate, but both the concept and effect are accurate).
Line 1: (nothing)
Line 2: ZERO push a zero value onto the stack at (2,2)
Line 3: LT48 000006400000 push a skeletal descriptor with length=100
onto the stack at (2,3)
LT8 5 push a literal 5 onto the stack
STAG set a tag=5 onto the skeletal descriptor; this
leaves an untouched descriptor at (2,3)
PUSH flush the top-of-stack (TOS) words into memory
LT16 2048 push literal 2048 (this is a marker word for the
MCP BLOCKEXIT procedure called at the end)
BSET 47 set bit 47 in TOS word
LT8 6 push a literal 6
STAG set tag=6 on the BLOCKEXIT marker word
Line 4: ONE push a literal 1 onto the stack
NAMC (2,2) Name Call: push an IRW for I onto the stack
STOD Store Destructive: store the 1 to the (2,2)
address; pop both the IRW and literal 1
Line 5: (nothing)
Line 6: VALC (2,2) Value Call: push a copy of the value of I
NAMC (2,3) push an IRW to the descriptor for A
INDX Index: pop both parameters; push an indexed copy
descriptor for A[I] (think: address of A[I])
VALC (2,2) push another copy of I
ONE push a literal 1
SUBT Subtract: pop both parameters and push the
value of I-1
NAMC (2,3) push IRW to the descriptor for A
NXLV index and load value: index A by I-1; pop both
parameters and push the value of A[I-1]
VALC (2,2) push a copy of the value of I
ADD pop top two values; push the value of A[I-1]+1
STOD store A[I-1]+I in A[I]; pop both addr & value
Line 7: VALC (2,2) push a copy of the value of I
ONE push a literal 1
ADD pop top two values and push the value of I+1
NTGR integerize TOS word with rounding
NAMC (2,2) push an IRW for I onto the stack
STON Store Nondestructive: store I+1 back into I;
pop the address but leave value I+1 on TOS
LT8 100 push a literal 100 onto the stack
GRTR Compare Greater: [(I+1) > 100]; pop both values;
push result of comparison: 1 if true, 0 if false
BRFL 4:4 branch false: if low-order bit of TOS word is 0,
branch to Line 6 (word 4, syllable 4 in the
current code segment); pop the TOS word
Line 8: MKST construct and push a MSCW (Mark Stack Control
Word) in preparation for a procedure call
NAMC (1,4) push an IRW to the PCW for the MCP's BLOCKEXIT
procedure (actually, for an MCP intrinsic, it
would be an IRW to an SIRW in the Segment
Dictionary to the PCW in the MCP stack for
BLOCKEXIT). This procedure is not passed any
parameters, but if it were, they would be pushed
into the stack at this point.
ENTR Enter: call the BLOCKEXIT procedure
EXIT Exit Procedure: cut back the stack (thus
destroying this activation record), and exit the
outer block of the program (this exits back into
an MCP procedure which terminates the task and
disposes of the stack's memory and related
resources)
When this program is initiated, the MCP reads some information from the
codefile that tells it how to set up the data stack, including a
recommended initial size. If the Segment Dictionary is not already
present (due to another task executing the same codefile), a "code
stack" is allocated for the Segment Dictionary and its image is loaded
from the codefile. There is a base area of the data stack that the MCP
uses for task management, which it also sets up. No program globals are
loaded, however -- this will be done by stack-building code generated by
the compiler for the outer block's data segment (as is shown for I and A
in the example above). Instead, the MCP creates a dummy stack frame that
makes it appear as if this task has called a procedure, but the return
address from that call is set up as the entry point to the outer block's
code segment.
The MCP also constructs a TOSCW (Top of Stack Control Word) at the base
of the stack, which tells the hardware how to find the top of stack and
the base of the top stack frame. From that, the processor can
reconstruct all of the stack linkage, D registers, return instruction
address, and so forth. After building the initial stack image, the MCP
simply links the task into the READYQ, the prioritized list of tasks
waiting for a processor. Once the task rises to the top of this queue, a
processor is assigned to it, at which point the processor "exits" into
the entry point.
The B6700 has a instruction, MVST (Move to Stack) that switches the
processor from its current stack to another one. This instruction
constructs a TOSCW for the current stack and uses the TOSCW for the new
stack to reconstruct stack linkage and registers settings. Later systems
did context switching in different ways, but it appears that on the
current Libra systems, MVST is once again how it's done.
Note that once the MCP sets up the initial stack image and releases the
new task to the READYQ, all further saving and restoring of registers
and other state information is handled automatically by the hardware.
Since all registers have specific purposes (i.e., there are no
general-purpose registers being used who knows when and for what), the
hardware knows when the value of a register needs to be pushed into
memory or recalled. This applies not only to context switches between
tasks, but also to all procedure calls. Hardware interrupts are
implemented as a forced procedure call on the stack that currently has
the interrupted processor, so the same state-saving mechanism is used
for interrupts as well.
Back on the Algol example above, the very first thing that happens when
the processor exits to the entry point is a Presence bit interrupt as it
detects the Presence bit is zero in the descriptor for the outer block's
code segment. Execution continues once this code segment is made present.
The stack-building code at the beginning of the outer block creates the
local variables for the stack frame and pushes them onto the stack. In
the case of integer I, this is simply a literal zero; in the case of
array A, the code constructs an untouched data descriptor of length 100.
The 100 words of memory for the array will not be allocated until the
descriptor is first "touched" and its zero Presence bit detected. This
will happen the first time the NXLV (Index and Load Value) instruction
is executed in Line 6. Note that the INDX (Index) instruction executed
earlier does not cause a Presence bit interrupt, since it only generates
an indexed copy descriptor and does not attempt to access an array
element. The INDX instruction effectively acts as a "load address"
instruction. Bounds checking takes place on both INDX and NXLV, however.
The program then proceeds to initialize the value of I (some compilers
would fold this assignment into the stack-building code, but Algol does
not), and execute the DO loop that iterates through the elements of
array A. To someone used to register-based architectures, this code
probably looks like it generates a lot of memory accesses -- all those
VALC and index operators, not to mention the stack pushes and pops. On
the B6700 that certainly was true, as there was essentially no caching,
except for two TOS registers. More recent implementations use caching
extensively, however, and most of the apparent memory references would
stay inside the processor.
Another thing that may be apparent is that Algol does very little
optimization. It is a one-pass compiler and, for better or worse, emits
instructions in pretty much a what-you-code-is-what-you-get manner.
This program contains an intentional bug, which becomes apparent on the
last iteration of the DO loop. The value of I is 100, which is greater
than the upper bound of array A. When I compiled this and ran it on an
MCP system, I got the following message from the MCP:
2228 MSRHI3:INVALID INDEX @ (00100600)
2228 HISTORY: 003:0001:3 (00100600).
F-DS 2228 (PAUL)OBJECT/SIMPLE/ALGOL ON OPS.
This indicates a bounds violation on line 100600 (a sequence number that
is part of the source file line) at code segment address 003:0001:3,
which is the INDX instruction for Line 6 in the example above. This
bounds checking is not due to any debug or diagnostic mode I enabled for
the compiler or the object code -- it's implicit in the segmented
addressing mechanism for the architecture and cannot be turned off.
The value 2228 is the MCP-assigned task number. F-DS indicates the
program was terminated (discontinued, or "DS-ed" in MCP parlance) due to
a fault interrupt. Although it is not apparent from this example, the
HISTORY line is a trace of return addresses -- it shows the history of
procedure calls that got to the point where the fault occurred.
Assuming this bug did not exist (i.e., the comparison on line 7 was ">
99" instead of ">100"), the loop would have terminated when the value of
I reached 100 and control would have fallen into the END statement for
the block. The NTGR instruction for line 7 is due to the numeric format
used with all MCP systems since the B5000 -- integers are implemented as
a subset of floating-point values, and integer overflow generates a
floating-point result. NTGR normalizes the TOS value as an integer and
generates a fault interrupt if it exceeds the limits of integer
representation (+/- 2**39-1).
The call to BLOCKEXIT at the end of the code for the block is generated
by the compiler to dispose of any complex objects (arrays, files, etc.)
that were declared in this stack frame. The compiler generates a tag-6
marker word at the end of stack-building code that serves as a parameter
to BLOCKEXIT. This marker word contains a bitmask indicating which types
of resources BLOCKEXIT should look for. Failure to call BLOCKEXIT when
required would result in memory leaks, and the presence of this call is
another example of the trust the system places in its compilers. More
recent E-mode levels include a "blockexit bit" in one of the stack
linkage words that can be used by the MCP to enforce proper disposal of
stack frame resources before the frame can be exited.
A Series Enhancements to the Descriptor Mechanism.
The Address field of data and code segment descriptors is 20 bits wide,
which allows for a total of 1048576 words (6 MB). The B6700 has the same
maximum physical memory size, so the field width was adequate. In the
late 1960s and early '70s 1 MW seemed near infinite, but as systems
became larger through the '70s (and especially as the use on-line
applications and data bases grew during this period), this upper limit
on physical memory of 1 MW became grossly inadequate. The
B6800/7800/6900/7900 implemented a somewhat crude paging technique (the
infamous "Global(tm)Memory") that helped somewhat on multi-processor
systems, but the physical address space for a given processor at any one
time remained at 1 MW.
The A Series models introduced starting in the early 1980s addressed
this problem by implementing a concept known as ASD (Actual Segment
Descriptor). Heretofore, the "mom" descriptor in a program's data stack
was the owner of a memory area and pointed directly to it when the
segment was present in memory. There was no room to expand the address
field in the 48 bit descriptor word, so the role of owner was moved from
the mom to a central ASD table in memory. Instead of a real memory
address, the Address field in descriptors now holds an index into this
ASD table. On the latest processors, each entry in this table contains
eight 48-bit words, of which only the first three are used by the
hardware. The actual location, length, and status of each allocated
memory area is now stored in these table entries, hence the ASD name. It
functions similarly to the page table in other virtual memory
architectures, except that the "pages" are variable-length segments.
Most processors use caching to reduce the incidence of real memory
accesses to this table, effectively implementing a form of TLB.
The concept of "mom" and "copy" descriptors no longer exists in the
architecture, at least not in anything like the way it was in the B6700.
All descriptors in data segments (except untouched descriptors) point to
an ASD table entry. In fact, since E-mode level Gamma introduced the use
of four-bit tags (initially for the A11 and A16 models in the early
1990s), tag-5 words are used only to represent untouched descriptors.
Once the area is allocated, descriptors accessible to user-level code
(now called "virtual segment descriptors" or VSDs) use tag values of C,
D, E, and F to identify various combinations of indexed/unindexed and
paged/unpaged descriptors.
The ASD index field in these VSD words is currently 23 bits in length,
allowing for a maximum of just over 8 million segments in the system.
The maximum length of a segment is still limited to 2**20-1 words,
although it appears this could be extended. The physical address field
in the ASD is currently 36 bits, allowing a maximum physical memory
space of 64 GW or 384 GB. The large Libra systems I have encountered
recently have 4-8 GW of physical memory, which appears to be more than
adequate.
With so much physical memory available, most systems today run with
large amounts of the memory space unallocated, and almost no overlay due
to memory allocation pressure takes place. As a result, the Presence bit
mechanism now largely serves as an allocate-on-first-reference
capability. Should you fill up the physical memory, however, the
automated overlay ("virtual memory") mechanism will still do its thing.
The size of the ASD table is established at system initialization time,
based on the total size of physical memory and a factor that is settable
by the system administrator. Running out of ASD table entries is a
no-no, and causes the system to halt.
A serious issue with the B6700 design was management of copy
descriptors. Every time the presence, location, or size of a segment
changed, not only did the mom descriptor need to be fixed up, but all of
the copies, too. The only way to find those copies was to search for
them, so copy descriptors were only permitted in stacks. There were
(still are) special instructions to search for these copies, and a
considerable software investment was made to minimize the number of
stacks that needed to be searched, but stack-search overhead could be
fierce, especially on a system that was near or beyond the thrashing point.
The ASD implementation considerably helped this situation. Copies no
longer contain real addresses to the data area or to the mom -- instead
they point to the ASD table entry for the segment. This table index does
not change through the life of the segment, so copies no longer need to
be found and fixed up when the location, size, or status of a segment
changes.
One of the nicest aspects of the ASD implementation was that it had
essentially no impact on user applications. Since descriptors are
managed by the MCP, the details of how the indexing instructions compute
real addresses are opaque to user-level code. It was necessary to
recompile some programs, although not specifically to support the ASD
addressing changes -- the B6700-era compilers emitted code that would
access fields of descriptors (e.g., to determine the length of an area
in Algol, you could obtain a tagless copy of the descriptor and isolate
bits [39:20]). Starting with E-mode level Gamma, the length field was
not even in the VSDs accessible by user-level code anymore, so new
instructions (e.g., GLEN to determine the length of a segment) were
implemented to perform these functions, and the old methods (along with
the Algol syntax that supported them) were deprecated. Some other
model-specific instruction sequences (such as directly accessing
processor registers) were eliminated at the same time, all of which
improved the security of the system and reduced somewhat the reliance on
trusted compilers. That reliance was not eliminated entirely, however.
There was a lengthy transition period that allowed users to recompile
their programs so that their codefiles would be compliant with newer
processors.
Issues with Descriptors and the MCP Architecture in General.
The first thing almost everyone comments on when first being exposed to
the stack- and descriptor-based architecture of MCP systems is the
memory access overhead of push/pop on the stack and of having to index
through descriptors to reach data. The second thing that gets comments
is the lack of user-accessible registers.
There is no question that these characteristics of the architecture add
overhead and (at least in the myopic view in which most seem to consider
performance issues) degrade performance. This overhead is at least
partially offset, however, by:
* the lack of unnecessary state saving,
* the efficiencies resulting from variable-length memory segments,
* the efficiencies resulting from unnecessary memory allocation by
delaying it until first reference,
* the efficiencies resulting from code and data segments being
closely related to language objects,
* the efficiencies resulting from being able safely to access data
and code across addressing environments and inter-task
(marshalling data across process boundaries is a foreign concept
in the MCP),
* the efficiencies in context switching,
* the efficiencies in interrupt handling, and
* the efficiencies resulting from hardware and operating system
environments that were designed specifically for each other.
In an I/O-intensive, transaction-server environment (which is what the
MCP systems are designed for), this performance trade-off balances out
better than you might think. Where the architecture loses at the micro
level of instruction performance, it gains at the macro level of system
performance. For a server, that's what you want. Need to do
high-performance numerical computation? MCP systems are probably not the
ones you should consider using. Need to do high-performance transaction
processing and safely balance the needs of hundreds or thousands of
tasks competing for processors, memory, and I/O paths? MCP systems do
quite well in that solution space.
There is another aspect to performance that I do not think is considered
often enough -- reliability performance. The lack of low-level bounds
checking in other systems terrifies me, and it should terrify you. The
idea that bound violations can be prevented simply by programmers "being
careful" is both silly and irresponsible. Giving addresses to
programmers is like giving whiskey and car keys to teenagers -- sooner
or later something stupid is going to happen, and it's probably going to
be sooner. I say this being a programmer. The current problems we have
with malware are largely due to unchecked memory accesses and allowing
data to be treated as code. These are problems that MCP systems simply
do not have.
As I have said before in this space, there is a cost to using
descriptors and hardware-enforced bounds protection. This is also a cost
to not using descriptors and hardware-enforced bounds protection.
In my opinion, the MCP architecture has two major problems -- and
neither of them relates to performance. The first is the reliance on
trusted compilers. As discussed briefly above, tweaks to the
architecture over the past 30 years have improved this situation, but
the security of the system is still too dependent on the quality of code
that the compilers generate. Barring a social engineering attack, it is
quite difficult get untrusted code into the system in a form that can be
executed. Once an untrustworthy compiler is authorized, though, havoc is
possible. I am not aware that penetration of untrusted code has ever
been a problem since the early B6700 days when some major holes were
exposed in the enforcement of codefile integrity, but this is too ripe
and area for potential abuse, and one that requires enforcement in too
many parts of the system, to be considered an adequate aspect of the
architecture.
The second problem is that the segmentation, addressing, and memory
management mechanisms of the system are built for hierarchical,
block-structured languages such as Algol and Pascal. Memory objects
effectively "belong" to the block that declared them, and are
automatically deallocated when that block exits. This approach also
works fine for COBOL, and is adequate for FORTRAN (at least through
FORTRAN-77). It works poorly, though, for languages which rely on
heap-based memory management, where an object can have a life after its
originating environment no longer exists.
The MCP compilers and operating system go to great lengths to prevent
"up-level pointers" -- the existence of references to a
locally-allocated segment that can be stored in a more global scope. The
system does not have an efficient way to locate and invalidate such
up-level references when the locally-allocated segment is deallocated,
so their use is prohibited.
Current MCP systems support a C compiler, but the performance of its
code is not all that good, partly because C is basically a high-level
assembler for register-based, flat-address architectures (to which the
MCP architecture is a nearly complete antithesis), and partly because
the C heap is currently implemented as a series of array rows, with C
pointers implemented using integer offsets into the runtime
environment's heap space. It works, but the result is not very efficient.
The problem is even worse for object-oriented languages such as Java. A
descriptor-based capability architecture should be a good fit with an
object-oriented language (a descriptor, after all, is a primitive form
of object), but the current MCP architecture is too closely tied with
the Algol memory model to work well natively with Java. I consider this
to be both a real shame, and a real threat to the future viability of
the MCP architecture.
Those issues aside, I think the MCP architecture is still the most
interesting, if least appreciated, one on the market today. There are
other interesting aspects of the architecture that I have either passed
over quickly (such as stack linkage and cross-stack addressing) or
ignored altogether (such as the process model and the concepts of task
families and synchronous and asynchronous dependent tasks). Then there
are the MCP stack, the stack vector, procedure calls, accidental entries
("thunks"), parameter passing, string and data movement instructions,
server libraries, and connection libraries. These are all also well
integrated with the segmentation, stack, and addressing issues discussed
above. In fact, one of the endlessly fascinating things about this
architecture to me is that, for all of its complexity, everything fits
into a nicely integrated whole. It's quite elegant, really.
For those willing to RTFM, the Unisys support web site allows free
access to essentially all of the documentation for current systems. You
can access this through the front door by going to
http://support.unisys.com/ and clicking the "Access documentation" link
at very bottom of page. Click through the agreement and you will be
presented with a page that can search the documentation. You can also
access documents directly if you know their URL. Here are some useful ones:
ClearPath Libra 680/690 (latest version of E-mode level Epsilon spec):
http://public.support.unisys.com/c71/docs/Libra680-1.1/68787530-004.pdf
ClearPath NX6820/6830 (E-model level Delta spec):
http://public.support.unisys.com/aseries/docs/Hardware/70205547-001.pdf
Current Algol language reference:
http://public.support.unisys.com/aseries/docs/ClearPath-
MCP-11.1/PDF/86000098-507.pdf
Bitsavers has a good collection of documents for the older MCP systems
under:
http://www.bitsavers.org/pdf/burroughs/
In particular, you might want to look at the following documents under
that URL:
Narrative Description of the B5500 MCP:
B5000_5500_5700/1023579_Narrative_Description_Of_B5500_MCP_Oct66.pdf
B5500 Reference Manual (architecture reference):
B5000_5500_5700/1021326_B5500_RefMan_May67.pdf
B5500 Extended Algol:
B5000_5500_5700/1028024_B5500_ExtendedAlgol_Jul67.pdf
Elliot Organick's 1973 book on the MCP architecture:
B5000_5500_5700/Organick_B5700_B6700_1973.pdf
B6700 Reference Manual (architecture reference):
B6500_6700/1058633_B6700_RefMan_May72.pdf
A good paper by Hauck and Dent on the B6500/6700 stack mechanism:
B6500_6700/1035441_B6500_B7500_Stack_Mechanism_1968.pdf
If on first reading you don't understand this architecture, you're
running about average. I will happily try to reply to questions and
comments.
I will certainly agree with Edward that process initiation and
termination carry a fair amount of overhead, but disagree about
switching among processes. Because register state is saved and restored
automatically by the hardware, there is very little unsaved state that
needs to be handled during a process switch. On the B6700 and on the
latest high-end Libra machines there is a single instruction that does
the switch. The overhead Edward may be considering is operating system
overhead for process accounting -- e.g., accumulated CPU time -- but
that is done out of choice, not architectural necessity.
The current hardware limit on the size of a single data segment is
2^20-1 words. I thought that was also the upper limit on the size of a
user-level array row, and seeing 2^28 quoted (and assuming that meant
2^28-1), I had to check this out. I therefore wrote the following Algol
program:
BEGIN
DEFINE MAXBIT = 28 #;
ARRAY A[0:2**MAXBIT-1];
A[0]:= 0;
A[2**(MAXBIT DIV 2)-1]:= 2**(MAXBIT DIV 2)-1;
A[2**MAXBIT-1]:= 2**MAXBIT-1;
PROGRAMDUMP (ARRAYS);
END.
This did not compile (using the MCP 11.1 DMALGOL compiler on a Libra
300), generating an error on the declaration for array A, "This
dimension is declared with too many elements".
Then I tried MAXBIT=27. That compiled, but running the program generated
a run-time error on the assignment to A[0] (which is where the array
would have been allocated, being the first reference to one of its
elements), "DIMENSION SIZE ERROR 1=134217728".
Next, I tried MAXBIT=26, and that both compiled and executed
successfully. The memory dump for the task generated by the next-to-last
statement confirmed that the array was segmented, and allocated as 8193
pages of 8192 words each (the last page is zero length, which is a
hardware requirement when the length of the segmented array is a
multiple of 8192). The dump also confirmed that only pages 0 and 8191
were actually allocated in memory, as they were the only ones that were
touched.
The Libra 300 is an E-model level Delta machine (it's also an emulated
machine -- the MCP architecture is implemented in software using a
standard Intel Pentium box running Windows Server 2003). It is possible
that the latest, E-mode level Epsilon machines may allow higher limits,
but I do not have ready access to one to test this.
>
> There's probably still a limitation on the size of code segments. But the
> compilers take care of that, and it's totally transparent. It's been years
> since I was even aware of the size of code segments except in bizarre cases
> where it would be a clue to some strange bug.
>
> If you want a lot of big segments, just declare
>
> ARRAY A [0:2**20-1, 0:2**20-1, 0:2**20-1];
>
> (Note that exponentiation is Algol is **.) That gives you a billion array
> rows of a million words each. A googol of them would only take a couple
> more lines. You would not be able to use them all due to physical memory
> limitations and the number of lifetimes it would take to go through them,
> but you could compile the program and access some random words from
> anywhere just to prove the point.
>
> Ironically, the most noticeable limit today is one you aren't likely to
> hear about. IOs are still limited to 2^16 words (6 * 2^16 bytes).
In other words, the largest unpaged memory segment that can be rolled
out to disk is 64K words in length, because the MCP requires virtual
memory paging to be done in one I/O. Permanently resident segments can
theoretically be 2^20-1 words in length, though.
>
>> Most of what I know about the Burroughs and descendants' architecture
>> is from Blaauw and Brooks.
>
> I'm not familiar with their book. I recommend Elliot Organick's "Computer
> System Organization: The B-5700-B-6700 Series". Used copies are not
> numerous but are not hard to find either. It won't give all the details,
> but it's well written. Three decades out of date, but the basics haven't
> changed.
>
>> Their description is somewhat confusing
>> (not really their fault, since the hardware architecture is
>> phenomenally complicated), but as far as I can tell, each segment is
>> limited to 32K words.
>
> Totally incorrect. I can't even guess where this misconception came from.
> As mentioned above, the modern limit is about 2^28 words. In programing
> terms it was never less than 2^20 words. There were various limits that
> were smaller, but they did not affect programming.
>
> Remember that most MCP segmentation was always transparent to the
> programmer.
>
> Code segments? Maybe code segments are limited to 2^15 words, though
> actually 2^13 comes to mind. But it's totally transparent. When the
> compiler fills up a code segment, it generates a branch to the next code
> segment. End of problem.
The word offset in a code segment is limited to 2^13-1 in a PCW (Program
Control Word, tag=7) and in the intra-segment branch address format of
branch instructions.
The important point here, I think, is that both Edward and I have been
using MCP systems for a long, long time, and both of us are quite
familiar with them, but not only do we not worry about segment size
limitations in our daily work, gosh -- we're not even sure what they are
anymore. I had to dive into the architecture reference manual to address
the issues in this thread. There are limits, but they are not
constraints on practical use. That was not the case on the B5500, which
had much lower limits all around, but for the B6700 and later systems,
virtual addressability for either code or data has never been much of a
concern.
snip very detailed descriptions
> Those issues aside, I think the MCP architecture is still the most
> interesting, if least appreciated, one on the market today.
Thank you very much Paul for that wonderful description. I have been
looking for something like that for a long time. I also appreciate that
you talked about the disadvantages/flaws in the system as well as its
advantages as it is good to get a dispassionate post like that from an
obviously interested and dedicated "Burroughsian". :-)
One other issue which some might consider a flaw or disadvantage is that
it seems hard to see if you could do a JIT compiler for this system. Is
that right?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
Cool. Thanks very much for posting that. I knew some, but not nearly
all, of it, once upon a time. I'll have to reread the ASD stuff.
I'd have to say that among the disadvantages of the MCP architecture are
(1) its native use of EBCDIC instead of ASCII and (2) the 48-bit word
format. It would be hard to take a C program reading 32-bit
twos-complement integers or IEEE floating point numbers, translate it to
ALGOL, run it on an MCP system, and get the same results (after finding
the buffer overruns that were there all along without anyone knowing).
Louis
> snip other stuff
> One other issue which some might consider a flaw or disadvantage is that
> it seems hard to see if you could do a JIT compiler for this system. Is
> that right?
>
> --
> -StephenFuld
> (e-mail address disguised to prevent spam)
Stephen,
MCP systems have supported JIT compilation since at least the mid to
late 1980's.
The first I encountered and contributed to was the S-Code compiler.
The MCP was extended to provide the ability to allow code to be
generated "on the fly" in memory (JIT wasn't as common-place term back
then since Java and .net was still almost a decade later).
Then when we added TADS support for COBOL85, C and Pascal83 using the
Slice compiler architecture we leveraged the same MCP support.
When using these products, "JIT" compilations are taking place
following your input.
The Slice TADS libraries for these products include the respective
compiler that allow you to compile a statement (or block of
statements) at a time in the intended context.
There were other experimental "JIT" tools in-house at the time that I
probably can't talk about, but needless to say there is no additional
barrier to implement a trusted JIT versus implementing a regular
trusted compiler.
Matt.
At the time IBM was absolutely dominant in business computing, and
IBM had turned its back on ASCII and put forth EBCDIC in its place.
We can speculate that IBM figured EBCDIC would take over and ASCII
would wither away; that was usually the way with de facto standards
established by IBM. So if Burroughs wanted to be a factor in business
computing they had to support EBCDIC. Open question whether Burroughs
thought ASCII would wither away and EBCDIC would come to dominate.
The 48-bit word format was carried over from the B-5500 and was well
thought out. I've been told that Burroughs management insisted on
some compatibility between the B-5500 and the systems that were to
follow it, and thus the architects were not free to improve on some
aspects. I think it's pretty clever that the 48 bit numeric format
allows integers and floats to have the same representation, so you
don't need to be always converting from one to the other. Then
48 bits was cool in the days of 6-bit character codes since you got
8 characters to the word.
C came along a lot later.
OK. That wasn't apparent from Paul's post. Can any compiler generate
code on the fly or is it an "extra" permission beyond being able to
generate codefiles?
--
- Stephen Fuld
I'm familiar with the history (once upon a time, I ported code from the
B5500 to the B6700), and I'm not saying that the 48-bit word was a bad
idea. My point is that no machine has a chance of being taken seriously
by most programmers today unless it does native mode ASCII, 32-bit
twos-complement integers, and IEEE floating point.
In our industry, there is a tendency to believe that newer is better,
and C is automatically assumed to be an improvement over its predecessors.
Louis
IEEE fp and twos-complement, I agree. ASCII as compared to EBCDIC, I
donÄt see as a hardware constraint. However, software that needs to know
the size of the integers it is using is, in almost all cases, broken by
design.
Summary: The 48-bitness does not matter to any software I want to care
about.
Jan
Think of all those C programs that use int32 for portability!
fixed bin(17,0) is the way to go!
--
mac the naïf
> snip other stuff
> > MCP systems have supported JIT compilation since at least the mid to
> > late 1980's.
> > The first I encountered and contributed to was the S-Code compiler.
> > The MCP was extended to provide the ability to allow code to be
> > generated "on the fly" in memory
>
> OK. That wasn't apparent from Paul's post. Can any compiler generate
> code on the fly or is it an "extra" permission beyond being able to
> generate codefiles?
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)
The compilers that generate code JIT or on the fly need to have a
runtime container/sandpit/environment in which to generate and to
reference in-memory JIT'ed code. In the case of the Slice TADS
libraries this is the runtime Test And Debug environment. They need
access to the necessary MCP API/entry point and when installed the
library has to be given the appropriate additional privileges that a
regular compiler does not need.
Matt.
Thanks. That makes sense.
It is indeed ironic that the B6700 handled TWO different character sets
(three if you include hex, which is still handled identically to eight-bit
code), but did not include the eventual winner. As has been pointed out, in
1970 IBM's star was in the ascendent and EBCDIC looked like the right
choice. It turned out to be the wrong choice, but I don't think anyone
could have known that at the time. The watchword was "no one ever got fired
for buying IBM". Today substitute Microsoft for IBM -- the cult has become
just as strong. Microsoft will follow IBM, not this year, maybe not even
next decade, but eventually. Who will be right then? Personally I'm not
risking any money on it. (Though it may be worth pointing out that anyone
who bought Unisys stock when it was around $1, around 1991, and sold it
10-15 years later made a pretty profit.)
Maybe programmers put value in particular integer formats, but that's
religion. For integers, I know of no advantage to one format over another.
Floating point is another matter; IEEE floating point has real advantages.
But how many programs are using IEEE floating point in 32-bit words? Today
as in 1970, 32-bit floating point is OK for toy programs but not much else.
Scientific programs running on the IBM 360 were famous for producing
cockeyed results due to insufficient precision, and many programmers got
used to automatically using double precision. OTOH, 48-bit Burroughs
floating point worked fine for most of those same programs, since it had
almost twice the mantissa precision (roughly 37 bits vs 21 bits after
considering various factors). So the 48-bit word actually saved a lot of
that expensive core memory for many programs, which on IBM equipment had to
use 64-bit floating point.
It's too bad that IEEE floating point doesn't specify an integer subset. I
don't think it would have been difficult. Might even be an easy extension
now. The Burroughs format still has the advantage there.
ASCII vs EBCDIC is yet another religion. I have MCP programs which read and
write files directly on Windows servers. I declare the files with
INTMODE=EBCDIC, EXTMODE=ASCII. End of problem. In fact, for the most part
the architecture is agnostic with respect to ASCII vs EBCDIC -- it's just
8-bit characters. The only exception that comes to mind is in converting
binary to decimal and vice versa, where IIRC there is no option to generate
ASCII digits instead of EBCDIC digits or to interpret zones as ASCII on
input. (I *think* that the IBM 360 did have such an option, but it's been a
loooooong time.) Certainly the software is heavily locked in to EBCDIC, but
I don't see EBDCIC on the screen any more than I am seeing ASCII as I write
this -- I just see letters in the Roman alphabet. It's really a pretty
minor issue for working programmers. But religion is always a major issue.
And in truth ... ASCII is the loser too. Modern systems use Unicode.
Edward
--
Art Works by Melynda Reid: http://paleo.org
This is actually documented in FAQ # 10026491. Please check it out ... I
spent a year and a half in correspondence persuading them to document it!
(I actually had a situation where a COBOL program needed to declare as
large as possible an 01 level without taking any risk that it might fail on
another CPU. Plus those of us in the business of writing general purpose
software like to know the limits of any system it might possibly run on.)
or if the wrap is a problem try
Edward
--
Art Works by Melynda Reid: http://paleo.org
That's an accurate statement about traditional Algol programming and
represents the state of most software running under the MCP. However, for
some 10-15 years now Algol has had good object-oriented structures. There's
no polymorphism or inheritance, but they are still extremely useful. Algol
today can look quite a bit different from even well written Algol of 25
years ago. Still, I'm sure that the use of memory is radically different
from that of other systems even with similar declared structures.
> The B5500 was an outstanding batch job machine and a lousy time sharing
> machine. The reason it was lousy was that absolute addresses get into the
> stack, so when a job is swapped completely out it has to be swapped back
> in to the same memory addresses it previously occupied. This was corrected
> by an architectural change in the B6500.
It may have been improved, but the B6700 still had major problems with this
kind of swapping. Stacks still contained absolute addresses. They could be
swapped back in to different locations, but at the cost of going through
the stack to fix up every single descriptor. I suppose the difference was
that this was possible on the B6700 and not on the B5500 (on the B6700 the
absolute addresses occurred only in descriptors, which could be identified
by their tags). I further suppose this is probably corrected today, since
descriptors now contain ASD numbers instead of memory addresses ... now
that we have plenty of memory and don't need swapping any more.
Edward
--
Art Works by Melynda Reid: http://paleo.org
I believe the hardware operator "Masked Search for Equal" was created to
make stack searching for copy descriptors a little faster. I vaguely
remember that when the system was thrashing, the mask showed up
prominently in the register display.
Louis
> Today as in 1970, 32-bit floating point is OK for toy programs but
> not much else.
Toys are a big market. CPU (and GPU) makers can hardly ignore them.
If you can double the effective size and bandwidth of your RAM with
no observable downside on your application, you'd be a pretty poor
engineer not to.
Sure enough. I changed my little Algol test program to make the arrays
one word shorter, thus:
BEGIN
DEFINE MAXBIT = 28 #;
ARRAY A[0:2**MAXBIT-2];
A[0]:= 0;
A[2**(MAXBIT DIV 2)-1]:= 2**(MAXBIT DIV 2)-1;
A[2**MAXBIT-2]:= 2**MAXBIT-2;
PROGRAMDUMP (ARRAYS);
END.
This makes the array length 2^MAXBIT-1 instead of 2^MAXBIT. Using
MAXBIT=28 now compiles on an NX4201 (an E-mode Beta system running MCP
7.0) and the Libra 300 (an E-mode Delta system running MCP 11.1) that I
tried before. It also gets a Dimension Size error on both, as the FAQ
indicated it should. MAXBIT=28 should work on E-mode Epsilon systems
(Libra 185 and later), though.
Using MAXBIT=27 allows the program to compile and run on the Libra 300.
I had to go all the way down to MAXBIT=20 in order for the program to
run on the NX4201, again as the FAQ indicated should happen.
The NX4201 (along with all the other E-mode Beta and Gamma models) is no
longer supported, so it appears that the maximum array size you can
currently count on for cross-system compatibility is 2^27-1 words.
Thanks, Edward. I learned something from this. Now all I need to do is
find an application that requires a virtually contiguous 768 MB memory
vector.
I doubt it was created solely for that, but it was definitely used for that
and I imagine that the requirements of stack searching were a major
consideration in the design of the operator. Probably made it a lot faster,
not just a little -- the search still had to access every word in the
stack, but did not have to fetch or interpret instructions while doing so.
If you read the description of the masksearch intrinsic in the ALGOL
manual, there's one important thing it doesn't tell you: the tag takes part
in both the mask and the target. I don't remember if this is clear in the
architecture manuals.
OK, for those not In The Know: the operator (mnemonic SRCH) takes three
parameters on the stack: a target, a mask, and an array location. The
target and mask are single full words. The array location is an indexed
descriptor -- if you read all of Paul's treatise, then you know what this
is, but otherwise just think of it as a location in an array. The SRCH
operator compares words of the array with the target word, moving toward
the beginning of the array, comparing only the bits which are on in the
mask. You can also describe this as searching for a word where ((target EQV
arrayword) AND mask) has all bits off. The operator returns the array
subscript of the first match, or -1 if no match is found.
When ALGOL programs use this operator via the MASKSEARCH or ARRAYSEARCH
intrinsics, the most commonly written mask is REAL (NOT FALSE), which is
all bits on (used because the generated code is only two bytes: ONE, LNOT).
But the tag is still zero. In ALGOL there is no way to create a mask
parameter with a non-zero tag, so there is no point in documenting that the
tag takes part in the comparison. A double array may have tags = 2 on the
data words, but with no way to make the tag non-zero in the mask, that's
irrelevant.
Of course the MCP languages (ESPOL in the early days, NEWP now) can set the
mask tag. That means that they can search for various kinds of descriptors,
checking both the tag and other bits such as the copy and indexed bits.
Various combinations of this were used to fix up descriptors for overlays.
As Paul described, a great deal of this is no longer applicable on ASD
systems, and I imagine that a much smaller range of the capabilities is now
used in this context.
> I vaguely
> remember that when the system was thrashing, the mask showed up
> prominently in the register display.
I never actually saw this, but I remember the story. On the B6700, the CPU
maintenance panel had a light for every bit in the CPU, including the
top-of-stack registers. The code for putting the CPU into the idle state
could be arranged so that the TOS registers generated particular patterns.
In the early days, it spelled out the word IDLE; later it was changed to
the big B logo. (Altering this pattern was one of the most popular local
MCP patches.)
Managers didn't like to see their expensive computer idle. (I'm not
claiming that all or even most managers were this clueless, just that this
is the story as I heard it.) Operators could handle this by loading more
work, even if that work didn't really need to be done. (Most never learned
that they could disable the IDLE just by entering "?L:GO L" on the console,
and in any case that didn't work until WFL was implemented on release 2.3,
around 1973.) But when the CPU was idle because all tasks were waiting on
overlays, the IDLE display showed up just as bright as when there was
really no work. So the operators would pile on more work trying to get rid
of the IDLE ...
Eventually the pattern would change and the IDLE would go away. The system
would still be waiting on overlays, totally thrashing, but the CPU would be
pegged handling the overhead of doing the overlays. Most of this overhead
was stack searching (which Paul described), mostly using the SRCH operator.
The mask would spend a lot of time in one of the TOS registers, and not
surprisingly the mask had most bits on. So there would be these waves of
solid lights moving across the display. Very impressive to the managers,
who realized that their expensive computers were very busy.
There was a name for this effect, but I can't remember it.
>Floating point is another matter; IEEE floating point has real advantages.
>But how many programs are using IEEE floating point in 32-bit words? Today
>as in 1970, 32-bit floating point is OK for toy programs but not much else.
>Scientific programs running on the IBM 360 were famous for producing
>cockeyed results due to insufficient precision, and many programmers got
>used to automatically using double precision.
The IBM 360 didn't have IEEE floating point, so your example doesn't work.
There are many significant modern scientific applications which can
get accurate results with 32-bit IEEE fp.
-- greg
A totally excellent technical and historical treatise. I've seen good
descriptions of the architecture, but this one is particularly lucid, and
includes integrated historical information, which I don't recall seeing
elsewhere.
> The compilers generate code in the initialization section of
> procedures to build these "untouched" descriptors directly in the
> program's stack. [...]
> The Copy bit indicates that a tag-5 word is not the original descriptor
> for an area, but rather a copy of it. A non-copy descriptor is said to
> be the "mom" descriptor for an area, and there can only be one such mom.
The alert reader will have wondered whether these conditions co-exist. In
fact the "untouched mom" was an important condition, and was referred to
internally in the code as a "virgin mom". I don't remember whether a virgin
mom was exactly the same as an untouched mom or involved some additional
distinction. In any case, this terminology for some reason never made it
into the public documentation.
> String operations that run off the end of a physical memory area
> generate a "segmented [paged] array" interrupt. The MCP responds to this
> interrupt by locating the next page in sequence (if there is one) and
> restarting the operation.
Note that at this point Paul is describing the Early Days. Paged arrays
could be expensive in some contexts due to these interrupts and the fixup
required to restart the operator -- just filling, say, a four kiloword
array with zeroes using a string operator (which would seem to be the
efficient method from a naive viewpoint) would result in 16 interrupts. Now
it's all handled in the microcode and the overhead is miniscule, plus as
Paul pointed out the pages are much larger now and so the change of pages
happens far less often anyway.
> Segment
> Dictionaries are also sharable -- if multiple tasks are initiated from
> the same codefile, the Segment Dictionary is loaded only once and the
> separate tasks are linked to this common copy. Thus, all of the object
> code and read-only constant pools for a program are automatically reentrant.
This is literally true, and very useful. However, once one works where
reentrancy of the code itself is a given, one begins to realize that there
are more degrees of reentrancy than IBM's old classification (non-reusable,
serially reusable, reentrant) allowed for. In particular, the complexity of
the environment may mean that code has its own copy of some data but uses
shared data in other cases, and this use of shared data can arise in
unexpected ways. The fact that code is literally reentrant does not absolve
the programmer of the responsibility for coordinating access to shared
data, which turns out to be more difficult than reentrancy of the code
itself.
> the content of descriptors is controlled only by hardware
> instructions and the MCP, not user-level code. User-level code (and all
> but small parts of the MCP kernel, for that matter) does not manipulate
> addresses, it manipulates offsets into memory areas.
Fully true now, and always true for end-user programming. As Paul described
later, the compilers did generate code which built and manipulated
descriptors using bit fields, all of which had to be removed for the ASD
migration. I remember somewhere around mark 38 finding and reporting a
situation where the ALGOL compiler was still using the old version. I don't
recall the exact resolution, but it was messy because it meant that the
three release recompile rule wasn't adequate to assure that affected
programs were recompiled before being run on ASD systems. Most of the code
migration depended on the three-release rule.
> This accessibility to the MCP stack was recognized as a serious security
> issue fairly early on, and later systems blocked direct access to LL=0.
I think that the maintenance issue was the real nightmare, more so than
security. With the compilers generating code with offsets into the MCP
stack, those offsets could not be changed. Procedures declared in the MCP
for public use had to be address-equated to specific D0 offsets, and the
assignments had to be managed manually. All of this went away with the
indirect linkage, which is basically handled the same as library linkage.
> The B6700 had 32 such D registers, but this proved (for
> once) to be more than necessary. Later systems cut back to 16 D
> registers (allowing user procedures nested 13 levels deep -- I doubt
> that I've ever coded anything that goes more than four levels deep).
I think that GEMCOS used up to about D9 or D10. But that wasn't all real
programmatic nesting; there were some levels of unnamed blocks added for
reasons I no longer recall (probably bad reasons, based on what I do
remember about the GEMCOS code).
The real advantage of having so many (16) D registers is simply that you
never have to worry about the limit, since as Paul explained, 16 is
basically limitless. If I have a procedure I want to insert within another
procedure (or a large amount of code to $include in a procedure), I can do
it. I don't need to worry what level the host procedure is at, nor how much
nesting is within the code being inserted. It's just a non-issue, and thus
something I don't have to waste time thinking about.
> bounds checking is not due to any debug or diagnostic mode I enabled for
> the compiler or the object code -- it's implicit in the segmented
> addressing mechanism for the architecture and cannot be turned off.
And because it's part of the ISA, the bounds checking is typically done in
parallel with other CPU functions. As a result, it does not incur the
performance penalties typical of software-based bounds checking.
> The first thing almost everyone comments on when first being exposed to
> the stack- and descriptor-based architecture of MCP systems is the
> memory access overhead of push/pop on the stack and of having to index
> through descriptors to reach data.
Paul gives a good summary of the fact that caching in the CPU mitigates
most of this. Even the B6700 had two explicit "top of stack" (TOS)
registers to buffer the need to push and pull from main memory (which IIRC
started as core but was thin-film by the time the B6700's life ended). The
architecture manual even had details about when these TOS registers (called
A and B) had to be pushed into memory or refreshed from memory, per the
requirements of each operator. The B7700, the higher-performance system of
the same era, had a much larger TOS cache, but I do not remember the
details.
> Need to do
> high-performance numerical computation? MCP systems are probably not the
> ones you should consider using.
However, in the days when Burroughs supported FORTRAN, you might have
wanted to develop your numerical software on MCP systems even if you
planned to run it on other systems.
When I was working for the State of Florida DHSMV in the 1970s, a federal
government agency (perhaps NHTSA) visited to install a statistics package
which they were using to collect data from every state. They had already
installed it on IBM systems and one other (perhaps Univac but I'm not
sure). Not surprisingly, they had to make more modifications for the B6700.
But at the end, they said they wished they had done the B6700 first,
because then they would have had almost no problems on the other systems.
The B6700 didn't require a lot of specialized changes, but rather allowed
(and even required) them to write standard, portable FORTRAN (as much as it
could be standardized in those days).
> In my opinion, the MCP architecture has two major problems -- and
> neither of them relates to performance.
I would add a third problem: the Burroughs/Unisys Sales Prevention
Departments. They never figured out how to sell it. My favorite example of
this is that in the 1960s and 1970s, the B5000/B6700 were very popular in
academic circles due to the interesting architecture. Quite a number of
universites had B6700s, some of the best known being UCSD, the University
of Tasmania, and SUNY Fredonia. You would think that marketing would say
hey, students are the next generation of programmers, let's sell B6700s to
universities at the very lowest possible cost so that the next generation
(who will also be driving decisions on what to buy) is familiar with them.
Instead, they appear to have said hey, this looks like a good profit
center, let's see how much money we can get out of them. Needless to say,
academic interest wasn't sufficient to make up for the pricing.
> C is basically a high-level
> assembler for register-based, flat-address architectures
In the 1970s it was said that FORTRAN was the world's most popular
assembler language. It has been replaced in this distinction by C: not only
is C now far more popular, but Fortran (yes, the capping changed) has
evolved into a good, modern programming language.
> There are
> other interesting aspects of the architecture that I have either passed
> over quickly (such as stack linkage and cross-stack addressing)
I have long held that the most important aspect of the architecture is the
environment management. This is a lot harder to explain than a stack, so
most people think "stack machine". The operators to enter and exit
procedures have to do some fancy linking to update the proper D registers
for the new environment. At first thought this may seem easy -- when you
are thinking "stack machine" -- but try to imagine what happens when you
are linked to a procedure in a library (a different environment entirely,
not part of the task's D register sequence, see Paul's decription of
SIRWs), and then pass that procedure as a parameter to a higher or lower
level procedure, which calls the passed procedure, not even knowing what
lex level it's going to run at. (If you followed that, then you either have
a lot of experience with such environments or do not yet fully comprehend
the complexity.) Actually it can be done pretty simply if you don't mind a
lot of extra memory accesses, but that would mess with performance, so
determining when the D-update can be truncated is critical. The only
operators which the architecture manuals never fully explained were the
enter and exit operators -- the writers got to a certain point and then
kind of waved their arms around saying "the D registers get updated".
In the early days of ALGOL, there were test programs to see if ALGOL
compilers properly handled complex environments. I recently ran across one
such program written by Donald Knuth. (I think it's linked from the
Wikipedia article on ALGOL, which unfortunately has some bad information in
other parts.) Burroughs ALGOL never had to worry about such programs --
Knuth's test was actually a pretty trivial one, though apparently a lot of
compilers had trouble with it. The B6700 ISA just did what was needed, and
the compiler didn't have to break a sweat.
Stacks weren't the only place you could find absolute addresses on the
(Pre-ASD) B6700. The MCP had them hidden in a number of different structures
in those days. In some cases they weren't even in descriptors.
The B6700 never was able to swap stacks in and out like they were data
areas. The only way to get a stack written to disk was to use a mechanism
called SWAPPER. With SWAPPER, all the memory areas belonging to a task were
allocated in a contiguous subset of memory. Then, when a task was sent to
disk, all of it's memory went at the same time. When it came back into
memory, the MCP knew it had to go through the entire subspace and adjust all
the absolute addresses.
John Keiser ( retired MCP type and former SWAPPER expert )
> I never actually saw this, but I remember the story. On the B6700, the CPU
> maintenance panel had a light for every bit in the CPU, including the
> top-of-stack registers. The code for putting the CPU into the idle state
> could be arranged so that the TOS registers generated particular patterns.
> In the early days, it spelled out the word IDLE; later it was changed to
> the big B logo. (Altering this pattern was one of the most popular local
> MCP patches.)
>
> Managers didn't like to see their expensive computer idle. (I'm not
> claiming that all or even most managers were this clueless, just that this
> is the story as I heard it.) Operators could handle this by loading more
> work, even if that work didn't really need to be done. (Most never learned
> that they could disable the IDLE just by entering "?L:GO L" on the console,
> and in any case that didn't work until WFL was implemented on release 2.3,
> around 1973.) But when the CPU was idle because all tasks were waiting on
> overlays, the IDLE display showed up just as bright as when there was
> really no work. So the operators would pile on more work trying to get rid
> of the IDLE ...
Hence the widely deployed 'Speedo' patch, which replaced the 'meatball'
(The Burroughs 'B') with
W A
I T
for I/O,
P
B
for Presence Bit, and a smiley face for Idle
Almost exact opposite is true: 32bit floating points are perfectly
good for wast majority of today's floating point workloads - signal
and image processing, 3D rendering and other multimedia tasks.
Today's situation is radically different from the '70s when majority
of fp cycles went into solving PDEs.
Gosh, no one is reading carefully enough to catch my error. That should be
((target NEQV arrayword) AND mask) has all bits off
And many thanks for doing so. It's been a long time since I was caring
for and feeding a B-5500 - certainly the most fun machine in my experience.
Would you also say that it's hard to talk about because with so many
of the models being emulated it is so easy to make changes and
improvements to the way things work? Or is the emulator rarely changed?
Also I'm unclear about the range of models - is there just one emulator
or are there different emulators for different models?
I find the MCP systems difficult to talk about for just the reason that
I mentioned in the quote above -- the architecture is both very complex
and very synergistic. Everything fits together so well, and all of the
elements interact so beneficially, that it's a real challenge to find a
place to start the discussion without quickly becoming bogged down in
related issues. Apparently I was somewhat successful in doing this in my
long post, but it was a challenge just to focus on the segmentation
aspect. There's a lot more really interesting stuff to talk about.
As far as I know, there is currently only one emulator, but it comes in
two flavors -- one for Index boxes with x86 chips and one for x64 chips.
The emulator for the x86 chips emulates E-mode level Delta, and has
worked that way since the LX5000 systems were released in 1998. Prior to
that, the NX4200 and A2100 systems used an emulator based on E-mode
level Beta. I have not had access to any of the x64-based systems, so
have to guess that they use a Delta-level emulator as well. I am not
sure what the differences between the Delta and Epsilon (used in the
Libra 185 and later non-emulated systems) levels are, but from a cursory
inspection they appear to be quite similar. Perhaps someone from Unisys
could respond and clarify this issue.
Your point about emulated implementations being easy to change is a good
one, but Unisys seems to have been very disciplined about this. First,
and to their great credit, they have been committed to making the
emulated systems completely code-compatible with the "hard" systems. You
can literally move code and data between the two (as long as they are
both at supported levels) without worrying about it. Playing fast and
loose with the ISA just because it's implemented in C++ would break this
level of compatibility.
Another thing to consider is that all of the current MCP systems (and
all of them built since the early 1980s) have really been emulated
systems. The "hard" Libra boxes on the high end of the product line are
based on a custom processor that is driven by microcode. Changes there
have to be made carefully so as not to break the MCP and end-user
applications. The same considerations apply to the Intel-based systems.
You can think of the Intel systems as just being based on a different
processor with different microcode, but which implements the same ISA.
Paul made pretty much the points I would have made. In fact, if there are
any differences among the emulated models, I'm less aware of them than I am
of differences among the microcode models. However, as Paul explains, these
differences are miniscule; with very trivial exceptions the models are
entirely code-compatible. And note that the LX100 series, of which I think
the LX140 is current, runs on standard PC hardware. You buy the PC, and you
buy CDs and a dongle from Unisys for $3000, and you have a development or
demo system (not licensed for production work and technically single-user).
And there have always been a few differences among models, even going back
to the B6700 and its first successor. The difference that catches the
unaware most often is the inconsistent handling of bit 47 of numeric
operands. The B5500 used bit 47 as an early form of protection. To maintain
compatibility with the numeric format (I think Paul talked about this), the
B6700 did not use bit 47 (and to this day, floating point is really a
47-bit format). Bit 47 could be used when the word held characters or was
just a bucket of bit fields, but it was always quite explicitly undefined
what bit 47 was in the result of an arithmetic operator (as opposed to a
character or logical operator, where it was always well defined). And in
fact it has changed quite regularly from one model to another, presumably
for the convenience of the microcode writers. I've seen good programmers
caught by this even in the past few years. (Even the B6700 was microcoded,
though the microcode wasn't field-loadable. I don't think any major
computer since the 1950s, perhaps even earlier, was fully hard-wired.
People in DP -- now called IT -- became aware of microcode mainly when it
became field-loadable.)
Another known exception was a "scan forward" of a pointer (which remember
is not what other architectures call a pointer, but a descriptor to a
position in an array containing text) where the distance to scan is
negative. Some models used the absolute value of the distance, and some
treated it as zero. I don't think any did a scan backward.
But these are memorable precisely because they are so rare.
There have been changes due to the evolution of the architecture, but off
the top of my head these are the only two I can remember which actually
varied seemingly randomly.
An aspect of the system which made changes much more controllable was (and
is) the "three release recompile rule". This is a software-only rule, so
Paul only mentioned it briefly if at all. He did discuss at length the ways
that code must be "blessed" before it can run. This gives the MCP a hook
where it can check what release of the compiler was used to generate the
code. If the code is more than three releases old, the MCP refuses to run
the code. If it is exactly three releases old, the MCP runs the code but
issues a warning, and there are tools to help with batch recompilation when
necessary.
This made it possible to change the architecture. It's never changed
radically, as you can see from Paul's lengthy description, but bits and
pieces have changed. The transitions have always been smoothed over by the
compilers, which in some cases even generated code to check what model of
CPU was in use and to execute different code depending on the answer.
So it's the mediation by the compilers which really made changes much
easier. But that ease has never been abused; the consistency of the ISA has
been much more remarkable than the changes.
<snip>
> And there have always been a few differences among models, even going back
> to the B6700 and its first successor.
<snip>
> Another known exception was a "scan forward" of a pointer (which remember
> is not what other architectures call a pointer, but a descriptor to a
> position in an array containing text) where the distance to scan is
> negative. Some models used the absolute value of the distance, and some
> treated it as zero. I don't think any did a scan backward.
>
> But these are memorable precisely because they are so rare.
There was at least one other memorable incompatibility, this one between
the B6700 and its "big brother" B7700, ca. 1975. It involved the
difference between what I would call a "sliding move" and a "smearing
move". Both are overlapping moves. In a sliding move, the destination is
always at a lower address (i.e., array offset) than the source, so that
the memory words used as a destination never become words used as a
source. In a smearing move, the destination address is slightly higher
than the source, so that destination addresses eventually become source
addresses during the move. The critical thing is the delta between the
destination and source addresses.
In Burroughs Extended Algol, a sliding move would look like this:
ARRAY A [0:300];
REPLACE POINTER(A[10],0) BY POINTER(A[11],0) FOR 200 WORDS;
That is, A[10] is replaced by the contents of A[11], A[11] is replaced
by the contents of A[12], and so forth, in that order, for a total of
200 words. The "POINTER()" intrinsic is simply a type coercion function
to satisfy the compiler and generates no code. Actually, I could have
written the statement this way and generated the same effect,
REPLACE A[10] BY POINTER(A[11],0) FOR 200 WORDS;
since the destination must by definition be an address, so the POINTER()
coercion is implied. Writing the statement this way, however,
REPLACE A[10] BY A[11] FOR 200 WORDS;
generates an entirely different result -- that copies the value of A[11]
repetitively into A[10] through A[209].
[By the way, there are a few instructions required to set up the
destination, source, and length in the stack, but the MCP architecture
does the actual move in one instruction, TWSD, Transfer Words
Destructive, the "destructive" referring to what happens to the
parameters in the stack at the end of the instruction. The same transfer
instruction (but not parameter setup code) would be used for all three
cases above.]
As long as the starting destination address (index) was at least one
less than the starting source address, this move worked properly on both
the B6700 and B7700 systems.
The Algol equivalent for a smearing move would be this:
REPLACE POINTER(A[11]) BY POINTER(A[10]) FOR 200 WORDS;
Note that the destination address is one word higher than the source
address, so that A[11] is replaced by the contents of A[10], A[12] is
replaced by the contents of A[11] (which used to be the contents of
A[10], or so you would think), and so forth.
The B7700 handled this the way you would think it should, effectively
smearing the value of A[10] into A[11] through A[210].
The B6700 handled this smearing move in an, er, non-intuitive manner
when the delta between source and destination addresses was less than
two words. I now forget the details, but I think A[11] was replaced by
A[10], A[12] was replaced by the original value of A[11], A[13] was
replaced by the original value of A[12], etc. If the delta between
addresses was two or more words, both systems worked the same.
The difference turned out to be in the way that the two CPUs handled the
interleaving of memory reads and writes. The B7700 was the first MCP
system to implement extensive caching, lookahead, and multiple pending
memory operations, so its designers had carefully thought out how this
interleaving should work. The B6700 was a more primitive design
(although hardly primitive in any absolute sense) and did not take into
account the interaction between source and destination memory references
when they were close together.
The actual problem that I came in contact with was in the form of a
COBOL program that was part of a benchmark for a U.S. Federal agency.
The original program had been written for an IBM 360 system and
implemented the smearing move within the same 01 record by means of
redefines. I think the purpose of the move was to clear the record to an
initial value based on the first few (probably four) characters in the
record.
We had converted the program to Burroughs COBOL 68 and were running it
on both a B6700 and a B7700, and were getting inconsistent results from
the benchmark tests. It took a while to realize that the inconsistency
was based on which system the program was running on, and somewhat
longer to narrow the problem down to the smearing move (which is
difficult to detect in COBOL, as you have to relate a perfectly ordinary
MOVE statement to the fact that the source and destination actually
overlap in the 01 record definition).
When this realization came to light, a couple of us who had been working
with the systems since the early B6500 days, and had previously noticed
this two-word delta requirement, collectively let out what could be best
described as the "programmer's ahhhhh..." Everyone else on the benchmark
team looked at us like we had gotten hold of some really bad stuff. It
being the mid-1970s, that would have been an entirely reasonable, if in
this case incorrect, supposition.
Overlapping moves (sliding or smearing) are either prohibited or
explicitly noted as undefined by the ANSI COBOL standards, but I don't
know of any implementation that actually detects this, and of course it
makes no difference when you are trying to demonstrate a customer's
workload on your product. As a result, to this day I avoid smearing
moves like the plague.
Sorry -- "Index box" should have been "Intel box" -- rogue spell check
correction.
A (small) smearing move seems like an ingenious way to repeat a
small/variable length array across a larger array, but I would still try
to avoid it like the plague.
My "carefully trained" intuition about what is easy and what is hard to
handle efficiently on a modern cached cpu shudders. :-)
I would definitely prefer to use special-case code for a few distinct
short lengths (1, 2, 4, 8 bytes), and then handle all others as a
regular double loop, but always with the source pattern saved away in
separate storage. (If it was really critical, then I could handle 3, 5
and possibly 7 with unrolled-by-4 code.)
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
I used to train my intuition about such things, but machines were
simpler then. Nowadays, I do what I teach and assume that their
properties are unpredictable - even on the same machine. Even mere
firmware updates can cause a change of efficiency to such things
(and that has happened to systems I was managing).
|> I would definitely prefer to use special-case code for a few distinct
|> short lengths (1, 2, 4, 8 bytes), and then handle all others as a
|> regular double loop, but always with the source pattern saved away in
|> separate storage. (If it was really critical, then I could handle 3, 5
|> and possibly 7 with unrolled-by-4 code.)
I would use your generic approach for all sizes, and tune the others
only if profiling indicated it was essential.
Regards,
Nick Maclaren.
When I was taught IBM S/360 assembler, that was the way I was taught to
do things like setting a print line to all blanks prior to putting real
data into it. You would use the MVC (Move Characters) instruction which
would copy a variable length (up to 255 bytes) from source to
destination. So doing MVC addr, addr+1, x would copy the first
character at addr to the next x characters. We were even told that if
we were going to use that a lot on a particular variable, buffer, etc.
to precede he variable in storage with a single byte initialized to the
"clearing value" to facilitate the operation.
I suspect that this was such a common idiom that IBM probably optimized
the microcode for such situations. Can anyone in the know confirm that?
> My "carefully trained" intuition about what is easy and what is hard to
> handle efficiently on a modern cached cpu shudders. :-)
Yes, of course no modern cached cpu has an MVC instruction. :-)
cp67 delivered to the university had a special page of all zeros as part
of formating boot volume. when initializing a virtual address space, the
tables were set so that all pages pointed to the "zero page". i changed
that so that the backing store table indicated "no page" for each page.
page fault processing for "no page" then was changed to fill the virtual
page location allocated in real memory with zeros by instructions.
As you point out that it was common coding practice on 360 to using
overlapping move to clear space. However, studying the 360/67 functional
specs. document indicated that it would be faster to fill a whole page
with zeros with storing registers ... aka save all registers
... initialize ten or so registers to zero, setup the other registers
for BXLE loop and do STM of ten registers of zero at a time.
Old 360 functional characteristic manuals ... giving detailed
instruction timing forumulas
http://bitsavers.org/pdf/ibm/360/funcChar/
some of the coding performance trade-offs could be different
in different models. 360/67 functional characteristic manuals
(including detailed description of virtual memory and segment
hardware)
http://bitsavers.org/pdf/ibm/360/funcChar/A27-2719-0_360-67_funcChar.pdf
http://bitsavers.org/pdf/ibm/360/funcChar/GA27-2719-2_360-67_funcChar.pdf
I have some vague recollection that at some stage, some processor
implementations looked at overlapping moves for special case processing
involving doubleword operation at a time (even tho it was nominally a
byte at a time instruction).
For 370, "long instructions" were introduced that had field length in
register (instead of encoded in the instruction) ... and could be used
for several mbytes. the MVCL instruction also allowed for different
source and destination lengths and specifying a "PAD" byte whne the
source was shorter than destination (source length zero byte and target
length was 4k). However, it took some machine generations before there
was optimized MVCL microcode for this case that beat the STM loop.
current MVCL long instruction description (given operation in 24-bit,
31-bit, and 64-bit addressing modes):
http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/7.5.90?SHELF=DZ9ZBK03&DT=20040504121320
REP MOVSB/MOVSW/MOVSD comes close enough (to MVC) to make a fast hw
implementation somewhat troublesome.
I.e. REP STOSB and REP MOVS could have been the "speed of light"
standard for memory fills and copies, but a quick google will find
thousands of links to handwritten replacements for various special cases.
Good point. I guess the advantage of the single instruction approach is
its generality. I wonder if the IBM Z series compilers now generate MVC
or various special cases depending on the need.
They always did. They didn't always choose which according to any
known logic, though ....
Regards,
Nick Maclaren.
I believe that Bob Rogers has given presentations at Share that indicated
that this MVC to addr+1 from addr is optimized in current z-series servers.
This reminds me of a different kind of model discrepancy. In the early
days, it was a given that the WORDS version was always faster. Of course,
if you didn't want to move a multiple of six bytes, using "transfer words"
required doing some extra stuff at possibly both ends of the string, but on
occasion it was worth the extra programming for the speedup. And if the
source and destination were not at the same offset within the word, it just
didn't work.
For some now-forgotten reason, probably in the early 1980s, I wrote a
program to test the difference in timing, and was able to run it on a
number of different CPUs. Imagine my surprise when I found that the B7900
showed the reverse of the usual difference: it was FASTER on character
moves than on word moves, at least if the character moves were
word-aligned. (I'm pretty sure that non-word-aligned character moves were
still slower, but since that wasn't a surprise, I don't remember it
clearly. In fact, this one anomalous result is essentially all I remember
about the entire experiment.) I don't think I ever learned the reason,
though I suspect a microcode bug (performance bug, not functional bug).
This means that code optimized to use "transfer words" became de-optimized
when run on a B7900.
I have a distant but fairly distinct memory of discussing this with John
Keiser in a public area at a CUBE conference in Detroit, probably the first
time I met him. Of course that was a long time ago and I don't know how
well I can trust the memory.
I believe you are trying to remember the one failing the B5500 CPU had
and that was the OP Code LTSL "0000" to place a word of "0"s (zeros) in
the top of stack was octal 0000. So to flush memory with "0"s from
Processor Panel clear the T reg and S reg, set Trof, set the switch to
hold T reg and hit the start button. The result was pushing a word of
"0's into the stack at address 0 thru end of memory. Could also flush
memory by loading a binary card with code loop of LTSL "0000" then branch
unconditional back to LTSL "0000". Another method to flush memory was
from the I/O panel set bit D24, clear the W reg, set mem cycle switch
and push start button. Most program bugs or hardware failures on B5500
CPU that I was involved with would result in branching to some area of
memory that had been flushed with "0"s so CPU would execute this code as
LTSL "0000" and push a word of "0"s into the stack. Since there was no
distinction between code and data on the B5500 CPU this process would
continue untill the limit of memory was hit, so memory dumps would be
useless containing nothing but "0"s.
B6700 had same OP Code for Lit 0 but code words had to have tag 3 so
only code could be executed and if there were more than 3 stack overflow
interrupts in a row then the CPU would Super Halt and a useful memory
dump could be taken. Early B6500 and model 1 B6700 CPUs did not have the
Super Halt feature and would over write the contents of memory with the
IRW, P1 & P2 words of the stack overflow interrupt until the limit of
memory. So only good part of memory dump was below where problem
occurred.
My recollections
Ex FE
Yes, unless you use the "trick" I mentioned before about preceding the
addr with the desired character.
> I believe that Bob Rogers has given presentations at Share that indicated
> that this MVC to addr+1 from addr is optimized in current z-series servers.
Neat, thanks.
z990/z9 and a new z6 are certainly modern cached CPUs.
If IBM ever decides to publish SPECInt2006 for z6 I wouldn't be
surprized to find it in the close fight for the 3rd place with likes
of Intel's Montvale and AMD's Opteron (Intel's Merom and IBM's own
Power6 untouchable by now in this particular benchmark).
One detail that some readers may be missing in this discussion is that
the "transfer words" operators always adjusted the source and
destination addresses (i.e., indexed data descriptors) forward to the
next word boundary if those addresses were not already on a word
boundary. Therefore, the programmer had to worry about the potential
partial words at each end of the move if they wanted to do a
word-oriented transfer.
If my memory serves, the difference between the performance of "transfer
characters" and "transfer words" instructions was significant, or at
least noticeable, on the B6x00 processors. The B7x00 designs paid a lot
more attention to the efficiency of this sort of thing, since moving
characters from one place in memory to another is typically the most
common thing that COBOL-based applications do.
An FE once told me that the B7700 did not quite get all of the memory
interleaving and boundary conditions for its transfer instructions right
(in terms of efficiency), but that the lessons learned from that design
were plowed back into the B7800 and later systems, and that the extra
overhead for handling variable offsets within words was swamped by the
memory access time on those later systems. That makes Edward's finding
on the B7900 all the more surprising, and I agree that it was probably
due to some sub-optimal microcode and not something intrinsic in the
design. In any case, I suspect that the relative difference in
performance between character- and word-oriented moves on the B7900 was
a lot less than that on the B6700.
My suspicion is that, today, with all of the caching and pipelining in
modern processors, you would not see any difference at all between the
performance of character- and word-oriented moves.
It was easy to get confused and think of transferring words as a quick
way to transfer multiples of six bytes, and this adjustment to a word
boundary sometimes caught people by surprise.
Louis
I also remember the smearing "MVC addr,addr+1" being a recommended
technique for clearing an area of memory on S/360 and /370 systems. One
issue with this technique is that it requires an equal number of reads
and writes to memory. Of course, a good processor design could probably
optimize away most of the read memory accesses, but I would be surprised
if the S/360 processors (below the Model 85, at least) were that smart.
The sad part about the model discrepancy/benchmark story I related
earlier in this thread was that the B6700/7700 (and current systems,
too) have a much more straightforward and efficient way to clear an area
of memory.
Most data transfer (move) operators on these systems require three
parameters to be set up in the stack: a destination address, a source
address, and a length. Some operators take additional parameters,
depending on their functionality, but simple transfers require just
these three.
Note that the source and destination "addresses" are really indexed data
descriptors. A nice feature of the MCP architecture is that the source
operand for a transfer instruction can be either an address or a data
word. Since all words have tags, the processor can distinguish between
an indexed data descriptor (tag=5 on B6000-series machines) and a data
word (tag=0). If the source operand is a data word, the processor uses
that value as the source data, circularly reusing the bytes in the word
until the transfer length is exhausted. The word's value is held in a
processor register, so there are no memory reads while the transfer
takes place, just writes -- unless there are partial words affected at
the beginning or end, of course.
The benchmark I mentioned came from an IBM S/360 shop, so I can
understand an IBM-focused programmer using the same smearing technique
for clearing memory in COBOL that was popular in assembler. The program
from that benchmark that highlighted the discrepancy between the B6700
and B7700 processors probably did something like this:
01 BIG-RECORD.
02 BIG-PREFIX PIC X(1).
02 BIG-AREA PIC X(4095).
MOVE SPACE TO BIG-PREFIX.
MOVE BIG-RECORD TO BIG-AREA.
BIG-PREFIX may have been X(4) or X(8) in the benchmark program, but
anything less than X(12) [two words] would have brought forth the
problem on the B6700. Note that this snippet generates a smearing move
of BIG-RECORD to BIG-AREA for 4095 characters. Moving a larger area to a
smaller one is a well-defined operation in COBOL (the move is truncated
by the length of the shorter area), so this is not an error, although
many compilers will generate a "move truncation" warning message in this
case.
The following shows the code that is generated for these move statements
using the current MCP 11.1 COBOL-85 compiler. The hex sequences after
the "%" are the actual opcodes. As far as I can tell, this is exactly
the same code that would have been generated for B6000 systems in the 1970s.
510700 MOVE SPACE TO BIG-PREFIX.
0002:0000:0 ZERO % B0
0002:0000:1 NAMC (2,002) % 5002
0002:0000:3 INDX % A6
0002:0000:4 VALC (1,003) % 2003
0002:0001:0 ONE % B1
0002:0001:1 TUND % E6
510800 MOVE BIG-RECORD TO BIG-AREA.
0002:0001:2 ONE % B1
0002:0001:3 NAMC (2,002) % 5002
0002:0001:5 INDX % A6
0002:0002:0 ZERO % B0
0002:0002:1 NAMC (2,002) % 5002
0002:0002:3 INDX % A6
0002:0002:4 LT16 4095 (0FFF) % B30FFF
0002:0003:1 TUND % E6
For the first statement:
* The ZERO/NAMC/INDX sequence pushes an indexed data descriptor
(IDD) to offset 0 in the BIG-RECORD area, which is implemented as
an 8-bit character array. The descriptor for BIG-RECORD is located
at (2,2) (i.e., lex level=2, offset=2) in the stack. This is the
destination operand.
* VALC (1,3) pushes a word consisting of six EBCDIC spaces copied
from location (1,3) in the Segment Dictionary stack. This word was
created by the compiler. This is the source operand.
* ONE pushes a literal value of 1 onto the stack. This is the length
operand.
* TUND is the "transfer characters unconditional destructive"
operator. It simply transfers characters from the source to the
destination for the specified length. "Destructive" means that the
operator deletes all three operands from the stack at completion;
the "update" version of this operator (TUNU) would have left
updated source and destination operands in the stack for a
subsequent transfer operation.
The second statement works similarly, but uses a different type of
source operand. The ONE/NAMC/INDEX sequence pushes an IDD to the
destination at offset 1 within BIG-RECORD. The ZERO/NAMC/INDEX sequence
pushes an IDD to the source at offset zero in the same area. LT16 pushes
the value 4095 onto the stack. TUND works as described above, but this
time in a memory-to-memory instead of register-to-memory fashion, and
with a much longer length operand.
Now contrast that code sequence with what would have been generated if
the programmer had just moved spaces to BIG-RECORD in a straightforward
fashion:
511200 MOVE SPACE TO BIG-RECORD.
0002:0003:2 ZERO % B0
0002:0003:3 NAMC (2,002) % 5002
0002:0003:5 INDX % A6
0002:0004:0 VALC (1,003) % 2003
0002:0004:2 LT16 4096 (1000) % B31000
0002:0004:5 TUND % E6
This is exactly the same as for the first statement above, except for
the LT16 instruction pushing a length of 4096 onto the stack instead of
the ONE instruction pushing a length of 1. Not only is it less than half
the bytes of code, it's quite a bit more efficient. So much for the
value of programmer-induced low-level performance optimization --
especially when crossing system architectures. It just goes to show that
sort of thing should really be the province of the compiler and ISA
implementation.
Interestingly, the S/360 smearing technique does not work even on more
modern MCP systems. I compiled the code above and ran it on both E-mode
level Beta (NX4201) and Delta (Libra 300) systems. Both of them are
emulated systems, and on both the smear does not work if the size of
BIG-PREFIX is less than one word (6 bytes). What I see is BIG-AREA
receiving a copy of what is in BIG-PREFIX for exactly the length of
BIG-PREFIX, but after that binary zeros (which is the initial value for
an area established by the MCP memory allocation routines). This
indicates to me that source word fetches are taking place before
destination word stores (as you would expect) and that the processor is
not attempting to accommodate the overlapping addresses for the move.
At least the behavior is now consistent between these two versions of
E-mode. Can someone please try this on an Epsilon-level machine and
report the results?
Microcode :-)
Seriously. At least on some models, that case was recognised and
optimised.
Regards,
Nick Maclaren.
And my suspicion is that today with all of the caching and pipelining
in modern processors, you indeed would see no significant difference
between the performance of character- and word-oriented moves when at
least on one end of transfer misses L2 cache.
On the other hand, I suspect that for more common (at least for medium-
sized transfers) case of both source and destination being cached in
L1/L2 the difference between character- and word-oriented moves is
bigger than ever.
L3-resident transfers (on the processors that at all have L3 cache)
would fall somewhat in between, because the typical L3 can't sustain
transfers at rate of one word per CPU cycle but is faster than one
character/cycle.
Of course, I'd prefer measurements over suspicions. Too busy now to do
it myself.