From a recent post on IBM-MAIN from Bruce Black
[bbl...@FDRINNOVATION.COM]:
>Whenever a 256 byte area currently in I-cache is modified,
>it is transfered to D-cache. But if instructions in that
>area are executed, it must be flushed from the D-cache to
>L2 cache, and re-fetched into I-cache, which causes a
>serious execution penalty. If they repeated on the same
>location, it has to go back to D-cache and all over again.
Assuming this is at least fairly accurate...
Previous IBM CPU's didn't have such a penalty for code that
resides in close proximity to the data it modifies. Any CPU
architecture change is bound to favor different programming
techniques -- but wasn't this seen (by IBM architects) as a
potential MAJOR performance change? Small routines that
otherwise have NO reason to keep data areas separate from
code, or programs that dynamically build/modify code will
perform WORSE on z/900 than on previous hardware. A likely
place for such code is in the heart of parameter-driven
number crunching.
This kinda gets back to the old "where's the instruction
timings" whine we occasionally engage in, because those
timings were accompanied by an overview of factors that
affect timing.
The z/Architecture Principles of Operation does not mention
any details of what disrupts the cache, and thus reduces
CPU performance. Such model-dependent trivia isn't in
PoOPs... but how is a coder to learn of things like this?
I'll gladly RTFM if one exists.
-jcf
__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to list...@bama.ua.edu with the message: GET IBM-MAIN INFO
As to publishing instruction timings, yeah, I'd like to see them do that.
But don't be surprised if the timing manual for one model is thicker than
the new PoOps; it's been a long time since such things were simple. Even the
old 370 line had all sorts of special cases in the timing information.
Shmuel (Seymour J.) Metz
> -----Original Message-----
> From: John Ford [SMTP:zj...@YAHOO.COM]
> Sent: Tuesday, February 13, 2001 3:04 PM
>
> Previous IBM CPU's didn't have such a penalty for code that
> resides in close proximity to the data it modifies. Any CPU
> architecture change is bound to favor different programming
> techniques -- but wasn't this seen (by IBM architects) as a
> potential MAJOR performance change? Small routines that
> otherwise have NO reason to keep data areas separate from
> code, or programs that dynamically build/modify code will
> perform WORSE on z/900 than on previous hardware. A likely
> place for such code is in the heart of parameter-driven
> number crunching.
>
> This kinda gets back to the old "where's the instruction
> timings" whine we occasionally engage in, because those
> timings were accompanied by an overview of factors that
> affect timing.
>
> The z/Architecture Principles of Operation does not mention
> any details of what disrupts the cache, and thus reduces
> CPU performance. Such model-dependent trivia isn't in
> PoOPs... but how is a coder to learn of things like this?
> I'll gladly RTFM if one exists.
----------------------------------------------------------------------
Shmuel (Seymour J.) Metz
I've played the game of using R13 as a combined base and save area pointer
also, but not in the last few decades.
IBM was telling us to keep code and data separate in the early 70's; I
consider that adequate notice, especially for software vendors.
Shmuel (Seymour J.) Metz
> -----Original Message-----
> From: g...@ugcs.caltech.edu [SMTP:g...@ugcs.caltech.edu]
> Sent: Tuesday, February 13, 2001 4:08 PM
>
> Some models of PDP11 have a memory management system so that they
> can address more than 64k (bytes). Some OS used this, possibly
> including putting instructions and data in different places.
>
> Most non-reentrant assembly and even fortran S/3x0 programs that
> I knew had static data areas in the program CSECT. This includes
> save areas. I think it was mentioned not so long ago in this
> group, the trick of putting the save area near the top of the CSECT
> so that R13 can be used for the base register.
>
> Back at least to PL/I (F) static data had its own CSECT,
> and most data areas, including save areas, were dynamically allocated.
>
> I don't know LE at all, but I suspect it does even more with its
> memory allocation.
>
> I believe that it is more modern programming techniques that
> tend to keep code and data apart, more than just unix.
>
> It does seem that more notice of this might have been nice. Still,
> I don't expect too many people to run Fortran G on their z/ machine.
Not directed at you (Seymour) in particular, but where/how
does one hear these things from IBM? I remember reading in
some Performance Guide(s) about separating code and data
back when paging was a major concern, but I've not seen
(that I remember!) a manual that discusses CPU performance
at the level of detail of instruction pipelines. I'm not
saying they don't exist... I'd just like to know what/where
they are.
-jcf
__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/
----------------------------------------------------------------------
Shmuel (Seymour J.) Metz
> -----Original Message-----
> From: John Ford [SMTP:zj...@YAHOO.COM]
> Sent: Tuesday, February 13, 2001 5:19 PM
>
> Not directed at you (Seymour) in particular, but where/how
> does one hear these things from IBM? I remember reading in
> some Performance Guide(s) about separating code and data
> back when paging was a major concern, but I've not seen
> (that I remember!) a manual that discusses CPU performance
> at the level of detail of instruction pipelines. I'm not
> saying they don't exist... I'd just like to know what/where
> they are.
----------------------------------------------------------------------
This is FUD as far as I am concerned. I suggest you have facts before
making a broad statement like this.
Different processors in the past have frequently had penalties for
having code and data in close proximity. The cost/impact for doing so
varies on the general processor design, and is required to for
conformance to the complex rules imposed on the architecture for store
into stream imposed by the 360 architecture.
I've seen programs that ran slower than expected on a G5/G6 due store
into stream and the fact that the cache size changed. I've seen similar
issues on the 9021 when it's instruction pipeline was disrupted.
These effects are not limited to IBM designed processors. This is one
of the reasons that makes MIPS meaningless.
Greg
Greg Dyck
z/OS Core Technology Design
Jan Jaeger.
> These effects are not limited to IBM designed processors. This is one
> of the reasons that makes MIPS meaningless.
Ha! Indeed not. The habit the old Sharp APL compiler had of storing
into the instruction stream _CRIPPLED_ the Hitachi S8 (NAS AS/9000).
--
Phil Payne
Phone +44 7785 302803 Fax +44 7785 309674
>
>These effects are not limited to IBM designed processors. This is one
>of the reasons that makes MIPS meaningless.
>Greg
Greg,
You had some good points in your reply.
Just to bring everybody up to speed .. could you continue on with
this "thread" and list out some more reasons why MIPS is misleading.
Thanks,
Ed
My apologies for any FUD. When I wrote "...such a
penalty...", perhaps I should have said "...as much of a
penalty...". But you're right, I may have jumped too far in
my conclusions that I based on a couple posts here and on
"Cheryl's List #50" at
http://www.watsonwalker.com/clist50.html.
I don't have a G6 and a z900 at my disposal to run my own
benchmarks, so I don't have facts. [Even if I DID, how
would I know WHAT my code was doing wrong?] The few
Functional Characteristics manuals I've read in recent
years don't say much about things that disrupt instruction
cache/pipeline operation, so I rely on "word of mouth" [via
Internet in this case] to learn such things.
I return to my orginal question:
Such model-dependent trivia isn't in PoOPs [or Functional
Characteristics]... but how is a coder to learn of things
like this?
I ask this with an honest desire to know the answer, not to
riducule or otherwise abuse IBM. And certainly not to
spread FUD.
-jcf
__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/
----------------------------------------------------------------------
In any case, I'm not aware of any other vendor that publishes detailed
instruction timings and cache behavior models. I think IBM is right to keep
timings internal. OTOH I think it wouldn't hurt them to make available some
basic machine architecture information... (thinking) but wait, they do. If
you subscribe to the IBM Journal of Research and Development, they publish a
wonderful (dense) description of every new machine family (including
mainframes) at about the time the new machines show up. You can usually get
a subscription through your IBM branch office.
If you don't have access to the JoR&D, your options are really limited.
Sooooooooo..... where does one "go" to learn about the implications (old hat
in the risc world btw) of split I and D caches? Most people who know
anything about this subject either (a) currently work in the field, or a
closely related one, or (b) they learn about it in college.
People who fall in category (a) would be H/W designers primarily and folks
(like Greg et. al.) directly involved in the OS design, which is necessarily
closely related to the H/W. People who fall in category (b) might actually
become people in category (a) when they grow up, but most don't. I am
fortunate enough to have spent a portion of my life in category (a) (my
mis-spent youth).
If you really want to get a good primer on this sort of thing without
suffering through college hours again, I strongly recommend you cruise on
over to AMAZON and order yourself a copy of "Computer Architecture, A
Quantitative Approach" by John Hennessy and David Patterson. It's not cheap,
but it IS the bible for H/W weenies.
One thing you can be sure of... while IBM terminology may vary from other
vendors, IBM cpus obey the same physics as every other cpu implemented in
similar technology, e.g. if you compare CMOS with CMOS and avoid comparing
CMOS with ECL. Their IBM examples are very dated, but the rest of the book
is wonderful. It will give you a solid feel for why H/W designers make
otherise mysterious design choices like separating the I-stream and the
D-stream.
After you've spent a few cozy beverage-hours with Hennessy & Patterson, you
should be able to read the IBM Journal articles again (or for the first
time) and they will make a whole lot more sense.
Chris
>-----Original Message-----
>From: John Ford [mailto:zj...@YAHOO.COM]
>I return to my orginal question:
>Such model-dependent trivia isn't in PoOPs [or Functional
>Characteristics]... but how is a coder to learn of things
>like this?
>
>I ask this with an honest desire to know the answer, not to
>riducule or otherwise abuse IBM. And certainly not to
>spread FUD.
----------------------------------------------------------------------
> Chris
>
The "IBM Journal of Research and Development" as well as "IBM Systems
Journal" appear to be available on the web at:
http://www.research.ibm.com/journal/. I don't know if these are the current
hardcopy or if this site is delayed.
----------------------------------------------------------------------
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
Bill
----- Posted via NewsOne.Net: Free (anonymous) Usenet News via the Web -----
http://newsone.net/ -- Free reading and anonymous posting to 60,000+ groups
NewsOne.Net prohibits users from posting spam. If this or other posts
made through NewsOne.Net violate posting guidelines, email ab...@newsone.net
-----Original Message-----
From: Craddock, Chris [mailto:Chris_C...@BMC.COM]
<snip> ..
If you subscribe to the IBM Journal of Research and Development, they
publish a
wonderful (dense) description of every new machine family (including
mainframes) at about the time the new machines show up.
**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager of QR.
This message has been swept by MIMESweeper for the presence of computer
viruses. No warranty is given that this message upon its receipt is
virus free and no liability is accepted by the sender in this respect.
This email is a message only; does not constitute advice and should not
be relied upon as such.
**********************************************************************
While I've still got my soap-box out, this situation should cast a dark
cloud of terror over you folks out there who still write non-reentrant code.
That's precisely the kind of code that's going to drive the z-boxes nuts.
So if you haven't made the switch to reentrant code (whether in your asm
code, or your compiled code) -NOW- is the time to get that out of the way.
Non-reentrant code is going to run like a dog on a 2064 when compared with
the same code implemented reentrantly.
Chris
>-----Original Message-----
>From: Ginnane, Shane [mailto:Shane....@QR.COM.AU]
>Now I'm happy to admit I haven't done the pre-req reading,
>however the last
>time I tried to fathom one of these "dense" articles, I gave
>up when I hit a
>formula that was longer than what I consider a reasonable
>paragraph should
>be.
>About page 2 I seem to recall .....
Of course, like Seymour, I'm on old hardware and not likely to be upgraded
anytime soon.
----------------------------------------------------------------------
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
> -----Original Message-----
> From: Craddock, Chris [SMTP:Chris_C...@BMC.COM]
> Sent: Wednesday, February 14, 2001 6:51 PM
> To: IBM-...@BAMA.UA.EDU
> Subject: Re: z/Architecture I-cache
>
> Shane - you probably didn't spend enough "beverage hours" on it. You'll
> notice I didn't specify what kind of beverage you needed ;o)
>
> While I've still got my soap-box out, this situation should cast a dark
> cloud of terror over you folks out there who still write non-reentrant
> code.
> That's precisely the kind of code that's going to drive the z-boxes nuts.
>
> So if you haven't made the switch to reentrant code (whether in your asm
> code, or your compiled code) -NOW- is the time to get that out of the way.
> Non-reentrant code is going to run like a dog on a 2064 when compared with
> the same code implemented reentrantly.
>
> Chris
>
>
>
----------------------------------------------------------------------
Not to mention all the IBM-supplied services that address the parmlist with
BAL R1,*+4+length_of_parmlist.
-jc-
Shmuel (Seymour J.) Metz
> -----Original Message-----
> From: Chase, John [SMTP:jch...@USSCO.COM]
> Sent: Thursday, February 15, 2001 9:46 AM
>
> Not to mention all the IBM-supplied services that address the parmlist
> with
> BAL R1,*+4+length_of_parmlist.
----------------------------------------------------------------------
>While I've still got my soap-box out, this situation should cast a
>dark cloud of terror over you folks out there who still write
>non-reentrant code. That's precisely the kind of code that's going to
>drive the z-boxes nuts.
I believe that writing reentrant code may not always save your bacon on this
issue...
Let's assume a block of AMODE31 reentrant code that's loaded in 31-bit
storage. This code has GETMAIN'd a block of 24-bit data storage. The
program logic requires a small bit of code that lives in 24-bit storage. So
the program copies a model of the code segment from it's 31-bit storage to a
location in the 24-bit data block. The program then invokes the code
fragment from its target location in the 24-bit data block. I'll bet this
could cause the type of cache issue that we've been discussing.
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com
The answer to your concern is, it depends.
If you build it once and use it many times, you will only hit the over head
once. Now if you build this generated code, and you build it non-reentrant
where ,let's say, you have a save/work area at the end of it, then you will
hit the problem.
A cache line can exist in both the I-cache and the D-cache but only as long
as the D-cache line is not modified. All those constants that you have in
your reentrant programs do not cause a problem because they do not get
changed. (Another reason to group all the constants and LTORGs together is
that you will have fewer duplicated cache lines.)
John McKowen, and all the others that have a bunch of non-reentrant code,
really should think about converting them sooner, rather than latter, to
reentrant.
Another thing that is aggravating this problem a little is that the cache
line length for the z-Series is 256 bytes, where before it was 128 bytes.
It is probably a small grain of salt in the wound, but it is one more grain.
Chris Blaicher
BMC Software, Inc.
Austin Research Labs
10415 Morado Circle
Austin, TX 78759
512/340-6154
BMC Software, Inc. makes no representations or promises regarding the
reliability, completeness, or accuracy of the information provided in
this discussion; all readers agree not to rely on or take any action
against BMC Software in response to this information.
> -----Original Message-----
> From: Eric Chevalier [mailto:blackb...@HOTMAIL.COM]
> Sent: Thursday, February 15, 2001 12:03 PM
> To: IBM-...@BAMA.UA.EDU
> Subject: Re: z/Architecture I-cache
<snip>
> Let's assume a block of AMODE31 reentrant code that's loaded in 31-bit
> storage. This code has GETMAIN'd a block of 24-bit data storage. The
> program logic requires a small bit of code that lives in
> 24-bit storage. So
> the program copies a model of the code segment from it's
> 31-bit storage to a
> location in the 24-bit data block. The program then invokes the code
> fragment from its target location in the 24-bit data block.
> I'll bet this
> could cause the type of cache issue that we've been discussing.
> _________________________________________________________________
----------------------------------------------------------------------
> John McKowen, and all the others that have a bunch of non-reentrant code,
> really should think about converting them sooner, rather than latter, to
> reentrant.
>
I would like to do that. However, there is a "problem". I have
responsibility (I didn't design it!) for a general purpose asm subroutine
which is called literally 1000s (maybe close to 10,000!) times every time
this one program is run. If I were to make this routine truly reentrant by
dynamically acquiring a save area, the execution time would be horrendous!
What the program does is load the data from a VSAM file into 31 bit memory.
The records are variable length. The program reads the records into memory,
creating an in-memory array of pointers to each record. When a "get by key"
request is given from the calling program, this routine does a binary search
of the array to find the key requested. In my opinion, this program should
be eliminated, but it is a MAJOR subroutine in many COBOL programs. Telling
applications to change these program is simply not an option. I guess that I
could minimize this "problem" by only using the save area on the "load the
file" call. The rest of the calls do not require a save area since the
program does not use any system facilities (it's really just a simply binary
search routine which builds its own table from the VSAM file when first
called).
<snip>
> Chris Blaicher
>
> BMC Software, Inc.
>
>
----------------------------------------------------------------------
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
----------------------------------------------------------------------
>
> I believe that writing reentrant code may not always save your bacon
> on this
> issue...
>
> Let's assume a block of AMODE31 reentrant code that's loaded in 31-bit
> storage. This code has GETMAIN'd a block of 24-bit data storage. The
> program logic requires a small bit of code that lives in 24-bit
> storage. So
> the program copies a model of the code segment from it's 31-bit
> storage to a
> location in the 24-bit data block. The program then invokes the code
> fragment from its target location in the 24-bit data block. I'll bet
> this
> could cause the type of cache issue that we've been discussing.
First of all, this is really only a problem for performance-sensitive
code (perhaps in a tight loop executed many times). For code that isn't
in the performance path, I wouldn't give this issue a second thought. If
necessary, you could insert a 'DS CL256' between the "code" and "data"
(assuming you have enough base register space to do so).
--
| Edward E. Jaffe | Voice: (310) 338-0400 x318 |
| Mgr., Research & Development | Fax: (310) 338-0801 |
| Phoenix Software International | edj...@phoenixsoftware.com |
| 5200 W. Century Blvd., Suite 800 | USS24J24 at IBMMAIL |
| Los Angeles, CA 90045 | http://www.phoenixsoftware.com |
I do exactly this in my LOGON Pre-Prompt Exit. But the 24-bit code is moved
(i.e., updated) only once and is thereafter read only. Unless the data around
the relocated code (within the same cache line) is updated a lot, the penalty
will only occur the first time the code is relocated.
The performance problem with mixing code and updated data is one of frequency
of execution, reference, and updating, and is one that "most" programs will
not notice. There will ALWAYS be exceptions, as it appears with the
originally mentioned ISV product.
Keith
-- So many stupid people. So few comets.
Why not just dispense with the save area altogether and use BAKR|PR?
Tom Harper
-----Original Message-----
From: McKown, John [mailto:JMc...@HEALTHAXIS.COM]
Sent: Thursday, February 15, 2001 1:10 PM
To: IBM-...@BAMA.UA.EDU
Subject: Re: z/Architecture I-cache
> -----Original Message-----
snipped...
If I were to make this routine truly reentrant by
dynamically acquiring a save area, the execution time would be horrendous!
.snipped
I guess that I
could minimize this "problem" by only using the save area on the "load the
file" call. The rest of the calls do not require a save area since the
program does not use any system facilities (it's really just a simply binary
search routine which builds its own table from the VSAM file when first
called).
----------------------------------------------------------------------
> -----Original Message-----
> From: Eric Chevalier [mailto:blackb...@HOTMAIL.COM]
> I believe that writing reentrant code may not always save
> your bacon on this
> issue...
>
> Let's assume a block of AMODE31 reentrant code that's loaded in 31-bit
> storage. This code has GETMAIN'd a block of 24-bit data storage. The
> program logic requires a small bit of code that lives in
> 24-bit storage. So
> the program copies a model of the code segment from it's
> 31-bit storage to a
> location in the 24-bit data block. The program then invokes the code
> fragment from its target location in the 24-bit data block.
> I'll bet this
> could cause the type of cache issue that we've been discussing.
My initial response is "why would you ever need to do that anymore?" - even
the crusty old access methods have supported 31-bit callers for years!
Pardon my ignorance, but is there anything left that requires 24-bit?
But, to answer your specific question, "it depends". If you build the code
in a cache-line aligned area and you don't modify any storage within 256
bytes of that area, then it won't be any different than if you had simply
loaded a 24-bit program. (btw: cache lines on z-boxes are 256 bytes).
OTOH, if you modify any storage that's within the same cache line, you'll
get this behavior. There's actually even more horrible things in store than
just the cache flush behavior. The pipeline is going to be trashed and the
cpu is going to stall until the cache line can be refetched from L2 cache.
Not pretty.
Chris
----------------------------------------------------------------------
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
> -----Original Message-----
> From: Tom Harper [SMTP:tha...@NEONSYS.COM]
> Sent: Thursday, February 15, 2001 2:33 PM
> To: IBM-...@BAMA.UA.EDU
> Subject: Re: z/Architecture I-cache
>
> John,
>
> Why not just dispense with the save area altogether and use BAKR|PR?
>
> Tom Harper
>
>
----------------------------------------------------------------------
What about using the DXD/CXD facility? You define your save/work areas and
getmain them once on load/open, then just save the gotten address (I think
-- it's been a while since I coded this way -- check the ASM Reference for
good info, don't depend on me). In any case, I think you save the getmain
address in a Q-type adcon and the other calls just load the contents of the
adcon to get addressibility to the getmained area.
I know I have probably not described the details accurately, but I hope you
get the idea -- get once, use many times.
[To anyone who really knows how this type of coding should be done -- my
apologies for approximating the correct technique.]
Also, even if you get many times, using the "STORAGE OBTAIN" macro instead
of GETMAIN generates a PC instruction instead of an SVC, which is much less
overhead.
HTH
Peter Farley
Senior Consultant
ADP Brokerage Information Services
-----Original Message-----
From: McKown, John [mailto:JMc...@HEALTHAXIS.COM]
Sent: Thursday, February 15, 2001 2:10 PM
To: IBM-...@BAMA.UA.EDU
Subject: Re: z/Architecture I-cache
> -----Original Message-----
> From: Blaicher, Chris [SMTP:Chris_B...@BMC.COM]
> Sent: Thursday, February 15, 2001 12:55 PM
> To: IBM-...@BAMA.UA.EDU
> Subject: Re: z/Architecture I-cache
>
<snip>
> John McKowen, and all the others that have a bunch of non-reentrant code,
> really should think about converting them sooner, rather than latter, to
> reentrant.
>
I would like to do that. However, there is a "problem". I have
responsibility (I didn't design it!) for a general purpose asm subroutine
which is called literally 1000s (maybe close to 10,000!) times every time
this one program is run. If I were to make this routine truly reentrant by
dynamically acquiring a save area, the execution time would be horrendous!
What the program does is load the data from a VSAM file into 31 bit memory.
The records are variable length. The program reads the records into memory,
creating an in-memory array of pointers to each record. When a "get by key"
request is given from the calling program, this routine does a binary search
of the array to find the key requested. In my opinion, this program should
be eliminated, but it is a MAJOR subroutine in many COBOL programs. Telling
applications to change these program is simply not an option. I guess that I
could minimize this "problem" by only using the save area on the "load the
file" call. The rest of the calls do not require a save area since the
program does not use any system facilities (it's really just a simply binary
search routine which builds its own table from the VSAM file when first
called).
<snip>
> Chris Blaicher
>
> BMC Software, Inc.
>
>
----------------------------------------------------------------------
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
----------------------------------------------------------------------
On the other hand, if your code really was going to run under LE you could
use the CEEENRY and CEEEXIT macros in your assemler code and your code would
get its working storage needs off the LE stack. Lots faster than GETMAIN.
The more obvious question is whether you even need a save area? Your caller
provides YOU with a save area, but do you actually call anyone else? If not,
then "no problemo" ditch the save area and run without it.
Chris
> -----Original Message-----
> From: McKown, John [mailto:JMc...@HEALTHAXIS.COM]
> I've been told that BAKR/PR will cause all "h" to break loose
> when we go to
> LE & if the subroutine abends. LE cannot handle subroutines that use
> BAKR/PR. Or so I've been told.
----------------------------------------------------------------------
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
> -----Original Message-----
> From: Craddock, Chris [SMTP:Chris_C...@BMC.COM]
> Sent: Thursday, February 15, 2001 2:43 PM
> To: IBM-...@BAMA.UA.EDU
> Subject: Re: z/Architecture I-cache
>
> It probably wouldn't make any difference to LE. The linkage stack is
> restored to the point it was at when the recovery routine (ESTAE for LE)
> was
> set. In your case, it would be "back at the right place" anyway if your
> routine abended and LE retried back into the compiled program code.
>
> On the other hand, if your code really was going to run under LE you could
> use the CEEENRY and CEEEXIT macros in your assemler code and your code
> would
> get its working storage needs off the LE stack. Lots faster than GETMAIN.
>
> The more obvious question is whether you even need a save area? Your
> caller
> provides YOU with a save area, but do you actually call anyone else? If
> not,
> then "no problemo" ditch the save area and run without it.
>
> Chris
>
>
----------------------------------------------------------------------
Not too long ago I found BAKR/PR not necessarily faster than usual
PROLOG/EPILOG with GETMAIN/STORAGE/whatever. This may be different with newer
processors.
For performance the calling code passed as first parameter a block of memory
that the assembler subroutine would point to as its R13 savearea. The block was
big enough for savearea and whatever local variables were needed by subroutine.
Where subroutines nested the initial block was big enough for all subroutine
saveareas and variables. Subroutine A would call B passing some offset into
block as 1st parm. This poor man's stack obviated need for
BAKR/PR/GETMAIN/STORAGE, etc. Speedup was quite pleasing. - Jim Keohane
Tom Harper wrote:
--
Jim Keohane
Brigadier Consultant
LockStar, Inc.
1200 Wall Street West
Lyndhurst, NJ 07071
Tel: (201) 508-3231
Fax: (201) 508-3201
http://www.lockstar.com
May those who love us, love us
And those that don't love us,
May God turn their hearts.
And if he doesn't turn their hearts,
May he turn their ankles
So we'll know them by their limping.
- - - Old Gaelic Blessing
I have used the same technique. I was just trying to find something easy
for John to do to fix up his code...
When we get an abend in a program that uses BAKR/PR, we get (for example)
the following LE abend message:
CEE0374C CONDITION = CEE3201S TOKEN = 00030C81 59C3C5C5 00000000
WHILE RUNNING PROGRAM CREBIM
AT THE TIME OF INTERRUPT
PSW 078D0400 800F5892
GPR 0-3 000F5888 000F5E2C 0010C130 00000010
GPR 4-7 00014AE2 000E6000 00000000 00000000
GPR 8-B 000B3BC8 0FB07D58 00005360 800F57E8
GPR C-F 0002EA80 0010C240 800F5884 00000000
followed by a (non-LE) dump, followed by
IEA995I SYMPTOM DUMP OUTPUT
USER COMPLETION CODE=4083 REASON CODE=00000004
TIME=14.25.33 SEQ=18363 CPU=0000 ASID=0036
PSW AT TIME OF ERROR 078D1400 8F74B6E4 ILC 2 INTC 0D
ACTIVE LOAD MODULE ADDRESS=0F719F98 OFFSET=0003174C
NAME=CEEPLPKA
DATA AT PSW 0F74B6DE - 00181610 0A0D5810 503C5840
GPR 0-3 84000000 84000FF3 00005EF0 0F7125E0
GPR 4-7 00000000 0F713AD0 0F713788 00000000
GPR 8-11 0F713788 00000004 00005360 8F74AD48
GPR 12-15 0002EA80 0F712A98 8F74B6D2 00000004
END OF SYMPTOM DUMP
> -----Original Message-----
> From: Craddock, Chris [SMTP:Chris_C...@bmc.com]
> Sent: Thursday, February 15, 2001 3:43 PM
> To: IBM-...@bama.ua.edu
> Subject: Re: [IBM-MAIN] z/Architecture I-cache
>
> It probably wouldn't make any difference to LE. The linkage stack is
> restored to the point it was at when the recovery routine (ESTAE for LE)
> was
> set. In your case, it would be "back at the right place" anyway if your
> routine abended and LE retried back into the compiled program code.
<snip>
> -----Original Message-----
> From: Jim Keohane [SMTP:jim...@lockstar.com]
> Sent: Thursday, February 15, 2001 5:17 PM
> To: IBM-...@bama.ua.edu
> Subject: Re: [IBM-MAIN] z/Architecture I-cache
>
> Tom,
>
> Not too long ago I found BAKR/PR not necessarily faster than usual
> PROLOG/EPILOG with GETMAIN/STORAGE/whatever. This may be different with
> newer
> processors.
>It probably wouldn't make any difference to LE. The linkage stack is
>restored to the point it was at when the recovery routine (ESTAE for LE) was
>set. In your case, it would be "back at the right place" anyway if your
>routine abended and LE retried back into the compiled program code.
>
>On the other hand, if your code really was going to run under LE you could
>use the CEEENRY and CEEEXIT macros in your assemler code and your code would
>get its working storage needs off the LE stack. Lots faster than GETMAIN.
>
>The more obvious question is whether you even need a save area? Your caller
>provides YOU with a save area, but do you actually call anyone else? If not,
>then "no problemo" ditch the save area and run without it.
>
But remember that I/O services (GET, PUT, READ, WRITE, etc.) use CALL-style
linkages.
Cheers
Steve Comstock
Telephone: 303-393-8716
www.trainersfriend.com
email: st...@trainersfriend.com
256-B S. Monaco Parkway
Denver, CO 80224
USA
>My initial response is "why would you ever need to do that anymore?" - even
>the crusty old access methods have supported 31-bit callers for years!
>Pardon my ignorance, but is there anything left that requires 24-bit?
DCB exits?
Tony H.
> Eric,
>
> The answer to your concern is, it depends.
>
> If you build it once and use it many times, you will only hit the over head
> once. Now if you build this generated code, and you build it non-reentrant
> where ,let's say, you have a save/work area at the end of it, then you will
> hit the problem.
>
> A cache line can exist in both the I-cache and the D-cache but only as long
> as the D-cache line is not modified. All those constants that you have in
> your reentrant programs do not cause a problem because they do not get
> changed. (Another reason to group all the constants and LTORGs together is
> that you will have fewer duplicated cache lines.)
>
> John McKowen, and all the others that have a bunch of non-reentrant code,
> really should think about converting them sooner, rather than latter, to
> reentrant.
>
> Another thing that is aggravating this problem a little is that the cache
> line length for the z-Series is 256 bytes, where before it was 128 bytes.
> It is probably a small grain of salt in the wound, but it is one more grain.
note that MVS (& other systems) had a lot of work done on them in the
3081/3084 time-frame to allocate on cache-line boundaries and in
multiples of cache-lines. the issue was big performance hit between
different processors accessing different data that happen to co-exist
in the same cache line (lots of cross-cache invalidation moving data
back & forth between the different processor caches). some of the
cache syncronization/performance issues between I&D caches are the
same as cache syncronization/performance issues between different
processor caches.
in the risc/harvard architectures with different I&D caches,
store-into data cache, and no I&D cache syncronization ... there are
specific instructions ... typically used by at least loaders that
invalidate i-cache lines and push data-cache lines to memory ... so
that when a loaded program is finally branched to ... its instructions
that started life in the data-cache are flushed to memory so that an
i-cache miss will pick it up.
--
Anne & Lynn Wheeler | ly...@garlic.com - http://www.garlic.com/~lynn/
I have some code that builds instructions along with data those instructions will
use, in a small structure. Then the driving code updates the variables and does an
EX (execute) for the instructions (the instructions are clc or clcl instructions
which set condition codes, so its EX, BC, BC, EX, BC, BC... inline - ugly but fast).
Enough of that... here's the question:
Are the targets of EX instructions likely to be hit by the i-cache line invalidation
problem?
[I've already moved the targets of the EX instructions to their own aligned 256 byte
area so the question is really one of curiosity. I've seen a very tiny benefit from
the rework though it may be better compiler optimization and have nothing to do with
cache misses, and of course someday life will probably change again with 512 byte
cache lines]
On Thu, 15 Feb 2001 12:55:16 -0600, you wrote:
>The answer to your concern is, it depends.
>
>If you build it once and use it many times, you will only hit the over head
>once. Now if you build this generated code, and you build it non-reentrant
>where ,let's say, you have a save/work area at the end of it, then you will
>hit the problem.
Doug Nadel
----------------------------------------
ISPF and OS/390 Tools & Toys page:
http://somebody.home.mindspring.com/
Mail containing HTML or any attachments, including vcf files, is
automatically discarded. If you need to send me an attachment,
please let me know so that I can change my email filters.
>My initial response is "why would you ever need to do that anymore?" - even
>the crusty old access methods have supported 31-bit callers for years!
>Pardon my ignorance, but is there anything left that requires 24-bit?
Surprisingly, yes! All of the legacy, non-VSAM access method
functions in VSE/ESA require RMODE24, and most of those functions also
require AMODE24. The example I gave in my earlier post was based on a
code fragment that's part of an AMODE 31, RMODE ANY application. The
application is reading a sequential file. The EOF routine called by
VSE is given control in 24-bit mode. So my application builds a short
stub routine in 24-bit storage that gets control at EOF, switches to
AMODE31 if that's the mode the app was running in when EOF was
detected and then transfers to the real EODAD routine.
As you and Ed and others have pointed out, the issue I brought up in
my original post isn't significant if the code involved isn't executed
very often (and my EODAD routine isn't). Still, the cache change
introduced in the z900 adds a small item for us programmers to keep in
mind as we write our code.
>-----Original Message-----
>>Pardon my ignorance, but is there anything left that requires 24-bit?
>Surprisingly, yes! All of the legacy, non-VSAM access method
>functions in VSE/ESA...
On the other hand, has IBM announced support for VSE on the z/900? It seems
like something that would be a lot of work for not much reward?
Chris
As long as (a) the data you modify isn't within a cache-line span of the
code and (b) you don't modify the code itself after you've executed it for
the first time... no, it will be perfectly fine.
However, if you -DO- modify the storage containing the instructions and then
re-execute them, you will perceive a substantial penalty, aside from such
considerations as being undispatched and redispatched in between. IOW, you
probably couldn't tell unless you modified/executed the code frequently.
One other issue that springs to mind, is the EX instruction. By happenstance
I was in a presentation by Bob Rogers this morning and he had some dark
mutterings about just how bad it would be if the target of an execute
instruction were in modified storage. Apparently the effect is even worse
because of the pipeline disruption caused by the EX itself. So the answer to
your question below would seem to be yes, you -ARE- likely to be hit by this
issue.
Chris
>-----Original Message-----
>From: Doug Nadel [mailto:some...@MINDSPRING.COM]
>Sent: Thursday, February 15, 2001 9:50 PM
>To: IBM-...@BAMA.UA.EDU
>Subject: Re: z/Architecture I-cache
>
>
>I've lightly followed this thread with some interest, but have
>a question.
>
>I have some code that builds instructions along with data
>those instructions will
>use, in a small structure. Then the driving code updates the
>variables and does an
>EX (execute) for the instructions (the instructions are clc or
>clcl instructions
>which set condition codes, so its EX, BC, BC, EX, BC, BC...
>inline - ugly but fast).
>Enough of that... here's the question:
>
>Are the targets of EX instructions likely to be hit by the
>i-cache line invalidation
>problem?
>
>[I've already moved the targets of the EX instructions to
>their own aligned 256 byte
>area so the question is really one of curiosity. I've seen a
>very tiny benefit from
>the rework though it may be better compiler optimization and
>have nothing to do with
>cache misses, and of course someday life will probably change
>again with 512 byte
>cache lines]
----------------------------------------------------------------------
> ... and of course someday life will probably change again with 512 byte
> cache lines]
'Aint that the kicker! We've already made some changes in our code to
avoid this problem. Unfortunately, the way some of the changes were
implemented, the problem will come right back if cache line sizes ever
increase. Looking for solution that will work for a long, long time, I
decided the only truly safe thing to do is acquire a separate 4K page on
a 4K boundary for the dynamically built instructions. It wouldn't make
much sense for the hardware designers to allow a cache line to ever span
a page boundary. Of course, it's always possible that in the distant
future they could increase the machine's page size beyond 4K. If that
ever happens, they can bring me out of retirement to consult on the fix. :-)
--
| Edward E. Jaffe | Voice: (310) 338-0400 x318 |
| Mgr., Research & Development | Fax: (310) 338-0801 |
| Phoenix Software International | edj...@phoenixsoftware.com |
| 5200 W. Century Blvd., Suite 800 | USS24J24 at IBMMAIL |
| Los Angeles, CA 90045 | http://www.phoenixsoftware.com |
----------------------------------------------------------------------
> On the other hand, has IBM announced support for VSE on the z/900? It seems
> like something that would be a lot of work for not much reward?
A lot of work? My understanding is that the current version of VSE/ESA
runs fine unchanged on zSeries 900 servers (ESA mode of course). APAR
DY45534 is required to upgrade the stand-alone IOCP program to be able
to generate the IOCDS for a 2064.
--
| Edward E. Jaffe | Voice: (310) 338-0400 x318 |
| Mgr., Research & Development | Fax: (310) 338-0801 |
| Phoenix Software International | edj...@phoenixsoftware.com |
| 5200 W. Century Blvd., Suite 800 | USS24J24 at IBMMAIL |
| Los Angeles, CA 90045 | http://www.phoenixsoftware.com |
----------------------------------------------------------------------
>My initial response is "why would you ever need to do that anymore?" - even
>the crusty old access methods have supported 31-bit callers for years!
>Pardon my ignorance, but is there anything left that requires 24-bit?
...and to add another example : most of the TSO/E macros like STLINENO,
GTTERM etc. still do.
/martin
On Thu, 15 Feb 2001 23:25:16 -0800, Ed Jaffe wrote:
>Doug Nadel wrote:
>
>> ... and of course someday life will probably change again with 512 byte
>> cache lines]
>
>
>'Aint that the kicker! We've already made some changes in our code to
>avoid this problem. Unfortunately, the way some of the changes were
>implemented, the problem will come right back if cache line sizes ever
>increase.
Doug Nadel
----------------------------------------
ISPF and OS/390 Tools & Toys page:
http://somebody.home.mindspring.com/
Mail containing HTML or any attachments, including vcf files, is
automatically discarded. If you need to send me an attachment,
please let me know so that I can change my email filters.
----------------------------------------------------------------------