I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is less overhead than MVCL in many cases. Especially since I know that my length is always no more than 255 characters. I check and report an error if the length is 256 or more.
As an aside, to whomever it was who recommended the TROO as a way to move bytes from an input area to an output area, while testing for "unprintable" bytes - thanks! It made my code much easier to write and understand. I was going to use a TRT and an EX'd MVC in a loop. A TROO in a loop was super easy to code.
--
John McKown
Systems Engineer IV
IT
Administrative Services Group
HealthMarkets(r)
9151 Boulevard 26 * N. Richland Hills * TX 76010
(817) 255-3225 phone *
john....@healthmarkets.com * www.HealthMarkets.com
Confidentiality Notice: This e-mail message may contain confidential or proprietary information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. HealthMarkets(r) is the brand name for products underwritten and issued by the insurance subsidiaries of HealthMarkets, Inc. -The Chesapeake Life Insurance Company(r), Mid-West National Life Insurance Company of TennesseeSM and The MEGA Life and Health Insurance Company.SM
>> move "variable length" strings into a blank-initialized area for
>> reporting purposes.
How about using MVCL and using the padding function to fill in the
blanks....
--
Martin
Pi_cap_CPU - all you ever need around MWLC/SCRT/CMT in z/VSE
more at http://www.picapcpu.de
Fred!
Sent from my iPad
On Jan 12, 2012, at 16:33, "McKown, John" <John....@healthmarkets.com> wrote:
> As an aside, to whomever it was who recommended the TROO as a way to move bytes from an input area to an output area, while testing for "unprintable" bytes - thanks! It made my code much easier to write and understand. I was going to use a TRT and an EX'd MVC in a loop. A TROO in a loop was super easy to code.
-----------------------------------------------------------------
ATTENTION:
The information in this electronic mail message is private and
confidential, and only intended for the addressee. Should you
receive this message by mistake, you are hereby notified that
any disclosure, reproduction, distribution or use of this
message is strictly prohibited. Please inform the sender by
reply transmission and delete the message without copying or
opening it.
Messages and attachments are scanned for all viruses known.
If this message contains password-protected attachments, the
files have NOT been scanned for viruses by the ING mail domain.
Always scan attachments before opening them.
-----------------------------------------------------------------
What about MVCOS?
--
Kind regards,
-Steve Comstock
The Trainer's Friend, Inc.
303-355-2752
http://www.trainersfriend.com
* To get a good Return on your Investment, first make an investment!
+ Training your people is an excellent investment
* Try our tool for calculating your Return On Investment
for training dollars at
http://www.trainersfriend.com/ROI/roi.html
--jg
--
John Gilmore, Ashland, MA 01721 - USA
Somebody oughtta test that theory and publish a table of "break" points (by machine type) at which the more "complex" instructions become "better", and below which you should stick with the "old-fashioned" way to do it. I'd do it but there seems to be a very serious lack of available round tuits for "interesting" work in my life lately. There is far too much "required" stuff in the queue ahead of "interesting".
Just my USD$0.02 worth.
Peter
> -----Original Message-----
> From: IBM Mainframe Assembler List [mailto:ASSEMBLER-
> LI...@LISTSERV.UGA.EDU] On Behalf Of McKown, John
> Sent: Thursday, January 12, 2012 10:32 AM
> To: ASSEMBL...@LISTSERV.UGA.EDU
> Subject: How bad is the EX instruction?
>
> OK, I hope I'm not becoming wearisome with my yammering. But I am not too
> busy right now. And I still really like and respect the z architecture
> (despite its horrendous price).
>
> I ask about the CPU cost of an EX because that same program that I'm
> working on uses the EX a fair amount to move "variable length" strings
> into a blank-initialized area for reporting purposes. Instead of EX of an
> MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is
> less overhead than MVCL in many cases. Especially since I know that my
> length is always no more than 255 characters. I check and report an error
> if the length is 256 or more.
>
> As an aside, to whomever it was who recommended the TROO as a way to move
> bytes from an input area to an output area, while testing for
> "unprintable" bytes - thanks! It made my code much easier to write and
> understand. I was going to use a TRT and an EX'd MVC in a loop. A TROO in
> a loop was super easy to code.
--
This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system.
<begin snippet>
My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's
and even MVC/CLC loops for "small" areas (FSVO "small") is that
MVCL/CLCL loses almost every time.
</end snippet>
Preoccupation with these issues is, at best, counter-productive. The
MVCLE is logically simpler and should be used unless one knows that
one has only some fixed, small number n << 256 bytes to move.
No one has ever claimed that the timing differences here are large,
significant ones; and the continuing preoccupation here with
suboptimizing of this sort is, I think, evidence of a pervasive
malaise, a retreat into the familiar that precludes consideration of
more, much more, important design issues.
> EX is indeed expensive, but my guess (untested) is that an EXecuted MVC for small lengths (not only under 256 but even less) is probably still more efficient than MVCL for those lengths, and *definitely* more efficient than MVCLE. My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's and even MVC/CLC loops for "small" areas (FSVO "small") is that MVCL/CLCL loses almost every time.
I have a regular need for a CLC over 0..8 byte (a kind of "pseudo
wildcard" pattern where a trailing * specifies matching of the
preceding substring).
Since it's already Friday here, this is what I came up with after some
experiments (forgive my SPM accent, I trust the intentions are clear).
I'm open for ideas...
LA R1,OP2+L'OP2 Beyond string in case no spaces
TRT OP2,SPACE Find first ' ' in pattern
CR R1,R3
COND NOTEQUAL,DECR,R1 Point at last non-blank, if any
CLI 0(R1),C'*'
IF NOTEQUAL
CLC OP1,OP2 No '*' - just compare them
ELSE ,
SR R1,R3 Compute length before '*'
IF NOTZERO
DECR R1
INLINEX R1,CLC,OP1(0),OP2
FI ,
FI ,
The INLINEX is my macro and generates (in case of the CLC) this
INLINEX R1,CLC,OP1(0),OP2
+INLINEX0338 CLC OP1(0),OP2
+INLINEY0338 EX R1,INLINEX0338
From what we could tell, the CLC is still warm when EX hits it.
Slower alternative was a computed branch to do the 9 cases. With our
usage mix, the extra test to use a plain CLC over 8 byte seems to pay
off.
-----Original Message-----
From: IBM Mainframe Assembler List
> If you're looking to reduce CPU usage you might want to optimize the TRT
> the heck out of the equation. Talk about expensive! [augment with
> imagined or actual sound of cash register "cah-ching" sound for added
> emphasis/effect]
Ok... but how? Would a loop stepping over the max 8 bytes be wiser
to find the first blank? Another idea I had was to step a 2-byte CLC
with '* ' over the string, but the complexity and the end spoils the
fun.
Guess I never really measured TRT. A variation of this code is used
to search items in a linked list. I obviously moved the TRT out of the
loop and that might have helped make it faster.
Rob
> OK, I hope I'm not becoming wearisome with my yammering. But I am not too busy right now. And I still really like and respect the z architecture (despite its horrendous price).
>
Of course, that comes out of not our pocket, but your employers'.
An you've often mentioned how cost-conscious they are. So, then,
why are they not considering converting to Linux rather than to
Windows. OS software would be much cheaper; hardware should
be the same (literally); is this then offset by the cost of middleware
and application software?
> I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is less overhead than MVCL in many cases. Especially since I know that my length is always no more than 255 characters. I check and report an error if the length is 256 or more.
>
If EX is so bad, I wonder about a chain of:
TM COUNT,128
BNO *+18
MVC 0(DEST,128),0(SOURCE)
LA DEST,128(,DEST)
LA SOURCE,128(,SOURCE)
TM COUNT,64
BNO *+18
MVC 0(DEST,64),0(SOURCE)
LA DEST,64(,DEST)
LA SOURCE,64(,SOURCE)
...
TM COUNT,1
BNO *+10
MVC 0(DEST,1),0(SOURCE)
(Or wrap it in a loop)
(I haven't been an assembler programmer for three decades; fill
in the blanks.)
-- gil
> If you're looking to reduce CPU usage you might want to optimize the TRT
> the heck out of the equation. Talk about expensive! [augment with
> imagined or actual sound of cash register "cah-ching" sound for added
> emphasis/effect]
>
Boyer-Moore? I guess that's no use for individual characters.
-- gil
In general I tend to agree with this, but I've worked or
consulted at installations that either had problems completing
overnight jobs in their assigned batch window, or just
processing large amounts of data.
While I haven't tried this on very current machines, on older
ones EX added 40 to 50% to the instruction time (EX overhead on
some Amdahl machines was greater); 4 MVCs of 256 bytes were
about the same as a 1K MVCL; and 5 CLI/BE were about the same as
one TRT/B *+4(R2). In each case if paid to identify the most
frequently executed code and look for improvements.
Gerhard Postpischil
Bradford, VT
Keven
-----Original Message-----
From: IBM Mainframe Assembler List
[mailto:ASSEMBL...@LISTSERV.UGA.EDU] On Behalf Of Rob van der Heij
Sent: Thursday, January 12, 2012 7:16 PM
To: ASSEMBL...@LISTSERV.UGA.EDU
Subject: Re: How bad is the EX instruction?
have you tried SRST?
I had a hard time getting used to SRSTs way of using/wanting the
resgisters- but then... It does an excellent job on searching for one
(and only one) character in a string.
here is a simple sample for SRST:
L R15,SAVE point in string for cont
LA R14,256(R15)
LA R0,C'/'
SRST R14,R15
* R14 is now on the first /
LA R15,1(R14)
SRST R14,R15
* R14 is now on the second /
Two hints:
1.) SRST should be followed by a JO *-4, but POP says min
length scanned is 255. So it can be omitted in certain cases.
2.) A found condition is indicated by a L (L_located).
A not found condition is indicated by a H (not L_ocated) -
so an extra JH NOT_FOUND might be usefull (or JL LOCATED_CHAR).
I had a hard time getting used to SRSTs way of using/wanting the
registers- but then... It does an excellent job on searching for one
That makes sense. It sounds like even if you can afford to MVC the
entire buffer (because you know there is room in the destination and
you're not near the edge of the source) then it might make sense to EX
MVC if you know the actual size and it's less than half on average.
For short EX MVC's the burden of getting stuff in the right registers
makes MVCL less interesting.
My preoccupation with this is mostly on Friday ;-) And I guess I
should not write real code on Friday 13th anyway...
The EX CLC is in fact in loop scanning a linked list for the right
entry among 100-200 elements. My big savings were moving the TRT etc
out of the loop. I was tempted to also take the decision between CLC
and EX CLC out of the loop, but didn't for ease of maintenance.
Rob
Martin,
Haven't, and probably should for my own education. We restrict our
products to older architecture levels for a number of good reasons.
Rob
<begin snippet>
In general I tend to agree with this, but I've worked or consulted at
installations that either had problems completing overnight jobs in
their assigned batch window, or just processing large amounts of data.
</end snippet>
I value GP's concurrence. Let me add, however, that 1) in my
experience batch-window problems are always i/o-related; and 2) the
unwashed always attack them in the wrong way, devoting resources to
"optimizing" instruction sequences that, even if it had been possible
to reduce their CP consumption to zero, would have left the
batch-window problem unresolved.
These applications, like most commercial batch ones, were i/o-bound,
and their resolution required the use of overlapped, asynchronous i/o,
which, for those who know how to do it, is not difficult. What it
was/is in most of these shops was/is, quite literally, unthinkable.
(The RESIDENCE time of a classical MFU can always be cut by a factor
of four or more using asynchronous i/o.)
>On Fri, Jan 13, 2012 at 8:13 AM, Martin Truebner wrote:
>> have you tried SRST?
>
>Haven't, and probably should for my own education. We restrict our
>products to older architecture levels for a number of good reasons.
How old?
SRST was first documented in the second edition of the ESA POO.
That is much older than the Relative and Immediate instructions.
--
Tom Marchant
Cheers, Martin
Martin Packer,
Mainframe Performance Consultant, zChampion
Worldwide Banking Center of Excellence, IBM
email: martin...@uk.ibm.com
Twitter / Facebook IDs: MartinPacker
Blog:
https://www.ibm.com/developerworks/mydeveloperworks/blogs/MartinPacker
From:
John Gilmore <johnwgil...@gmail.com>
To:
ASSEMBL...@listserv.uga.edu,
Date:
13/01/2012 13:20
Subject:
Re: How bad is the EX instruction?
Sent by:
IBM Mainframe Assembler List <ASSEMBL...@listserv.uga.edu>
Gerhard Postpischil wrote:
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Egregiously bad SQL can be the villain. I dealt recently with a
situation in which the SQL builtin function max was used repeatedly to
obtain a next, new 'treaty number' in an insurance application. It
did its job, correctly, by scanning all of the rows of one of the
largest DB2 tables I have encountered over and over and over again,
with predictable consequences.
I believe, however, that, as here, bad SQL is very often bad at bottom
because it triggers too much implicit i/o. DB2's read and write
engines are very good indeed at what they do; but both they and the
sort invoked from DB2 can be asked to do too much gratuitous work.
John Gilmore, Ashland, MA 01721 - USA
--
John, Gerhard, right on! I have been even more radical !
First I attempted to *eliminate* the need to have the code
*THERE* in the first place, in the most often executed path.
On one occasion I added an extra field to the record (row)
to avoid generating a key each time.
Another comes to mind, scrapping a name&address
decompression routine and replacing its loops, bit shifts,
translates, by blanks-truncation and a table of the 65000
most common city and street name "words" on file. To
print an adress became lightnig fast: follow a chain of
1-byte offsets to the next 3-byte place holder and load the
2-byte index into the table, a few RX (L LA IC), an SLL
to convert the index to an ofsset, an SR/JNP to detect and
move text between tokens and lastly an EX-MVC combo.
For this elite group here, this post is really OFF-TOPIC.
You worry about picoseconds because your code runs zillion
times per... If you weren't, the perennial EX-topic would fit
in nicely with my TGIF post - let's have a great weekend!
Andreas F. Geissbuehler
AFG Consultants Inc.
http://www.afgc-inc.com/
Yours likewise!
I got these kinds of optmization mandates because
of CPU Hours and EXCPs it cost to run some batch
jobs and CICS transactionsat my clients' service bureau.
> Egregiously bad SQL can be the villain.
Indeed, and big, impressive gains can be made using ancient
methods, writing no more than a few sort/merge exits :-))
Relative instruction performance is a moving target. We run benchmarks whenever
we get a new processor so we can understand the trends. Most new hardware
generations build on the microprocessor design of the prior generation, so the
changes tend to be incremental.
Now and then, the microprocessor gets completely redesigned. One such redesign
occurred with the introduction of the z10. I mentioned our observations re:
EXecute and MVCL performance in my "z10 User Experience" at SHARE in Denver.
Check out slide 13 for this information. Thanks again to David Bond for helping
us make sense of our measurements.
http://proceedings.share.org/client_files/SHARE_in_Denver/S2215EJ161728.pdf
--
Edward E Jaffe
Phoenix Software International, Inc
831 Parkview Drive North
El Segundo, CA 90245
310-338-0400 x318
edj...@phoenixsoftware.com
http://www.phoenixsoftware.com/
As I understand it, and doesn't seem to have been mentioned, the big
effect on EX has to do with caching. I believe that it should be
near code and not data. (That is, it goes into the instruction cache
instead of the data cache.) Then again, I could have that backwards.
Other than cache, it should be plenty fast enough.
> My preoccupation with this is mostly on Friday ;-) And I guess I
> should not write real code on Friday 13th anyway...
> The EX CLC is in fact in loop scanning a linked list for the right
> entry among 100-200 elements. My big savings were moving the TRT etc
> out of the loop. I was tempted to also take the decision between CLC
> and EX CLC out of the loop, but didn't for ease of maintenance.
I usually use a hash table. Especially if speed is important.
You could also do binary search, which will find the right entry
with about log(n) comparisons.
-- glen
>> The EX CLC is in fact in loop scanning a linked list for the right
>> entry among 100-200 elements.
>
>You could also do binary search, which will find the right entry
>with about log(n) comparisons.
How do you do a binary search on a linked list?
--
Tom Marchant
-----Original Message-----
From: IBM Mainframe Assembler List [mailto:ASSEMBL...@LISTSERV.UGA.EDU]
On Behalf Of Tom Marchant
Sent: Monday, January 16, 2012 6:35 AM
To: ASSEMBL...@LISTSERV.UGA.EDU
Subject: Re: How bad is the EX instruction?
> I usually use a hash table. Especially if speed is important.
>
> You could also do binary search, which will find the right entry
> with about log(n) comparisons.
Yeah, and I prefer to stop the earth rotation when I take a sun bath... ;-)
Re-reading, I see I confused you with "the right entry" where it
actually may be more than one so I have to walk the entire list. In
fact, each entry has up to 5 possible fields to check like this. In
this case the change frequency of the data is more than the reference
rate, so on average I would have to build the hash table or search
tree on each reference. And I don't really have a context where I
could keep it.
But if you have an efficient hash function handy for 200 strings of
6-8 (uppercase) characters, I'm game. My ad-hoc tests were a bit
disappointing in rehash.
Rob
> One Assembler trick I have seen in speeding up scanning loops was to use a
> CLI instruction to check the first byte of a string and then only doing the
> CLC/CLCL if the CLI matches. This trick even works if doing a binary
> search.
Marginal savings, I think, compared to EX/CLC or CLCL,
for the reason that both CLC and CLCL give up after examining the
first character, should they be unequal.
Might be more fruitful to compare length of the key with that of an element first,
and then carrrying out the compare should those lengths be equal.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
>One Assembler trick I have seen in speeding up scanning loops was to use a
>CLI instruction to check the first byte of a string and then only doing the
>CLC/CLCL if the CLI matches. This trick even works if doing a binary
>search.
I don't know if the cost of EX is high enough that you would benefit
from doing a one byte CLC before an EX of a CLC. I don't see how a CLI
will help you though.
--
Tom Marchant
Having the CLC near the EX helps for cache. I also like to assemble it
in-line because the right USINGs apply. We noticed that it is
attractive to run over the CLC (with the length byte 0 as assembled)
and then EX behind your back to do the real thing. More attractive
than branch over the target if the instruction lets you.
I doubt whether a branch between the CLC and the EX would be
advantage. Depending on how often the comparison already fails on the
first byte, you trade an untaken branch against an EX CLC that fails
on the first byte. Guess I should try that some Friday afternoon...
Rob
Ya-but...
SRST came in sometime during the late System/370 era. I have a yellow
book with SRST and CLST defined.
(I've been burned only once by a non-System/370 instruction (ICM), and
that was on a plug-compatible that a Brazilian customer was running in
the early 1990s. I have burned myself on using the wrong ARCH option in
a C compile when a customer was still running a z900 (ARCH(5)) and I had
accidentally left it set to ARCH(6).
Later,
Ray
--
M. Ray Mullins
Roseville, CA, USA
http://www.catherdersoftware.com/
German is essentially a form of assembly language consisting entirely of
far calls heavily accented with throaty guttural sounds. ---ilvi
French is essentially German with messed-up pronunciation and spelling.
--Robert B Wilson
English is essentially French converted to 7-bit ASCII. ---Christophe
Pierret [for Alain LaBont�]
Tom Marchant correctly mention that SRST/CLST came in with ESA, not late
System/370, as a look at my SEARS card just confirmed. However, the
point still applies - SRST/CLST have been around for almost 25 years and
I doubt anyone is still running ES 9000 boxes.
B-tree?
-- gil
> From: "Dan Skomsky, PSTI" <Poodl...@sbcglobal.net>
> Sent: Monday, 16 January 2012 11:49 PM
>
>> One Assembler trick I have seen in speeding up scanning loops was to use a
>> CLI instruction to check the first byte of a string and then only doing the
>> CLC/CLCL if the CLI matches. This trick even works if doing a binary
>> search.
>
On the average (FSVO), how does this compare Boyer-Moore?
I've seen suggested TRT for the first character, then CLC for
the rest. Works much better for strings beginning with "Z"
than for strings beginning with " ".
> Marginal savings, I think, compared to EX/CLC or CLCL,
> for the reason that both CLC and CLCL give up after examining the
> first character, should they be unequal.
>
> Might be more fruitful to compare length of the key with that of an element first,
> and then carrrying out the compare should those lengths be equal.
>
Gives you "=", but not "<" or ">", so precludes binary search.
What was the statement of the problem, anyway?
CDC 3600/3800 had a "Modify following instruction" instruction
that met much of the requirement for EX. And pipelining was of
little import in that era.
-- gil
I have paying customers on 9672s, MP2000, MP3000, etc.
VSE, not z/OS.
Tony Thigpen
Performance is one of the strongest reasons for coding in assembler and
this discussion characterizes some of the low hanging fruit available for
the attainment thereof.
The timing differences can be quite significant when the code in question
is embedded in a routine that is executed 100 billion times.
Kerry Tenberg
Austin, Tx
On Thu, Jan 12, 2012 at 1:23 PM, John Gilmore <johnwgil...@gmail.com>wrote:
> My own experience has been much more mixed, but I'd like to accept, in
> order to address what I take to be a more important issue, that Peter
> Farley is right when he says
>
> <begin snippet>
> My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's
> and even MVC/CLC loops for "small" areas (FSVO "small") is that
> MVCL/CLCL loses almost every time.
> </end snippet>
>
> Preoccupation with these issues is, at best, counter-productive. The
> MVCLE is logically simpler and should be used unless one knows that
> one has only some fixed, small number n << 256 bytes to move.
>
> No one has ever claimed that the timing differences here are large,
> significant ones; and the continuing preoccupation here with
> suboptimizing of this sort is, I think, evidence of a pervasive
> malaise, a retreat into the familiar that precludes consideration of
> more, much more, important design issues.
>
> OK, I hope I'm not becoming wearisome with my yammering. But I am not too busy right now. And I still really like and
> respect the z architecture (despite its horrendous price).
>
> I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move
> "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use
> MVCL or MVCLE.
As the task is to move stuff to a buffer for reporting purposes,
the cpu time will be negligible compared to I/O time.
Confucius say: If it works, don't fix it.
> CDC 3600/3800 had a "Modify following instruction" instruction
The S/360 and subsequent machines have one like that also.
In the case of MVC/CLC instructions :-
stc 1,*+5
mvc a(0),b
can be useful.
EX does more than just "insert length" into SS instructions.
The ability to OR in bits from the second byte of the subject instruction
along with the content of the nominated register is probably rarely used
in the case of SS instructions, but can be used to effect with RX instructions,
where you might want to retain, say, the existing index field in the
subject instruction, yet supply bits for the Register field of the
subject instruction.
> Having the CLC near the EX helps for cache. I also like to assemble it
> in-line because the right USINGs apply. We noticed that it is
> attractive to run over the CLC (with the length byte 0 as assembled)
> and then EX behind your back to do the real thing. More attractive
> than branch over the target if the instruction lets you.
A convenient place for the subject instruction is immediately after
a B instruction, thus avoiding the need to execute CLC or MVC twice.
My experience was that executing the MVC or CLC twice (first with
length 0) is better than to branch over it. So:
X CLC ONE(0),TWO
EX Rx,X
But it may very well be that current CPUs look sufficiently over the
branch that one could
B Y
X CLC ONE(0),TWO
Y EX Rx,X
Obviously I do not wish to make this kind of decision at each
instance. But once you find this in the deep bowls of a heavy loop, it
is worth to think about it and put the optimal one in my INLINEX macro
that does the work:
INLINEX Rx,CLC,ONE(0),TWO
Rob
>> My experience was that executing the MVC or CLC twice (first with
length 0) is better than to branch over it.
I doubt that doing something little and then full is faster than doing
it full the first time....
If you observed major difference I do suspect that it is because
the first execution triggered a pagein (or a swap or a
steal...whatever).
I do NOT like this "inline"-technique at all. Also: it does make coding
baseless (only base(s) for data) hard, if EXRL is not available. Yes, I
heard (and do use) of LOCTR and various other techniques to do it
anyway.
--
Martin
Pi_cap_CPU - all you ever need around MWLC/SCRT/CMT in z/VSE
more at http://www.picapcpu.de
> I do NOT like this "inline"-technique at all. Also: it does make coding
> baseless (only base(s) for data) hard, if EXRL is not available. Yes, I
> heard (and do use) of LOCTR and various other techniques to do it
> anyway.
I thought that putting instructions between the data was considered
evil practice. But I merely assumed it applied to the target of EX as
well.
When reading the code, I find it breaks the line of thought when I
have to go look for the exact instruction that's targeted by EX. And I
have been bitten a few times because USINGs were different at the EX
and where the target was placed.
This is what I see as abstraction. The details you mention are done
inside the macro and don't affect my source code. My macro even knows
about the instructions that are safe to execute twice as we discussed.
If I would go baseless, that would be resolved entirely in my INLINEX
macro (the branch as well as putting the target instruction in the
constants area with the right LOCTR settings). And with *no* branches
coded in my source, the macros is all it takes to go baseless...
Rob
The USING-issue is a strong argument in favor of this: I juggle around USINGs a lot and it is a pain (and error-prone) to set up the same USINGs for a single instruction that needs to be EXecuted.
We use the HLASM Toolkit Structured Programming Macros which means that we can't easily insert an instruction 'after' a Jump instruction. Almost all the Jump instructions are generated by the SPM macros.
Fred!
-----------------------------------------------------------------
ATTENTION:
The information in this electronic mail message is private and
confidential, and only intended for the addressee. Should you
receive this message by mistake, you are hereby notified that
any disclosure, reproduction, distribution or use of this
message is strictly prohibited. Please inform the sender by
reply transmission and delete the message without copying or
opening it.
Messages and attachments are scanned for all viruses known.
If this message contains password-protected attachments, the
files have NOT been scanned for viruses by the ING mail domain.
Always scan attachments before opening them.
-----------------------------------------------------------------
Right, I'm more and more tempted to drop all USINGs at the end of
subroutines and explicitly state which ones apply upon entry.
Since I have nested subroutines with static scope, it's even more
appropriate. Within the routine itself, I try to have the USING and
DROP at the same nesting level.
> We use the HLASM Toolkit Structured Programming Macros which means that we can't easily insert an instruction 'after' a Jump instruction. Almost all the Jump instructions are generated by the SPM macros.
What's a Jump instruction ;-) Really, mine are only generated by the
structured programming macros (including one to exit the routine as
part of error handling).
Rob
--
John McKown
Systems Engineer IV
IT
Administrative Services Group
HealthMarkets(r)
9151 Boulevard 26 * N. Richland Hills * TX 76010
(817) 255-3225 phone *
john....@healthmarkets.com * www.HealthMarkets.com
Confidentiality Notice: This e-mail message may contain confidential or proprietary information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. HealthMarkets(r) is the brand name for products underwritten and issued by the insurance subsidiaries of HealthMarkets, Inc. -The Chesapeake Life Insurance Company(r), Mid-West National Life Insurance Company of TennesseeSM and The MEGA Life and Health Insurance Company.SM
> -----Original Message-----
> From: IBM Mainframe Assembler List
> [mailto:ASSEMBL...@LISTSERV.UGA.EDU] On Behalf Of robin
> Sent: Monday, January 16, 2012 11:40 PM
> To: ASSEMBL...@LISTSERV.UGA.EDU
I did a very Q&D test and...
J *+10
CLC 0(1,R10),8(R10)
EXRL R1,*-6
...is about 25% faster than...
CLC 0(1,R10),8(R10)
EXRL R1,*-6
So on a z196 the jump seems to be faster than the compare...
> From: "Paul Gilmartin" <PaulGB...@aim.com>
> Sent: Tuesday, 17 January 2012 7:33 AM
>
>> CDC 3600/3800 had a "Modify following instruction" instruction
>
> The S/360 and subsequent machines have one like that also.
> In the case of MVC/CLC instructions :-
>
> stc 1,*+5
> mvc a(0),b
>
> can be useful.
>
No, no, no, no, no!:
o RENT!? (How does this affect instruction pipelining?)
o The CDC instruction didn't modify the storage; it
modified the execution of the instruction after it had
been fetched from storage. Even as EX doesn't modify
its target instruction in storage.
> EX does more than just "insert length" into SS instructions.
> The ability to OR in bits from the second byte of the subject instruction
> along with the content of the nominated register is probably rarely used
> in the case of SS instructions, but can be used to effect with RX instructions,
> where you might want to retain, say, the existing index field in the
> subject instruction, yet supply bits for the Register field of the
> subject instruction.
-- gil
>> Having the CLC near the EX helps for cache. I also like to assemble it
>> in-line because the right USINGs apply. We noticed that it is
>> attractive to run over the CLC (with the length byte 0 as assembled)
>> and then EX behind your back to do the real thing. More attractive
>> than branch over the target if the instruction lets you.
>
> The USING-issue is a strong argument in favor of this: I juggle around USINGs a lot and it is a pain (and error-prone) to set up the same USINGs for a single instruction that needs to be EXecuted.
>
An alternative is LOCTR (possibly in a macro). (With possible
cache miss consequences. I forget; is the target of EX treated
as a data access or as an instruction access for cacne management?)
But "instruction" should be a data type supported for use in
literals: "EX Rx,=INST'CLC ...'. Some programmers have
kludged this with ugly hex constants; the facility should be
made orderly.
> We use the HLASM Toolkit Structured Programming Macros which means that we can't easily insert an instruction 'after' a Jump instruction. Almost all the Jump instructions are generated by the SPM macros.
-- gil
o Compiler developers have done the research on instruction
timings and know better than most end users what sequences
fit the pipelines optimally.
o Compiled code can be re-optimized for a new generation of
hardware simply by recompiling.
o Interpreters can dynamically recompile based on statistical
profiles evaluated at the actual time of execution.
-- gil
The 256-byte cache line containing the target instruction is loaded into I-cache.
--
Edward E Jaffe
Phoenix Software International, Inc
831 Parkview Drive North
El Segundo, CA 90245
310-338-0400 x318
edj...@phoenixsoftware.com
http://www.phoenixsoftware.com/
Notoriously NOT for the IBM COBOL compilers. I plead ignorance for the PL/1 and Fortran compilers, but the C/C++ compiler is the only current compiler in my personal experience that actually exhibits a knowledge of instruction timings and latency and AGI interrupts, etc., for current and recent pipelined "z" processors.
IMHO, COBOL generated code is so bad that if I was on the COBOL code-generation development team I would be embarrassed to admit it.
> o Compiled code can be re-optimized for a new generation of
> hardware simply by recompiling.
>
> o Interpreters can dynamically recompile based on statistical
> profiles evaluated at the actual time of execution.
>
--
This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system.
The cobol code generator appears to be the beneficiary of benign neglect.
The net result of using only the halfword immediate instructions and the
relative branching would be significant.
But as noted elsewhere here, there are a lot of active very old machines.
I suspect if the source to the code generator were made available w/an
NDA, there would be a group who would improve it for the general interest.
-----------------------------------------
The information contained in this communication (including any
attachments hereto) is confidential and is intended solely for the
personal and confidential use of the individual or entity to whom
it is addressed. If the reader of this message is not the intended
recipient or an agent responsible for delivering it to the intended
recipient, you are hereby notified that you have received this
communication in error and that any review, dissemination, copying,
or unauthorized use of this information, or the taking of any
action in reliance on the contents of this information is strictly
prohibited. If you have received this communication in error,
please notify us immediately by e-mail, and delete the original
message. Thank you
Tony Thigpen
My usual "whine". I would likely abandon HLASM for the most part. IF I had a C license. But, one makes due with what one has.
I have been told that part of the reason for the "horrible" code emitted by the COBOL compiler is to guarantee 100% conformance to the ANSI standards. I don't know this for a fact. But COBOL was designed around __decimal__ arithmetic. And getting proper truncations and overflow notifications. So there may be something to this. And let's not even talk about the abomination of the PERFORM verb. Implementing that is a royal PITA, from what I can tell. Mainly because the end of any paragraph may, or may not, return to some other point in the code. Sometimes it "returns" to a PERFORM, and other times it "falls through" to the next paragraph. Oh my aching compiler.
The PL/X compiler also generates 'poor' code. (It's one reason it's been
difficult to convince the 'powers that be' to establish a new Architectural
Level Set for z/OS.)
IBM has hinted that they plan to address these compiler deficiencies--when is
anybody's guess. But, at least they admit there's a problem. That's the first
step...
<begin snippet>
. . . I plead ignorance for the PL/1 and Fortran compilers, but the
C/C++ compiler is the nly current compiler in my personal experience
that actually exhibits a knowledge of instruction timings and latency
and AGI interrupts, etc., for current and recent pipelined "z"
processors.
</end snippet>
The IBM optimizing machinery for C/C++ and PL/I is now shared, the
same for both compilers; and the effects of this sharing have been
mixed, mostly good and some few of them very bad.
--jg
--
> IBM has hinted that they plan to address these compiler deficiencies--when is
> anybody's guess. But, at least they admit there's a problem. That's the first
> step...
On Jan 17, 2012, at 10:11, John Gilmore wrote:
>
> The IBM optimizing machinery for C/C++ and PL/I is now shared, the
> same for both compilers; and the effects of this sharing have been
> mixed, mostly good and some few of them very bad.
Sounds like an opportunity for PL/X to join the party.
How's Metal/C?
-- gil
Pretty good, in my limited investigations. When the highest level of optimization is turned on, it can be rather tricky to follow the generated assembler code even knowing precisely what the C code was intended to do. I haven't yet measured the speed of the generated code in any meaningful way for a non-trivial program, but I am seriously impressed by the optimizations that are done and by the compiler's ability to "tune" the instruction set used so that code can be generated that will run on "z" machines from a given architecture level upwards.
Peter
Some of these deficiencies of course reflect its history. PL/S had a
notoriously fertile generate facility that permitted assembly language
to be dropped into source programs. This facility was 'abused' by
some IBM and contractor programmers to write routines that were, in
effect, assembly-language cakes with some PL/X powdered sugar
sprinkled on them.
The cross-platform emphasis in PL/X discourages and was intended to
discourage this sort of thing; but optimizing machinery, less
important in PL/S because resort to assemb ly language was possible,
does not appear to be much used by PL/X. (Even such obvious things
as moving common subexpressions out of loops and suppressing redundant
subscript arithmetic don't seem to happen.)
IBM knows and has always known how to fix this problem; what has
been|is lacking is the will to do it.
--jg
--
> The cross-platform emphasis in PL/X discourages and was intended to
> discourage this sort of thing; but optimizing machinery, less
> important in PL/S because resort to assemb ly language was possible,
> does not appear to be much used by PL/X. (Even such obvious things
> as moving common subexpressions out of loops and suppressing redundant
> subscript arithmetic don't seem to happen.)
>
> IBM knows and has always known how to fix this problem; what has been|is lacking is the will to do it.
IBM almost a decade ago fixed the problem where it really counts - in
its millicode. That is generated by the GCC suite, with a
private-to-IBM PL8 language front end, and a published(?) middle-end
optimizer and back end code generator. One bumps into the odd
non-millicode module compiled by PL8's CMS-hosted predecessor
implementation, PL.8, but porting of the whole GCC and Linuxy
infrastructure, including ELF object format and other baggage, into
the likes of z/OS would presumably be required before the current PL/X
could be replaced and its code optimized. An outsider can only imagine
the internal geopolitical goings on with regard to all this.
Tony H.
> I did a very Q&D test and...
>
> J *+10
> CLC 0(1,R10),8(R10)
> EXRL R1,*-6
>
> ...is about 25% faster than...
>
> CLC 0(1,R10),8(R10)
> EXRL R1,*-6
>
> So on a z196 the jump seems to be faster than the compare...
This seems unsurprising. Even on much older processors, an
unconditional branch has been predicted as "taken", and so the
instruction stream fetching will be at the EXRL long before execution
gets to the J. If R1 was set some instructions earlier, the EXRL and
target CLC can be set up and ready to go way in advance.
Tony H.
On 2012-01-16 13:43, Tony Thigpen wrote:
> > I doubt anyone is still running ES 9000 boxes.
>
> I have paying customers on 9672s, MP2000, MP3000, etc.
> VSE, not z/OS.
>
>
> Tony Thigpen
>
> -----Original Message -----
> From: Ray Mullins
> Sent: 01/16/2012 01:48 PM
>> Arrgh. Correction to the below. Not enough caffeine, yet it's late in
>> the morning...
>>
>> Tom Marchant correctly mention that SRST/CLST came in with ESA, not late
>> System/370, as a look at my SEARS card just confirmed. However, the
>> point still applies - SRST/CLST have been around for almost 25 years and
>> I doubt anyone is still running ES 9000 boxes.
>>
>
--
M. Ray Mullins
Roseville, CA, USA
http://www.catherdersoftware.com/
German is essentially a form of assembly language consisting entirely of
far calls heavily accented with throaty guttural sounds. ---ilvi
French is essentially German with messed-up pronunciation and spelling.
--Robert B Wilson
English is essentially French converted to 7-bit ASCII. ---Christophe
Pierret [for Alain LaBont�]
So, this would seem to point towards putting the target near the
instruction, if you can, or at least no more than 244 bytes away (worst
case, maybe a bit less), or possibly grouping frequently executed
targets together using Martin T's. favorite LOCTR assembler instruction
and hoping that the line stays in cache.
Thoughts?
Later,
Ray