How bad is the EX instruction?

601 views
Skip to first unread message

McKown, John

unread,
Jan 12, 2012, 10:32:23 AM1/12/12
to ASSEMBL...@listserv.uga.edu
OK, I hope I'm not becoming wearisome with my yammering. But I am not too busy right now. And I still really like and respect the z architecture (despite its horrendous price).

I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is less overhead than MVCL in many cases. Especially since I know that my length is always no more than 255 characters. I check and report an error if the length is 256 or more.

As an aside, to whomever it was who recommended the TROO as a way to move bytes from an input area to an output area, while testing for "unprintable" bytes - thanks! It made my code much easier to write and understand. I was going to use a TRT and an EX'd MVC in a loop. A TROO in a loop was super easy to code.


--
John McKown
Systems Engineer IV
IT

Administrative Services Group

HealthMarkets(r)

9151 Boulevard 26 * N. Richland Hills * TX 76010
(817) 255-3225 phone *
john....@healthmarkets.com * www.HealthMarkets.com

Confidentiality Notice: This e-mail message may contain confidential or proprietary information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. HealthMarkets(r) is the brand name for products underwritten and issued by the insurance subsidiaries of HealthMarkets, Inc. -The Chesapeake Life Insurance Company(r), Mid-West National Life Insurance Company of TennesseeSM and The MEGA Life and Health Insurance Company.SM

Martin Truebner

unread,
Jan 12, 2012, 11:15:43 AM1/12/12
to ASSEMBL...@listserv.uga.edu
John,

>> move "variable length" strings into a blank-initialized area for
>> reporting purposes.

How about using MVCL and using the padding function to fill in the
blanks....

--
Martin

Pi_cap_CPU - all you ever need around MWLC/SCRT/CMT in z/VSE
more at http://www.picapcpu.de

Fred van der Windt

unread,
Jan 12, 2012, 11:21:32 AM1/12/12
to ASSEMBL...@listserv.uga.edu
You're welcome. :-)

Fred!

Sent from my iPad

On Jan 12, 2012, at 16:33, "McKown, John" <John....@healthmarkets.com> wrote:

> As an aside, to whomever it was who recommended the TROO as a way to move bytes from an input area to an output area, while testing for "unprintable" bytes - thanks! It made my code much easier to write and understand. I was going to use a TRT and an EX'd MVC in a loop. A TROO in a loop was super easy to code.

-----------------------------------------------------------------
ATTENTION:
The information in this electronic mail message is private and
confidential, and only intended for the addressee. Should you
receive this message by mistake, you are hereby notified that
any disclosure, reproduction, distribution or use of this
message is strictly prohibited. Please inform the sender by
reply transmission and delete the message without copying or
opening it.

Messages and attachments are scanned for all viruses known.
If this message contains password-protected attachments, the
files have NOT been scanned for viruses by the ING mail domain.
Always scan attachments before opening them.
-----------------------------------------------------------------

Steve Comstock

unread,
Jan 12, 2012, 11:22:37 AM1/12/12
to ASSEMBL...@listserv.uga.edu

What about MVCOS?

--

Kind regards,

-Steve Comstock
The Trainer's Friend, Inc.

303-355-2752
http://www.trainersfriend.com

* To get a good Return on your Investment, first make an investment!
+ Training your people is an excellent investment

* Try our tool for calculating your Return On Investment
for training dollars at
http://www.trainersfriend.com/ROI/roi.html

John Gilmore

unread,
Jan 12, 2012, 11:31:18 AM1/12/12
to ASSEMBL...@listserv.uga.edu
Instead, why not use MVCLE with the padding byte?

--jg


--
John Gilmore, Ashland, MA 01721 - USA

Farley, Peter x23353

unread,
Jan 12, 2012, 12:59:27 PM1/12/12
to ASSEMBL...@listserv.uga.edu
EX is indeed expensive, but my guess (untested) is that an EXecuted MVC for small lengths (not only under 256 but even less) is probably still more efficient than MVCL for those lengths, and *definitely* more efficient than MVCLE. My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's and even MVC/CLC loops for "small" areas (FSVO "small") is that MVCL/CLCL loses almost every time.

Somebody oughtta test that theory and publish a table of "break" points (by machine type) at which the more "complex" instructions become "better", and below which you should stick with the "old-fashioned" way to do it. I'd do it but there seems to be a very serious lack of available round tuits for "interesting" work in my life lately. There is far too much "required" stuff in the queue ahead of "interesting".

Just my USD$0.02 worth.

Peter

> -----Original Message-----
> From: IBM Mainframe Assembler List [mailto:ASSEMBLER-
> LI...@LISTSERV.UGA.EDU] On Behalf Of McKown, John
> Sent: Thursday, January 12, 2012 10:32 AM
> To: ASSEMBL...@LISTSERV.UGA.EDU
> Subject: How bad is the EX instruction?
>
> OK, I hope I'm not becoming wearisome with my yammering. But I am not too
> busy right now. And I still really like and respect the z architecture
> (despite its horrendous price).
>
> I ask about the CPU cost of an EX because that same program that I'm
> working on uses the EX a fair amount to move "variable length" strings
> into a blank-initialized area for reporting purposes. Instead of EX of an
> MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is
> less overhead than MVCL in many cases. Especially since I know that my
> length is always no more than 255 characters. I check and report an error
> if the length is 256 or more.
>
> As an aside, to whomever it was who recommended the TROO as a way to move
> bytes from an input area to an output area, while testing for
> "unprintable" bytes - thanks! It made my code much easier to write and
> understand. I was going to use a TRT and an EX'd MVC in a loop. A TROO in
> a loop was super easy to code.
--


This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system.

John Gilmore

unread,
Jan 12, 2012, 2:23:12 PM1/12/12
to ASSEMBL...@listserv.uga.edu
My own experience has been much more mixed, but I'd like to accept, in
order to address what I take to be a more important issue, that Peter
Farley is right when he says

<begin snippet>


My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's
and even MVC/CLC loops for "small" areas (FSVO "small") is that
MVCL/CLCL loses almost every time.

</end snippet>

Preoccupation with these issues is, at best, counter-productive. The
MVCLE is logically simpler and should be used unless one knows that
one has only some fixed, small number n << 256 bytes to move.

No one has ever claimed that the timing differences here are large,
significant ones; and the continuing preoccupation here with
suboptimizing of this sort is, I think, evidence of a pervasive
malaise, a retreat into the familiar that precludes consideration of
more, much more, important design issues.

Rob van der Heij

unread,
Jan 12, 2012, 6:27:44 PM1/12/12
to ASSEMBL...@listserv.uga.edu
On Thu, Jan 12, 2012 at 6:59 PM, Farley, Peter x23353
<Peter....@broadridge.com> wrote:

> EX is indeed expensive, but my guess (untested) is that an EXecuted MVC for small lengths (not only under 256 but even less) is probably still more efficient than MVCL for those lengths, and *definitely* more efficient than MVCLE. My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's and even MVC/CLC loops for "small" areas (FSVO "small") is that MVCL/CLCL loses almost every time.

I have a regular need for a CLC over 0..8 byte (a kind of "pseudo
wildcard" pattern where a trailing * specifies matching of the
preceding substring).

Since it's already Friday here, this is what I came up with after some
experiments (forgive my SPM accent, I trust the intentions are clear).
I'm open for ideas...

LA R1,OP2+L'OP2 Beyond string in case no spaces
TRT OP2,SPACE Find first ' ' in pattern
CR R1,R3
COND NOTEQUAL,DECR,R1 Point at last non-blank, if any
CLI 0(R1),C'*'
IF NOTEQUAL
CLC OP1,OP2 No '*' - just compare them
ELSE ,
SR R1,R3 Compute length before '*'
IF NOTZERO
DECR R1
INLINEX R1,CLC,OP1(0),OP2
FI ,
FI ,

The INLINEX is my macro and generates (in case of the CLC) this
INLINEX R1,CLC,OP1(0),OP2
+INLINEX0338 CLC OP1(0),OP2
+INLINEY0338 EX R1,INLINEX0338

From what we could tell, the CLC is still warm when EX hits it.

Slower alternative was a computed branch to do the 9 cases. With our
usage mix, the extra test to use a plain CLC over 8 byte seems to pay
off.

Litwinowich, David

unread,
Jan 12, 2012, 6:28:22 PM1/12/12
to ASSEMBL...@listserv.uga.edu
I am currently out of the office, returning Monday, Oct. 24.

Hall, Keven

unread,
Jan 12, 2012, 7:05:56 PM1/12/12
to ASSEMBL...@listserv.uga.edu
If you're looking to reduce CPU usage you might want to optimize the TRT
the heck out of the equation. Talk about expensive! [augment with
imagined or actual sound of cash register "cah-ching" sound for added
emphasis/effect]

-----Original Message-----
From: IBM Mainframe Assembler List

Rob van der Heij

unread,
Jan 12, 2012, 8:15:32 PM1/12/12
to ASSEMBL...@listserv.uga.edu
On Fri, Jan 13, 2012 at 1:05 AM, Hall, Keven <keh...@informatica.com> wrote:

> If you're looking to reduce CPU usage you might want to optimize the TRT
> the heck out of the equation. Talk about expensive! [augment with
> imagined or actual sound of cash register "cah-ching" sound for added
> emphasis/effect]

Ok... but how? Would a loop stepping over the max 8 bytes be wiser
to find the first blank? Another idea I had was to step a 2-byte CLC
with '* ' over the string, but the complexity and the end spoils the
fun.

Guess I never really measured TRT. A variation of this code is used
to search items in a linked list. I obviously moved the TRT out of the
loop and that might have helped make it faster.

Rob

Paul Gilmartin

unread,
Jan 12, 2012, 9:50:32 PM1/12/12
to ASSEMBL...@listserv.uga.edu
On Jan 12, 2012, at 08:32, McKown, John wrote:

> OK, I hope I'm not becoming wearisome with my yammering. But I am not too busy right now. And I still really like and respect the z architecture (despite its horrendous price).
>

Of course, that comes out of not our pocket, but your employers'.
An you've often mentioned how cost-conscious they are. So, then,
why are they not considering converting to Linux rather than to
Windows. OS software would be much cheaper; hardware should
be the same (literally); is this then offset by the cost of middleware
and application software?

> I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is less overhead than MVCL in many cases. Especially since I know that my length is always no more than 255 characters. I check and report an error if the length is 256 or more.
>

If EX is so bad, I wonder about a chain of:

TM COUNT,128
BNO *+18
MVC 0(DEST,128),0(SOURCE)
LA DEST,128(,DEST)
LA SOURCE,128(,SOURCE)

TM COUNT,64
BNO *+18
MVC 0(DEST,64),0(SOURCE)
LA DEST,64(,DEST)
LA SOURCE,64(,SOURCE)

...
TM COUNT,1
BNO *+10
MVC 0(DEST,1),0(SOURCE)

(Or wrap it in a loop)

(I haven't been an assembler programmer for three decades; fill
in the blanks.)

-- gil

Paul Gilmartin

unread,
Jan 12, 2012, 9:54:46 PM1/12/12
to ASSEMBL...@listserv.uga.edu
On Jan 12, 2012, at 17:05, Hall, Keven wrote:

> If you're looking to reduce CPU usage you might want to optimize the TRT
> the heck out of the equation. Talk about expensive! [augment with
> imagined or actual sound of cash register "cah-ching" sound for added
> emphasis/effect]
>

Boyer-Moore? I guess that's no use for individual characters.

-- gil

Gerhard Postpischil

unread,
Jan 12, 2012, 10:55:16 PM1/12/12
to ASSEMBL...@listserv.uga.edu
On 1/12/2012 2:23 PM, John Gilmore wrote:
> No one has ever claimed that the timing differences here are large,
> significant ones; and the continuing preoccupation here with
> suboptimizing of this sort is, I think, evidence of a pervasive
> malaise, a retreat into the familiar that precludes consideration of
> more, much more, important design issues.

In general I tend to agree with this, but I've worked or
consulted at installations that either had problems completing
overnight jobs in their assigned batch window, or just
processing large amounts of data.

While I haven't tried this on very current machines, on older
ones EX added 40 to 50% to the instruction time (EX overhead on
some Amdahl machines was greater); 4 MVCs of 256 bytes were
about the same as a 1K MVCL; and 5 CLI/BE were about the same as
one TRT/B *+4(R2). In each case if paid to identify the most
frequently executed code and look for improvements.

Gerhard Postpischil
Bradford, VT

Hall, Keven

unread,
Jan 12, 2012, 11:18:54 PM1/12/12
to ASSEMBL...@listserv.uga.edu
When looking for a specific byte a CLI loop is my weapon of choice.
Unless I'm dealing with frequently executed code I'm happy to simply
embrace the TRT instruction along with the other unabashedly
CISC-to-the-max members of the z/Architecture instruction set.
Ultimately you just have to experiment because you're optimizing for a
black-box and the only instrumentation available is CPU time used.
Instruction order, branch points and branch target locations can make
big differences so it's worth trying various combinations to see what
works best (putting the first instruction of a tight loop on a
doubleword boundary, for example).

Keven

-----Original Message-----
From: IBM Mainframe Assembler List
[mailto:ASSEMBL...@LISTSERV.UGA.EDU] On Behalf Of Rob van der Heij
Sent: Thursday, January 12, 2012 7:16 PM
To: ASSEMBL...@LISTSERV.UGA.EDU
Subject: Re: How bad is the EX instruction?

Martin Truebner

unread,
Jan 13, 2012, 2:13:01 AM1/13/12
to ASSEMBL...@listserv.uga.edu
Rob,

have you tried SRST?

I had a hard time getting used to SRSTs way of using/wanting the
resgisters- but then... It does an excellent job on searching for one
(and only one) character in a string.

Martin Truebner

unread,
Jan 13, 2012, 2:33:17 AM1/13/12
to ASSEMBL...@listserv.uga.edu
Rob,

here is a simple sample for SRST:

L R15,SAVE point in string for cont
LA R14,256(R15)
LA R0,C'/'
SRST R14,R15
* R14 is now on the first /
LA R15,1(R14)
SRST R14,R15
* R14 is now on the second /

Two hints:

1.) SRST should be followed by a JO *-4, but POP says min
length scanned is 255. So it can be omitted in certain cases.

2.) A found condition is indicated by a L (L_located).
A not found condition is indicated by a H (not L_ocated) -
so an extra JH NOT_FOUND might be usefull (or JL LOCATED_CHAR).

I had a hard time getting used to SRSTs way of using/wanting the

registers- but then... It does an excellent job on searching for one

Rob van der Heij

unread,
Jan 13, 2012, 5:13:35 AM1/13/12
to ASSEMBL...@listserv.uga.edu

That makes sense. It sounds like even if you can afford to MVC the
entire buffer (because you know there is room in the destination and
you're not near the edge of the source) then it might make sense to EX
MVC if you know the actual size and it's less than half on average.

For short EX MVC's the burden of getting stuff in the right registers
makes MVCL less interesting.

My preoccupation with this is mostly on Friday ;-) And I guess I
should not write real code on Friday 13th anyway...
The EX CLC is in fact in loop scanning a linked list for the right
entry among 100-200 elements. My big savings were moving the TRT etc
out of the loop. I was tempted to also take the decision between CLC
and EX CLC out of the loop, but didn't for ease of maintenance.

Rob

Rob van der Heij

unread,
Jan 13, 2012, 5:18:40 AM1/13/12
to ASSEMBL...@listserv.uga.edu
On Fri, Jan 13, 2012 at 8:13 AM, Martin Truebner <Mar...@pi-sysprog.de> wrote:
> Rob,
>
> have you tried SRST?
>
> I had a hard time getting used to SRSTs way of using/wanting the
> resgisters- but then... It does an excellent job on searching for one
> (and only one) character in a string.

Martin,

Haven't, and probably should for my own education. We restrict our
products to older architecture levels for a number of good reasons.

Rob

John Gilmore

unread,
Jan 13, 2012, 8:17:43 AM1/13/12
to ASSEMBL...@listserv.uga.edu
Gerhard Postpischil wrote:

<begin snippet>


In general I tend to agree with this, but I've worked or consulted at
installations that either had problems completing overnight jobs in
their assigned batch window, or just processing large amounts of data.

</end snippet>

I value GP's concurrence. Let me add, however, that 1) in my
experience batch-window problems are always i/o-related; and 2) the
unwashed always attack them in the wrong way, devoting resources to
"optimizing" instruction sequences that, even if it had been possible
to reduce their CP consumption to zero, would have left the
batch-window problem unresolved.

These applications, like most commercial batch ones, were i/o-bound,
and their resolution required the use of overlapped, asynchronous i/o,
which, for those who know how to do it, is not difficult. What it
was/is in most of these shops was/is, quite literally, unthinkable.
(The RESIDENCE time of a classical MFU can always be cut by a factor
of four or more using asynchronous i/o.)

Tom Marchant

unread,
Jan 13, 2012, 8:24:44 AM1/13/12
to ASSEMBL...@listserv.uga.edu
On Fri, 13 Jan 2012 11:18:40 +0100, Rob van der Heij wrote:

>On Fri, Jan 13, 2012 at 8:13 AM, Martin Truebner wrote:
>> have you tried SRST?


>
>Haven't, and probably should for my own education. We restrict our
>products to older architecture levels for a number of good reasons.

How old?
SRST was first documented in the second edition of the ESA POO.
That is much older than the Relative and Immediate instructions.

--
Tom Marchant

Martin Packer

unread,
Jan 13, 2012, 9:13:14 AM1/13/12
to ASSEMBL...@listserv.uga.edu
Not always I/O-related. Sometimes CPU-related but where SQL tuning would
be more appropriate than application code instruction cycle tuning.

Cheers, Martin

Martin Packer,
Mainframe Performance Consultant, zChampion
Worldwide Banking Center of Excellence, IBM

+44-7802-245-584

email: martin...@uk.ibm.com

Twitter / Facebook IDs: MartinPacker
Blog:
https://www.ibm.com/developerworks/mydeveloperworks/blogs/MartinPacker

From:
John Gilmore <johnwgil...@gmail.com>
To:
ASSEMBL...@listserv.uga.edu,
Date:
13/01/2012 13:20
Subject:
Re: How bad is the EX instruction?

Sent by:
IBM Mainframe Assembler List <ASSEMBL...@listserv.uga.edu>

Gerhard Postpischil wrote:


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

John Gilmore

unread,
Jan 13, 2012, 9:39:59 AM1/13/12
to ASSEMBL...@listserv.uga.edu
Martin Packer's point is, of course, well taken.

Egregiously bad SQL can be the villain. I dealt recently with a
situation in which the SQL builtin function max was used repeatedly to
obtain a next, new 'treaty number' in an insurance application. It
did its job, correctly, by scanning all of the rows of one of the
largest DB2 tables I have encountered over and over and over again,
with predictable consequences.

I believe, however, that, as here, bad SQL is very often bad at bottom
because it triggers too much implicit i/o. DB2's read and write
engines are very good indeed at what they do; but both they and the
sort invoked from DB2 can be asked to do too much gratuitous work.

John Gilmore, Ashland, MA 01721 - USA


--

Andreas Geissbuehler

unread,
Jan 13, 2012, 9:49:13 AM1/13/12
to ASSEMBL...@listserv.uga.edu
>> On 1/12/2012 2:23 PM, John Gilmore wrote:
>> No one has ever claimed that the timing differences here are large,
>> significant ones; and the continuing preoccupation here with
>> suboptimizing of this sort is, I think, evidence of a pervasive
>> malaise, a retreat into the familiar that precludes consideration of
>> more, much more, important design issues.
> on 2012-01-1 22:55 Gerhard Postpischil wrote:
> In general I tend to agree with this, but I've worked or
> consulted at installations that either had problems completing
> overnight jobs in their assigned batch window, or just
> processing large amounts of data.

John, Gerhard, right on! I have been even more radical !

First I attempted to *eliminate* the need to have the code
*THERE* in the first place, in the most often executed path.
On one occasion I added an extra field to the record (row)
to avoid generating a key each time.

Another comes to mind, scrapping a name&address
decompression routine and replacing its loops, bit shifts,
translates, by blanks-truncation and a table of the 65000
most common city and street name "words" on file. To
print an adress became lightnig fast: follow a chain of
1-byte offsets to the next 3-byte place holder and load the
2-byte index into the table, a few RX (L LA IC), an SLL
to convert the index to an ofsset, an SR/JNP to detect and
move text between tokens and lastly an EX-MVC combo.

For this elite group here, this post is really OFF-TOPIC.
You worry about picoseconds because your code runs zillion
times per... If you weren't, the perennial EX-topic would fit
in nicely with my TGIF post - let's have a great weekend!

Andreas F. Geissbuehler
AFG Consultants Inc.
http://www.afgc-inc.com/

Andreas Geissbuehler

unread,
Jan 13, 2012, 10:06:06 AM1/13/12
to ASSEMBL...@listserv.uga.edu
Original Message From: "John Gilmore"

> Martin Packer's point is, of course, well taken.

Yours likewise!

I got these kinds of optmization mandates because
of CPU Hours and EXCPs it cost to run some batch
jobs and CICS transactionsat my clients' service bureau.

> Egregiously bad SQL can be the villain.

Indeed, and big, impressive gains can be made using ancient
methods, writing no more than a few sort/merge exits :-))

Edward Jaffe

unread,
Jan 13, 2012, 11:15:10 AM1/13/12
to ASSEMBL...@listserv.uga.edu
On 1/12/2012 7:32 AM, McKown, John wrote:
> I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use MVCL or MVCLE. But many have said that EX of an MVC is less overhead than MVCL in many cases. Especially since I know that my length is always no more than 255 characters. I check and report an error if the length is 256 or more.

Relative instruction performance is a moving target. We run benchmarks whenever
we get a new processor so we can understand the trends. Most new hardware
generations build on the microprocessor design of the prior generation, so the
changes tend to be incremental.

Now and then, the microprocessor gets completely redesigned. One such redesign
occurred with the introduction of the z10. I mentioned our observations re:
EXecute and MVCL performance in my "z10 User Experience" at SHARE in Denver.
Check out slide 13 for this information. Thanks again to David Bond for helping
us make sense of our measurements.

http://proceedings.share.org/client_files/SHARE_in_Denver/S2215EJ161728.pdf

--
Edward E Jaffe
Phoenix Software International, Inc
831 Parkview Drive North
El Segundo, CA 90245
310-338-0400 x318
edj...@phoenixsoftware.com
http://www.phoenixsoftware.com/

glen herrmannsfeldt

unread,
Jan 13, 2012, 9:29:55 PM1/13/12
to ASSEMBL...@listserv.uga.edu
> For short EX MVC's the burden of getting stuff in the right registers
> makes MVCL less interesting.

As I understand it, and doesn't seem to have been mentioned, the big
effect on EX has to do with caching. I believe that it should be
near code and not data. (That is, it goes into the instruction cache
instead of the data cache.) Then again, I could have that backwards.

Other than cache, it should be plenty fast enough.

> My preoccupation with this is mostly on Friday ;-) And I guess I
> should not write real code on Friday 13th anyway...
> The EX CLC is in fact in loop scanning a linked list for the right
> entry among 100-200 elements. My big savings were moving the TRT etc
> out of the loop. I was tempted to also take the decision between CLC
> and EX CLC out of the loop, but didn't for ease of maintenance.

I usually use a hash table. Especially if speed is important.

You could also do binary search, which will find the right entry
with about log(n) comparisons.

-- glen

Tom Marchant

unread,
Jan 16, 2012, 7:35:14 AM1/16/12
to ASSEMBL...@listserv.uga.edu
On Fri, 13 Jan 2012 18:29:55 -0800, glen herrmannsfeldt wrote:

>> The EX CLC is in fact in loop scanning a linked list for the right
>> entry among 100-200 elements.
>

>You could also do binary search, which will find the right entry
>with about log(n) comparisons.

How do you do a binary search on a linked list?

--
Tom Marchant

Dan Skomsky, PSTI

unread,
Jan 16, 2012, 7:49:54 AM1/16/12
to ASSEMBL...@listserv.uga.edu
One Assembler trick I have seen in speeding up scanning loops was to use a
CLI instruction to check the first byte of a string and then only doing the
CLC/CLCL if the CLI matches. This trick even works if doing a binary
search.

-----Original Message-----
From: IBM Mainframe Assembler List [mailto:ASSEMBL...@LISTSERV.UGA.EDU]

On Behalf Of Tom Marchant
Sent: Monday, January 16, 2012 6:35 AM
To: ASSEMBL...@LISTSERV.UGA.EDU
Subject: Re: How bad is the EX instruction?

Rob van der Heij

unread,
Jan 16, 2012, 8:50:58 AM1/16/12
to ASSEMBL...@listserv.uga.edu
On Sat, Jan 14, 2012 at 3:29 AM, glen herrmannsfeldt
<g...@ugcs.caltech.edu> wrote:

> I usually use a hash table. Especially if speed is important.
>
> You could also do binary search, which will find the right entry
> with about log(n) comparisons.

Yeah, and I prefer to stop the earth rotation when I take a sun bath... ;-)

Re-reading, I see I confused you with "the right entry" where it
actually may be more than one so I have to walk the entire list. In
fact, each entry has up to 5 possible fields to check like this. In
this case the change frequency of the data is more than the reference
rate, so on average I would have to build the hash table or search
tree on each reference. And I don't really have a context where I
could keep it.

But if you have an efficient hash function handy for 200 strings of
6-8 (uppercase) characters, I'm game. My ad-hoc tests were a bit
disappointing in rehash.

Rob

robin

unread,
Jan 16, 2012, 9:53:45 AM1/16/12
to ASSEMBL...@listserv.uga.edu
From: "Dan Skomsky, PSTI" <Poodl...@sbcglobal.net>
Sent: Monday, 16 January 2012 11:49 PM

> One Assembler trick I have seen in speeding up scanning loops was to use a
> CLI instruction to check the first byte of a string and then only doing the
> CLC/CLCL if the CLI matches. This trick even works if doing a binary
> search.

Marginal savings, I think, compared to EX/CLC or CLCL,
for the reason that both CLC and CLCL give up after examining the
first character, should they be unequal.

Might be more fruitful to compare length of the key with that of an element first,
and then carrrying out the compare should those lengths be equal.

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Tom Marchant

unread,
Jan 16, 2012, 10:13:37 AM1/16/12
to ASSEMBL...@listserv.uga.edu
On Mon, 16 Jan 2012 06:49:54 -0600, Dan Skomsky, PSTI wrote:

>One Assembler trick I have seen in speeding up scanning loops was to use a
>CLI instruction to check the first byte of a string and then only doing the
>CLC/CLCL if the CLI matches. This trick even works if doing a binary
>search.

I don't know if the cost of EX is high enough that you would benefit
from doing a one byte CLC before an EX of a CLC. I don't see how a CLI
will help you though.

--
Tom Marchant

Rob van der Heij

unread,
Jan 16, 2012, 10:37:24 AM1/16/12
to ASSEMBL...@listserv.uga.edu

Having the CLC near the EX helps for cache. I also like to assemble it
in-line because the right USINGs apply. We noticed that it is
attractive to run over the CLC (with the length byte 0 as assembled)
and then EX behind your back to do the real thing. More attractive
than branch over the target if the instruction lets you.

I doubt whether a branch between the CLC and the EX would be
advantage. Depending on how often the comparison already fails on the
first byte, you trade an untaken branch against an EX CLC that fails
on the first byte. Guess I should try that some Friday afternoon...

Rob

Ray Mullins

unread,
Jan 16, 2012, 1:41:17 PM1/16/12
to ASSEMBL...@listserv.uga.edu

Ya-but...

SRST came in sometime during the late System/370 era. I have a yellow
book with SRST and CLST defined.

(I've been burned only once by a non-System/370 instruction (ICM), and
that was on a plug-compatible that a Brazilian customer was running in
the early 1990s. I have burned myself on using the wrong ARCH option in
a C compile when a customer was still running a z900 (ARCH(5)) and I had
accidentally left it set to ARCH(6).

Later,
Ray


--
M. Ray Mullins
Roseville, CA, USA
http://www.catherdersoftware.com/

German is essentially a form of assembly language consisting entirely of
far calls heavily accented with throaty guttural sounds. ---ilvi
French is essentially German with messed-up pronunciation and spelling.
--Robert B Wilson
English is essentially French converted to 7-bit ASCII. ---Christophe
Pierret [for Alain LaBont�]

Ray Mullins

unread,
Jan 16, 2012, 1:48:55 PM1/16/12
to ASSEMBL...@listserv.uga.edu
Arrgh. Correction to the below. Not enough caffeine, yet it's late in
the morning...

Tom Marchant correctly mention that SRST/CLST came in with ESA, not late
System/370, as a look at my SEARS card just confirmed. However, the
point still applies - SRST/CLST have been around for almost 25 years and
I doubt anyone is still running ES 9000 boxes.

Paul Gilmartin

unread,
Jan 16, 2012, 3:25:45 PM1/16/12
to ASSEMBL...@listserv.uga.edu

B-tree?

-- gil

Paul Gilmartin

unread,
Jan 16, 2012, 3:33:58 PM1/16/12
to ASSEMBL...@listserv.uga.edu
On Jan 16, 2012, at 07:53, robin wrote:

> From: "Dan Skomsky, PSTI" <Poodl...@sbcglobal.net>
> Sent: Monday, 16 January 2012 11:49 PM
>
>> One Assembler trick I have seen in speeding up scanning loops was to use a
>> CLI instruction to check the first byte of a string and then only doing the
>> CLC/CLCL if the CLI matches. This trick even works if doing a binary
>> search.
>

On the average (FSVO), how does this compare Boyer-Moore?

I've seen suggested TRT for the first character, then CLC for
the rest. Works much better for strings beginning with "Z"
than for strings beginning with " ".

> Marginal savings, I think, compared to EX/CLC or CLCL,
> for the reason that both CLC and CLCL give up after examining the
> first character, should they be unequal.
>
> Might be more fruitful to compare length of the key with that of an element first,
> and then carrrying out the compare should those lengths be equal.
>

Gives you "=", but not "<" or ">", so precludes binary search.

What was the statement of the problem, anyway?

CDC 3600/3800 had a "Modify following instruction" instruction
that met much of the requirement for EX. And pipelining was of
little import in that era.

-- gil

Tony Thigpen

unread,
Jan 16, 2012, 4:43:41 PM1/16/12
to ASSEMBL...@listserv.uga.edu
> I doubt anyone is still running ES 9000 boxes.

I have paying customers on 9672s, MP2000, MP3000, etc.
VSE, not z/OS.


Tony Thigpen

Kerry

unread,
Jan 16, 2012, 10:21:35 AM1/16/12
to ASSEMBL...@listserv.uga.edu
Saying that "... sub-optimizing of this sort is, I think, evidence of a
pervasive malaise..." is a short sighted generalization.

Performance is one of the strongest reasons for coding in assembler and
this discussion characterizes some of the low hanging fruit available for
the attainment thereof.

The timing differences can be quite significant when the code in question
is embedded in a routine that is executed 100 billion times.

Kerry Tenberg
Austin, Tx

On Thu, Jan 12, 2012 at 1:23 PM, John Gilmore <johnwgil...@gmail.com>wrote:

> My own experience has been much more mixed, but I'd like to accept, in
> order to address what I take to be a more important issue, that Peter
> Farley is right when he says
>
> <begin snippet>
> My prior experiences in replacing MVCL/CLCL's with multiple MVC/CLC's
> and even MVC/CLC loops for "small" areas (FSVO "small") is that
> MVCL/CLCL loses almost every time.
> </end snippet>
>
> Preoccupation with these issues is, at best, counter-productive. The
> MVCLE is logically simpler and should be used unless one knows that
> one has only some fixed, small number n << 256 bytes to move.


>
> No one has ever claimed that the timing differences here are large,
> significant ones; and the continuing preoccupation here with
> suboptimizing of this sort is, I think, evidence of a pervasive
> malaise, a retreat into the familiar that precludes consideration of
> more, much more, important design issues.
>

robin

unread,
Jan 17, 2012, 12:39:49 AM1/17/12
to ASSEMBL...@listserv.uga.edu
From: "McKown, John" <John....@healthmarkets.com>
Sent: Friday, 13 January 2012 2:32 AM

> OK, I hope I'm not becoming wearisome with my yammering. But I am not too busy right now. And I still really like and
> respect the z architecture (despite its horrendous price).


>
> I ask about the CPU cost of an EX because that same program that I'm working on uses the EX a fair amount to move
> "variable length" strings into a blank-initialized area for reporting purposes. Instead of EX of an MVC, I could use
> MVCL or MVCLE.

As the task is to move stuff to a buffer for reporting purposes,
the cpu time will be negligible compared to I/O time.

Confucius say: If it works, don't fix it.

robin

unread,
Jan 17, 2012, 12:30:49 AM1/17/12
to ASSEMBL...@listserv.uga.edu
From: "Paul Gilmartin" <PaulGB...@aim.com>
Sent: Tuesday, 17 January 2012 7:33 AM

> CDC 3600/3800 had a "Modify following instruction" instruction

The S/360 and subsequent machines have one like that also.
In the case of MVC/CLC instructions :-

stc 1,*+5
mvc a(0),b

can be useful.

EX does more than just "insert length" into SS instructions.
The ability to OR in bits from the second byte of the subject instruction
along with the content of the nominated register is probably rarely used
in the case of SS instructions, but can be used to effect with RX instructions,
where you might want to retain, say, the existing index field in the
subject instruction, yet supply bits for the Register field of the
subject instruction.

robin

unread,
Jan 17, 2012, 12:49:50 AM1/17/12
to ASSEMBL...@listserv.uga.edu
From: "Rob van der Heij" <rvd...@gmail.com>
Sent: Tuesday, 17 January 2012 2:37 AM

> Having the CLC near the EX helps for cache. I also like to assemble it
> in-line because the right USINGs apply. We noticed that it is
> attractive to run over the CLC (with the length byte 0 as assembled)
> and then EX behind your back to do the real thing. More attractive
> than branch over the target if the instruction lets you.

A convenient place for the subject instruction is immediately after
a B instruction, thus avoiding the need to execute CLC or MVC twice.

Rob van der Heij

unread,
Jan 17, 2012, 3:48:21 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On Tue, Jan 17, 2012 at 6:49 AM, robin <rob...@dodo.com.au> wrote:
> From: "Rob van der Heij" <rvd...@gmail.com>
> Sent: Tuesday, 17 January 2012 2:37 AM
>
>> Having the CLC near the EX helps for cache. I also like to assemble it
>> in-line because the right USINGs apply. We noticed that it is
>> attractive to run over the CLC (with the length byte 0 as assembled)
>> and then EX behind your back to do the real thing. More attractive
>> than branch over the target if the instruction lets you.
>
>
> A convenient place for the subject instruction is immediately after
> a B instruction, thus avoiding the need to execute CLC or MVC twice.

My experience was that executing the MVC or CLC twice (first with
length 0) is better than to branch over it. So:

X CLC ONE(0),TWO
EX Rx,X

But it may very well be that current CPUs look sufficiently over the
branch that one could

B Y
X CLC ONE(0),TWO
Y EX Rx,X

Obviously I do not wish to make this kind of decision at each
instance. But once you find this in the deep bowls of a heavy loop, it
is worth to think about it and put the optimal one in my INLINEX macro
that does the work:
INLINEX Rx,CLC,ONE(0),TWO

Rob

Martin Truebner

unread,
Jan 17, 2012, 5:06:24 AM1/17/12
to ASSEMBL...@listserv.uga.edu
Rob,

>> My experience was that executing the MVC or CLC twice (first with
length 0) is better than to branch over it.

I doubt that doing something little and then full is faster than doing
it full the first time....

If you observed major difference I do suspect that it is because
the first execution triggered a pagein (or a swap or a
steal...whatever).

I do NOT like this "inline"-technique at all. Also: it does make coding
baseless (only base(s) for data) hard, if EXRL is not available. Yes, I
heard (and do use) of LOCTR and various other techniques to do it
anyway.

--
Martin

Pi_cap_CPU - all you ever need around MWLC/SCRT/CMT in z/VSE
more at http://www.picapcpu.de

Rob van der Heij

unread,
Jan 17, 2012, 6:34:21 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On Tue, Jan 17, 2012 at 11:06 AM, Martin Truebner <Mar...@pi-sysprog.de> wrote:

> I do NOT like this "inline"-technique at all. Also: it does make coding
> baseless (only base(s) for data) hard, if EXRL is not available. Yes, I
> heard (and do use) of LOCTR and various other techniques to do it
> anyway.

I thought that putting instructions between the data was considered
evil practice. But I merely assumed it applied to the target of EX as
well.

When reading the code, I find it breaks the line of thought when I
have to go look for the exact instruction that's targeted by EX. And I
have been bitten a few times because USINGs were different at the EX
and where the target was placed.

This is what I see as abstraction. The details you mention are done
inside the macro and don't affect my source code. My macro even knows
about the instructions that are safe to execute twice as we discussed.
If I would go baseless, that would be resolved entirely in my INLINEX
macro (the branch as well as putting the target instruction in the
constants area with the right LOCTR settings). And with *no* branches
coded in my source, the macros is all it takes to go baseless...

Rob

Fred van der Windt

unread,
Jan 17, 2012, 7:31:59 AM1/17/12
to ASSEMBL...@listserv.uga.edu
> Having the CLC near the EX helps for cache. I also like to assemble it
> in-line because the right USINGs apply. We noticed that it is
> attractive to run over the CLC (with the length byte 0 as assembled)
> and then EX behind your back to do the real thing. More attractive
> than branch over the target if the instruction lets you.

The USING-issue is a strong argument in favor of this: I juggle around USINGs a lot and it is a pain (and error-prone) to set up the same USINGs for a single instruction that needs to be EXecuted.

We use the HLASM Toolkit Structured Programming Macros which means that we can't easily insert an instruction 'after' a Jump instruction. Almost all the Jump instructions are generated by the SPM macros.

Fred!
-----------------------------------------------------------------
ATTENTION:
The information in this electronic mail message is private and
confidential, and only intended for the addressee. Should you
receive this message by mistake, you are hereby notified that
any disclosure, reproduction, distribution or use of this
message is strictly prohibited. Please inform the sender by
reply transmission and delete the message without copying or
opening it.

Messages and attachments are scanned for all viruses known.
If this message contains password-protected attachments, the
files have NOT been scanned for viruses by the ING mail domain.
Always scan attachments before opening them.
-----------------------------------------------------------------

Rob van der Heij

unread,
Jan 17, 2012, 7:54:22 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On Tue, Jan 17, 2012 at 1:31 PM, Fred van der Windt
<Fred.van....@mail.ing.nl> wrote:
>> Having the CLC near the EX helps for cache. I also like to assemble it
>> in-line because the right USINGs apply. We noticed that it is
>> attractive to run over the CLC (with the length byte 0 as assembled)
>> and then EX behind your back to do the real thing. More attractive
>> than branch over the target if the instruction lets you.
>
> The USING-issue is a strong argument in favor of this: I juggle around USINGs a lot and it is a pain (and error-prone) to set up the same USINGs for a single instruction that needs to be EXecuted.

Right, I'm more and more tempted to drop all USINGs at the end of
subroutines and explicitly state which ones apply upon entry.
Since I have nested subroutines with static scope, it's even more
appropriate. Within the routine itself, I try to have the USING and
DROP at the same nesting level.

> We use the HLASM Toolkit Structured Programming Macros which means that we can't easily insert an instruction 'after' a Jump instruction. Almost all the Jump instructions are generated by the SPM macros.

What's a Jump instruction ;-) Really, mine are only generated by the
structured programming macros (including one to exit the routine as
part of error handling).

Rob

McKown, John

unread,
Jan 17, 2012, 8:06:17 AM1/17/12
to ASSEMBL...@listserv.uga.edu
At this shop, using CPU costs money. Using I/O doesn't. Wall clock doesn't. Therefore, so long as SLAs are met, it is better to decrease CPU time at the expense of __anything__ else. Yes, even productivity. I must say no more.

--
John McKown
Systems Engineer IV
IT

Administrative Services Group

HealthMarkets(r)

9151 Boulevard 26 * N. Richland Hills * TX 76010
(817) 255-3225 phone *
john....@healthmarkets.com * www.HealthMarkets.com

Confidentiality Notice: This e-mail message may contain confidential or proprietary information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. HealthMarkets(r) is the brand name for products underwritten and issued by the insurance subsidiaries of HealthMarkets, Inc. -The Chesapeake Life Insurance Company(r), Mid-West National Life Insurance Company of TennesseeSM and The MEGA Life and Health Insurance Company.SM

> -----Original Message-----
> From: IBM Mainframe Assembler List
> [mailto:ASSEMBL...@LISTSERV.UGA.EDU] On Behalf Of robin
> Sent: Monday, January 16, 2012 11:40 PM
> To: ASSEMBL...@LISTSERV.UGA.EDU

Fred van der Windt

unread,
Jan 17, 2012, 8:09:59 AM1/17/12
to ASSEMBL...@listserv.uga.edu
> I doubt that doing something little and then full is faster than doing it full the first time....
>
> If you observed major difference I do suspect that it is because the first execution triggered a pagein (or a swap or a steal...whatever).

I did a very Q&D test and...

J *+10
CLC 0(1,R10),8(R10)
EXRL R1,*-6

...is about 25% faster than...

CLC 0(1,R10),8(R10)
EXRL R1,*-6

So on a z196 the jump seems to be faster than the compare...

Paul Gilmartin

unread,
Jan 17, 2012, 9:32:15 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On Jan 16, 2012, at 22:30, robin wrote:

> From: "Paul Gilmartin" <PaulGB...@aim.com>
> Sent: Tuesday, 17 January 2012 7:33 AM
>
>> CDC 3600/3800 had a "Modify following instruction" instruction
>
> The S/360 and subsequent machines have one like that also.
> In the case of MVC/CLC instructions :-
>
> stc 1,*+5
> mvc a(0),b
>
> can be useful.
>

No, no, no, no, no!:

o RENT!? (How does this affect instruction pipelining?)

o The CDC instruction didn't modify the storage; it
modified the execution of the instruction after it had
been fetched from storage. Even as EX doesn't modify
its target instruction in storage.

> EX does more than just "insert length" into SS instructions.
> The ability to OR in bits from the second byte of the subject instruction
> along with the content of the nominated register is probably rarely used
> in the case of SS instructions, but can be used to effect with RX instructions,
> where you might want to retain, say, the existing index field in the
> subject instruction, yet supply bits for the Register field of the
> subject instruction.

-- gil

Paul Gilmartin

unread,
Jan 17, 2012, 9:40:06 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On Jan 17, 2012, at 05:31, Fred van der Windt wrote:

>> Having the CLC near the EX helps for cache. I also like to assemble it
>> in-line because the right USINGs apply. We noticed that it is
>> attractive to run over the CLC (with the length byte 0 as assembled)
>> and then EX behind your back to do the real thing. More attractive
>> than branch over the target if the instruction lets you.
>
> The USING-issue is a strong argument in favor of this: I juggle around USINGs a lot and it is a pain (and error-prone) to set up the same USINGs for a single instruction that needs to be EXecuted.
>

An alternative is LOCTR (possibly in a macro). (With possible
cache miss consequences. I forget; is the target of EX treated
as a data access or as an instruction access for cacne management?)

But "instruction" should be a data type supported for use in
literals: "EX Rx,=INST'CLC ...'. Some programmers have
kludged this with ugly hex constants; the facility should be
made orderly.

> We use the HLASM Toolkit Structured Programming Macros which means that we can't easily insert an instruction 'after' a Jump instruction. Almost all the Jump instructions are generated by the SPM macros.

-- gil

Paul Gilmartin

unread,
Jan 17, 2012, 10:32:06 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On Jan 16, 2012, at 08:21, Kerry wrote:
>
> Performance is one of the strongest reasons for coding in assembler and
> this discussion characterizes some of the low hanging fruit available for
> the attainment thereof.
>
Others have said here that performance is a strong reason
for _not_ coding in assembler:

o Compiler developers have done the research on instruction
timings and know better than most end users what sequences
fit the pipelines optimally.

o Compiled code can be re-optimized for a new generation of
hardware simply by recompiling.

o Interpreters can dynamically recompile based on statistical
profiles evaluated at the actual time of execution.

-- gil

Edward Jaffe

unread,
Jan 17, 2012, 10:44:17 AM1/17/12
to ASSEMBL...@listserv.uga.edu
On 1/17/2012 6:40 AM, Paul Gilmartin wrote:
> I forget; is the target of EX treated as a data access or as an instruction access for cacne management?

The 256-byte cache line containing the target instruction is loaded into I-cache.

--
Edward E Jaffe
Phoenix Software International, Inc
831 Parkview Drive North
El Segundo, CA 90245
310-338-0400 x318
edj...@phoenixsoftware.com
http://www.phoenixsoftware.com/

Farley, Peter x23353

unread,
Jan 17, 2012, 11:06:17 AM1/17/12
to ASSEMBL...@listserv.uga.edu
> -----Original Message-----
> From: IBM Mainframe Assembler List [mailto:ASSEMBLER-
> LI...@LISTSERV.UGA.EDU] On Behalf Of Paul Gilmartin
> Sent: Tuesday, January 17, 2012 10:32 AM
> To: ASSEMBL...@LISTSERV.UGA.EDU
> Subject: Re: How bad is the EX instruction?
>
> On Jan 16, 2012, at 08:21, Kerry wrote:
> >
> > Performance is one of the strongest reasons for coding in assembler and
> > this discussion characterizes some of the low hanging fruit available
> > for the attainment thereof.
> >
> Others have said here that performance is a strong reason
> for _not_ coding in assembler:
>
> o Compiler developers have done the research on instruction
> timings and know better than most end users what sequences
> fit the pipelines optimally.

Notoriously NOT for the IBM COBOL compilers. I plead ignorance for the PL/1 and Fortran compilers, but the C/C++ compiler is the only current compiler in my personal experience that actually exhibits a knowledge of instruction timings and latency and AGI interrupts, etc., for current and recent pipelined "z" processors.

IMHO, COBOL generated code is so bad that if I was on the COBOL code-generation development team I would be embarrassed to admit it.

> o Compiled code can be re-optimized for a new generation of
> hardware simply by recompiling.
>
> o Interpreters can dynamically recompile based on statistical
> profiles evaluated at the actual time of execution.
>
--


This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system.

Kirk Talman

unread,
Jan 17, 2012, 11:17:07 AM1/17/12
to ASSEMBL...@listserv.uga.edu
> From: "Farley, Peter x23353" <Peter....@broadridge.com>

> IMHO, COBOL generated code is so bad that if I was on the COBOL
> code-generation development team I would be embarrassed to admit it.

The cobol code generator appears to be the beneficiary of benign neglect.

The net result of using only the halfword immediate instructions and the
relative branching would be significant.

But as noted elsewhere here, there are a lot of active very old machines.

I suspect if the source to the code generator were made available w/an
NDA, there would be a group who would improve it for the general interest.

-----------------------------------------
The information contained in this communication (including any
attachments hereto) is confidential and is intended solely for the
personal and confidential use of the individual or entity to whom
it is addressed. If the reader of this message is not the intended
recipient or an agent responsible for delivering it to the intended
recipient, you are hereby notified that you have received this
communication in error and that any review, dissemination, copying,
or unauthorized use of this information, or the taking of any
action in reliance on the contents of this information is strictly


prohibited. If you have received this communication in error,

please notify us immediately by e-mail, and delete the original
message. Thank you

Tony Thigpen

unread,
Jan 17, 2012, 11:21:41 AM1/17/12
to ASSEMBL...@listserv.uga.edu
I had much rather debug a Cobol program any day vs. what passes for
C-code now days.
Even trying to make simple changes to a C/C++ program can require hours
just to figure out what the program is really doing.

Tony Thigpen

McKown, John

unread,
Jan 17, 2012, 11:22:46 AM1/17/12
to ASSEMBL...@listserv.uga.edu
> -----Original Message-----
> From: IBM Mainframe Assembler List
> [mailto:ASSEMBL...@LISTSERV.UGA.EDU] On Behalf Of Paul Gilmartin
> Sent: Tuesday, January 17, 2012 9:32 AM
> To: ASSEMBL...@LISTSERV.UGA.EDU
> Subject: Re: How bad is the EX instruction?
>

My usual "whine". I would likely abandon HLASM for the most part. IF I had a C license. But, one makes due with what one has.

McKown, John

unread,
Jan 17, 2012, 11:28:30 AM1/17/12
to ASSEMBL...@listserv.uga.edu
> -----Original Message-----
> From: IBM Mainframe Assembler List
> [mailto:ASSEMBL...@LISTSERV.UGA.EDU] On Behalf Of Farley,
> Peter x23353
> Sent: Tuesday, January 17, 2012 10:06 AM
> To: ASSEMBL...@LISTSERV.UGA.EDU
> Subject: Re: How bad is the EX instruction?
>
<snip>

>
> Notoriously NOT for the IBM COBOL compilers. I plead
> ignorance for the PL/1 and Fortran compilers, but the C/C++
> compiler is the only current compiler in my personal
> experience that actually exhibits a knowledge of instruction
> timings and latency and AGI interrupts, etc., for current and
> recent pipelined "z" processors.
>
> IMHO, COBOL generated code is so bad that if I was on the
> COBOL code-generation development team I would be embarrassed
> to admit it.
>

I have been told that part of the reason for the "horrible" code emitted by the COBOL compiler is to guarantee 100% conformance to the ANSI standards. I don't know this for a fact. But COBOL was designed around __decimal__ arithmetic. And getting proper truncations and overflow notifications. So there may be something to this. And let's not even talk about the abomination of the PERFORM verb. Implementing that is a royal PITA, from what I can tell. Mainly because the end of any paragraph may, or may not, return to some other point in the code. Sometimes it "returns" to a PERFORM, and other times it "falls through" to the next paragraph. Oh my aching compiler.

Edward Jaffe

unread,
Jan 17, 2012, 12:07:30 PM1/17/12
to ASSEMBL...@listserv.uga.edu
On 1/17/2012 8:06 AM, Farley, Peter x23353 wrote:
>>
>> Others have said here that performance is a strong reason
>> for _not_ coding in assembler:
>>
>> o Compiler developers have done the research on instruction
>> timings and know better than most end users what sequences
>> fit the pipelines optimally.
> Notoriously NOT for the IBM COBOL compilers.

The PL/X compiler also generates 'poor' code. (It's one reason it's been
difficult to convince the 'powers that be' to establish a new Architectural
Level Set for z/OS.)

IBM has hinted that they plan to address these compiler deficiencies--when is
anybody's guess. But, at least they admit there's a problem. That's the first
step...

John Gilmore

unread,
Jan 17, 2012, 12:11:50 PM1/17/12
to ASSEMBL...@listserv.uga.edu
Peter Farley wrote:

<begin snippet>
. . . I plead ignorance for the PL/1 and Fortran compilers, but the
C/C++ compiler is the nly current compiler in my personal experience


that actually exhibits a knowledge of instruction timings and latency
and AGI interrupts, etc., for current and recent pipelined "z"
processors.

</end snippet>

The IBM optimizing machinery for C/C++ and PL/I is now shared, the
same for both compilers; and the effects of this sharing have been
mixed, mostly good and some few of them very bad.

--jg


--

Paul Gilmartin

unread,
Jan 17, 2012, 12:19:38 PM1/17/12
to ASSEMBL...@listserv.uga.edu
On Jan 17, 2012, at 10:07, Edward Jaffe wrote:
>
> The PL/X compiler also generates 'poor' code. (It's one reason it's been
> difficult to convince the 'powers that be' to establish a new Architectural
> Level Set for z/OS.)
>
The balance between cost of development and cost of execution may
be biased when the vendor pays for one and the customer for the other.

> IBM has hinted that they plan to address these compiler deficiencies--when is
> anybody's guess. But, at least they admit there's a problem. That's the first
> step...


On Jan 17, 2012, at 10:11, John Gilmore wrote:
>
> The IBM optimizing machinery for C/C++ and PL/I is now shared, the
> same for both compilers; and the effects of this sharing have been
> mixed, mostly good and some few of them very bad.

Sounds like an opportunity for PL/X to join the party.

How's Metal/C?

-- gil

Farley, Peter x23353

unread,
Jan 17, 2012, 12:39:15 PM1/17/12
to ASSEMBL...@listserv.uga.edu
> -----Original Message-----
> From: IBM Mainframe Assembler List [mailto:ASSEMBLER-
> LI...@LISTSERV.UGA.EDU] On Behalf Of Paul Gilmartin
> Sent: Tuesday, January 17, 2012 12:20 PM
> To: ASSEMBL...@LISTSERV.UGA.EDU
> Subject: Re: How bad is the EX instruction?
<Snipped>
> On Jan 17, 2012, at 10:11, John Gilmore wrote:
> >
> > The IBM optimizing machinery for C/C++ and PL/I is now shared, the
> > same for both compilers; and the effects of this sharing have been
> > mixed, mostly good and some few of them very bad.
>
> Sounds like an opportunity for PL/X to join the party.
>
> How's Metal/C?

Pretty good, in my limited investigations. When the highest level of optimization is turned on, it can be rather tricky to follow the generated assembler code even knowing precisely what the C code was intended to do. I haven't yet measured the speed of the generated code in any meaningful way for a non-trivial program, but I am seriously impressed by the optimizations that are done and by the compiler's ability to "tune" the instruction set used so that code can be generated that will run on "z" machines from a given architecture level upwards.

Peter

John Gilmore

unread,
Jan 17, 2012, 1:01:35 PM1/17/12
to ASSEMBL...@listserv.uga.edu
One of the arguments for making PL/X available to customers who are
willing to pay for it is that doing so would give IBM an economic
incentive to fix some of its deficiencies.

Some of these deficiencies of course reflect its history. PL/S had a
notoriously fertile generate facility that permitted assembly language
to be dropped into source programs. This facility was 'abused' by
some IBM and contractor programmers to write routines that were, in
effect, assembly-language cakes with some PL/X powdered sugar
sprinkled on them.

The cross-platform emphasis in PL/X discourages and was intended to
discourage this sort of thing; but optimizing machinery, less
important in PL/S because resort to assemb ly language was possible,
does not appear to be much used by PL/X. (Even such obvious things
as moving common subexpressions out of loops and suppressing redundant
subscript arithmetic don't seem to happen.)

IBM knows and has always known how to fix this problem; what has
been|is lacking is the will to do it.

--jg


--

Tony Harminc

unread,
Jan 17, 2012, 2:37:09 PM1/17/12
to ASSEMBL...@listserv.uga.edu
On 17 January 2012 13:01, John Gilmore <johnwgil...@gmail.com> wrote:

> The cross-platform emphasis in PL/X discourages and was intended to
> discourage this sort of thing; but optimizing machinery, less
> important in PL/S because resort to assemb ly language was possible,
> does not appear to be much used by PL/X. (Even such obvious things
> as moving common subexpressions out of loops and suppressing redundant
> subscript arithmetic don't seem to happen.)
>
> IBM knows and has always known how to fix this problem; what has been|is lacking is the will to do it.

IBM almost a decade ago fixed the problem where it really counts - in
its millicode. That is generated by the GCC suite, with a
private-to-IBM PL8 language front end, and a published(?) middle-end
optimizer and back end code generator. One bumps into the odd
non-millicode module compiled by PL8's CMS-hosted predecessor
implementation, PL.8, but porting of the whole GCC and Linuxy
infrastructure, including ELF object format and other baggage, into
the likes of z/OS would presumably be required before the current PL/X
could be replaced and its code optimized. An outsider can only imagine
the internal geopolitical goings on with regard to all this.

Tony H.

Tony Harminc

unread,
Jan 17, 2012, 3:05:25 PM1/17/12
to ASSEMBL...@listserv.uga.edu
On 17 January 2012 08:09, Fred van der Windt
<Fred.van....@mail.ing.nl> wrote:

> I did a very Q&D test and...
>
> J *+10
> CLC 0(1,R10),8(R10)
> EXRL R1,*-6
>
> ...is about 25% faster than...
>
> CLC 0(1,R10),8(R10)
> EXRL R1,*-6
>
> So on a z196 the jump seems to be faster than the compare...

This seems unsurprising. Even on much older processors, an
unconditional branch has been predicted as "taken", and so the
instruction stream fetching will be at the EXRL long before execution
gets to the J. If R1 was set some instructions earlier, the EXRL and
target CLC can be set up and ready to go way in advance.

Tony H.

Ray Mullins

unread,
Jan 17, 2012, 5:47:51 PM1/17/12
to ASSEMBL...@listserv.uga.edu
I knew there were VSE folks on those boxes, which is why I chose my
models carefully. ;)

On 2012-01-16 13:43, Tony Thigpen wrote:
> > I doubt anyone is still running ES 9000 boxes.
>
> I have paying customers on 9672s, MP2000, MP3000, etc.
> VSE, not z/OS.


>
>
> Tony Thigpen
>
> -----Original Message -----

> From: Ray Mullins
> Sent: 01/16/2012 01:48 PM
>> Arrgh. Correction to the below. Not enough caffeine, yet it's late in
>> the morning...
>>
>> Tom Marchant correctly mention that SRST/CLST came in with ESA, not late
>> System/370, as a look at my SEARS card just confirmed. However, the
>> point still applies - SRST/CLST have been around for almost 25 years and
>> I doubt anyone is still running ES 9000 boxes.
>>
>


--
M. Ray Mullins
Roseville, CA, USA
http://www.catherdersoftware.com/

German is essentially a form of assembly language consisting entirely of
far calls heavily accented with throaty guttural sounds. ---ilvi
French is essentially German with messed-up pronunciation and spelling.
--Robert B Wilson
English is essentially French converted to 7-bit ASCII. ---Christophe
Pierret [for Alain LaBont�]

Ray Mullins

unread,
Jan 17, 2012, 6:28:51 PM1/17/12
to ASSEMBL...@listserv.uga.edu
On 2012-01-17 07:44, Edward Jaffe wrote:
> On 1/17/2012 6:40 AM, Paul Gilmartin wrote:
>> I forget; is the target of EX treated as a data access or as an
>> instruction access for cacne management?
>
> The 256-byte cache line containing the target instruction is loaded into
> I-cache.

So, this would seem to point towards putting the target near the
instruction, if you can, or at least no more than 244 bytes away (worst
case, maybe a bit less), or possibly grouping frequently executed
targets together using Martin T's. favorite LOCTR assembler instruction
and hoping that the line stays in cache.

Thoughts?

Later,
Ray

Reply all
Reply to author
Forward
0 new messages