How efficient is seemingly very straight-forward and readable code?

Robert AH Prins

unread,

Apr 5, 2013, 9:02:54 PM4/5/13

to

Hi all,

I recently had a look at the code generated by the INDEX builtin. In this
particular program that uses it, I'm using it to add (or not) new
4-character+space items (items do not have a leading space) to a CHAR VAR, i.e.

if index(set, new || ' ') = 0 then
set ||= new || ' ';

The complete sets are later used to index arrays, i.e.

i = (index(set, item || ' ') + 4) / 5;

and although the above is short, sweet and perfectly readable, it turns out to
be far from optimal...

Adding a bit of code to the program to output all 52,128 possible invocations of
the above, with sets varying from 1 to 80 items and item *always* present in the
set, and then using this output in another program to process all 52,128
possible invocations in a loop bracketed with the datetime() builtin showed that
using a simple

do i = 1 to length(set) by 5 until(item = substr(set, i, 4));
end;
i = (i + 4) / 5;

is about 25% faster (datetime()'ed over 100 loops of the 52,128 actual
variations) than

i = (index(set, item || ' ') + 4) / 5;

which makes me wonder, why isn't there an INDEXN or (POSN) builtin that would
increment the position of the next character in "set" to scan for "item" by the
length of "item", in other words,

if set = 'A1 B2 C3 D4 E5 F6 G7 ' and item = 'B2 ', this theoretical "INDEXN"
would do the compare of "item" to "set" only on positions 1, 4, 7 etc of "set",
creating an efficient way to scan these kind of
strings-(ab)used-as-unsorted-table, which in a way resembles the REXX "wordpos"
function.

Any thoughts on this, other than those of the "why bother when you can write the
loop yourself" variety?

Robert
--
Robert AH Prins
robert(a)prino(d)org

glen herrmannsfeldt

unread,

Apr 5, 2013, 7:29:13 PM4/5/13

to

Robert AH Prins <spam...@prino.org> wrote:

> I recently had a look at the code generated by the INDEX builtin. In this
> particular program that uses it, I'm using it to add (or not) new
> 4-character+space items (items do not have a leading space) to a CHAR VAR, i.e.

> if index(set, new || ' ') = 0 then
> set ||= new || ' ';

Hmm. Why not an array of CHAR(4) instead?

> The complete sets are later used to index arrays, i.e.

> i = (index(set, item || ' ') + 4) / 5;

And a hash table to look them up?

> and although the above is short, sweet and perfectly readable, it turns out to
> be far from optimal...

> Adding a bit of code to the program to output all 52,128 possible invocations of
> the above, with sets varying from 1 to 80 items and item *always* present in the
> set, and then using this output in another program to process all 52,128
> possible invocations in a loop bracketed with the datetime() builtin showed that
> using a simple

> do i = 1 to length(set) by 5 until(item = substr(set, i, 4));
> end;
> i = (i + 4) / 5;

> is about 25% faster (datetime()'ed over 100 loops of
> the 52,128 actual variations) than

So not a huge increase in speed.

> i = (index(set, item || ' ') + 4) / 5;

> which makes me wonder, why isn't there an INDEXN or (POSN) builtin that would
> increment the position of the next character in "set" to scan for "item" by the
> length of "item", in other words,

Even better, how about a Boyer-Moore version of INDEX. That would
probably be pretty fast in this case. Though generating the tables
takes some time.

> if set = 'A1 B2 C3 D4 E5 F6 G7 ' and item = 'B2 ', this theoretical
> "INDEXN" would do the compare of "item" to "set" only on positions
> 1, 4, 7 etc of "set", creating an efficient way to scan these kind
> of strings-(ab)used-as-unsorted-table, which in a way resembles
> the REXX "wordpos" function.

Seems to me that people who need to do that will usually use arrays.

> Any thoughts on this, other than those of the "why bother when
> you can write the loop yourself" variety?

-- glen

Robin Vowels

unread,

Apr 6, 2013, 6:34:01 AM4/6/13

to

On Apr 6, 10:02 am, Robert AH Prins <spamt...@prino.org> wrote:
> Hi all,
>
> I recently had a look at the code generated by the INDEX builtin. In this
> particular program that uses it, I'm using it to add (or not) new
> 4-character+space items (items do not have a leading space) to a CHAR VAR, i.e.
>
> if index(set, new || ' ') = 0 then
> set ||= new || ' ';
>
> The complete sets are later used to index arrays, i.e.
>
> i = (index(set, item || ' ') + 4) / 5;
>
> and although the above is short, sweet and perfectly readable, it turns out to
> be far from optimal...
>
> Adding a bit of code to the program to output all 52,128 possible invocations of
> the above, with sets varying from 1 to 80 items and item *always* present in the
> set, and then using this output in another program to process all 52,128
> possible invocations in a loop bracketed with the datetime() builtin showed that
> using a simple
>
> do i = 1 to length(set) by 5 until(item = substr(set, i, 4));
> end;
> i = (i + 4) / 5;
>
> is about 25% faster (datetime()'ed over 100 loops of the 52,128 actual
> variations) than
>
> i = (index(set, item || ' ') + 4) / 5;

You'd expect the time difference because conceptually INDEX
is examining every character of SET, which is 5 more tests
than your loop method.
Plus the fact that you are catenating a blank in the INDEX method.

Really, you'd be better off with an array, as each item is
of fixed length (4 characters).

> which makes me wonder, why isn't there an INDEXN or (POSN) builtin that would
> increment the position of the next character in "set" to scan for "item" by the
> length of "item", in other words,
>
> if set = 'A1 B2 C3 D4 E5 F6 G7 ' and item = 'B2 ', this theoretical "INDEXN"
> would do the compare of "item" to "set" only on positions 1, 4, 7 etc of "set",
> creating an efficient way to scan these kind of
> strings-(ab)used-as-unsorted-table, which in a way resembles the REXX "wordpos"
> function.
>
> Any thoughts on this, other than those of the "why bother when you can write the
> loop yourself" variety?

1. Use an array, as suggested above.
2. If time is an issue, use hashing.

Peter J. Seymour

unread,

Apr 6, 2013, 10:30:07 AM4/6/13

to

On 2013-04-06 02:02, Robert AH Prins wrote:
> Hi all,
>
> I recently had a look at the code generated by the INDEX builtin. In
> this particular program that uses it, I'm using it to add (or not) new
> 4-character+space items (items do not have a leading space) to a CHAR
> VAR, i.e.
>
> if index(set, new || ' ') = 0 then
> set ||= new || ' ';
>
> The complete sets are later used to index arrays, i.e.
>
> i = (index(set, item || ' ') + 4) / 5;
>
> and although the above is short, sweet and perfectly readable, it turns
> out to be far from optimal...
>

.....
> Robert

It reminders me that long ago with the Optimising Compiler, we had some
very specific coding standards that forbade certain things in the
interests of efficiency. Efficiency is still important, but I can't
imagine that it is as critical these days, given the high speed and low
cost of current machines.

Robert AH Prins

unread,

Apr 6, 2013, 12:47:26 PM4/6/13

to

On 2013-04-05 23:29, glen herrmannsfeldt wrote:
> Robert AH Prins <spam...@prino.org> wrote:
>
>> I recently had a look at the code generated by the INDEX builtin. In this
>> particular program that uses it, I'm using it to add (or not) new
>> 4-character+space items (items do not have a leading space) to a CHAR VAR, i.e.
>
>> if index(set, new || ' ') = 0 then
>> set ||= new || ' ';
>
> Hmm. Why not an array of CHAR(4) instead?

Funny...

When I joined my first employer in 1985 as a trainee, everyone of our group
actually used an array, until our instructor showed the string method, and ran
the compiler with the "FLOW" & "COUNT" options, showing that the INDEX method
only resulted in the execution of a single statement, rather than the average
n/2 that an array would use and telling us that using a "long" CHAR VAR would
sort of future-proof the code, with no need to change the bounds of an array.

>> The complete sets are later used to index arrays, i.e.
>
>> i = (index(set, item || ' ') + 4) / 5;
>
> And a hash table to look them up?

The overhead for small sets is probably too high. If items would be a single
character, it would be a no-brainer.

>> and although the above is short, sweet and perfectly readable, it turns out to
>> be far from optimal...
>
>> Adding a bit of code to the program to output all 52,128 possible invocations of
>> the above, with sets varying from 1 to 80 items and item *always* present in the
>> set, and then using this output in another program to process all 52,128
>> possible invocations in a loop bracketed with the datetime() builtin showed that
>> using a simple
>
>> do i = 1 to length(set) by 5 until(item = substr(set, i, 4));
>> end;
>> i = (i + 4) / 5;
>
>> is about 25% faster (datetime()'ed over 100 loops of
>> the 52,128 actual variations) than
>
> So not a huge increase in speed.

Not huge, but definitely significant.

>> i = (index(set, item || ' ') + 4) / 5;
>
>> which makes me wonder, why isn't there an INDEXN or (POSN) builtin that would
>> increment the position of the next character in "set" to scan for "item" by the
>> length of "item", in other words,
>
> Even better, how about a Boyer-Moore version of INDEX. That would
> probably be pretty fast in this case. Though generating the tables
> takes some time.

Don't tell me, I've used BM in the days of Turbo Pascal and the gains were
pretty amazing.

>> if set = 'A1 B2 C3 D4 E5 F6 G7 ' and item = 'B2 ', this theoretical
>> "INDEXN" would do the compare of "item" to "set" only on positions
>> 1, 4, 7 etc of "set", creating an efficient way to scan these kind
>> of strings-(ab)used-as-unsorted-table, which in a way resembles
>> the REXX "wordpos" function.
>
> Seems to me that people who need to do that will usually use arrays.

Actually, at every employer and client I've worked I've seen this kind of code
being used, and until recently I had never given it a second thought, even
though my experience with Turbo Pascal should have given me cause to reconsider.

My INDEXN was partly suggested by the last item on page 57 of

http://proceedings.share.org/client_files/SHARE_in_San_Francisco/Session_12335_handout_4165_0.pdf

i.e.

<quote>
HANDLE operations
.
.
The sensitivity to the underlying type makes the behavior like that in C. So,
for example, adding 1 to a handle increases the associated pointer value by
the number of bytes in the underlying structure type
</quote>

And if there was an INDEXN-like builtin, there would be no need to add a dummy
character to each item to prevent mismatches on partial strings.

Robin Vowels

unread,

Apr 6, 2013, 8:02:32 PM4/6/13

to

On Apr 7, 12:47 am, Robert AH Prins <spamt...@prino.org> wrote:
> On 2013-04-05 23:29, glen herrmannsfeldt wrote:
>

> > Robert AH Prins <spamt...@prino.org> wrote:
>
> >> I recently had a look at the code generated by the INDEX builtin. In this
> >> particular program that uses it, I'm using it to add (or not) new
> >> 4-character+space items (items do not have a leading space) to a CHAR VAR, i.e.
>
> >> if index(set, new || ' ') = 0 then
> >> set ||= new || ' ';
>
> > Hmm. Why not an array of CHAR(4) instead?
>
> Funny...
>
> When I joined my first employer in 1985 as a trainee, everyone of our group
> actually used an array, until our instructor showed the string method, and ran
> the compiler with the "FLOW" & "COUNT" options, showing that the INDEX method
> only resulted in the execution of a single statement, rather than the average
> n/2 that an array would use and telling us that using a "long" CHAR VAR would
> sort of future-proof the code, with no need to change the bounds of an array.

Well, it doesn't. A CHAR VAR had/has a limit of 32767 characters,
whereas an array allowed eight times that. Nowadays arrays
allow 4Gb elements.

There was never any need to change the bounds of an array.
A fixed size array could be used, just as a fixed size area
is used for CHAR VAR.

BTW, for 80 "elements" in your string, that's 160 catenations
to build the string, plus 160 data moves,
plus a catenation every time you search.

> >> The complete sets are later used to index arrays, i.e.
>
> >> i = (index(set, item || ' ') + 4) / 5;
>
> > And a hash table to look them up?
>
> The overhead for small sets is probably too high. If items would be a single
> character, it would be a no-brainer.

Your example uses 5-characters, not 1.

> >> and although the above is short, sweet and perfectly readable, it turns out to
> >> be far from optimal...

> >> Adding a bit of code to the program to output all 52,128 possible invocations of
> >> the above, with sets varying from 1 to 80 items and item *always* present in the
> >> set, and then using this output in another program to process all 52,128
> >> possible invocations in a loop bracketed with the datetime() builtin showed that
> >> using a simple
>
> >> do i = 1 to length(set) by 5 until(item = substr(set, i, 4));
> >> end;
> >> i = (i + 4) / 5;
>
> >> is about 25% faster (datetime()'ed over 100 loops of
> >> the 52,128 actual variations) than
>
> > So not a huge increase in speed.
>
> Not huge, but definitely significant.

In the context of the entire job, probably not significant.

> >> i = (index(set, item || ' ') + 4) / 5;
>
> >> which makes me wonder, why isn't there an INDEXN or (POSN) builtin that would
> >> increment the position of the next character in "set" to scan for "item" by the
> >> length of "item", in other words,
>
> > Even better, how about a Boyer-Moore version of INDEX. That would
> > probably be pretty fast in this case. Though generating the tables
> > takes some time.
>
> Don't tell me, I've used BM in the days of Turbo Pascal and the gains were
> pretty amazing.
>
> >> if set = 'A1 B2 C3 D4 E5 F6 G7 ' and item = 'B2 ', this theoretical
> >> "INDEXN" would do the compare of "item" to "set" only on positions
> >> 1, 4, 7 etc of "set", creating an efficient way to scan these kind
> >> of strings-(ab)used-as-unsorted-table, which in a way resembles
> >> the REXX "wordpos" function.
>
> > Seems to me that people who need to do that will usually use arrays.
>
> Actually, at every employer and client I've worked I've seen this kind of code
> being used, and until recently I had never given it a second thought, even
> though my experience with Turbo Pascal should have given me cause to reconsider.

> And if there was an INDEXN-like builtin, there would be no need to add a dummy
> character to each item to prevent mismatches on partial strings.

There's no need anyway. Look at your loop method. Do you see a
dummy character being added?

Shmuel Metz

unread,

Apr 8, 2013, 10:39:27 AM4/8/13

to

In <asaqre...@mid.individual.net>, on 04/06/2013

at 04:47 PM, Robert AH Prins <spam...@prino.org> said:

>When I joined my first employer in 1985 as a trainee, everyone of our
>group actually used an array, until our instructor showed the string
>method, and ran the compiler with the "FLOW" & "COUNT" options,
>showing that the INDEX method only resulted in the execution of a
>single statement, rather than the average n/2 that an array would
>use and telling us that using a "long" CHAR VAR would sort of
>future-proof the code, with no need to change the bounds of an
>array.

First, it's the execution time that matters, not the number of
statements. Second, using a varying string doesn't future-proff your
code, because you might have to increase the string length. In fact,
the array has the edge for future proofing because the limit is
larger.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spam...@library.lspace.org

Robert AH Prins

unread,

Apr 8, 2013, 1:26:06 PM4/8/13

to

On 2013-04-08 14:39, Shmuel (Seymour J.) Metz wrote:
> In <asaqre...@mid.individual.net>, on 04/06/2013 at 04:47 PM, Robert AH
> Prins <spam...@prino.org> said:
>
>> When I joined my first employer in 1985 as a trainee, everyone of our group
>> actually used an array, until our instructor showed the string method, and
>> ran the compiler with the "FLOW" & "COUNT" options, showing that the INDEX
>> method only resulted in the execution of a single statement, rather than
>> the average n/2 that an array would use and telling us that using a "long"
>> CHAR VAR would sort of future-proof the code, with no need to change the
>> bounds of an array.
>
> First, it's the execution time that matters, not the number of statements.

Of course.

However, the index method has one additional "advantage" over using an array, in
that it's very natural to incorporate into a multiple condition IF statement:

if (@ext & index(set, item || ' ') ^= 0) |
(@int & #i < #sep) then

> Second, using a varying string doesn't future-proff your code,

"sort of future-proof": if the number of items goes up both methods require
program changes, the only method that would be completely future proof (and a
lot slower) is a linked list.

> because you might have to increase the string length. In fact, the array has
> the edge for future proofing because the limit is larger.

However, using a CHAR (32767) VAR will in many cases be pretty future proof.
I've seen it being used for three-character country codes (natural limit:
200-ish), two-character legal entities (actual limit: 37) and a few more:
internal error message codes, currency codes, authorization lists of CICS
transactions.

John W Kennedy

unread,

Apr 8, 2013, 1:12:06 PM4/8/13

to

Look, you came here complaining that your source code produced
inefficient object code. It has been explained to you how your code
should be fixed, and all you can say is, "I wanna do it my way." Either
fix your code or stop complaining. Your instructor, by the way, was a
moron.

--
John W Kennedy
"...if you had to fall in love with someone who was evil, I can see why
it was her."
-- "Alias"

Robert AH Prins

unread,

Apr 8, 2013, 4:35:32 PM4/8/13

to

The code has already been changed to use a loop and we've actually found a way
to make access direct by adding an extra pointer that is initialized during the
first pass through the data.

In all other posts I've been replying to what was posted in reply to my original
post.

> Your instructor, by the way, was a moron.

No, my instructor, in 1985, was simply someone who had gown up in the company,
knew all the ins and outs of the business and was a pretty decent programmer,
but one who probably never had the need (nor access to the tools) to look at the
performance of the code.

If you want to find real morons, look no further than to the sysprogs of a
Belgian health insurer, who were still running Enterprise PL/I V3.7 on a z9
system using ARCH(2) and the LE "STORAGE(0,0,0,0)" option in 2009...

Peter Flass

unread,

Apr 8, 2013, 4:02:37 PM4/8/13

to

On 4/8/2013 1:26 PM, Robert AH Prins wrote:
>
> However, using a CHAR (32767) VAR will in many cases be pretty future
> proof. I've seen it being used for three-character country codes
> (natural limit: 200-ish), two-character legal entities (actual limit:
> 37) and a few more:
> internal error message codes, currency codes, authorization lists of CICS
> transactions.
>

OK, future-proof in that it can handle larger size country codes if such
are ever defined. Otherwise the character string and the array are both
allocated at run time, you get no savings for "CHAR(32767) VAR" vs.
"(8192) CHAR(4)". In fact, since you no longer need the blank delimiter
you save a bit of storage.

--
Pete

Robin Vowels

unread,

Apr 9, 2013, 3:13:08 AM4/9/13

to

You do get savings in run time.

Storage can be allocated before run time, by making the storage
STATIC.

In the case of the array, of course, the array can be extended
as much as is wished, by making the array CONTROLLED.

John W Kennedy

unread,

Apr 9, 2013, 2:05:55 PM4/9/13

to

CONTROLLED can have severe hidden performance costs, with CHAR(4) being
treated as CHAR(*) and the 4 being treated only as a default at
ALLOCATE time. 99 times out of a hundred, BASED is better.

My usual habit, when I was coding in PL/I, was to build a linked list
(either directly or with CONS cells), and then make an array, either
with copies of the data or with pointers to the original). To improve
batch performance, I generally allocated in AREA variables and used
EMPTY() for fast cleanup. But that scheme assumes that the key set is
read one and not altered during processing; however, I can't offhand
recall that ever /not/ being the case.

Of course, in any modern language, you just use a built-in hashset type
with automatic stretching and shrinking.

--
John W Kennedy
Having switched to a Mac in disgust at Microsoft's combination of
incompetence and criminality.

Robert AH Prins

unread,

Apr 9, 2013, 5:36:41 PM4/9/13

to

On 2013-04-09 18:05, John W Kennedy wrote:
> On 2013-04-09 07:13:08 +0000, Robin Vowels said:
>
>> On Apr 9, 6:02 am, Peter Flass <Peter_Fl...@Yahoo.com> wrote:
>>> On 4/8/2013 1:26 PM, Robert AH Prins wrote:
>>>
>>>> However, using a CHAR (32767) VAR will in many cases be pretty future
>>>> proof. I've seen it being used for three-character country codes
>>>> (natural limit: 200-ish), two-character legal entities (actual limit:
>>>> 37) and a few more: internal error message codes, currency codes,
>>>> authorization lists of CICS transactions.
>>>
>>> OK, future-proof in that it can handle larger size country codes if such
>>> are ever defined. Otherwise the character string and the array are both
>>> allocated at run time, you get no savings for "CHAR(32767) VAR" vs.
>>> "(8192) CHAR(4)". In fact, since you no longer need the blank delimiter
>>> you save a bit of storage.
>>
>> You do get savings in run time.
>>
>> Storage can be allocated before run time, by making the storage STATIC.
>>
>> In the case of the array, of course, the array can be extended as much as
>> is wished, by making the array CONTROLLED.
>
> CONTROLLED can have severe hidden performance costs, with CHAR(4) being
> treated as CHAR(*) and the 4 being treated only as a default at ALLOCATE
> time. 99 times out of a hundred, BASED is better.

I'm not sure if that is still the case, but with the old OS 2.3.0 compiler it
was certainly true. Changing a program that was using string(acct_no(i)) in a
compare in a loop, where acct_no was (simplified) declared as

dcl 1 acct_no(n) ctl,
2 le char (2),
2 acct char (5),
2 ccy char (3);

into using simply "acct_no char(10)" reduced the CPU time from 3+ hours to
around 20 minutes, after Strobe showed this to be one very, very hot spot.

> My usual habit, when I was coding in PL/I, was to build a linked list (either
> directly or with CONS cells), and then make an array, either with copies of
> the data or with pointers to the original).

That's not an uncommon way of combining the extendability of a linked list with
the direct access of an array. From a purely theoretical point of view (I've
never tried it), you should actually be able to do this, provided all items in
the linked list are of the same size (and that size is a multiple of 8) and are
allocated consecutively inside an area, by basing the array over the area, with
an offset of 16 bytes to skip the area header.

> To improve batch performance, I generally allocated in AREA variables and
> used EMPTY() for fast cleanup.

http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=31632

> But that scheme assumes that the key set is read one and not altered during
> processing; however, I can't offhand recall that ever /not/ being the case.
>
> Of course, in any modern language, you just use a built-in hashset type with
> automatic stretching and shrinking.

Dreaming don't cost nothing... :)

John W Kennedy

unread,

Apr 10, 2013, 12:31:31 AM4/10/13

to

It's a language point. Any CONTROLLED variable that is passed as a
parameter or is EXTERNAL /might/ be subject to an ALLOCATE statement
there that /might/ change any dimension or length. It's especially
hideous when you coded BIT(1) and find that every test is a subroutine
call.

--
John W Kennedy
"The bright critics assembled in this volume will doubtless show, in
their sophisticated and ingenious new ways, that, just as /Pooh/ is
suffused with humanism, our humanism itself, at this late date, has
become full of /Pooh./"
-- Frederick Crews. "Postmodern Pooh", Preface

Peter Flass

unread,

Apr 10, 2013, 7:43:19 AM4/10/13

to

Yes, if they could fix *one* thing I'd wish for this. CONTROLLED data
declared with * lengths or bounds could be alterable at run time, and
anything declared with constant lengths or bounds not.

--
Pete

Robert AH Prins

unread,

Apr 10, 2013, 12:49:12 PM4/10/13

to

*process rules(nolaxctl); (Since at least EPLI V3.5)

LAXCTL | NOLAXCTL

Specifying LAXCTL allows a CONTROLLED variable to be declared with a constant
extent and yet to be allocated with a differing extent. NOLAXCTL requires that
if a CONTROLLED variable is to be allocated with a varying extent, then that
extent must be specified as an asterisk or as a non-constant expression.

The following code is illegal under NOLAXCTL:

dcl a bit(8) ctl;
alloc a;
alloc a bit(16);

But this code would still be valid under NOLAXCTL:

dcl b bit(n) ctl;
dcl n fixed bin(31) init(8);
alloc b;
alloc b bit(16);