Speeding up code - am I missing something?

24 views
Skip to first unread message

Robert Prins

unread,
Aug 15, 2018, 5:23:45 PM8/15/18
to
Hi all,

Not really specific x86, Virtual Pascal, or PL/I, but as I've got the same code
written in all three languages, I'll give it a try here:

I have an array (actually three, but that's not very relevant for the issue) of
pointers to items in a linked list. The array is sorted into a specific order,
and after the sort, I use it to print Top-N tables. For the main Top-50 table (x
3) that's easy enough, the first 50 items are selected.

However, at this time the linked list also contains 5 sub-lists, each containing
additional sub-lists for a total of 390 (yes, three-hunderd-and-ninety)
sub-lists for which I want to print Top-10 tables (or if a sub-list contains
just 4 items, a Top-4 table). Because I don't really want to sort the main array
six (x 3) times, I just scan the array from the top, and pick out the items for
each sub-list, until I have 10 (or the 4 mentioned before) items. To speed up
the process a little, I make an initial pass over the array to find the first
item for each sub-list, but that doesn't really improve things a great deal.

So given that "this" is the list, sorted on its main key, with the second and
third columns identifying additional sub-lists,

100, 'a', '1'
099, 'b', '9'
098, 'c', '1'
097, 'b', '8'
096, 'd', '3'
095, 'a', '2'
094, 'c', '8'
093, 'c', '1'
092, 'a', '3'
091, 'f', '4'
090, 'a', '2'

the "Top-50" would contain the above.

For sub-list 'a', I would scan the array, and the resulting "Top-10" would be

100, 'a', '1'
095, 'a', '2'
092, 'a', '3'
090, 'a', '2'

Likewise, for sub-list 'b', I would scan the array, and the resulting "Top-10"
would be, and the initial fill-the-cache scan would have set the starting index
for the 'b' sub-list to the cell containing 099, saving me one superfluous compare.

099, 'b', '9'
097, 'b', '8'

etc, and the same would happen for the sub-lists defined by column 3, the first
being

100, 'a', '1'
098, 'c', '1'
093, 'c', '1'

As I wrote, I make an initial pass over the data to cache the first item for
each sub-list, some of them consist of a single item and are nearly at the end
of the sorted full array, and this saves some time, the count of executed
statements is reduced by around 10%, but if anyone can think of a way to speed
up things a bit more, without having to sort the array(s) five more times (in an
additional 10,000,000 executed statements, see below), please share it with me.

Some numbers:

- the caching code executes around 120,000 PL/I statements
- sorting the three arrays executes around in 2,000,000 PL/I statements (using a
Heapsort)
- printing the 390 top-n tables executes around 7,600,000 PL/I statements, and
the "IF" statement that selects the data to be printed has a hit-percentage of
just 0.5%:

IF STY_TCN = 'Tr' & STY = LIST.TR | /* 208 type 1 */ 1,841,754 executed IF's
STY_TCN = 'Na' & TCN = LIST.NA | /* 93 type 2 */
STY_TCN = 'Yr' & STY = LIST.YR | /* 37 type 3 */
STY_TCN = 'Co' & TCN = LIST.CO | /* 33 type 4 */
STY_TCN = 'Ty' & TCN = LIST.TY | /* 19 type 5 */
STY_TCN = '**' THEN /* 1 main */
DO;
#T = #T + 1; 9,594 - the actual
number of Top-N lines printed

Robert
--
Robert AH Prins
robert(a)prino(d)org

Peter Flass

unread,
Aug 15, 2018, 9:29:30 PM8/15/18
to
Assuming mainframe, the rule-of-thumb I learned was "let the sort do the
work" since DFSORT is always more efficient than any user code. I can't
visualize what you're doing from your description, but if possible I'd make
one pass of the data and build records that can be sorted into the desired
sequence by DFSORT. Then read the sorted records and print the listings. I
would use PLISRTD to build and retrieve the records, but some would
probably prefer two separate programs with the sort in between invoked by
JCL, or build the records and ATTACH the sort, or...

--
Pete

Robert Prins

unread,
Aug 16, 2018, 1:00:55 PM8/16/18
to
The program processes my hitchhike data (yes, really), and I have three
versions, one written in PL/I, which runs on z/OS, one written in Virtual
Pascal, running on Windows, and one where most of the Pascal has been converted
to in-line assembler. All three should produce the same output, and when there
are differences (and there were some recently), I know that I'm doing something
wrong.

The full set of hitchhike data is kept in a single linked list, but every item
in the list contains a load of extra pointers to form additional lists,

1) per trip,
2) per type of driver,
3) per country,
4) per nationality of driver, and
5) per year.

The very first (Turbo) Pascal (V3.01a) version of the program produced just a
few tables (see
<https://prino.neocities.org/miscellaneous/keeping-statistics.html> for a longer
description), but over the years, most as a result of in-ride conversations with
my drivers "Did you ever consider determining ...?", I've added a "few" more tables.

This post refers to a relatively new set of tables, the Top-n ones. The program
produces a Top-50 table of the longest rides (actually three Top-50 tables, for
longest rides (km), longest rides (time), and fastest rides (km/h)) for the
whole set, and those three tables, but then limited to Top-10 (or Top-<10) are
then repeated for every trip (208 at the moment), type of driver (19), country
(1+33), nationality of the driver (93) and year (37) for a total of 390 tables.

Sorting the array of pointers to the items of the list in distance order gives
me all data in that order, and to obtain the top-10 fort a specific sub-list, I
just skip the unwanted data, and which saves me, for example for case 1) above,
three additional sorts (distance/time/velocity) where the trip-number is the
primary key (and ditto for all other sub-lists), at the, as is obvious from the
data presented, expense of rather a hell of a lot of extra compares.

And yes, on z/OS I could use the PLISRTx interface to DFSORT, and
using Virtual Pascal I could, with rather a lot more hassle, spawn M$' sort.exe,
but that would mean writing out the set of data five times in pure text format,
as SORT.EXE still cannot sort on anything other than a single starting column.
Hell, even using PLISRTD requires me to write out the data over and over again...

So, unless someone comes up with some kind of brilliant suggestion (like Paul
Green way back in 1998, when I asked how to speed-up another procedure(*)), I'm
probably stuck with what I have now, whether I like it or not. (I don't, but
c'est la vie...)

And lest you say, "Why the flipping 'ell are we discussing a program that deals
with data relating to hitchhiking?"

My answer to that question?

The very first time (October 2006) I compiled an older it with Enterprise PL/I
(V3.5.0) the compiler abended, not being able to handle the INIT of a based
structure with a power ("INIT ((dont-remember-what**2))"), and it's directly
responsible for

- https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=31632
- https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=84512
- https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=100073
- https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=115540

'nuff said?

Robert
--
PS:

(*) Executed PL/I statement counts:

- pre Paul Green: NAT_SCAN 5,407,617,546
- post Paul Green: NAT_SCAN 1,237,239

Old 2.3.0 compiler actually requires compilation with (NOFOFL): prefix to avoid
a decimal overflow exception abend in the PL/I library code.

Robert Prins

unread,
Aug 16, 2018, 1:46:08 PM8/16/18
to
On 2018-08-16 16:02, Terje Mathisen wrote:
> (newsgroups limited to just clax86, my server does not allow indiscriminate
> cross-posting.)
>
> Robert, you do realize that anything you are going to print will take billions
> of cycles for every second the printer needs to actually print the pages?

Really? ROFL...

Of course I do realise that once I/O is added, all bets are off. But as I've
mentioned in another reply you may already have seen, sometimes others approach
a problem from a completely different angle, with stunning results.

Somewhere in the back of my head I think I might be able to build the top-10 or
top<10 lists in my caching code, but that would mean adding another 5 pointers
to each item, and more arrays to keep track of start- and end-of-list addresses.
It may be worth a try, but I remember that I once changed the PL/I version to
use arrays of pointers and indices in the actual allocated items, and code
generated with, I think Enterprise PL/I V3.9, was pretty AD 1985 Borland TP
3.01a-ish, even at the highest level of optimisation. Virtual Pascal would not
be any better, even two consecutive

a:= array[i].a;
b:= array[i].b

statements result in the calculation of the address of the variable in the array
twice.

Robert
--
Robert AH Prins
robert(a)prino(d)org


>
> Terje

Robert Prins

unread,
Aug 17, 2018, 12:33:07 PM8/17/18
to
On 2018-08-17 06:49, Terje Mathisen wrote:
> Robert Prins wrote:
>> On 2018-08-16 16:02, Terje Mathisen wrote:
>>> (newsgroups limited to just clax86, my server does not allow
>>> indiscriminate cross-posting.)
>>>
>>> Robert, you do realize that anything you are going to print will take
>>> billions of cycles for every second the printer needs to actually print
>>> the pages?
>>
>> Really? ROFL...
>>
>> Of course I do realise that once I/O is added, all bets are off. But as
>> I've mentioned in another reply you may already have seen, sometimes
>> others approach a problem from a completely different angle, with stunning
>> results.
>
> OK, here's the most obvious (?) one:
>
> Create a function for each of these lists, put them all on an array and call
> them in order with the current item:
>
> Each time a list has been completed, the corresponding function pointer
> removes itself from the array of functions to be called.

Great minds (OK, one very great mind and my rather smaller one) seem to think
the same, I'm actually using a similar approach in the follow-up program that
converts the output of "LIFT" to RTF format:

@02:
mov eax, _s
lea eax, [eax * 4 + offset @03]
jmp dword ptr [eax]

align 4

@03:
dd offset @04 // 0 - Trip -> 0 s -> 1
dd offset @05 // 1 - Trip fall through -> 2
dd offset @06 // 2 - Type -> 2 s -> 3
dd offset @07 // 3 - Type fall through -> 4
dd offset @08 // 4 - Cnty -> 4 s -> 5
dd offset @09 // 5 - Cnty fall through -> 6
dd offset @10 // 6 - Nat -> 6 s -> 7
dd offset @11 // 7 - Nat fall through -> 8
dd offset @14 // 8 - Naco / Cnty #I/#E -> 8 s -> 9
dd offset @18 // 9 - Cnty #I/#E fall through -> 10
dd offset @20 // 10 - Year -> 10 s -> 11
dd offset @21 // 11 - Year fall through -> 12
dd offset @22 // 12 - V! tables -> 12 s -> 13
nop
nop
nop
nop

@04:

Sadly, this approach is not (entirely) possible here, because the Country top-10
tables are "incomplete", I've hitched through Bulgaria, Slovakia, and Andorra,
but I've never hitched in those countries.

Obviously I could cater for this, but while we've got this discussion going,
I've decided to try filling the entire top-N in the "caching" code, where I end
up going to nearly the bottom, entry 4,300 of (currently) 4,310 rides, anyway. A
quick test using the old PL/I version, where I can use a compiler option that
tells me how many times each statement is executed, tells me that the executed
statement count is reduced by more than 90% when I use this approach for the
Top-10s of trips. ;)

Only unpleasant side issue is the fact that I'm using 3-D arrays,

dcl 1 t10_trip(trip_total.trip) ctl,
2 tr_ix(3) fixed bin (31) init ((3)0),
2 tr_ktv(3, 10) ptr;

with

#j = t10_trip(lift_list.trip).tr_ix(1) + 1;
if #j <= hbound(tr_ktv, 3) then
t10_trip(lift_list.trip).tr_ktv(1, #j) = lift_ptr;

t10_trip(lift_list.trip).tr_ix(1) = #j;

to fill the top-10 slots, and worse for the Type/Country/Nat tables where I have
to do a lookup of the index. (Which leads to the question, is there a way using
all the new whiter than white x86 instructions to fast match a 4-byte value in
an array of up to (potentially) 234 values? I'm now using the "vpcmpistri" (see
another tread) to find CR's in the buffer (and don't understand why I require a
VPXOR or set them to 0x00), and similar code would probably faster than a
discretely coded version of CMPSD...)

which won't result in pretty code using Virtual Pascal. Need to check what
Enterprise PL/I generates, but even here, based on past experience, I'm somewhat
pessimistic.

More later,

Robert Prins

unread,
Aug 23, 2018, 10:56:01 AM8/23/18
to
On 2018-08-17 19:20, Robert Prins wrote:
> On 2018-08-17 06:49, Terje Mathisen wrote:
>> Robert Prins wrote:
>>> On 2018-08-16 16:02, Terje Mathisen wrote:
>>>> (newsgroups limited to just clax86, my server does not allow indiscriminate
>>>> cross-posting.)

<snip>

> Obviously I could cater for this, but while we've got this discussion going,
> I've decided to try filling the entire top-N in the "caching" code, where I end
> up going to nearly the bottom, entry 4,300 of (currently) 4,310 rides, anyway. A
> quick test using the old PL/I version, where I can use a compiler option that
> tells me how many times each statement is executed, tells me that the executed
> statement count is reduced by more than 90% when I use this approach for the
> Top-10s of trips. ;)
>
> Only unpleasant side issue is the fact that I'm using 3-D arrays,
>
> dcl 1 t10_trip(trip_total.trip) ctl,
>       2 tr_ix(3)        fixed bin (31) init ((3)0),
>       2 tr_ktv(3, 10)   ptr;
>
> with
>
> #j = t10_trip(lift_list.trip).tr_ix(1) + 1;
> if #j <= hbound(tr_ktv, 3) then
>   t10_trip(lift_list.trip).tr_ktv(1, #j) = lift_ptr;
>
> t10_trip(lift_list.trip).tr_ix(1) = #j;
>
> to fill the top-10 slots, and worse for the Type/Country/Nat tables where I
> have to do a lookup of the index. (Which leads to the question, is there a
> way using all the new whiter than white x86 instructions to fast match a
> 4-byte value in an array of up to (potentially) 234 values? I'm now using the
> "vpcmpistri" (see another tread) to find CR's in the buffer (and don't
> understand why I require a VPXOR or set them to 0x00), and similar code would
> probably faster than a discretely coded version of CMPSD...)
>
> which won't result in pretty code using Virtual Pascal.

Needlessly to say, the VP generated code is ugly as (insert your favourite
expletive)!

> Need to check what Enterprise PL/I generates, but even here, based on past
> experience, I'm somewhat pessimistic.

And this code seems to be pretty good using Enterprise PLI V4.5.0. Would need
use another system to see what Enterprise PL/I V5.2.2 generates.

Running time with VP generated code is around 5% less than old (find only
top-of-sublist) code. ;) Now just have to hack it into something more palatable.

One feature sadly missing from Pascal is the fact that PL/I allows you to use
'*' as array extents in a called proc, and the (hidden) descriptors that are
also passed to it allow you to call the same proc with

dcl 1 t10_summ(1) ctl,
2 su_ix(3) fixed bin (31),
2 su_ktv(3,50) ptr;

dcl 1 t10_trip(trip_total.trip) ctl,
2 tr_ix(3) fixed bin (31),
2 tr_ktv(3,10) ptr;

dcl 1 t10_cnty(0:lift_work.#cnty) ctl,
2 co_ix(3) fixed bin (31),
2 co_ktv(3,10) ptr;

dcl 1 t10_year(year_top -> year_list.year:
year_end -> year_list.year) ctl,
2 yr_ix(3) fixed bin (31) init ((3)0),
2 yr_ktv(3,10) ptr;

and a set of builtin functions (lbound and hbound) can be used to retrieve the
low and high bounds of the arrays. ;) It's a bit more convoluted to do this with
Pascal...

Peter Flass

unread,
Aug 23, 2018, 12:11:07 PM8/23/18
to
It's like the old FORTRAN or C, where you have to pass the upper bound as
an argument (lower bound always 1). PL/I and (I think Ada) don't require
this, din't know about ALGOL.

--
Pete

rug...@gmail.com

unread,
Aug 31, 2018, 12:47:18 PM8/31/18
to
Hi,

On Thursday, August 23, 2018 at 9:56:01 AM UTC-5, Robert Prins wrote:
>
> One feature sadly missing from Pascal is the fact that PL/I allows you to use
> '*' as array extents in a called proc, and the (hidden) descriptors that are
> also passed to it allow you to call the same proc with
> ...

I don't quite understand what you mean here.

> and a set of builtin functions (lbound and hbound) can be used to retrieve
> the low and high bounds of the arrays. ;) It's a bit more convoluted to do
> this with Pascal...

Are you talking about conformant arrays? Schemata? Open arrays?
Even Turbo Pascal (only later versions?) had LOW() and HIGH() built-ins.
Is that what you meant?

* https://www.freepascal.org/docs-html/ref/refsu14.html

Robert Prins

unread,
Sep 1, 2018, 4:06:45 AM9/1/18
to
Not really. or maybe.

Can Pascal (choose your flavour) handle passing these two different arrays of
structures to a single procedure:

dcl 1 s1,
2 s2(2,3),
3 v1 char (12),
3 v2 char (15),
2 s3(4,5),
3 v3 char (19),
3 v4 char (1);

dcl 1 t1,
2 t2(7,8),
3 w1 char (14),
3 w2 char (77),
2 t3(3,3),
3 w3 char (12),
3 w4 char (1);

Well, PL/I can:

myproc: procedure(parm);
dcl 1 parm,
2 p2(*,*),
3 w1 char (*),
3 w2 char (*),
2 s3(*,*),
3 w3 char (*),
3 w4 char (*);

and trying to assign a char(14) to w(1,1) will happily cut off the last two
characters when the procedure is passed s1. And when the SUBSCRIPTRANGE is
enabled, you'll get an error when trying to access w3(4,5) when t1 is passed to
the procedure.

I don't think there's an easy way to do the same with Pascal, but I'm ready to
be corrected.

Marco van de Voort

unread,
Sep 1, 2018, 7:20:49 AM9/1/18
to
On 2018-09-01, Robert Prins <rob...@prino.org> wrote:
> I don't think there's an easy way to do the same with Pascal, but I'm
> ready to be corrected.

Not if you force static allocation. Since Delphi2 (1?) strings are mostly
dynamically allocated, and the static allocation stuff didn't develop
further.

rug...@gmail.com

unread,
Sep 3, 2018, 8:32:33 PM9/3/18
to
Hi,

On Saturday, September 1, 2018 at 3:06:45 AM UTC-5, Robert Prins wrote:
> On 2018-08-31 16:47, rug...@gmail.hates.spam wrote:
> >
> > Are you talking about conformant arrays? Schemata? Open arrays?
> > Even Turbo Pascal (only later versions?) had LOW() and HIGH() built-ins.
> > Is that what you meant?
> Not really. or maybe.
>
> Can Pascal (choose your flavour) handle passing these two different arrays of
> structures to a single procedure:
>
> Well, PL/I can:

But I don't grok PL/I, so you'll have to show me in a more Pascal-friendly pseudo-code.

> I don't think there's an easy way to do the same with Pascal, but I'm ready to
> be corrected.

"Pascal" can mean any number of dialects and offshoots. Yes, there are
several (potential) ways to solve the problem.
Reply all
Reply to author
Forward
0 new messages