Pick Tip : Working with BIG items

270 views
Skip to first unread message

Tony Gravagno

unread,
Oct 8, 2019, 6:48:15 PM10/8/19
to Pick and MultiValue Databases
OK, here's another tip/trick which will probably earn a lot of rebuttal. Hey, it's just one approach to solving a problem, not THE answer for all that ails ya.

The problem is that in some platforms (* cough * "D3"), VAR<-1> will drag down a process as the variable gets larger. Operations on data in the middle of large dynamic arrays have the same effect.

The cause of the problem is that to place a new attribute at the end of a dynamic array, these specific platforms start scanning from the front of the variable's address, and they chase the overflow chain all the way to the end before appending the new data. Similarly, inserting or deleting data in the middle of the block causes everything after it to shift up or down, causing a huge series of frame updates and re-linkages.

One quick way to alleviate some of this pain is to use FlashBASIC, where text strings are manipulated in C memory rather than in frame space - but exactly how it does this isn't documented and there is another approach which can be used to help even more.

This tip/trick is to create huge blocks in memory and operate within those blocks.

There are a couple ways to do this. First, start with a huge string:
BLOCK = SPACE(10000000) ; * 10MB
There is a one-time performance hit for creating that block as new frames are pulled from local or system overflow and linked.
After that, rather than using VAR<-1>, manipulate substrings within that block:

* Old way:
BLOCK = ""
... * add new attribute
BLOCK<-1> = NEW.DATA
* New way
BLOCK = SPACE(10000000)
BLOCK.SIZE = 0
... * add new attribute
DATA.SIZE = LEN(NEW.DATA) + 1
BLOCK[ BLOCK.SIZE+1 , DATA.SIZE ] = NEW.DATA : @AM
BLOCK.SIZE +=  DATA.SIZE

The actual size of the block hasn't changed, we've only updated the pointer to the end of the block.

I can't explain the internals of why higher BLOCK.SIZE values don't cause the same sort of performance pain we see when it chases to the end of the dynamic memory. I think there are at least these two factors:
1) It's just moving forward for a count of bytes, not doing a scan for the last attribute mark. (BAL: MOV vs MICD? Been a while.)
2) When it over-writes the bytes, it's not checking for new frame organization - it knows the size of the data hasn't changed.

D3 has a built-in feature where it will not directly write to the original frames. It copies the existing frames, updates that, then replaces the pointer to the copy, then discards the original chain. This is called Update Protection. Some of you may recall that your system runs faster if you turn off update protection ... and you may also remember the pain of dealing with GFEs. This is how they "mostly" solved that problem. Imagine how much overhead is caused by update protection here if it does this operation for every modification to a dynamic array in a tight loop.

I can't completely explain the mechanics here. But I can tell you that I've used this technique to reduce the time processing some jobs from tens of minutes down to tens of seconds. No kidding. Point out inaccuracies in my code sample above and argue about semantics if you will but when implemented properly this technique offers dramatic rewards.

Here's the same technique with a different approach...
Consider using the %functions and operating in memory rather than in frame space. With %malloc to allocate a block of several megabytes in memory, you're completely avoiding any operations in frame space.

When you're done with these uber-variables, just trim off the back-end:
BLOCK = BLOCK[1,BLOCK.SIZE]
When working with %functions, be sure to release the memory.
Yes, there is a small performance hit there, and one final one for the disk write, but so far this is a total of three such updates compared to one in every loop.

Fortunately, more modern MV platforms don't show the same pain because they maintain a pointer to the end of the data stream. But manipulation of the middle of a large dynamic array still causes big shift of data. If you're wondering about doing that kind of manipulation in the middle of the BLOCK, it's just normal string manipulation, something like:

ORIGINAL = BLOCK[START.HERE,END.HERE]
BLOCK[START.HERE,DATA.SIZE + LEN(ORIGINAL) + 1] = NEW.DATA : ORIGINAL

For that kind of manipulation, ORIGINAL itself should be considered for the same kind of treatment as BLOCK so that it doesn't resize either when temporarily used with huge strings:
ORIGINAL[1,END.HERE-START.HERE-1] =
     BLOCK[START.HERE,END.HERE] :
     SPACE( LEN(ORIGINAL) - END.HERE-START.HERE-1 )

In all of the examples above, I am not at all being careful about my math. This is in part because I'm lazy and in part to accentuate a point. If your final block size is not exactly the same as the original, then the variable will shift in memory/frames and you will encounter the same performance overhead that you are trying to defeat. The test of your success is simply the timing. If you go through a lot of work, and it's not something like 60 times faster, then the math is wrong and the variables are changing sizes. You know you have it all right when the job finishes in seconds and you find yourself wondering if it even did the job. :)

As to scalability, I think I tested this up to 20MB, and I think even this started to wear down at about 15MB. Beyond this we need to ask ourselves if we're really using the right tools.

Yes, there are other tricks for doing this:
- People learned long ago that sending output into the spooler avoids performance overhead. Just change all of your VAR<-1> statements to PRINT, then copy the resulting hold entry into frame space.
- And it's worth experimenting with DIM BLOCK(10000), bulding small blocks in each element, and then using MATBUILD/MATPARSE to switch between dynamic arrays when necessary. This approach doesn't work as well with inserting data in the middle of the block.
- Rather than building a huge item in MV, it might be better to build records at the OS level and spawn off processes that will sequentially concatenate new data using something like "cat new >> big". This could work really well though there is overhead with the OS level disk updates, especially in Windows if you use a lot of file handles.

And of course one could say this is all a huge hassle to implement. Yeah, I know, been there, done this. My approach was to create a subroutine that encapsulated the ADD.ATTRIBUTE functionality, and then replace all of the VAR<-1> instances with GOSUBs. It wasn't that bad once the pattern was established. I just had to keep reminding myself that the ugly code was worth the expected reward.

This post isn't intended to cover all solutions. It's specifically to focus on this one class of a solution for this one problem. Details about the spooler solution and others can be found in old CDP archives, and someone might want to refresh them into new threads here.My goal with these tips is simply to focus on one point at a time, give it its own space, and subject it to public scrutiny. So please - if you like other solutions, please post about them in their own threads so that they get the individual reviews that they deserve.

And if you have tips of your own, please do profile them in new threads in this group.

HTH
T

MAV

unread,
Oct 9, 2019, 4:07:34 AM10/9/19
to Pick and MultiValue Databases
Hi Tony

Sometimes I have used a method very similar to yours.

In D3, on other occasions, when I had to fill a large dynamic array and didn't know its final size, I reserved memory blocks with DIM:

     BLOCK.SIZE  = 10000
     BLOCK.NUM   = 1
     DATASET.RESERVED = BLOCK.SIZE * BLOCK.NUM
     DATASET.POS = 0
     DIM DATASET(DATASET.RESERVED)
     *** Fill DATA
     OPEN "FILE" TO FILE ELSE ABORT
     H = 0
     SELECT FILE 
     LOOP
      READNEXT ID ELSE EXIT
      READ REG FROM FILE,ID ELSE CONTINUE
        ***
        *** Conditions, etc.
        ***
        H += 1
        DAT = "Línea de datos  ":H
        GOSUB FILL_DATASET
     REPEAT
     ***
     DATOS = ""
     MATBUILD DATOS FROM DATASET
     STOP
***************************************************************
*** Nec: DAT
***   
FILL_DATASET:
     DATASET.POS += 1
     IF DATASET.POS > DATASET.RESERVED THEN
        BLOCK.NUM += 1
        DATASET.RESERVED = BLOCK.NUM * BLOCK.SIZE
        DIM DATASET(DATASET.RESERVED)
     END
     DATASET(DATASET.POS) = DAT
     RETURN
     


Marcos Alonso Vega

Tom M

unread,
Jan 13, 2020, 7:32:22 PM1/13/20
to Pick and MultiValue Databases
Hi Tony,

Here's a little two liner to show the difference in the opcodes:

001 X<-1>='DYNAMIC'
002 X[10000000,7]='DYNAMIC'

This compiles into:

0001 0000 16  LOAD           VAR(1)
0001 0005 07  LOAD 1        
0001 0006 4E  NEGATE        
0001 0007 08  LOAD 0        
0001 0008 08  LOAD 0        
0001 0009 09  LOADS          "DYNAMIC"
0001 0012 92  REPLACE       
0001 0013 18  STOREA         VAR(1)
0001 0018 01  EOL           
0002 0019 05  LOADN          -1258999.0684
0002 0020 05  LOADN          7
0002 0027 16  LOAD           VAR(1)
0002 002C 09  LOADS          "DYNAMIC"
0002 0035 DC  OVLY SUBS     
0002 0036 18  STOREA         VAR(1)
0002 003B 01  EOL           
0003 003C 01  EOL           
0004 003D 45  EXIT          

As you said, the REPLACE method does a linear search to find where the string goes. Then it has to shuffle string space around as part of the replace (or append) operation.

The OVLY SUBS operation must be optimized to go where it needs to go and "overlays" the data rather than shift and move. I use this extensively whenever I have any heavy string lifting to do.  If I can calculate at least a ballpark of how much space I need, then I'll create a string that long in advance.  Cut my base64 encoding/decoding time in half.

The result still gets STOREAd back into the variable, but the path to get there must be less traveled.

Tom

Peter McMurray

unread,
Jan 14, 2020, 5:03:10 PM1/14/20
to Pick and MultiValue Databases
Whenever I see Var<-1> = someBigItem I check carefully. Is there a better way? Invariably the answer is yes.
Where does this happen? Typically when transferring data between disparate systems because in Pick a large item like this should typically be split in to individual items in a file.
D3 handles dimensioned array superbly just as the original Pick does. Every element of a dimensioned array is itself a multivalued array capable of being identified uniquely and immediately in memory.
We handle significant .csv files daily with credit card transfers. The HOSTS file in DM contains several helpfule variants.
A Starcard csv can be any length as it can contain several days worth of credit card details for fuel purchases from any Caltex distributer in the country.
 ***** Do Transfer
 CHK = LEN(STARDIR)
 IF STARDIR[CHK,1] NE "\" THEN STARDIR = STARDIR:"\"
 SRCEDIR = "DOS:":STARDIR
 CONVERT "\" TO "/" IN SRCEDIR
 OPEN SRCEDIR TO SRCEF ELSE
     A.ERRINFO = "The Path ":STARDIR:" Is Invalid"
     A.ERRSEV = -1
     CALL "ARGO_ERRORCHK"
     A.ERR = A.ERRINFO
     RETURN
 END
 READ STARITEMS FROM SRCEF,STARFILE ELSE
     A.ERRINFO = "The Path ":STARDIR:STARFILE:" Is Invalid"
     A.ERRSEV = -1
     CALL "ARGO_ERRORCHK"
     A.ERR = A.ERRINFO
     RETURN
 END
 
 SELECT STARITEMS TO SOURCE
 ***** Start Capturing Data
 FINISH = NOVAL
 recCnt = 0
 LASTLIN = NOVAL
 csInvoice = noval
 LOOP
     READNEXT Rec FROM SOURCE ELSE FINISH = AM
 WHILE FINISH = NOVAL DO

In this case we strip the commas and quotation marks from the individual record in an external subroutine then rebuild the Pick record into internal format for number, date etc.. Add some customer information and store as an indivual Pick item.

Once records are in a separate file the full power of the Pick report can be used.
If one wishes to reverse the procedure one can, as TG suggested, print it and use the HOST reference to dump it out.
However a cleaner way with more control is to SELECT the source file and DIM am array to build the output. Orders of magnitude faster than VAR<-1>.
Then simply use the HOST definition to output it.

The issue of inserting into an array is also simple if one starts with a dimensioned array.
Remembering that every element of a dimensioned array is itself an array simply insert as follows.
:compile mmbp trickit (o)
trickit
..

[820] Creating FlashBASIC Object ( Level 0 ) ...
[241] Successful compile!   2 frame(s) used.
:run mmbp trickit
:ct test initial new

    initial
001 1
002 2
003 3
004 4
005 5
006 6
007 7
008 8
009 9
010 10

    new
001 1
002 2
003 3
004 4
005 5
006 5.1
007 5.2
008 5.3
009 5.4
010 5.5
011 5.6
012 6
013 7
014 8
015 9
016 10

:ct mmbp trickit

    trickit
001 open '','test' to tesf else crt "boom";stop
002 lin = ""
003 for a = 1 to 10
004 lin<-1> = a
005 next a
006 write lin on tesf,"initial"
007 dim new(10)
008 read new from tesf,'initial'
009 for a = 1 to 6
010 new(5)<-1> = "5.":a
011 next a
012 write new on tesf,"new"
013 end

Tony Gravagno

unread,
Jan 14, 2020, 6:29:48 PM1/14/20
to Pick and MultiValue Databases
Thanks for the contribution, Peter. The only thought I have at the moment is that I _sometimes_ avoid dimensioned arrays for three reasons.

First, depending on system and context it's a compiler-directive and not a runtime operation. This means the number of elements is fixed at compile time, not run time. Yes, we can REDIM. That and MATPARSE, MATBUILD, and some other manipulations have a trade off a bit of runtime performance, so the trick there is to figure out how big the array needs to be and only rebuild it once or twice.

Second, we don't always know the maximum number of elements at compile time I might allocate 10,000 elements and have to modify the code later.

Third, some environments have a limit on the size of a dimensioned array or maybe the total number of elements. For example, 32k elements was, perhaps only in the distant past, a limit, which might have been for a single dimension X(32767) or maybe split across dimensions (almost certainly wrong) X(1024,32). Or maybe it was X(32766,32766).

My memory on exact nuances is not fresh and probably a relic of ancient history. But because of these relics of mind I've lowered my expectations when writing cross-platform code. Honestly I think the only place I had these considerations were with NebulaXLite (creates Excel workbooks from BASIC) where old Excel was limited to 32k rows, which were defined by hundreds of thousands of attributes, and needed to be supported over old AP as well as the latest OS/DBMS. These considerations are probably not valid for any modern system and need to be tested and tossed on a system by system basis.

T

Tony Gravagno

unread,
Jan 14, 2020, 6:40:28 PM1/14/20
to Pick and MultiValue Databases
Tom, you've quickly earned my respect in your posts. Thanks for your participation.
Yeah, it's up to the MVDBMS companies to optimize what happens in those REPLACE and OVLY SUBS opcodes. In D3 that's done in assembler/abs or for FlashBASIC in C++. I believe in U2 they addressed those concerns long ago, and with D3 and mvBase sharing engineering resources with U2, I hope that D3 inherits the optimizations at some point. (I'll ask about that.)

LOADN  -1258999.0684 ?
That was a surprise.

Tip: In D3 you can use the Compile verb with the C option to eliminate all of those EOL opcodes. They're only there for debugging and the display of line numbers when a program aborts. If you're running stable code you can squeeze out a tiny bit of performance (processing+memory) by removing them.

Regards,
T

Wols Lists

unread,
Jan 14, 2020, 7:10:10 PM1/14/20
to mvd...@googlegroups.com
On 14/01/20 23:29, Tony Gravagno wrote:
>
> First, depending on system and context it's a compiler-directive and not
> a runtime operation. This means the number of elements is fixed at
> compile time, not run time. Yes, we can REDIM. That and MATPARSE,
> MATBUILD, and some other manipulations have a trade off a bit of runtime
> performance, so the trick there is to figure out how big the array needs
> to be and only rebuild it once or twice.

Bear in mind you never used to be able to REDIM - it was a PI addition
and I think it doesn't work in Pick flavour.

Secondly, I wrote a bunch of sort routines, and READ, COUNT, REDIM,
MATPARSE, MATBUILD was well worth doing for quite small arrays ...

Cheers,
Wol

Peter McMurray

unread,
Jan 14, 2020, 7:18:53 PM1/14/20
to Pick and MultiValue Databases
Hi TG
Dim is a runtime command not a compile. I did say that I am selecting and creating csv of unknown size.
:compile mmbp dimit (o
dimit
.

[820] Creating FlashBASIC Object ( Level 0 ) ...
[241] Successful compile!   2 frame(s) used.

:sane
Term setting: 79,25,0,7,1,8,132,60,n

:run mmbp dimit

 0100175065
645

:ct mmbp dimit

    dimit
001 open '',"dos:d:\d3\" to bigf else crt "boom";stop
002 read it from bigf,"big.txt" else crt "no big.txt";stop
003 chk = dcount(it,@am)
004 dim new(chk)
005 new = it
006 for a = 1 to chk
007 crt @(1,10):new(a)[1,10]
008 next a
009 crt a


On Wednesday, October 9, 2019 at 9:48:15 AM UTC+11, Tony Gravagno wrote:

Peter McMurray

unread,
Jan 14, 2020, 7:26:44 PM1/14/20
to Pick and MultiValue Databases
Hi
I should also have said never use redim more than once as one beginner did one changing the DOSTRANSFER program in R83. He redimmed for every record starting at 1 and killed the system.
You may notice that insert is always ok in a dimmed array and disastrous in a dynamic array of size.
The simple fact is that every insert requires the program to find sufficient space to make the new array, move the first part there,append the change and then append the remainder. It is 42 years since I wrote Assembler where we had MOVE and MOVEP the latter just overwrote anything in the way. I doubt that much has changed. 

Tony Gravagno

unread,
Jan 15, 2020, 12:19:18 PM1/15/20
to Pick and MultiValue Databases
It is correct that from AP-forward, DIM uses an expression for run-time evaluation of elements. That wasn't the case in R83 (AP Ref v2 page 66) and I was never sure exactly which post-R83 platforms supported expressions. Local array definitions can be re-dimmed. Arrays in Common or otherwise globally shared cannot, depending on details. I know this is not a compiler-vs-runtime spec, it's runtime memory management. The point is that each platform handles this in a way that we can hope is consistent, but it's entirely possible that nuances could be different. For example, which platforms support a zero-th element? Which platforms default to array values of zero, null, or unassigned value? Do any platforms support a third dimension? Does every platform faithfully support passing file descriptors as elements across Common? I'm not asking these questions. I'm just saying this is an area where if one is doing cross-platform development that we need to check manuals to see how we are going to approach specific concepts. And I'm saying that because I haven't taken the time to do that cross-platform reading on this specific topic, I tend to keep my array usage simple.

Thanks.
T

Tony Gravagno

unread,
Jan 15, 2020, 12:21:09 PM1/15/20
to Pick and MultiValue Databases


On Tuesday, January 14, 2020 at 4:26:44 PM UTC-8, Peter McMurray wrote:
You may notice that insert is always ok in a dimmed array and disastrous in a dynamic array of size.

That's actually the topic of this thread. ;)

Peter McMurray

unread,
Jan 15, 2020, 3:09:21 PM1/15/20
to Pick and MultiValue Databases
ERGO! Problem Solved with DIM Foo(barcnt) :-)


On Wednesday, October 9, 2019 at 9:48:15 AM UTC+11, Tony Gravagno wrote:

Scott Ballinger

unread,
Jan 26, 2020, 7:22:48 PM1/26/20
to Pick and MultiValue Databases
I'm with Peter on this. Using DIM to resize the array when it gets too big allows building attribute delimited lists with 10s of millions of records with minimal pain in D3. Universe has this problem too, but not nearly as much...

:ct es big.save-list

    big.save-list
001 * build a very large save-list as dimensioned array
002 * paste into your program...
003
004
005 * intitalize
006 svlist.no = 1
007 dim svlist.array(svlist.no)
008 svlist.array(1) = ""
009
010
011 * update
012 svlist.array(svlist.no)<-1> = id
013 if len(svlist.array(svlist.no)) gt 4000 then  ;* keep size < 1 frame
014   svlist.no += 1
015   dim svlist.array(svlist.no)
016   svlist.array(svlist.no) = ""
017 end
018
019
020 * write
021 if svlist.array(1) ne "" then
022   open "pointer-file" to pf else stop 201,"pointer-file"
023   write svlist.array on pf,"your-list-name"
024 end
025

Most commonly I use this to leave behind a record (in the form a a save-list of all new or updated IDs) when processing a large number of items.

/Scott Ballinger
Pareto Corporation
Edmonds WA USA
Reply all
Reply to author
Forward
0 new messages