Interesting performance observation

David Knight

unread,

Aug 31, 2013, 3:58:10 AM8/31/13

to mvd...@googlegroups.com

I noticed something interesting the other day.

Using d3/win; I had to write a utility to "pull apart" an ASCII comma separated, quote delimited text file stored in the Windows file structure.

Pretty easy methought; and it was. As it was a comma separated file I could pull apart the 'n' fields per line simply by using the group extract OCONV. Looping 'n' I could use a group OCONV of "Gn,1". I could then test the result to see if the first char was a quote and if it was use another OCONV such as 'G0"1' to strip off the quotes. Map the results to my d3 item; and away I go.

Yup, worked fine. And fast too.

But [& some may have guessed this was going to happen], some numb-nuts when creating the data file has text fields which included commas; and thus threw out my whole nice OCONV bit. So, I had to change the approach to one where I manually parsed the line looking for commas & quotes and deciding appropriately if it was, or was not, the beginning/end of a field. Not too hard, either. Simply looping through 'n' with a text extract of [n,1].

But here's the thing: It was soooo sloooow in comparison to the OCONV method; which got me to thinking: surely the internals of d3 must have been doing something similar anyway; so why is the char-by-char method sooo much slower? How does d3 internally 'find a separator' so much more quickly? I then thought about other tools such as Excel, which when opening an ASCII file can 'look' at it really quickly a work out where the separators are, and thus fields.

Anyone else know of a better way to handle comma separated, quote delimited ASCII file manipulation that allows for separators within the delimited fields?

Any couch-experts out there explain the performance bit?

Thoughts?

Mike Preece

unread,

Aug 31, 2013, 5:10:53 AM8/31/13

to mvd...@googlegroups.com

Hello David,

Any chance you could post the relevant code?

Anthony Youngman

unread,

Aug 31, 2013, 7:03:05 AM8/31/13

to mvd...@googlegroups.com

Something I did years and years ago, which gave me a massive performance
boost on PI ... MATPARSE.

DIM PARSEARRAY(100,2)

MATPARSE PARSEARRAY FROM LINE, ',"'

It sticks the text in the first dimension, and the delimiter in the
second. So it still takes a bit of effort to parse and rebuild the
dimensioned array, but it's a lot easier to fly through.

>
> Any couch-experts out there explain the performance bit?
>
> Thoughts?
>

Watch out for your csv definition and where it comes from. There's no
standard, so if the source changes it could easily break your nicely
crafted code.

Cheers,
Wol

George Gallen

unread,

Aug 31, 2013, 11:34:50 AM8/31/13

to mvd...@googlegroups.com

What ticks me off more than the embedded commas, or double quotes for
embedded quotes
are the freakin embedded <CR>s from websites usually which when it's read
in, breaks the line into two lines. then
you need to start checking if the number of header fields is different from
the number of
fields on your current line, and then concating the next line to your
current and so on until
you have right number of fields, and hoping the last field doesn't contain
one as well.

George

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms

Charlie Noah

unread,

Aug 31, 2013, 11:35:31 AM8/31/13

to mvd...@googlegroups.com

David,

I've found that you just can't reliably parse comma delimited, quote surrounded CSV files with an OCONV. There are so many loose interpretations of how a CSV should be structured, and then there are embedded commas and quotes, that there are more exceptions than rules. Then there's the notoriously bad job Excel does in saving as a CSV. Unfortunately, Open Office doesn't do it a lot better, just differently. This is a huge argument for using tab delimited with no surround character. That said, you probably aren't given a choice in the file structure, are you?

You're going to have to parse each line character by character, and then be very careful. Test, test, test, and there'll still be something that will bite you somewhere (embedded linefeeds comes to mind).

Regards,
Charlie

Tiny Bear Wild Bird Store
Home of "Safety Net Shipping"
http://www.TinyBearWildBirdStore.com
Toll Free: 1-855-TinyBear (855-846-9232)

Martin Phillips

unread,

Aug 31, 2013, 11:48:56 AM8/31/13

to mvd...@googlegroups.com

Hi,

Watch out for your csv definition and where it comes from. There's no

standard, so if the source changes it could easily break your nicely

crafted code.

Although there is no fully ratified specification, there is an RFC (4180) that sets out the rules. This is all very well but lots of software then breaks those rules so we are not really that much better off than not having a specification.

QM has integrated CSV support which is considerably faster than the group extraction process described earlier in this thread. We chose to implement according to the RFC and also with two lower compliance alternatives. We have QMBasic programming statements to read/write/parse/build CSV data and the query processor can directly produce CSV output which leads to the interesting problem of how to handle multivalues in CSV.

Martin Phillips
Ladybridge Systems Ltd
17b Coldstream Lane, Hardingstone, Northampton NN4 6DB, England
+44 (0)1604-709200

CDMI - Steve T

unread,

Aug 31, 2013, 12:00:55 PM8/31/13

to mvd...@googlegroups.com

David:
here is a READ csv routine (when I'm not using QM)

SUBROUTINE READCSV.SUB(CODE,STRING,ARRAY,CONTINUED)
!
* 10-10-12 smt readcsv.sub
!
*    CODE      = (TAB or CSV)
*    STRING  
  = line from csv file to be converted to a dynamic array and stored in array
*    ARRAY     = dynamic array to be built or continued
*    CONTINUED = set to 1 if linefeed was within last quoted field and array needs to be contiuned on next line
*                Initially set to zero in calling program when starting a CSV file but then maintained by this
*                subroutine for the duration of the csv file.
*
IF ASSIGNED(CONTINUED) THEN CONTINUED = 0
IF ASSIGNED(ARRAY)     THEN ARRAY =
 ''
*
NEW = ''
QUOTED = 0
COMMA.ON = 0
QUOTE.ON = CONTINUED
*
IF CONTINUED THEN
 CONTINUED = 0
 AM$ = COUNT(ARRAY,@AM)
END ELSE
 AM$ = 0
 ARRAY = ''
END
*
LNE = STRING
LNG = LEN(LNE)
*
BEGIN CASE
 CASE CODE = "TAB"
  CONVERT CHAR(9) TO @AM IN LNE
  AM$ += 1
  ARRAY<AM$> := LNE
 CASE CODE = "CSV"
  GOSUB 100
 CASE 1
  CRT \Invalid code. CODE = '\:CODE:\'\
  ARRAY = ''
END CASE
RETURN
*
100 * CSV
*
ESC.STR = ''
ESC.STR<1> = '\\,\", \r, \n, \t,\r\n'
ESC.STR<2> = ' \, ",253,253,252, 253'
CONVERT ", " TO @VM IN ESC.STR
*
LOOP
 X1 = LNE[1,1]
 X2 = LNE[1,2]
 X4 = LNE[1,4]
UNTIL X1 = '' DO
 BG = 2
*
*** setup check for excaped characters
*
 LOCATE(X4,ESC.STR,1;VM$) THEN
  ESC.ON = 5
  CHR = ESC.STR<2,VM$>
 END ELSE
  LOCATE(X2,ESC.STR,1;VM$) THEN
   ESC.ON = 3
   CHR = ESC.STR<2,VM$>
  END ELSE
   ESC.ON = 0
  END
 END
*
 BEGIN CASE
*
*** skip trailing spaces while waiting for comma
*
  CASE COMMA.ON AND X1 # ","
   IF X1 # " " THEN
    CRT \Invalid format, discarding. X1 = '\:X1:\'\
   END
*
*** excaped characters only valid within quoted fields
*
  CASE ESC.ON AND QUOTE.ON
 
  IF CHR MATCH "1N0N" THEN
    NEW := CHAR(CHR)       ;* new line or tab within quoted field
   END ELSE
    NEW := CHR
   END
   BG = ESC.ON              ;* 3 or 5
*
*** check quote usage
*
  CASE X1 = \"\
   IF QUOTE.ON THEN
    IF X2 = \""\ THEN
     NEW := X1
     BG = 3
    END ELSE
     QUOTE.ON = 0
     COMMA.ON = 1
    END
   END ELSE
    QUOTE.ON = 1
    QUOTED = 1
    IF TRIM(NEW) # '' THEN
     CRT \Invalid format, discarding. NEW = '\:NEW:\'\
    END
    NEW = ''
   END
*
*** check comma usage
*
  CASE X1 = ","
   IF QUOTE.ON THEN
    NEW := X1
   END ELSE
    GOSUB 200              ;* add to array
   END
*
*** all the rest
*
  CASE 1
   NEW := X1
 END CASE
*
*** trim off what was used
*
 LNE = LNE[BG,LNG]
REPEAT
*
*** is this field continued?
*
IF QUOTE.ON THEN
 CONTINUED = 1
 LOCATE("\n",ESC.STR,1;VM$) THEN
  CHR =
 ESC.STR<2,VM$>
 END ELSE
  CHR = 253
 END
 IF CHR MATCH "1N0N" THEN
  NEW := CHAR(CHR)         ;* new line or tab within quoted field
 END ELSE
  NEW := CHR
 END
END
*
GOSUB 200                      ;* add to array
*
RETURN
*
200 * add to array
*
*** if not quoted then trim leading and trailing spaces
*
IF NOT(QUOTED) THEN
 NEW = TRIMF(NEW)            ;* trim leading spaces
 NEW = TRIMB(NEW)            ;* trim trailing spaces
END
*
*** update array
*
AM$ +=
 1
ARRAY<AM$> := NEW
*
*** init variables for next field
*
NEW = ''
QUOTED = 0
COMMA.ON = 0
QUOTE.ON = 0
*
RETURN
*
END

here is a
 subroutine to WRITE csv (when I'm not using QM)

SUBROUTINE WRITECSV.SUB(DO.EXCEL,DYNREC,MAXAMC,RESX)
!
* 10-10-12 smt take dynamic array and convert CSV (comma separated values)
*          please reference document RFC 4180
*          http://tools.ietf.org/html/rfc4180
*
*
*** to make it simple and more obvious, I will enclose
 EACH field in '"' (double quotes)
*
* do NOT remove a '-' when it's the 1st char if the 2nd char is a number
* this makes 'negative' numbers positive
!
* modifications:
!
RESX = ''
FOR II = 1 TO MAXAMC
 IF II > 1 THEN RESX:= ","
 VAL$ = DYNREC<II>
!
*** remove chars that Excel uses in formulas
!
 IF DO.EXCEL THEN
  CHR$ = VAL$[1,1]
  BEGIN CASE
   CASE CHR$ = "-"
    IF NOT(NUM(VAL$[2,1])) THEN
     VAL$ = VAL$[2,999]
    END
   CASE CHR$ = "="
    VAL$ = VAL$[2,999]
   CASE CHR$ = "+"
    VAL$ = VAL$[2,999]
   CASE CHR$ = "/"
    VAL$ = VAL$[2,999]
   CASE CHR$ = "*"
    VAL$ = VAL$[2,999]
  END CASE
 END
 IF INDEX(VAL$,\"\,1) THEN
  VAL$ = SWAP(VAL$,\"\,\""\)
 END
 RESX:= \"\:VAL$:\"\
NEXT II
!
20000 * end of program
!
RETURN
END


hope this helps
good luck,

Steve Trimble
Computerized Data Mgmt Inc
2705 Justin Matthews Dr
N Little Rock, AR 72116
(501) 615-8674 09:00am - 6:00pm CST

    From: David Knight <david...@matash.com.au>
 To: mvd...@googlegroups.com 
 Sent: Saturday, August 31, 2013 2:58 AM
 Subject: [mvdbms] Interesting performance
 observation
  

Tony Gravagno

unread,

Aug 31, 2013, 4:14:22 PM8/31/13

to mvd...@googlegroups.com

As far as parsing CSV, including nested commas, quotes, CR's, etc, the
topic has been discussed in various forums before and code has been
posted and discussed in various places. I'm trying to get out of the
habit of eagerly researching such things to respond to notes like
this, posting links to old threads and searching through publicly
available resources that everyone else should know about. Suffice to
say the solution has already been out there for years so you might
want to consider looking around for that rather than continuing to
beat your head against this wall.

As to internals, I'm no expert on this but I'll make a guess. Pick
Assembler has instructions related to Scan to the next Character. The
system delimiters are hardcoded, making it faster to scan to the next
attribute, value, subvalue, or segment mark. There is a byte for
another character there, (and I hope I'm remembering this properly)
where we could set the value of the byte and then call the right
function to use that character as the delimiter. The low-level scan
function would then use that and be very fast about doing it. That
should help to explain why those kinds of operations are fast while
manually parsing strings is not. That said, D3 Windows is written in
C, doesn't use the standard assembly modes, and I'm not sure how they
do low-level functions like that, but I suspect they're emulating the
assembly modes.

I hope someone with more of a clue will tell me I'm all wrong and that
there is another better reason for the performance difference, but
until that happens this is how I believe it works. :)

HTH
T

Kevin Powick

unread,

Aug 31, 2013, 6:01:28 PM8/31/13

to mvd...@googlegroups.com

Lot's of ways to go about it with solutions new and old found by searching this group and the old CDP. Of course, how those solutions work will depend on how consistent/clean your data is.

If it's consistently comma delimited/quoted as: "red","green","blue" you can simply create a value mark delimited record by using the CHANGE() function on the 3 character combination of double-quote comma double-quote.

NEW.REC = CHANGE(REC, '","', @VM)

The only other manipulation you'll need to do is for each attribute NEW.REC<N>, you'll strip off the very first and very last character, which are the leftover double quotes from the initial conversion.

It's pretty fast and only requires a few lines of code.

--

Kevin Powick

fwinans

unread,

Aug 31, 2013, 8:09:26 PM8/31/13

to mvd...@googlegroups.com

Even 'compiled' basic is somewhat interpreted by the d3 virtual machine,
as I understand it. So executing even a small number of basic commands
to handle each byte of the csv file is going to take a bit of time. We get
good results using somewhat convoluted algorithms that offload
as much of the detail work as possible onto pick basic functions, which in
the areas they are designed for run at blindingly fast speed. You can
{somewhat easily} extend the d3 monitor with custom functions, which
should be much faster than passing byte-at-a-time through basic variables.
Or, you could just do this in a C language program or linux / cygwin shell
script over on the native linux / windows side...   Hint: change all the
commas _that_are_inside_a_quoted_string_ to some unused char, like
the backwards single quote mark {grave symbol `} or perhaps tilde {~} or
pipe symbol { | } using such external tools, then after import to d3 use
basic   convert    command to shift 'em all back to whatever you want
them to be, after you've done whatever work they had been sabotaging.

Other 'troublesome bytes' you may run across when you
import files that came from beyond D3 are Tabs, esc strings from arrow
keys users were permitted to type in, "correcting" their typos but not
really, and carriage returns that you accidently let survive into the d3
resulting item(s). Oh, and don't just suck in binary files as text, there
are mechanisms provided to keep the nasty char(255) byte at bay;
use them and you'll be ok. Assume any graphics or compiled binary
or compressed file needs such kid gloves handling.

Were you just curious about this response time disparity, or did you
really need to gulp in csv files on a regular and non-interactive basis?

Tony Gravagno, you could have just ignored over such a boring / cliche-ed
topic. I do so all the time, here. And there was a smidgin of new content
in their asking about relative execution speed.

Ahem; it is a little harder than before to get www.google.com to show you
old postings from just that now-disused comp.databases.pick newsgroup,
but if you google to      advanced group search   it will provide a 'hit' for
the deprecated groups.google.com/advanced_search
page that lets you type in a limiter to group name / web page domain name --
put comp.databases.pick in that box, as well as your search info in top left area.

David Knight

unread,

Sep 1, 2013, 6:47:10 PM9/1/13

to mvd...@googlegroups.com

Sure. Here's the code section, old & new. NO.OF.FIELDS previously determined in the commented out version of section 8210...

*8210 ;* Split code into AMC's

* FOR Y = 1 TO NO.OF.FIELDS

* EXTRACT.STR = "G" : (Y-1) : DELIM : "1" ;* Uses D3 G<skip><delim><return>

* THIS.CODE<Y> = OCONV(CODEARRAY(X),EXTRACT.STR)

* THIS.CODE<Y> = TRIM(THIS.CODE<Y>) ;* Strip trailing spaces

* GOSUB 8220 ;* Strip out quotation marks if quoted

* NEXT Y

* RETURN

*

8210 ;* Split code into AMC's

Y = 1 ;* Initialise AMC counter

TEXTFIELD = 0 ;* Assume not text

LINE.LENGTH = LEN(CODEARRAY(X)) ;* Length of this line

FOR M = 1 TO LINE.LENGTH ;* Read character by character

THIS.CHAR = CODEARRAY(X)[M,1]

BEGIN CASE

CASE THIS.CHAR = '"'

IF TEXTFIELD = 0 THEN ;* Start of textfield

TEXTFIELD = 1

END ELSE

TEXTFIELD = 0 ;* End of text field

END

GOSUB 8215 ;* Append character

CASE THIS.CHAR = DELIM

IF TEXTFIELD = 0 THEN

GOSUB 8218 ;* Increment Y

END ELSE

GOSUB 8215 ;* Append character

END

CASE 1

GOSUB 8215 ;* Append character

END CASE

NEXT M

RETURN

*

8215 ;* Append character

THIS.CODE<Y> = THIS.CODE<Y> : THIS.CHAR

RETURN

*

8218 ;* Increment Y

THIS.CODE<Y> = TRIM(THIS.CODE<Y>) ;* Strip trailing spaces

GOSUB 8220 ;* Strip quotation marks

Y = Y + 1

RETURN

*

8220 ;* Strip quotation marks if quoted

IF THIS.CODE<Y>[1,1] = '"' THEN ;* Starts with quote

THIS.CODE<Y> = OCONV(THIS.CODE<Y>,'G1"1')

END

RETURN

David Knight

unread,

Sep 1, 2013, 6:58:59 PM9/1/13

to mvd...@googlegroups.com

Thanks to all.

Appears I crossed some form of line here, for which I apologise. In truth, I was not fishing for code or methods since my specific needs are limited and are for only a particular 3rd party datafile which should be relatively "clean". If ever non-reasonable stray characters founds their way into the datasource, since my clients are paying serious dosh for the data; I'd flick it back at the authors & get 'em to do their job properly! Heh heh.

Those that did post code; thank you anyway. Interesting reading!

FWIW, while I did not search in this instance; I've found searching forums in general quite hard to do, unless you already know *how* the discussion originally went; and thus know what you are looking for. But, as I said; I was not really looking for code.

I was more curious about the performance difference when I manually wrote a routine which I thought internally d3 *must* be doing when performing a G form OCONV. To be fair, I have not Flash compiled this routine; so it is still strictly interpretive; but the performance difference was startling.

As Tony pointed out, d3/win is now written in C anyway; so the 'old' idea of assembler modes at the lowest level in the vm being super-fast *should* be somewhat negated? Maybe not?

Thanks everyone...

Ross Ferris

unread,

Sep 2, 2013, 2:54:59 AM9/2/13

to mvd...@googlegroups.com

As Tony suggested, the reason the OCONV is faster is because it is taking advantage of some intrinsic assembler processing to do with character scanning

The fact that D3/Win is now "in C" will have little impact on the performance you are seeing from Basic --> UNTIL you flash compile your code. Do that, and whilst not as fast as a native OCONV, it will still be faster that non-flashed BASIC.

The suggestion to use MATPARSE has some merit, especially as that can be made to "break" on multiple characters (I believe this capability was used by CompuSheet to parse formula "quickly")

Depending on your requirements (& skill set), there are some nifty things that you COULD/SHOULD/CAN do outside of D3 that could "manipulate" & clean up your input file to the extent that when you read it in from D3, it could already have value marks in the "right" spot (and eliminate "unnecessary" quote marks, keep embedded commas etc etc), assuming the directory that you are picking files up from is, say, an NTFS directory by having a service that subscribes to directory changes (and you would actually pick up your "clean" data from another directory) .... but I'm guessing the costs associated with writing such a thing as a one off would be such that slow D3 performance is fine :-)

The same/similar technique could be used to similarly "manipulate" native XLS/XLSX files, strip data from PDF or Word templates .... the list is "endless".

Of course you could submit a request to TData/TL for a feature enhancement, then enlist others to make similar submissions, to see if you could get a similar capability within D3 to what Martin has incorporated into QM.

Tony Gravagno

unread,

Sep 3, 2013, 7:49:50 PM9/3/13

to mvd...@googlegroups.com

David, I'm sorry if my comments seemed directed, or if they provoked responses from others that seemed directed. I was being direct, but not backhanded, which some people assume I do as a general practice but really it's Very rare. I think lately I've been typing fast, editing less, and just haven't been as diplomatic as I should in text. I'm not thinking about how text is going to be read but rather about how it's intended - again, it's just direct, not directed, if that makes sense. I've recently been as undiplomatic with Frank and not feeling very good about that either. We win some and lose some.

Always my best,

T

From: David Knight

Appears I crossed some form of line here, for which I apologise. In truth, I was not fishing for code or methods …

David Knight

unread,

Sep 3, 2013, 9:34:17 PM9/3/13

to mvd...@googlegroups.com

No offence taken... The original line about code/methods in my original post was an after thought added in; which I probably should not have done...

My point [which has been very adequately answered by all BTW] was while I suspected the reason for the difference, I was quite gob-smacked at the magnitiude. It was HUUGE-AH. Also, I was too lazy to Flash-Compile! Maybe I'll do that & see what happens... in fact, I'll do it now... 3 ASCII data files to process in turn, 41171, 2924, 8062 totalling 52157 lines. Elapsed time approx 2 mins. Flash compiled: [omg...] 15s!!! If ever there was a case to flash-compile [I used (oc] to gain performance, this is it!!

Wow.. [which BTW was approximately the performance of the routine when it used G-extracts]...

So, there ya go folks...

Message has been deleted

Ross Ferris

unread,

Sep 9, 2013, 7:44:12 PM9/9/13

to mvd...@googlegroups.com

Now,

If only TL would/could fix up the debugger for Flashed code, I wouldn't think there would be much of a compelling reason for ANYONE to NOT use flashed code.

We made the decision to ONLY use flashed code from day 1 with Visage - all code maintained/managed from within Visage is automatically flashed, and as you have noted the gains with string manipulation execution speeds are "impressive".

Assume you are aware (though others may not) that in order to CALL a "flashed" subroutine, you main program must also be flashed, OTHERWISE your unFlashed version of the subroutine will be used (which is why "flashed code" APPEARS to consume around twice as much space as unflashed code, because both the flashed AND UNFLASHED object is produced

Reply all

Reply to author

Forward