Parsing a CSV file in Classic REXX

Arthur T.

unread,

Jul 28, 2016, 9:43:09 PM7/28/16

to

I would like to do something elegantly, rather than via brute
force. I think the technique I've come up with needs the INTERPRET
statement, which I'd rather avoid as a general principle. I'm using
classical Rexx (Regina).

I'm reading a CSV file with a header row. I'd like to find all
rows which do not have addresses. I can find the address column by
some SMOP looking for "Address" in the first line and figuring the
number of commas preceding it. Assuming it's the 4th column, I could
generate a PARSE template of
. ',' . ',' . ',' LineAddress ',' .

But I don't believe there's any way to get PARSE to use that
generated template without something like
interpret 'parse var line.i' template
(Untested. I'd have to think longer as to whether "i" is in or out
of the quotes.)

At this point in my design, I'm not concerned with the edge
cases where Address is the first or last column.

I'd appreciate comments on my technique and suggestions on how
better to do this.

--
Arthur T. - ar23hur "at" pobox "dot" com

A.D. Fundum

unread,

Aug 7, 2016, 1:06:06 PM8/7/16

to

> At this point in my design

Which design?

> I'd appreciate comments on my technique

Which technique?

Do use INTERPRET to write a simple one-liner Rexx calculator, or
somehow refuse to use INTERPRET and write your own, customized, rather
useless parsing-function. IOW, rewrite (whatever you need of) PARSE in
Rexx.

--

Ken Leidner

unread,

Aug 9, 2016, 3:36:33 AM8/9/16

to

So my idea is to count the number of commas before the key word in
the header line and then step down the record and pull the desired
field out of the record.

My second way is elegantly and uses the translate and word functions,
but the translate does have one drawback

You have to know for sure that some character would NEVER be in the
input record, here I picked the $. Use the translate function to
change all spaces to that character and at the same time translate
commas to spaces. Then you can use the word function to pull out your
value. One small snag is that two comma together become two spaces
and get treated as one and you get the wrong word.

However the changestr function can fix that by adding a space between
them. The down side is the extra if statement to change the found
value of a single blank to nothing. Assuming again you are OK with
treating a single blank for your value as missing.

I wrote the statements all together and did not "save" the results
rom each function as they really are only needed to validate the
statement is working, or to see earier how it works.

This first part is the same for both ways, find where the key word
is in the header line.
/* REXX */
rec ="name,phone,sex,LineAddress,city,state,zip" /* example */
numcomma = countstr(",",substr(rec,1 ,pos("LineAddress",rec,1)) )
/* count the number of commas before the line address */

Way One
/* here is an example data line */
rec = "ken,123-4567,M,123 some road,new york,ny,02345"
/* or rec = "ken,123-4567,M,,new york,ny,02345" for the missing one*/
p1 = 1
do i=1 to numcomma
p1 = pos(",",rec,p1)+1
end
address = substr(rec,p1,pos(",",rec,p1)-p1)
q = length(address)
say " address has a length of " q ", a valuse of> " address "<"
Say "Input line was " rec

It give a zero length if the value is missing

Way Two - Using the translate and the word function

rec = "ken,123-4567,M,123 some road,new york,ny,02345"
/* here is an example data line, same as above */

address =Translate(word(translate(changestr(",,",rec,", ,"),"$ "," ,")
,numcomma+1)," ","$")
if address = " " then address = ""
q = length(address)
say " address has a length of " q ", a valuse of> " address "<"
say "Input line was " rec

Enjoy

In article <loblpb1afpaiv8fj7...@4ax.com>,
art...@munged.invalid says...

A.D. Fundum

unread,

Aug 9, 2016, 9:52:43 AM8/9/16

to

> My second way is elegantly and uses the translate and
> word functions, but the translate does have one drawback

Interpretfobia without proper examples of data formats and headers is
wasting everyone's time.

<YMD>,<HIGH>,<HTIME>,<LOW>,<LTIME>

Translate "," of such a proper header to spaces, use WordPos to find
the Nth column, and use PARSE to parse the Nth column. You may have to
use PARSE N times.

If this won't work, then there's something wrong with the (variable?!)
data format again. IRL you'd prefer to ignore the content of the
header, and use a fixed order of fields instead of an unknown,
flexible data format.

Or use specialized SQL'ish tools, written in a faster programming
language.

--

Andreas Schnellbacher

unread,

Aug 9, 2016, 3:18:02 PM8/9/16

to

Arthur T. wrote:

> Assuming it's the 4th column, I could generate a PARSE template of
> . ',' . ',' . ',' LineAddress ',' .

The way to go is to parse a line in a loop at comma positions.

In order to allow double-quoted (or single-quoted) commata in the
string parts, you have to check if a quote char comes first, before a
comma. I usually use pos for that.

If yes, then parse the entire string at once, up to the next quote
char. If not, then the next comma marks the end of the next field.

--
Andreas Schnellbacher

Arthur T.

unread,

Aug 9, 2016, 6:11:18 PM8/9/16

to

In Message-ID:<oe6dnfkPBIVhGjTK...@earthlink.com>,

Ken Leidner <y...@somehost.somedomain> wrote:

>This first part is the same for both ways, find where the key word
>is in the header line.
> /* REXX */
>rec ="name,phone,sex,LineAddress,city,state,zip" /* example */
>numcomma = countstr(",",substr(rec,1 ,pos("LineAddress",rec,1)) )
>/* count the number of commas before the line address */

That's more elegant than my method. Thanks for introducing me
to the COUNTSTR function; I hadn't noticed that it existed.

<snip>

>Way One

This could work for me. Thanks again.

>Way Two - Using the translate and the word function

Definitely more elegant in that it doesn't loop. But that
double TRANSLATE means that I'd have to double the length of the code
(in comments) so I'd be able to figure out what I had done. Not to
mention the time it would take me to understand what you did.

TRANSLATE is one of those instructions which I always have
problems with, except when used straightforwardly. I know it's
powerful, but it's also not intuitive.

Arthur T.

unread,

Aug 9, 2016, 6:11:21 PM8/9/16

to

In Message-ID:<noda99$gdi$1...@news.albasani.net>,

Andreas Schnellbacher <as...@despammed.com> wrote:

>Arthur T. wrote:
>
>> Assuming it's the 4th column, I could generate a PARSE template of
>> . ',' . ',' . ',' LineAddress ',' .
>

<snip>

>
>In order to allow double-quoted (or single-quoted) commata in the
>string parts, you have to check if a quote char comes first, before a
>comma. I usually use pos for that.
>
>If yes, then parse the entire string at once, up to the next quote
>char. If not, then the next comma marks the end of the next field.

Good point. I happen to know that in this case there will be no
quoted strings before the address. But, since I'm writing for a
general case, I should take that into account in my code.

A.D. Fundum

unread,

Aug 21, 2016, 4:41:28 PM8/21/16

to

> I'm writing for a general case

Then you'll also have to count the number of fields, to account for
any comma in unquoted data, and the user will have to specify which
fields are optional, and the user has to specify the type of data per
value. Unquoted data is no general case. There is no such thing as a
standarized CSV file format, so there is no such thing as your general
case. For one you're assuming the use of a comma as the separator,
which may as well be a TAB or a space, with or without your header.

--

bpkm...@gmail.com

unread,

Aug 23, 2017, 5:34:07 PM8/23/17

to

Just create a file as a Tab delimited file and use chr(9) (tab character) to separate the tokens in the string.

Arthur T.

unread,

Aug 24, 2017, 1:17:48 AM8/24/17

to

In
Message-ID:<4cf0f27b-0600-4275...@googlegroups.com>,

bpkm...@gmail.com wrote:

>Just create a file as a Tab delimited file and use chr(9) (tab character) to separate the tokens in the string.

That's not a bad idea. I inherited the file, and I'm not sure
what else might depend on its being separated by commas. Maybe I
could bring it into a spreadsheet program and resave it, though.

Thanks for looking at a year-old posting.