read flat file with enclosed delimiters

598 views
Skip to first unread message

master

unread,
May 27, 2010, 7:16:46 PM5/27/10
to jBASE
How do i read a record whose delimiter is part of the field values?

Many programs (Excel, etc) will enclose a field in which a delimiter
is part of the text with double quotes:
Sample This:

Name: Name1
Address: 458, MyCity
Country: MyCountry

In Excel it will appear like this:

Name Address Country
Name1 458, MyCity MyCountry

When converted to CSV it will be:

Name, "458, MyCity", MyCountry

Now, using READSEQ, the routine breaks the Address field at the first
comma , (i.e after 458) instead of treating everything enclosed in ""
as a single entity.

Daniel Klein

unread,
May 28, 2010, 9:40:59 AM5/28/10
to jb...@googlegroups.com
READSEQ cannot be doing this.

How is the CSV file being generated. A better method would be to use a tab-delimited file.

Post your code, a 'jdiag.out' file (as requested in the Posting Guidelines. For Zarquon's sake, doesn't ANYONE read it?) and an actual sample of the file so that a correct solution can be provided.

Dan


--
Please read the posting guidelines at: http://groups.google.com/group/jBASE/web/Posting%20Guidelines

IMPORTANT: Type T24: at the start of the subject line for questions specific to Globus/T24

To post, send email to jB...@googlegroups.com
To unsubscribe, send email to jBASE-un...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/jBASE?hl=en

master

unread,
May 28, 2010, 11:44:29 AM5/28/10
to jBASE
@D. Klein:

I could use lots of delimiters, my fav being pipe. But in this case I
dont control the format the file comes in or the delimiter to be used.
I DONT read it directly from Excel; the data is converted to the flat
file format I have described before it comes to jbase for loading.

Different applications have a way of interpreting the enclosed
delimiters - for instance in Oracle I just have to declare the
"enclosed by" character when defining an external table.

That said, here is a portion of the code minus the inserts.

DIRPATH='\path\to\dir' ;* Path to extracts directory
FILENAME ='cust.file' ;* Flat File name
OPENSEQ DIRPATH,FILENAME TO MYINPUTFILE ELSE
CRT 'CANT open file'
END

LOOP
READSEQ D.LINE FROM MYINPUTFILE ELSE
BREAK
END
CONVERT ',' TO @FM IN D.LINE ;* Here is my field delimiter

**Then I go on to process the data

while D.LINE

FIELD.COUNT = DCOUNT(D.LINE,@FM) ;* This value should be the same for
all records
;* However, if there is an
enclosed comma, the number of fields will be more

RECORD.ID = D.LINE[@FM,1,1] ;* Get ID from first field in the
flat file

R.RECORD<0>= RECORD.ID

R.CUST<FIELD1> =D.LINE[@FM,1,1] ;* first field, etc
R.CUST<FIELD2>=D.LINE[@FM,2,1]
R.CUST<FIELD3>=D.LINE[@FM,3,1]

WRITE R.CUST ON FN.CUST,RECORD.ID ON ERROR ;*Write to jbase
file
*DO STUF
END
REPEAT

Sample Data:

The sample data I sent should be enough

Name: Name1
Address: 458, MyCity
Country: MyCountry

In Excel it will appear like this:

Name Address Country
Name1 458, MyCity MyCountry

When converted to CSV it will be:

Name1, "458, MyCity", MyCountry

The routine should read 3 fields:
Field 1: "Name 1"
Field 2: "458, MyCity" - including the comma
Field 3: MyCountry.

Right now this is what is happening:

Field 1: "Name 1"
Field 2: "458
Field 3: MyCity"
Field 4: MyCountry.


On May 28, 4:40 pm, Daniel Klein <danielklei...@gmail.com> wrote:
> READSEQ cannot be doing this.
>
> How is the CSV file being generated. A better method would be to use a
> tab-delimited file.
>
> Post your code, a 'jdiag.out' file (as requested in the Posting Guidelines.
> For Zarquon's sake, doesn't ANYONE read it?) and an actual sample of the
> file so that a correct solution can be provided.
>
> Dan
>

FFT...@aol.com

unread,
May 28, 2010, 3:02:01 PM5/28/10
to jb...@googlegroups.com
In a message dated 5/28/2010 9:44:57 AM Pacific Daylight Time, mail...@gmail.com writes:


        CONVERT ',' TO @FM IN D.LINE ;* Here is my field delimiter  >>


Do not do this.
That's the first step.
This is a big no no.
You do not *know* if there's embedded commas in the data, here you're making them jump out and smack you in the kisser.

W.J.

Daniel Klein

unread,
May 28, 2010, 7:09:25 PM5/28/10
to jb...@googlegroups.com
If your 'sample data' is really representative of the CSV file then this code should do the trick...

0001     line = 'Name1, "458, MyCity", MyCountry'
0002     EQU DQ TO CHAR(34)
0003     EQU COMMA TO CHAR(44)
0004     newarray = ''
0005     LOOP UNTIL LEN(line) = 0
0006         comma1.pos = INDEX(line,DQ,1)
0007         IF comma1.pos = 1 THEN
0008             comma2.pos = INDEX(line,DQ,2)
0009             newarray<-1> = TRIM(line[comma1.pos+1,comma2.pos-2])
0010             line = TRIM(line[comma2.pos+2,-1])
0011         END ELSE
0012             newarray<-1> = line[COMMA,1,1]
0013             line = TRIM(line[COMMA,2,999])
0014         END
0015     REPEAT
0016     CRT newarray

If the actual CSV file does not have spaces between the fields then you can remove the TRIM()'s.

Dan

Jim Idle

unread,
May 29, 2010, 6:14:19 PM5/29/10
to jb...@googlegroups.com
Clearly, and I mean CLEARLY your import routine is too simplistic. Why do you expect something as simple as CONVERT to know anything about literal delimiters? Did you read the man page for that function? Just add code to scan and extract the fields in to a dynamic array. It is a simple state machine that should be obvious how to programeven if you don't know what the formal definition of a state machine is. It only has two real states "Traversing a literal" "Not traversing a literal"!!!!

I realize that while my postings to this group about helping ourselves and at least trying to learn may seem important to me and trite or preachy to everyone else, but for the love of Zarquon, are you sure you are in a job that you enjoy?

Jim

master

unread,
May 29, 2010, 3:57:06 AM5/29/10
to jBASE
Thanks Dan,

Exactly what I needed.

It worked perfectly

On May 29, 2:09 am, Daniel Klein <danielklei...@gmail.com> wrote:
> If your 'sample data' is really representative of the CSV file then this
> code should do the trick...
>
> 0001     line = 'Name1, "458, MyCity", MyCountry'
> 0002     EQU DQ TO CHAR(34)
> 0003     EQU COMMA TO CHAR(44)
> 0004     newarray = ''
> 0005     LOOP UNTIL LEN(line) = 0
> 0006         comma1.pos = INDEX(line,DQ,1)
> 0007         IF comma1.pos = 1 THEN
> 0008             comma2.pos = INDEX(line,DQ,2)
> 0009             newarray<-1> = TRIM(line[comma1.pos+1,comma2.pos-2])
> 0010             line = TRIM(line[comma2.pos+2,-1])
> 0011         END ELSE
> 0012             newarray<-1> = line[COMMA,1,1]
> 0013             line = TRIM(line[COMMA,2,999])
> 0014         END
> 0015     REPEAT
> 0016     CRT newarray
>
> If the actual CSV file does not have spaces between the fields then you can
> remove the TRIM()'s.
>
> Dan
>

bw@ht

unread,
May 28, 2010, 8:17:27 PM5/28/10
to jBASE
The simplest way to do this is to read the file directly from the OS
file system as follows

EQU COMMA TO ','
EQU QTE TO /"/
EQU NULLS TO ""

LOOP
UNTIL last.byte

OSBREAD block FROM file.var AT bbyte LENGTH lgth
bbyte += lgth

last.byte = INDEX(block, @EOF)
IF last.byte THEN block = block[1, (last.byte - 1)]

CHANGE @CRLF to @FM IN block
block<1> = left.overs : block<1>
last.fmc = DCOUNT(block, @FM)

IF NOT(last.byte) THEN
left.overs = block<last.fmc>
last.fmc = last.fmc - 1
END

FOR fmc = 1 TO last.fmc

this.line = block<fmc>
comma.count = COUNT(this.line, COMMA)
qt.comma.count = COUNT(this.line, QUOTE(COMMA))

BEGIN CASE
CASE comma.count = 2
CHANGE COMMA TO @VM IN this.line
CHANGE QTE TO NULLS IN this.line
CASE qt.comma.count = 2
CHANGE QUOTE(COMMA) TO @VM IN this.line
CASE 1
{otherwise break this.line down
using logic driven by INDEX & GROUP functions}
END CASE

{processes}

NEXT FMC

REPEAT

This requires writing a little extra code to deal with the line and
element parsing, but that should all be boiler plate that is reusable
with appropriate modifications to read lengths and CSV delimiter
counts whenever you have to read a CSV file.

In my experience on WINDOWS, the OSBREAD and OSBWRITE functions are
way faster than READSEQ and WRITESEQ on larger files because only a
byte-specific, limited portion of the file has to be dealt with at a
time

FFT...@aol.com

unread,
May 30, 2010, 3:02:14 PM5/30/10
to jb...@googlegroups.com
The dilithium crystals cannot handle it Captain!  She's gonna blow!

master

unread,
May 31, 2010, 7:57:24 AM5/31/10
to jBASE
Thanks for the insight.

Now I have two *working* options
Reply all
Reply to author
Forward
0 new messages