ETL, a useful library

39 views

Skip to first unread message

The Beez

unread,

Feb 3, 2024, 1:15:01 PMFeb 3

to 4tH-compiler

Hi 4tH-ers!

4tH becomes 30 years old this year. In its early years, it had serious limitations, but coming in the naughties, it became viable enough to use it in real life applications. So I often used it for work, like small data converters. And I still use it. It saved the day many a time.

Yes, it's for that reason a template like CONVERT.4tH exists. Or library routines to convert dates or phone numbers. Yes, some of those routines trickled into 4tH - because I often also continued work on them in my spare time - especially when I thought a situation might arrive in the future that I might need them (although often that never happened).

Some of these custom converters stayed in use for years after I left the company, so they certainly have proven their use.

Well, in the last ten years I often encountered an application that was able to dump its tables in a CSV file. But you could never be sure how they got out. It might lack some fields, have some more fields - and the fields might not be in the correct order. So if you had a program that expected those fields in a specific order, you might have to shuffle things around. Which might introduce some errors - if you were in a hurry - you know what I mean. It was a mess.

Now, I could have written something clever for that one occasion, but I believe one should have a more generic solution. And I made one: ETL.4tH.

Now, It's not the most beautiful code I've ever written, but it works - if you follow the rules:

You need a structure containing the fields you need. It is fixed length, so you might want to analyze the CSV file using CSVSCAN.4tH before sizing the fields. You have to instantiate that structure, so we can store stuff there;
You need a table describing your structure (often called MAPPER). It contains a field for the name of the field in the CSV file, the corresponding field offset in the structure you created and an execution token for the "transformation" word for that field. E.g. if you want to clean up your phone number, enter the word there. You may use the same word for different fields;
You have to calculate how many rows there are in that table. Often the result ends up in the #ETL constant;
You have to create an array that can hold that table. Use /ETL and #ETL to size that array;
Now we're in business. Fill the buffer using REFILL and initiate reading the header of the CSV file. The ETL API provides a routine for that. You have to feed it with the array, the MAPPER table and an execution token to read the field. Yes, you may read fields in a different way - probably even fixed size fields, although I haven't tried this yet;
Now the magic happens. It keeps count of the fields it reads - and when it encounters the name of the field, it stores its sequence number in the array, along with the offset of the field and the transformation xt;
In the end, it sorts the array by sequence number. It leaves two housekeeping items on the stack. We've read the header;
Now we can enter the main loop - using these two items - by calling the second ETL API call. Just add the address of the structure buffer and (again) an execution token to read the field. Don't lose the housekeeping items - we will (obviously) need them during the next API call;
The trick is - this routine will counts the fields as well. Note in the array those field are sorted, so the only thing we have to check is: does this number equal a field we're interested in. If so, it will use the structure buffer and the field offset in the array to transfer the data to their correct place. It will use the transformation xt in the array to do any post processing;
This transformation xt will receive the actual string in the structure buffer, so it can do it's thing there. If you don't want to do anything with it, just drop that addr/count pair. Yes, that translates to 2DROP;
Now you are responsible for keeping that string address, so you will often see a STOW to save it. It's not portable to Forth, but in 4tH it is just fine. We can handle that.

I hadn't used that lib for some time now, but this month I got an assignment that made it indispensable - especially since it has to be used in an time critical situation where lots of these files have to be processed in a very short amount of time. So I thanked my past me for providing it - along half a dozen of other libraries.

Sure, I suppose one could make it in other languages as well (although I see a few bumps in the road here and there), but given I have 4tH and I know 4tH I wouldn't feel comfortable giving a timeline for any other language.