ETL using FBP

228 views
Skip to first unread message

Paul Morrison

unread,
May 28, 2020, 3:54:15 PM5/28/20
to Flow Based Programming
ETL using FBP
----------------

In a recent conversation with Felipe Valdés, an experienced programmer who has been an FBP enthusiast for a number of years (he is quoted in my book),  we talked about future directions for FBP, and Felipe suggested, "write a couple basic ETL components and become a ETL DSL, market as ETL, beat the competitors with better usability for the editor." !

Now, I had actually forgotten about ETL, but in my book - more specifically the 2nd edition - there are quite a few references to papers about ETL - many of them in connection with checkpointing, as ETL apps are often quite long-running. 

The transform (the T in ETL) component/subnet can either be prepackaged, using some kind of rules manager (I have some experience doing this in Java), or just be a hand-written FBP component/subnet...  The front and back ends would be components interfacing with various database managers, e.g. JDBC, or straight files... or combinations of the above! 

And of course these ETL-like networks can also interface to anything that "talks data", including file managers, sockets, messaging, etc., etc.



An FBP-based ETL system looks like an interesting project, so here goes:

I have already got a simple JavaFBP./JDBC demo working, so it's (in my view) just a matter of growing it organically... Most of the time was taken installing MySQL!

However... my JDBC program has a lot of stuff hard-wired, so turning it into a reusable module will take some design effort...  Of course people could hand-code the Extract and Load components for every database in their application, using standard templates, but I would like to do better than that!

Another issue that jumped out at me was that the demo (which I took and modified from the Internet) uses double for price... which is a NO-NO!   However, since we would be working with existing databases, we just have to know the currency - see below...

As follows (in Java):

           ResultSet rset = stmt.executeQuery("... select statement...");
           Book book = new Book();
           book.title = rset.getString("title");
           book.author = rset.getString("author");
           book.price = rset.getDouble("price"); 
           book.qty   = rset.getInt("qty");

We did quite a bit of work on business data types when I was still at IBM, some of which is described in https://jpaulm.github.io/busdtyps.html ...

For any locale-independent system, you need to include the currency in a monetary value or price, so we used the internationally accepted 3-character currency codes, plus the value as a String, e.g.  "CAD21.65"  .  Internally this was converted to currency code and the numeric value as BigDecimal (much more appropriate for currency amounts than "double"!)...  Note: if you are not familiar with BigDecimal, it is described at https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html .

That's it for now - I think the Google group would be a good place to start a discussion - unless a GitHub project or other tool would be better...?

Who is up for some FBP-based design and implementation work?!  I am going to put a JDBC program up on GitHub... which will be a good example of how NOT to build ETL components (unless there really is no better way)!   See https://github.com/jpaulm/fbp-etl/tree/master/src/main/java/com/jpaulmorrison/dbtest  - you can see that a lot of things are hard-wired, so it will be quite a challenge to "componentize" it!  MySQL databases are somewhat self-describing (although not completely), and Java has reflection, so we should be able to some componentization - we might also be able to generate E and L components from something like JSON...
                                                                                         
Oh, I almost forgot: we have a quite decent FBP-oriented design tool, https://github.com/jpaulm/drawfbp, which has been evolving over a number of years, so drawing and exchanging design diagrams will not be a problem...

PS I see where Nicholas Chen raised this idea back in 2011 (9 years ago!) - wonder if anything came of that effort?


Message has been deleted

Paul Morrison

unread,
May 31, 2020, 4:12:42 PM5/31/20
to Flow Based Programming
You can now see an early stage of development for this project at https://github.com/jpaulm/fbp-etl  .

The ReadJDBC component is still very hard-wired, so I would welcome suggestions on how to parametrize it.


Paul Morrison

unread,
Jun 1, 2020, 4:15:13 PM6/1/20
to Flow Based Programming
This proposed project - https://github.com/jpaulm/fbp-etl  - is intended to be essentially an "ETL kit" or "cookbook".  Like the Tutorial repo - https://github.com/jpaulm/fbp-tutorial-filter-file - there will (hopefully) be a series of steps, each introducing new concepts and/or components.  Each stage will be illustrated with diagrams, and will have running code attached. This is an "evolutionary" approach - at no point will there be a single "ETL app" - instead, it should be more like a cookbook, with recipes and descriptions of necessary and optional ingredients.

As a first example, Step05 shows a simple diagram, the code generated from it using DrawFBP, and the output from running that code.  The ReadJDBC component has a lot of hard-wired information - the only thing externalized so far is the password!  In later stages, more information will be externalized, to the point where hopefully we can have a single precoded, pretested component.

My book, especially the 2nd ed. - https://www.amazon.com/Flow-Based-Programming-J-Paul-Morrison-ebook/dp/B004PLO66O -  has a number of references to ETL, including several descriptions of the need for, and techniques to implement, checkpointing.  In the book checkpointing is described at a conceptual level, so hopefully we can add checkpointing components to ths project as it evolves.

I should stress that I am not an ETL expert, so I would like very much for people who are more skilled in that area to join in and collaborate, give feedback, etc.  I am however convinced, like a number of others over the years, that FBP is a good match with ETL, so I would like to see if we can get this project off the ground!

Looking forward to hearing from you!


Paul Morrison

unread,
Jun 13, 2020, 12:50:44 PM6/13/20
to Flow Based Programming
This project is moving ahead slowly but steadily (I think), and I would like to report the present stage of development.

I now have a reusable component which reads a full MySQL table and generates a stream of information packets (IPs).  It has no hard-wired application-specific data, but It does however right now make certain assumptions, which I need to open up.  The assumptions are as follows:

  • it assumes that the table column names and object field names are identical
  • what is (currently) hard-wired is the correspondence between VARCHAR and String, DECIMAL and BigDecimal, etc.

The first restriction should be lifted as soon as possible!

I also need to support more types and correspondences, but I am thinking that, rather than making the correspondences one-way, I should check the type of each table column against the corresponding object field type, and see if they are compatible - there is actually a list of these available on-line.


It should be time to start a table writer component soon! 

As before, collaborators who speak (My)SQL and ETL would be most welcome!


Paul Morrison

unread,
Jun 26, 2020, 10:47:51 PM6/26/20
to Flow Based Programming
We now have a component which reads a MySQL table and generates a stream of "row" IPs.  We therefore have made a start at providing the "E" of "ETL"!  Of course data can be obtained from files, or any other input formats.

The correspondence between column names and IP fields is held in a table which can be either specified in the network via an IIP, or in an incoming "regular" IP... so the column names and IP fields do not have to be the same.  Other information is also external to the component, as well as column metadata (obtained from MySQL) and IP class information (via Java reflection).

There is a working example on https://github.com/jpaulm/fbp-etl - the complete network is at Step14.  Here is the current network (the front end only):

Step14.png



Next phase is to start work on the 'L' of 'ETL'...


Does anyone want to collaborate and/or provide an interesting MySQL (or other SQL) database?

Paul Morrison

unread,
Jun 27, 2020, 2:18:31 PM6/27/20
to Flow Based Programming
Here is the output from WriteObjectsToConsole, and also from the JavaFBP run as a whole - WriteObjectsToConsole uses Java reflection to find out the class names for the "row" IPs, and field names and values...  So, this is also a generalized component, containing no application knowledge.

Step10-3.png




Paul Morrison

unread,
Jun 27, 2020, 10:41:05 PM6/27/20
to flow-based-...@googlegroups.com
Almost forgot!  Last thing at the "Extract" end is to convert the BigDecimal 'price' to MPrice data type in Business Data Types -

This may take a bit more work, as we may have to use VARCHAR for the MySQL column type!  Let's see....

Paul Morrison

unread,
Jun 28, 2020, 9:59:01 AM6/28/20
to Flow Based Programming
Looks like we have to use https://docs.oracle.com/javase/tutorial/jdbc/basics/sqlcustommapping.html - help would be appreciated!

Paul Morrison

unread,
Jun 29, 2020, 12:39:16 PM6/29/20
to Flow Based Programming
Dead end! Can't do this with MySQL! Suggestions, anyone?!

Paul Morrison

unread,
Jun 29, 2020, 1:29:25 PM6/29/20
to Flow Based Programming
Guess I'll have to go with VARCHAR!

I have changed my table, so earlier networks in this project (before Step15) no longer works (for me)!  And Step15 is not yet working!

Ged Byrne

unread,
Jun 29, 2020, 1:52:40 PM6/29/20
to flow-based-...@googlegroups.com
Hi Paul,

I thought the MySQL connector supports BigDecimal as standard: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-type-conversions.html

Big Decimal maps best to Numeric or Decimal.  You just have to make sure you give it enough Capacity and Precision: https://docs.oracle.com/javase/1.5.0/docs/guide/jdbc/getstart/mapping.html#107577

Regards, 


Ged



--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flow-based-programming/24ae42c7-14b3-4558-9391-44ad72b74e03o%40googlegroups.com.

Paul Morrison

unread,
Jun 29, 2020, 3:45:57 PM6/29/20
to Flow Based Programming
Hi Ged,  Good to hear from you!  And you have admitted to knowing MySQL - I may have to bug you a bit!

Much of the thinking is buried in https://jpaulm.github.io/busdtyps.html and https://github.com/jpaulm/jbdtypes : I want to have two data types called Monetary and MPrice, which allow currency abbreviations to be added to the numeric amount, e.g. CAD24.99 (i.e. Canadian Dollars).  I was bombing along until I discovered that MySQL does not allow one to CREATE a named type, although some othr SQL implementations do.

You may have seen earlier that my intent is to develop a component that is basically driven by SQL metadata, and Java reflection - but the link below ( https://docs.oracle.com/javase/tutorial/jdbc/basics/sqlcustommapping.html ) suggests that each Monetary and MPrice would have to consist of two pieces: currency (CHAR) and amount (DECIMAL).  I would like to do better than that!

I do have something sort of working, but it involves a fudge, so I will have to dig a bit deeper!  Do you know of any way to add customized data types to MySQL - it doesn't seem like rocket science!

Thanks for the feedback anyway - when (if!) I have something solid, hopefully I will ask for your feedback!

Best regards,

Paul

On Monday, June 29, 2020 at 1:52:40 PM UTC-4, Ged Byrne wrote:
Hi Paul,

I thought the MySQL connector supports BigDecimal as standard: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-type-conversions.html

Big Decimal maps best to Numeric or Decimal.  You just have to make sure you give it enough Capacity and Precision: https://docs.oracle.com/javase/1.5.0/docs/guide/jdbc/getstart/mapping.html#107577

Regards, 


Ged



On Mon, 29 Jun 2020 at 17:39, Paul Morrison <paul.m...@rogers.com> wrote:
Dead end! Can't do this with MySQL! Suggestions, anyone?!

On Sunday, June 28, 2020 at 9:59:01 AM UTC-4, Paul Morrison wrote:
Looks like we have to use https://docs.oracle.com/javase/tutorial/jdbc/basics/sqlcustommapping.html - help would be appreciated!

On Saturday, June 27, 2020 at 10:41:05 PM UTC-4, Paul Morrison wrote:
Almost forgot!  Last thing at the "Extract" end is to convert the BigDecimal 'price' to MPrice data type in Business Data Types -

This may take a bit more work, as we may have to use VARCHAR for the MySQL column type!  Let's see....

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

Ged Byrne

unread,
Jun 29, 2020, 3:59:45 PM6/29/20
to flow-based-...@googlegroups.com
Hi Paul,

Unfortunately not.  I think Oracle are careful to make sure that MySQL doesn’t get too many features. 

Postgres, on the other hand, is fully loaded with features: 


Ged

To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flow-based-programming/a4ea9d85-03ce-4bb9-bc99-47f92835c147o%40googlegroups.com.

Paul Morrison

unread,
Jul 6, 2020, 2:45:12 PM7/6/20
to Flow Based Programming
Thanks Ged,

I have gone with VARCHAR for the prices - all the JBDTypes data types have "String constructors" so I am using that in my implementation.

I have pretty much finished development for the "E" end of "ETL" - people can explore it via https://github.com/jpaulm/fbp-etl/tree/master/src/main/java/com/jpaulmorrison/Step15 , and earlier steps if desired.

Feel free to contact me if anyone has questions... or use the Issues for the FBP-ETL project in GitHub.

Cheers,

Paul
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.

Paul Morrison

unread,
Jul 8, 2020, 10:22:19 PM7/8/20
to Flow Based Programming
As I said in my previous post, the "Extract" end of my FBP-ETL project is pretty much working - if you are interested, take a look at https://github.com/jpaulm/fbp-etl - especially Step15...

What I notice is that FBP removes distinctions!  There is nothing special about ETL from an FBP point of view!   FBP removes the distinction between online and batch!  It removes the distinction between application development and prototyping, between development and simulation...  It is language-agnostic!  It is visual, it promotes communication between all the different players: designers, users, developers,  and managers!

Join the FBP revolution!  All you have to do is unlearn the von Neumann paradigm!!
Message has been deleted

Paul Morrison

unread,
Aug 7, 2020, 1:36:52 PM8/7/20
to Flow Based Programming
Must have double-clicked - apologies!

I wanted to say that my simple ETL network is now working - Extract and Load access two MySQL tables - the "final" network diagrams and code are in https://github.com/jpaulm/fbp-etl/tree/master/src/main/java/com/jpaulmorrison/Step25 .  The diagrams and the corresponding Java files comprise multiple levels, so this app illustrates "top down" and "bottom up" development.  Of course, to run this application, you will need to set up your own tables, password, etc., and modify the network accordingly.  (I am not going to give the reader my password!)

I put "final" in quotes because there is nothing special about this network from an FBP point of view... we are just reading from some kind of data store, and writing to another one - in between you can do any processing you feel like, including checkpointing, etc.

Thanks for suggesting this, Felipe, it was fun... and I learned a lot!

Paul M.

Matthew Lai

unread,
Aug 8, 2020, 11:50:47 PM8/8/20
to Flow Based Programming
Mr. Morrison,

I'm also working on some sort of ETL side project, and it looks like FBP may be the right tool for the tasks involved. 

In a nutshell I'm trying to create a poorman's text-audio-sync tool that is used for ESL (English As a Second Language) learning purpose. The idea is simple: find a text book that has the corresponding human-read audio file. Open the text in some editor and also play the audio file. While audio file is playing, the corresponding text is highlighted/underlined. [If you ever read the transcript while listening to TED talks you know what I mean by text-audio-sync!] This feature helps the user to train English listening and reading at the same time, sort of like turning on the closed captions while watching a movie.

Example of human-read audio file and the text file can be found in https://www.gutenberg.org/files/9556/mp3-32bit/9556-3201.mp3 and https://www.gutenberg.org/files/244/244-h/244-h.htm#link2HCH0001

Extract 
- read the mp3 file
- read the text file

Transform
- extract the sentences from the mp3 file eg. use Audacity to look for 1 second silence gaps and extract those audio clips in between those gaps
- extract the sentences from the text file eg. parse the text file to generate sentences using punctuations as delimiters
- match each audio clip to one or more parsed sentence. One matching technique I'm testing now is to convert the audio clip to text using DeepSpeech2, and then match the converted text to the parsed sentence(s).
- collect the matching data {<start audio clip time> <stop audio clip time> <matched sentence(s)>}

Load
- save matching data into some database table
- save mp3 file into some data server
- save text file into some data server

The challenge here is to work with the audio and not just text processing like the 'traditional' ETL applications. I do believe FBP is the right tool for this application as there are subtasks that can be run concurrently (extract audio clips, matching audio clips to parsed sentence(s)) and long batch processing, for tens to hundreds of audio and text files, can be handled well by FBP.

Matthew

Paul Morrison

unread,
Aug 9, 2020, 11:32:12 AM8/9/20
to flow-based-...@googlegroups.com
Hi Matt,

That sounds brilliant!  Let us know how that works out!   I assume you will be using your TCL/TK tool...?

BTW I'm glad I tried your example - turned out that my audio driver wasn't working, so Microsoft very kindly installed a new one, and it seems to be working now!

Best regards,

Paul M.

--
You received this message because you are subscribed to the Google Groups "Flow Based Programming" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-progra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flow-based-programming/5227fde0-3f82-4023-90ec-ea85b625f19bo%40googlegroups.com.

Matthew Lai

unread,
Aug 12, 2020, 8:43:42 AM8/12/20
to Flow Based Programming
Mr. Morrison,

I'll start with my tclfbp first. But any other fbp implementations should work as DeepSpeech2 is invoked as a cmdline task and not built or linked into fbp executables.

Yours,

Matt
To unsubscribe from this group and stop receiving emails from it, send an email to flow-based-programming+unsub...@googlegroups.com.
Message has been deleted

Paul Morrison

unread,
Sep 5, 2020, 11:10:25 AM9/5/20
to Flow Based Programming
Just a postscript about my FBP-ETL project - I have added another step called "Externalizing Pasword Block" - https://github.com/jpaulm/fbp-etl/tree/master/src/main/java/com/jpaulmorrison/Step30 - I didn't like the fact that reading the password was being done in both "Extract" and "Load", so I have made this function a single block, feeding both "Extract" and "Load", and modified the diagrams accordingly...

This is probably the last enhancement that I will be making to this project for a while... unless I get a suggestion about new directions someone would like this project to take!

As I said before, FBP turns ETL into just another FBP app - it is totally open-ended.  No special logic required!

Comments would be appreciated!  It would also be great if someone wants to build on this foundation!

Paul M.

Paul Morrison

unread,
Nov 7, 2020, 2:13:39 PM11/7/20
to Flow Based Programming
Two months later...  Step30 is working - https://github.com/jpaulm/fbp-etl -  both under Eclipse and from a DOS window.  It is nicely packaged as 3 subnets: "E", "T" and "L"!  "T" also uses the Java Business Types prototype - https://github.com/jpaulm/jbdtypes .  I note that these do not seem to have been much accessed!

Now here is where I reveal my ignorance of Java production packaging, and would appreciate help/guidance/whatever... (I am using Gradle and Maven).   If you look at https://github.com/jpaulm/fbp-etl/blob/master/README.md , you will note that the app uses 4 Maven artifacts (actually 5, as it uses protobuf, but I think this is included in the MySQL jar file).   Two JavaFBP support components are included in the JavaFBP jar file - originally the JavaFBP jar file was pulling in the whole MySQL jar file, but this is not desirable, as JavaFBP may accumulate "stubs" for many different software packages in the future, and we want to avoid "software bloat"!

So... could some kind person look at my build.gradle file, pom, file hierarchy, Readme, etc., and make recommendations about the best way to package this stuff...?

I should mention that the [Java]FBP world is a componentry world, so the packaging strategies we adopt should be widely applicable...

Thanks in advance,

Paul M.

Paul Morrison

unread,
Nov 9, 2020, 11:31:07 AM11/9/20
to Flow Based Programming
I seem to have found the solution: fbp-etl is now a (very) "fat jar", containing all the required dependencies, so you do not have to list them in the command. 

See https://github.com/jpaulm/fbp-etl for code, diagrams, and run instructions (in the Readme).

I must confess I am a little surprised that nobody stepped up with assistance, given that

a) Java is currently the 3rd most popular programming language in use today, according to TIOBE  - https://www.tiobe.com/tiobe-index/ - and

b) the FBP group currently has 780 members.

There must be some Java/Gradle/Maven experts out there!

Cheers, and stay safe, everyone!

Paul M
Reply all
Reply to author
Forward
0 new messages