Mass data batch processing

305 views
Skip to first unread message

Jan Lessner

unread,
Feb 28, 2017, 7:16:21 AM2/28/17
to jOOQ User Group
Hello everyone
Lukas and I started a discussion on the manual pages which Lukas proposed to move over to the user group instead. The issue was: what is the perfect pattern with JOOQ for a DB batch programm that is supposed to process a significant amount of data - let's say: 1 million rows per night. I added a real-world usecase at the bottom of this post. We assume that there are good reasons why the business logic for this batch operation cannot be ported to a stored procedure, so it must run in Java.
Due to the large amount of data, you would like to select from an open DB cursor (ResultSet in JDBC) without the need to produce intermediate lists in memory which make your JVM run out of memory. This is perfectly addressed by JOOQ with something like that:

create.selectFrom(EMAILCONTACT)....stream()

Now a got all the source data provided as a stream. Very nice.
For the inserts and updates we would need batched prepared statements for maximum writing speed. Our Oracle DB makes 3.000 inserts per second with batch prepared inserts, so we are fine with 1 million records per night. And here comes the difficult part: how can I feed multiple batched prepared statements from one source stream. JOOQ provides a few different features for batched prepared statements (batchStore, batch, Loader API) but it seems that there is a missing link to elegantly build a kind of processing pipeline from the single source and the various outputs. Java streams cannot be "plit" somehow, so there is no straight-forward solution. Lukas brought the term "dispatching" into the discussion. Something like that might be the link.

Any suggestions?

Regards,
Jan



The real-world usecase
An external e-mail fullfillment system provides us a DB table about which customers got send an e-mail for advertisment campaigns. Sometimes a few thousand per day, sometimes a few million, depending on the marketing department's actiity. A second table holds information about e-mail bounces. Dependening on this information, our system has to write entries in the corresponding customers' contact history as an input for the service center application. If the e-mail bounced, the history entry has different type and content. Additionally the e-mail bounce leads to an update of the customer's e-mail account status.
So we have a single source, consisting of
  • The e-mail information records
  • joined with the e-mail bounce information if present (i.e. outer join) and
  • joined with the customer's e-mail account record (inner join). In case the e-mail status must be changed, we don't want to select the e-mail account separately. So the join is reasonable
The results of the process are
  • inserts into a customer contact history table (nearly the same amount as provided e-mail information records) and
  • updates for e-mail accounts (an unpredictable sub-amount but still lots of)

Lukas Eder

unread,
Feb 28, 2017, 2:57:49 PM2/28/17
to jooq...@googlegroups.com
Hi Jan,

Thanks for following up on the user group. Before we look into jOOQ options, of course, it would be very interesting to learn why a PL/SQL based solution is not an option here. Does excluding PL/SQL also exclude using an anonymous PL/SQL block (not a stored procedure)? I'm thinking:

CREATE TYPE my_type AS OBJECT (...);
/
CREATE TYPE my_table_type AS TABLE OF my_type;
/

And then:

DECLARE
  records my_table_type;
BEGIN
  records := ?;

  FORALL i IN 1 .. records.COUNT
    INSERT INTO target_table
    VALUES records(i);

  COMMIT;
END;

You could then run the above by sending a chunk of records through the generated MyTableTypeRecord through JDBC or jOOQ:

ctx.execute(aboceSqlString, new MyTableTypeRecord(listOfRecords));

That would probably run quite quickly...

What do you think?
Lukas

--
You received this message because you are subscribed to the Google Groups "jOOQ User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jooq-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Jan Lessner

unread,
Feb 28, 2017, 4:36:24 PM2/28/17
to jOOQ User Group

Hi Lukas
There are many reasons which might make it difficult to use a stored procedure
  • Unit-Testability of the code
  • Reuse of existing Java code modules
  • Complicated business logic on the way from source records to destination records
  • Different databases in development and production
  • Missing knowhow to maintain PL/SQL

just to name a few. Taking the mentioned real-world usecase as an example: In this case the exact content of the history entries are assembled from a dynamic Excel template using the openL rule engine. This is Java technologie without an alternative in PL/SQL.


Regards,

Jan

Emrul Islam

unread,
Feb 28, 2017, 5:07:17 PM2/28/17
to jOOQ User Group
Hi, I just came across this thread and realised I had a similar situation to deal with which I solved slightly differently.  In my case, I didn't need to update the target database until the end of the batch run but I also couldn't store everything in memory until the batch was completed.  Instead, I ended up using https://github.com/npgall/cqengine

1. I created a persisted disk index with CQEngine
2. As the batch ran I updated the indexes with the results
3. Once the batch was finished I went back and updated the source database with the results

Not sure if that helps you but it helped me (especially as it meant that the batch jobs became resumable)

Lukas Eder

unread,
Mar 1, 2017, 6:10:31 AM3/1/17
to jooq...@googlegroups.com
Jan:

2017-02-28 22:36 GMT+01:00 Jan Lessner <jles...@gmx.de>:
  • Unit-Testability of the code
  • Reuse of existing Java code modules
  • Complicated business logic on the way from source records to destination records
  • Different databases in development and production
  • Missing knowhow to maintain PL/SQL

Interesting. Well, I don't have all the bits of information, and there might be huge pain hiding in bullet number 3, but I think you might be burning some money on Exadata there if you don't use PL/SQL :) Here's some interesting (biased of course) insight:

I do recommend you reconsider, and at least benchmark your solution against an alternative with more data processing logic in the database.

I contract for a client that loads and unloads several millions of rows into dozens of tables per night in Oracle through Perl and PL/SQL, similar order of magnitude as what you're talking about. They also run their more complex business logic in Java processes that get triggered using Oracle AQ from within the database. I don't have numbers about messages per second, but their Oracle instance certainly doesn't squeal yet. Oracle AQ is really completely underrated!

Heck, I'd even recommend not just using PL/SQL but SQL only using Oracle's multi-table INSERT extension for your specific insertion use-case:

But anyway, you have your reasons.

So, essentially you're looking for jOOQ native support for precisely that multi-table INSERT statement (which would then also work in other DBs). Unfortunately, it doesn't exist yet, but it's certainly a good idea to support. Multi table streams could be "bulked", batched, and committed in a quite nice manner using the Loader API (or an extension thereof). We'd need a set of source to target mapping functions that transform the input data into different streams of output data.

I've created a feature request for this:
 

just to name a few. Taking the mentioned real-world usecase as an example: In this case the exact content of the history entries are assembled from a dynamic Excel template using the openL rule engine. This is Java technologie without an alternative in PL/SQL.

Oh, there's a jOOQ user out there who would put that library into the database and run a JVM inside of Oracle :) But that might not be a very good idea...

Cheers,
Lukas

Lukas Eder

unread,
Mar 1, 2017, 6:14:09 AM3/1/17
to jooq...@googlegroups.com

  • Different databases in development and production
Out of curiosity: What's your development database?

Jan Lessner

unread,
Mar 1, 2017, 4:49:40 PM3/1/17
to jOOQ User Group
Thanks for the advice, Emrul. This is indeed not quite what I'm looking for, but anyway: this CQEngine looks pretty cool. I will definitely keep that in the back of my mind :-)

Jan Lessner

unread,
Mar 1, 2017, 5:14:36 PM3/1/17
to jOOQ User Group

Am Mittwoch, 1. März 2017 12:10:31 UTC+1 schrieb Lukas Eder:
Interesting. Well, I don't have all the bits of information, and there might be huge pain hiding in bullet number 3, but I think you might be burning some money on Exadata there if you don't use PL/SQL :)

Yes, that's probably right :-)
The reason for using an Exadata is not the Java-based operative system but the analytical data warehouse. But we are currently in preparation for a massive redesign to expand the business, and we will probably split the analytical and operative database so that the Exadata is no longer our basis.
 

I've created a feature request for this:
Thanks a lot. Maybe we can help in the future :-)
 
 Out of curiosity: What's your development database?
 At the moment the developers are working on locally installed Oracle XEs. But in case we moved to JOOQ (coming from PriDE) we would probably change that to make use of JOOQ's open source edition. As we have almost no Oracle-specific aspects in our code, we could dare to use different databases in development and production.

Lukas Eder

unread,
Mar 2, 2017, 5:31:22 AM3/2/17
to jooq...@googlegroups.com
2017-03-01 23:14 GMT+01:00 Jan Lessner <jles...@gmx.de>:
The reason for using an Exadata is not the Java-based operative system but the analytical data warehouse. But we are currently in preparation for a massive redesign to expand the business, and we will probably split the analytical and operative database so that the Exadata is no longer our basis.

Yes, that's never a bad idea. Combining OLTP and OLAP in a single system works only "for so long" i.e. until you grow to some size.

I've created a feature request for this:
Thanks a lot. Maybe we can help in the future :-)

You definitely can: The more details we know about your use-case and the features you'd like to see, the better this feature will get.

 Out of curiosity: What's your development database?
 At the moment the developers are working on locally installed Oracle XEs. But in case we moved to JOOQ (coming from PriDE) we would probably change that to make use of JOOQ's open source edition. As we have almost no Oracle-specific aspects in our code, we could dare to use different databases in development and production.

I see, indeed Oracle XE is quite compatible with SE and EE, at least from a developer perspective. However, I generally recommend not using a database for development that you're not also targeting for production systems. There are a lot of differences, e.g. in transaction models and in many subtle differences of behaviour that are not covered / abstracted in the SQL language per se.

In recent years, Oracle has made working with its database really easy for developers by providing out-of-the-box docker images for Oracle 11g and 12c:

Just a thought :)
Lukas

mi...@providencehr.com

unread,
Mar 2, 2017, 8:28:23 AM3/2/17
to jOOQ User Group
Please excuse me if I misunstand the requirements. You want to process the stream of data and collect it into inserts into the contact history table and updates to the email address and then batch process these once you've collected enough to make the batch efficient. The pain is that Java streams aren't easily split and don't provide an API to deal with the window functionality. You might take a look at RxJava. You could create an RxJava observable from your stream of data with multiple subscribers. One subscriber would filter the stream looking for inserts, the other for updates. You could then use the built-in window functionality to collect inserts / updates until you have the number needed for efficiency. Here's a couple links that flesh this out.

Lukas Eder

unread,
Mar 2, 2017, 10:53:16 AM3/2/17
to jooq...@googlegroups.com
Yes, indeed, an Observer might be a better way to abstract over a stream of data to merge into different target tables.

It would certainly work with a Java 8 stream as well, but maybe this feature request might just be a good enough reason to embrace what appears to become a new de-facto standard SPI in the Java ecosystem: the http://www.reactive-streams.org and/or JDK 9 Flow:

I'd rather not consider a direct dependency on RxJava, though.

The relevant feature request is:

Mike Foody

unread,
Mar 2, 2017, 11:14:21 AM3/2/17
to jooq...@googlegroups.com
Once JOOQ has a Java9 dependency that'd be great. In the meantime most of these libraries have convenicen APIs for this purpose. For instance to convert that stream into an RxJava observable is just Observable.from(jooqStream::iterator). It'd be nicer to say fetchIntoObservable but not much nicer :-)

--
You received this message because you are subscribed to a topic in the Google Groups "jOOQ User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jooq-user/Yt5ZQOapNek/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jooq-user+unsubscribe@googlegroups.com.

Lukas Eder

unread,
Mar 2, 2017, 11:40:18 AM3/2/17
to jooq...@googlegroups.com
2017-03-02 17:13 GMT+01:00 Mike Foody <mi...@providencehr.com>:
Once JOOQ has a Java9 dependency that'd be great. In the meantime most of these libraries have convenicen APIs for this purpose. For instance to convert that stream into an RxJava observable is just Observable.from(jooqStream::iterator). It'd be nicer to say fetchIntoObservable but not much nicer :-)

Well, you see, ResultQuery.stream() is nicer than StreamSupport.stream(ResultQuery.spliterator(), false) but not much nicer ;-)

Jan Lessner

unread,
Mar 4, 2017, 6:24:22 PM3/4/17
to jOOQ User Group
Actually an interesting idea. It may allow to build such a pipeline *on top* of JOOQ using DSLContext.batchInsert and DSLContext.batchUpdate on the subscriber side.

I will check that out until Lucas builds in the Java 9 flow support ;-)
Reply all
Reply to author
Forward
0 new messages