Bulk Import - Moving from DB2CMD with DB2 Load to ibm_db

188 views
Skip to first unread message

Allen Cartright

unread,
Jan 8, 2019, 11:58:13 PM1/8/19
to ibm_db
Looking for recommendations on the best way to approach getting large volumes of data into a DB2 database using Python with ibm_db and working with SQL/DDL files.

Today we use DB2CMD scripts on our application servers that handle loading multi GB data files (multi GB) into DB2 databases.  This is a remote load scenario.  The drivers for looking at Python is that we have to add additional functionality to the scripts and with the restrictions in Windows CMD scripting it makes sense to look at other options.  That said, because the current solution works and works well there isn't a desire to make architectural changes to the data flow.  Namely, we are stuck with the data files being sent to application server.

In the current environment the scripts use the standard DB2 LOAD command.  Something along the lines of;

     DB2 LOAD CLIENT FROM <filename> OF DEL RESTARTCOUNT 1 MESSAGES <messages file> REPLACE INTO <target table> NONERECOVERABBLE
       or
     DB2 LOAD CLIENT FROM <filename> OF DEL RESTARTCOUNT 1 MESSAGES <messages file> REPLACE INTO <target table> COPY YES

This affords;
     * Fast data loads.  Yes, it is a little slower than if the data files were on the DB server but performance has been very good (9K on low power untuned systems).
     * Ability to restart a load
     * On failure the load can be backed out completely
     * Good status reporting
        - Message log file with row by row data
        - Status summary;
             Number of rows read            =
             Number of rows skipped       =
             Number of rows loaded         =
             Number of rows rejected       =
             Number of rows deleted        =
             Number of rows committed   =

I understand that DB2 LOAD is a command line processor command so it isn't an option in ibm_db.  And with the data files residing on the application server 'call admin_cmd' isn't an option.  So what is the best option to;
     * Maintain current configuration of data files on the application server
     * Performance that is equal to or better than a DB2 LOAD
     * Can back out on failure
     * Provide the same or equivalent level of logging (read, skipped, rejected, committed) or at least some way of reconciling what records didn't make it in.

Because I haven't put enough in this post I'll add one more spoon feed item.  In our scripts we do a lot through DDL/SQL files where the script will execute something along the lines of "db2 -tf <whateverfile.sql>".  I'm not seeing how to do this with Python. 

Appreciate any assistance.  Fine with being pointed to resources but would ask that such resources have applicable examples.

Thanks!

Saba Kauser

unread,
Jan 9, 2019, 1:18:10 AM1/9/19
to ibm_db
Hello Allen,

As you mentioned, python ibm_db does not support load.
I am not sure what you meant exactly by this:"In our scripts we do a lot through DDL/SQL files where the script will execute something along the lines of "db2 -tf <whateverfile.sql>".  I'm not seeing how to do this with Python. "

You could execute CLP commands from python as:
import os
os.system("db2 -tvf file.sql")

There is an equivalent of db2 load performance from db2 clidriver.
The example can be found here-> https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.apdv.sample.doc/doc/cli/s-tbload-c.html

Allen Cartright

unread,
Jan 9, 2019, 11:07:03 PM1/9/19
to ibm_db
Saba,

Appreciate the response.  Given the amount of hoops I think it will be easier to use PowerShell.    

Jorge Sierra

unread,
Apr 10, 2019, 11:05:24 PM4/10/19
to ibm_db
I am working on some kind of extension to the ibm_db driver by using the db2 load c api, to insert parquet files to db2....is almost ready...need to create a git project and post it there....

Jorge Sierra

unread,
Apr 11, 2019, 11:07:17 PM4/11/19
to ibm_db

First commit, all this code is experimental on my end, could this be incorporated on the official driver ? may be.
So the function consume a pyarrow.table, this pyarrow.table can be created from pandas too....
Reply all
Reply to author
Forward
0 new messages