CSV Import performance improvement idea

167 views
Skip to first unread message

Omi Chiba

unread,
Feb 13, 2012, 5:54:44 PM2/13/12
to web2py-users
I have a problem with the performance of CSV import and I assume it
generate INSERT statement for every record so it will be 8000
statement if you have 8000 records in csv file.

Can we use bulk_insert method instead so there will be always only one
INSERT statement which should reduce the performance significantly ?

Johann Spies

unread,
Feb 14, 2012, 2:34:10 AM2/14/12
to web...@googlegroups.com
As I understand it the database (at least Postgresql) uses the COPY statement to import csv-files.  That is much quicker than a series of individual inserts.

Regards
Johann


--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

kenji4569

unread,
Feb 14, 2012, 7:48:42 AM2/14/12
to web2py-users
I use gluon.scheduler to import large csv files for production
environment, and execute db.commit() by 100 records in the task. This
works well for me.

Kenji

On 2月14日, 午後4:34, Johann Spies <johann.sp...@gmail.com> wrote:

Omi Chiba

unread,
Feb 14, 2012, 12:02:03 PM2/14/12
to web2py-users
Johann and Kenji san, Thank you !
DB2 has CPYFRMIMPF command (I think it's same as COPY on postgres) so
I will try it.

GoldenTiger

unread,
Feb 14, 2012, 5:18:36 PM2/14/12
to web...@googlegroups.com
I remember I got that performance problem, importing 8.300 records from a csv file using appadmin and remote database   ( over 10 min )

Nest IMPROVEMENTS worked for me:

-if local database was fast from appadmin
-if remote database, A- using another DB manager to import csv file, phpmyadmin, sqlbuddy,etc
                              B- coping file to remote host and calling importcsv from code


Niphlod

unread,
Feb 14, 2012, 6:23:07 PM2/14/12
to web...@googlegroups.com
you need to make differences between methods .....

web2py stand in the middle, not all databases have facilities like the COPY command of postgresql.

Also, there are methods to export/import from csv at the database level and at the table level. If you need to restore a table from the admin app, you have to take into account the time to phisically transfer the csv file. web2py methods also can use uuid fields to do "upserts" , deals with "reference", "list:*" types, etc.

If you have huge tables, or machines not performant enough, you shouldn't use web2py methods and code something tailored to your needs....I don't have issues for normal tables with 500000 records (file uploaded to remote, import from local file directly from the code, using DAL) .
Of course, if your table contains several columns or large blobs, YMMV.

PS @ Kenji: normally if you restore a table or import something, you want to know what records were committed and what records weren't, if you're not using pkeys or uuids for every single record.... committing 100 records at a time is normally considered a bad behaviour: if your file contain a bad record, you can't assure the consistency of the table....
The time taken to import 8000 records with a commit every 100 is more than importing 8000 records and doing a single final commit, and you'll retain consistency, e.g. if something falls apart, you have a functioning table instead of a broken one.

Again, @ all, YMMV: if your csv's are millions of records and you don't mind database consistency (e.g., all records in the tables are then "polished" or filtered in a smart way), you can commit every n records and store the last committed line in some place else, so if your import crashes, you know where to start (be aware, writing performant checks and logics to import "incrementally" something to dbs and make it work in every possible condition is a quite daunting task).
My point is: don't be shy using transactions, commits and rollbacks, they are simpler to use, very performant and maybe the most wonderful features in a relational database!

kenji4569

unread,
Feb 14, 2012, 8:08:45 PM2/14/12
to web2py-users
@ Niphlod: The situation for me is that non-programmer operators
upload 10,000-100,000 product records mainly for inventory updates via
an admin interface. It's not so huge but large enoght to exceed a
server resposne time limit. The upload is not a bulk copy operation,
but insert/update operations for each record which has a pkey. I think
the upload keeps consistency when it aborted halfway. I am concerned
that the single commit would cause high memory usage which should be
avoided for live environment.

Regards,
Kenji

Niphlod

unread,
Feb 15, 2012, 5:43:59 PM2/15/12
to web...@googlegroups.com
ok, I got it, but normally databases don't "reserve" or "commit" additional memory (as in RAM) for a single commit: the only thing happening in most of them is a first enlargement of what is the "transaction log", that is "transfered" to the main data storage after the commit. Try with some tests and clear out your concerns ^_^

Reply all
Reply to author
Forward
0 new messages