Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[JDBC] Speed improvement - Group batch Insert - Rewrite the INSERT at the driver level (using a parameter)

476 views

Skip to first unread message

J. W. Ulbts

unread,

Jul 18, 2009, 9:51:31 AM7/18/09

Hello!

It would be great if someone would add a parameter and code to the driver which would allow the JDBC driver to group a list of INSERT INTO added to a batch (Prepared Statement and the addBatch(..) method).
This way the INSERT INTO from the batch could be rewritten in a single INSERT INTO which is at least 3 times faster compared to the current behaviour. At least that was the result of my tests.

To make clear what I mean a simple example.

The normal batch would execute this:
--------------------
INSERT INTO sometable (col1, col2) VALUES (1, 'word1');
INSERT INTO sometable (col1, col2) VALUES (2, 'word2');
INSERT INTO sometable (col1, col2) VALUES (3, 'word3');
INSERT INTO sometable (col1, col2) VALUES (4, 'word4');
--------------------

If the batch insert is rewritten by the driver it would generate and execute this:
--------------------
INSERT INTO sometable (col1, col2) VALUES (1, 'word1'), (2, 'word2'), (3, 'word3'), (4, 'word4');
--------------------

If this is already possible with the PostgreSQL JDBC driver I must have missed it.

I ran into this by one of my users pointing me to the multi value/grouped insert statement, and then I finally saw that Mark Matthews has just improved the MySQL JDBC driver in this area. He's talking about 10x performance increase but that's basically because the MySQL server isn't optimizing as good as PostgreSQL. ;)
I could "only" see an improvement by factor 5 with my test data and MySQL InnoDB (IMDb - Keywords.list; reading data from file creating INSERT INTO all in one transaction). It's possible that MyISAM are up to 10x faster.

In my tests where I added the values for the INSERT INTO statement into two Lists (ArrayList; one for the key column and one for the data column) and generated the "INSERT INTO sometable (col1, col2) VALUES (x,y)[, (x,y)];" in my Java code it was 3 times faster compared to the normal PreparedStatement addBatch with it's batchExecute for PostgreSQL.
I could get the same speed improvement with my manually generated INSERT INTO as with the new JDBC driver for MySQL where I turned on the "rewriteBatchedStatements" parameter.
I tried column groups (the "(x,y)") with sizes between 250 and 5000 (in steps of 250: so 250, 500, 750,...,5000). The speed was pretty stable at 3x for PostgreSQL but the larger the group the more memory is used to store the data you are trying to insert.

Here is the blog entry of Mark Matthews (MySQL Connector/J developer) about speeding up Batch Inserts:
http://www.jroller.com/mmatthews/entry/speeding_up_batch_inserts_for

The upcoming MySQL driver supporting this feature (Connector/J 5.1 GA and 6.0 Alpha) can be downloaded here:
http://downloads.mysql.com/snapshots.php

Currently the MySQL driver has an error if your prepared INSERT INTO statement ends with a ";" (which is not needed except in a SQL script but several developers like myself normally also add this to the SQL Command in the Java code).
The driver would create an INSERT INTO statment that looks like this:
INSERT INTO sometable1 (col1, col2) VALUES (1, 'word');, (2, 'word2'), ...

It would be great if someone could implement this into the JDBC driver and the grouping could be turned on by a URL driver parameter like e.g MySQL Connector/J (?rewriteBatchedStatements=true).
Maybe there are also comment about this. So what do you think?

--
Juergen
______________________________________________________
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de

--
Sent via pgsql-jdbc mailing list (pgsql...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-jdbc

Oliver Jowett

unread,

Jul 18, 2009, 10:25:20 AM7/18/09

J. W. Ulbts wrote:

> The normal batch would execute this:
> --------------------
> INSERT INTO sometable (col1, col2) VALUES (1, 'word1');
> INSERT INTO sometable (col1, col2) VALUES (2, 'word2');
> INSERT INTO sometable (col1, col2) VALUES (3, 'word3');
> INSERT INTO sometable (col1, col2) VALUES (4, 'word4');
> --------------------
>
> If the batch insert is rewritten by the driver it would generate and execute this:
> --------------------
> INSERT INTO sometable (col1, col2) VALUES (1, 'word1'), (2, 'word2'), (3, 'word3'), (4, 'word4');
> --------------------

That would require teaching the driver how to parse INSERT statements,
and in general the driver tries to avoid parsing SQL where ever possible.

Where exactly is the performance benefit that you see coming from? I
would expect the driver to already be reusing a named statement when
batching a PreparedStatement (you could check the loglevel=2 output to
verify this), so you only pay the parsing cost a few times until use of
a named statement kicks in. Is the overhead of Bind/Execute really that
high?

If your use case is just "I want to do bulk inserts as fast as possible"
then perhaps the newly merged COPY suport is a better way to go.

-O

PS: my reading of the mysql blog entry that you linked was that it was
the parsing/rewriting that the MySQL driver does that caused a 10x
slowdown in the first place, and the "speedup" was actually just fixing
that.

0 new messages