Bulk load table

259 views
Skip to first unread message

Michael Alexeev

unread,
Mar 6, 2012, 7:13:00 PM3/6/12
to voltd...@googlegroups.com
Hi All,

What is the preferred way to bulk insert ~20M rows in a table? I know I can queue individual inserts and then execute them together but is there a limitation on the queue size? Or is there a better way?

Thanks,
Mike

Ryan Betts

unread,
Mar 6, 2012, 7:42:31 PM3/6/12
to voltd...@googlegroups.com

Probably writing a client that does single partition inserts will be
fastest. We generate built-in insert procedures for partitioned tables
with pkeys defined - but we don't generate loaders or expose an
optimal bulk loading interface.

Ryan.

Ning Shi

unread,
Mar 6, 2012, 7:47:15 PM3/6/12
to voltd...@googlegroups.com

Once it's loaded for the first time, save the database to a snapshot and use the "recover" action to load it on subsequent start up will be the fastest. To use the "recover" action, just replace "create" or "start" in your start server script.

--
Ning

Ariel Weisberg

unread,
Mar 6, 2012, 7:47:25 PM3/6/12
to voltd...@googlegroups.com
Hi,

The fastest legit way is to do single partition inserts. You might get a bit of improvement by sending multiple rows for each insert by hashing the rows on the partition key in the client and building a volt table for each partition. Once the table for a partition is a meg or two fire it off to the partition and insert each row individually from within the procedure.

The fastest way is build a volt table in the client for each partition by hashing the partition column and then sending the table to the right partition and calling the evil and unsupported load table method. We use that for the TPC-C benchmark. It is liable to break or go away at any time and throws away all data protection. See https://github.com/VoltDB/voltdb/blob/master/src/frontend/org/voltdb/VoltProcedure.java#L268 It's fine for loading data if you are trying to debug something but can't be used in production.

Ariel

On Tue, Mar 6, 2012 at 7:13 PM, Michael Alexeev <michael...@gmail.com> wrote:

Michael Alexeev

unread,
Mar 6, 2012, 9:13:59 PM3/6/12
to voltd...@googlegroups.com
Hi All,

Thanks for great responses!

Yes,  this is not for a production use. I am looking at ENG-2503/04 and just trying to recreate the problem. I guess I can use VoltProcedure::voltLoadTable just to load data for the first time and use save/restore feature going forward.

Ariel,

>The fastest way is build a volt table in the client for each partition by hashing the partition column and then sending the >table to the right partition

I understand you in general but not sure about implementation details :). Let say we have N partitions. The hash function than needs to return values from [0..N) range for all possible PK values, right?
The sudo code could be as following:

VoltTable tables[N];
for each PK value
    idx = hash(PK)
    tables[idx].add(VoltTableRow)

for each table in tables
     VoltProcedure::voltLoadTable(table)

Is that what you were suggesting? If so, how can I get on hold of hash function used for partitioning?

Or did I completely misunderstand you?

Thanks,
Mike



Ariel Weisberg

unread,
Mar 6, 2012, 9:25:32 PM3/6/12
to voltd...@googlegroups.com
Hi,

If you are going to load just once, I would just insert all the data using single row inserts. If you get 100k of those per second you will be done in a few minutes and then snapshot restore takes care of the reloading it using the bulk loading mechanism.

Ariel

John Piekos

unread,
Mar 7, 2012, 4:58:06 PM3/7/12
to voltd...@googlegroups.com
Hi Michael,

FYI - we have a "bulk load" ticket filed for this: https://issues.voltdb.com/browse/ENG-2013
Lots of ideas/opinions, just hasn't been something we've implemented yet.

John
--
John Piekos
VoltDB, Inc.  
Vice President, Engineering
email:  jpi...@voltdb.com 
office+1.978.528.0576
mobile: +1.508.517.3218
web:    www.voltdb.com

Ben Ballard

unread,
Mar 7, 2012, 5:20:09 PM3/7/12
to voltd...@googlegroups.com, voltd...@googlegroups.com
Mike,
I have some sample benchmark code for loading data (individual asynchronous inserts) from a Buffered FileReader.  If that would help. 

-Ben
Sent from my iPhone

Michael Alexeev

unread,
Mar 7, 2012, 9:56:06 PM3/7/12
to voltd...@googlegroups.com
Hi Ben, I solved my immediate problem but if ever need to reload it again than yes, I would like to have an example. I suspect I won't be able to use it as is because in my case I generate dummy rows on a fly. Your's probably read them from some file, right?

Thanks,
Mike

Michael Alexeev

unread,
Mar 7, 2012, 10:08:42 PM3/7/12
to voltd...@googlegroups.com
Hi Ning,


Do I also need to specify the snapshot  name ? If I simply use recover option
$VOLTDB recover catalog $APPNAME.jar deployment deployment.xml \
        license $LICENSE leader $LEADER
Then I get the following:

VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.FATAL 03:04:40,333   [Thread-16] HOST: Nothing to recover from

I was able to restore it using the SnapshotRestore procedure via the client call.

Thanks,
Mike


--
Ning


Ning Shi

unread,
Mar 7, 2012, 10:25:12 PM3/7/12
to voltd...@googlegroups.com
On Mar 7, 2012, at 10:08 PM, Michael Alexeev wrote:

Do I also need to specify the snapshot  name ? If I simply use recover option
$VOLTDB recover catalog $APPNAME.jar deployment deployment.xml \
        license $LICENSE leader $LEADER
Then I get the following:

VoltDB has encountered an unrecoverable error and is exiting.
The log may contain additional information.FATAL 03:04:40,333   [Thread-16] HOST: Nothing to recover from

I was able to restore it using the SnapshotRestore procedure via the client call.

The recover action looks for snapshot files in the locations known to VoltDB and restore the latest one. These locations are usually specified in the deployment file. It includes the auto snapshot directory, and the command log snapshot directory if you are using the enterprise edition. If you didn't specify the auto snapshot directory in the deployment file, by default it will create a directory called "snapshots" in your voltdb root. You can just move your snapshot files to that directory and recover will find them.


--
Ning

Michael Alexeev

unread,
Mar 8, 2012, 7:08:59 PM3/8/12
to voltd...@googlegroups.com
Thanks. That was it.
Reply all
Reply to author
Forward
0 new messages