> The performance page of H2 says approx 29000 records/sec.
No, it says "statements per second". It's a mix of statements.
> There must be something
> obviously wrong in our approach.
I would try to avoid CSV files, and insert the data directly into the
database. That should double the performance, because it's about half
the I/O.
> insert into table (select from csvread(<filename>));
Try using: CREATE TABLE ... AS SELECT
This should improve performance.
> FILE_LOCK=NO;TRACE_LEVEL_FILE=0;TRACE_LEVEL_SYSTEM_OUT=0;LOCK_MODE=3
This will not help. See http://h2database.com/html/performance.html#fast_import
> with 32GB RAM
Maybe you could use in-memory databases?
> CSVREAD ... unoptimized I/O routines
That's true, CSVREAD is relatively slow. I'm not sure if this is the
main reason for your problem, but I will try to improve the
performance for the next release.
Regards,
Thomas
> all the caches should be
> increased by an order of magnitude,
By default the JVM allocates 64 MB heap. If more is available, then H2
could use more, that's true. I will add a feature request:
"CACHE_SIZE: automatically use a fraction of Runtime.maxMemory - maybe
automatically the second level cache."
> and the second-level soft reference
> cache should be enabled (to use any excess memory for caching).
No, because it slows down the performance, unless you have a lot of
memory (see above).
> I'm not entirely sure how to modify the values, but H2 also has
> settings for:
> h2.objectCacheSize (defult=1024)
> h2.objectCacheMaxPerElementSize (default=4096)
> Increasing these moderately (say, 2x to 5x) would reduce memory use
> when duplicate values exist between rows.
I think those are reasonable settings. I don't plan to change those
unless there is prove it does reduce memory usage for common use
cases.
>> CSV
> I'd like to take a stab at this one myself.
Sorry, I already fixed the problem (committed to the trunk). Now CSV
reading is about the same speed as using LineNumberReader (about 0.8s
instead of over 2s for 33 MB). I still use FileReader by default.
Maybe there is a faster way to read from a file, or convert bytes to
characters?
> I don't know if you're willing to allow NIO use
It depends. It is not available on all platforms, so I like to keep
this 'optional'.
> Ask me some time how I managed to read arrays of primitives from file
> faster than the best of the NIO methods.
Well, that's interesting (specially for the page store mechanism).
Regards,
Thomas
Is there a way to speed up insert creation?
Can I force parallelism while creating indexes?
> why are so many h2 log files created... when my jdbc url says LOG=0 and UNDO_LOG=0?
I can't reproduce this. In my test only one file is created:
<dbname>.0.log.db. Could you create a reproducable test case please?
It would be great if the test case does not have any dependencies
except the H2 jar file (one Java class uses the JDBC API and is run
using a static main method). Please include any initialization code
(CREATE TABLE, INSERT and so on) in the Java class or in a .sql script
file.
One possible optimization is to use the 'nio' or 'nioMapped' file
system. File systems are not yet documented, I will do that in the
next release:
== Pluggable File System ==
This database supports a pluggable file system API. The file system
implementation is selected using a file name prefix. The following
file systems are included:
* zip: read-only zip-file based file system. Format:
zip:/zipFileName!/fileName.
* nio: file system that uses FileChannel instead of
RandomAccessFile (faster in some operating systems).
* nioMapped: file system that uses memory mapped files (faster in
some operating systems).
* split: file system that splits files in 1 GB files (stackable
with other file systems).
* memLZF: compressing in-memory file system (experimental; used
for testing).
* memFS: in-memory file system (experimental; used for testing).
As an example, to use the the nio file system, use the following
database URL: jdbc:h2:nio:~/test.
To register a new file system, extend the classes
org.h2.store.fs.FileSystem and FileObject, and call the method
FileSystem.register before using it.
> FILE_LOCK=NO
I don't think this has an impact on performance. It is a bit dangerous
to use, because multiple processes could open the same database file
at the same time, which could lead to corruption.
> We are building the index as an offline process after the
> inserts have been finished.
In theory that should help, however for H2 currently it could be
better to create the indexes before inserting data.
> Can I force parallelism while creating indexes?
No.
> does not do anything special when doing bulk indexing
It does a little bit (pre-sorting batches of rows in memory), but it
doesn't help much. I plan to improve index creation when using the
page store.
> DB creation & open - first time: 4.5 sec
I guess that's the JVM 'warm up' time (compiling the byte codes). I
don't think there is anything I can do in H2 to solve this.
Regards,
Thomas
> In theory that should help, however for H2 currently it could be
> better to create the indexes before inserting data
If we are creating indexes before the inserts, the inserts become
twice as slow. This causes a huge bottleneck for the main processing
thread.
Hence we are just inserting in one thread and once the insertion is
over, creating indexes using another thread.
That way we are not only speeding up the inserts, but also keeping the
insert thread relatively free and light to process more inserts/sec,
and also utilizing multi-core systems slightly more better. Now the 2
threads (one insertion and another index creation) can run
simultaneously on 2 different cores thus speeding up the whole
process.
Let me know if you see any flaw in the above process or any other
better ideas to parallelize and increase performance.
> The files are created while the index creation is in progress.
I can't reproduce this problem, see my test case below. Could you
modify the test case so the files are created?
> Since we are creating indexes after finishing off with the inserts,
> the index creation takes a cool 30+ minutes to complete.
For me it takes about 3 min (using dummy data).
> H2 does not seem to have smarter algorithm for bulk indexing
That's true.
> This does not look like transaction logging, but more like temporary
> space used for sorting etc. while creating indexes.
Creating indexes shouldn't doesn't create temporary objects files.
> If we are creating indexes before the inserts, the inserts become
> twice as slow.
If you use CREATE TABLE ... AS SELECT ... ORDER BY then it shouldn't
be a big problem. But it depends on the data, but for me it's only 50%
slower than creating the table without index. Unless you create a
unique index, in which case you can create a primary key, you would
need to use the MySQL mode and the syntax:
CREATE TABLE TestTable(Field1 Varchar2(20), ..., index(field7)) as select ...
"order by" in the select statement on field7 may improve performance.
> Field1 Varchar2(20), Field2 Number(20), Field3 Number(20), Field4 INT, Field5 INT, Field5 INT, Field6 Number(8), Field7 Number(15), Field8 Number(10), Field9 INT);
Not using the precision helps a few percent bit in my test:
Field1 Varchar, Field2 Number, Field3 Number, Field4 INT, Field5 INT,
Field5 INT, Field6 Number, Field7 Number, Field8 Number, Field9 INT);
More than 10% better is BIGINT instead of NUMBER, specially for the
indexed column:
Field1 Varchar, Field2 Number, Field3 Number, Field4 INT, Field5 INT,
Field6 int, Field7 bigint, Field8 int, Field9 INT)
My test case:
package db;
import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import org.h2.store.FileLister;
import org.h2.tools.DeleteDbFiles;
// java -Xmx512m
// create table 232 s, create index 183 s
public class TestCreateIndex {
public static void main(String[] args) throws Exception {
Class.forName("org.h2.Driver");
DeleteDbFiles.execute("data", "test", true);
String url = "jdbc:h2:nio:data/test;LOG=0;UNDO_LOG=0;" +
"CACHE_SIZE=262144";
final Connection conn = DriverManager.getConnection(url);
System.out.println("creating table...");
long t = System.currentTimeMillis();
conn.createStatement().execute(
"CREATE TABLE TestTable(" +
"Field1 Varchar2(20), Field2 Number(20), " +
"Field3 Number(20), Field4 INT, Field5 INT, " +
"Field6 Number(8), Field7 Number(15), " +
"Field8 Number(10), Field9 INT) " +
"as select space(15), x, x, x, x, x, x, x, x " +
"from system_range(1, 6000000)");
System.out.println((System.currentTimeMillis() - t) + " ms");
new Thread() {
public void run() {
try {
while (!conn.isClosed()) {
System.out.println();
for (String s : FileLister.
getDatabaseFiles("data", "test", true)) {
System.out.println(s + ": " +
(new File(s).length() / 1024) + " KB");
}
Thread.sleep(20000);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}.start();
System.out.println("creating index...");
t = System.currentTimeMillis();
conn.createStatement().execute(
"Create index test_idx on TestTable(Field7)");
System.out.println((System.currentTimeMillis() - t) + " ms");
conn.close();
}
}
Regards,
Thomas