ram usage in h2

1,751 views
Skip to first unread message

John Smith

unread,
May 29, 2015, 1:45:53 PM5/29/15
to h2-da...@googlegroups.com
I have several csv files I'm loading up into an h2 instance (running the beta version of h2, ver 1.4.187).

Total bytes on disk for csv files is approx 4 GB.

After importing this in h2, adding a few indexes, the size of my h2 mv.db database file is 3.6 gigs. Same csv files loaded up into an h2 "mem" database, all in ram, windows server shows usage of 85 gigs. The memory database just explodes in size compared to the file database, all else between the databases remains the same (tables + indexes etc). 

Does that make sense? The huge difference in size there? Even accounting for some of that 85 gigs being used as a cache of some sort that might also be present with the file database (I forgot to test how much ram the h2 process consumed when using file database), why is there such a huge increase in ram usage? My csv files themselves are only 4 gigs, I add one or two indexes on each table (there are two tables), I don't see how 4 gb of csv translates to 85 gigs once loaded in ram?

Second thing is speed. I moved my file database to a ram / memory disk. Then loaded up h2 and executed a query. It took 8.4 minutes. It was a simple query, no joins, but used one orderby on an unindexed column. The same query using the h2 mem database returned in 15 seconds; same thing, no index on the orderby. I get that I could add an index to cut down the query time, but that's not the point; I wanted to test how much difference in performance there is b/n the native mem mode, and the h2 file database stored on a ram disk.

So to sum up: 
1: why does h2 use so much more memory in "mem" mode vs the file size on disk, for the same amount of data? 
2: why would the mem database perform so much quicker than a file database stored in ram? 
3: if I am required to use the "mem" mode for the performance that we need, is there any way I can "defrag" the database while it's in ram? After several drops, deletes, updates etc., the mem based database continues to grow and I can't shrink it using "shutdown defrag" like I can with a file based database.

Thanks in advance for any thoughts on all this.


Christian MICHON

unread,
Jun 2, 2015, 2:19:58 AM6/2/15
to h2-da...@googlegroups.com
Are you using some commands like "create table mytable as select * from csvread('myfile.csv')" ?

If so, I've few possible pieces of advice:
- create a schema first, with the right types expected for attributes. otherwise each attribute will be casted as text with 2Gb max allocation.
- insert like this example:
  INSERT INTO LANG_CHINESE (ISO ,NAME ) SELECT * FROM CSVREAD('CHINESE.CSV',NULL,'UTF8') ;
- do not create indexes first, create indexes after the csv import is completed.

Can your data be shared?

John Smith

unread,
Jun 2, 2015, 3:37:34 PM6/2/15
to h2-da...@googlegroups.com
Thanks for your comments! I'm using a custom script which reads the first 20,000 lines of a csv file, then generates the h2 import command dynamically from that. So it's not a straight "create table X as select * from Y", but rather it includes proper column types for every column it detects. 

The string / varchar data could possibly be handled better, right now it defaults to varchar(5000); still better than the Integer.max value I believe H2 has by default, but not quite as small as it could be. We have widely variable strings all over the place though so it's hard to predict a max value for the string columns. Sometimes the first csv file will import, but then one of the remaining (of hundreds) csv files throws up on the "INSERT" command (also generated at the same time with the same schema as the CREATE TABLE) so we prefer to keep the varchars a little bigger to be safe.

We do add the indexes after the data is all inserted though, so at least that looks good.

But I guess I'm confused as to why the ram usage is so much higher than the file on disk is, when compared to file based databases. Wouldn't a 4 gig .mv.db file use roughly the same amount of ram when the same data is loaded in a "mem" database? Even then, we might be able to work around the huge memory usage, as long as we can regain the space in ram again when we drop a table and reload it from updated csv. Right now the "mem" mode just continues to grow and grow until eventually h2 stops responding (cpu usage increases but query never completes; I'm assuming it just runs out of available memory)

Unfortunately the data is not allowed to leave the building. I might be able to create a small java program which generates a large database; to compare "mem" mode vs "file" mode, but not sure how helpful that would be.

Noel Grandin

unread,
Jun 3, 2015, 1:20:56 AM6/3/15
to h2-da...@googlegroups.com
It shouldn't be using that much memory.
If you could create a standalone test-case that would be perfect.
If not perhaps try examining it with a profiler and posting what it finds here
--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.
To post to this group, send email to h2-da...@googlegroups.com.
Visit this group at http://groups.google.com/group/h2-database.
For more options, visit https://groups.google.com/d/optout.

Thomas Mueller

unread,
Jun 3, 2015, 1:42:07 AM6/3/15
to H2 Google Group
Hi,

Using "varchar(300)" does not use less memory or disk space than "varchar(30000000)" or simply "varchar". The length is a _condition_, and does not affect memory or disk space usage. See also the documentation.

One option would be to use a "file system in memory", for example "jdbc:h2:memFS:test" or "jdbc:h2:memLZF:test". This is slower than pure in memory, but needs less heap memory. You can use defrag and so on there.

> create a schema first

"create table as select" is usually faster. Best is table creation with the right data types, as in "create table(id int, name varchar) as select ... from csvread...".


> why does h2 use so much more memory in "mem" mode vs the file size on disk, for the same amount of data?

It's hard to explain, but yes, it needs much more memory. Part of the reason is Java uses a lot of memory for objects (java.lang.Integer, java.lang.String, and so on). Internally, a String uses an array of char, which is 2 bytes per character. On disk, only one byte is used for English.


> why would the mem database perform so much quicker than a file database stored in ram?

Disk access is slow.


> if I am required to use the "mem" mode for the performance that we need

For your case, I would create indexes, even for in-memory databases.

Regards,
Thomas





John Smith

unread,
Jun 3, 2015, 2:44:30 PM6/3/15
to h2-da...@googlegroups.com
Interesting; I didn't know about memFS and memLZF. Unfortunately those two options don't work, I get an exception in the middle of importing my second csv file (database ver 1.4.187)


General error: "java.lang.ArrayIndexOutOfBoundsException: -2095880" [50000-187] HY000/50000
org.h2.jdbc.JdbcSQLException: General error: "java.lang.ArrayIndexOutOfBoundsException: -2095880" [50000-187]
    at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
    at org.h2.message.DbException.get(DbException.java:168)
    at org.h2.message.DbException.convert(DbException.java:295)
    at org.h2.message.DbException.toSQLException(DbException.java:268)
    at org.h2.message.TraceObject.logAndConvert(TraceObject.java:352)
    at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:160)
    at org.h2.server.web.WebApp.getResult(WebApp.java:1390)
    at org.h2.server.web.WebApp.query(WebApp.java:1063)
    at org.h2.server.web.WebApp$1.next(WebApp.java:1025)
    at org.h2.server.web.WebApp$1.next(WebApp.java:1012)
    at org.h2.server.web.WebThread.process(WebThread.java:168)
    at org.h2.server.web.WebThread.run(WebThread.java:93)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -2095880
    at org.h2.store.fs.FileMemData.expand(FilePathMem.java:513)
    at org.h2.store.fs.FileMemData.readWrite(FilePathMem.java:622)
    at org.h2.store.fs.FileMem.read(FilePathMem.java:309)
    at org.h2.store.fs.FileBase.read(FileBase.java:41)
    at org.h2.mvstore.cache.FilePathCache$FileCache.read(FilePathCache.java:81)
    at org.h2.mvstore.DataUtils.readFully(DataUtils.java:429)
    at org.h2.mvstore.FileStore.readFully(FileStore.java:98)
    at org.h2.mvstore.Page.read(Page.java:191)
    at org.h2.mvstore.MVStore.readPage(MVStore.java:1843)
    at org.h2.mvstore.MVMap.readPage(MVMap.java:736)
    at org.h2.mvstore.Page.getChildPage(Page.java:218)
    at org.h2.mvstore.Cursor.min(Cursor.java:129)
    at org.h2.mvstore.Cursor.hasNext(Cursor.java:36)
    at org.h2.mvstore.MVStore.collectReferencedChunks(MVStore.java:1214)
    at org.h2.mvstore.MVStore.freeUnusedChunks(MVStore.java:1183)
    at org.h2.mvstore.MVStore.storeNowTry(MVStore.java:981)
    at org.h2.mvstore.MVStore.storeNow(MVStore.java:973)
    at org.h2.mvstore.MVStore.commitAndSave(MVStore.java:962)
    at org.h2.mvstore.MVStore.beforeWrite(MVStore.java:2097)
    at org.h2.mvstore.MVMap.beforeWrite(MVMap.java:1046)
    at org.h2.mvstore.MVMap.put(MVMap.java:117)
    at org.h2.mvstore.db.TransactionStore.commit(TransactionStore.java:358)
    at org.h2.mvstore.db.TransactionStore$Transaction.commit(TransactionStore.java:779)
    at org.h2.engine.Session.commit(Session.java:507)
    at org.h2.command.Command.stop(Command.java:152)
    at org.h2.command.Command.executeUpdate(Command.java:284)
    at org.h2.jdbc.JdbcStatement.executeInternal(JdbcStatement.java:184)
    at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:158)
    ... 7 more 

The same data loads successfully in both mem and file based modes.

One option would be to use a "file system in memory", for example "jdbc:h2:memFS:test" or "jdbc:h2:memLZF:test". This is slower than pure in memory, but needs less heap memory. You can use defrag and so on there.


Can you explain how to defrag in either mem, memFS, or memLZF modes? The only way I'm aware of to defrag and cleanup dead data is to execute a "shutdown defrag" command, which wouldn't be an option with mem based databases. The roadmap seems to backup my understanding:

John Smith

unread,
Jun 3, 2015, 8:42:08 PM6/3/15
to h2-da...@googlegroups.com
I'm getting that exception now with "mem" mode too; after running tests on memFS and memLZF modes, I tried to reload our database back to mem but now I'm getting the same error in mem mode. Though now the index appears to be a differnt value (-2095986 vs -2095880). 

Any ideas? I've even rebooted the server, nothing helps. The csv files haven't been modified recently. I've reloaded this database multiple times in mem mode, this is the first time it is consistently failing.


Unable to perform query General error: "java.lang.ArrayIndexOutOfBoundsException: -2095968" [50000-181] org.h2.jdbc.JdbcSQLException: General error: "java.lang.ArrayIndexOutOfBoundsException: -2095968" [50000-187] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) at org.h2.message.DbException.get(DbException.java:168) at org.h2.message.DbException.convert(DbException.java:295) at org.h2.server.TcpServerThread.sendError(TcpServerThread.java:221) at org.h2.server.TcpServerThread.run(TcpServerThread.java:161) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.ArrayIndexOutOfBoundsException: -2095968 at org.h2.store.fs.FileMemData.expand(FilePathMem.java:513) at org.h2.store.fs.FileMemData.readWrite(FilePathMem.java:622) at org.h2.store.fs.FileMem.read(FilePathMem.java:309) at org.h2.store.fs.FileBase.read(FileBase.java:41) at org.h2.mvstore.cache.FilePathCache$FileCache.read(FilePathCache.java:81) at org.h2.mvstore.DataUtils.readFully(DataUtils.java:429)  

Thomas Mueller

unread,
Jun 10, 2015, 2:12:07 AM6/10/15
to h2-da...@googlegroups.com
Hi,

I can reproduce the problem. Even thought "long" is used for file access, it looks like an integer overflow if the file is larger than 2 GB. I'm currently debugging this.

Regards,
Thomas
--

Harshad RJ

unread,
Jun 14, 2015, 3:48:15 PM6/14/15
to h2-da...@googlegroups.com
Looks like this issue was fixed by this commit. Thanks!

Aside, I think confirmed bugs like these should be tracked as an issue on Github.

Thomas Mueller

unread,
Jun 15, 2015, 7:47:18 AM6/15/15
to H2 Google Group
I think confirmed bugs like these should be tracked as an issue on Github.

Currently, not all changes have issues. However, all changes are tracked in the change log (except formatting and so on).

Regards,
Thomas


--
Reply all
Reply to author
Forward
0 new messages