Group BY on "large" tables from file-system causes Out of Memory Error

45 views
Skip to first unread message

MacMahon McCallister

unread,
Apr 21, 2020, 3:49:41 AM4/21/20
to H2 Database
Hello, I wrote a simple case in order to reproduce the problem (doesn't do any cleanup).

In the test scenario, the OOM will always happen, when H2 is running the group by query on a table with 5 million rows.
What would be the cause? Memory options for the JVM in the test scenario, during my testing were -XmX1024m.





   
void createTablesForGroupByScenario() throws SQLException {
       
String dir = "/path/to/some/dir";
       
Integer[] sizes = new Integer[] { 1000, 10_000, 100_000, 1_000_000, 5_000_000 };

       
for (Integer size : sizes) {
           
System.out.println("Creating table with size: " + size);

           
String name = "group_by_" + size;
           
String h2Url = "jdbc:h2:file:" + dir + "/" + name
                   
+ ";DB_CLOSE_ON_EXIT=FALSE"
                   
+ ";AUTO_RECONNECT=TRUE"
                   
+ ";FILE_LOCK=NO"
                   
+ ";TRACE_LEVEL_FILE=0"
                   
+ ";TRACE_LEVEL_SYSTEM_OUT=0"
                   
+ ";LOG=0"
                   
+ ";UNDO_LOG=0"
                   
+ ";CACHE_SIZE=" + 65000;

           
Connection con = DriverManager.getConnection(h2Url);

           
String initSql = "create table result(id bigint, name varchar, phone int);\n";
           
RunScript.execute(con, new StringReader(initSql));
            con
.commit();

           
PreparedStatement st = con.prepareStatement("insert into result values (?, ?, ?)");
           
for (int i = 0; i < size; i++) {
                st
.setLong(1, i);
                st
.setString(2, "name_" + i);
                st
.setInt(3, i);
                st
.addBatch();
               
if (i % 500 == 0) {
                    st
.executeBatch();
                    con
.commit();
               
}
           
}
            st
.executeBatch();
            con
.commit();
            con
.close();
       
}
   
}


   
void forEveryDbCreatedRunGroupByQuery() throws SQLException {
       
String dir = "/path/to/some/dir";
       
Integer[] sizes = new Integer[] { 1000, 10_000, 100_000, 1_000_000, 5_000_000 };

       
for (Integer size : sizes) {
           
System.out.println("Running query for table with size: " + size);

           
String name = "group_by_" + size;
           
String h2Url = "jdbc:h2:file:" + dir + "/" + name
                   
+ ";DB_CLOSE_ON_EXIT=FALSE"
                   
+ ";AUTO_RECONNECT=TRUE"
                   
+ ";FILE_LOCK=NO"
                   
+ ";TRACE_LEVEL_FILE=0"
                   
+ ";TRACE_LEVEL_SYSTEM_OUT=0"
                   
+ ";LOG=0"
                   
+ ";UNDO_LOG=0"
                   
+ ";CACHE_SIZE=" + 65000;

           
Connection con = DriverManager.getConnection(h2Url);

           
String sql = "select id, sum(phone) from result group by id;\n";

           
long start = System.currentTimeMillis();
           
Statement st = con.createStatement();
           
ResultSet rs = st.executeQuery(sql);

           
int processed = 0;
           
while (rs.next()) {
               
//'fake' result-set processing by just counting the results
                processed
++;
           
}
            con.close();
           
long time = System.currentTimeMillis() - start;
           
System.out.println(String.format("Processed %s, time %s ms", processed, time));
       
}
   
}






Noel Grandin

unread,
Apr 21, 2020, 4:18:02 AM4/21/20
to H2 Database
Which version is this ?

And what happens when you remove the dangerous options? (LOG and UNDO_LOG)

Evgenij Ryazanov

unread,
Apr 21, 2020, 4:30:20 AM4/21/20
to H2 Database
Hello.

If you don't have an index on GROUP BY column, you need a lot of memory for such queries in H2.

You can use the EXPLAIN command to check whether optimization is used or not.

create table result(id bigint, name varchar, phone int);

-- Without optimization
explain
select id, sum(phone) from result group by id;
> SELECT
>     "ID",
>     SUM("PHONE")
> FROM "PUBLIC"."RESULT"
>     /* PUBLIC.RESULT.tableScan */
> GROUP BY "ID"

create index i on result
(id);

-- With optimization
explain
select id, sum(phone) from result group by id;
> SELECT
>     "ID",
>     SUM("PHONE")
> FROM "PUBLIC"."RESULT"
>     /* PUBLIC.I */
> GROUP BY "ID"
> /* group sorted */

MacMahon McCallister

unread,
Apr 21, 2020, 6:12:01 AM4/21/20
to H2 Database


On Tuesday, 21 April 2020 11:18:02 UTC+3, Noel Grandin wrote:
Which version is this ?

And what happens when you remove the dangerous options? (LOG and UNDO_LOG)

Version: 1.4.200.
Nothing happens if i remove the options. I actually tried fiddling with the options earlier, but it always halts on 5M rows.

On Tuesday, 21 April 2020 11:30:20 UTC+3, Evgenij Ryazanov wrote
If you don't have an index on GROUP BY column, you need a lot of memory for such queries in H2.

 
This does make kind of sense, but still not in this test-scenario. How come the previous test-cases (up to 1M rows, without index), run fine, even with memory as low as -XmX256m:
Executing with size: 1000
Processed 1000, time 30 ms
Executing with size: 10000
Processed 10000, time 50 ms
Executing with size: 100000
Processed 100000, time 241 ms
Executing with size: 1000000
Processed 1000000, time 1925 ms




 


MacMahon McCallister

unread,
Apr 21, 2020, 6:23:03 AM4/21/20
to H2 Database
 To respond to myself, actually, the Xmx settings didn't apply properly and therefore (as suggested earlier) the h2 operation ran out of memory.
 It seems, that actually using heap settings of Xmx1024 is able to execute the unindexed query on a table with 5M rows within 10 seconds.

But a follow up question - for these "un-indexed group by" scenarios, does h2 have to read all the result-set into memory?
And besides indexing the table (which I can not too probably) are there any other optimizations to consider?


Evgenij Ryazanov

unread,
Apr 21, 2020, 7:31:11 AM4/21/20
to H2 Database
H2 doesn't need a lot of memory for plain queries without aggregate and window functions, large results are stored on the disk automatically. But queries with aggregate or window functions currently need to load the whole result into the memory; the only exclusion is the mentioned optimization for group-sorted queries in presence of compatible index.

Noel Grandin

unread,
Apr 21, 2020, 4:04:46 PM4/21/20
to H2 Database
TBH, retrieving super large result sets is not something we optimise for.

If you really need that, you can try turn on the LAZY_FETCH feature.

Reply all
Reply to author
Forward
0 new messages