"SELECT ... WHERE ... IN" on extremely large tables

140 views
Skip to first unread message

Tim Fielder

unread,
Oct 7, 2019, 7:32:22 AM10/7/19
to H2 Database
I'm trying to develop an application that parses scientific journal publications from a few sources and collects them in a single database.  The problem is that there's just too damn many of them and it creates a catch-22 situation where I either need literally years to build the database, or I can build the database in a few days with no structure, constraints, or indexing and I run out of memory if I attempt to convert the database to a useful state.

For clarity and simplicity, we'll take the example of parsing a subset of the Scopus publications, 10,000 articles each from the years 1996-2018 (230,000 total articles), and use table definitions that drop the majority of the metadata fields.

If I apply constraints to the tables during initialization, the schema looks like this:
CREATE TABLE document (docid varchar NOT NULL,
                       title varchar
,
                       
abstract varchar NULL,
                       PRIMARY KEY
(docid) );

CREATE TABLE reference
(docid varchar NOT NULL,
                        refid varchar NOT NULL
,
                        PRIMARY KEY
(docid, refid),
                        FOREIGN KEY
(docid) REFERENCES document(docid),
                        FOREIGN KEY
(refid) REFERENCES document(docid) );

I then use INSERT ... ON DUPLICATE KEY UPDATE and INSERT IGNORE to add rows to the document table for each article, and for each citation of another article, respectively.

The problem with this approach is that since the tables are indexed, the insert time grows quadratically with the size of the table.  As a result I can handle 230,000 articles in about 2 hours, but the full 46.7 million will take at least 300 days.

In order to defer the application of constraints until after I fully complete parsing, the schema becomes simply:
CREATE TABLE document (docid varchar NOT NULL,
                       title varchar
,
                       
abstract varchar NULL );

CREATE TABLE tempdoc (docid varchar NOT NULL,
                      title varchar );


CREATE TABLE reference (docid varchar NOT NULL,
                        refid varchar NOT NULL );


CREATE TABLE tempref (docid varchar NOT NULL,
                      refid varchar NOT NULL );


Note the inclusion of the temporary tables, which are likely to contain duplicate keys.  I perform all full article INSERTs on the document table because they are guaranteed to be unique in the data, and all cited article INSERTs on the tempdoc table.

After parsing, I use the following queries to retrieve all rows of the tempdoc table that aren't already in document, and then all rows of the tempref table that are unique and valid:
SELECT refid, MAX(title) FROM tempdoc WHERE refid NOT IN (SELECT docid FROM document) GROUP BY refid;

SELECT DISTINCT docid, refid FROM tempref WHERE refid IN (SELECT docid FROM document);

The trouble with this is that the INSERTs into document following the first query multiply its size by a factor of about 20, and thus by the time I get to the SELECT from tempref the inner SELECT becomes too large.  With a subset of just 230,000 articles, the inclusion of the referenced articles adds another 4.52 million.  If I bump it up to a larger subset of 2.3 million articles, the number of cited articles adds around 43 million additional rows, and my machine's 64 GB of RAM is no match for the task.  The full data set of 46.7 million articles will create at least 900 million additional rows, and I don't think there's enough RAM in the world to handle that.

So my question is: is there a middle ground where I can build this database in a reasonable amount of time and without suplexing my RAM?  I've already tried converting the String article identifiers to long, which results in a performance loss and doesn't solve the out-of-memory error.  I've also tried combining the SELECT and INSERT queries into a single operation to let the database handle all of it on the back end, which is marginally faster but doesn't solve the out-of-memory error.  I most recently tried simply SELECTing the entire tempref table and performing single SELECT COUNT queries for each row to check for existence in the reference table - by my math this solution is expected to finish processing the full 46 million articles sometime around 2700 AD.

Noel Grandin

unread,
Oct 7, 2019, 7:40:01 AM10/7/19
to h2-da...@googlegroups.com


On 2019/10/07 1:32 PM, Tim Fielder wrote:> The problem with this approach is that since the tables are indexed, the
insert time grows quadratically with the size
> of the table. As a result I can handle 230,000 articles in about 2 hours, but the full 46.7 million will take at least
> 300 days.

That should not be the case, insert time should be something like O(n log n)
So not sure why it is so slow for you.


>
> In order to defer the application of constraints until after I fully complete parsing, the schema becomes simply:>

If you are going to do something like this, then rather

(*) insert all rows into tempdoc

(*) CREATE TABLE document AS SELECT DISTINCT .... FROM tempdoc

(*) add constraints to document

and similarly for other tables.

Evgenij Ryazanov

unread,
Oct 7, 2019, 7:51:29 AM10/7/19
to H2 Database
BTW, what version of H2 do you use?

IN (query) condition needs a lot of memory in 1.4.197 and older versions.

Tim Fielder

unread,
Oct 7, 2019, 8:41:58 AM10/7/19
to H2 Database
I'm using version 1.4.199 for this.  That's good to know though.

Tim Fielder

unread,
Oct 7, 2019, 8:47:53 AM10/7/19
to H2 Database
O(n log n) sounds about right.  Based on the most recent results I get this formula for the expected time to process X files of 10,000 articles each.  With an x^2 term of 0.21 that's just above linear, but once you're up to over 4,600 files it's still way too damn long.

I will try the CREATE TABLE ... AS SELECT DISTINCT thing and get back to you.  Thanks.

Tim Fielder

unread,
Oct 7, 2019, 2:57:57 PM10/7/19
to H2 Database
The suggestion of populating the tables via the CREATE TABLE ... AS SELECT DISTINCT is a good one so far.  It's nearly a 30% performance improvement compared to the previous method, which cuts about 7 whole days off of the expected time to complete populating the database.

I'm currently testing to determine if it will still experience the out-of-memory exception, as I am still performing one WHERE ... NOT IN query.  It's going to take all night, so I'll check back in the morning.  If it does fail, I have some ideas for how to work around the problem.

I also wanted to ask about performance differences across database modes.  Right now I'm constructing the database in file mode and just storing it on the same machine that's running the application.  I can't set up an actual server to offload the database maintenance onto a second machine, but I might be able to set up an actual MySQL instance on this machine and connect to it.  Would I reasonably expect to see a performance improvement by doing this?

Noel Grandin

unread,
Oct 7, 2019, 3:55:17 PM10/7/19
to h2-da...@googlegroups.com
On Mon, 7 Oct 2019 at 20:58, Tim Fielder <tlfi...@gmail.com> wrote:
I also wanted to ask about performance differences across database modes.  Right now I'm constructing the database in file mode and just storing it on the same machine that's running the application.  I can't set up an actual server to offload the database maintenance onto a second machine, but 

I wouldn't expect running H2 on another machine to improve things. MySQL may or may not be faster, depending on a lot of things.

Also, looking at your schema, and your saying that you are indexing the tables, you may want to consider doing
CREATE TABLE reference (docid varchar NOT NULL,
                        refid varchar NOT NULL );
CREATE TABLE reference_reverse (refid varchar NOT NULL,
                        docid varchar NOT NULL );
and only having a primary key index on each one.

This will make certain lookups faster, since the required data will be grouped closer together.

Tim Fielder

unread,
Oct 9, 2019, 6:23:09 AM10/9/19
to H2 Database
On Monday, October 7, 2019 at 3:55:17 PM UTC-4, Noel Grandin wrote:
Also, looking at your schema, and your saying that you are indexing the tables, you may want to consider doing
CREATE TABLE reference (docid varchar NOT NULL,
                        refid varchar NOT NULL );
CREATE TABLE reference_reverse (refid varchar NOT NULL,
                        docid varchar NOT NULL );
and only having a primary key index on each one.

This will make certain lookups faster, since the required data will be grouped closer together.


I'll be sure to experiment with that when I can actually get a database built.  In the meantime, I'm still running into the memory overhead limit issue.  Here's my current code:
Statement docType = gConn.createStatement();
docType
.execute("ALTER TABLE doctype ADD PRIMARY KEY (abbrev);");
docType
.close();

Statement refDoc = gConn.createStatement();
refDoc
.execute(
       
"CREATE TABLE indexDoc ("
               
+ " docid varchar,"
               
+ " title varchar,"
               
+ " content varchar,"
               
+ " sourceTitle varchar,"
               
+ " startPage varchar,"
               
+ " endPage varchar,"
               
+ " year varchar,"
               
+ " indexed bit NOT NULL DEFAULT 0 );");

refDoc
.executeUpdate("INSERT INTO indexDoc (docid, indexed) ( SELECT docid, 1 FROM document );");

refDoc
.executeUpdate(
       
"INSERT INTO indexDoc (docid, title, content, sourceTitle, startPage, endPage, year) ("
               
+ " SELECT refid, MAX(title), MAX(content), MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"
               
+ " FROM refdoc"
               
+ " GROUP BY refid );");

ResultSet refDocs = refDoc.executeQuery("SELECT docid, MAX(title), MAX(content)"
               
+ " FROM indexDoc"
               
+ " GROUP BY docid"
               
+ " HAVING MAX(indexed) = 0;");

while (refDocs.next())
{
   
Document doc = new Document();
    doc
.add(new Field("ID", refDocs.getString(1), FieldFactory.field(false)));
    doc
.add(new Field("title", refDocs.getString(2), FieldFactory.field(false)));
    doc
.add(new Field("content", refDocs.getString(3), FieldFactory.field(true)));
    writer
.addDocument(doc);
}
refDocs
.close();

refDoc
.executeUpdate(
       
"INSERT INTO document (docid, title, sourceTitle, startPage, endPage, year) ("
               
+ " SELECT docid, MAX(title), MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"
               
+ " FROM indexDoc"
               
+ " GROUP BY docid"
               
+ " HAVING MAX(indexed) = 0 );");

refDoc
.execute("DROP TABLE indexDoc;");
refDoc
.execute("DROP TABLE refdoc;");

refDoc
.close();

Statement doc = gConn.createStatement();
doc
.execute("ALTER TABLE document ADD PRIMARY KEY (docid);");
doc
.execute("ALTER TABLE document ADD FOREIGN KEY (docType) REFERENCES doctype(abbrev);");
doc
.close();

Statement ref = gConn.createStatement();
ref.execute("CREATE TABLE reference AS ( SELECT DISTINCT docid, refid FROM tempref );");

ref.execute("DROP TABLE tempref;");

ref.execute("ALTER TABLE reference MODIFY COLUMN docid varchar NOT NULL;");
ref.execute("ALTER TABLE reference MODIFY COLUMN refid varchar NOT NULL;");
ref.execute("ALTER TABLE reference ADD PRIMARY KEY (docid, refid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (docid) REFERENCES document(docid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (refid) REFERENCES document(docid);");
ref.close();

As you can see, I'm also building a Lucene index along with the database, and in order to prevent articles from being added to it more than once I have to make use of the database to build a list of all articles using the unique IDs for those articles, and keep track of which ones have already been indexed.  For some reason, this query works just fine:
SELECT docid, MAX(title), MAX(content)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0;

but this query fails due to a memory limit:
INSERT INTO document (docid, title, sourceTitle, startPage, endPage, year) (
SELECT docid, MAX(title), MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0 );

My suspicion is that nesting a SELECT inside any other function is going to cause these kinds of errors, and I will need to instead retrieve a ResultSet and use a PreparedStatement to batch-INSERT into tables in chunks of ~10,000 or so.

Tim Fielder

unread,
Oct 9, 2019, 8:35:34 AM10/9/19
to H2 Database
On Wednesday, October 9, 2019 at 6:23:09 AM UTC-4, Tim Fielder wrote:
My suspicion is that nesting a SELECT inside any other function is going to cause these kinds of errors, and I will need to instead retrieve a ResultSet and use a PreparedStatement to batch-INSERT into tables in chunks of ~10,000 or so.

Well, that doesn't seem to be the case.  I modified the code to:
Statement docType = gConn.createStatement();
docType
.execute("ALTER TABLE doctype ADD PRIMARY KEY (abbrev);");
docType
.close();

Statement refDoc = gConn.createStatement();
refDoc
.execute(
       
"CREATE TABLE indexDoc ("
               
+ " docid varchar,"
               
+ " title varchar,"
               
+ " content varchar,"
               
+ " sourceTitle varchar,"
               
+ " startPage varchar,"
               
+ " endPage varchar,"
               
+ " year varchar,"
               
+ " indexed bit NOT NULL DEFAULT 0 );");

refDoc
.executeUpdate("INSERT INTO indexDoc (docid, indexed) ( SELECT docid, 1 FROM document );");

refDoc
.executeUpdate(
       
"INSERT INTO indexDoc (docid, title, content, sourceTitle, startPage, endPage, year) ("
               
+ " SELECT refid, MAX(title), MAX(content), MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"
               
+ " FROM refdoc"
               
+ " GROUP BY refid );");

ResultSet refDocs =

        refDoc
.executeQuery("SELECT docid, MAX(title), MAX(content), MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)"

               
+ " FROM indexDoc"
               
+ " GROUP BY docid"
               
+ " HAVING MAX(indexed) = 0
;"
);
PreparedStatement insert =
        gConn
.prepareStatement(
               
"INSERT INTO document (docid, title, sourceTitle, startPage, endPage, year) VALUES (?, ?, ?, ?, ?, ?);");
int count = 0;

while (refDocs.next())
{
   
Document doc = new Document();
    doc
.add(new Field("ID", refDocs.getString(1), FieldFactory.field(false)));
    doc
.add(new Field("title", refDocs.getString(2), FieldFactory.field(false)));
    doc
.add(new Field("content", refDocs.getString(3), FieldFactory.field(true)));
    writer
.addDocument(doc);


    insert
.setString(1, refDocs.getString(1));
    insert
.setString(2, refDocs.getString(2));
    insert
.setString(3, refDocs.getString(4));
    insert
.setString(4, refDocs.getString(5));
    insert
.setString(5, refDocs.getString(6));
    insert
.setString(6, refDocs.getString(7));
    insert
.addBatch();
   
    count
++;
   
if(count >= 10000)
   
{
        insert
.executeBatch();
   
}
}
refDocs
.close();
if (count > 0)
{
    insert
.executeBatch();
}
insert
.close();


refDoc
.execute("DROP TABLE indexDoc;");
refDoc
.execute("DROP TABLE refdoc;");

refDoc
.close();

Statement doc = gConn.createStatement();
doc
.execute("ALTER TABLE document ADD PRIMARY KEY (docid);");
doc
.execute("ALTER TABLE document ADD FOREIGN KEY (docType) REFERENCES doctype(abbrev);");
doc
.close();

Statement ref = gConn.createStatement();
ref.execute("CREATE TABLE reference AS ( SELECT DISTINCT docid, refid FROM tempref );");

ref.execute("DROP TABLE tempref;");

ref.execute("ALTER TABLE reference MODIFY COLUMN docid varchar NOT NULL;");
ref.execute("ALTER TABLE reference MODIFY COLUMN refid varchar NOT NULL;");
ref.execute("ALTER TABLE reference ADD PRIMARY KEY (docid, refid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (docid) REFERENCES document(docid);");
ref.execute("ALTER TABLE reference ADD FOREIGN KEY (refid) REFERENCES document(docid);");
ref.close();

I still get an out-of-memory exception on the query
SELECT docid, MAX(title), MAX(content), MAX(sourceTitle), MAX(startPage), MAX(endPage), MAX(year)
FROM indexDoc
GROUP BY docid
HAVING MAX(indexed) = 0;

Is it just the case that any conditional SELECT will fail once a table reaches a certain size?

Noel Grandin

unread,
Oct 9, 2019, 8:45:50 AM10/9/19
to h2-da...@googlegroups.com


On 2019/10/09 2:35 PM, Tim Fielder wrote:
>
> Is it just the case that any conditional SELECT will fail once a table reaches a certain size?
>

Quite possibly that code was not smart enough to spool to disk when exceeding RAM size.

You could try building from master and using that, Evgenij has fixed some of those.

Noel Grandin

unread,
Oct 9, 2019, 8:47:17 AM10/9/19
to h2-da...@googlegroups.com
Otherwise another trick you can try is manually sharding your query, something like

SELECT ...
WHERE docid % 10 == 0

SELECT ...
WHERE docid % 10 == 1

...

SELECT ...
WHERE docid % 10 == 9

Tim Fielder

unread,
Oct 10, 2019, 7:21:58 AM10/10/19
to H2 Database
Rebuilt using the latest code from Tuesday, still doesn't work.  Going to try the manual sharding thing next.

Noel Grandin

unread,
Oct 10, 2019, 7:25:50 AM10/10/19
to h2-da...@googlegroups.com


On 2019/10/10 1:21 PM, Tim Fielder wrote:
>
> Rebuilt using the latest code from Tuesday, still doesn't work.  Going to try the manual sharding thing next.
>

Ah, that is a pity.

Before doing that, maybe try manually giving Java lots and lots of memory (e.g. -XmX64G), your machine will probably go
into swap, but that might take less time than writing custom sharding code.

Tim Fielder

unread,
Oct 10, 2019, 1:53:26 PM10/10/19
to H2 Database
Just had the same out-of-memory error with a heap limit of 60 GB, up from 1024 MB previously.

Here's the eclipse.ini I'm using:
-vm
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-1.el7_7.x86_64/jre/bin
-startup
plugins
/org.eclipse.equinox.launcher_1.5.300.v20190213-1655.jar
--launcher.library
/home/tfielder/.p2/pool/plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.1000.v20190125-2016
-product
org
.eclipse.epp.package.java.product
-showsplash
org
.eclipse.epp.package.common
--launcher.defaultAction
openFile
--launcher.appendVmargs
-vmargs
-Declipse.p2.max.threads=10
-Doomph.update.url=http://download.eclipse.org/oomph/updates/milestone/latest
-Doomph.redirection.index.redirection=index:/->http://git.eclipse.org/c/oomph/org.eclipse.oomph.git/plain/setups/
-Dosgi.requiredJavaVersion=1.8
-Dosgi.instance.area.default=@user.home/eclipse-workspace
-XX:+UseG1GC
-XX:+UseStringDeduplication
--add-modules=ALL-SYSTEM
-Dosgi.requiredJavaVersion=1.8
-Dosgi.dataAreaRequiresExplicitInit=true
-Xms256m
-Xmx60g
--add-modules=ALL-SYSTEM

I'm now thoroughly confused as to what might be wrong.  I can try sharding but I think it's going to take all weekend to run it.

Noel Grandin

unread,
Oct 10, 2019, 2:43:47 PM10/10/19
to h2-da...@googlegroups.com

The eclipse heap has no effect on the heap of the program you're running, you need to tweak the startup VM parameters under the run/debug stuff

Tim Fielder

unread,
Nov 18, 2019, 9:38:03 AM11/18/19
to H2 Database
To follow up:

I got use of a system with 512GB of RAM, which is not nearly as helpful as I'd like.  The JVM is still perfectly happy to eat all of it and go commit seppuku.

The first thing I had to do was split the Lucene index into chunks.  Trying to parse 4700 files of 10,000 documents each resulted in Lucene continuously merging, which slowly consumed all available memory until there was no longer enough to open a new connection to the database.  By instead having Lucene parse each file into a separate directory, I could defer merging until after the parsing was complete.  That allowed me to, at least, complete the initial step of parsing the files and building the temporary tables.

Now I'm trying to copy the temporary tables into permanent tables via the CREATE TABLE ... AS SELECT DISTINCT command, but some of the tables are like 400+ GB in size, so I'm still having some trouble with running out of memory.  Thus far my strategy has been to split the database correction steps into smaller and smaller chunks, which allows me to shut down the application and free all the memory that, for whatever reason, the JVM can't seem to garbage collect.  I'm now up to 15 distinct steps, each taking a full day to run, so it takes quite a while to fix any problems I run into.

I did try the idea of using a MySQL database and connecting to it instead of using H2.  The results I got suggest that using H2 in MySQL mode is faster than running MySQL as a service on localhost.  Memory usage is about the same.  I shifted back to using H2 and got some multi-threading support by writing my temporary tables into separate H2 databases, so that the threads aren't blocked by file writes.

If I get to the point that I can't subdivide my database activities any further and I still can't get it done with half a terabyte of RAM, I'm really not sure where to go next.  I think my best option is probably to use the SCRIPT command to write the table out to a file, then try to do my own file parsing to copy only the distinct rows.

Noel Grandin

unread,
Nov 18, 2019, 10:49:33 AM11/18/19
to H2 Database
might be easier to use the unix uniq or sort -u commands or something similar, followed by
   CREATE TABLE AS .. SELECT

DISTINCT is problematic because it requires building a huge set before perfomring the insert.

Or use the MERGE command to avoid the DISTINCT step.

Also turn off lthe undo log with
  SET UNDO_LOG 0
and the transaction log with
   SET LOG 0
at the expensive of having a dead DB if anything goes wrong, but these commands can be quite useful during initial population.
Don't forget to re-enable them when you are done.



--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/8b291c44-39e7-4d20-a9ea-63071fe8fb08%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages