Database corruption on 1.3.176: any suggestions on how to avoid?

978 views
Skip to first unread message

Rob Van Dyck

unread,
Jun 26, 2015, 9:39:00 AM6/26/15
to h2-da...@googlegroups.com
Hi,

I work for a small company using (the latest stable) H2 in our software. Our client base is starting to grow (+-100 installations on client computers, most have DB's multiple GBs in size) and we are starting to run into more problems with broken (and sometimes worse: unrepairable) H2 DBs. Our clients use lots of different OSes (all Windows/Mac OS X) on normal commodity hardware. To give you an estimate about the failure rate: we have had about 10 broken DBs in the last 6 months.

We currently use an embedded persistent database with default connection properties: "jdbc:h2:file:" + h2Path + ";IFEXISTS=TRUE" after which we set autocommit to false. There is only one thread connected with the DB and the database was created using the latest version stable H2 version.

We know for sure a few instances happened a limited time after our software ran into an out-of-memory situation. We also suspect some happened after an OS-level crash which caused the computer to reboot without having a chance to shutdown properly (e.g., power failure or the user pressing the reset button).

The data is privacy sensitive, so we are reluctant to provide it to you unless that is the only option.

We were hoping you might be able to hint us a little bit on what we might do to avoid these issues?

1. We are converting our embedded persistent H2 DB to a (tcp)server started by a different process. Hoping that OOMs in our software won't make H2 corrupt since the H2 process can shutdown cleanly. Do you think this might help for OOMs?
2. We are wondering whether we are missing certain properties to set on the connection? We looked at UNDO_LOG and LOG, but the default settings are already the 'safest'.
3. We are using the latest stable version 1.3.176 (and use the default of its 'storage engines' called B-tree (?). I.e., we don't use MVStore). Should we consider moving to the beta version? Could that possibly have more protection against these types of failure?
4. We know some instances of corruption happened in a virtualized environment (where the guest OS 'crashed'). We tried to reproduce this by running a Windows 8 guest on a Linux host, where we tried to reset and shutdown our application multiple times (10) while it was performing heavy database updates. We could not reproduce the issue. 
5. One of the issues is that we cannot reliably detect issues. At one time we ran the H2 recovery tool which gave us no errors so we continued using the existing DB, but immediately afterwards this resulted in H2 complaining about corruption. Is this possible (does the recovery tool check all kind of errors? Or does it skip, e.g., index pages)? Is there a way to know for sure that there is no corruption?  
6. We have tried on some occasions to run the recovery tool and re-import the corrupted database, but at least on one occasion this gave us errors so we were unable to restore the data. Unfortunately we do not have the error output anymore.
7. The next time this happens, is there anything that we should check (e.g., the trace file)?

I'll include some of the stacktraces, maybe this can give you an indication of what might have gone wrong.

Thanx for your answers and/or tips.

Kind regards,
Rob.
stacktraces.log

Thomas Mueller

unread,
Jun 27, 2015, 7:30:34 AM6/27/15
to h2-da...@googlegroups.com
Hi,

I'm sorry that the risk of corruption is that big. I'm not sure what the problem could be. In the past, people did report corruptions now and then, but not at such a high rate as you have.

I would not move to the MVStore yet, as there are known problem in case of power failure (in case of re-ordered writes). I'm working on that right now. There is also a known problem with corruption after out-of-memory, which is fixed in the trunk but not released yet.

What I would probably use the old storage format ("mv_store=false" in the database URL). Whether you use the very latest 1.4.x version or 1.3.176 will probably not make a big difference.

I would consider creating online backups regularly, but I'm not sure if that's feasible in your case.

Regards,
Thomas

--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.
To post to this group, send email to h2-da...@googlegroups.com.
Visit this group at http://groups.google.com/group/h2-database.
For more options, visit https://groups.google.com/d/optout.

Nicolas Barbier

unread,
Jun 28, 2015, 9:50:03 AM6/28/15
to h2-da...@googlegroups.com
Hello Thomas,

Disclaimer: I am not familiar at all with the internals of H2 (neither PageStore nor MVStore). I just looked a bit at the code, but nothing too fancy. Anything I write here might therefore be entirely wrong :-).

2015-06-27 Thomas Mueller:


> I would not move to the MVStore yet, as there are known problem in
> case of power failure (in case of re-ordered writes). I'm working on
> that right now. There is also a known problem with corruption after
> out-of-memory, which is fixed in the trunk but not released yet.

I wonder how such reordered writes are coped with in PageStore. In a traditional "write-ahead log + main storage area" setting, I would expect that – for each "operation", e.g., an insertion into a table plus the corresponding changes to indexes etc – first the log is written, before any of the corresponding changes to the main storage area are written. Then, a "write barrier" is inserted to make sure that the log writes get to the platter before any of the corresponding main storage area writes get to the platter (hopefully disk drives don't violate this order too much, as consumer-grade drives supposedly do), and only then the writes to the main storage area are performed. If such a write barrier is not used, the OS is very likely to reorder the writes, *especially* if the blocks of the write-ahead log and the main storage area are intermingled in the same file (are they in H2?).

I would not know how to insert a write barrier in Java, except by fsync'ing the log after writing to it, and only then starting to issue the writes to the main storage area. But as H2 seems to put the log in the same file as the main storage area, either a "partial file fsync" would be needed (of the form "only fsync at least bytes X to Y of this file", where bytes X to Y is the location of the part of the log that was just written; I think Java doesn't support such an API?), or the whole file needs to be fsync'ed. The latter would be very bad for performance, as all previously issued writes to the main storage area (which are typically spread out rather randomly) would need to be performed by the OS (and in principle drive) before the call could return.

Even fsync'ing only the log after each "operation" would be bad for performance (severely limiting the number of operations per second), but that could be coped with by delaying fsync'ing the log until a COMMIT happens or main storage area writes are needed because of memory pressure. Any main storage area writes (that correspond to the operations whose log parts are not fsync'ed yet) must then of course be delayed until the corresponding log writes are fsync'ed. It seems that H2 is already implemented like this (except for the write barrier/fsync part); I assume that the writing happens on a separate thread, so that having to wait for fsync'ing would not prevent the SQL-performing code from just continuing while the fsync is in flight (unless reading uses the same thread)?

As I wrote before, if this "write barrier" stuff (or the equivalent "fsync" stuff) is not performed, then the OS can reoder writes in any order it wants before sending it to the drive (I think?). When a power failure occurs, some writes to the main storage area will probably already have occurred before all corresponding writes to the log have occurred, which would then likely result in a corrupt database.

Could you please shed some light onto how H2 tries to cope with this potential problem? By following the code, I could see that H2 makes sure that the log is written ("sent to the OS" in FileUtils.writeFully) before the corresponding changes to the main storage area are, but I could not find anything that would prevent the OS from reordering the writes. If that problem does indeed exist and is not solved by H2, I might take a jab at getting into the code and try to figure out what would be needed to implement such a solution. No guarantees of course, as I have lots of other stuff to do as well; I just think/hope it would be easier to fix H2 than to switch to another DBMS :-).

Assuming that the problem exists, the following would seem to be needed:

* Either find a way to fsync only a part of a file in Java, or (more likely),
* Put the log in a separate file (probably right next to the main storage area file).

  → The second solution introduces a problem where a user could accidentally mix a log file and a main file that don't belong together, but the resulting problems (corrupting the whole DB when replaying the log file) could be avoided by storing some kind of random ID in both files and complaining if they are not the same. Also, a user could mix an old log file and a new main file, which would also result in corrupting the DB when replaying it. A solution for this would be a bit more complex. But then, I don't think that these are scenarios that would need to be protected against in a first implementation.

I think that a similar problem would need solving for MVStore too, but I know too little about MVStore and I haven't given it enough thought (yet) to be able to give you any useful advice.

Greetings,

Nicolas [ works for the same company as Rob ]

Thomas Mueller

unread,
Jun 28, 2015, 10:33:00 AM6/28/15
to h2-da...@googlegroups.com
Hi,

I wonder how such reordered writes are coped with in PageStore.

So far I assumed that the PageStore works fine, as it uses a write-ahead log (ARIES style), plus fsync after writing the log. Once I have the "simulated reordered writes file system" in place, I can test that. I'm still working on that (I already have some results, but it's not complete yet).

Regards,
Thomas

Nicolas Barbier

unread,
Jun 28, 2015, 1:13:48 PM6/28/15
to h2-da...@googlegroups.com
2015-06-28 Thomas Mueller <thomas.to...@gmail.com>:

> So far I assumed that the PageStore works fine, as it uses a write-ahead log
> (ARIES style), plus fsync after writing the log. Once I have the "simulated
> reordered writes file system" in place, I can test that. I'm still working
> on that (I already have some results, but it's not complete yet).

I could not find which code path H2 uses to perform an fsync after
flushing the log. Could you point me in the right direction?

It seems that org.h2.store.FileStore.sync() is only called when a
manual CHECKPOINT SYNC is issued (and also by
PageStore.writeVariableHeader, which doesn’t seem to be relevant
here). There are other callers of FileDisk.force(), but they don’t
seem relevant to the case of “fsync after writing the log”.

The WriterThread, which seems to perform the time-based flushing of
the log, only calls PageStore.flushLog(), which calls PageLog.flush(),
etc, which in the end doesn't seem to do any fsync’ing.

Did I miss something about the inner workings of H2, or does this
point to a problem?

Greetings,

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

Nicolas Barbier

unread,
Jun 28, 2015, 6:25:44 PM6/28/15
to h2-da...@googlegroups.com
2015-06-28 Nicolas Barbier <nicolas...@gmail.com>:

> It seems that org.h2.store.FileStore.sync() is only called when a
> manual CHECKPOINT SYNC is issued (and also by
> PageStore.writeVariableHeader, which doesn’t seem to be relevant
> here). There are other callers of FileDisk.force(), but they don’t
> seem relevant to the case of “fsync after writing the log”.
>
> The WriterThread, which seems to perform the time-based flushing of
> the log, only calls PageStore.flushLog(), which calls PageLog.flush(),
> etc, which in the end doesn't seem to do any fsync’ing.

This analysis seems consistent with the fact that we often get corrupt
databases after a forced computer reset (I assume that the OS
typically reorders writes rather aggressively). It does not really
explain why our databases also become corrupt after an OOM situation
(so I guess that is a separate issue).

I guess it would be good to perform some strace-testing on a trivial
testcase, to check whether H2 indeed behaves as I expect it to (i.e.,
that it omits calls to f[data]sync when I think it should). OTOH,
having you confirm or reject my proposed solution would be even better
:-).

Assuming for a second that my analysis above is correct, I would
suggest letting PageStore.flushLog() just call sync() (i.e., replacing
the current 3 lines of code in it with a call to the sync() method
declared right underneath it). It seems that callers of flushLog seem
to expect that they can just start writing non-log stuff right after
calling flushLog, which seems consistent with the fact that flushLog
should perform the fsync’ing.

* It seems that flushLog() wouldn’t be called for each “operation” as
it is (mostly) only used when memory pressure kicks in, and therefore
the performance loss wouldn’t be dramatic.
* It would still be better for performance to move the log to a
separate file, so the fsync’ing could be restricted to the log,
instead of forcing all writes to the log + all writes to the main
storage area to hit the platter before returning.
* My proposed solution does not guarantee 100% invulnerability to
corruption in the face of drives that don’t honor fsync. I.e., if they
do reordering of write commands, even if told that there is supposed
to be an fsync in between. I don’t know whether the typical “buggy”
consumer drives do that, or whether they just return “it’s written” to
the OS immediately but keep the order of write commands correct
anyway; If the latter, corruption is avoided even with such drives
(although durability (which I don’t care about) is not).

Nicolas Barbier

unread,
Jun 28, 2015, 6:30:29 PM6/28/15
to h2-da...@googlegroups.com
2015-06-29 Nicolas Barbier <nicolas...@gmail.com>:

> * It seems that flushLog() wouldn’t be called for each “operation” as
> it is (mostly) only used when memory pressure kicks in, and therefore
> the performance loss wouldn’t be dramatic.

Correction: It is also called in time-based fashion, which would cause
a lot of fsync’ing of the database file all the time. That makes
separating the log from the main storage area more important
performance-wise :-(.

Rob Van Dyck

unread,
Jul 3, 2015, 7:41:36 AM7/3/15
to h2-da...@googlegroups.com
Hi Thomas,

Thanx for your initial reply. We would really like to get to the bottom of this and find out what's wrong.

I still have two questions:
  1. Is there any way to know for sure a database is consistent? Eg. By running Recover with -trace and -transactionLog? Will this check all internal indexes etc? It is important for us to be able to identify whether a database was corrupted in some way (note that that is different from knowing that there is no corruption left after an export (using Recover) and an import into a new DB).
  2. We have a (new) corrupted database from a machine that isn't suffering from unexpected reboots (no critical errors according to the Windows logs) and didn't experience an out-of-memory (that we know of)... and we are looking at external factors: eg. Maybe the shadow copy (Windows Restore Checkpoint) has interfered somehow with the DB, or maybe Windows Restore has restored a corrupted snapshot, etc etc ... If you have any ideas that we might want to check out, they are very welcome. It is a customer with multiple computers, and this particular computer has had a corrupted database 4 times in about two months. One of his other computers had a corruption once, all other computers haven't had corruptions while they are used by the same staff randomly. The DB is 8 GB in size.
Kind regards,
Rob.

Op zaterdag 27 juni 2015 13:30:34 UTC+2 schreef Thomas Mueller:
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database+unsubscribe@googlegroups.com.

Thomas Mueller

unread,
Jul 3, 2015, 8:47:28 AM7/3/15
to h2-da...@googlegroups.com
Hi,


> Is there any way to know for sure a database is consistent?

The statement "script to <fileName>" will detect most corruptions. It will not detect corruptions in just the secondary indexes, but those are quite rare.


> We have a (new) corrupted database from a machine that isn't suffering from unexpected reboots

Even thought my main priority is now to get the MVStore stable (which hopefully will fully solve the corruption problem), I would still be interested to understand why you have that many corruptions. My guess is that your use case is slightly different than what others do. I have a list of questions I have used before (you have answered some of those questions already):

- What is your database URL?

- Did you use LOG=0 or LOG=1? Did you read the FAQ about it?

- Did the system ever run out of disk space?

- Could you send the full stack trace of the exception including message text?

- Did you use SHUTDOWN DEFRAG or the database setting DEFRAG_ALWAYS with H2 version 1.3.159 or older?

- How many connections does your application use concurrently?

- Do you use temporary tables?

- With which version of H2 was this database created?
    You can find it out using:
    select * from information_schema.settings where name='CREATE_BUILD'
    or have a look in the SQL script created by the recover tool.

- Did the application run out of memory (once, or multiple times)?

- Do you use any settings or special features (for example cache settings,
    two phase commit, linked tables)?

- Do you use any H2-specific system properties?

- Is the application multi-threaded?

- What operating system, file system, and virtual machine
    (java -version) do you use?

- How did you start the Java process (java -Xmx... and so on)?

- Is it (or was it at some point) a networked file system?

- How big is the database (file sizes)?

- How much heap memory does the Java process have?

- Is the database usually closed normally, or is process terminated
    forcefully or the computer switched off?

- Is it possible to reproduce this problem using a fresh database
    (sometimes, or always)?

- Are there any other exceptions (maybe in the .trace.db file)?
    Could you send them please?

- Do you still have any .trace.db files, and if yes could you send them?

- Could you send the .h2.db file where this exception occurs?

Regards,
Thomas

To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.

To post to this group, send email to h2-da...@googlegroups.com.
Visit this group at http://groups.google.com/group/h2-database.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2-database...@googlegroups.com.

Harshad RJ

unread,
Jul 5, 2015, 1:21:00 AM7/5/15
to h2-da...@googlegroups.com


On Monday, 29 June 2015 03:55:44 UTC+5:30, Nicolas Barbier wrote:
2015-06-28 Nicolas Barbier <nicolas...@gmail.com>:

I guess it would be good to perform some strace-testing on a trivial
testcase, to check whether H2 indeed behaves as I expect it to (i.e.,
that it omits calls to f[data]sync when I think it should).

Very interested in what you dig up. Let us know.

(Also, this should probably move to a new discussion thread)

Rob Van Dyck

unread,
Jul 6, 2015, 7:26:36 AM7/6/15
to h2-da...@googlegroups.com
Hi Thomas and list,

2015-07-03 Thomas Mueller <thomas.to...@gmail.com>:

>> Is there any way to know for sure a database is consistent?
>
> The statement "script to <fileName>" will detect most corruptions. It will
> not detect corruptions in just the secondary indexes, but those are quite
> rare.

It appears it doesn't detect the error we see most when using this DB:
'Row not found when trying to delete from index
"PUBLIC.""idx_Person_birthDate""'. Both Script and Recovery produce an
SQL file without any error messages but some have strange file sizes.
(original DB is 8 GB, the Script-generated SQL: 6 GB, the
Recovery-generated with extra logging (-trace -transactionLog): 14 GB,
the Recovery-generated without extra logging: 11.7 GB).
The only problem arises when trying to import the SQL file: During the
reimport (RunScript) of the Recovery-generated (without extra logging)
one I got a 'java.io.IOException: Stream closed' [1] (somewhere while
doing a rollback?). The strange thing was that the restored DB was
already 12 GB, with extra temp.db's where one was more than 8 GB big.
At that time I only had 3 GB of storage left on the particular machine
(where I was testing the restore). But it seems to me that needing +20
GB to restore a 8 GB database would be rather strange.

Do you know what might be the cause of the 'Stream closed' exception?

[1]
org.h2.jdbc.JdbcSQLException: IO Exception: "java.io.IOException:
Stream Closed";
"C:/Users/[..]/AppData/LocalLow/[..]/temp/records-restored.66233827.7.temp.db"
[90031-176]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:344)
at org.h2.message.DbException.get(DbException.java:167)
at org.h2.message.DbException.convertIOException(DbException.java:329)
at org.h2.store.FileStore.setLength(FileStore.java:362)
at org.h2.engine.UndoLog.getLast(UndoLog.java:96)
at org.h2.engine.Session.rollbackTo(Session.java:594)
at org.h2.command.Command.executeUpdate(Command.java:278)
at org.h2.jdbc.JdbcStatement.executeInternal(JdbcStatement.java:186)
at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:160)
at org.h2.tools.RunScript.process(RunScript.java:261)
at org.h2.tools.RunScript.process(RunScript.java:191)
at org.h2.tools.RunScript.process(RunScript.java:329)
at org.h2.tools.RunScript.runTool(RunScript.java:142)
at org.h2.tools.RunScript.main(RunScript.java:69)
Caused by: java.io.IOException: Stream Closed
at java.io.RandomAccessFile.length(Native Method)
at org.h2.store.fs.FileDisk.truncate(FilePathDisk.java:422)
at org.h2.store.FileStore.setLength(FileStore.java:357)
... 10 more

>> We have a (new) corrupted database from a machine that isn't suffering
>> from unexpected reboots
>
> Even thought my main priority is now to get the MVStore stable (which
> hopefully will fully solve the corruption problem), I would still be
> interested to understand why you have that many corruptions. My guess is
> that your use case is slightly different than what others do. I have a list
> of questions I have used before (you have answered some of those questions
> already):
>
> - What is your database URL?

We actually have two databases. Both have a separate thread which is
the only one communicating with the DB. They each use a URL like this:
"jdbc:h2:file:" + h2Path + ";IFEXISTS=TRUE" after which we set
autocommit to false.

> - Did you use LOG=0 or LOG=1? Did you read the FAQ about it?

No, we use the default LOG=2.

> - Did the system ever run out of disk space?

Not that we know of. It currently has about 80 GB of 180 GB in use. I
searched the Windows system log for 'quota' and 'space' didn't give
any results.

> - Could you send the full stack trace of the exception including message
> text?

This is included in the trace.db file right?

> - Did you use SHUTDOWN DEFRAG or the database setting DEFRAG_ALWAYS with H2
> version 1.3.159 or older?

No

> - How many connections does your application use concurrently?

One per DB.

> - Do you use temporary tables?

No.

> - With which version of H2 was this database created?
> You can find it out using:
> select * from information_schema.settings where name='CREATE_BUILD'
> or have a look in the SQL script created by the recover tool.

176.

> - Did the application run out of memory (once, or multiple times)?

That is possible, we have not noticed any out-of-memory errors in our
logs, but we see our application stopped without logging at least 3
times. That could indicate our process was killed, or because of a JVM
exit or something.

> - Do you use any settings or special features (for example cache settings,
> two phase commit, linked tables)?

No.

> - Do you use any H2-specific system properties?

No.

> - Is the application multi-threaded?

Yes.

> - What operating system, file system, and virtual machine
> (java -version) do you use?

Win7 Home Premium - SP1 (6.1 amd64), NTFS, Java: 1.7.0_79 64.

> - How did you start the Java process (java -Xmx... and so on)?

-Xmx800M -XX:MaxPermSize=128M -XX:-OmitStackTraceInFastThrow
-Dsun.security.smartcardio.t0GetResponse=false

> - Is it (or was it at some point) a networked file system?

No.

> - How big is the database (file sizes)?

8 GB.

> - How much heap memory does the Java process have?

800 MB.

> - Is the database usually closed normally, or is process terminated
> forcefully or the computer switched off?

This particular computer hasn't been powered off incorrectly. Our
application has been terminated (without a clean JVM shutdown) 3 times
according to our logs.

> - Is it possible to reproduce this problem using a fresh database
> (sometimes, or always)?

No.

> - Are there any other exceptions (maybe in the .trace.db file)?
> Could you send them please?

See next question.

> - Do you still have any .trace.db files, and if yes could you send them?

I will attach them.

> - Could you send the .h2.db file where this exception occurs?

This is hard because the data is confidential.

Thanx for your time,
Rob.
global-adjusted.trace.db
records-adjusted.trace.db

Steve McLeod

unread,
Jul 6, 2015, 4:35:09 PM7/6/15
to h2-da...@googlegroups.com
Just adding my experience to confirm Rob's scenario: my desktop app has thousands of users, uses H2 1.3.176, and also occasionally suffers from similar corruption. Customer databases occasionally become corrupted. The rate of corruption was much worse in earlier H2 versions. Customer databases are typically some gigabytes in size. As best I can tell from limited information, sudden loss of power is a major cause of the problem.

A couple of years ago I attempted to add a "CHECK DATABASE" to H2 command that would read each row in each table and index, but got overwhelmed by the task.

I'm keen and available to help in any way to improve this corruption situation.

I recently tried updating my desktop app to the the latest H2 1.4 update, albeit with MVSTORE disabled. The database corruption rate was much worse, and I needed to revert to 1.3.176.

Rob Van Dyck

unread,
Jul 10, 2015, 10:34:40 AM7/10/15
to h2-da...@googlegroups.com
Hi Thomas and list,

After investigating another corruption yesterday, we noticed something strange: a few minutes before the DB was found corrupt there was a Volume Shadow Copy created for the drive where our H2 files are located.

Apparantly there are multiple reports on the internet that suspect DB's can get corrupted during a Volume Shadow copy creation (thus the original DB gets corrupted, not just the copy), though most are unconfirmed and just suspicions:
IBM ClearCase VOB: 
Firebird:
Access:

Could that also possibly lead to a problem in H2?

I'll include the corruption stacktrace from right after the Volume Shadow Copy. H2 was trying to do a commit. This was on a Windows 8.1 6.3 amd64 machine running Java 1.7.0_60 64.

We have also checked several instances of H2 corruptions (5) for Volume Shadow copy creations while our application was running:
  • 2 had a Volume Shadow copy taken previously that day
  • 1 had it taken 2 days before
  • 1 had it taken 2 weeks before
  • 1 didn't appear to have had a shadow copy taken while our application was running

Regards,
Rob.
07-10-2015-stacktrace-after-VSS.log

Noel Grandin

unread,
Jul 10, 2015, 2:21:04 PM7/10/15
to h2-da...@googlegroups.com
By themselves those numbers don't nessarily mean much. We would also need to know how often shadow copy happened and nothing was corrupted.

--
Reply all
Reply to author
Forward
0 new messages