File extension independent H2 format

500 views
Skip to first unread message

Thotheolh

unread,
Dec 10, 2010, 3:11:30 AM12/10/10
to H2 Database
I was reading the forum and noticed that there are requests to change
file extensions or to somehow customize extensions.

I would like to propose a method in H2 format that would make H2
discoverable and independent from file extension format so to give
freedom for people to have any extension they want for the H2 database
files they would like to name it as.

H2 in Page Store format seems to have a header that says "-- H2 0.5/B
--" and it may have three lines of that header. It could be used to
identify H2 files from other files by the header rather then file
extensions and so users can simply make up their extensions as they
want as long as the header is intact.

H2 could also allow users to target specific database files .e.g. "~/
DB/MyDbFiles.myExt" so by doing that, users could have their own
extensions as long as the files have the correct formatting. H2 would
need to read the header of the file and check basic formatting to
positively ID that the file is genuinely H2 too.

I know H2 have trace and log files too and the ability to extend the
proposed ability to allow users to have free control over the file
extensions do apply. I would like to propose a change of H2 header
format to allow the extensibility and unify formats across H2.

Below is my proposal and is designed to have as minimal disruptions as
possible to the current H2 format if possible.

Skeletal format:
--<space>H2<space><version of file format protocol>/<alpha (A), beta
(B), stable (S)><space><filetype><space><Engine Version><space>--
Example: "-- H2 0.5/B D 1.2.147--" means: H2, version 0.5 protocol
Beta and filetype: D. Engine of creation is 1.2.147 database.

The format should be able to accept users who refuse to include the
engine version in case they do not wish to specify which version of H2
they are using to create the database.

Datatype available:
B - blobs
C - clobs
D - normal database file
T - trace / log files

End of file format:
-- H2 eof -- //H2 end of file
It is not necessary to have eof (end-of-file) but it is a good
practise to do so.

Possible drawbacks:
- Slower engine ? Could be tweaked and enhanced just to handle it.
- Additional space for headers and eofs ? Not really big actually.
- Changing H2 page store format versions is troublesome ? Proper and
lightweight conversion / file IO tools.
- "Header overload" syndrome whereby taking advantage of
extensibility of header, could dump all information and metadata into
header ? Strict formatting of header.

Possible advantages:
- Clearer distinction by using a better header and possibly with eofs
(enders).
- Customizability of file extensions
- Protocol longevity and extensibility.

For those who may want to compare headers and everything:

Currently some H2 have 3 lines of "-- H2 0.5/B --" but it could be
reduced to 1 single line of header as I proposed.
I estimate that 3 lines of the current H2 header is 14 characters in
UTF-8 and in total 3 lines which is about 336 bits or 42 bytes.
(Disclaimer: I cannot say this numbers are correct to the point as I
am no expert in H2 format.)

With the proposed headers and eofs with only 1 line needed for header
and an optional eof including optional specification of type of
database engine being used in creation of file: 35 characters in UTF-8
= 280 bits or 35 bytes.

Additional goodies:
If you have a header and a eof I proposed, you can add additional data
in front of the header and end of eof in the below manner which can
make the file even more flexible to your liking.

...your other stuff...

//Main H2 data
-- H2 0.5/B D 1.2.147--
<H2 data>
-- H2 eof --

//Another H2 CLOB stored
-- H2 0.5/B C 1.2.147--
<H2 CLOB data>
-- H2 eof --

//Another H2 Trace stored
-- H2 0.5/B T 1.2.147--
<H2 Trace data>
-- H2 eof --

... your other stuff ...

What happens is that you could literally add stuff before and after H2
data and maybe you could store blobs, normal h2 data and logs together
in a file but the bad thing is once the file is gone, all is lost.
Another possible thing is that you can store multiple instance of H2
databases and H2 items in a single file but that is really really
complex.

Thanks for reading this really long message post. Thanks

Regards,
Thotheolh.





Ryan How

unread,
Dec 10, 2010, 3:39:04 AM12/10/10
to h2-da...@googlegroups.com
Hi,

Just my 2 cents.

I'm pro for the customisable file extension, but I don't think there
should be allowed custom data before and after the header. I don't think
anything should write to the database file except for H2 itself, even if
just for reliability. If you want to store extra data as configuration
or anything else, then it can be stored in a H2 table.

Also, I was thinking with having data after the file header or before,
if you change the length of data before it will need to shift the entire
database file along and every time h2 changes the length of the database
file it will need to copy the data after the ender. So it would be a
huge performance hit as the database grows in size. It would be much
better off the data being stored in a h2 table and h2 managing it.

Anyway, just my thoughts.

Cheers, Ryan

Thotheolh

unread,
Dec 10, 2010, 3:49:07 AM12/10/10
to H2 Database
It's true that H2 should be in-charge of the H2 data and the proper
practise for most of the file formats out there is to not add
unnecessary data to protect file integrity and that's true.

Do post your opinions here and add in ideas you have if you want
to. :D

Regards,
Thotheolh.

Sergi Vladykin

unread,
Dec 10, 2010, 4:16:32 AM12/10/10
to H2 Database
Hi,

I think it is really useless (and even incorrect in some points) to
change h2 file format as you described. To get different db file
extension it will be enough to make it configurable.

regards,
Sergi

ulim

unread,
Dec 12, 2010, 7:43:44 AM12/12/10
to H2 Database
Well, there's three different things:

a) Get away from the current "two-dotted" extension, since many SDKs
cannot handle that.
b) Make the extension configurable.
c) Do away with the extension altogether (and, conceivably, use the
file header for identifying H2 files).

Of those options the first one is probably easy to implement and
wouldn't impede changing to one of the other options later on.
However, it does break backwards compatibility, because the newer H2
engines won't recognize the old extensions anymore, unless some ugly
workaround code is introduced. Therefore I tend to prefer the second
option, because it would allow restoring backwards compatibility - the
user could simply configure .h2.db again.

I am not sure where this file extension is being used in the codebase,
perhaps it is relatively unimportant internally? In that case it would
be enough to just stop adding the .h2.db extension to every database
url. Users wanting to keep the old .h2.db extension or using a new one
would simply have to change their JDBC urls to reflect that. So that
would be another option:

d) Stop adding extensions to JDBC urls altogether.

Ulrich

Sergi Vladykin

unread,
Dec 12, 2010, 2:24:11 PM12/12/10
to H2 Database
I prefer mechanism that will alow translate database name from URL
(String) to file object (java.io.File). Like that

interface DatabaseFileLocator {
File findDatabaseFile(String databaseName);
}

The default implementation will do just the same that H2 does now (add
".h2.db" extension and so on..) and this can be changed to your own
implementation with any behavior you want. But this is again about
support of service providers in H2 which is still absent...

I think that support of SPI can solve almost all of such problems very
gracefully.

regards,
Sergi

Thomas Mueller

unread,
Dec 13, 2010, 2:51:51 PM12/13/10
to h2-da...@googlegroups.com
Hi,

For backward compatibility I guess the current default suffix ".h2.db"
will stay for a while. The earliest possible change is H2 version 1.4,
and I'm not sure if it makes sense yet. First I want to get rid of all
the other database files (lob files, temp files). The .trace.db file
may get renamed to .log at some point (not sure yet).

There is a feature request for "Database file name suffix: a way to
use no or a different suffix (for example using a slash).". That means
if you use the database URL "jdbc:h2:~/test/" then it would create a
file named "test". This is not implemented yet (patches are welcome),
but what do you think about it?

> -- H2 0.5/B -- and it may have three lines of that header

Yes, the first is the regular header, the second and third may be
encrypted (for encrypted databases).

> -- H2 0.5/B D 1.2.147--

I will consider this for the future. There is already the CREATE_BUILD
setting in the database file. Plus, there is a read-version and
write-version in the header (see PageStore.java class javadoc).

What about h2database.com as the header? Or h2database.org.

The idea is that the header is one page long, and never changes once
the database is created. If it could change then there is a problem
how to change it in a transactional way (with possible power failure
while it's written). The second and third page contain header data
that can change (both pages are supposed to contain the exact same
data, so that this data can be changed in a transactional way, and
recover from a power failure).

Regards,
Thomas

Thotheolh

unread,
Dec 13, 2010, 7:22:45 PM12/13/10
to H2 Database
Hi. Thomas, I think using a long string like 'h2database.xxx' maybe
abit too long. I think it is better to have as short a header as
possible to reduce space needed for header. Indeed the main header
data should not be changed else risk corrupting the data.

Thomas, why does the second and third line of the '-- H2 0.5/B --' may
get encrypted for encrypted databases ? Is it use as some kind of
string to check if decryption succeed ? If it is so, maybe a shortened
check string like '-- c --' or '? c ?' could come in handy and only
use a single string rather then a couple of them.

To bundle clob and blobs into the main database files can make the
main database file very bulky and slow down H2 significantly. It is
better to leave clobs and blobs on their own for now rather then
allowing them to slow H2 i guess.

Regards,
Thotheolh.

Thomas Mueller

unread,
Dec 17, 2010, 4:59:52 AM12/17/10
to h2-da...@googlegroups.com
Hi,

> using a long string like 'h2database.xxx' maybe
> abit too long.

The first page (usually 2 KB) doesn't ever change, and there is a lot
of unused space there.

> why does the second and third line of the '-- H2 0.5/B --' may
> get encrypted for encrypted databases

See the encrypted file source code, SecureFileStore.java

> To bundle clob and blobs into the main database files can make the
> main database file very bulky and slow down H2 significantly.

There are many advantages to use one file, for example it prevents the
'too many open files' problems. Also, there are many problems because
of lost / deleted files. De-duplication is easier. One disadvantage is
that the database file can't shrink quickly after deleting many LOBs.

Regards,
Thomas

Thotheolh

unread,
Dec 17, 2010, 7:49:03 AM12/17/10
to H2 Database
Hi Thomas. It's true that the advantages and disadvantages are as you
said but would a single storage file slow down H2 engine from looking
for and getting or writing data since all the clobs and blobs which
can be rather huge, are bunched together with the main database too ?

> > To bundle clob and blobs into the main database files can make the
> > main database file very bulky and slow down H2 significantly.
>
> There are many advantages to use one file, for example it prevents the
> 'too many open files' problems. Also, there are many problems because
> of lost / deleted files. De-duplication is easier. One disadvantage is
> that the database file can't shrink quickly after deleting many LOBs.

Regards,
Thotheolh.

Dario Fassi

unread,
Dec 17, 2010, 2:49:45 PM12/17/10
to h2-da...@googlegroups.com
Hi,
What think you of using only 2 files, one for all normal columns and otherspecialized for long data types(LOBS).
Something similar to "Regular Tablespaces" and "Large Tablespaces" in oldies DB2s.

That way you have both goodies, only two files and one of them specialized fileStore to contain all lobs columns ( used only if lobs are referenced) , that can facilitate locator's implementation too.

regards,
Dario.

El 17/12/10 09:49, Thotheolh escribi�:

Thotheolh

unread,
Dec 17, 2010, 8:28:11 PM12/17/10
to H2 Database
That's a good idea using one for normal tables and the other for huge
storages. It did propose to split the huge storage to CLOBS and BLOBS
so that there wouldn't be a need to mix binary and character based
storage and make it hard for locators to locate resources.

Regards,
Thotheolh.

Thomas Mueller

unread,
Dec 20, 2010, 3:30:47 PM12/20/10
to h2-da...@googlegroups.com
Hi,

> would a single storage file slow down H2 engine from looking
> for and getting or writing data since

No. The only problem (I know) is what I have already described.

> What think you of using only 2 files, one for all normal columns and otherspecialized for long data types(LOBS).

I thought about this, but I would try to avoid it if possible. What
would be the advantage?

> that can facilitate locator's implementation too

How?

> It did propose to split the huge storage to CLOBS and BLOBS so that there wouldn't be a need to mix binary and character based storage

That's no problem at all.

Regards,
Thomas

Thotheolh

unread,
Dec 21, 2010, 7:45:03 AM12/21/10
to H2 Database
It would be good if H2 could finally squeeze clobs and blobs and all
the data storage into a single file without an significant drop of
performances.

It would be so much easier to handle a single H2 data file then handle
seperated files.

Maybe the H2 data files containing all the data has some sort of data
compression in-built into it by default ?

I think the meaning of 'facilitate locator's implementation' in the
previous reply meant that every time data is fetched from the
database, it is usually sequential from the head to the end of the
data file. If all the data are mixed into just one file (especially
clobs and blobs) which sometimes can have too much data to be loaded
into the memory for caching, when a read operation is done (and the
data or data sets are too big to be stored into a memory cache), data
have to be read sequentially and it would take alot of time to do so.

I am not very good at file based data storage operations since I don't
have much experience in creating database engines. Generally, what
would be used so that blobs and clobs could be stored together in the
same database file as the main data without affecting database
performances ?

Regards,
Thotheolh.

Dario Fassi

unread,
Dec 21, 2010, 12:01:28 PM12/21/10
to h2-da...@googlegroups.com, "Darío V. Fassi"
Hi,

El 20/12/10 17:30, Thomas Mueller escribi�:


>> would a single storage file slow down H2 engine from looking
>> for and getting or writing data since
> No. The only problem (I know) is what I have already described.

I doubt that this could be real. In a database with many lobs columns and rows the size of this single file can easily grow to disadvantageous levels.
A single lob field can have the size of several full tablesor even exceeding the size of the rest of the database.

Fragmentation at file system (OS) level will have much more impact on large files, caching (at OS level) will be less effective too, read-ahead capabilities will be less effective too and finally IO load will increaseinevitably.

It is easy to measure the degradation of the performance of a database as the data volume is significantly increased. I mean, if a db without lobs have 1 GB size and with lobs goes over 10 GB, would be very optimistic to think that the overall performance
will not change. Just imagine defragment or compact a file of that size.
In a two files scenario, we would havea main file of 1 GB with almost all data + indexes , and the lobs file of 9 GB with lobs only. ( Not so bad as a file per lob and not so big as all in one file).

>> What think you of using only 2 files, one for all normal columns and other specialized for long data types(LOBS).


> I thought about this, but I would try to avoid it if possible. What
> would be the advantage?

I can think in all stated above and more:

1) Main file will concentrate almost all indexes and data (except lobs) and references to lobs files as column values for lobs columns in the main file.

2) Lobs file can have a different fileStore (much more simple and specific) organized in variable length extents or pages to take advantage of sequential nature of it's contents.

Such a fileStore only need an avail-list and one index of pointers or references to be used as column value in the main file ;
like old xBase .DBT files that use a simple and very effective format or .tar files format that was designed for sequential access devices (or streaming in Java parlance).
So a locator can be implemented easily (at file level) as the Lob Reference pointer + locator offset.

For extents contents compacting (if needed) can be used a stream oriented method like deflate or gzip without harm streaming .
Each extents can have a header with a tag-marker, length , checksum, etc. ; to make broken file recovery easier.

>> that can facilitate locator's implementation too
> How?

Is explained above, but again.

If lob's fileStore is organized as a sequence of variable length extents with and index of pointers and available (or deleted) extents ;
a locator can be implemented easily (at file level) as the Lob Reference pointer + the locator offset.

Streaming access to lob's contents will be simplified and benefited too.

regards,
Dario.

Thomas Mueller

unread,
Dec 26, 2010, 4:14:05 AM12/26/10
to h2-da...@googlegroups.com
Hi,

> Fragmentation at file system (OS) level will have much more impact on large files, caching (at OS level) will be less effective too,
read-ahead capabilities will be less effective too and finally IO load
will increaseinevitably.

Could you please provide links to back this up? Or provide a test case
that shows multiple small files are significantly faster than one
large file (given the same file operations)?

> It is easy to measure the degradation of the performance of a database as the data volume is significantly increased.

Do you think the database will be faster if you split it into multiple
files? I don't think so. But if you want, H2 supports the "split file
system". You can easily find out. This also has the advantage that all
files are about the same size (less files).

Regards,
Thomas

Dario Fassi

unread,
Dec 27, 2010, 11:40:41 AM12/27/10
to h2-da...@googlegroups.com
Hi,

El 26/12/10 06:14, Thomas Mueller escribi�:


>> Fragmentation at file system (OS) level will have much more impact on large files, caching (at OS level) will be less effective too,
> read-ahead capabilities will be less effective too and finally IO load
> will increaseinevitably.
>
> Could you please provide links to back this up? Or provide a test case
> that shows multiple small files are significantly faster than one
> large file (given the same file operations)?

There are a lot of information related to File Systems efforts to mitigate the impact of fragmentation and smart caching strategies. All of this is strongly OS and file system dependent, but there are many factors in common.
This document has some interesting metrics: http://www.linuxsymposium.org/2006/filesys_frag_slides.pdf

In regard to H2 FileStore usage and fragmentation , "File Scattering" can be the most interesting type of fragmentation.
For generals about this subject, start with: http://en.wikipedia.org/wiki/File_system_fragmentation and http://en.wikipedia.org/wiki/File_sequence ; this pages has many references to technical documents and papers.

In regard to caching and read-ahead (or pre-fetch) note that this happen at hardware, file system and OS level. Read-ahead (at hardware level) is a way to reduce IO operations mainly on sequential access patterns.
Lobs of big size are ideal subjects for streaming (sequential access IO pattern) in contrast to indexed table rows that produce mainly random access IO patterns.

>> It is easy to measure the degradation of the performance of a database as the data volume is significantly increased.
> Do you think the database will be faster if you split it into multiple
> files? I don't think so.

Others DBMS use many directories with many files as storage, but this isn't the point.
If you say: do the same thing with many files , probably will be worst.

I'm talking about separate storage in only 2 files with different FileStore implementations. One for general database metadata, table rows and indexes (good for random access IO pattern) as small as possible.
And another file containing only LOBS objects, organized to facilitate sequential access IO patterns (over each internal Lob object) and to reduce the size of the other file.

Motivation: In many applications we see that for any table with LOBS columns only 1 of 4 querys (or less) retrieve lobs columns. I know this can't be generalized but don't seem unreasonable to think that big LOBS be accessed less frequently than the rest
of commons data type values in the same table.
Even more, is a common database design practice to use lob specific tables like ( id, lob_value ) to put all lob values in one table and master tables with FK references only.

About performance and IO load, I don't have a well done benchmark, but I have an application in production on to sites - similar conditions except their database size. We see up to 20% of difference in queries performance.
I will try to use "sar", "iostat", etc. ; to analyze if this is because IO wait time or cache hit rate change, but can be tricky to isolate H2 load from the rest of IO load.

> But if you want, H2 supports the "split file
> system". You can easily find out. This also has the advantage that all
> files are about the same size (less files)

Remember that we are discussing about if a good idea or not, store big lobs values with the rest of database objects in only one file when LobsInDatabase.
Anyway "split file system" can be useful to do some performance comparison between one only large file or many fixed size files.

regards,
Dario

Reply all
Reply to author
Forward
0 new messages