LOB Storage Mechanism

37 views
Skip to first unread message

Joel Turkel

unread,
Feb 19, 2008, 5:30:06 PM2/19/08
to H2 Database
Hi,

I'm investigating using H2 for an application that will store around
20 million blobs that range in size from about 10Kb to 10Mb. It looks
like H2 will store blobs inline up to MAX_LENGTH_INPLACE_LOB or out of
line in a file per blob. Storing them inline won't work well for our
use case, but storing a blob per file won't scale very well even if
setting LOB_FILES_IN_DIRECTORIES to true. I know improving LOB storage
is on the roadmap. Any idea on the time frame for this?

Thanks,
Joel

Thomas Mueller

unread,
Feb 20, 2008, 3:43:46 PM2/20/08
to h2-da...@googlegroups.com
Hi,

> storing a blob per file won't scale very well even if
> setting LOB_FILES_IN_DIRECTORIES to true.

Do you know an algorithm that would scale better?

(I have some ideas how to avoid duplicate files, but that's probably
not an issue here).

> Any idea on the time frame for this?

The current plan is to create a 1.1 branch later this year where such
feature are enabled. Databases will not be fully backward compatible
(upgrade may be required). The plan is to use the same source code as
the 1.0 branch, only with default settings (for example
LOB_FILES_IN_DIRECTORIES enabled by default). I'm not sure when
exactly this will happen (maybe June).

Regards,
Thomas

Joel Turkel

unread,
Feb 20, 2008, 4:22:52 PM2/20/08
to H2 Database
Hi Thomas,

I did some profiling around the the current blob support and it looks
like a large amount of time is being spent in
ValueLob.getNewObjectId(). In my experiment, inserting 100,000 10Kb
blobs into a table with just an integer primary key column and the
blob took 572 seconds. Hacking ValueLob.getNewObjectId() to return a
sequence of increasing integers dropped this time to 213 seconds.
Further hacking to insert all lob data into a single lob data file
dropped the test time to a mere 14 seconds. I suspect the major win
here is performing just one sync on the lob data file at commit time
instead of syncing/closing each individual lob file after inserting
each lob because the OS cab perform more of the write asynchronously.

In terms of lob storage design, I wasn't thinking of anything too
complicated. Perhaps just a sub-directory that contained lob data
files. Lob data files would grow up till a size limit and then an
additional lob data file would be created. Some kind of free list
would be used to keep track of free space in the lob data files. Like
row updates, lob updates would not be done in place. This should
hopefully make integration with the transactional subsystem easier.
The row's lob value would continue to store a file name and an offset
in that file (perhaps in the objectId field).

-Joel

On Feb 20, 3:43 pm, "Thomas Mueller" <thomas.tom.muel...@gmail.com>
wrote:

Thomas Mueller

unread,
Mar 2, 2008, 3:32:12 PM3/2/08
to h2-da...@googlegroups.com
Hi,

What you describe is similar to storing the blob in the data file
itself. It should work well for medium sized LOBs (maybe 100 bytes -
64 KB). Instead of adding a new storage mechanism the LOBs could be
stored in a system table (SYSTEM_LOB or so) and then only read when
required.

The advantage of using individual files (when used for larger LOBs)
are: the space is reclaimed when deleting the LOB. It is easier to
concurrently write LOBs if the size is unknown. Incremental backup
(only backup files created/modified since the last backup) is
possible.

> ValueLob.getNewObjectId() to return a sequence of increasing integers

I tried this as well and it looks promising. An optimization is to
cache the directory lists (using a small LRU cache). My prototype is
now faster for then the default setting even with a small number of
LOBs, I will continue to work on this.

Regards,
Thomas

Joel Turkel

unread,
Mar 3, 2008, 5:03:52 PM3/3/08
to H2 Database
Hi Thomas,

There are a few reasons why we don't want to store the LOBs in the
data file:

* Cache Efficiency - If the LOBs are stored inline, they'll take up
lots of space database page cache. Perhaps this could be solved by
adding something like the NOCACHE table option that Oracle supports?

* Database Size - We'll be storing more than 256GB of LOB data.
Perhaps this could be solved by supporting larger data files or
multiple data files?

* Undo Log Overhead - The undo log stores copies of rows so copying
around inline LOBs will add some overhead.

* Data File Performance - In our test case of inserting 100,000 10KB
blobs into a table 30% of the time was spent in
org.h2.store.DiskFile.allocate(). I suspect this problem could be
fixed, but it still leaves all of the above issues.

Unfortunately storing LOBs in individual files isn't fast enough for
our use case (15X slower than storing them in a LOB data files even
with the ValueLob.getNewObjectId() fix). I guess that's the beauty of
open source software :) My team is going to look into implementing the
changes I mentioned in my February 20th post, so let me know if you
have any words of advice. We'll post the changes in case there's
anything you'd like to incorporate back into H2.

Thanks,
Joel

On Mar 2, 3:32 pm, "Thomas Mueller" <thomas.tom.muel...@gmail.com>
wrote:
> Hi,
>
> What you describe is similar to storing the blob in the data file
> itself. It should work well for medium sized LOBs (maybe 100 bytes -
> 64 KB). Instead of adding a new storage mechanism the LOBs could be
> stored in a system table (SYSTEM_LOB or so) and then only read when
> required.
>
> The advantage of using individual files (when used for larger LOBs)
> are: the space is reclaimed when deleting the LOB. It is easier to
> concurrently write LOBs if the size is unknown. Incremental backup
> (only backup files created/modified since the last backup) is
> possible.
>
> > ValueLob.getNewObjectId() to return a sequence of increasing integers
>
> I tried this as well and it looks promising. An optimization is to
> cache the directory lists (using a small LRU cache). My prototype is
> now faster for then the default setting even with a small number of
> LOBs, I will continue to work on this.
>
> Regards,
> Thomas
>

Thomas Mueller

unread,
Mar 5, 2008, 1:59:26 PM3/5/08
to h2-da...@googlegroups.com
Hi,

> * Cache Efficiency - If the LOBs are stored inline, they'll take up
> lots of space database page cache. Perhaps this could be solved by
> adding something like the NOCACHE table option that Oracle supports?

NOCACHE is a good idea. I will add it to the roadmap (low priority right now).

> * Database Size - We'll be storing more than 256GB of LOB data.
> Perhaps this could be solved by supporting larger data files or
> multiple data files?

Multiple data files are already supported and used by the
LinearHashIndex; however I will probably remove this index in the next
release as it is very slow and unstable). Each data file has its own
cache and can have logging enabled or not.

> * Data File Performance - In our test case of inserting 100,000 10KB
> blobs into a table 30% of the time was spent in
> org.h2.store.DiskFile.allocate(). I suspect this problem could be
> fixed, but it still leaves all of the above issues.

30% is really bad of course. I would like to fix this problem. Could
you post more information about your test? Maybe I need to re-write
DiskFile.

> so let me know if you
> have any words of advice.

In any case I would store very small blobs in-place, and use
individual files for objects larger than 128 KB or so. So your
solution is for medium sized LOBs. It would be great if DiskFile can
be re-used (or replaced). I think there is a potential problem with
not logging and writing multiple objects to the same file: on power
failure old data could get corrupted. Not so with individual files, or
when using append-only files (for example tar files). But append-only
requires some kind of 'garbage collection'.

Regards,
Thomas

Joel Turkel

unread,
Mar 6, 2008, 1:31:00 PM3/6/08
to H2 Database
Hi,

> > * Database Size - We'll be storing more than 256GB of LOB data.
> > Perhaps this could be solved by supporting larger data files or
> > multiple data files?
>
> Multiple data files are already supported and used by the
> LinearHashIndex; however I will probably remove this index in the next
> release as it is very slow and unstable). Each data file has its own
> cache and can have logging enabled or not.

Other than using the LinearHashIndex (which is going away), I assume
there's no way to use multiple data files?

> > * Data File Performance - In our test case of inserting 100,000 10KB
> > blobs into a table 30% of the time was spent in
> > org.h2.store.DiskFile.allocate(). I suspect this problem could be
> > fixed, but it still leaves all of the above issues.
>
> 30% is really bad of course. I would like to fix this problem. Could
> you post more information about your test? Maybe I need to re-write
> DiskFile.

I've filed a bug with a test case and profile report to track this
issue: http://code.google.com/p/h2database/issues/detail?id=13

> > so let me know if you
> > have any words of advice.
>
> In any case I would store very small blobs in-place, and use
> individual files for objects larger than 128 KB or so. So your
> solution is for medium sized LOBs. It would be great if DiskFile can
> be re-used (or replaced). I think there is a potential problem with
> not logging and writing multiple objects to the same file: on power
> failure old data could get corrupted. Not so with individual files, or
> when using append-only files (for example tar files). But append-only
> requires some kind of 'garbage collection'.

Could you explain the corruption problem in more detail? My hope was
that not doing lob updates in place and syncing the LOB data files on
commit would avoid many of these issues.

Thanks,
Joel

On Mar 5, 1:59 pm, "Thomas Mueller" <thomas.tom.muel...@gmail.com>
wrote:

Thomas Mueller

unread,
Mar 6, 2008, 2:44:23 PM3/6/08
to h2-da...@googlegroups.com
Hi,

> Other than using the LinearHashIndex (which is going away), I assume
> there's no way to use multiple data files?

What I mean is, you can use multiple data files, and you can have a
look how to do that in LinearHashIndex.

> I've filed a bug with a test case and profile report to track this
> issue: http://code.google.com/p/h2database/issues/detail?id=13

I will have a look, thanks!

> Could you explain the corruption problem in more detail?

Hard drives don't always write in the same order as the write methods
was called (elevator optimization or so). Also, usually there is a
write cache (writes are delayed). File systems have algorithms to
detect that and recover in most situations, but if you write to a file
without logging your changes there might be a problem.

> not doing lob updates in place and syncing the LOB data files on
> commit would avoid many of these issues.

Append only should be quite safe, however update in place without log
file is quite dangerous. Basically you would need to test it. I think
it can be done, but at least checksums should be used (checksums are
used for DiskFile) and recovery algorithms must be implemented.

Regards,
Thomas

Reply all
Reply to author
Forward
0 new messages