MVStore, more than just the new storage backend

Thomas Mueller

unread,

Sep 5, 2012, 2:34:21 PM9/5/12

to h2-da...@googlegroups.com

Hi,

I'm working on a new storage backend for H2, currently called MVStore (for Multi-Version Store). It is not ready yet to be used as the storage backend for the database, but you can try it out in 'standalone' mode if you want (for example to run benchmarks). Please be aware the API is unstable and will change.

== Map API ==

In addition to be the storage backend for H2, there will be a Java Map API similar to IndexedDB and JDBM3 [1]. The goals and internal architecture of JDBM3 / JDBM4 and MVStore are quite different so it doesn't make much sense to share code (I discussed this with Jan Kotek).

This is what you can do now. MVMap stands for Multi-Version Map, and implements the java.util.Map interface:

// open the store (in-memory if fileName is null)

MVStore s = MVStore.open(fileName);

// create/get the map "data"

// the String.class, String.class will be optional later

MVMap<String, String> map = s.openMap("data",

String.class, String.class);

// add some data

map.put("1", "Hello");

map.put("2", "World");

// get the current version, for later use

long oldVersion = s.getCurrentVersion();

// from now on, the old version is read-only

s.incrementVersion();

// more changes, in the new version

// changes can be rolled back if required

// changes always go into 'head' (the newest version)

map.put("1", "Hi");

map.remove("2");

// access the old data (before incrementVersion)

MVMap<String, String> oldMap =

map.openVersion(oldVersion);

// store the newest version to disk

s.store();

// print the old version (can be done

// concurrently with further modifications)

// this will print Hello World

System.out.println(oldMap.get("1"));

System.out.println(oldMap.get("2"));

// print the newest version ("Hi")

System.out.println(map.get("1"));

// close the store - this doesn't write to disk

s.close();

== Core / Modularity ==

The core only supports very few data types (actually, only String keys and values at the moment), but data types are pluggable, and I will write a module to support all serializable Java objects.

The idea is to keep the core very small, and make optional features pluggable. This should simplify testing, allow to create a very small Android version, and allow to write a C or C++ version of (just) the core if needed. Pluggable features are for example caching, data types / serialization, file system implementations (with things like encryption), the compression algorithm, alternative tree implementations such as an R-tree to index spatial data. Other components will be implemented on top of the storage, for example SQL and JDBC support, clustering (replication / sharding), a network server.

== Storage and In-Memory Architecture ==

The basic storage architecture is a log structured storage [2] combined with a partially persistent in-memory data structure [3]. The 'partially persistent' doesn't mean only partially stored to disk, but partially immutable (this is a bit confusing I know, I suggest to read the Wikipedia article). For speed, in-memory structures are only copied (COW, copy-on-write) when the version is incremented.

Each MVStore supports multiple maps that are stored as B-trees [4]. For spatial queries, there is also a (currently incomplete) R-tree [5] implementation. An MVStore can be in-memory and on disk.

Data is stored in one file by default, in the form of chunks. A chunk contains a chunk header and a list of pages, stored next to each other. Plus there is the file header that points to the latest chunk. The file format is quite a bit simpler than the current 'page store'. The MVStore doesn't use in-place updates, and there should be no need for fsync [9], which is the current bottleneck for writes. Pages are variable size, and they are stored next to each other (no empty space), this alone should improve write performance. Also, each page can optionally be compressed (LZF by default), which should further increase data density. But multiple versions of the same data are stored sometimes, so disk space usage might be a bit higher than now, specially if there are many updates. There is a feature to compact the file (garbage collection and defragmentation), which may run in the background. A detail: truncating and dropping tables (and in the future possibly also range deletes) are faster than they are now as the leaf pages don't need to be read (around 90% are leaf pages, so it should be about 10 times faster). Other operations such as range counting and skip should also get faster, because it is a counted B-tree [10] (that's possible because pages are copy-on-write [11]).

The MVStore is a bit similar to leveldb. As far as I know, leveldb is log structured merge tree (LSM) and sometimes uses a lot of files, while the MVStore only uses one file (by default) and doesn't copy data unless really required.

The record size limit is the available memory. For large binaries I plan to write a mechanism that splits them into blocks (not started yet), a bit similar to how H2 works now, but maybe it will be a content addressable storage [7] internally.

For each store, there is a metadata map that contains chunk info and information about what maps are available and where they are stored (where the root page is). Other than that, the metadata map is really just like any other map.

To cache pages, a LIRS cache [6] is used currently (but I want to make it pluggable).

For low level file access, the same file system abstraction will be used as of now. There is no .lock.db file, instead regular file locking will be used by default.

The main advantages compared to the current page store are: simpler file format (no low-level transaction log, no in-place updates), simpler page format (no overflow pages), simpler rollback, no sync required, native MVCC using COW (copy on write). I'm not quite sure yet what are the disadvantages are, I guess there are a few. One is: the new storage engine will temporarily need more disk space if you update a lot of data, but I think that's acceptable. Reading isn't aligned to fixed page sizes, not sure if that's a problem (I don't think it is). A transaction log will most likely be needed, at least in some cases (two-phase commit for example), my current plan is to implement it in the form of a map.

== Concurrency ==

The combination of log structured storage and persistent data structure allows to read old versions of the data without blocking other readers or writers. This includes reading very old data, unless it was explicitly overwritten. Rollback is fast. This simplifies recovery a lot and should simplify MVCC [8], which will probably be the default isolation level in the future (as in PostgreSQL, most likely there will be row level locking for writes).

Concurrent writes to the head of the same map are not supported, because that would make things slower. If you need it, you need to either split the map or write in a synchronized way. I'm quite sure some people will want a true ConcurrentMap, but I will not compromise performance for others. The more likely case in my view is concurrent read (iterate) and write, and this is supported: read from an older version. That way readers and writers are isolated.

== SQL / JDBC ==

When used within the database engine, most likely there will be one map per table, and one map per index. For a table, the key is the unique row id (always a long, like now - if the primary key is an int or long, then this is the key of the data map). For an indexes, the key of the map is index columns (combined) for unique indexes, plus the row id for non-unique indexes, and the value is the row id (no value at all for non-unique indexes).

Changing the SQL engine to fully use the MVStore will need some time, maybe a year or so (that's hard to say).

== Performance ==

Performance: according to the current benchmarks, the MVStore is much faster than the current page store, for multiple reasons. I believe on Android, it will be faster than SQLite (but I didn't test recently), at least when using the Java API.

== Code ==

The code is currently kept in the following directories (that will change): src/tools/org/h2/dev/store/btree (implementation) and src/test/org/h2/test/store (test and extensions).

[1] http://github.com/jankotek/JDBM3

[2] http://blog.notdot.net/2009/12/Damn-Cool-Algorithms-Log-structured-storage

[3] http://en.wikipedia.org/wiki/Persistent_data_structure

[4] http://en.wikipedia.org/wiki/B-tree

[5] http://en.wikipedia.org/wiki/R-tree

[6] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-02-6.pdf

[7] http://en.wikipedia.org/wiki/Content-addressable_storage

[8] http://en.wikipedia.org/wiki/Multiversion_concurrency_control

[9] http://en.wikipedia.org/wiki/Fsync

[10] http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html

[11] http://en.wikipedia.org/wiki/Copy-on-write

Regards,

Thomas

Rami Ojares

unread,

Sep 5, 2012, 3:19:29 PM9/5/12

to h2-da...@googlegroups.com

Wow, this is much more exciting than Nokia's release of Windows 8 phones
today in New York!
And such a thorough explanation with 11 links. I'm impressed!

- rami

Sergi Vladykin

unread,

Sep 5, 2012, 5:13:56 PM9/5/12

to h2-da...@googlegroups.com

Hi,

Quite an interesting stuff, Thomas! Thanks for sharing!

Few points:
You decided to make updates synchronous in favor of single threaded performance but concurrent updates is quite an important use case too in our century of multicore CPUs. May be it will not be too hard to make this configurable?

Also about Map interface. Why doesn't it at least NavigableMap which is better reflects sorted nature of the data structure and directly supports range operations? ConcurrentNavigableMap is preferable of course. For example I created scalable in-memory table engine with snapshot isolation based on SnapTree (https://github.com/nbronson/snaptree). It would be very interesting to just replace one ConcurrentNavigableMap implementation with another to compare performance.

You say that page size will be variable, will not it be harder to recover corrupted databases because of this?

Sergi

Thomas Mueller

unread,

Sep 6, 2012, 3:04:54 AM9/6/12

to h2-da...@googlegroups.com

Hi,

About concurrency: it is possible to extend the MVMap to support fully concurrent operations. For example the R-tree implementation is an extension of the MVMap. The map implementation used is pluggable. According to my tests, for in-memory operations, the current MVMap implementation is about as fast as a TreeMap, which is not synchronized and does not allow concurrent writes. I guess a concurrent MVMap will be slower and more complex, but let's see. There are multiple ways to support fully concurrent operations, for example synchronize on each page, or use a read-only root node. However I wonder if it's really such a common use case to update a map concurrently, or concurrently write to and read from the same version of a map. What exactly is your use case? How do you make sure the threads don't overwrite each others changes (for concurrent writes)? Don't you need some kind of isolation? For concurrent write to and read from the head, don't you think it's a problem that reading will not be isolated?

NavigableMap: currently it's a java.util.Map, but yes all features of a NavigableMap will be supported later on. Maybe this could be done in the form of a wrapper or abstract base class, to keep the core engine small (similar to java.util.AbstractMap).

SnapTree: I didn't know about this, I will have a look at it.

> You say that page size will be variable, will not it be harder to recover corrupted databases because of this?

No. The file format is:

[fileHeader] [fileHeader] [chunk]*

The file header is stored (at least) twice for security. The size of the file header the native file system block size, or a multiple of it. It contains the pointer to the latest head (actually it will be the list of latest heads). A chunk is:

[chunkHeader] [page]*

Chunks are also aligned to the file system block size, so a chunk occupies at least one block (but should be around 2 MB or so if you care about write performance). Each chunk header points to the root page of the meta map within this chunk. The meta map contains the position of the root pages of all other maps (beside other metadata). Pages don't need to be aligned, because they are not overwritten. There are no in-place updates like in the page store and in most databases. At startup, the recovery algorithm just has to check which one is the latest valid chunk. The recover tool (currently called Dump) reads chunks, thats also not a problem. Each page does have a small page header and a checksum by the way, so data can be recovered even if the chunk header and some of the pages are corrupt.

Regards,

Thomas

Noel Grandin

unread,

Sep 6, 2012, 6:02:34 AM9/6/12

to h2-da...@googlegroups.com, Thomas Mueller

Hi

Awesome, this looking great!

Some comments

(1) Perhaps use byte[] as the basic key/value of your interface. This is
the lowest common denominator of java types, and it's easy to wrap
String around it.

(2) Probably a good idea to store a per-block CRC. Its amazing how often
corruption manages to sneak through.

(3) I see there is per-page compression! Awesome! I have some tables
that are really going to benefit from this. Is the page size
configurable? I'm happy to trade off larger page sizes for more compression.

Sergi Vladykin

unread,

Sep 7, 2012, 4:03:14 AM9/7/12

to h2-da...@googlegroups.com

Hi,

About concurrency: it is possible to extend the MVMap to support fully concurrent operations. For example the R-tree implementation is an extension of the MVMap. The map implementation used is pluggable. According to my tests, for in-memory operations, the current MVMap implementation is about as fast as a TreeMap, which is not synchronized and does not allow concurrent writes. I guess a concurrent MVMap will be slower and more complex, but let's see.

It can be slower on sinlgle thread but allow better throughput in multicore environment in multiple threads.

There are multiple ways to support fully concurrent operations, for example synchronize on each page, or use a read-only root node. However I wonder if it's really such a common use case to update a map concurrently, or concurrently write to and read from the same version of a map. What exactly is your use case? How do you make sure the threads don't overwrite each others changes (for concurrent writes)?

It is very common case. Just imagin two users updating their data in some table concurrently. Updates are independent on different unique keys, so why one have to wait another? Of cource if it is possible that one thread can incorrectly override changes from another then application have to manage this itself, because it is a part of business logic, storage should not know it.

Don't you need some kind of isolation? For concurrent write to and read from the head, don't you think it's a problem that reading will not be isolated?

For real it depends. My use case is not that complex (and I exploit this), I have no multiple operations in one transaction and each operation is just one add, update or delete by primary key, so each update just goes directly to the datastructure. When I need to execute query I take snapshot in table lock operation and remove it in unlock, main datastructure still can be updated while the query runs over the snapshot. Obvoiusly it will be harder to do it in general way, but I dont think too much. Different transaction isolation levels probably will require different approaches but basically it is all about when to take snapshots and row locks and whent to release them. Probably some kind of merge operation will be needed at map level as well. May be the best option would be make the map transactional itself.

NavigableMap: currently it's a java.util.Map, but yes all features of a NavigableMap will be supported later on. Maybe this could be done in the form of a wrapper or abstract base class, to keep the core engine small (similar to java.util.AbstractMap).

Thats good!

Sergi

Sergi Vladykin

unread,

Sep 7, 2012, 4:23:45 AM9/7/12

to h2-da...@googlegroups.com, Thomas Mueller

(1) Perhaps use byte[] as the basic key/value of your interface. This is
the lowest common denominator of java types, and it's easy to wrap
String around it.

Not the best advice I ever heard. Usually you still will have to deserialize at least key to do comparisons and also you lose an abstraction level. I would prefer avoid dealing with byte arrays at all and have an abstract pluggable serialization/desrialization which will use ObjectInput/ObjectOutput or something similar. Also it would be nice to take into consideration that some types still can be compared without full deserialization but some can't, so it should be possible to do comparisons in the most effective way.

Sergi

Thomas Mueller

unread,

Sep 7, 2012, 12:52:58 PM9/7/12

to Noel Grandin, h2-da...@googlegroups.com

Hi,

I just saw I didn't sent the reply to all...

(1) Perhaps use byte[] as the basic key/value of your interface. This is the lowest common denominator of java types, and it's easy to wrap String around it.

I thought about that, but I found ByteBuffer is fine performance-wise, at least on a regular computer. Not sure it that's true for Android as well. One problem is that ByteBuffer can't auto-extend like StringBuilder or ByteArrayOutputStream can, so possibly I will use byte[] at some point, or my own implementation (similar to Data in H2).

(2) Probably a good idea to store a per-block CRC. Its amazing how often corruption manages to sneak through.

There is checksum per page, but just the metadata. This is to detect that a file was modified with a text editor or so, or a file was open in multiple processes, or power failure (old or missing data at the end of a chunk). Checking every byte of the data is quite an overhead. I don't remember seeing a corrupt checksum in a real-world H2 database file that was caused by media failure or power failure. If I implement a checksum, it will be a partial checksums (check every nth bytes) like in the current H2 page store.

(3) I see there is per-page compression! Awesome! I have some tables that are really going to benefit from this. Is the page size configurable? I'm happy to trade off larger page sizes for more compression.

Yes. MVStore supports different page sizes within the same store: you can change it on the fly, so during times when there are many writes it could use a smaller page size (with compression disabled), and for defragmentation it could use large page sizes (and use a better+slower compression algorithm). Currently, page sizes are actually measured in number of entries (not in the number of bytes) by the way.

Regards,

Thomas

Thomas Mueller

unread,

Sep 7, 2012, 12:57:38 PM9/7/12

to Sergi Vladykin, h2-da...@googlegroups.com

Hi,

(1) Perhaps use byte[] as the basic key/value of your interface.

I think it's a good point. Specially on Android ByteBuffer might be quite a lot slower than byte[]. I will definitely test it.

Usually you still will have to deserialize at least key to do comparisons

I'm not sure how this is related to using byte[] versus ByteBuffer.

I would prefer avoid dealing with byte arrays at all and have an abstract pluggable serialization/desrialization which will use ObjectInput/ObjectOutput or something similar.

At some point the data has to be stored to disk, and the question is whether this should be done using a byte[] or a ByteBuffer. Files don't support ObjectInput/ObjectOutput as far as I know.

Also it would be nice to take into consideration that some types still can be compared without full deserialization but some can't, so it should be possible to do comparisons in the most effective way.

That's a good point, such features are missing currently. It is something to keep in mind.

Regards,

Thomas

wburzyns

unread,

Sep 7, 2012, 3:51:22 PM9/7/12

to h2-da...@googlegroups.com

I definitely second the idea of having byte[] as keys/values. It's as universal as it could be.

Thomas Mueller

unread,

Sep 7, 2012, 4:36:50 PM9/7/12

to h2-da...@googlegroups.com

Hi,

If the keys and values are individual byte[], then it wouldn't be performant. By default, the keys/values are objects of type java.lang.Object within a Page. If you write your own Page implementation it could even be int or long (anything).

The bigger question is what the serialization format is before the values are stored to / read from a file, and what file level abstraction should be. For a byte[], it could be a RandomAccessFile, and for ByteBuffer it could be a FileChannel. Of course a byte[] can be wrapped within a ByteBuffer.

Currently ByteBuffer is used, and I think I will keep this unless it is a performance problem for Android, and if it is probably switch to my own wrapper around byte[], similar to Data currently used in H2.

Regards,

Thomas

On Fri, Sep 7, 2012 at 9:51 PM, wburzyns <wbur...@gmail.com> wrote:

I definitely second the idea of having byte[] as keys/values. It's as universal as it could be.

--
You received this message because you are subscribed to the Google Groups "H2 Database" group.
To view this discussion on the web visit https://groups.google.com/d/msg/h2-database/-/m5ZDR2XgFCcJ.

To post to this group, send email to h2-da...@googlegroups.com.
To unsubscribe from this group, send email to h2-database...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/h2-database?hl=en.

Sergi Vladykin

unread,

Sep 8, 2012, 4:37:57 AM9/8/12

to h2-da...@googlegroups.com, Sergi Vladykin

Hi,

I'm not sure how this is related to using byte[] versus ByteBuffer.

I'm sorry, I thought it was about making Map<byte[],byte[]> but not about byte[] vs ByteBuffer.

I would prefer avoid dealing with byte arrays at all and have an abstract pluggable serialization/desrialization which will use ObjectInput/ObjectOutput or something similar.

At some point the data has to be stored to disk, and the question is whether this should be done using a byte[] or a ByteBuffer. Files don't support ObjectInput/ObjectOutput as far as I know.

Just a simple use case for better understanding. If I want to create storage over offheap memory I would prefer not to have any intermediate byte arrays or byte buffers but have an ability to serialize data to allocated memory region directly to avoid copying. So it would be great if serialization will be abstract enough to allow this. Byte array obviously is not abstract at all.

Sergi

Thomas Mueller

unread,

Sep 8, 2012, 4:55:23 AM9/8/12

to h2-da...@googlegroups.com

Hi,

The MVStore is mainly meant to store data on disk. Off-heap memory (in which case the data is stored in a ByteBuffer that was created using ByteBuffer.allocateDirect) can be used in two places: (a) as a large file cache, and (b) as the main 'storage' (using an in-memory file system). Both is possible, I guess (b) is easier to implement (a new file system implementation).

I'm sorry, I thought it was about making Map<byte[],byte[]>

If the application wants to deal with byte arrays then that's possible of course, but in most cases the application (and the database engine) wants to deal with 'real' Java objects to avoid serialization / deserialization for each operation. The database engine will probably use Map<Long, Record> or Map<Long, Object[]> to store records. It is also possible to use Map<Object, ByteBuffer> to avoid copying data, but I guess that's not the normal use case.

Regards,

Thomas

Sergi Vladykin

unread,

Sep 8, 2012, 6:09:54 AM9/8/12

to h2-da...@googlegroups.com

The MVStore is mainly meant to store data on disk. Off-heap memory (in which case the data is stored in a ByteBuffer that was created using ByteBuffer.allocateDirect) can be used in two places: (a) as a large file cache, and (b) as the main 'storage' (using an in-memory file system). Both is possible, I guess (b) is easier to implement (a new file system implementation).

Yes, but currently you serialize pages into intermediate ByteBuffer and then write it to the FileChannel (I believe you gonna abstract this). I'm talking that in some situations it would be beneficial to avoid this intermediate ByteBuffer at all. Good if the file system abstraction will allow that. ByteBuffer should be implementation detail here.

Sergi

Thomas Mueller

unread,

Sep 8, 2012, 7:20:39 AM9/8/12

to h2-da...@googlegroups.com

Hi,

currently you serialize pages into intermediate ByteBuffer and then write it to the FileChannel (I believe you gonna abstract this).

The file system abstraction is already used, but yes, the data is first written to a ByteBuffer.

I'm talking that in some situations it would be beneficial to avoid this intermediate ByteBuffer at all.

I guess in most cases people will want to use regular Java objects on the client side (String, Integer, Long,...) - in that case there is no (efficient) way I know to avoid copying the data first to a ByteBuffer.

I guess what you mean is that if an application uses Map<..., ByteBuffer> or Map<..., byte[]>, the MVStore should avoid copying the data. It could be done, but I'm not sure if this is such an important use case to justify the added complexity. It might make sense if you store relatively large binaries, but I'm not sure. In such cases, it might make sense to store the binaries outside of the database in their own files, just because freeing up disk space would be a lot simpler that way. The MVStore is really more about database records and not so much about really large binaries.

Regards,

Thomas

Sergi Vladykin

unread,

Sep 8, 2012, 11:02:01 AM9/8/12

to h2-da...@googlegroups.com

I guess in most cases people will want to use regular Java objects on the client side (String, Integer, Long,...) - in that case there is no (efficient) way I know to avoid copying the data first to a ByteBuffer.

Of course when you work with conventional file system it is more effective to write data to some buffer in memory and then flush it to FS instead of writing byte after byte to file directly but in case of in-memory file system situation is opposite.

I guess what you mean is that if an application uses Map<..., ByteBuffer> or Map<..., byte[]>, the MVStore should avoid copying the data. It could be done, but I'm not sure if this is such an important use case to justify the added complexity. It might make sense if you store relatively large binaries, but I'm not sure. In such cases, it might make sense to store the binaries outside of the database in their own files, just because freeing up disk space would be a lot simpler that way. The MVStore is really more about database records and not so much about really large binaries.

It is an interesting use case too but for real I'm talking about filesystem abstraction which is independent of key and value data types. Basically I suggest in DataType use some interface instead of ByteBuffer (something similar to DataOutput). And get an instance of that interface from File abstraction like file.getOutput(long pos, long size). That way file system implementation can decide itself how to do the things internally - may be using ByteBuffer or byte[] or write data directly to the location. It does not seem to introduce any complication or performance penalty.

Sergi

Reply all

Reply to author

Forward