Hi,
I'm working on a new storage backend for H2, currently called MVStore (for Multi-Version Store). It is not ready yet to be used as the storage backend for the database, but you can try it out in 'standalone' mode if you want (for example to run benchmarks). Please be aware the API is unstable and will change.
== Map API ==
In addition to be the storage backend for H2, there will be a Java Map API similar to IndexedDB and JDBM3 [1]. The goals and internal architecture of JDBM3 / JDBM4 and MVStore are quite different so it doesn't make much sense to share code (I discussed this with Jan Kotek).
This is what you can do now. MVMap stands for Multi-Version Map, and implements the java.util.Map interface:
// open the store (in-memory if fileName is null)
MVStore s = MVStore.open(fileName);
// create/get the map "data"
// the String.class, String.class will be optional later
MVMap<String, String> map = s.openMap("data",
String.class, String.class);
// add some data
map.put("1", "Hello");
map.put("2", "World");
// get the current version, for later use
long oldVersion = s.getCurrentVersion();
// from now on, the old version is read-only
s.incrementVersion();
// more changes, in the new version
// changes can be rolled back if required
// changes always go into 'head' (the newest version)
map.put("1", "Hi");
map.remove("2");
// access the old data (before incrementVersion)
MVMap<String, String> oldMap =
map.openVersion(oldVersion);
// store the newest version to disk
s.store();
// print the old version (can be done
// concurrently with further modifications)
// this will print Hello World
System.out.println(oldMap.get("1"));
System.out.println(oldMap.get("2"));
// print the newest version ("Hi")
System.out.println(map.get("1"));
// close the store - this doesn't write to disk
s.close();
== Core / Modularity ==
The core only supports very few data types (actually, only String keys and values at the moment), but data types are pluggable, and I will write a module to support all serializable Java objects.
The idea is to keep the core very small, and make optional features pluggable. This should simplify testing, allow to create a very small Android version, and allow to write a C or C++ version of (just) the core if needed. Pluggable features are for example caching, data types / serialization, file system implementations (with things like encryption), the compression algorithm, alternative tree implementations such as an R-tree to index spatial data. Other components will be implemented on top of the storage, for example SQL and JDBC support, clustering (replication / sharding), a network server.
== Storage and In-Memory Architecture ==
The basic storage architecture is a log structured storage [2] combined with a partially persistent in-memory data structure [3]. The 'partially persistent' doesn't mean only partially stored to disk, but partially immutable (this is a bit confusing I know, I suggest to read the Wikipedia article). For speed, in-memory structures are only copied (COW, copy-on-write) when the version is incremented.
Each MVStore supports multiple maps that are stored as B-trees [4]. For spatial queries, there is also a (currently incomplete) R-tree [5] implementation. An MVStore can be in-memory and on disk.
Data is stored in one file by default, in the form of chunks. A chunk contains a chunk header and a list of pages, stored next to each other. Plus there is the file header that points to the latest chunk. The file format is quite a bit simpler than the current 'page store'. The MVStore doesn't use in-place updates, and there should be no need for fsync [9], which is the current bottleneck for writes. Pages are variable size, and they are stored next to each other (no empty space), this alone should improve write performance. Also, each page can optionally be compressed (LZF by default), which should further increase data density. But multiple versions of the same data are stored sometimes, so disk space usage might be a bit higher than now, specially if there are many updates. There is a feature to compact the file (garbage collection and defragmentation), which may run in the background. A detail: truncating and dropping tables (and in the future possibly also range deletes) are faster than they are now as the leaf pages don't need to be read (around 90% are leaf pages, so it should be about 10 times faster). Other operations such as range counting and skip should also get faster, because it is a counted B-tree [10] (that's possible because pages are copy-on-write [11]).
The MVStore is a bit similar to leveldb. As far as I know, leveldb is log structured merge tree (LSM) and sometimes uses a lot of files, while the MVStore only uses one file (by default) and doesn't copy data unless really required.
The record size limit is the available memory. For large binaries I plan to write a mechanism that splits them into blocks (not started yet), a bit similar to how H2 works now, but maybe it will be a content addressable storage [7] internally.
For each store, there is a metadata map that contains chunk info and information about what maps are available and where they are stored (where the root page is). Other than that, the metadata map is really just like any other map.
To cache pages, a LIRS cache [6] is used currently (but I want to make it pluggable).
For low level file access, the same file system abstraction will be used as of now. There is no .lock.db file, instead regular file locking will be used by default.
The main advantages compared to the current page store are: simpler file format (no low-level transaction log, no in-place updates), simpler page format (no overflow pages), simpler rollback, no sync required, native MVCC using COW (copy on write). I'm not quite sure yet what are the disadvantages are, I guess there are a few. One is: the new storage engine will temporarily need more disk space if you update a lot of data, but I think that's acceptable. Reading isn't aligned to fixed page sizes, not sure if that's a problem (I don't think it is). A transaction log will most likely be needed, at least in some cases (two-phase commit for example), my current plan is to implement it in the form of a map.
== Concurrency ==
The combination of log structured storage and persistent data structure allows to read old versions of the data without blocking other readers or writers. This includes reading very old data, unless it was explicitly overwritten. Rollback is fast. This simplifies recovery a lot and should simplify MVCC [8], which will probably be the default isolation level in the future (as in PostgreSQL, most likely there will be row level locking for writes).
Concurrent writes to the head of the same map are not supported, because that would make things slower. If you need it, you need to either split the map or write in a synchronized way. I'm quite sure some people will want a true ConcurrentMap, but I will not compromise performance for others. The more likely case in my view is concurrent read (iterate) and write, and this is supported: read from an older version. That way readers and writers are isolated.
== SQL / JDBC ==
When used within the database engine, most likely there will be one map per table, and one map per index. For a table, the key is the unique row id (always a long, like now - if the primary key is an int or long, then this is the key of the data map). For an indexes, the key of the map is index columns (combined) for unique indexes, plus the row id for non-unique indexes, and the value is the row id (no value at all for non-unique indexes).
Changing the SQL engine to fully use the MVStore will need some time, maybe a year or so (that's hard to say).
== Performance ==
Performance: according to the current benchmarks, the MVStore is much faster than the current page store, for multiple reasons. I believe on Android, it will be faster than SQLite (but I didn't test recently), at least when using the Java API.
== Code ==
The code is currently kept in the following directories (that will change): src/tools/org/h2/dev/store/btree (implementation) and src/test/org/h2/test/store (test and extensions).
Regards,
Thomas