Lightning Memory DB

Michael Barker

unread,

Aug 12, 2013, 5:31:35 PM8/12/13

to mechanica...@googlegroups.com

Hi,

While doing a bit of work with embedded key/value databases I came
across LMDB (http://symas.com/mdb/) which is the new data store for
OpenLDAP. Howard Chu gave an interesting talk about it at Devoxx
France earlier this year
(http://www.parleys.com/play/517f58f9e4b0c6dcd95464ae/chapter0/about).

It displays some interesting mechanical sympathy, e.g. doesn't
maintain it's own data cache, it uses the OS page cache. Similar to
the approach used by the Varnish HTTP proxy.

Mike.

Rajiv Kurian

unread,

Aug 12, 2013, 11:05:39 PM8/12/13

to mechanica...@googlegroups.com

Very interesting. Good benchmarks too. Mike, have you had any personal experience with it?

Michael Barker

unread,

Aug 12, 2013, 11:11:53 PM8/12/13

to mechanica...@googlegroups.com

Not yet. I've done a fair bit with LevelDB, but I'm planning to try
it out. I've been digging about the code a little bit.

Mike.

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Rajiv Kurian

unread,

Aug 13, 2013, 12:51:52 AM8/13/13

to mechanica...@googlegroups.com

It would be awesome if you could share your experience after you try it out. I've been looking at LevelDB and HyperDex's fork of LevelDB for some work, but this seems like a very viable option.

Rajiv

> email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Barker

unread,

Aug 13, 2013, 12:57:55 AM8/13/13

to mechanica...@googlegroups.com

My main issue with both LevelDb and LMDB is that the Java binding are
annoying. They follow a model of put(byte[], byte[]), which means
that you spend half of your time copying buffers about. At the very
least a put(byte[] key, byte[] value, int valueOffset, int
valueLength) would be useful. I'm thinking about writing my own
bindings for LMDB.

Mike.

>> > email to mechanical-symp...@googlegroups.com.

>> > For more options, visit https://groups.google.com/groups/opt_out.
>

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-symp...@googlegroups.com.

Rajiv Kurian

unread,

Aug 13, 2013, 1:03:21 AM8/13/13

to mechanica...@googlegroups.com

Yeah I have found a lot of JNI libraries doing that. What are you planning to use instead?

>> > email to mechanical-sympathy+unsub...@googlegroups.com.

>> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-sympathy+unsub...@googlegroups.com.

Peter Lawrey

unread,

Aug 13, 2013, 2:25:47 AM8/13/13

to mechanica...@googlegroups.com

An interesting range of tests I should do the same for Chronicle 2.0. (Something to add to my list ;)

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Scholl

unread,

Aug 13, 2013, 7:13:07 AM8/13/13

to mechanica...@googlegroups.com

The benchmark results are impressive, but I have my doubts after reading about the design of MDB.

The real impact of a main-memory mapped database is (obviously) when data doesn't fit into memory anymore. Now according to the design document[1], MDB employs a block-re-using strategy. This is more than fine if you are building an in-memory data store but it's a real trade-off. You really don't want to use such a library for longer-lasting transactions or analytical workloads involving listings because data will most likely be physically spread all over your disk; in that case, relying on the OS (read read-ahead) won't help MDB much here (so far about the mechanical sympathy of MDB for bigger data) ...

It's as always. You really have to read the fine-print, and benchmark numbers seldomly help you find the right conclusion about a data store.

Just my $0.02,
Martin

[1] https://gitorious.org/mdb/mdb/blobs/mdb.master/libraries/liblmdb/lmdb.h

Martin Thompson

unread,

Aug 13, 2013, 8:29:21 AM8/13/13

to mechanica...@googlegroups.com

What was not clear to me in watching the video is what the throughput would be in the case of high numbers of transactions when the DB has become very fragmented after block reuse. He suggested they sync at the end of a transaction so it would not matter if the DB fitted in memory or not. This would be very random access to disk.

>> > email to mechanical-sympathy+unsubscribe...@googlegroups.com.

>> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-sympathy+unsubscribe...@googlegroups.com.

> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Darach Ennis

unread,

Aug 13, 2013, 8:40:10 AM8/13/13

to mechanica...@googlegroups.com

Hi all,

I'm not sure I agree with the conclusion with respect to doubts and/or memory fit. It looks like compared to BDB it is using about 1/4th of the RAM so

it would appear it is fairly optimal with respect to memory utilisation. Anyway, RAM is cheap, just get more! The design choices and tradeoffs are well presented

and worth studying. JNI wrapper looks to be well underway with a LevelDB API thrown in for good measure: https://github.com/chirino/lmdbjni

However, use cases that impose long lived read transactions are ill suited to this DB and the relatively small community (compared to LevelDB) will be an issue for

some folk but it may be a little too early.

Cheers,

Darach.

Message has been deleted

Michael Barker

unread,

Aug 14, 2013, 3:21:29 AM8/14/13

to mechanica...@googlegroups.com

My observations so far. I've started writing my own Java bindings as I don't like the existing ones, they follow the same model as the LevelDb binding and force you to generate loads of garbage. Mine are currently zero-garbage on write and on read I have to allocate a DirectByteBuffer, but it uses the JNI NewDirectByteBuffer call and just wraps the pointer returned from LMDB (to get to zero garbage on read I'll have to do something evil, like reflectively modifying the address/capacity fields on a DirectByteBuffer or use the Unsafe). My benchmarks suggest that they add about 200-300ns overhead, however my performance test just does a basic put, where as the db_bench_mdb tool does a write through a cursor, so not quite a 1-1 comparison.

LMDB has to write options, pwrite (the syscall) or writemap (copy directly to the mmap). Writemap is about 4x faster than pwrite (~1.5µ v. ~5µs), but requires the db file to be extended to the full max_mapsize. Which is probably okay, but is one of those things that can startle ops guys. "What do you mean on start up it will automatically consume 2 TiB of disk space". Basic single write with writemap is comparable but slightly slower than LevelDb, however batching makes a much bigger difference even with pwrite. With a batch size of around 20, I can get the writes (with pwrite) down to around 700ns, I haven't been able to get LevelDb below 1µs even with batching. My goal is to reach 1M ops/sec for my current use case.

The compaction point Martin raised is interesting, however I don't think it will affect my use case. I and building an index that is a rolling time window over a stream of events, therefore all values written to the DB will be deleted at some point in the future and their pages released and reused.

Mike.

On 14 August 2013 11:10, Howard Chu <highl...@gmail.com> wrote:

On Tuesday, August 13, 2013 4:13:07 AM UTC-7, Martin Scholl wrote:

The benchmark results are impressive, but I have my doubts after reading about the design of MDB.

The real impact of a main-memory mapped database is (obviously) when data doesn't fit into memory anymore. Now according to the design document[1], MDB employs a block-re-using strategy. This is more than fine if you are building an in-memory data store but it's a real trade-off. You really don't want to use such a library for longer-lasting transactions or analytical workloads involving listings because data will most likely be physically spread all over your disk; in that case, relying on the OS (read read-ahead) won't help MDB much here (so far about the mechanical sympathy of MDB for bigger data) ...

LMDB turns off readahead. Since the I/O *is* random when the DB is larger than RAM, readahead is actually a liability. With readahead the OS will bring in 100 pages for a single page request, and in so doing it will evict 99 other pages from cache that you probably cared about. (Already discovered this in earlier testing.)

It's as always. You really have to read the fine-print, and benchmark numbers seldomly help you find the right conclusion about a data store.

I'd go further - you have to read the source code. And you have to benchmark your own actual workloads.

>> > email to mechanical-sympathy+unsubscribe...@googlegroups.com.

>> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-sympathy+unsubscribe...@googlegroups.com.

> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Thompson

unread,

Aug 14, 2013, 3:46:35 AM8/14/13

to mechanica...@googlegroups.com

On Wednesday, August 14, 2013 8:21:29 AM UTC+1, mikeb01 wrote:

The compaction point Martin raised is interesting, however I don't think it will affect my use case. I and building an index that is a rolling time window over a stream of events, therefore all values written to the DB will be deleted at some point in the future and their pages released and reused.

My concern is that as the pages get reused then the IO becomes very random. Benchmarking is the best way to determine if the pattern is an issue or not.

Rajiv Kurian

unread,

Aug 14, 2013, 4:21:04 AM8/14/13

to mechanica...@googlegroups.com

Thanks for the update Mike. Any chance you'll release the JNI bindings?

Michael Barker

unread,

Aug 14, 2013, 4:27:03 AM8/14/13

to mechanica...@googlegroups.com

If we start using them (and LMDB), I'll release them as open source. Still experimental at this point.

On 14 August 2013 20:21, Rajiv Kurian <geet...@gmail.com> wrote:

Thanks for the update Mike. Any chance you'll release the JNI bindings?

Howard Chu

unread,

Aug 14, 2013, 8:11:35 AM8/14/13

to mechanica...@googlegroups.com

(repost)

Yes you're right, when the DB doesn't fit in RAM and you have a high number of transactions there will be very random access to disk. But this is true for *all* disk-based DB designs. And still LMDB is far superior to others. See http://symas.com/mdb/hyperdex/ for example, LMDB vs HyperLevelDB with 100 million records and 1 million ops. I also now have the numbers for 10 million ops, not posted there yet but basically the same result.

Howard Chu

unread,

Aug 14, 2013, 8:16:57 AM8/14/13

to mechanica...@googlegroups.com

(repost) LMDB turns off readahead. When the DB is larger than RAM it proves to be detrimental. You may request a single page but the OS will try to readahead 100 pages. If your RAM/cache is already full, 99 pages will be evicted that you probably wanted to keep around.

Sure, the only numbers that actually matter are your own measurements of your own actual workloads. But generic benchmarks are still a good rough indicator, and if you really care to know you can also read the source code yourself to see how things work. But without the rough numbers as a starting point you'd never know where to begin looking.

On Tuesday, August 13, 2013 4:13:07 AM UTC-7, Martin Scholl wrote:

>> > email to mechanical-sympathy+unsubscribe...@googlegroups.com.

>> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to mechanical-sympathy+unsubscribe...@googlegroups.com.

> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Scholl

unread,

Aug 14, 2013, 8:57:44 AM8/14/13

to mechanica...@googlegroups.com

On Wed, Aug 14, 2013 at 2:16 PM, Howard Chu <highl...@gmail.com> wrote:

(repost) LMDB turns off readahead. When the DB is larger than RAM it proves to be detrimental. You may request a single page but the OS will try to readahead 100 pages. If your RAM/cache is already full, 99 pages will be evicted that you probably wanted to keep around.

This is exactly the point I wanted to make.

Also, I'd like to highlight the MDB benchmarks do not reflect object listings which, depending on size and insertion order, won't let MDB look as good anymore.

Martin.

Martin Thompson

unread,

Aug 14, 2013, 9:08:19 AM8/14/13

to mechanica...@googlegroups.com

Thanks Howard for the clarification.

I'm trying to understand the durability of the transaction in your terminology. Do you sync the pages to physical storage on commit? Also do you write complete OS pages per tx, i.e. 4K or 2MB THPs?

Howard Chu

unread,

Aug 14, 2013, 9:23:11 AM8/14/13

to mechanica...@googlegroups.com

In the HyperDex benchmark the commits are aysnchronous, a background thread wakes up periodically to perform an fsync. All writes are complete OS pages. Looking over the flush activity I'd say the actual syncer thread had very little work to do because the OS was already flushing dirty pages continuously.

Howard Chu

unread,

Aug 14, 2013, 9:25:26 AM8/14/13

to mechanica...@googlegroups.com

Sorry, I'm not familiar with your object listings. Can you describe this workload in more detail? With LMDB's zero-copy design, the performance difference between it and other DBs increases as object size goes up.

Martin Thompson

unread,

Aug 14, 2013, 9:30:46 AM8/14/13

to mechanica...@googlegroups.com

To be fair this is not truly a transaction as a database like Oracle would define it. Since the data is not synchronously replicated or synchronously made durable therefore it does not meet the "D" in ACID as a transaction. This is not a bad thing, it is just not comparing apples with apples. For me a transaction should be safe from a crash after a commit.

By having a background thread do a asynchronous fsync (not an fdatasync?) then the pages can be scheduled by the IO scheduler for less disk seek activity. It will not show up anywhere near as much in a profile as when a sync is applied to each transaction.

Howard Chu

unread,

Aug 14, 2013, 9:35:33 AM8/14/13

to mechanica...@googlegroups.com

Note that LMDB is fully transactional by default. It is simply run asynchronously in this HyperDex benchmark because that is how HyperDex uses HyperLevelDB. The previous microbench results already show synchronous commit behavior.

Howard Chu

unread,

Aug 14, 2013, 9:39:13 AM8/14/13

to mechanica...@googlegroups.com

Yes, LMDB uses fdatasync on platforms where it is supported. Sorry, I really didn't think I needed to specify in this email; it's clearly spelled out in the source code.

Martin Thompson

unread,

Aug 14, 2013, 9:53:50 AM8/14/13

to mechanica...@googlegroups.com

I'm sure I'm not the only person who likes to get a feel for a products capabilities before committing the time reading and learning to navigate unknown source code.

There has been a lot of FUD surrounding data stores of late regarding what level of reliability is provided by a commit, e.g. MongoDB. I personally feel any datastore should be upfront about the level of data reliability it provides when committing a transaction without needing to read the source code.

It is great that you are getting impressive benchmarks while doing synchronous transactions to disk. This will mean multiple page writes per transaction while doing the path copy. People should be aware that on an SSD this will cause significant wear with high transaction rates compared to a standard journal based transaction system where multiple transactions get batched into single journal pages. All part of the mechanical sympathy :-)

Howard Chu

unread,

Aug 14, 2013, 10:17:10 AM8/14/13

to mechanica...@googlegroups.com

LMDB is full ACID by default. It uses MVCC and single-writer semantics so all writes are fully serialized. All transactions are fully isolated. All reads are repeatable. These are real transactions; if you make modifications within a txn a subsequent read in the same txn will read what you wrote.

There's definitely a tradeoff here. With LMDB your write load is proportional to the size of your data plus the height of the tree. With a traditional logging system your write load is proportional to 2x the size of your data; records are written to the log first, and then eventually to the datafile. With small data items the logging approach is better. With large data items the LMDB approach is better.

In a journaling system that uses both an undo and a redo log things get even worse. And with an LSM approach, like LevelDB, your write load is the 2x data size of a logging system, plus the continuously growing volume of the overall DB due to compaction and merges.

Howard Chu

unread,

Aug 14, 2013, 10:22:54 AM8/14/13

to mechanica...@googlegroups.com

On Wednesday, August 14, 2013 6:53:50 AM UTC-7, Martin Thompson wrote:

I'm sure I'm not the only person who likes to get a feel for a products capabilities before committing the time reading and learning to navigate unknown source code.

Is this not upfront enough? http://symas.com/mdb/doc/

Peter Lawrey

unread,

Aug 14, 2013, 10:35:51 AM8/14/13

to mechanica...@googlegroups.com

This looks like an interesting suite of tests so I have added them to Chronicle as DataStorePerfTest

With 100K keys of 16 characters (Java doesn't support using byte[] as a key) and 100 byte values, I get the following on my laptop.

Seq write: 1,446 K/s, Seq read: 11,212 K/s, Rnd write: 1,496 K/s, Rnd read: 11,137 K/s

Seq write: 1,448 K/s, Seq read: 11,272 K/s, Rnd write: 1,472 K/s, Rnd read: 11,589 K/s

Seq write: 1,521 K/s, Seq read: 11,299 K/s, Rnd write: 1,481 K/s, Rnd read: 11,586 K/s

Given there is no restriction on memory usage random and sequential access is much the same as they are really hitting RAM either way. (Sequential is sorted, Random is shuffled)

There isn't much basis for comparison as this is a persisted Map<String, byte[]> rather than a database (it supports replication over TCP), but these numbers are not bad compared with the DBs on the list.

Regards,

Peter.

On 12 August 2013 23:31, Michael Barker <mik...@gmail.com> wrote:

Hi,

While doing a bit of work with embedded key/value databases I came
across LMDB (http://symas.com/mdb/) which is the new data store for
OpenLDAP. Howard Chu gave an interesting talk about it at Devoxx
France earlier this year
(http://www.parleys.com/play/517f58f9e4b0c6dcd95464ae/chapter0/about).

It displays some interesting mechanical sympathy, e.g. doesn't
maintain it's own data cache, it uses the OS page cache. Similar to
the approach used by the Varnish HTTP proxy.

Mike.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jan

unread,

Sep 19, 2013, 2:45:42 AM9/19/13

to mechanica...@googlegroups.com

Hello Howard.

I wanted to ask you further more on fragmentation. What if I had an object larger than one page. Does LMDB try to store it into adjactent pages in order to prevent fragmentation? Is there a way to find out how much is DB fragmented? And what is the best way to defragment it?

Jan

Message has been deleted

Howard Chu

unread,

Sep 25, 2013, 3:36:04 AM9/25/13

to mechanica...@googlegroups.com

On Wednesday, September 18, 2013 11:45:42 PM UTC-7, Jan wrote:

Hello Howard.

I wanted to ask you further more on fragmentation. What if I had an object larger than one page. Does LMDB try to store it into adjactent pages in order to prevent fragmentation?

That is already answered in the documentation. I don't answer questions whose answers are already documented.

Is there a way to find out how much is DB fragmented?

And what is the best way to defragment it?

LMDB does not need defragmenting.

Jan

unread,

Oct 1, 2013, 2:52:43 AM10/1/13

to mechanica...@googlegroups.com

On Wednesday, September 25, 2013 9:36:04 AM UTC+2, Howard Chu wrote:

On Wednesday, September 18, 2013 11:45:42 PM UTC-7, Jan wrote:
Hello Howard.

I wanted to ask you further more on fragmentation. What if I had an object larger than one page. Does LMDB try to store it into adjactent pages in order to prevent fragmentation?

That is already answered in the documentation. I don't answer questions whose answers are already documented.

OK. I understand that "overflow records occupy a number of contiguous pages". And what about datafile itself?

Let's say I have 1MB mapsize. At the beginning I put there 10 of 100kB objects. Then I delete each second. So now I should have like 500kB of free space in there, but I am unable to put there object larger then 100kB because there is just no such a long block of contiguous pages.

So the only solution I have in mind now is that I delete all the records from time to time and put them back again. There won't be any void space between them. Or just perform the mdb_env_copy(). But this is an expensive operation for large datasets.

Would it be possible to maintain free space continuously?

Nicolas Grilly

unread,

Apr 23, 2014, 11:54:23 AM4/23/14

to mechanica...@googlegroups.com

On Wednesday, August 14, 2013 3:53:50 PM UTC+2, Martin Thompson wrote:

It is great that you are getting impressive benchmarks while doing synchronous transactions to disk. This will mean multiple page writes per transaction while doing the path copy. People should be aware that on an SSD this will cause significant wear with high transaction rates compared to a standard journal based transaction system where multiple transactions get batched into single journal pages. All part of the mechanical sympathy :-)

You write that a "standard journal based transaction system " batches "multiple transactions" into "single journal pages". Are you sure about this?

The database engine is expected to fsync after each transaction. There we have two possibilities: 1) write a full page (usually 4 kb) for each transaction even if only a part of the page is used, or 2) only write the necessary bytes to the log which implies that the next transaction will start on the same page if there is still available space.

In the context of an SSD, the second approach may be inefficient because the page cannot be modified. Thus the page has to be written elsewhere or the full block has to be erased and rewritten. With this approach, I would not say that journaling is obviously more efficient than LMDB approach.

This leaves us with the first approach which is similar to what LMDB does.

We should be very cautious in trying to evaluate the properties of engines like LevelDB and LMDB on SSD, because the Flash Translation Layer used by SSD is proprietary most of the time and it's difficult to known what they really do.

Amirouche Boubekki

unread,

Apr 23, 2014, 12:39:34 PM4/23/14

to mechanica...@googlegroups.com

I started evaluating lmdb for a project but my application is write heavy and reads are done mostly from application's cache. LMBD single writer might be enough for my use-case.

I share my search results, in the hope that it is helpful.

I think that the perfect database for my usecase would be:

- a k-d tree where k can be at least 3 (plus the value)
- embedded
- ACID database
- write heavy loads ready

And preferably in a language that translates to machine code like C/C++/D unlike Java, C#

I found no k-d tree database with k > 1, when k=1 it's a key-value store. k-d tree are also called nested trees.

I also consider montedb [0] or rolling my own with the lower level statis framework [1] "a flexible transactional storage library that is geared toward high-performance applications and system developers. It supports concurrent transactional storage, and no-FORCE/STEAL buffer management".

[0] http://www.monetdb.com/
[1] http://code.google.com/p/stasis/

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,

Apr 23, 2014, 12:47:26 PM4/23/14

to mechanica...@googlegroups.com

On Wednesday, 23 April 2014 16:54:23 UTC+1, Nicolas Grilly wrote:

On Wednesday, August 14, 2013 3:53:50 PM UTC+2, Martin Thompson wrote:
It is great that you are getting impressive benchmarks while doing synchronous transactions to disk. This will mean multiple page writes per transaction while doing the path copy. People should be aware that on an SSD this will cause significant wear with high transaction rates compared to a standard journal based transaction system where multiple transactions get batched into single journal pages. All part of the mechanical sympathy :-)

You write that a "standard journal based transaction system " batches "multiple transactions" into "single journal pages". Are you sure about this?

There are a number of approaches that can be taken. A simplistic approach of appending to the journal on each transaction as you suggest is possible. This can work well in a low volume system, especially if reacting to burst traffic is not an issue.

In the case of burst traffic it is possible to enqueue transactions to a journal writer that batches to disk filling pages to be efficient and reduce latency in the burst scenario. Notification of transaction commit is delayed to the caller until the write is complete. I've built many systems like this and know of commercial RDBMSs that employ similar techniques.

http://mechanical-sympathy.blogspot.co.uk/2011/10/smart-batching.html

The database engine is expected to fsync after each transaction. There we have two possibilities: 1) write a full page (usually 4 kb) for each transaction even if only a part of the page is used, or 2) only write the necessary bytes to the log which implies that the next transaction will start on the same page if there is still available space.

Writes need to be for whole blocks otherwise the device has to read the block, apply the updates, write back the results which can hurt performance especially on HDDs.

There is a third approach of only synch'ing to disk periodically which a number of databases can take, yes MySQL and MongoDB I'm looking at you!

In the context of an SSD, the second approach may be inefficient because the page cannot be modified. Thus the page has to be written elsewhere or the full block has to be erased and rewritten. With this approach, I would not say that journaling is obviously more efficient than LMDB approach.

This leaves us with the first approach which is similar to what LMDB does.

We should be very cautious in trying to evaluate the properties of engines like LevelDB and LMDB on SSD, because the Flash Translation Layer used by SSD is proprietary most of the time and it's difficult to known what they really do.

If durability is very important you need to have a conversation with your storage vendor and only pick vendors willing to have that conversation :-)

So often it is easier to be resilient by acknowledged replication rather than syncing to disk. synch'ing to disk can be considering as replicating to yourself in the future with a partial ACK.

I'm not knocking LMDB. I think the design really cool and very interesting. For many usecases it can be an excellent and even the best choice.

Martin...

Thomas Harning

unread,

May 24, 2014, 4:56:59 PM5/24/14

to mechanica...@googlegroups.com

Any status update on your minimal-garbage binding to LMDB?

Michael Barker

unread,

May 27, 2014, 11:16:05 PM5/27/14

to mechanica...@googlegroups.com

I'm coming back round to working on a system that uses LevelDB over the next month or 2. Migrating to LMDB is one of the potential tasks that we may pick up. I'll will post here if there is any progress.

Mike.

On 25 May 2014 08:56, Thomas Harning <harn...@gmail.com> wrote:

Any status update on your minimal-garbage binding to LMDB?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Matt Warren

unread,

May 28, 2014, 4:25:52 AM5/28/14

to mechanica...@googlegroups.com

There's a nice in-depth blog series that reviews the technical aspects of LMDB and compares it to LevelDB along the way. It starts here and has some useful info, this one is useful as it shows some scenarios that LMDB doesn't do as well in (admittedly it's about the only scenario, it does really well in all other cases).

On Wednesday, 28 May 2014 04:16:05 UTC+1, mikeb01 wrote:

I'm coming back round to working on a system that uses LevelDB over the next month or 2. Migrating to LMDB is one of the potential tasks that we may pick up. I'll will post here if there is any progress.

Mike.

On 25 May 2014 08:56, Thomas Harning <harn...@gmail.com> wrote:

Any status update on your minimal-garbage binding to LMDB?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Howard Chu

unread,

Jun 19, 2014, 12:21:49 PM6/19/14

to mechanica...@googlegroups.com

That blog series culminated in Ayende going off and writing their own KV store in C#, called Voron. Looking closer at it, I think they came up with a pretty good approach to write throughput, and I'm considering adopting it for a future LMDB version. It requires a periodic compactor to cleanup old journal records though, which is still something I'm leery of.

They change LMDB's copy-on-write strategy and keep the copied pages in a second, ephemeral mmap. At the same time, they write a stream of compressed pages to their journal. Periodically a task must run to copy modified pages from the second mmap back into the main DB mmap, and it necessarily can only do this when it knows no readers are referencing those pages in the main mmap. Once those copies are completed the corresponding chunks of the journal can be released. That's my understanding of it from a high level view, anyway.

The cool part is that the journal is written directly to media (e.g., POSIX O_DIRECT) thus bypassing the OS page cache and obviating any need for fsync. So you get completely safe txn commits without stopping the world waiting for buffered write queues to flush, and of course all of the journal writes are sequential so they're very fast.

There are some added costs of course - page references must all bounce through a translation table to see if the current version of a page is in the 2nd mmap or in the 1st. And the background task must actually fsync its updates to the main DB file before it can truncate the journal. The fsync cost is negligible compared to a sync after every commit, but the translation table slows reads down noticeably.

If the system crashes while the main mmap is being updated, it will be corrupted but the corruption can be trivially repaired by replaying the journal at next startup.

It's a shame they chose to write in C#, instead of writing in C so that their work can be leveraged by the rest of the world.

Howard Chu

unread,

Jun 19, 2014, 12:26:45 PM6/19/14

to mechanica...@googlegroups.com

On Wednesday, April 23, 2014 9:47:26 AM UTC-7, Martin Thompson wrote:

If durability is very important you need to have a conversation with your storage vendor and only pick vendors willing to have that conversation :-)

So often it is easier to be resilient by acknowledged replication rather than syncing to disk. synch'ing to disk can be considering as replicating to yourself in the future with a partial ACK.

I'm not knocking LMDB. I think the design really cool and very interesting. For many usecases it can be an excellent and even the best choice.

Martin...

In some use cases it is completely unrivaled. http://symas.com/mdb/inmem/ http://symas.com/mdb/inmem/large.html

Jan Kotek

unread,

Jul 3, 2014, 4:13:39 AM7/3/14

to mechanica...@googlegroups.com

Hi,

Perhaps have a look at MapDB as well. It is pure java, optimized to have almost zero copy and no GC trash. And it uses mmap files as well.

In-memory off heap mode is only 3x slower compared to Java Collections :-)

(I am author).

Jan

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Viktor Szathmáry

unread,

Dec 12, 2014, 3:41:10 PM12/12/14

to mechanica...@googlegroups.com

Hi Mike,

Any chance of sharing your LMDB JNI wrapper?

Thanks,

Viktor

On Wednesday, May 28, 2014 5:16:05 AM UTC+2, mikeb01 wrote:

I'm coming back round to working on a system that uses LevelDB over the next month or 2. Migrating to LMDB is one of the potential tasks that we may pick up. I'll will post here if there is any progress.

Mike.

On 25 May 2014 08:56, Thomas Harning <harn...@gmail.com> wrote:

Any status update on your minimal-garbage binding to LMDB?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Barker

unread,

Jan 13, 2015, 4:40:56 PM1/13/15

to mechanica...@googlegroups.com

Sorry it is unfinished as we didn't end up using LMDB. I'll dig out what I have and post to github at some point.

Mike.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kristoffer Sjögren

unread,

Jan 26, 2015, 7:00:47 PM1/26/15

to mechanica...@googlegroups.com

LMDB JNI with support for zero copy if anyone is interested.

https://github.com/deephacks/lmdbjni

On Tuesday, January 13, 2015 at 10:40:56 PM UTC+1, mikeb01 wrote:

Sorry it is unfinished as we didn't end up using LMDB. I'll dig out what I have and post to github at some point.

Mike.

On 13 December 2014 at 09:41, Viktor Szathmáry <phra...@gmail.com> wrote:

Hi Mike,

Any chance of sharing your LMDB JNI wrapper?

Thanks,

Viktor

On Wednesday, May 28, 2014 5:16:05 AM UTC+2, mikeb01 wrote:

I'm coming back round to working on a system that uses LevelDB over the next month or 2. Migrating to LMDB is one of the potential tasks that we may pick up. I'll will post here if there is any progress.

Mike.

On 25 May 2014 08:56, Thomas Harning <harn...@gmail.com> wrote:

Any status update on your minimal-garbage binding to LMDB?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ysm562...@gmail.com

unread,

Jul 22, 2015, 3:47:24 AM7/22/15

to mechanical-sympathy

HI I'm studing LMDB now, but i can't run the project succeed, would you mind giving me some tip in Build LMDB project with java~~

在 2013年8月13日星期二 UTC+8上午5:31:35，mikeb01写道：

ivan

unread,

Apr 5, 2016, 11:35:27 PM4/5/16

to mechanical-sympathy

Any plan for Windows 32bit ?

Thanks

Reply all

Reply to author

Forward