ObjectStore design

Jonathan Halliday

unread,

Mar 31, 2020, 10:29:03 AM3/31/20

to narayana-users

I've been giving some thought to the way the ObjectStore implementations work. I'm interested mostly in the dominant use case where Narayana is operating as a transaction coordinator, not as a resource manager.

The original store implementations use the filesystem directory tree as the index, having one file per tx, named based on the tx id. This is slow, as it's calling into the O/S for file creation, opening, syncing, closing and deleting. As a result, this store implementation is not widely used in tx coordinator use cases, though it remains the only option full featured enough for resource manager usage.

The journal store uses an append-only log approach, designed to avoid disk head seeks back in the day when people still used disks with moving parts. It can also batch records, allowing the sync cost to be shared by several tx. That advantage, unlike the head seek, carries over to SSDs. However, it carries a penalty with it: latency increases to support the greater throughput, as tx commits must wait for the next batch. Nevertheless, this approach outperforms the original design thanks to making fewer O/S calls.

The accepted wisdom that the append only log is the optimal structure for our use case seems increasingly questionable. SSDs, now the dominant storage media, have considerable internal concurrency and not only support, but benefit from, the multiple concurrent requests which the log explicitly avoids. There is no inherent reason to order the records, nor to cause one transaction to wait on another. The RMs may need to do that, but the coordinator doesn't.

The overhead is tolerable on busy systems, where the batching of sync calls continues be a dominant advantage. But with smaller, more modular deployments becoming common, a lower transaction throughput per server is more common, whilst call latency is more important as fan-out and chaining of calls to different services is prevalent. Furthermore, the latest persistent memory hardware no longer requires an O/S call for the sync operation, so a future-proof design should discount the batching advantage still further.

To address the needs of future deployments on hardware that post-dates the design of the existing stores, I've been considering a new approach, incorporating the considerations above and some other observations about the usage patterns. It's common that store records are short-lived (usually just the interval between prepare and commit) and that they are of small and mostly uniform size. Therefore, a design based on a fixed number of reusable parallel 'slots' may work better. A slot may be a pre-existing file in a set of such, or it may be a region within a larger file. Either way, no overhead for creating/opening/closing/deleting, so in that respect it's closer to the journal model than the original store one. Likewise in incorporating the index (i.e. tx id/path) in the record rather than externally in the file system. But by adopting a one sync per tx approach the design is closer to the original store than the journal.

A tx log write will take an empty slot, write to it and sync it to persistent store. Apart from a queue remove for the free list, this is a concurrent operation and doesn't need to wait for other tx. On the down side, it doesn't amortize the cost of the sync across multiple tx. So, lower latency, but lower throughput on systems that are loaded heavily enough to benefit from the batching. A completed tx releases the slot back into the pool for reuse. That's less space management overhead than the compaction algorithm of the journal, at the cost of potentially halting the processing of new transactions if all the slots are full with tx awaiting recovery. Given that tx log records are small, substantial over-provisioning should not be costly though.

This design is likely to scale less well than the journal on SSD, as the simpler operation doesn't offset the advantage of batching sync operations. However, on new persistent memory hardware the sync becomes cheap and the advantage goes the other way.

I propose to divide the implementation into phases to streamline review and testing. In the first, the store uses a simple in-memory backend, making it similar to VolatileStore. In the next, it adds a filesystem backend, making it a faster but less feature rich alternative to the ShadowNoFileLockStore. Finally, an optional persistent memory backend. This will be available only on recent (14+) JDKs, so will be a modular external dependency much the same as the existing Journal AIO library is.

Any thoughts on this before I start PRs for the new code?

Jonathan

Mark Little

unread,

Mar 31, 2020, 10:44:16 AM3/31/20

to Jonathan Halliday, narayana-users

Any possible impact on backwards compatibility?

--
You received this message because you are subscribed to the Google Groups "narayana-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to narayana-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/narayana-users/0bd5b0ab-8460-4fc0-85a7-18875ac9e918%40googlegroups.com.

Mark Little

unread,

Apr 2, 2020, 5:01:59 AM4/2/20

to Jonathan Halliday, narayana-users

Ping

---
Mark Little
mli...@redhat.com

JBoss, by Red Hat
Registered Address: Red Hat Ltd, 6700 Cork Airport Business Park, Kinsale Road, Co. Cork.
Registered in the Companies Registration Office, Parnell House, 14 Parnell Square, Dublin 1, Ireland, No.304873
Directors:Michael Cunningham (USA), Vicky Wiseman (USA), Michael O'Neill, Keith Phelan, Matt Parson (USA)

Michael Musgrove

unread,

Apr 2, 2020, 5:20:38 AM4/2/20

to narayana-users

I like the ideas. A couple of questions and observations:

Why do RMs need to order records or wait for other transactions?

The single file (memory mapped or otherwise) approach with slots is similar to malloc implementations so the algorithms should already be well researched.

Although the existing filestore implementations that use a file per transaction could be modified to allow concurrent usage, it cannot avoid the open/close costs your refer to so the overall benefit should positive.

Michael

Jonathan Halliday

unread,

Apr 2, 2020, 9:32:41 AM4/2/20

to narayana-users

On 02/04/2020 10:20, Michael Musgrove wrote:

> Why do RMs need to order records or wait for other transactions?

initial state: x = 0, y = 0
tx1 update: x = 1
tx2 update: y = 1
final state: x = 1, y = 1;

If the recovery log contains tx2 but not tx1, the recovered state is
x = 0, y = 1, which is invalid - it's not a state that ever existed and
may violate business constraints. Writing tx2 to disk is insufficient to
guarantee the final state can be recovered - it's necessary to also
ensure that tx1 is written.

> The single file (memory mapped or otherwise) approach with slots is
> similar to malloc implementations so the algorithms should already be
> well researched.

yes. Much simpler actually, though similar to a slab memory manager that
only supports one allocation block size.

Jonathan

--
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Jonathan Halliday

unread,

Apr 2, 2020, 9:32:41 AM4/2/20

to narayana-users

Not that I am aware of.

Jonathan

On 31/03/2020 15:41, Mark Little wrote:
> Any possible impact on backwards compatibility?

Michael Musgrove

unread,

Apr 3, 2020, 5:37:20 AM4/3/20

to narayana-users

Thanks.

Let us know if you need any input, resources or testing.

Mark Little

unread,

Apr 3, 2020, 6:04:27 AM4/3/20

to Michael Musgrove, narayana-users

Actually that got me thinking … how easy is this going to be to test? Do we need to look at getting specific hardware?

Mark.

--
You received this message because you are subscribed to the Google Groups "narayana-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to narayana-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/narayana-users/7ddd3788-44f0-4194-a432-b2abb3b8d407%40googlegroups.com.

Jonathan Halliday

unread,

Apr 3, 2020, 6:33:30 AM4/3/20

to narayana-users

Don't panic, your software engineering minions are moderately competent
and capable of thinking ahead :-)

The SlotStore is designed with pluggable backends - RAM (i.e java heap,
much like the VoltileStore), regular file system, and persistent memory
(pmem). All the shared code (which is most of it) and the first two of
the backends run fine on regular hardware and get good test coverage
just by configuring the existing tests to run with the new store.

The pmem specific backend code, with the exception of a basic adapter,
actually lives in the mashona pmem library. It runs either on emulated
pmem (needs a kernel param at boot, but no special hardware) or on
actual pmem. Testing that library, particularly for performance, is a
separate problem, much as testing the Artemis Journal is. But feel free
to approve the hardware budget request that's already in for it...

Full end to end testing of the pmem backend within the context of the
transaction system would require the tx test suite to run on hardware
with either emulated or actual pmem, but even without that you get most
of the problem solved by existing means.

Jonathan

On 03/04/2020 11:04, Mark Little wrote:
> Actually that got me thinking … how easy is this going to be to test? Do
> we need to look at getting specific hardware?

Mark Little

unread,

Apr 3, 2020, 6:36:17 AM4/3/20

to Jonathan Halliday, narayana-users

I'm not aware of any hardware request.

Mark.

--
You received this message because you are subscribed to the Google Groups "narayana-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to narayana-user...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/narayana-users/30d7f865-d670-92d3-6fe6-02e13af44ed4%40redhat.com.

Reply all

Reply to author

Forward