[erlang-questions] mnesia sync_transactions not fsynced?

44 views
Skip to first unread message

Emile Joubert

unread,
Oct 27, 2011, 7:17:55 AM10/27/11
to erlang-q...@erlang.org

The docs for mnesia:sync_transactions say that it will wait until data
have been committed and logged to disk (if disk is used). (R14B04)

That appears not to be the case, as a test script illustrates. Repeated
execution of the script (containing paired writes and deletes) should
leave an empty table, but every so often the database starts up
non-empty, indicating that a previous delete was not recorded on disk.

Inserting "disk_log:sync(latest_log)" seems to work around the problem
in the single-node case, but doesn't help when multiple nodes are
involved. What is the recommended approach to guarantee that
transactions are on disk, in both single and multiple node situations?

A previous message on this list appears to indicate that transactions
cannot be made safe in the way that I need:
http://erlang.2086793.n4.nabble.com/Mnesia-disk-logging-and-synchronous-disk-logging-td2087144.html
Is that information still accurate?

-Emile

---- 8< ----- 8< ----- 8< ----- 8< ----- 8< ----- 8< ----- 8< -----
#!/usr/bin/env escript
%%! -sname testnode -mnesia debug verbose

-record(testrec, {id, val=''}).

main([]) ->
main(["200"]);
main([Arg]) ->
mnesia_setup(),
Ids = mnesia:dirty_select(testtab,
[{#testrec{id='$1',val='_'},[],['$1']}]),
io:format("~w entries found: ~w~n", [length(Ids), Ids]),
runtest(list_to_integer(Arg)).

mnesia_setup() ->
mnesia:create_schema([node()]),
mnesia:start(),
mnesia:create_table(testtab,
[{disc_copies, [node()]},
{record_name, testrec},
{attributes, record_info(fields, testrec)}]),
mnesia:wait_for_tables([testtab], 1000).

runtest(0) ->
%disk_log:sync(latest_log),
halt();
runtest(N) ->
mnesia:sync_transaction(
fun () -> mnesia:write(testtab, #testrec{id = N}, write) end),
mnesia:sync_transaction(fun () -> mnesia:delete({testtab, N}) end),
runtest(N-1).
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Dan Gudmundsson

unread,
Oct 27, 2011, 7:24:59 AM10/27/11
to Emile Joubert, erlang-q...@erlang.org
The only guarantee you have is that complete commits will be written to disc.

I.e. if you do several operations inside your transaction, either all
of them or none
will be save to that disc (or at least read).

/Dan

Matthew Sackman

unread,
Oct 27, 2011, 7:36:25 AM10/27/11
to erlang-q...@erlang.org
On Thu, Oct 27, 2011 at 01:24:59PM +0200, Dan Gudmundsson wrote:
> The only guarantee you have is that complete commits will be written to disc.

Choice of language here is very important.

Do you agree that the lack of a disk_log:sync means that the complete
commits that are actually written to disk may be a subset of the commits
that mnesia completes?

> I.e. if you do several operations inside your transaction, either all
> of them or none
> will be save to that disc (or at least read).

Yes - that last part is very important. It is almost certainly the case
that incomplete transactions are written to disk, it's just they're
detected on startup to be incomplete and thus ignored. That is not the
same as saying "only complete commits are written to disk".

Matthew

Dan Gudmundsson

unread,
Oct 27, 2011, 7:52:48 AM10/27/11
to erlang-q...@erlang.org
On Thu, Oct 27, 2011 at 1:36 PM, Matthew Sackman <mat...@wellquite.org> wrote:
> On Thu, Oct 27, 2011 at 01:24:59PM +0200, Dan Gudmundsson wrote:
>> The only guarantee you have is that complete commits will be written to disc.
>
> Choice of language here is very important.

Sorry my language is bad Swedish, which get worse when translated :-/

>
> Do you agree that the lack of a disk_log:sync means that the complete
> commits that are actually written to disk may be a subset of the commits
> that mnesia completes?

Yes that is correct.
If you want more safety you add more nodes/replicas.

/Dan

Håkan Mattsson

unread,
Oct 27, 2011, 8:17:07 AM10/27/11
to Dan Gudmundsson, erlang-q...@erlang.org
On Thu, Oct 27, 2011 at 1:52 PM, Dan Gudmundsson <dan...@gmail.com> wrote:
> On Thu, Oct 27, 2011 at 1:36 PM, Matthew Sackman <mat...@wellquite.org> wrote:
>> On Thu, Oct 27, 2011 at 01:24:59PM +0200, Dan Gudmundsson wrote:
>> Do you agree that the lack of a disk_log:sync means that the complete
>> commits that are actually written to disk may be a subset of the commits
>> that mnesia completes?
>
> Yes that is correct.
> If you want more safety you add more nodes/replicas.

A call to disk_log:sync would increase the probability for the committed data to
be stored on durable media, but there is no guarantee. The data may still be
lost in case of a hard reboot or a power failure. Write caches are evil in this
regard.

Anyway, the subset of the commits that may be lost are the last written ones
before a crash/shutdown. It is not a random subset.

/Håkan

Matthew Sackman

unread,
Oct 27, 2011, 8:30:00 AM10/27/11
to erlang-q...@erlang.org
On Thu, Oct 27, 2011 at 02:17:07PM +0200, Håkan Mattsson wrote:
> A call to disk_log:sync would increase the probability for the committed data to
> be stored on durable media, but there is no guarantee. The data may still be
> lost in case of a hard reboot or a power failure. Write caches are evil in this
> regard.

Yes, but you can turn off disk write caches with other tools. People who
actually care about the data getting to disk when it's claimed it's been
written to disk are very likely to understand the performance
implications of that. Mnesia currently prevents any such configuration.
Ideally, there should be a mechanism whereby mnesia could be configured,
maybe even on a txn-by-txn case as to whether you really want it to hit
disk.

Just out of interest, given this behaviour, what is the point of
mnesia:sync_transation, as it does not actually sync to disk? In rabbit
we use it so to guarantee that dirty_reads on other nodes are guaranteed
to observe the new values after the sync_transaction returns, but is
this the only purpose?


Matthew

Emile Joubert

unread,
Oct 27, 2011, 8:37:28 AM10/27/11
to Håkan Mattsson, erlang-q...@erlang.org
On 27/10/11 13:17, Håkan Mattsson wrote:

> Anyway, the subset of the commits that may be lost are the last written ones
> before a crash/shutdown. It is not a random subset.

In that case, would you expect consecutive records with ID numbers
starting at 0 to possibly survive the VM shutdown in the test script I
supplied? I always see a single record with much higher ID.

In both the original RabbitMQ code where this problem was discovered and
the distilled test script the missing transaction was never the last
one, but a random older one.

Emile

Håkan Mattsson

unread,
Oct 27, 2011, 9:57:17 AM10/27/11
to erlang-q...@erlang.org
On Thu, Oct 27, 2011 at 2:30 PM, Matthew Sackman <mat...@wellquite.org> wrote:
> On Thu, Oct 27, 2011 at 02:17:07PM +0200, Håkan Mattsson wrote:
>> A call to disk_log:sync would increase the probability for the committed data to
>> be stored on durable media, but there is no guarantee. The data may still be
>> lost in case of a hard reboot or a power failure. Write caches are evil in this
>> regard.
>
> Yes, but you can turn off disk write caches with other tools. People who
> actually care about the data getting to disk when it's claimed it's been
> written to disk are very likely to understand the performance
> implications of that. Mnesia currently prevents any such configuration.
> Ideally, there should be a mechanism whereby mnesia could be configured,
> maybe even on a txn-by-txn case as to whether you really want it to hit
> disk.

I think that it is better that such a flag would be per table and not a
parameter to the transaction. If at least one of the tables involved in
the transaction has the flag, it should imply that disk_log:sync is invoked
for the entire transaction.

> Just out of interest, given this behaviour, what is the point of
> mnesia:sync_transation, as it does not actually sync to disk? In rabbit
> we use it so to guarantee that dirty_reads on other nodes are guaranteed
> to observe the new values after the sync_transaction returns, but is
> this the only purpose?

The "sync" in sync_transaction is not related to hard disks. It refers to that
the transaction function returns when the transaction has been committed
on all nodes. But even if ordinary transactions may return earlier, it will not
effect the consistency of the database.

/Håkan

Håkan Mattsson

unread,
Oct 27, 2011, 10:10:51 AM10/27/11
to Emile Joubert, erlang-q...@erlang.org
2011/10/27 Emile Joubert <em...@rabbitmq.com>:

> On 27/10/11 13:17, Håkan Mattsson wrote:
>
>> Anyway, the subset of the commits that may be lost are the last written ones
>> before a crash/shutdown. It is not a random subset.
>
> In that case, would you expect consecutive records with ID numbers
> starting at 0 to possibly survive the VM shutdown in the test script I
> supplied? I always see a single record with much higher ID.

There should not be any consecutive records. It should be at most one record
according to your test program.

> In both the original RabbitMQ code where this problem was discovered and
> the distilled test script the missing transaction was never the last one, but a
> random older one.

This is not proofed by your test program. The result of your test program does
not contradict what I said.

/Håkan

Jon Watte

unread,
Oct 27, 2011, 7:05:54 PM10/27/11
to Emile Joubert, erlang-q...@erlang.org
 
runtest(N) ->
   mnesia:sync_transaction(
       fun () -> mnesia:write(testtab, #testrec{id = N}, write) end),
   mnesia:sync_transaction(fun () -> mnesia:delete({testtab, N}) end),
   runtest(N-1).



This code runs two separate transactions.  One to add a record, and one to delete a record. If you bring down the node after the first one has committed, but before the second one has committed, then you will see one record when you start up again.

Are you claiming that if you run this from 100 through 0, and crash the node at 10, you may see a record with the value 40 in the database? If that is the case, then the cause is simply that all of the transactions from 39 .. 0 never wrote to disk before the crash. This can, roughly, be thought of as a SQL database achieving some particular point in its binlogs, but nothing beyond that point.

The observation here is that mnesia:sync_transaction(), under the conditions observed, does not fulfill the "Durable" part of ACID transactions -- a committed transaction (writing value 39, say) was observed by the system, but after re-start, that point was not durable.
Whether mnesia:sync_transaction() "should be" durable is not something I'm qualified to answer :-) However, reading the documentation, it certainly *sounds* as if it should provide durability:

This function waits until data have been committed and logged to disk (if disk is used) on every involved node before it returns

Thus, assuming that you don't have a kernel/device-based write cache/power-fail outage (but simply an Erlang process death outage), then I would expect the last record seen on disk to be the last record observed in the test function.

Sincerely,

jw


Dan Gudmundsson

unread,
Oct 28, 2011, 3:18:47 AM10/28/11
to Jon Watte, erlang-q...@erlang.org
Disk log cashes writes as well so you can loose many writes when
halting the process.

/Dan

Håkan Mattsson

unread,
Oct 28, 2011, 6:05:12 AM10/28/11
to Dan Gudmundsson, erlang-q...@erlang.org
For obvious reasons, the write cache in disk_log ought to be disabled
for the transaction log.

/Håkan

Matthew Sackman

unread,
Oct 28, 2011, 8:37:21 AM10/28/11
to erlang-q...@erlang.org
On Fri, Oct 28, 2011 at 12:05:12PM +0200, Håkan Mattsson wrote:
> For obvious reasons, the write cache in disk_log ought to be disabled
> for the transaction log.

As a minimum. You almost certainly want to ensure that the file is
opened with O_DIRECT and O_SYNC too, which are flags which are not
exposed in the file module. Either that, or to actually do the fsync's
before each txn commits.

Matthew

Håkan Mattsson

unread,
Oct 28, 2011, 9:21:00 AM10/28/11
to erlang-q...@erlang.org
On Fri, Oct 28, 2011 at 2:37 PM, Matthew Sackman <mat...@wellquite.org> wrote:
> On Fri, Oct 28, 2011 at 12:05:12PM +0200, Håkan Mattsson wrote:
>> For obvious reasons, the write cache in disk_log ought to be disabled
>> for the transaction log.
>
> As a minimum. You almost certainly want to ensure that the file is
> opened with O_DIRECT and O_SYNC too, which are flags which are not
> exposed in the file module. Either that, or to actually do the fsync's
> before each txn commits.

Maybe. If used naively, these flags may impose bad characteristics of the VM.

/Håkan

Dan Gudmundsson

unread,
Oct 28, 2011, 9:21:37 AM10/28/11
to erlang-q...@erlang.org
On Fri, Oct 28, 2011 at 2:37 PM, Matthew Sackman <mat...@wellquite.org> wrote:
> On Fri, Oct 28, 2011 at 12:05:12PM +0200, Håkan Mattsson wrote:
>> For obvious reasons, the write cache in disk_log ought to be disabled
>> for the transaction log.
>
> As a minimum. You almost certainly want to ensure that the file is
> opened with O_DIRECT and O_SYNC too, which are flags which are not
> exposed in the file module. Either that, or to actually do the fsync's
> before each txn commits.
>

Well I didn't know it was cached, and there is no api to turn it off yet either.
/Dan

Jon Watte

unread,
Oct 29, 2011, 9:24:05 PM10/29/11
to Dan Gudmundsson, erlang-q...@erlang.org
On Fri, Oct 28, 2011 at 12:18 AM, Dan Gudmundsson <dan...@gmail.com> wrote:
Disk log cashes writes as well so you can loose many writes when
halting the process.

/Dan



 
This means that the implementation does not match the documentation. I think one or the other should change :-)
Or the documentation writers have another understanding of the words "logged to disk" than I do :-)

Sincerely,

jw


Dan Gudmundsson

unread,
Oct 30, 2011, 6:59:47 AM10/30/11
to Jon Watte, erlang-q...@erlang.org
As far as mnesia is concerned it is logged to disc. It have left
mnesia call chain and it's nothing
more mnesia can do, except sync the disk.
Which is a performance penalty that is not acceptable per transaction.

I have been reminded why the disc_log cache was implemented, when you
push transactions
in a short loop, you are pushing more transaction then what can be
pushed to disc.
So that everything was stuck in disc_logs message queue.
The way to improve that was to write larger chunks to the disc.
So removing the timeout in disc_log will probably not change the
situation only that message
will be stuck on message queue instead. It will still not be written
to the disc fast enough.

/Dan

Jon Watte

unread,
Oct 30, 2011, 11:39:44 PM10/30/11
to Dan Gudmundsson, erlang-q...@erlang.org
On Sun, Oct 30, 2011 at 3:59 AM, Dan Gudmundsson <dan...@gmail.com> wrote:
As far as mnesia is concerned it is logged to disc. It have left
mnesia call chain and it's nothing
more mnesia can do, except sync the disk.
Which is a performance penalty that is not acceptable per transaction.


"Not acceptable" to who? That's what a durable transaction *is*. For most users of actual databases, it is not acceptable that a (durable, isolated) transaction does *not* sync the disk when it claims to synchronize and its results are globally visible.

Imagine building a bank account balance transfer system on a database that allows globally visible changes to be rolled back. First, I transfer money to your account. Then, someone asks how much money you have to cover some payment. Then, the database crashes, and my transfer to you is reverted, even though the transfer was observed globally from outside the transaction. 

Again, you can change the documentation if you want -- but from this discussion, I've learned something new -- what mnesia calls a transactional relational databse is not the same as what you'll find in a database systems textbook.

 
I have been reminded why the disc_log cache was implemented, when you
push transactions
in a short loop, you are pushing more transaction then what can be
pushed to disc.


This is what back pressure is for. If you need more peak throughput than your disk can provide (when including blocked commits) then you have to use non-synchronous "transactions" (which really aren't), and maybe a replicated table store in a second data center, to reduce the risk of permanent data loss. Mnesia can do this (neat!) similar to how systems like MongoDB can do this, but that's not a durable, isolated transaction system in the face of machine failure.

 
So that everything was stuck in disc_logs message queue.
The way to improve that was to write larger chunks to the disc.


That's fine. A synchronous transaction commit should not be allowed until that larger block has committed, though. This means that you may collect many transactions waiting for commit, and they all commit with that particular block flush at the same time.


I think a note in the documentation would be useful. I think an implementation that blocks sync transactions until the block has flushed to disk (and thus, unblocking many synchronous transactions at once) might be even better!


Sincerely,

jw
 

Ulf Wiger

unread,
Oct 31, 2011, 10:00:09 AM10/31/11
to Jon Watte, erlang-q...@erlang.org
On 31 Oct 2011, at 04:39, Jon Watte wrote:


On Sun, Oct 30, 2011 at 3:59 AM, Dan Gudmundsson <dan...@gmail.com> wrote:
As far as mnesia is concerned it is logged to disc. It have left
mnesia call chain and it's nothing
more mnesia can do, except sync the disk.
Which is a performance penalty that is not acceptable per transaction.


"Not acceptable" to who? That's what a durable transaction *is*. For most users of actual databases, it is not acceptable that a (durable, isolated) transaction does *not* sync the disk when it claims to synchronize and its results are globally visible.

Mnesia was never designed to be durable in the same sense as e.g. Oracle et al. If you want to be able to guarantee durability in a single-node installation with rotating disks, you should probably use a raw partition in the first place, and take complete control of memory management, including caching. Most traditional RDBMS had to do this, as replication did not become an option until much later - for example, PostgreSQL didn't introduce synchronous replication until release 9.1, which was released in September 2011.

Mnesia is primarily a ram database, designed for distributed systems. These systems rely more on redundancy than disk-based durability.

While one could imagine adding an option for mnesia to sync the disk, it would not be acceptable if it couldn't be turned off, or at least done only periodically.

The sync(8) man page also includes this caveat:

"On Linux, sync is only guaranteed to schedule the dirty blocks for writing; it can actually take a short time before all the blocks are finally written. The reboot(8) and halt(8)commands take this into account by sleeping for a few seconds after calling sync(2)."


"To achieve high performance, databases will use group commits whereby multiple transactions in a commit cycle will use the same write/sync operation to make all of the transactions durable. This is possible where they are all appending to the same transaction log.

This may mean that the response of an individual commit may be delayed (while waiting for others to join the commit cycle) but the overall throughput is much greater across the whole database because the cost of the write/sync is amortized across multiple transactions. For example, each individual transaction may take 10mS, but thousands of transactions are all able to commit in the same cycle."

A notable difference between Mnesia and e.g. Oracle is that mnesia has _much_ better response times. One contributing factor is that mnesia doesn't delay transactions in order to increase throughput through batching. In the domain where mnesia is used, latency is usually more important than throughput.

also from the Stackoverflow thread:

"This also raises the question of what is durable? Is a single disk durable? Not if the disk fails. Is a RAID array durable? Not if there is a catastrophic RAID corruption. The only guarantee of durability is where transactions are replicated across multiple remote database instances - but not not everybody needs that level of guarantee. Durability should not be considered a binary option but rather as a choice of level of durability."


BR,
Ulf W

Ulf Wiger, CTO, Erlang Solutions, Ltd.



Reply all
Reply to author
Forward
0 new messages