That appears not to be the case, as a test script illustrates. Repeated
execution of the script (containing paired writes and deletes) should
leave an empty table, but every so often the database starts up
non-empty, indicating that a previous delete was not recorded on disk.
Inserting "disk_log:sync(latest_log)" seems to work around the problem
in the single-node case, but doesn't help when multiple nodes are
involved. What is the recommended approach to guarantee that
transactions are on disk, in both single and multiple node situations?
A previous message on this list appears to indicate that transactions
cannot be made safe in the way that I need:
http://erlang.2086793.n4.nabble.com/Mnesia-disk-logging-and-synchronous-disk-logging-td2087144.html
Is that information still accurate?
-Emile
---- 8< ----- 8< ----- 8< ----- 8< ----- 8< ----- 8< ----- 8< -----
#!/usr/bin/env escript
%%! -sname testnode -mnesia debug verbose
-record(testrec, {id, val=''}).
main([]) ->
main(["200"]);
main([Arg]) ->
mnesia_setup(),
Ids = mnesia:dirty_select(testtab,
[{#testrec{id='$1',val='_'},[],['$1']}]),
io:format("~w entries found: ~w~n", [length(Ids), Ids]),
runtest(list_to_integer(Arg)).
mnesia_setup() ->
mnesia:create_schema([node()]),
mnesia:start(),
mnesia:create_table(testtab,
[{disc_copies, [node()]},
{record_name, testrec},
{attributes, record_info(fields, testrec)}]),
mnesia:wait_for_tables([testtab], 1000).
runtest(0) ->
%disk_log:sync(latest_log),
halt();
runtest(N) ->
mnesia:sync_transaction(
fun () -> mnesia:write(testtab, #testrec{id = N}, write) end),
mnesia:sync_transaction(fun () -> mnesia:delete({testtab, N}) end),
runtest(N-1).
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
I.e. if you do several operations inside your transaction, either all
of them or none
will be save to that disc (or at least read).
/Dan
Choice of language here is very important.
Do you agree that the lack of a disk_log:sync means that the complete
commits that are actually written to disk may be a subset of the commits
that mnesia completes?
> I.e. if you do several operations inside your transaction, either all
> of them or none
> will be save to that disc (or at least read).
Yes - that last part is very important. It is almost certainly the case
that incomplete transactions are written to disk, it's just they're
detected on startup to be incomplete and thus ignored. That is not the
same as saying "only complete commits are written to disk".
Matthew
Sorry my language is bad Swedish, which get worse when translated :-/
>
> Do you agree that the lack of a disk_log:sync means that the complete
> commits that are actually written to disk may be a subset of the commits
> that mnesia completes?
Yes that is correct.
If you want more safety you add more nodes/replicas.
/Dan
A call to disk_log:sync would increase the probability for the committed data to
be stored on durable media, but there is no guarantee. The data may still be
lost in case of a hard reboot or a power failure. Write caches are evil in this
regard.
Anyway, the subset of the commits that may be lost are the last written ones
before a crash/shutdown. It is not a random subset.
/Håkan
Yes, but you can turn off disk write caches with other tools. People who
actually care about the data getting to disk when it's claimed it's been
written to disk are very likely to understand the performance
implications of that. Mnesia currently prevents any such configuration.
Ideally, there should be a mechanism whereby mnesia could be configured,
maybe even on a txn-by-txn case as to whether you really want it to hit
disk.
Just out of interest, given this behaviour, what is the point of
mnesia:sync_transation, as it does not actually sync to disk? In rabbit
we use it so to guarantee that dirty_reads on other nodes are guaranteed
to observe the new values after the sync_transaction returns, but is
this the only purpose?
Matthew
> Anyway, the subset of the commits that may be lost are the last written ones
> before a crash/shutdown. It is not a random subset.
In that case, would you expect consecutive records with ID numbers
starting at 0 to possibly survive the VM shutdown in the test script I
supplied? I always see a single record with much higher ID.
In both the original RabbitMQ code where this problem was discovered and
the distilled test script the missing transaction was never the last
one, but a random older one.
Emile
I think that it is better that such a flag would be per table and not a
parameter to the transaction. If at least one of the tables involved in
the transaction has the flag, it should imply that disk_log:sync is invoked
for the entire transaction.
> Just out of interest, given this behaviour, what is the point of
> mnesia:sync_transation, as it does not actually sync to disk? In rabbit
> we use it so to guarantee that dirty_reads on other nodes are guaranteed
> to observe the new values after the sync_transaction returns, but is
> this the only purpose?
The "sync" in sync_transaction is not related to hard disks. It refers to that
the transaction function returns when the transaction has been committed
on all nodes. But even if ordinary transactions may return earlier, it will not
effect the consistency of the database.
/Håkan
There should not be any consecutive records. It should be at most one record
according to your test program.
> In both the original RabbitMQ code where this problem was discovered and
> the distilled test script the missing transaction was never the last one, but a
> random older one.
This is not proofed by your test program. The result of your test program does
not contradict what I said.
/Håkan
runtest(N) ->
mnesia:sync_transaction(
fun () -> mnesia:write(testtab, #testrec{id = N}, write) end),
mnesia:sync_transaction(fun () -> mnesia:delete({testtab, N}) end),
runtest(N-1).
This function waits until data have been committed and logged to disk (if disk is used) on every involved node before it returns
/Dan
/Håkan
As a minimum. You almost certainly want to ensure that the file is
opened with O_DIRECT and O_SYNC too, which are flags which are not
exposed in the file module. Either that, or to actually do the fsync's
before each txn commits.
Matthew
Maybe. If used naively, these flags may impose bad characteristics of the VM.
/Håkan
Well I didn't know it was cached, and there is no api to turn it off yet either.
/Dan
Disk log cashes writes as well so you can loose many writes when
halting the process.
/Dan
I have been reminded why the disc_log cache was implemented, when you
push transactions
in a short loop, you are pushing more transaction then what can be
pushed to disc.
So that everything was stuck in disc_logs message queue.
The way to improve that was to write larger chunks to the disc.
So removing the timeout in disc_log will probably not change the
situation only that message
will be stuck on message queue instead. It will still not be written
to the disc fast enough.
/Dan
As far as mnesia is concerned it is logged to disc. It have left
mnesia call chain and it's nothing
more mnesia can do, except sync the disk.
Which is a performance penalty that is not acceptable per transaction.
I have been reminded why the disc_log cache was implemented, when you
push transactions
in a short loop, you are pushing more transaction then what can be
pushed to disc.
So that everything was stuck in disc_logs message queue.
The way to improve that was to write larger chunks to the disc.
On Sun, Oct 30, 2011 at 3:59 AM, Dan Gudmundsson <dan...@gmail.com> wrote:As far as mnesia is concerned it is logged to disc. It have left
mnesia call chain and it's nothing
more mnesia can do, except sync the disk.
Which is a performance penalty that is not acceptable per transaction.
"Not acceptable" to who? That's what a durable transaction *is*. For most users of actual databases, it is not acceptable that a (durable, isolated) transaction does *not* sync the disk when it claims to synchronize and its results are globally visible.