[erlang-questions] Mnesia deadlock with large volume of dirty operations?

Brian Acton

unread,

Apr 1, 2010, 7:11:05 PM4/1/10

to erlang-q...@erlang.org

Hi guys,

I am running R13B04 SMP on FreeBSD 7.3. I have a cluster of 7 nodes running
mnesia.

I have a table of 1196143 records using about 1.504GB of storage. It's a
reasonably hot table doing a fair number of insert operations at any given
time.

I decided that since there was a 2GB limit in mnesia that I should do some
cleanup on the system and specifically this table.

Trying to avoid major problems with Mnesia, transaction load, and deadlock,
I decided to do dirty_select and dirty_delete_object individually on the
records.

I started slow, deleting first 10, then 100, then 1000, then 10000, then
100,000 records. My goal was to delete 192593 records total.

The first five deletions went through nicely and caused minimal to no
impact.

Unfortunately, the very last delete blew up the system. My delete command
completed successfully but on the other nodes, it caused mnesia to get stuck
on pending transactions, caused my message queues to fill up and basically
brought down the whole system. We saw some mnesia is overloaded messages in
our logs on these nodes but did not see a ton of them.

Does anyone have any clues on what went wrong? I am attaching my code below
for your review.

--b

Mnesia configuration tunables:

-mnesia no_table_loaders 20
-mnesia dc_dump_limit 40
-mnesia dump_log_write_threshold 10000

Example error message:

** WARNING ** Mnesia is overloaded: {mnesia_tm, message_queue_len,
[387,842]}

Sample code:

Select = fun(Days) ->
{MegaSecs, Secs, _MicroSecs} = now(),
T = MegaSecs * 1000000 + Secs - 86400 * Days,
TimeStamp = {T div 1000000, T rem 1000000, 0},
mnesia:dirty_select(offline_msg,
[{'$1',
[{'<', {element, 3, '$1'},
{TimeStamp} }],
['$1']}])
end.

Count = fun(Days) -> length(Select(Days)) end.

Delete = fun(Days, Total) ->
C = Select(Days),
D = lists:sublist(C, Total),
lists:foreach(fun(Rec) ->
ok = mnesia:dirty_delete_object(Rec)
end,
D),
length(D)
end.

Dan Gudmundsson

unread,

Apr 2, 2010, 3:19:10 AM4/2/10

to Brian Acton, erlang-q...@erlang.org

When you are using dirty, every operation is sent separately to all nodes,
i.e. 192593*6 messages, actually a transaction could have been faster
in this case.
With one message (large) containing all ops to each node.

What you get is an overloaded mnesia_tm (very long msg queues),
which do the actual writing of the data on the other (participating
mnesia nodes).

So transactions will be blocked waiting on mnesia_tm to process those 200000
messages on the other nodes.

/Dan

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Ovidiu Deac

unread,

Apr 2, 2010, 8:47:04 AM4/2/10

to Dan Gudmundsson, Brian Acton, erlang-q...@erlang.org

To me it sounds like another example of premature optimization which
went wrong? :)

Brian Acton

unread,

Apr 2, 2010, 2:22:52 PM4/2/10

to Ovidiu Deac, Dan Gudmundsson, erlang-q...@erlang.org

I'm sorry. I neglected to tell you what I had done on the previous day.

On the previous day, I had attempted to delete some old records using this
methodology:

mnesia:write_lock_table(offline_msg),
mnesia:foldl(
fun(Rec, _Acc) ->
case Rec#offline_msg.expire of
never ->
ok;
TS ->
if
TS < TimeStamp ->
mnesia:delete_object(Rec);
true ->
ok
end
end
end, ok, offline_msg)

This delete finished on the 1st node but subsequently locked up all the
other nodes on a table lock. The cluster blew up and my 24/7 service went
into 1 hr of recovery of downtime.

So to recap,

on day 1 - transaction start, table lock, delete objects - finished in about
2 minutes
on day 2 - dirty select, dirty delete objects - finished in about 2 minutes

In both cases, the cluster blew up and became unusable for at least 20-30
minutes. After 20-30 minutes, we initiated recovery protocols.

Should I try

day 3 - transaction start, no table lock, delete objects

? is the table lock too coarse grained ? considering that the cluster has
blown up twice, i'm obviously a little scared to try another variation....

--b

Dan Gudmundsson

unread,

Apr 2, 2010, 4:05:21 PM4/2/10

to Brian Acton, Ovidiu Deac, erlang-q...@erlang.org

clear_table is the fastest way you can delete it, but it will take a
while when there is a lot of data.

/Dan

Brian Acton

unread,

Apr 2, 2010, 4:19:09 PM4/2/10

to Dan Gudmundsson, Ovidiu Deac, erlang-q...@erlang.org

On this particular table, I do not want to delete all entries. This is why I
posted a separate post to the mailing list. Combining the two threads back,
I want:

One table, I want to delete entries > n days.
Another table, I want to delete all entries.

Both tables are reasonably hot (~1-2 ops per second) and reasonably large (>
1.5GB). I'm hitting the 2GB limit and I need to clean up these tables.

So far, any attempts at maintenance (as outlined in previous emails) have
resulted in Mnesia seizing up and bringing down the cluster.

It sounds like I have to do this in very small increments with wait time
between increments. However, I do not have a method and mechanism for
determining the size of an increment or a wait time between increments. I'm
fine doing ten deletes per 1 second if that's what it takes. However, I'd
like to be able to figure out the maximum number of deletes that I can do in
the minimum amount of time.

I'm definitely open to suggestion on this.

--b

Dan Gudmundsson

unread,

Apr 2, 2010, 4:55:11 PM4/2/10

to Brian Acton, Ovidiu Deac, erlang-q...@erlang.org

Well, I can't much advice, but I would definitely test this on a non
live system first.

mnesia:fold is not the best tool when you are changing a lot of
records, it will have to keep
every change in memory until you have traversed the whole table. And
it is slow with lot
of the changes, since have to compensate for the things you have done
earlier in the transaction.

I assume your are using dets (disc_only) since you are afraid of the
2G limit, or is it
memory limit on windows?

dets is slow, mnesia is primarly a ram database.

The only way I see it is to chunk though the tables a couple 100~1000 records
per transaction or something.
And have code that can deal with both the new and old format during the changing
of the database.

Good luck
/Dan

Brian Acton

unread,

Apr 2, 2010, 5:08:32 PM4/2/10

to Dan Gudmundsson, Ovidiu Deac, erlang-q...@erlang.org

Yes. I am using dets (disc_only) tables in mnesia.

Since I was able to delete 10k records previously. I think I am going to
start with a baseline of 10k record with a 60 second sleep interval.
Hopefully this will work successfully. I wish I knew what a more appropriate
sleep period would be as the maintenance is now going to take a very long
time.

Thanks for your help,

--b

Bob Ippolito

unread,

Apr 2, 2010, 5:14:11 PM4/2/10

to Brian Acton, Dan Gudmundsson, Ovidiu Deac, erlang-q...@erlang.org

You might want to measure the message_queue_len of the mnesia_tm
processes (on each node) to see if it's getting behind, and tune your
waits based upon if the message queues are small/empty or not.

Brian Acton

unread,

Apr 2, 2010, 7:00:06 PM4/2/10

to Bob Ippolito, Dan Gudmundsson, Ovidiu Deac, erlang-q...@erlang.org

Well, I went ahead and deleted about 82k messages in 10k batches.

I did this over about a 15 minute period.

The good news is that the system has not crashed.

The bad news is that some of the reported sizes of the tables have grown
dangerously close to the 2GB limit and further that the tables appear to be
wholly inconsistent:

Here is a dump of my 7 nodes

node 1: offline_msg : with 995638 records occupying 2039455556 bytes on
disc
node 2: offline_msg : with 1015600 records occupying 2097112225 bytes on
disc
node 3: offline_msg : with 995641 records occupying 1797758788 bytes on
disc
node 4: offline_msg : with 1015204 records occupying 2096658267 bytes on
disc
node 5: offline_msg : with 995615 records occupying 1776787268 bytes on
disc
node 6: offline_msg : with 995618 records occupying 1388054291 bytes on
disc
node 7: offline_msg : with 995611 records occupying 1388054291 bytes on
disc

before I started, the nodes were about 1.36GB on disc. Some of them are now
close to the 2GB limit.

the delete operation was initiated on node 5. the message queue's on all of
the nodes are zero across the board. my logs are clean and give the
indication that everything proceeded normally.

i think at this point, my only recourse is to restart node 1-5 in the hopes
that they clone from 6 and 7 providing the best space reclamation....