[erlang-questions] Mnesia could not write core file: system_limit

14 views
Skip to first unread message

Slobodan Miskovic

unread,
Dec 8, 2009, 7:53:42 PM12/8/09
to Erlang-Questions Questions
Hi all, one of our nodes just popped up the following error:

Mnesia(node@localhost): ** ERROR ** (could not write core file:
system_limit)
** FATAL ** Cannot open log file "/var/mnesia/invoice_item.DCL":
{file_error, "/var/mnesia/invoice_item.DCL", system_limit}

I naively tried mnesia:start() only to have the same error repeat but
for a different table.

Restarting the whole node it now seems ok (error hasn't popped up again)
but...

Problem is that it seems data is now missing from at least one of those
tables. I have made a backup of the whole mnesia after the second
failure, but I don't suppose that data in there somewhere.

1. What has caused this? There is plenty of space left on the device,
process was running as root, ulimit reports unlimited.

2. Is the data really gone? I have enough information elsewhere to
reconstruct the data, but would like to know where it has gone.

3. How to prevent this from happening again? Periodic backups would not
help with this as I can not afford to loose data since last backup even
if I was to do 1hr intervals. Would running another node on another
system be assurance enough?

I find it very strange that some data would get dropped (about 10k
records out of 85k record table), is it a sign of a mnesia bug or is
this something I should anticipate and work around?

Some potentialy usefull System Info:
- Erlang R12B4
- about 200 tables, 50% sets 50% bags, both ram and disc copies for each
table
- node takes about 1.2GB ram when data is loaded (I understand there is
2GB per table limit, or am I misguided)
- du -sh /var/mnesia/
343M /var/mnesia/

- Filesystem Size Used Avail Use% Mounted on
/dev/md/1 233G 49G 184G 22% /

- Slackware Linux 2.6.24.5-smp #2 SMP Wed Apr 30 13:41:38 CDT 2008 i686
Intel(R) Pentium(R) Dual CPU E2200 @ 2.20GHz GenuineIntel GNU/Linux

- node did not generate erl_crash.dump as I brought it down via shell
q(). The only trace I have is the error above and subsequent error
reports of calls to mnesia failing {aborted, {node_not_running...


Any help and pointers are highly appreciated.

Thanks!
Slobo

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Valentin Micic

unread,
Dec 9, 2009, 2:58:05 AM12/9/09
to Slobodan Miskovic, Erlang-Questions Questions
Just a wild guess... could it be that you're running out of available file
descriptors? The error indicates that you cannot open a log file, which may
be caused by this.

V/

Slobodan Miskovic

unread,
Dec 9, 2009, 3:24:17 AM12/9/09
to Erlang-Questions Questions
That is a possibility, node has been running for about 40 days, but i've had longer runs with no problem on the same system with similar loads.

How would I check the FD usage of a running system? Is there a way to get notified if node is approaching the limit, in which case I could take some corrective action (such as restarting the node).

Do open FDs get closed if a process dies? Perhaps that's where the leak is coming from.

Biggest question that remains is why did mnesia lose data. I would understand losing whatever was going to be written when system_limit was reached, but I lost a lot of "old" data as well, data that was there for days, even weeks

Thanks
Slobo

Roberto Aloi

unread,
Dec 9, 2009, 5:24:17 AM12/9/09
to Slobodan Miskovic, Erlang-Questions Questions
Hi,

a link that could be useful to investigate possible system limits occurring:

http://www.erlang.org/doc/efficiency_guide/advanced.html#id2265856

Then, you might want to use (but you probably did already):
* erlang:system_info()
* erlang:memory()
* length(processes())
* erlang:process_info/2

At least, this is what I would do...

Regards,

--
Roberto Aloi
robert...@erlang-consulting.com
http://www.erlang-consulting.com

Bernard Duggan

unread,
Dec 9, 2009, 10:06:40 AM12/9/09
to Slobodan Miskovic, Erlang-Questions Questions
Slobodan Miskovic wrote:
> How would I check the FD usage of a running system?
Quick way to check on linux is to count the number of entries in
/proc/<PID>/fd (excluding . and .. of course). The command "lsof -p
<PID>" will also show you (although it lists a bunch of other things too
and I can't remember off the top of my head how to filter just to file
descriptors that contribute to the processes limit).

> Is there a way to get notified if node is approaching the limit, in which case I could take some corrective action (such as restarting the node).
>

I'll leave that as an exercise for the reader ;) (Although an erlang
process which periodically monitors it seems like it would be simple
enough to implement).


> Do open FDs get closed if a process dies? Perhaps that's where the leak is coming from.
>

In general, yes they do. However be aware that many operations in
Erlang spawn a process, and they're not always cleaned up in the way you
might expect (we found a particular case with long-lasting HTTP client
connections where you have to explicitly shut them down, even if the
process that initiated it had crashed). Basically, what you need to be
concerned about is process leaks, not FD leaks per se.

Cheers,

Bernard

Slobodan Miskovic

unread,
Dec 9, 2009, 5:03:08 PM12/9/09
to Erlang-Questions Questions
On Wed, 2009-12-09 at 10:06 -0500, Bernard Duggan wrote:
> > How would I check the FD usage of a running system?
> Quick way to check on linux is to count the number of entries in
> /proc/<PID>/fd (excluding . and .. of course).

Heh, thinking outside of the (VM) box - guess there is no query-able
interface in Erlang for this? Linux is then indeed simple enough:
{ok, FDList} = file:list_dir("/proc/self/fd"),
length(FDList)

What about other platforms, ie. Windows?

> The command "lsof -p
> <PID>" will also show you (although it lists a bunch of other things too
> and I can't remember off the top of my head how to filter just to file
> descriptors that contribute to the processes limit).
>

Would lsof (or similar mechanism) be preferable as I would get a list of
open sockets and network connections which as I understand all
contribute to the max open ports limit. Would I get all those in
the /proc/.../fd list as well?


> > Do open FDs get closed if a process dies? Perhaps that's where the leak is coming from.
> >
> In general, yes they do. However be aware that many operations in
> Erlang spawn a process, and they're not always cleaned up in the way you
> might expect (we found a particular case with long-lasting HTTP client
> connections where you have to explicitly shut them down, even if the
> process that initiated it had crashed). Basically, what you need to be
> concerned about is process leaks, not FD leaks per se.


Hm, I have embedded Yaws running, and occasionally processes terminate
when invalid requests comes in. I would have thought those would have
been really dead. I'll have to keep an eye on the running system to see
if number of processes is rising.


Thanks,
Slobo

Bernard Duggan

unread,
Dec 9, 2009, 6:00:11 PM12/9/09
to Slobodan Miskovic, Erlang-Questions Questions
Slobodan Miskovic wrote:
> Heh, thinking outside of the (VM) box - guess there is no query-able
> interface in Erlang for this? Linux is then indeed simple enough:
> {ok, FDList} = file:list_dir("/proc/self/fd"),
> length(FDList)
>
Huh - never noticed the 'self' symlink - that's really handy :)
As far as I know there's no specific interface in Erlang for this - I'd
expect it to either be in maybe os: or erlang:, but I can't see anything
there.

> What about other platforms, ie. Windows?
>
Sorry, I can't help you there.

> Would lsof (or similar mechanism) be preferable as I would get a list of
> open sockets and network connections which as I understand all
> contribute to the max open ports limit. Would I get all those in
> the /proc/.../fd list as well?
>
/proc/self/fd should include all file descriptors that contribute to the
process's limit - pipes, sockets, files etc. It's what we use for
monitoring exactly this issue on some of our C++ code.

> Hm, I have embedded Yaws running, and occasionally processes terminate
> when invalid requests comes in. I would have thought those would have
> been really dead. I'll have to keep an eye on the running system to see
> if number of processes is rising.
>
Proces leaks are the memory leaks of Erlang - they're relatively easy to
accidentally code if you're not careful.
If you're running a long-running server app I'd suggest looking at
providing an SNMP monitoring system to keep an eye on this kind of thing
- Erlang's built-in SNMP stuff is a pain in the backside to initially
set up, but once you've got it going it's trivial to add a lot of really
informative information.

Cheers,

B

Evans, Matthew

unread,
Dec 9, 2009, 6:09:43 PM12/9/09
to Bernard Duggan, Slobodan Miskovic, Erlang-Questions Questions
To handle process leaks (especially if you are using a gen_server / gen_fsm behaviour) I normally return a timeout from my handle_call, handle_info, handle_cast etc. functions.

I have the time the process started and the time of the last event in my State record, and then have logic in the handle_info(timeout,State) function to cause the process to "self destruct" if it thinks it's been around too long.

There is a small performance overhead of starting a timer in your process, but it does trap these sort of problems.

Regards

Matt
________________________________________
From: erlang-q...@erlang.org [erlang-q...@erlang.org] On Behalf Of Bernard Duggan [ber...@m5net.com]
Sent: Wednesday, December 09, 2009 6:00 PM
To: Slobodan Miskovic
Cc: Erlang-Questions Questions
Subject: Re: [erlang-questions] Mnesia could not write core file: system_limit

Reply all
Reply to author
Forward
0 new messages