You know your application best but in general, when an
uncaughtException fires, something unexpected and unhandled happened
and your program is now in an undefined state. Restarting your
application puts it back in a known-good state.
Not sure if this answers your question, but there's cluster:
http://learnboost.github.com/cluster/
It starts up a master process, and a few workers. If I'm not mistaken,
only the woker would be restarted if it crashed.
--
Branko Vukelic
bra...@brankovukelic.com
bra...@herdhound.com
IDEA MACHINE
www.brankovukelic.com
Lead Developer
Herd Hound (tm) - Travel that doesn't bite
www.herdhound.com
Love coffee? You might love Loveffee, too.
loveffee.appspot.com
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
Marak, that's simply untrue. As I said, it's a matter of application
design. I think you're mistaking "how you design your applications"
with "the absolute right way to design applications". Easy mistake,
people often make it.
Here's how I look at it: When an uncaughtException happens, there's
*an* exception. You don't know what it was, you don't really know how
it got there, and you don't know the state of your app. You just
don't. The only safe thing to do is to restart.
--Josh
Sounds like everyone' in violent agreement here.
If there's an error you don't handle, it's not handled, so you'd
better do your best to clean up and get out. Shutting down the
process is one reliable way to do that.
Catch the uncaughtException event, handle it the simple way you tell it should be... Or FFS chose a language/framework that makes such assumptions for you if you like it so much :-)
I think this is they key point, and I would like Ryan to address this in the context of his argument. If your app is designed to well to encapsulate requests, shouldn't you be able to catch *any* exception before it bubbles all the way up to the uncaughtException handler? If you can't, doesn't that mean it's not fully encapsulated and by extension, that this exception could have borked some unknown state in your app? It seems that what people are really arguing is that if you're reasonably sure that your application is still in a good state, you can continue and just take your chances.
What people are arguing here is that, once you miss an exception and it bubbles all the way up, you can't realistically have a high enough level of confidence that you *know* everything has been cleaned up. I agree with that.
Ryan, if you want to argue the latter (and I think you should if you're confident in your argument). Please do it with code. Restarting on uncaughtException IS a node best practice, until we come up with a better one. But I think I speak for a lot of people when I say that we WOULD like a better one. So let's see it.
It's more than that. Remember too that process.exit() is a sledgehammer. It disconnects whatever clients are also connected to that process. If you can gracefully exit (stop accepting connections, exit when you have no connections left) that might be a way to go, but it's just another option, and not always the right one.
can't control what everyone's code does, and people would rather their SMTP server keeps running), which are called by the node event loop, not by the Haraka core.
I don't think it's a "best practice" - just a common one. I can see reasons for both sides of the argument. Particularly when we know that node core has had some issues before when catching uncaughtException was the only way to keep your app alive (such as the now-fixed bugs in the http client which weren't trappable with try/catch).
I don't think it's a "best practice" - just a common one. I can see reasons for both sides of the argument. Particularly when we know that node core has had some issues before when catching uncaughtException was the only way to keep your app alive (such as the now-fixed bugs in the http client which weren't trappable with try/catch).Best practice doesn't mean everybody does it. It means it's a common practice that lots of smart people really recommend. Not saying that you or Ryan aren't smart. But I've heard the restart thing way more than I've heard other valid alternatives. Let's not quibble over semantics here. If you got a better idea, we're ready to hear it. Otherwise this *has* to be a best practice or people are setting themselves up for a world of hurt.
I find it interesting that just because someone's authoritative sounding blog post happens to be ranked by Google for a given keyphrase, people tend to believe that person's approach is somehow a community standard.
you're free to continue to spell things wrong but please don't try to teach spelling to others.
-Mikeal
[request.handle]
try {
request.handle();
.... contextTracker.wrapActivity(doSomething1()); //containerization
(covers 3rd party lib cases also). note: resource allocation is also
wrapped in containers
.... contextTracker.wrapActivity(doSomething2());
} catch (ex) {
contextTracker.recover();
areAllMySystemsInGoodHealth();
...... hmmmNo?OKRestartTheSystemsReportingIssues();
context.containers.cleanup(); // containerized activities "destruct"
and resources are released
ifI'mStillNotGoodWeHaveToCallInReinforcementsButIfI'mGoodIJustSavedALotOfMoneyOnMyCarInsurance();
}
much later. It's not a different solution. You should have handle the blow up and restart case.
the discussion is great, i'm not knocking it, but it's not going anywhere when the actual reasons we're giving for why we recommend something aren't actually being addressed by the replies.seriously, you can't figure out what is effected by an exception once it hits uncaughtException, people should know this.if this wasn't a problem Ryan wouldn't be working on features in node-core to fix it :)
@Mikeal: That's very interesting. Could you *please* gist a real example where letting the error bubble up to the top totally breaks the program as you say (so as to justify a process restart), but catching it instead at a lower level would work fine?
TIA,
--
Jorge.
This is exactly my point and it feels like some folks on the other side of this debate are missing it. If it bubbles all the way up to uncaughtException, it means you didn't account for it elsewhere. If you know exactly what the scenario is, or you have all these context trackers in place, then the error should not make it all the way to uncaughtException.For example, Matt's example about the rDNS error. If you know rDNS can throw exceptions, why aren't you catching them? If it throws errors that you can't catch, isn't that a bug?
What am I missing about this?
What those mechanisms really do in the server space is allow programmers to sandbox pieces of code (requests, sub-processes, etc.) into a self-contained context, which, if it encounters an uncaught exception, is then subsequently trashed to prevent total mayhem (ala the infamous 500 server error). The context of that piece of code (again, request, sub-process, etc.) gets cleaned up (and any resources it opened such as file handles, etc.) and the main piece of execution continues serving its purpose merrily.
Is there anything built into Node that would provide this sandboxed context which could have its own uncaught exception handler and to which any opened resources could be bound? Perhaps the developer could initiate it similar to traditional web servers' request or conversation scope concepts, and then when the context of that code has served its purpose, its resources are reclaimed, forcefully, if necessary.
Matt.
What I can offer is a simple example.
You use a database adapter.
When you initialize the adapter it creates a connection pool and a cache system. It asynchronously updates it's cache via push events from the databases.
An exception happens in the database adapter.
The exception *could* just be in a single connection getting a new value, it *could* relate to only one http request/response. Or, it could be in the code that updates the cache.
If it's in the code that updates the cache the cache will no longer be updated. You have no way to tell, you didn't even write this adapter.
ry does have a proposal for a solution to problems like this.
He's calling them "domains". You create a "domain" and until you end it anything that gets put in to the event system is attached to the domain, and anything those callbacks create will also be added to the domain. Domains become the exception handlers, you don't use process.uncaughtException anymore.
You write a database adapter and create a domain around your cache code. If you get an exception you clean up and restart all the caching logic.
Your web framework creates a domain around it's emit('request') so that it can give a 500 to any requests that cause exceptions.
-Mikeal
Yeah, that's true: you should not attempt to recover from unexpected obscure errors happening in code you don't understand :-)
But what does that have to do with *where* did you catch it ?
In other words, why an unknown error is ok if caught in a try/catch deep into the call stack, but not when it has bubbled all the way up to the global error handler (that is up to the process' uncaughtException handler) ?
--
Jorge.
On Sep 12, 2011, at September 12, 20116:54 PM, Matt wrote:On Mon, Sep 12, 2011 at 5:24 PM, Marco Rogers <marco....@gmail.com> wrote:
This is exactly my point and it feels like some folks on the other side of this debate are missing it. If it bubbles all the way up to uncaughtException, it means you didn't account for it elsewhere. If you know exactly what the scenario is, or you have all these context trackers in place, then the error should not make it all the way to uncaughtException.For example, Matt's example about the rDNS error. If you know rDNS can throw exceptions, why aren't you catching them? If it throws errors that you can't catch, isn't that a bug?
What am I missing about this?You're missing the difference between a framework that tries its best to be stable in the face of errors, vs an application that can control all of those errors.I can't control how people write plugins, but I do want to protect them from themselves. I *do* want them to catch exceptions, but sometimes they miss them (sometimes I do too!). So I want to be stable in the face of that.Now "stable" is maybe something you would argue that carrying on without quitting isn't going to work. But the only case of problems is when you have shared state, which is something I recommend avoiding, albeit sometimes it is unavoidable, in which case I can agree that quitting may be the only safe option.This just isn't the case. Shared state exists, period.
Totally agree. Absolutely.
--
Jorge
In node.js you most likely *do* know what is happening in try/catch because it can't encapsulate all that much code since so much of it gets put off in to callbacks that aren't caught by the try/catch.
Sure, you could call in to a bunch of code you don't know and you'd have the same problem but that is less likely.
I think we're also missing what may be the most common use case, errors emitted by event emitters which only throw if unhandled. With those you have a very good idea of what state you need to handle and would be effected by the error if you listen for it on the emitter but have almost no idea once it throws and hits the global exception handler..
-Mikeal
I think we've seen that this is a difficult topic because it's non-trivial to show code that illustrates either side. Being on the "you need to restart" side is particularly frustrating because, as Mikeal, said it hinges on the unexpected. And it's hard to model that in an example. If you want to get frustrated because no one will admit that there's wiggle room here. Fine, there's wiggle room. But we're talking about best practice. That's the framing of this discussion. If your goal is to simply destroy best practices so that all we can give newcomers is a weak "it depends". You will have a hard time doing that.
So for example, if there is a error in a subsystem that you didn't anticipate. You can tear the whole thing down. Create a whole new db layer with caching and everything. It may seem drastic, but some might prefer it to restarting the whole process. Obviously, this only works if the db layer touches no outside state whatsoever. So the question becomes how do you ensure that? It feels like Ryan's confidence comes from a position of knowing a whole lot about all the different components of his system. That's great, but not always practical. Or should I say, it takes more effort than just restarting the process. He is working with constraints that make it worth the effort (non-trivial startup path). That totally makes sense.
But we still need to be talking about real techniques. That's how you convince people. I think my side has at least managed to convey a few concrete use cases that you could address directly with your solutions. I hope Ryan has been silent because he's working up some awesome examples.
:MarcoHonestly, I'm done trying to state that there's two sides to this and have you and Marak and Marco just state that I'm wrong. I'm old enough and ugly enough to not care about arguing.Have a good day,Matt.
--
I think everyone i know that i running node in production has some kind of long running memory leak that requires semi-frequent restarts anyway.
--
After thinking more about it, I'm more interested in seeing techniques that you or Ryan use to mitigate this issue.
I'm particularly intrigued about the idea of building decoupled sub-systems that can recover from error. The more I thought about it (because I'm actually participating in this discussion and not just shooting people down), the more I feel like it aligns with an unformed theory that I was trying to discuss at node summercamp.
There's a single global exception handler, which should never be hit,
ever. All errors should be captured by a callback. If anything
throws, it's some completely unexpected weirdness. *By definition*, I
have not cleaned up the state, since *I didn't catch the error*, so
there's no "clean up and move on" option possible. The only work that
it does at that point is die in a horrible red-and-black fashion.
If your program doesn't do much, or has some reasonably reliable way
to clean up state and shut down gracefully, then sure, do that, and
*then* exit.
If it doesn't do very much, and can only crash in a few specific
ways, and you can detect those, then fine, handle those, or otherwise
guarantee that you're not dropping state on the floor, and keep on
trucking. But such programs are the minority.
Any conversation about "best practices" should be about the common
cases. I'm sure there are exceptions, but I'm willing to bet that any
given program is not an exception unless shown otherwise. That's
because I know the definition of "exception", and I'm not completely
terrible at betting.
Matt claims that Haraka is such an exception, and he seems pretty
bright and claims to be old and ugly, so probably has some experience
in such matters ;) But keeping the state clean requires a lot of
diligence, and there may well be cases where a bug in your stack
(either your code, a dependency, or node-core) makes this actually
impossible.
Qv:
On Tue, Sep 13, 2011 at 10:48, Mikeal Rogers <mikeal...@gmail.com> wrote:
> I know someone recently found a leak in the http client that was fixed.
If you don't exit on unhandledException events, then you should
probably not just check logs, but also monitor memory usage and open
file descriptors to make sure they don't creep up.
On Tue, Sep 13, 2011 at 12:50 PM, Marco Rogers <marco....@gmail.com> wrote:After thinking more about it, I'm more interested in seeing techniques that you or Ryan use to mitigate this issue.I can't speak for Ryan, but personally I just don't.If this sort of problem happens (in my personal instance of Haraka) I just let it happen. There's nothing scary here. No shared state to worry about. There might be database access going on, but I haven't seen any problems with that yet (I use Pg, and expect it to be safe in the event of any kind of exception).All state is maintained within the connection object, which should go out of scope when the exception is thrown, and the timer eventually fire on it which issues a disconnect. There's no global heap of connection objects so I don't need to delete anything else.
I may go in and check the logs from time to time, but I haven't had anything in a while.I'm particularly intrigued about the idea of building decoupled sub-systems that can recover from error. The more I thought about it (because I'm actually participating in this discussion and not just shooting people down), the more I feel like it aligns with an unformed theory that I was trying to discuss at node summercamp.One way to do it is something like this:Connection.prototype.add_cleanup_handler = function (handler) {var already_called = false;var _handler = function () {if (already_called) return;already_called = true;handler();};this.cleanup_handlers.push(_handler);return _handler;}Then when you allocate a resource that must be cleaned up somehow (like an open filehandle):var ws = fs.createWriteStream(path);var close_it = connection.add_cleanup_handler(function () { ws.destroy() });... // do something with ws// close it:close_it();And finally, in the disconnect method for the Connection you loop through and run all cleanup routines.
====The other thing to worry about with the "tear down your process" thing is slow connections. Say you're running cluster and 1 process per CPU (the default), on a 4 CPU system. If you have a coding error and an uncaughtException is happening frequently enough that all your processes are restarting often then you end up with a completely inaccessible system. You're waiting for the slow connections (one connected to each 4 processes) to finish processing, while no longer accepting connections - this may not be so much of an issue with http where you can stick nginx in front to trickle content to those slow connections (and thus free up node to restart), but it's very much an issue with SMTP where the communication is bidirectional and you can't proxy it. That's obviously an extreme example, but that's a possibility following this blindly as a best practice without thinking about your architecture and application first.Matt.
--
All state is maintained within the connection object, which should go out of scope when the exception is thrown, and the timer eventually fire on it which issues a disconnect. There's no global heap of connection objects so I don't need to delete anything else.In 0.5.3+ this is true of http client and request objects. In 0.4.x it is not, there are close and error cases that are not handled well.
Why can't you proxy (and slow down) SMTP? The amount by which you can decelerate is, I presume, bound only by timeout lengths. Just like HTTP.
Why can't you proxy (and slow down) SMTP? The amount by which you can decelerate is, I presume, bound only by timeout lengths. Just like HTTP.SMTP is chatty back and forth. You can proxy it, but it doesn't gain you anything.
I don't think I saw the beginning of this question about slow clients but I *think* i get the gist of it.In practice, what I believe most people are doing is setting a timeout on how long they will wait for the existing connections to finish up before restarting. If the timeout is reached the connections are pre-maturely closed.
This is all assuming you're using a process pool btw. If you aren't using a process pool then this is a bit harder.
Also, SMTP should have the same issue as HTTP in regard to slow clients.
Sure, a proxy could hold the data for you but it would have to do so in memory which means the workload can't be too large.
--