Why exit on uncaughtException?

348 views
Skip to first unread message

Clint H.

unread,
Sep 12, 2011, 11:14:32 AM9/12/11
to nod...@googlegroups.com
The general recommendation is that Node should exit if an uncaughtException occurs (as described by Felix here http://goo.gl/YPiOl), and in the case of a server, have some kind of monitor re-start Node.

I'm not challenging this--I simply can't find a good explanation as to why this is necessary. And it seems like the kind of thing that should be part of every "NodeJS 101" ramp-up for "old school multi-threaded webapp" developers who want, but are struggling, to understand this.

To be clear, I'm mostly thinking of a webapp scenario. If an exception is thrown and bubbles all the way up, I understand that something has to fail. But why does the entire server need to die? Does the uncaught exception affect the entire event loop or something?

Thanks for the patience!

Ben Noordhuis

unread,
Sep 12, 2011, 11:33:15 AM9/12/11
to nod...@googlegroups.com

You know your application best but in general, when an
uncaughtException fires, something unexpected and unhandled happened
and your program is now in an undefined state. Restarting your
application puts it back in a known-good state.

Mike Coolin

unread,
Sep 12, 2011, 11:44:57 AM9/12/11
to nod...@googlegroups.com
I agree with Clint, An error for a given user should not take out the entire server for all users. 
One of the benefits of real threads is the ability for a given thread to fail while allowing the server to live on.

Clint H.

unread,
Sep 12, 2011, 11:55:11 AM9/12/11
to nod...@googlegroups.com
It might be helpful if I started an example that someone else can use to explain things...

Let's say I have a Node-based webapp with a function that handles requests to /doSomething. This function calls foo.bar(). Let's say that one day, for whatever reason, foo is undefined for some reason, causing an uncaught exception.

Per NodeJS best practice, my uncaughtException handler should cause Node to exit, thus affecting all clients of my webapp. With a more common, multi-threaded webapp container, however, only the tread servicing the /doSomething request would die; not the entire webapp.

Why is the NodeJS event loop not capable of working in a similar fashion?

Again, I'm not challenging Node itself or the best practice. But apparently there's something very fundamental about how the single-threaded event loop paradigm works which I just don't understand (and which other new-to-Node developers will also need to learn early on).

Thanks

Branko Vukelić

unread,
Sep 12, 2011, 12:12:06 PM9/12/11
to nod...@googlegroups.com
On 2011-09-12 08:55 -0700, Clint H. wrote:
> Why is the NodeJS event loop not capable of working in a similar fashion?

Not sure if this answers your question, but there's cluster:

http://learnboost.github.com/cluster/

It starts up a master process, and a few workers. If I'm not mistaken,
only the woker would be restarted if it crashed.

--
Branko Vukelic
bra...@brankovukelic.com
bra...@herdhound.com

IDEA MACHINE
www.brankovukelic.com

Lead Developer
Herd Hound (tm) - Travel that doesn't bite
www.herdhound.com

Love coffee? You might love Loveffee, too.
loveffee.appspot.com

Clint H.

unread,
Sep 12, 2011, 12:15:29 PM9/12/11
to nod...@googlegroups.com
Wondering if the reasoning here is the same as why you would use a "try/catch/finally" clause in a language like Java. For example:

try {
  // Open a database connection
}
catch {
 // do something...maybe log the exception
}
finally {
  // Always, always try to close the connection
}

Is the issue that with event-driven callbacks you might not be able to have a "finally" mechanism that ensures an attempt to close the connection?

Ryan Gahl

unread,
Sep 12, 2011, 12:24:38 PM9/12/11
to nodejs
Killing the server on an uncaughtException is not a "NodeJS best
practice"... it's simply how some people (including Felix) have
decided to handle the situation.

This is about application design, not best practices. It is on you to
containerize your per-request (i.e. app specific, per user) code in a
manner that isolates resources sufficiently from other requests. If
you do that, killing the server is not needed at all. There is nothing
about the node loop that changes how easy or difficult this is to
accomplish. That is strictly determined by your application design. If
you decide it is too much work or risk to a) avoid sharing state
between requests and b) containerizing requests in such a manner that
you can safely cleanup resources upon exception and move on, then
building a nanny process that crashes and spins up new servers might
be a way to go. It is by no means a "NodeJS best practice" though. In
our case it would actually be WAY more complicated and difficult to go
that route because our startup process is very significant and we'd
have to deal with many intricate issues. It's much easier for us to
isolate the per-request app-specific resources in a manner that allows
us to "detox" that request and move along after an uncaught exception.
We make sure we log and issue alerts for each exception, of course, so
that we know where to look to fix the issue. Many uncaught exceptions
can occur because of something as benign (as far the "the system" is
concerned) as you forgetting to validate user input in a certain way
and some user enters text with a newline or apostrophe character (for
a contrived example). Do you really want every instance of that class
of exception to bring down your server? Maybe you do. We don't.

Felix's example in his blog post actually has absolutely nothing to do
with "understanding the inner workings of fs.ReadStream". It's simply
that you must understand that when an uncaught exception occurs
_in_javascript_, no more lines of code within _that_ execution context
will be executed. Node just so happens to wrap each loop iteration
within a master try/catch block, emits an event if any exception gets
bubbled out to that level and then either crashes if you don't handle
that event or executes your handler(s) and moves on to the next
iteration.

Ryan Gahl

unread,
Sep 12, 2011, 12:52:57 PM9/12/11
to nodejs
Just one little note I want to add. I find it interesting that just
because someone's authoritative sounding blog post happens to be
ranked by Google for a given keyphrase, people tend to believe that
person's approach is somehow a community standard. I don't mean to
take anything away from Felix specifically, but it's a trend I've
noticed on the list. IMHO node is still way too new to have many (if
any) design standards. NPM is a de facto community standard, and most
new people tend to gravitate toward Express for a web framework, but
other than that there are not really very many (if any) standards and
even less in the area of "best practices". There are generally Good
Things You Should Be Doing that are not node-specific, but as far as
I'm concerned node itself isn't adding many unique concerns. It's
often more just understanding how javascript works.
</rant>

Clint H.

unread,
Sep 12, 2011, 1:14:06 PM9/12/11
to nod...@googlegroups.com
Thanks Ryan--extremely helpful.

So to summarize, use common sense. One must ensure that an uncaught exception doesn't have a lasting effect (e.g., leaves the app in a state that causes problems for other requests, etc.). For example, maybe your webapp has a limit on the number of DB connections it can open. Find an app design that ensures a connection is closed at the end of handling a request, even if an an exception is thrown.

On another note: I would add that, to a newbie like myself Felix seems to be rather prominent in the Node community; that, combined with a few other things I read saying the same thing led me down the path of assuming "exit on uncaughtException" was the "general recommendation" (as I put it earlier). Anyways, thanks for your patience and willingness to clarify this.

Ryan Gahl

unread,
Sep 12, 2011, 1:29:04 PM9/12/11
to nodejs
Yeah, I wanted to be careful there as Felix is definitely a smart dude
and you wouldn't by any means be doing something wrong to follow his
advice here if it fits your workflow and app design. It's just that
it's not really something that (IMHO) should be viewed as the
categorical "right way to do things in node".

Marak Squires

unread,
Sep 12, 2011, 1:35:44 PM9/12/11
to nod...@googlegroups.com
This is straight forward:

"If your application throws an error which bubbles up to the uncaughtException handler, your application is now in an indeterminate state ( period ). Continuing to run your application in an indeterminate state can yield unpredictable results ( period )."

I don't see where there is room for interpretation here. 

--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Ryan Gahl

unread,
Sep 12, 2011, 1:46:45 PM9/12/11
to nodejs
Marak, that's simply untrue. As I said, it's a matter of application
design. I think you're mistaking "how you design your applications"
with "the absolute right way to design applications". Easy mistake,
people often make it.




On Sep 12, 12:35 pm, Marak Squires <marak.squi...@gmail.com> wrote:
> This is straight forward:
>
> "If your application throws an error which bubbles up to the
> uncaughtException handler, your application is now in an indeterminate state
> ( period ). Continuing to run your application in an indeterminate state can
> yield unpredictable results ( period )."
>
> I don't see where there is room for interpretation here.
>
>
>
>
>
>
>
> On Mon, Sep 12, 2011 at 8:14 AM, Clint H. <clinthhar...@gmail.com> wrote:
> > The general recommendation is that Node should exit if an uncaughtException
> > occurs (as described by Felix herehttp://goo.gl/YPiOl), and in the case

Ryan Gahl

unread,
Sep 12, 2011, 1:53:08 PM9/12/11
to nodejs
server.start(); // uncaughtException here === major issue, crash and
hunt down the bug

handleRequest(); // uncaughtException here === only a server-wide
issue if your app's design dictates that to be the case. Isolating
resources consumed by the request mitigates that risk. One request
should not care what happened in another request. Designing your app
to handle these situations gracefully without having to resort to
killing ans re-spawning the server is not only possible, but in many
cases easier.

Matt

unread,
Sep 12, 2011, 2:00:35 PM9/12/11
to nod...@googlegroups.com
I call bullshit. What's hopefully more likely in a well designed network application is that everything associated with that connection is now in an indeterminate state, and thus that connection (and all the associated state) should be closed and cleaned up.

Marak Squires

unread,
Sep 12, 2011, 2:01:28 PM9/12/11
to nod...@googlegroups.com
Marak, that's simply untrue. As I said, it's a matter of application
design. I think you're mistaking "how you design your applications"
with "the absolute right way to design applications". Easy mistake,
people often make it.

That has nothing to do with my statement, at all. This is computer science, not personal preference.

Just because you've designed your personal application to not melt when it's in an indeterminate state, doesn't mean it's still not an in an indeterminate state. 

If you don't believe me, listen to Ben and Felix, they aren't exactly beginners when it comes to this sort of thing.

- Marak

Joshua Holbrook

unread,
Sep 12, 2011, 2:01:56 PM9/12/11
to nod...@googlegroups.com
I think the thing you're forgetting, Ryan, is that these are
*uncaught* exceptions. If you know how to deal with them you should
deal with them as they happen, either in a callback or with a
try/catch.

Here's how I look at it: When an uncaughtException happens, there's
*an* exception. You don't know what it was, you don't really know how
it got there, and you don't know the state of your app. You just
don't. The only safe thing to do is to restart.

--Josh

Isaac Schlueter

unread,
Sep 12, 2011, 2:02:12 PM9/12/11
to nod...@googlegroups.com
On Mon, Sep 12, 2011 at 11:00, Matt <hel...@gmail.com> wrote:
> I call bullshit. What's hopefully more likely in a well designed network
> application is that everything associated with that connection is now in an
> indeterminate state, and thus that connection (and all the associated state)
> should be closed and cleaned up.

Sounds like everyone' in violent agreement here.

If there's an error you don't handle, it's not handled, so you'd
better do your best to clean up and get out. Shutting down the
process is one reliable way to do that.

Matt

unread,
Sep 12, 2011, 2:06:55 PM9/12/11
to nod...@googlegroups.com
Yes but the disagreement is that this is the only thing you can do. I think in a lot of cases it's best to just close that connection[*] and let others carry on as normal.

[*] It's important though to realise that you can't access the connection/client from the uncaughtException handler, so you have to have a timer fire to close it (and associated resources) at some determined time in the future.

Matt.

Ryan Gahl

unread,
Sep 12, 2011, 2:22:46 PM9/12/11
to nodejs
Let me try this again. The following are factual statements.

1. Doing the whole "kill everything and respawn everything" is _a_way_
to handle the situation.

2. It is possible to design things such that it is not required to do
#1.

3. In many cases #2 is easier than #1 (and vice versa).


btw, Marak, you don't know much about me, and this is _not_ meant to
come off arrogant, but I am not exactly a beginner in this area
either. Your style tends to come off as pretty hostile towards people
that have perspectives that don't match yours, especially towards
people that don't make it a priority to be "seen" a lot in the
community. Please just trust me when I say it's possible to design
things in the manner in which I have described. I know this to be
true, because we have designed such a system... and it works. This is
not a matter of opinion or personal preference, it's a fact. In fact
many "computer scientists" have built similar systems in various
underlying technologies. Don't argue with me that our design
approaches don't work when a) you know nothing of our designs and b) I
come to work every day and witness evidence to the contrary.

or in other words... get off your high horse.

Marco Rogers

unread,
Sep 12, 2011, 2:24:48 PM9/12/11
to nod...@googlegroups.com
I think this is they key point, and I would like Ryan to address this in the context of his argument. If your app is designed to well to encapsulate requests, shouldn't you be able to catch *any* exception before it bubbles all the way up to the uncaughtException handler? If you can't, doesn't that mean it's not fully encapsulated and by extension, that this exception could have borked some unknown state in your app? It seems that what people are really arguing is that if you're reasonably sure that your application is still in a good state, you can continue and just take your chances.

I actually agree with this, theoretically. I argued this same thing at Node Summercamp last week. It's a really sucks to have to restart your entire process because of one exception. Node should be able to do better. And theoretically, it can. But right now, it doesn't. Felix's example is simple and very illustrative. Handling of fs/net streams is asynchronous. If some earlier event handlers fail, some later ones like "end" may not fire. How can you ensure that all of your "cleanup" code will fire? Once you get down to the core node constructs that the application has little control over, you lose a level of confidence in what's happening with your application.

What people are arguing here is that, once you miss an exception and it bubbles all the way up, you can't realistically have a high enough level of confidence that you *know* everything has been cleaned up. I agree with that.

Ryan, if you want to argue the latter (and I think you should if you're confident in your argument). Please do it with code. Restarting on uncaughtException IS a node best practice, until we come up with a better one. But I think I speak for a lot of people when I say that we WOULD like a better one. So let's see it.

:Marco

Richard Miller-Smith

unread,
Sep 12, 2011, 2:30:45 PM9/12/11
to nodejs
Surely the discussion is just about what 'handled' means. I pretty
much agree with Ryan that a good design, where you can contextualise
callbacks and events, allows you to carefully prune away any dying or
defiant objects - even if you're not totally sure what the error is or
what caused it.

On the other hand if something does happen completely out of the blue
and can't be matched to an entity in the system what else could node
do but stop and tell you something bad happened?

Richard

Marak Squires

unread,
Sep 12, 2011, 2:34:34 PM9/12/11
to nod...@googlegroups.com
I made a very clear and factual statement. I made no assertions about you or anything you have done.

You seem to have interpreted my statement of suggesting you refer to other's judgement as a personal insult to your experience. This is not case.

The fact you think a personal attack is warranted for someone countering your point of view is irrational, and unappreciated. 

Elijah Insua

unread,
Sep 12, 2011, 2:41:16 PM9/12/11
to nod...@googlegroups.com
Things we've learned from this thread:

1) catch and handle your errors
2) if you decide to let node's global uncaught exception handler catch errors, you probably should restart your process.

-- Elijah

Nicolas Chambrier

unread,
Sep 12, 2011, 2:41:40 PM9/12/11
to nod...@googlegroups.com

Catch the uncaughtException event, handle it the simple way you tell it should be... Or FFS chose a language/framework that makes such assumptions for you if you like it so much :-)

Matt

unread,
Sep 12, 2011, 3:10:37 PM9/12/11
to nod...@googlegroups.com
On Mon, Sep 12, 2011 at 2:24 PM, Marco Rogers <marco....@gmail.com> wrote:
I think this is they key point, and I would like Ryan to address this in the context of his argument. If your app is designed to well to encapsulate requests, shouldn't you be able to catch *any* exception before it bubbles all the way up to the uncaughtException handler? If you can't, doesn't that mean it's not fully encapsulated and by extension, that this exception could have borked some unknown state in your app? It seems that what people are really arguing is that if you're reasonably sure that your application is still in a good state, you can continue and just take your chances.

It's more than that. Remember too that process.exit() is a sledgehammer. It disconnects whatever clients are also connected to that process. If you can gracefully exit (stop accepting connections, exit when you have no connections left) that might be a way to go, but it's just another option, and not always the right one.

So in Haraka we can't be sure to catch all exceptions, because some may happen in plugins' custom callbacks (yes they should probably have code to catch the error, but I can't control what everyone's code does, and people would rather their SMTP server keeps running), which are called by the node event loop, not by the Haraka core. So the uncaughtException handler spews a stack trace, but it only exits if Haraka isn't properly started yet.

What people are arguing here is that, once you miss an exception and it bubbles all the way up, you can't realistically have a high enough level of confidence that you *know* everything has been cleaned up. I agree with that.

It's true, you can't guarantee that (mostly due to JS's lack of destructors, which is really annoying). A typical example might be files left open.
 
Ryan, if you want to argue the latter (and I think you should if you're confident in your argument). Please do it with code. Restarting on uncaughtException IS a node best practice, until we come up with a better one. But I think I speak for a lot of people when I say that we WOULD like a better one. So let's see it.

I don't think it's a "best practice" - just a common one. I can see reasons for both sides of the argument. Particularly when we know that node core has had some issues before when catching uncaughtException was the only way to keep your app alive (such as the now-fixed bugs in the http client which weren't trappable with try/catch).

Matt.

Ryan Gahl

unread,
Sep 12, 2011, 3:34:49 PM9/12/11
to nodejs
Ah, yes, ok... here are some high level notes... sorry I don't have
code I can just extract easily to show you, but I'm hoping the
following discussion can at least let you see that yes it is
possible.

It's context driven design, really. The first thing to get is that
there is nothing magical about adding an explicit try/catch that
somehow removes a layer of having to deal with cleaning up things when
something goes amiss. Wrapping 3 lines of code in a try/catch vs. 1000
lines of code in a try/catch merely narrows it down a bit. Even with
the 3 line try/catch the best you can do is make an assumption in your
handling code as to the nature of the exception, and because there
were only 3 lines of code you feel more warm and fuzzy with continuing
on. Why? Because you know more about the context of the exception.
Your confidence grows even higher if you've seen this exact exception
a number of times ("oh nevermind that error, that just means the flay
rods haven't been calibrated yet").... the context is clearer. The
node loop iteration is just that 1000+ line try/catch block. The
exceptions are actually being "caught" but most people feel pretty
leery about continuing on because they know much less about the
context of the exception. AND THAT'S FINE. I have maintained
throughout this thread that the restarting option is definitely A WAY
to go. But for some systems that restarting logic is going to be much
more costly to author and troublesome to maintain, and it is possible
to design a system that can detect context with a satisfactory level
of confidence and 'right the wrongs' quickly and move on.

If I have a mechanism that lets me know the context an exception
occured in, I can make some pretty good decisions about how to handle
it. If all of my subsystems and resources are isolated AND I know the
context, I can make even better decisions. If the cost of restarting
everything is greater than the cost of making a decision based on
context it's a simple math equation to figure that the latter is the
better course. And when I say that, I mean _for_some_applications_ and
_if_ you have such mechanisms.

Another tool is a subsystem health check. If I have information about
the context an exception occurred in and I have a means to query my
subsystems to make sure things are ok, man oh man can I make some good
decisions then.

So if I have a mechanism that lets me reliably track context to my
satisfaction, and I have isolated subsystems and resources, and I have
a means to reliably query my subsystems for issues, it should be
fairly easy to imagine that it's within the realm of possibility that
all those things can be combined in a way that makes me feel more warm
and fuzzy about letting the next loop iteration happen. It probably
makes me feel warmer and fuzzier than your level of confidence that
that same uncaught exception that just caused a restart of all your
systems won't happen again in another .025 seconds after the restart
in a devastating server restart loop. If I run into that same bug in
my system at least I have a chance to successfully handle a % of
requests. Similarly bad for the former design is when someone
maliciously bangs against that bug that would cause a restart. "Oh
noess!!!"

Yes, subsystems should be designed so that they can be restarted
easily. At the top level of the stack resilience is often more
important.

Marco Rogers

unread,
Sep 12, 2011, 3:45:01 PM9/12/11
to nod...@googlegroups.com

It's more than that. Remember too that process.exit() is a sledgehammer. It disconnects whatever clients are also connected to that process. If you can gracefully exit (stop accepting connections, exit when you have no connections left) that might be a way to go, but it's just another option, and not always the right one.

I'm not sure what your argument is here. People should definitely be trying to do a graceful exit. I don't think that's under debate. I'm saying that ideally you would be able to determine whether you needed to exit or not. Then I'm arguing that with the current state of node, that's not really possible.
 
can't control what everyone's code does, and people would rather their SMTP server keeps running), which are called by the node event loop, not by the Haraka core.

This is the crux of the issue right? The way node is currently structured, you don't control every piece of code that is executed on the event loop. Modules are pretty autonomous and can add anything they like. It's never clear whether they are doing proper error handling. And you can't just try/catch everything like you can in a synchronous language. Essentially you need a way to "catch" a reference to any asynchronous callback so that you can start to reason about the full state of your app. Ry's new "domains" proposal is seeking to help with that. I'm pretty excited about it.
 
I don't think it's a "best practice" - just a common one. I can see reasons for both sides of the argument. Particularly when we know that node core has had some issues before when catching uncaughtException was the only way to keep your app alive (such as the now-fixed bugs in the http client which weren't trappable with try/catch).

Best practice doesn't mean everybody does it. It means it's a common practice that lots of smart people really recommend. Not saying that you or Ryan aren't smart. But I've heard the restart thing way more than I've heard other valid alternatives. Let's not quibble over semantics here. If you got a better idea, we're ready to hear it. Otherwise this *has* to be a best practice or people are setting themselves up for a world of hurt.

:Marco

--
Marco Rogers
marco....@gmail.com | https://twitter.com/polotek

Life is ten percent what happens to you and ninety percent how you respond to it.
- Lou Holtz

Matt

unread,
Sep 12, 2011, 3:59:00 PM9/12/11
to nod...@googlegroups.com
On Mon, Sep 12, 2011 at 3:45 PM, Marco Rogers <marco....@gmail.com> wrote:

I don't think it's a "best practice" - just a common one. I can see reasons for both sides of the argument. Particularly when we know that node core has had some issues before when catching uncaughtException was the only way to keep your app alive (such as the now-fixed bugs in the http client which weren't trappable with try/catch).

Best practice doesn't mean everybody does it. It means it's a common practice that lots of smart people really recommend. Not saying that you or Ryan aren't smart. But I've heard the restart thing way more than I've heard other valid alternatives. Let's not quibble over semantics here. If you got a better idea, we're ready to hear it. Otherwise this *has* to be a best practice or people are setting themselves up for a world of hurt.

What I'm suggesting is that the world of hurt is very much context dependant. In an SMTP example you might have an exception thrown on the return from a rDNS query. At that point in time of the SMTP connection it's likely the worst thing that's going to happen is you carry on and nothing bad occurs. You've got no resources allocated, your timer closes the connection and the resources associated with it get garbage collected. Your application continues and nothing bad has happened.

A worse case scenario is you left a filehandle open, and this causes a filehandle leak. There's a way to deal with this: register a disconnect hook and clean up any resources you might have allocated. This gets called after the exception has been thrown and logged, when the connection times out, and cleans everything up.

None of this is rocket science, nor any less of a best practice than "just exiting".

Matt.

DARA

unread,
Sep 12, 2011, 4:01:10 PM9/12/11
to nodejs
You will never get consensus on this issue because it reflects your
chosen system safety strategy and the fact that the chosen risk
mitigation strategy may go either way.

Especially in safety critical systems where there are multiple
redundant components, then the “fail hard” strategy is attractive (but
it is still only a heuristic) – bad components go immediately offline
to prevent potential interference with good ones. The restart Node
strategy upon unexpected exception falls in the this category. It also
has the obvious advantage of being really simple.

OTH, the alternative is graceful degradation to prevent potentially
severe side effects on the environment in which the system is
embedded. The big decision points comes from deciding in an uncertain
context on the risk that the system is too flawed to continue versus
it is good enough and maybe essential enough to continue. Remember if
the code still runs you are not in an entirely indeterminate state but
some parts may be uncertain.

It can be argued much of the split in opinions in this thread reflect
the relative level of confidence in the ability to decide how much can
be trusted in order to make the continue/abort/restart decision.

But this being one of the more philosophical topics in the field…can
be argued otherwise too.

Rob

On Sep 12, 12:10 pm, Matt <hel...@gmail.com> wrote:
> On Mon, Sep 12, 2011 at 2:24 PM, Marco Rogers <marco.rog...@gmail.com>wrote:
>
...

Marak Squires

unread,
Sep 12, 2011, 4:27:37 PM9/12/11
to nod...@googlegroups.com
I find it interesting that just because someone's authoritative sounding blog post happens to be ranked by Google for a given keyphrase, people tend to believe that person's approach is somehow a community standard.

I don't always agree with Felix, but he is a member of the node.js core contributors team, and has pushed 112 commits to core since September 20th of 2009. 

He is not just someone who "happens to be ranked by google for a given keyphrase".



Ryan Gahl

unread,
Sep 12, 2011, 4:39:03 PM9/12/11
to nodejs
:) - you kind of illustrated the whole "context" thing there by taking
that one line out of it.

The rest of the context around that 1 sentence you pulled out had more
to do with me saying something about a trend I noticed (and that I
found it interesting). I also made sure I stated that I wasn't trying
to take anything away from Felix, and that his advice is not wrong in
any way.

I don't know guys, it's starting to feel kind of like Marak is
trolling me in this thread. Should we ban him?

Ryan Gahl

unread,
Sep 12, 2011, 4:42:28 PM9/12/11
to nodejs
I take it back, that was pure snark.

Marak Squires

unread,
Sep 12, 2011, 4:45:03 PM9/12/11
to nod...@googlegroups.com
It's really okay Ryan, I don't care.

I was trying to help you. It's obvious you don't consider my input valid, so I won't bother trying to give it to you anymore.

Marco Rogers

unread,
Sep 12, 2011, 4:55:01 PM9/12/11
to nod...@googlegroups.com
Haha. There are only a very few instances, that I can remember, where I've seen someone who is active and confident in their use of node be swayed to a completely different opinion simply by conversing on this mailing list. I don't think it has anything to do with how much people respect other people's opinion. Words can only go so far to sway people like Ryan, Felix and Marak. Working code does much better.

Mikeal Rogers

unread,
Sep 12, 2011, 4:57:35 PM9/12/11
to nod...@googlegroups.com
I don't know how this thread got so hostile but I'll chime in now about the actual problem being discussed.

uncaughtException !== "your program had an error".

uncaughtException means that something you did not anticipate failing has failed.

at this time, there is very little you can figure out about the state of your application on uncaughtException. your ability to inspect the "current" state of the application and more importantly how many other stacks might share state with the one that just failed is quite minimal. this is being worked on and hopefully will be improved in the future.

but, at this time, it is most certainly a best practice to begin restarting your process on uncaughtException. by that i do not mean "cut all current connections and kill the server" but you should stop accepting new connections and when the current ones are done restart the server.

your node.js application is a single process with an unknown amount of mutable state shared between stacks. an error in one stack may mean that the rest of the process is in a bad state and, most importantly, there REALLY IS NOT a good way to figure it out. your claims that you "know what your application is doing" are a little presumptuous because it assumes the stack that failed is even in code that you wrote.

here's a good example. you have a database adapter. that database adapter has a cache. you get an uncaughtException. *if* you can figure out that the exception relates to the db adapter you don't have any idea if the exception effects the cache control and it could very likely mean that you don't update your cache ever again.

you really don't know what is going on if you get an uncaughtException, really, you don't, if you did you would have trapped it in a try/catch or on an "error" listener.

ryan's proposal for "domains" goes a long way to make exceptions like this a lot clearer and we may get to a place that we can know what is most likely effected by a bubbled exception. but for the the time being, start tearing down the process.

-Mikeal

Marco Rogers

unread,
Sep 12, 2011, 5:24:53 PM9/12/11
to nod...@googlegroups.com
This is exactly my point and it feels like some folks on the other side of this debate are missing it. If it bubbles all the way up to uncaughtException, it means you didn't account for it elsewhere. If you know exactly what the scenario is, or you have all these context trackers in place, then the error should not make it all the way to uncaughtException.

For example, Matt's example about the rDNS error. If you know rDNS can throw exceptions, why aren't you catching them? If it throws errors that you can't catch, isn't that a bug?

What am I missing about this? I think you can do lots of things to increase your level of confidence about the state of your app. But that also serves to greatly reduce the need for uncaughtException. Maybe what you should really be arguing is that you're confident that uncaughtException should not be reached. Maybe you should be giving us more best practices that will help avoid uncaughtException.

For what it's worth, I think this is a very important topic and I hope the discussion continues. I want to hear opposing views because, while I agree that restart seems the safest option, I also agree that it sucks. This is one of those difficult topics because, if you're going against the accepted wisdom, you need to be ready to defend your position. You can't simply state that you don't agree and that others shouldn't accept the conventional wisdom. That's how things should work. Best practices can always be challenged. But the point of them is that folks who are still learning and need a way forward with certain parts of their system can follow them and be reasonably confident. They are important and they should be properly debunked with real solutions rather than anecdotes about personal experience. Because the fact is that best practices are usually formed from lots of other people's personal experience. It shouldn't be easily dismissed.

:Marco

Marak Squires

unread,
Sep 12, 2011, 5:39:05 PM9/12/11
to nod...@googlegroups.com
Yes, this exactly. I couldn't have worded it better myself. 

On Mon, Sep 12, 2011 at 1:57 PM, Mikeal Rogers <mikeal...@gmail.com> wrote:

Marcel Laverdet

unread,
Sep 12, 2011, 5:52:38 PM9/12/11
to nod...@googlegroups.com
Just my two cents, I feel that Node's uncaught exception behavior is totally reasonable and expected. Trying to ignore exceptions automagically could lead to any number of evils like hanging connections, botched DB transactions, corrupt files, or worse. Node isn't exclusively a multi-user web server and making assumptions that it is would poison the platform for all. I think if you want assumptions about the platform to be made for you, you should be using one of the many great frameworks out there.

Sorry if I repeated any arguments made already, this thread is colossal so I just wanted to get in and out since it seems like the discussion is still going. Hope I'm not beating any dead horses.

Ryan Gahl

unread,
Sep 12, 2011, 6:08:15 PM9/12/11
to nodejs


On Sep 12, 4:24 pm, Marco Rogers <marco.rog...@gmail.com> wrote:
> This is exactly my point and it feels like some folks on the other side of
> this debate are missing it. If it bubbles all the way up to
> uncaughtException, it means you didn't account for it elsewhere. If you know
> exactly what the scenario is, or you have all these context trackers in
> place, then the error should not make it all the way to uncaughtException.

I get that, and that is fine for how a lot of people design their
systems. But the uncaughtException mechanism IS in actuality, a caught
exception. Node caught it, and even has a stack trace to send along
with the event in the vast majority of cases (excepting primarily for
those cases where code explicitly throws something that is not an
instance of Error). That's a very fine level in which to add
instrumentation (heuristics if you will) to set about dealing with the
issue. I'm beginning to think the major differences in opinion in this
thread are stemming from simply different design goals and scale of
projects (i.e. mindset).

>
> For example, Matt's example about the rDNS error. If you know rDNS can throw
> exceptions, why aren't you catching them? If it throws errors that you can't
> catch, isn't that a bug?

Sure, it's probably a bug. Maybe your system design dictates that all
such bugs require a full restart. Mine doesn't.
Back to the case where you forgot to add input validation code and you
forgot to wrap each relevant piece in an explicit try/catch (or your
design removes the need to do so), and a singular user submits
something that makes that one chunk of code choke.

>
> What am I missing about this? I think you can do lots of things to increase
> your level of confidence about the state of your app. But that also serves
> to greatly reduce the need for uncaughtException. Maybe what you should
> really be arguing is that you're confident that uncaughtException should not
> be reached. Maybe you should be giving us more best practices that will help
> avoid uncaughtException.
>
> For what it's worth, I think this is a very important topic and I hope the
> discussion continues. I want to hear opposing views because, while I agree
> that restart seems the safest option, I also agree that it sucks. This is
> one of those difficult topics because, if you're going against the accepted
> wisdom, you need to be ready to defend your position. You can't simply state
> that you don't agree and that others shouldn't accept the conventional
> wisdom. That's how things should work. Best practices can always be
> challenged. But the point of them is that folks who are still learning and
> need a way forward with certain parts of their system can follow them and be
> reasonably confident. They are important and they should be properly
> debunked with real solutions rather than anecdotes about personal
> experience. Because the fact is that best practices are usually formed from
> lots of other people's personal experience. It shouldn't be easily
> dismissed.

I think the gist of this discussion is this... you say potato and I
say potato (those are pronounced differently).

Where you might do this (where the try/catch is meant to represent
both in-line, i.e. syncronous, error handling, and the listening for
'error');

(pseudocode):

try {
handleRequest();
.... try {
........ doSomething();
.... } catch (someMicroException) {
........ handleMicroExceptionGracefully();
.... }
} catch (uncaughtRequestException) {
logger.logRequestException()
}
server.on("uncaughtException", function() {
blowUp();
});


...I do this...

contextTracker.wrap(handleRequest());
.... contextTracker.wrap(doSomething());
server.on("uncaughtException", function() {
try {
contextTracker.recover();
} catch (OMFGException) {
blowUp();
}
});

>
> :Marco
> marco.rog...@gmail.com |https://twitter.com/polotek

Ryan Gahl

unread,
Sep 12, 2011, 6:10:22 PM9/12/11
to nodejs
me (staring at blue sky): the sky is blue
some people (it's overcast by them): nope, sky's gray
me: not over here... sky's blue as can be... looking right at it
some people: uhhh, dude... um, yeah.... I can't understand how you'd
say that. Because like, the sky's actually gray. Are you sure you're
not looking at the ground? The ground is definitely blue.
me: sigh... yeah I guess I must be staring at the ground

Maybe more pseudocode? I love pseudocode. The outer try/catches are
node's (i.e. the implicit loop try/catch). I'm sorry I'm not one of
the "uber" community people where I can just point at a github repo
that clears all this up. I have 0 node core commits since ever,
etc.... but by golly the sky... she's a blue.

[server.start]
try {
server.start()
.... contextTracking.start()
.... requestHandling.start()
} catch (serverStartException) {
blowUp();
}

[request.handle]
try {
request.handle();
.... contextTracker.wrapActivity(doSomething1()); //containerization
(covers 3rd party lib cases also). note: resource allocation is also
wrapped in containers
.... contextTracker.wrapActivity(doSomething2());
} catch (ex) {
contextTracker.recover();
areAllMySystemsInGoodHealth();
...... hmmmNo?OKRestartTheSystemsReportingIssues();
context.containers.cleanup(); // containerized activities "destruct"
and resources are released

ifI'mStillNotGoodWeHaveToCallInReinforcementsButIfI'mGoodIJustSavedALotOfMoneyOnMyCarInsurance();
}




On Sep 12, 3:57 pm, Mikeal Rogers <mikeal.rog...@gmail.com> wrote:
> I don't know how this thread got so hostile but I'll chime in now about the actual problem being discussed
>
> uncaughtException !== "your program had an error".
>
> uncaughtException means that something you did not anticipate failing has failed.
>
> at this time, there is very little you can figure out about the state of your application on uncaughtException. your ability to inspect the "current" state of the application and more importantly how many other stacks might share state with the one that just failed is quite minimal. this is being worked on and hopefully will be improved in the future.
>
> but, at this time, it is most certainly a best practice to begin restarting your process on uncaughtException. by that i do not mean "cut all current connections and kill the server" but you should stop accepting new connections and when the current ones are done restart the server.
>
> your node.js application is a single process with an unknown amount of mutable state shared between stacks. an error in one stack may mean that the rest of the process is in a bad state and, most importantly, there REALLY IS NOT a good way to figure it out. your claims that you "know what your application is doing" are a little presumptuous because it assumes the stack that failed is even in code that you wrote.
>
> here's a good example. you have a database adapter. that database adapter has a cache. you get an uncaughtException. *if* you can figure out that the exception relates to the db adapter you don't have any idea if the exception effects the cache control and it could very likely mean that you don't update your cache ever again.
>
> you really don't know what is going on if you get an uncaughtException, really, you don't, if you did you would have trapped it in a try/catch or on an "error" listener.
>
> ryan's proposal for "domains" goes a long way to make exceptions like this a lot clearer and we may get to a place that we can know what is most likely effected by a bubbled exception. but for the the time being, start tearing down the process.
>
> -Mikeal
>
> On Sep 12, 2011, at September 12, 20111:45 PM, Marak Squires wrote:
>
>
>
>
>
>
>
> > It's really okay Ryan, I don't care.
>
> > I was trying to help you. It's obvious you don't consider my input valid, so I won't bother trying to give it to you anymore.
>

Mikeal Rogers

unread,
Sep 12, 2011, 6:16:07 PM9/12/11
to nod...@googlegroups.com
you're saying poetato and we're saying you might not know how to spell.

you're free to continue to spell things wrong but please don't try to teach spelling to others.

-Mikeal

Ryan Gahl

unread,
Sep 12, 2011, 6:21:23 PM9/12/11
to nodejs
Was that an asshole-ish thing to say?

Yep, yep I think it was. It's ok though, it was funny.

Mark Hahn

unread,
Sep 12, 2011, 6:24:41 PM9/12/11
to nod...@googlegroups.com
I'm glad to see that emotional long threads are spreading to other topics than just control-flow, templates, and coffeescript.  I think every discussion should be lively.

(Sorry for a meta-post).

Clint Harris

unread,
Sep 12, 2011, 6:25:09 PM9/12/11
to nod...@googlegroups.com
you're saying poetato and we're saying you might not know how to spell.
> you're free to continue to spell things wrong but please don't try to teach spelling to others.


I think that was unnecessary and totally unhelpful.

Regardless of your opinion, the discussion has been helpful to me (the OP who didn't know any better) and will probably be helpful to at least a few other people.

Marco Rogers

unread,
Sep 12, 2011, 6:28:47 PM9/12/11
to nod...@googlegroups.com
Ryan, I mean no offense by this, but perhaps your problem is that you are not very good at staging your case. No one is denying that your personal experience seems to suggest an alternate path. What we're saying is that we're skeptical and you've presented no evidence. This post is the most thorough yet and still it gives no useful information about your solution. Or, in a less negative connotation, maybe you're giving us too much credit and assuming we're reading between the lines. Let me ask more specific questions.
 

[request.handle]
try {
  request.handle();
.... contextTracker.wrapActivity(doSomething1()); //containerization

What is this? Containerization is not a word, it's not even tech jargon, so it gives me no ideas about how contextTracker actually works. Yes we all know how to wrap callbacks. But these callbacks can spawn other callbacks. How does contextTracker handle nesting and code you have no direct control over?


(covers 3rd party lib cases also). note: resource allocation is also
wrapped in containers
.... contextTracker.wrapActivity(doSomething2());

Explain. How does it wrap third party libs? Are you doing this explicitly? If so, how far do you have to understand that library to do so? If the answer is "a lot" then your solution immediately becomes less attractive.
 
} catch (ex) {
  contextTracker.recover();
  areAllMySystemsInGoodHealth();
...... hmmmNo?OKRestartTheSystemsReportingIssues();
  context.containers.cleanup(); // containerized activities "destruct"
and resources are released
 
ifI'mStillNotGoodWeHaveToCallInReinforcementsButIfI'mGoodIJustSavedALotOfMoneyOnMyCarInsurance();
}


None of this is a counter to what has been said here. You seem to be agreeing actually. If you can track the context, you can catch the error and handle it. That's what everyone is doing now. The only difference is that you are allowing that to happen much later. It's not a different solution. You should have handle the blow up and restart case. And you seem to be suggesting that letting errors bubble up to uncaughtException is a nicer more elegant way to do things in certain cases. I disagree. But to each his own. It still feels like you haven't presented anything new.

Just because you choose to give the color gray the name "blue" doesn't mean we're talking about different things. You just have a different perspective. That's fine. But I'd really like more detail.

:Marco

Mikeal Rogers

unread,
Sep 12, 2011, 6:30:40 PM9/12/11
to nod...@googlegroups.com
the discussion is great, i'm not knocking it, but it's not going anywhere when the actual reasons we're giving for why we recommend something aren't actually being addressed by the replies.

seriously, you can't figure out what is effected by an exception once it hits uncaughtException, people should know this.

if this wasn't a problem Ryan wouldn't be working on features in node-core to fix it :)

-Mikeal

Marco Rogers

unread,
Sep 12, 2011, 6:34:16 PM9/12/11
to nod...@googlegroups.com
much later. It's not a different solution. You should have handle the blow up and restart case.

This should be. "You still have to handle the blow and restart case." Sorry if that was unclear and for the dumb typing today.

:Marco

Mikeal Rogers

unread,
Sep 12, 2011, 6:52:15 PM9/12/11
to nod...@googlegroups.com
Let me back up, I'm started to get way more hostile than I want to be.

The point of this is thread, and really this list, is to:

1) have meaningful discussions that lead to new answers
2) inform new users of best practices and solutions

This thread began with an established practice and a desire to elaborate on why this is a good practice.

Someone disagreed with the reasoning. We've done our best to explain why the assumptions that are being made in that contradiction are not true, that this is in fact a very big problem in node.js that we're trying to improve.

Now, in order for the discussion to mature please refrain from restating how you might trap state. Yes, it's possible to trap some state, that's not what we're talking about. The issue here is the state you *don't know about*. Work is being done in core to allow you to inspect that state and provide better encapsulation, but it's not ready yet.

The more this discussion drags on without maturing the more this is conflicting with the second point of this thread/list which is to inform new users.

As it stands, Felix has stated a very good practice and we have done our best to explain not just why he prefers this practice but why most of us agree with it.

The question:

Why would you want to tear down a process on uncaughtException?

The answer:

Because there is no way to inspect all the state that depended on that stack and clean it up so you could leave the process in a bad state indefinitely if you do not begin the process of restarting the process.

Nothing has been said that contradicts that answer, it stands, please, new users, do this until we have a better solution for you.

-Mikeal


Dean Landolt

unread,
Sep 12, 2011, 6:59:03 PM9/12/11
to nod...@googlegroups.com
On Mon, Sep 12, 2011 at 6:30 PM, Mikeal Rogers <mikeal...@gmail.com> wrote:
the discussion is great, i'm not knocking it, but it's not going anywhere when the actual reasons we're giving for why we recommend something aren't actually being addressed by the replies.

seriously, you can't figure out what is effected by an exception once it hits uncaughtException, people should know this.

if this wasn't a problem Ryan wouldn't be working on features in node-core to fix it :)

Exactly. Though in defense of the other Ryan *ahl I suspect both Ryans are effectively working on the same problem from different angles (Dahl from core, Gahl from userland). I couldn't speak to how effective it is to track all of this context from userland -- it seems pretty damn error-prone to me -- but I can't believe it's impossible.

So perhaps this rule of thumb may work as a compromise: if you're really skilled, very careful, and you've done a ton of testing, you may not need to always tear down your process for errors that bubble to the top of the process (but if you're writing all that bookkeeping code, why not just catch all your errors properly :).

Now for the rest of us, it's probably alright to drop a some connections on the floor every so often. And that's what makes it a best practice -- general community consensus, right? And one day we won't have to do all this bookkeeping ourselves -- and when that day comes it may no longer be considered a best practice. But in the meantime, if you're going to write spend so many cycles collecting and swapping contexts perhaps a runtime with threads would be a better option -- the context swapping burden is less to bear if you're carrying much of it.

To Ryan's other point about a restart cycle -- if you find yourself in one of those I suspect you're pretty much f'ed -- either way you've got a hotfix deploy in your near feature. So I don't think this argument adds much. But I do buy his main premise -- this doesn't need to be done in core. But that's probably not the path of least resistance :)

Jorge

unread,
Sep 12, 2011, 9:20:50 PM9/12/11
to nod...@googlegroups.com
On 13/09/2011, at 00:52, Mikeal Rogers wrote:
>
> The question:
>
> Why would you want to tear down a process on uncaughtException?
>
> The answer:
>
> Because there is no way to inspect all the state that depended on that stack and clean it up so you could leave the process in a bad state indefinitely if you do not begin the process of restarting the process.
>
> Nothing has been said that contradicts that answer, it stands, please, new users, do this until we have a better solution for you.

@Mikeal: That's very interesting. Could you *please* gist a real example where letting the error bubble up to the top totally breaks the program as you say (so as to justify a process restart), but catching it instead at a lower level would work fine?

TIA,
--
Jorge.

Matt

unread,
Sep 12, 2011, 9:54:19 PM9/12/11
to nod...@googlegroups.com
On Mon, Sep 12, 2011 at 5:24 PM, Marco Rogers <marco....@gmail.com> wrote:
This is exactly my point and it feels like some folks on the other side of this debate are missing it. If it bubbles all the way up to uncaughtException, it means you didn't account for it elsewhere. If you know exactly what the scenario is, or you have all these context trackers in place, then the error should not make it all the way to uncaughtException.

For example, Matt's example about the rDNS error. If you know rDNS can throw exceptions, why aren't you catching them? If it throws errors that you can't catch, isn't that a bug?

What am I missing about this?

You're missing the difference between a framework that tries its best to be stable in the face of errors, vs an application that can control all of those errors.

I can't control how people write plugins, but I do want to protect them from themselves. I *do* want them to catch exceptions, but sometimes they miss them (sometimes I do too!). So I want to be stable in the face of that.

Now "stable" is maybe something you would argue that carrying on without quitting isn't going to work. But the only case of problems is when you have shared state, which is something I recommend avoiding, albeit sometimes it is unavoidable, in which case I can agree that quitting may be the only safe option.

Matt.

Matt

unread,
Sep 12, 2011, 10:14:04 PM9/12/11
to nod...@googlegroups.com
I think what's interesting about this discussion is that those on the "uncaughtException is ok sometimes" side can actually see both sides. I can and have said in this thread that yes sometimes carrying on after uncaughtException can be fine, and sometimes it can be bad.

The example I gave is an obvious one: where the exception means you never get to the point in the code where you close a filehandle. This leaves the filehandle open forever, and causes a filehandle leak, which eventually means you run out of filehandles (assuming the error happens more than once).

So there are instances where it's not OK to carry on, but there are a lot of ways to write applications where it's actually fine to handle the uncaughtException, and carry on serving "requests" (whatever those may be). Saying that "exiting on uncaughtException is the only sensible thing to do" is very much application specific, and honestly requires a deeper understanding of your application and how things work than just a simple catchphrase.

Matt.

Steve Krenek

unread,
Sep 12, 2011, 10:38:04 PM9/12/11
to nod...@googlegroups.com
Forgive me if this comment seems a bit naive.  I've only been indoctrinated into the async world of Node for about six months or so now, and don't want to provoke any tempers.

Coming from the more traditional web development world, it seems that first the concept of forking processes, and later as hardware (and programmers minds) supported it better, threads enabled these scenarios to be handled without killing the entire server.  (I'm assuming most people involved in this discussion are using Node in some sort of server context.)  

What those mechanisms really do in the server space is allow programmers to sandbox pieces of code (requests, sub-processes, etc.) into a self-contained context, which, if it encounters an uncaught exception, is then subsequently trashed to prevent total mayhem (ala the infamous 500 server error).  The context of that piece of code (again, request, sub-process, etc.) gets cleaned up (and any resources it opened such as file handles, etc.) and the main piece of execution continues serving its purpose merrily.  

Is there anything built into Node that would provide this sandboxed context which could have its own uncaught exception handler and to which any opened resources could be bound?  Perhaps the developer could initiate it similar to traditional web servers' request or conversation scope concepts, and then when the context of that code has served its purpose, its resources are reclaimed, forcefully, if necessary.  

In my limited experience so far, Node solves a ton of problems that traditional servers do not address, but the single-thread of execution does bring its own can of worms, as we've seen from this thread of over 50 entries so far.

Again, if I'm totally off my rocker here, feel free to ignore this, or better yet, enlighten me.  I'm still trying to make sense of all this. 

Cheers!
Steve

Matt

unread,
Sep 12, 2011, 11:55:00 PM9/12/11
to nod...@googlegroups.com
On Mon, Sep 12, 2011 at 10:38 PM, Steve Krenek <st...@thevolary.com> wrote:
What those mechanisms really do in the server space is allow programmers to sandbox pieces of code (requests, sub-processes, etc.) into a self-contained context, which, if it encounters an uncaught exception, is then subsequently trashed to prevent total mayhem (ala the infamous 500 server error).  The context of that piece of code (again, request, sub-process, etc.) gets cleaned up (and any resources it opened such as file handles, etc.) and the main piece of execution continues serving its purpose merrily.  

There's a number of issues at play here.

There are some types of resources (open file handles being the perfect example) that a dying thread won't clean up, node won't clean up, and even some process-based web frameworks won't clean up, unless you do it manually. Secondly that most of the problems people are associating with this issue are a problem of shared state - if you share state within a node process then you're pretty screwed if there's an unhandled exception.

In both cases, no particular architecture design is going to help you, though they do tend to offer certain subsystems to mitigate those problems (for example in the mod_perl world, a particular child process will only process N connections before reaping itself and starting over - this mitigates some of those problems).

Is there anything built into Node that would provide this sandboxed context which could have its own uncaught exception handler and to which any opened resources could be bound?  Perhaps the developer could initiate it similar to traditional web servers' request or conversation scope concepts, and then when the context of that code has served its purpose, its resources are reclaimed, forcefully, if necessary.  

You have to think of Node as very low level. It's basically a programming language with a standard library - it's not a framework at the level you're thinking of. Now at that higher level, should the frameworks built in node do something similar to what I wrote above (handle N requests before restarting)? Probably. It's generally a good idea, although it does sometimes hide certain problems rather than forcing developers to fix them.

Matt.

aaronblohowiak

unread,
Sep 13, 2011, 12:33:08 AM9/13/11
to nodejs
TL;DR: Who cares if you may be able to recover from uncaught
exceptions if you must design your system to recover from unexpected
process/machine failure anyway?

On Sep 12, 10:46 am, Ryan Gahl <ryan.g...@gmail.com> wrote:
> [...] As I said, it's a matter of application
> design.

I would like the systems that I develop to be tolerant of machine
failure, so single-process failure should not be the end of the
world. If you design your system to be tolerant of total machine
failure, than it must necessarily be able to recover from the failure
of the process. Ryan, not knowing the intricacies of your application
(or application design,) I do not know why your process startup time
must take so long as to be completely untenable -- please do not take
the following paragraph to be directly pointed at you, but rather to
push the discussion forward (your application specifics must be the
overriding factor, as we both know..)

I fear the implications of thinking "if we are careful enough, our
process will live forever." s/forever/as long as we'd like it to. If
your code is able to handle application resuming from total process
failure AND you are trying to recover from uncaught exception, then I
admire your coding discipline. Unfortunately, I think most
developers tend to view the exception-handling approach as a way to
make it so they can reason about their app as if their process will
complete all operations it begins.

Unfortunately, machines lose power, hard drives die, the OOM killer
can bite you in the ass, node core may have errors, et cetera. In
other words: given a sufficient horizon, your process will
unexpectedly crash for reasons that are not under your control.
Because if this, we must design our system to handle unexpected
process death (this may mean a manual data checking and correcting
process... which is perfectly valid decision in some scenarios...) If
the system can resume from unexpected process death anyway, then why
try to keep the process up? You seem to have a good reason in your
particular case, but I submit that for most use-cases, the diligence
required to recover gracefully from errors does not offset the restart
cost (especially since you must code for the eventuality that your
process will be unexpectedly restarted out from under you anyway...)

Frankly, I plan to never code with "domains" because I will a) write
code assuming that any process/machine can die at any time and b)
design my system to assume ephemerality, which means a larger quantity
of smaller sized processes.

You get some nice things "for free" when you design to the pattern of
multiple small processes, such as: a) leverages multiple cores
automatically b) is easier to spread load to multiple machines c) you
can influence the scheduling based on importance ('nice') d) you can
allow work queues to back up while new code versions are being
deployed without requiring the entire system to become unavailable.

If you code for whole-system fault tolerance (failure is normal,) then
the small cost of having to restart failed processes FAR FAR FAR
outweighs the additional user-land code complexity from trying to
mitigate the impact of exceptions (be this in recovering from
uncaughtException handler or through domains.) In other words, let's
trade operational complexity (which we need if we want to build stable
systems anyway) for code complexity (which is the root of hairloss.)

aaronblohowiak

unread,
Sep 13, 2011, 12:42:42 AM9/13/11
to nodejs

> there is no way to inspect all the state that depended on that stack and clean it up so you could leave the process in a bad state indefinitely if you do not begin the process of restarting the process.
>
> Nothing has been said that contradicts that answer, it stands, please, new users, do this until we have a better solution for you.

In the presence of shared mutable state, node core can only give user-
land developers a possibility that they are able to successfully track
all of their application state's dependencies as they write their
code, and handle all of the edge cases when weird shit breaks
unexpectedly. That is extremely difficult for a single programmer to
do, and impossible if you don't have total knowledge of the running
system.

Trying to design a system under which only a single perfect programmer
or a team of perfect programmers with perfect knowledge can be
successful seems... precarious. Worse is better. Let it fail.

Mikeal Rogers

unread,
Sep 13, 2011, 12:43:38 AM9/13/11
to nod...@googlegroups.com
This just isn't the case. Shared state exists, period.

The real questions is, faced with an exception in someone else's code you've executed do you assume that the process is in a bad or do you assume it's not.

It is *not* STABLE to continue the process indefinitely if you are unsure if it's in a bad state. That is in face the exact opposite of stable.

What you want to do is take the process out of rotation, let it finish it's pending requests, and restart it. You can do this without downtime if you do it right. Nobody is saying the only solution to handling exceptions is downtime, that would be unacceptable.



Matt.

Mikeal Rogers

unread,
Sep 13, 2011, 12:53:59 AM9/13/11
to nod...@googlegroups.com
I can't just gist a complex application to illustrate a complex problem, nor can I design example code to fail in a way that is by definition "unexpected" :)

What I can offer is a simple example.

You use a database adapter.

When you initialize the adapter it creates a connection pool and a cache system. It asynchronously updates it's cache via push events from the databases.

An exception happens in the database adapter.

The exception *could* just be in a single connection getting a new value, it *could* relate to only one http request/response. Or, it could be in the code that updates the cache.

If it's in the code that updates the cache the cache will no longer be updated. You have no way to tell, you didn't even write this adapter.

ry does have a proposal for a solution to problems like this.

He's calling them "domains". You create a "domain" and until you end it anything that gets put in to the event system is attached to the domain, and anything those callbacks create will also be added to the domain. Domains become the exception handlers, you don't use process.uncaughtException anymore.

You write a database adapter and create a domain around your cache code. If you get an exception you clean up and restart all the caching logic.

Your web framework creates a domain around it's emit('request') so that it can give a 500 to any requests that cause exceptions.

-Mikeal

Charlie McConnell

unread,
Sep 13, 2011, 1:59:49 AM9/13/11
to nod...@googlegroups.com
I think we're being a little fault-intolerant, at least for the Node.js community.  Can't we all just get along? 
--
Charlie McConnell
Support Engineer
Nodejitsu, Inc.

Jorge

unread,
Sep 13, 2011, 6:47:55 AM9/13/11
to nod...@googlegroups.com
Thanks Mikeal.

Yeah, that's true: you should not attempt to recover from unexpected obscure errors happening in code you don't understand :-)

But what does that have to do with *where* did you catch it ?

In other words, why an unknown error is ok if caught in a try/catch deep into the call stack, but not when it has bubbled all the way up to the global error handler (that is up to the process' uncaughtException handler) ?
--
Jorge.

Matt

unread,
Sep 13, 2011, 8:00:50 AM9/13/11
to nod...@googlegroups.com
On Tue, Sep 13, 2011 at 12:43 AM, Mikeal Rogers <mikeal...@gmail.com> wrote:
On Sep 12, 2011, at September 12, 20116:54 PM, Matt wrote:

On Mon, Sep 12, 2011 at 5:24 PM, Marco Rogers <marco....@gmail.com> wrote:
This is exactly my point and it feels like some folks on the other side of this debate are missing it. If it bubbles all the way up to uncaughtException, it means you didn't account for it elsewhere. If you know exactly what the scenario is, or you have all these context trackers in place, then the error should not make it all the way to uncaughtException.

For example, Matt's example about the rDNS error. If you know rDNS can throw exceptions, why aren't you catching them? If it throws errors that you can't catch, isn't that a bug?

What am I missing about this?

You're missing the difference between a framework that tries its best to be stable in the face of errors, vs an application that can control all of those errors.

I can't control how people write plugins, but I do want to protect them from themselves. I *do* want them to catch exceptions, but sometimes they miss them (sometimes I do too!). So I want to be stable in the face of that.

Now "stable" is maybe something you would argue that carrying on without quitting isn't going to work. But the only case of problems is when you have shared state, which is something I recommend avoiding, albeit sometimes it is unavoidable, in which case I can agree that quitting may be the only safe option.

This just isn't the case. Shared state exists, period.

Honestly, I'm done trying to state that there's two sides to this and have you and Marak and Marco just state that I'm wrong. I'm old enough and ugly enough to not care about arguing.

Have a good day,

Matt.

Jorge

unread,
Sep 13, 2011, 8:53:46 AM9/13/11
to nod...@googlegroups.com

Totally agree. Absolutely.
--
Jorge

Mikeal Rogers

unread,
Sep 13, 2011, 12:21:08 PM9/13/11
to nod...@googlegroups.com
I understand your question now.

In node.js you most likely *do* know what is happening in try/catch because it can't encapsulate all that much code since so much of it gets put off in to callbacks that aren't caught by the try/catch.

Sure, you could call in to a bunch of code you don't know and you'd have the same problem but that is less likely.

I think we're also missing what may be the most common use case, errors emitted by event emitters which only throw if unhandled. With those you have a very good idea of what state you need to handle and would be effected by the error if you listen for it on the emitter but have almost no idea once it throws and hits the global exception handler..

-Mikeal

Mikeal Rogers

unread,
Sep 13, 2011, 12:24:40 PM9/13/11
to nod...@googlegroups.com
This about it this way.

A function can access all kinds of shared global state. In your code you can reason about what it might effect by looking at it.

When handling an exception in a global exception handler there is no way to inspect what shared state is effected by that exception. A human being might be able to go in after the fact and figure out what is going on, but it's not safe to continue running the process without restarted. 

Again, do not sever all connections and crash hard. Take the process out of rotation and let it complete it's pending requests while another process handles new requests.

-Mikeal

Marco Rogers

unread,
Sep 13, 2011, 12:50:53 PM9/13/11
to nod...@googlegroups.com
I think we've seen that this is a difficult topic because it's non-trivial to show code that illustrates either side. Being on the "you need to restart" side is particularly frustrating because, as Mikeal, said it hinges on the unexpected. And it's hard to model that in an example. If you want to get frustrated because no one will admit that there's wiggle room here. Fine, there's wiggle room. But we're talking about best practice. That's the framing of this discussion. If your goal is to simply destroy best practices so that all we can give newcomers is a weak "it depends". You will have a hard time doing that.

Also, after the initial posts, I've tried to be careful not to totally dismiss the other side of this argument. I do want to hear fresh ideas on this. Others are much more adamant about their position. That comes from real experience. They're harder to convince. If that frustrates you, feel free to withdraw. Feel free to respond only to those you feel are being constructive. But, if you really have useful knowledge to share, I think this topic is important enough to endure some gruffness from Mikeal (he's really a lovable guy when you get to know him). Let me try to engage you and Ryan Gahl again in a more constructive way.

After thinking more about it, I'm more interested in seeing techniques that you or Ryan use to mitigate this issue. I'm particularly intrigued about the idea of building decoupled sub-systems that can recover from error. The more I thought about it (because I'm actually participating in this discussion and not just shooting people down), the more I feel like it aligns with an unformed theory that I was trying to discuss at node summercamp.

So for example, if there is a error in a subsystem that you didn't anticipate. You can tear the whole thing down. Create a whole new db layer with caching and everything. It may seem drastic, but some might prefer it to restarting the whole process. Obviously, this only works if the db layer touches no outside state whatsoever. So the question becomes how do you ensure that? It feels like Ryan's confidence comes from a position of knowing a whole lot about all the different components of his system. That's great, but not always practical. Or should I say, it takes more effort than just restarting the process. He is working with constraints that make it worth the effort (non-trivial startup path). That totally makes sense.

But we still need to be talking about real techniques. That's how you convince people. I think my side has at least managed to convey a few concrete use cases that you could address directly with your solutions. I hope Ryan has been silent because he's working up some awesome examples.

:Marco

Mikeal Rogers

unread,
Sep 13, 2011, 1:09:16 PM9/13/11
to nod...@googlegroups.com
On Sep 13, 2011, at September 13, 20119:50 AM, Marco Rogers wrote:

I think we've seen that this is a difficult topic because it's non-trivial to show code that illustrates either side. Being on the "you need to restart" side is particularly frustrating because, as Mikeal, said it hinges on the unexpected. And it's hard to model that in an example. If you want to get frustrated because no one will admit that there's wiggle room here. Fine, there's wiggle room. But we're talking about best practice. That's the framing of this discussion. If your goal is to simply destroy best practices so that all we can give newcomers is a weak "it depends". You will have a hard time doing that.

+1

So for example, if there is a error in a subsystem that you didn't anticipate. You can tear the whole thing down. Create a whole new db layer with caching and everything. It may seem drastic, but some might prefer it to restarting the whole process. Obviously, this only works if the db layer touches no outside state whatsoever. So the question becomes how do you ensure that? It feels like Ryan's confidence comes from a position of knowing a whole lot about all the different components of his system. That's great, but not always practical. Or should I say, it takes more effort than just restarting the process. He is working with constraints that make it worth the effort (non-trivial startup path). That totally makes sense.

I think everyone i know that i running node in production has some kind of long running memory leak that requires semi-frequent restarts anyway. That won't always be the case, eventually node will be much better and the frameworks and libraries will leak less, but at this time it's a good idea to have a strategy for process cycling.

But we still need to be talking about real techniques. That's how you convince people. I think my side has at least managed to convey a few concrete use cases that you could address directly with your solutions. I hope Ryan has been silent because he's working up some awesome examples.

My frustration, partially, comes from the fact that we really are working on features that will give programmers the tools to understand the state related to a bubbled exception. To inspect more of the state an exception might relate to so that we don't have to restart the process. Pretending that it's not a problem derails feedback about the tools we're building to help this.

BTW, i'm about to deal with error propagation in steam.pipe() in my streams2 branch. Feedback is welcome.



:Marco



Honestly, I'm done trying to state that there's two sides to this and have you and Marak and Marco just state that I'm wrong. I'm old enough and ugly enough to not care about arguing.

Have a good day,

Matt.

Nuno Job

unread,
Sep 13, 2011, 1:09:41 PM9/13/11
to nod...@googlegroups.com
Hi,

I think it's also useful to separate concerns here. At least for me this boils down to two very important points:
  1. How do you recover from error from an application being in the maintaining the correct state perspective
  2. How do you give a developer the right tools to guarantee a response (or something else you need to assure) to a specific request
I like the way node gives you the tools to deal with 1. yourself. You have an event for those exceptions, deal with it how it fits your application. Great.

I dislike the way node deal with 2.: meaning it doesn't. Maybe that's what makes up 1. but that is a concern. Would you build software that you put on a rocket ship based on something can't guarantee 2.? Javascript is already pretty lousy at being side effect free and formal methods is almost an area that does not exist in the community (At least that I know of, If I'm wrong point me to the sites and papers cause I sure want to read them).

I want to build my reliable services in node. For that I need a level of abstraction where I can ensure that the user gets a response (and hopefully, the correct one). Or if that's not the assurance I need, I might need other kind of assurance. If this was programmatically possible like 1. I would be happy. Oh well, I'm still happy.

Nuno



--

Matt

unread,
Sep 13, 2011, 1:44:29 PM9/13/11
to nod...@googlegroups.com
On Tue, Sep 13, 2011 at 1:09 PM, Mikeal Rogers <mikeal...@gmail.com> wrote:

I think everyone i know that i running node in production has some kind of long running memory leak that requires semi-frequent restarts anyway.

I don't see that with Haraka processing millions of emails a day... Is this a framework issue, or something to do with the core http libraries?

Matt.

Mikeal Rogers

unread,
Sep 13, 2011, 1:48:13 PM9/13/11
to nod...@googlegroups.com
both, but mainly in third party libraries.

I know someone recently found a leak in the http client that was fixed.

-Mikeal
--

Matt

unread,
Sep 13, 2011, 2:11:11 PM9/13/11
to nod...@googlegroups.com
On Tue, Sep 13, 2011 at 12:50 PM, Marco Rogers <marco....@gmail.com> wrote:
After thinking more about it, I'm more interested in seeing techniques that you or Ryan use to mitigate this issue.

I can't speak for Ryan, but personally I just don't.

If this sort of problem happens (in my personal instance of Haraka) I just let it happen. There's nothing scary here. No shared state to worry about. There might be database access going on, but I haven't seen any problems with that yet (I use Pg, and expect it to be safe in the event of any kind of exception).

All state is maintained within the connection object, which should go out of scope when the exception is thrown, and the timer eventually fire on it which issues a disconnect. There's no global heap of connection objects so I don't need to delete anything else.

I may go in and check the logs from time to time, but I haven't had anything in a while.
 
I'm particularly intrigued about the idea of building decoupled sub-systems that can recover from error. The more I thought about it (because I'm actually participating in this discussion and not just shooting people down), the more I feel like it aligns with an unformed theory that I was trying to discuss at node summercamp.

One way to do it is something like this:

Connection.prototype.add_cleanup_handler = function (handler) {
    var already_called = false;
    var _handler = function () {
        if (already_called) return;
        already_called = true;
        handler();
    };
    
    this.cleanup_handlers.push(_handler);
    return _handler;
}

Then when you allocate a resource that must be cleaned up somehow (like an open filehandle):

var ws = fs.createWriteStream(path);
var close_it = connection.add_cleanup_handler(function () { ws.destroy() });
... // do something with ws
// close it:
close_it();

And finally, in the disconnect method for the Connection you loop through and run all cleanup routines.

====

The other thing to worry about with the "tear down your process" thing is slow connections. Say you're running cluster and 1 process per CPU (the default), on a 4 CPU system. If you have a coding error and an uncaughtException is happening frequently enough that all your processes are restarting often then you end up with a completely inaccessible system. You're waiting for the slow connections (one connected to each 4 processes) to finish processing, while no longer accepting connections - this may not be so much of an issue with http where you can stick nginx in front to trickle content to those slow connections (and thus free up node to restart), but it's very much an issue with SMTP where the communication is bidirectional and you can't proxy it. That's obviously an extreme example, but that's a possibility following this blindly as a best practice without thinking about your architecture and application first.

Matt.

Isaac Schlueter

unread,
Sep 13, 2011, 2:32:58 PM9/13/11
to nod...@googlegroups.com
I have an example of a program that needs to exit on unhandled
exceptions: npm. I don't have a gist handy, but there's a git repo,
and it's probably already installed on your computer anyway. Go look
at it.

There's a single global exception handler, which should never be hit,
ever. All errors should be captured by a callback. If anything
throws, it's some completely unexpected weirdness. *By definition*, I
have not cleaned up the state, since *I didn't catch the error*, so
there's no "clean up and move on" option possible. The only work that
it does at that point is die in a horrible red-and-black fashion.

If your program doesn't do much, or has some reasonably reliable way
to clean up state and shut down gracefully, then sure, do that, and
*then* exit.

If it doesn't do very much, and can only crash in a few specific
ways, and you can detect those, then fine, handle those, or otherwise
guarantee that you're not dropping state on the floor, and keep on
trucking. But such programs are the minority.

Any conversation about "best practices" should be about the common
cases. I'm sure there are exceptions, but I'm willing to bet that any
given program is not an exception unless shown otherwise. That's
because I know the definition of "exception", and I'm not completely
terrible at betting.

Matt claims that Haraka is such an exception, and he seems pretty
bright and claims to be old and ugly, so probably has some experience
in such matters ;) But keeping the state clean requires a lot of
diligence, and there may well be cases where a bug in your stack
(either your code, a dependency, or node-core) makes this actually
impossible.

Qv:

On Tue, Sep 13, 2011 at 10:48, Mikeal Rogers <mikeal...@gmail.com> wrote:
> I know someone recently found a leak in the http client that was fixed.

If you don't exit on unhandledException events, then you should
probably not just check logs, but also monitor memory usage and open
file descriptors to make sure they don't creep up.

Mikeal Rogers

unread,
Sep 13, 2011, 2:36:24 PM9/13/11
to nod...@googlegroups.com


On Sep 13, 2011, at 11:11 AM, Matt <hel...@gmail.com> wrote:

On Tue, Sep 13, 2011 at 12:50 PM, Marco Rogers <marco....@gmail.com> wrote:
After thinking more about it, I'm more interested in seeing techniques that you or Ryan use to mitigate this issue.

I can't speak for Ryan, but personally I just don't.

If this sort of problem happens (in my personal instance of Haraka) I just let it happen. There's nothing scary here. No shared state to worry about. There might be database access going on, but I haven't seen any problems with that yet (I use Pg, and expect it to be safe in the event of any kind of exception).

All state is maintained within the connection object, which should go out of scope when the exception is thrown, and the timer eventually fire on it which issues a disconnect. There's no global heap of connection objects so I don't need to delete anything else.

In 0.5.3+ this is true of http client and request objects. In 0.4.x it is not, there are close and error cases that are not handled well.


I may go in and check the logs from time to time, but I haven't had anything in a while.
 
I'm particularly intrigued about the idea of building decoupled sub-systems that can recover from error. The more I thought about it (because I'm actually participating in this discussion and not just shooting people down), the more I feel like it aligns with an unformed theory that I was trying to discuss at node summercamp.

One way to do it is something like this:

Connection.prototype.add_cleanup_handler = function (handler) {
    var already_called = false;
    var _handler = function () {
        if (already_called) return;
        already_called = true;
        handler();
    };
    
    this.cleanup_handlers.push(_handler);
    return _handler;
}

Then when you allocate a resource that must be cleaned up somehow (like an open filehandle):

var ws = fs.createWriteStream(path);
var close_it = connection.add_cleanup_handler(function () { ws.destroy() });
... // do something with ws
// close it:
close_it();

And finally, in the disconnect method for the Connection you loop through and run all cleanup routines.

====

The other thing to worry about with the "tear down your process" thing is slow connections. Say you're running cluster and 1 process per CPU (the default), on a 4 CPU system. If you have a coding error and an uncaughtException is happening frequently enough that all your processes are restarting often then you end up with a completely inaccessible system. You're waiting for the slow connections (one connected to each 4 processes) to finish processing, while no longer accepting connections - this may not be so much of an issue with http where you can stick nginx in front to trickle content to those slow connections (and thus free up node to restart), but it's very much an issue with SMTP where the communication is bidirectional and you can't proxy it. That's obviously an extreme example, but that's a possibility following this blindly as a best practice without thinking about your architecture and application first.

Matt.

--

Matt

unread,
Sep 13, 2011, 2:52:44 PM9/13/11
to nod...@googlegroups.com
On Tue, Sep 13, 2011 at 2:36 PM, Mikeal Rogers <mikeal...@gmail.com> wrote:

All state is maintained within the connection object, which should go out of scope when the exception is thrown, and the timer eventually fire on it which issues a disconnect. There's no global heap of connection objects so I don't need to delete anything else.

In 0.5.3+ this is true of http client and request objects. In 0.4.x it is not, there are close and error cases that are not handled well.

Didn't know that. I don't do HTTP... In that case I revise my statement a bit: if you're doing HTTP then exiting on uncaughtException is the only way to go :)

And to Isaac's point: Yes I can't control what third party stuff does, this is very much true (hence why I say I can see both sides of the argument), and I'll probably make it an option at some point to exit on uncaughtException for that very reason. But for my personal use of Haraka, and the custom plugins I've written, I'm pretty sure it's all good, and I'd rather not deal with restarts.

(nobody responded to the slow client issue - though I guess with HTTP that is less of an issue if you proxy).

Matt.

Dean Landolt

unread,
Sep 13, 2011, 3:01:52 PM9/13/11
to nod...@googlegroups.com
Why can't you proxy (and slow down) SMTP? The amount by which you can decelerate is, I presume, bound only by timeout lengths. Just like HTTP.

Matt

unread,
Sep 13, 2011, 3:31:25 PM9/13/11
to nod...@googlegroups.com
Why can't you proxy (and slow down) SMTP? The amount by which you can decelerate is, I presume, bound only by timeout lengths. Just like HTTP.


SMTP is chatty back and forth. You can proxy it, but it doesn't gain you anything.

Dean Landolt

unread,
Sep 13, 2011, 4:24:31 PM9/13/11
to nod...@googlegroups.com
On Tue, Sep 13, 2011 at 3:31 PM, Matt <hel...@gmail.com> wrote:
Why can't you proxy (and slow down) SMTP? The amount by which you can decelerate is, I presume, bound only by timeout lengths. Just like HTTP.


SMTP is chatty back and forth. You can proxy it, but it doesn't gain you anything.

So long as you don't exceed a timeout bound it gains you the same property you'd get from nginx sitting on HTTP connections. But I don't know much about the protocol -- I know it's chatty but are the timeouts short and/or vaguely defined? If not you're golden, right?

Matt

unread,
Sep 13, 2011, 4:32:59 PM9/13/11
to nod...@googlegroups.com
No, you're missing the point in the context of a graceful shutdown.

In the no-proxy scenario you have: Client <--> Node and if one of those clients is slow, you wait the entire time of the connection until you can gracefully shut down (and thus trigger cluster to create a new process). If you do that and only have 4 processes running you run the risk of not being able to service clients.

In the proxy scenario you have: Client <--> Proxy <--> Node, and if one of those clients is slow, Node can still feed the response back to the proxy, shut down the connection, and let the proxy do the work of slowly feeding data back to the slow client. In this scenario there's no risk you won't be able to keep servicing connections because your restart will be fast.

SMTP doesn't work like HTTP. It's not a "fire and forget" protocol.

Dean Landolt

unread,
Sep 13, 2011, 4:53:04 PM9/13/11
to nod...@googlegroups.com
I think you're missing my point -- regardless of whether it's HTTP or SMTP if you have a proxy between you and node you can sit on incoming messages until your process is back up to service it. So you're bound by timeout. And yes, I get that session state makes it more difficult in SMTP but certainly not impossible. I never said it was a good idea :)

Mikeal Rogers

unread,
Sep 13, 2011, 5:01:36 PM9/13/11
to nod...@googlegroups.com
I don't think I saw the beginning of this question about slow clients but I *think* i get the gist of it.

In practice, what I believe most people are doing is setting a timeout on how long they will wait for the existing connections to finish up before restarting. If the timeout is reached the connections are pre-maturely closed.

This is all assuming you're using a process pool btw. If you aren't using a process pool then this is a bit harder.

Also, SMTP should have the same issue as HTTP in regard to slow clients. Both are TCP, both cannot push data at a client if the client cannot accept it without buffering that data in to memory. I know there is extra application layer concerns to think about in addition to this but they both suffer the limitation of pushing data via TCP and not being able to push data faster than a client is willing to accept it so this isn't exclusive to either.

Sure, a proxy could hold the data for you but it would have to do so in memory which means the workload can't be too large.

-Mikeal

Matt

unread,
Sep 13, 2011, 5:24:23 PM9/13/11
to nod...@googlegroups.com
On Tue, Sep 13, 2011 at 5:01 PM, Mikeal Rogers <mikeal...@gmail.com> wrote:
I don't think I saw the beginning of this question about slow clients but I *think* i get the gist of it.

In practice, what I believe most people are doing is setting a timeout on how long they will wait for the existing connections to finish up before restarting. If the timeout is reached the connections are pre-maturely closed.

Well of course.
 
This is all assuming you're using a process pool btw. If you aren't using a process pool then this is a bit harder.

Indeed, the issue of not being able to accept connections just hits you immediately then, rather than a bit later.
 
Also, SMTP should have the same issue as HTTP in regard to slow clients.

Not with a proxy. Besides, proxying has other issues with SMTP, since you're trying to deal with the IP address a lot of the time. There are ESMTP extensions to help with that, but they are imperfect.
 
Sure, a proxy could hold the data for you but it would have to do so in memory which means the workload can't be too large.

Well sure, but that's the point.


It's slightly less relevant to normal servicing with node because you don't have 1 process per connection limitations that mod_perl 1.x had, but it's still relevant to this shutdown/restart scenario.

Matt.

Clint Harris

unread,
Sep 14, 2011, 7:01:43 PM9/14/11
to nod...@googlegroups.com
Just a sidenote for other new-to-node folk who are interested in learning more about the domains stuff: Isaac, Mikeal, and a few of the other guys talked about it briefly in the latest episode of NodeUp (around the 23min mark). http://nodeup.com/four

--
Reply all
Reply to author
Forward
0 new messages