So a huge feature missed from the JVM is the ability to send SIGQUIT (ctrl-break) and get a list of running threads. This feature is a killer feature for tracking down hung processes, as it will allow you to see where threads are hung up.
I've written a small C extension for Ruby that will send a list of threads and their current file/line number executing to STDERR upon receiving a SIGQUIT. Simply install the gem and require 'thread-dump', and a trap will be registered with SIGQUIT within that process. (So, if you're fork'ing stuff, you need to do the require within the fork, for example.)
Still waiting on the rubyforge project, if you really need this ASAP (as we did), I've put it up at google pages:
This helped us track down a nasty bug that was occuring due to the lack of a timeout in Net::HTTP during SSL connect, which still seems to be busted in ruby trunk.
The extension is still very primitive, I was unable to decipher the Ruby C necessary to unwind the full stack frame for a thread that is not currently executing, but fortunately was able to at least get the bottom of the stack using the RNode held by the *thread. Help with this feature would be greatly appreciated!
It should be up at thread-dump.rubyforge.org sooner or later.
> I've written a small C extension for Ruby that will send a list of > threads and their current file/line number executing to STDERR upon > receiving a SIGQUIT.
Nice. I've wanted this ability a few times.
> This helped us track down a nasty bug that was occuring due to the > lack of a timeout in Net::HTTP during SSL connect, which still seems > to be busted in ruby trunk.
Could you describe this in more detail, or possibly post a diff of the changes you made to fix the bug? (We're about to ship a product with embedded ruby using Net::HTTP and SSL, so it would be great to be able to eliminate any such lurking bug.)
> > I've written a small C extension for Ruby that will send a list of > > threads and their current file/line number executing to STDERR upon > > receiving a SIGQUIT.
> Nice. I've wanted this ability a few times.
> > This helped us track down a nasty bug that was occuring due to the > > lack of a timeout in Net::HTTP during SSL connect, which still seems > > to be busted in ruby trunk.
> Could you describe this in more detail, or possibly post a diff > of the changes you made to fix the bug? (We're about to ship a > product with embedded ruby using Net::HTTP and SSL, so it would > be great to be able to eliminate any such lurking bug.)
> Thanks,
> Bill
We didn't fix it directly, we worked around it by timing out all requests in the outer caller. The bug seems to be inside of def connect, there is a call to "s.connect" if ssl is enabled, and this call is not timed out. Some of our processes were hanging on this call.
>> > This helped us track down a nasty bug that was occuring due to the >> > lack of a timeout in Net::HTTP during SSL connect, which still seems >> > to be busted in ruby trunk.
>> Could you describe this in more detail, or possibly post a diff >> of the changes you made to fix the bug? (We're about to ship a >> product with embedded ruby using Net::HTTP and SSL, so it would >> be great to be able to eliminate any such lurking bug.)
> We didn't fix it directly, we worked around it by timing out all > requests in the outer caller. The bug seems to be inside of def > connect, there is a call to "s.connect" if ssl is enabled, and this > call is not timed out. Some of our processes were hanging on this > call.
Interesting. We've been seeing an issue with "s.connect" as well, but only on Windows (ruby 1.8.4), and oddly only when ruby is embedded into our C++ app, and only the *first* time the SSL connect takes place.
For us, we'd see the CPU pegged for about 20 seconds down in openssl.so -> ssleay.dll -> libeay.dll. But it would eventually return. After that, all subsequent SSL connect calls would execute quickly.
I was wondering if it was doing some one-time generation of a private key or something. . . . (But why only when ruby was embedded in our C++ app? Something missing from the environment, I wondered...?)
Anyway, I wasn't getting very far debugging it as I didn't have symbols for ruby or the ssl libraries. (I was using binaries from the One-click installer.) So I built ruby 1.8.4 and openssl locally with debug symbols, updating to a newer version of OpenSSL in the process. (0.9.8e)
The result: The unexplained "s.connect" delay seems to have vanished.
I would be happier if I knew what had been causing the problem; maybe it's still lurking. But it used to happen like clockwork, and since rebuilding ruby and a newer OpenSSL, I've yet to see the problem again.
Incidentally our app also runs on OS X, and I have yet to see this "s.connect" problem over there?
What platform(s) are you seeing it on? In your case, it sounded like it may have been hanging indefinitely on you, as opposed to being a ~20 second delay that would eventually return?
> What platform(s) are you seeing it on? In your case, it sounded > like it may have been hanging indefinitely on you, as opposed to > being a ~20 second delay that would eventually return?
Yup, this happens on multiple fedora core servers, and they were hung up for several hours. We didn't notice this until we started trying to gracefully shutting down these threads, and realized a good chunk of them were stuck.
This little tool quickly revealed which line was the culprit :)