Yesterday I spent a bunch of time trying to track down a weird bug where
I was getting malformed_magic errors from read_res_packet.
It turned out in the end that our app was in one (buggy) case returning
an enormous response payload. We're actually running a slightly outdated
Gearman::Client perl library where there's a hard-coded limit of 20
reads before it falls out of the bottom of the read look in
read_res_packet. I see this has been changed in HEAD to grow that limit
for larger messages and to catch the short read when the limit is exceeded.
However, even in HEAD this function still has the issue that in the case
of any error it returns while leaving the socket in a bad state...
either right after a message header but before the payload or partway
through a payload.
This is made worse by the fact that TaskSet->add_task "handles" errors
here by printing to the selected filehandle the string " INITIAL SUBMIT
FAILED\n" and returning a task fail.
So in the event of an error, we:
* Leave the socket in a crazy state that causes a failure on the next
read attempt.
* Print a message to some random unknown filehandle that, in our
mod_perl application, happens to be the HTTP socket before Apache has
sent the response header, causing our HTTP status line to effectively be
" INITIAL SUBMIT FAILED" and thus confusing our load balancers.
* Fail the task but carry on ignoring the fact that the socket is now
hosed, causing all reads from that particular job server to fail until
something closes the socket or the gearman client object is freed.
Since our app uses a Gearman client singleton per Apache child, when an
error occurs that particular socket stays screwed until the child gets
recycled.
It feels like read_res_packet (or some other layer) should close the
socket on error to prevent it getting used for anything else, and to
cause a new connection to be opened before any further messages are
sent/received.
Also, printing INITIAL SUBMIT FAILED to the selected filehandle is not
an appropriate way to handle an error on task submit. Since the
following line marks the task as failed anyway, I don't think we need an
error message here at all.
I've attached a straightforward patch which closes the socket on error
and removes the error message.
I'm not sure that this is the best layer to close the socket, since it
seems that the closed socket ends up hanging around for a while, but I
figured that a closed socket is already one possible error case so
callers that aren't handling that are already broken. However, I'd be
interested in suggestions for alternative fixes.