http_parser HPE_INVALID_CHUNK_SIZE errors after upgrading to node 0.8

Jimb Esser

unread,

Sep 20, 2012, 4:48:02 PM9/20/12

to nod...@googlegroups.com

We're trying migrating from node 0.6.21 to 0.8.9, and one of our processes that does a database backup (a couple million items fetched over HTTP and HTTPS from Amazon SimpleDB and DynamoDB APIs) has started getting "HPE_INVALID_CHUNK_SIZE" errors, presumably from node's http_parser.c (and then after that, it seems we're generally handed back a partial response, and sometimes also a disconnected socket on the next request, which black-holes the connection without some other fixes). It generally gets one or two of these errors in the millions of fetched items (maybe 10s-100s of thousands of individual HTTP requests over just a handful of HTTP 1.1 connections with connection keep-alive). Rolling back to 0.6.21 makes everything fine. Doing a diff of http_parser.c between the two versions show's it's had some major work done (though nothing right around where this error is emitted), so perhaps a bug has been introduced, but I know nothing of that code and don't know where to start debugging it. Any suggestions? Any recent changes to related code I should try bisecting? Anyone else getting this error?

Jimb Esser

Cloud Party, Inc

Jimb Esser

unread,

Sep 20, 2012, 6:00:50 PM9/20/12

to nod...@googlegroups.com

As a follow-up, it looks like the data I get back from the response often has "\r\n2000\r\n" in the middle of what otherwise looks like good data, so I'm guessing something is going wrong with processing multi-part (chunked? whatever is the right term?) responses, and it sometimes reads too much and then tries to read the next chunk size from the middle of the response data (sometimes getting an "invalid chunk size" error, or if it gets lucky and reads a small number, it just truncates my response (though leaves the "\r\n2000\r\n" in the middle of it).

Ben Noordhuis

unread,

Sep 21, 2012, 8:45:41 AM9/21/12

to nod...@googlegroups.com

On Fri, Sep 21, 2012 at 12:00 AM, Jimb Esser <wast...@gmail.com> wrote:
> As a follow-up, it looks like the data I get back from the response often
> has "\r\n2000\r\n" in the middle of what otherwise looks like good data, so
> I'm guessing something is going wrong with processing multi-part (chunked?
> whatever is the right term?) responses, and it sometimes reads too much and
> then tries to read the next chunk size from the middle of the response data
> (sometimes getting an "invalid chunk size" error, or if it gets lucky and
> reads a small number, it just truncates my response (though leaves the
> "\r\n2000\r\n" in the middle of it).

Do you have some captured traffic I can look at?

Jimb Esser

unread,

Sep 21, 2012, 12:40:59 PM9/21/12

to nod...@googlegroups.com

I have no idea how to capture traffic on Linux, but I could do so if there are some good instructions somewhere. The problem generally takes hundreds of MB of traffic to reproduce, however, so I'm not sure if logging that traffic will affect it and cause it to not occur, but I'm willing to try.

I did a git bisect and found the revision in the v0.8 history which "caused" the issue, but unfortunately, I think that is a red herring, as the revision is just one that fixed keep-alive being broken on that branch [1], and older than that revision keep-alive is simply off, so the traffic is drastically different (not to mention orders of magnitudes slower in my test). On the 0.6 branch, however, even with keep-alive functioning correctly, I do not see this problem, so that leads me to believe that something broke while keep-alive was also broken on the 0.8 branch, which makes it much harder to bisect. I may try cherry-picking in the keep-alive fix while doing a bisect to see if I can narrow it down, though since each test can take up to a half hour before it's conclusive, I'm not sure how productive that will be.

[1] https://github.com/Joyent/node/commit/a6f3451e2549d0b82ff18990e8bfe28ea4386aba

Ben Noordhuis

unread,

Sep 21, 2012, 7:27:04 PM9/21/12

to nod...@googlegroups.com

On Fri, Sep 21, 2012 at 6:40 PM, Jimb Esser <wast...@gmail.com> wrote:
> I have no idea how to capture traffic on Linux, but I could do so if there
> are some good instructions somewhere. The problem generally takes hundreds
> of MB of traffic to reproduce, however, so I'm not sure if logging that
> traffic will affect it and cause it to not occur, but I'm willing to try.

tcpdump? It's easy:

tcpdump -C 1 -G 5 -i <interface> -n -s 1500 -w <filename> tcp port <port>

The -C and -G options makes it rotate the log file every 5 seconds or
when it grows to 1 million bytes, whatever comes first. You may need
to increase the snarf size (-s) to 8192 or higher if you're capturing
loopback traffic.

You can replay the traffic with `tcpdump -A -r <filename>`.

Jimb Esser

unread,

Sep 24, 2012, 5:37:16 PM9/24/12

to nod...@googlegroups.com

I tried tcpdump, and it starts saving off packets (writes a few dump files, a couple 10s of MBs), but them seems to stall out - nothing more gets written though my app is clearly still receiving data. By the time it gets the bad data, it's inevitably stopped by then, and searching through the dumps does not contain the bad data.

I added some logging to http_parser.c, however, simply dumping out the raw data that it is parsing, and it looks like the data is bad before it gets in there. Specifically, in one case I'm looking at (raw data, though all text at [1]) it looks like the tail of the buffer has 4377 bytes that begins what should be a 8192 byte chunk, and then a duplicated copy of the last ~4708 bytes or so, followed by "0\r\n\r\n". Sadly, now that I'm familiar with the http_parser code, this really points to a problem somewhere higher up... -_- I'll do some poking around with tcpdump to see if I can get it or something else to dump more reliably, it would be good to have absolute confirmation that the data was fine coming in at the TCP level, although the fact that node 0.6 has no problems makes me pretty certain the input is fine.

On other fronts, I did more tests searching through node's git history, and found that before a specific revision [2], this seems to happen much less frequently (happened once in 6 runs, as opposed to almost every run), but I'm pretty convinced there's nothing wrong with this change (I added some logging to both before and after and found nothing behaving differently, though the frequency changed when adding logging, so it may just indicate it's a delicate timing issue - although it does happen reliably both on my work machine, home machine, and AWS instances - all on pretty drastically different hardware, and similarly never seems to happen on node 0.6.21 on any of those machines (although if it was only happening 1 in 6 times before, it could be I would have never seen it in my tests)).

[1] http://thesilentb.com/files/bad_http_stream.txt

[2] https://github.com/joyent/node/commit/70033bd

Ben Noordhuis

unread,

Sep 24, 2012, 6:27:23 PM9/24/12

to nod...@googlegroups.com

Can you post the patch you used for logging? I don't see anything
obviously wrong in the dump itself.

It's kind of odd that tcpdump hangs, I've never had issues with it
myself, even on maxed out Gbit interfaces.

On a completely unrelated note:

{"S":"[6.24719116,-3.552713678800501e-15,2.252045820000001]"}

Clamp your floats, man! :-)

Jimb Esser

unread,

Sep 24, 2012, 6:48:56 PM9/24/12

to nod...@googlegroups.com

Patch for logging was to replace the two occurrences of SET_ERRNO(HPE_INVALID_CHUNK_SIZE) with:
{

const char *p2;

for (p2=data; p2 != data + len; p2++) {

fprintf(stderr, "0x%02x, ", *p2);

if (p == p2) {

fprintf(stderr, "/******/ ");

}

fprintf(stderr, "HPE_INVALID_CHUNK_SIZE case 1\n");

SET_ERRNO(HPE_INVALID_CHUNK_SIZE);

I then loaded the stderr output into a trivial JS program to write it out as a binary file.

When you say you don't see anything obviously wrong in the dump itself, do you mean in looks like a valid HTTP stream? It looks to me like, near the end, it says an 8k data chunk is coming (2000\r\n), and then provides less than 8k data and starts another chunk (with the original 8k terminating in the middle of the second chunk). I am basing this entirely on assumptions about the HTTP protocol gleaned from assumptions and reading http_parser.c, though, so I could be quite mistaken =).

Ben Noordhuis

unread,

Sep 24, 2012, 9:17:07 PM9/24/12

to nod...@googlegroups.com

On Tue, Sep 25, 2012 at 12:48 AM, Jimb Esser <wast...@gmail.com> wrote:
> Patch for logging was to replace the two occurrences of
> SET_ERRNO(HPE_INVALID_CHUNK_SIZE) with:
> {
> const char *p2;
> for (p2=data; p2 != data + len; p2++) {
> fprintf(stderr, "0x%02x, ", *p2);
> if (p == p2) {
> fprintf(stderr, "/******/ ");
> }
> }
> }
> fprintf(stderr, "HPE_INVALID_CHUNK_SIZE case 1\n");
> SET_ERRNO(HPE_INVALID_CHUNK_SIZE);
>
> I then loaded the stderr output into a trivial JS program to write it out as
> a binary file.
>
> When you say you don't see anything obviously wrong in the dump itself, do
> you mean in looks like a valid HTTP stream? It looks to me like, near the
> end, it says an 8k data chunk is coming (2000\r\n), and then provides less
> than 8k data and starts another chunk (with the original 8k terminating in
> the middle of the second chunk). I am basing this entirely on assumptions
> about the HTTP protocol gleaned from assumptions and reading http_parser.c,
> though, so I could be quite mistaken =).

The HTTP parser works across TCP packet boundaries. A request or
response doesn't necessarily fit in a single packet.

Unless you're benchmarking with a concurrency of 1 (a single client
that issues sequential requests), you'll see HTTP requests and
responses getting interleaved.

Jimb Esser

unread,

Sep 25, 2012, 12:38:34 AM9/25/12

to nod...@googlegroups.com

Yes, but, as I understand, this is logging during processing a single stream of data, in a single call to http_parser_execute(), specifically inside of a for loop looping over exactly the data I logged. That should be internally consistent, at least, right? If the stream says it has an 8192 byte chunk coming, and there's more than that many bytes of date, 8192 bytes later there should be another chunk size in the stream (which, there is not, which of course is why it's throwing the HPE_INVALID_CHUNK_SIZE error as it reads some non-hex characters at the offset its expecting the next chunk size).

Anyway, I managed to capture it as it went bad with tcpdump (I added -U -B 1000000, not sure if that helped or the timing was just different). Short result is: it looks like the data was bad as it came in over TCP! I'll go into detail and post the dumps below, but this quickly leads me to ask (after apologizing about thinking this was a node bug): why does this generally fail on 0.8 and never on 0.6?

My guess is that on 0.6 an HTTP parse error simply disconnected the socket, and all of our AWS APIs simply retry any requests which get a socket disconnect (as those happen rather frequently with AWS). We also automatically retry on a bunch of 5xx http response codes, so if any of those were being generated, we'd retry on those as well. I'll add some additional logging and run this on node 0.6 again to see if I can verify this.

Back to the details on the symptom at hand (just in case I misinterpreted some of this data, it might be good to get another set of eyes on it). In this case, it didn't actually get a http parser error, it simply handed back what looks like bad data (includes the HTTP chunk size lines inserted in the middle of it), perhaps because the socket got closed or something (although it seems if it got closed prematurely that perhaps that should have generated a parsing error since it was expecting more bytes? I'm not sure on this, exactly). I've uploaded the parsed/printed TCP dump of the end of the relevant stream (starting from before the last good chunk size line) [1], as well as the tail of the data that was returned to my app, starting from the same point [2].

As far as I can tell, the stream indicated there was an 8kb chunk coming, 8kb later it indicates another 8kb chunk is coming, but then 1296 bytes later there's another chunk size line ("\r\n2000\r\n"). At the packet level, the final packet (starting at seq 230681) appears to be simply repeated data from 1304 bytes back in the stream, but then terminated unexpectedly (never gets to the 8kb of data the earlier chunk size indicated was coming). No other packets show up in this stream in my dump.

I can not think of any possible thing on our side (either my code or node's) that could cause this, so my conclusion is that something is going wrong within Amazon's Dynamo DB (or routing between there and our servers, though I've tried both inside and outside of their datacenters with the same results). In retrospect, every bad packet I've gotten has been on DynamoDB queries, despite us slamming a couple other Amazon APIs with similar levels of traffic during this test, so that also points the finger in that direction. I think I will try and bring this up on the DynamoDB forums, and just make our DDB client more robust if it gets low-level parsing errors (although I'm a little afraid of what could happen if it happens to accidentally parse correctly because of some of our data happening to match up!).

Thank you very much for your help with this.

[1] http://thesilentb.com/files/tcpdump.txt

[2] http://thesilentb.com/files/packet.txt

Jimb Esser

Cloud Party, Inc

Ben Noordhuis

unread,

Sep 25, 2012, 8:32:38 AM9/25/12

to nod...@googlegroups.com

Glad you got it solved - or the next best thing, know what the culprit is. :-)

The reason that you're seeing it with 0.8 and not 0.6 probably is that
http_parser.c became considerably more strict. Error handling on the
JS level hasn't changed significantly.

Jimb Esser

unread,

Sep 25, 2012, 12:59:59 PM9/25/12

to nod...@googlegroups.com

Well, I did a bunch of testing on node 0.6, and after running for hours, I received no http parsing errors, no JSON parsing errors on the data returned, and no unexpected socket disconnects. I generally get one or two of these within 5-10 minutes on node 0.8. So, there's definitely something different node 0.8 is doing that is affecting this, though the tcpdump definitely indicates the problem is on the other end, so there's basically nothing more to try, but it is still worrisome :(. I think we'll have to stick on node 0.6 at least in production for a while yet.

For posterity, the thread I started with Amazon is here (no responses yet):

https://forums.aws.amazon.com/thread.jspa?threadID=105268

Jimb Esser

Reply all

Reply to author

Forward