Yes, but, as I understand, this is logging during processing a single stream of data, in a single call to http_parser_execute(), specifically inside of a for loop looping over exactly the data I logged. That should be internally consistent, at least, right? If the stream says it has an 8192 byte chunk coming, and there's more than that many bytes of date, 8192 bytes later there should be another chunk size in the stream (which, there is not, which of course is why it's throwing the HPE_INVALID_CHUNK_SIZE error as it reads some non-hex characters at the offset its expecting the next chunk size).
Anyway, I managed to capture it as it went bad with tcpdump (I added -U -B 1000000, not sure if that helped or the timing was just different). Short result is: it looks like the data was bad as it came in over TCP! I'll go into detail and post the dumps below, but this quickly leads me to ask (after apologizing about thinking this was a node bug): why does this generally fail on 0.8 and never on 0.6?
My guess is that on 0.6 an HTTP parse error simply disconnected the socket, and all of our AWS APIs simply retry any requests which get a socket disconnect (as those happen rather frequently with AWS). We also automatically retry on a bunch of 5xx http response codes, so if any of those were being generated, we'd retry on those as well. I'll add some additional logging and run this on node 0.6 again to see if I can verify this.
Back to the details on the symptom at hand (just in case I misinterpreted some of this data, it might be good to get another set of eyes on it). In this case, it didn't actually get a http parser error, it simply handed back what looks like bad data (includes the HTTP chunk size lines inserted in the middle of it), perhaps because the socket got closed or something (although it seems if it got closed prematurely that perhaps that should have generated a parsing error since it was expecting more bytes? I'm not sure on this, exactly). I've uploaded the parsed/printed TCP dump of the end of the relevant stream (starting from before the last good chunk size line) [1], as well as the tail of the data that was returned to my app, starting from the same point [2].
As far as I can tell, the stream indicated there was an 8kb chunk coming, 8kb later it indicates another 8kb chunk is coming, but then 1296 bytes later there's another chunk size line ("\r\n2000\r\n"). At the packet level, the final packet (starting at seq 230681) appears to be simply repeated data from 1304 bytes back in the stream, but then terminated unexpectedly (never gets to the 8kb of data the earlier chunk size indicated was coming). No other packets show up in this stream in my dump.
I can not think of any possible thing on our side (either my code or node's) that could cause this, so my conclusion is that something is going wrong within Amazon's Dynamo DB (or routing between there and our servers, though I've tried both inside and outside of their datacenters with the same results). In retrospect, every bad packet I've gotten has been on DynamoDB queries, despite us slamming a couple other Amazon APIs with similar levels of traffic during this test, so that also points the finger in that direction. I think I will try and bring this up on the DynamoDB forums, and just make our DDB client more robust if it gets low-level parsing errors (although I'm a little afraid of what could happen if it happens to accidentally parse correctly because of some of our data happening to match up!).
Thank you very much for your help with this.
Jimb Esser
Cloud Party, Inc