Client did not send nnn bytes as expected

102 views
Skip to first unread message

Amit Kasher

unread,
Dec 2, 2008, 4:13:46 AM12/2/08
to Google Web Toolkit
Hi,
Does anyone has any new insights about this issue? We've been
investigating for over a year(!), and we seem to not be the only
ones...

http://tinyurl.com/5rqfp5

Thanks.

Lothar Kimmeringer

unread,
Dec 2, 2008, 4:40:06 AM12/2/08
to Google-We...@googlegroups.com
Amit Kasher schrieb:

I have no insights but what about firing up Wireshark and
protocolling the packets that are exchanged between client
and server. At the moment the problem occurs you should be
able to come up with the protocol of that specific HTTP-
session. Maybe that helps to track down where the problem is.


Regards, Lothar

Amit Kasher

unread,
Dec 2, 2008, 4:56:29 AM12/2/08
to Google Web Toolkit
Thanks.

I have been trying tcpdump sniffer in the server side, and discovered
that the server always receives 80% of the byte content (I described
it here: http://tinyurl.com/5rqfp5). This is very interesting, but
unfortunately led me nowhere.

I don't manage to reproduce it, for over a year now, so I can't run a
sniffer in the client. Also, this is a high capacity internet
application, not intranet, therefore contacting the users even just
for a question is rather difficult, let alone installing a sniffer in
the client side.

Amit

Lothar Kimmeringer

unread,
Dec 2, 2008, 5:05:38 AM12/2/08
to Google-We...@googlegroups.com
Amit Kasher schrieb:

> I have been trying tcpdump sniffer in the server side, and discovered
> that the server always receives 80% of the byte content (I described
> it here: http://tinyurl.com/5rqfp5). This is very interesting, but
> unfortunately led me nowhere.

I just read the first post (shame on me ;-) but I still think
that Wireshark might help here. When the problem occurs, you can
simply reduce the view of the packets to the one session by
simply applying a filter on it. That way it should be possible
to see what was happening _before_ the packets got reduced.

> I don't manage to reproduce it, for over a year now, so I can't run a
> sniffer in the client. Also, this is a high capacity internet
> application, not intranet, therefore contacting the users even just
> for a question is rather difficult, let alone installing a sniffer in
> the client side.

The sniffer on the client-side would be a next step to be
considered. In the first place I think that it should be
enough to have one on the server-side (listening only to
HTTP-traffic).


Regards, Lothar

Amit Kasher

unread,
Dec 2, 2008, 6:11:19 AM12/2/08
to Google Web Toolkit, Amit Monbaz
We will try Wireshark. BTW, the inherent linux sniffer, tcpdump, is
pretty advanced and we used its filtering feature to pin point this
packet reduction. However, the disruption seems to occur somewhere
lower level in the server OS, or more likely before the server machine
altogether - some network equipment or client side code / browser.

Thanks again for your help.

Amit

Lothar Kimmeringer

unread,
Dec 2, 2008, 6:55:25 AM12/2/08
to Google-We...@googlegroups.com
Amit Kasher schrieb:

> However, the disruption seems to occur somewhere
> lower level in the server OS, or more likely before the server machine
> altogether - some network equipment or client side code / browser.

I doubt that there is a bug in the lower levels of an OS that lead
to the truncation of TCP-packets only when they come from a GWT-
application being executed inside an Internet Explorer.

With the sniffed packets I was hoping to see a pattern (if the
application is calling function x, y and z the effect starts
to be observed, etc.) With that you might be able to reproduce
the effect on a local machine allowing you to initiate further
actions like installing a sniffer on that box to see if the
packets are sent truncated or why the IE is getting a hickup.


Regards, Lothar

jchimene

unread,
Dec 2, 2008, 10:50:19 AM12/2/08
to Google Web Toolkit
Hi,

A few questions:

o Are all packets sent to the server the same size?
o What is that size?
o Have you checked for other types of congestion?
o Is this entirely TCP/IP? Have you checked maxrss?
o Have you enabled logging on intermediate nodes to see if there are
congestion issues?
o Is this related to a specific time of day (although it probably
happens between 10:00 and 14:00...)
o Do you have a world-wide net? If so, does the problem travel across
time zones?

Cheers,
jec

Amit Kasher

unread,
Dec 2, 2008, 11:11:46 AM12/2/08
to Google Web Toolkit
Hi,
Thanks for your reply. Answers are inline.

On Dec 2, 5:50 pm, jchimene <jchim...@gmail.com> wrote:
> Hi,
>
> A few questions:
>
> o Are all packets sent to the server the same size?
No, they are not.

> o What is that size?
This depends on the service call - somewhere between 150 and 2000
bytes.
I will mention again that by using a sniffer (tcpdump), it seems that
EVERY time this issue occurs, the actual packets the server receives
are ALWAYS EXACTLY 80% of what it should have received. This, again,
was very encouraging to find as a clue, but unfortunately led me
nowhere.

> o Have you checked for other types of congestion?
Congestion? Unfortunately, I don't have any control over the client's
environment since this is an internet application and I can't
reproduce it.

> o Is this entirely TCP/IP? Have you checked maxrss?
maxrss? I'm not sure I understood the relevance... TCP/IP is obviously
used, it is the underlying protocol of HTTP...

> o Have you enabled logging on intermediate nodes to see if there are
> congestion issues?
I wish I could... I don't have any control over any node before the
server. It is a CentOS VPS hosted internet application. I will state
that this occurred in several hosting providers, in several countries
and geographical locations.

> o Is this related to a specific time of day (although it probably
> happens between 10:00 and 14:00...)
I didn't find any correlation between the time of day and the
occurrence of this. Obviously, this is normalized to the usage load,
as you implied.

> o Do you have a world-wide net? If so, does the problem travel across
> time zones?
My users are not from around the world, but as I stated - this issue
occurred when using hosting providers around the world.

jchimene

unread,
Dec 2, 2008, 3:19:01 PM12/2/08
to Google Web Toolkit


On Dec 2, 9:11 am, Amit Kasher <amitkas...@gmail.com> wrote:
> Hi,
> Thanks for your reply. Answers are inline.
>
> On Dec 2, 5:50 pm, jchimene <jchim...@gmail.com> wrote:> Hi,
>
> > A few questions:
>
> > o Are all packets sent to the server the same size?
>
> No, they are not.
>
> > o What is that size?
>
> This depends on the service call - somewhere between 150 and 2000
> bytes.
> I will mention again that by using a sniffer (tcpdump), it seems that
> EVERY time this issue occurs, the actual packets the server receives
> are ALWAYS EXACTLY 80% of what it should have received. This, again,
> was very encouraging to find as a clue, but unfortunately led me
> nowhere.

At this point, I'd write a ping script and start generating packets of
a certain size. Hammer the server to see if you can reproduce on
demand. If you can reproduce w/ ping, it's not a browser/server issue.

> > o Have you checked for other types of congestion?
>
> Congestion? Unfortunately, I don't have any control over the client's
> environment since this is an internet application and I can't
> reproduce it.

I don't mean the client congestion, I mean congestion en route, i.e.
the cloud between the client and the server. Your later answers seem
to eliminate The Cloud.

> > o Is this entirely TCP/IP? Have you checked maxrss?
>
> maxrss? I'm not sure I understood the relevance... TCP/IP is obviously
> used, it is the underlying protocol of HTTP...

My bad. I meant MTU. But that doesn't sound like it's relevant.

> > o Have you enabled logging on intermediate nodes to see if there are
> > congestion issues?
>
> I wish I could... I don't have any control over any node before the
> server. It is a CentOS VPS hosted internet application. I will state
> that this occurred in several hosting providers, in several countries
> and geographical locations.

So it's sounding more and more like the app. If it's several hosting
providers in several locations, that's pretty much A Clue.

> > o Is this related to a specific time of day (although it probably
> > happens between 10:00 and 14:00...)
>
> I didn't find any correlation between the time of day and the
> occurrence of this. Obviously, this is normalized to the usage load,
> as you implied.
>
> > o Do you have a world-wide net? If so, does the problem travel across
> > time zones?
>
> My users are not from around the world, but as I stated - this issue
> occurred when using hosting providers around the world.
>

OK, so we're down to the app. I'd try constructing a reproducer using
ping with specific packet sizes. Record the output stats and run them
through a formatting routine that will make it easier to check for
problems. The goal here is to check end-to-end transmission w/o using
application layer code. If this transmission failure happens every
day, you should see something happen within 24 hours.

If nothing untoward happens, it's (probably) the app. In that case,
I'd grab a copy of Perl (or whatever you want) to write a client-side
app that faithfully simulates your app's transmission profile. Point
it at one of your servers and stress test that server/client circuit.

Send PCs to several locations if you have to, but get something in the
field that can reproduce this independently of your app and over which
you have complete control. If that's not realistic, you'll need a
"lab" environment, i.e. a machine that isn't a developer box, can be
wiped clean and conveniently set to a known state.

Bueno Suerte,

jchimene

unread,
Dec 2, 2008, 3:29:41 PM12/2/08
to Google Web Toolkit
Hi Amit,

One other thing:

I'm getting the impression that you also have a custom server. If it's
an identical configuration across all server instances, than you also
have to prove that it's not the server. Again, I'd code a simple HTTP
server in Perl (because there's no problem so intractable that it
can't be made worse with a Perl application) and use it to test
against your application.

Cheers,
jec

Amit Kasher

unread,
Dec 3, 2008, 1:20:38 AM12/3/08
to Google Web Toolkit
Hi and thanks again for your responses.

A few more subtle observations and insights:
1. It's probably not the server. There are several reasons that lead
us to believe that the server is not the cause of this issue: (a) We
switched hosting providers. (b) These providers reside in completely
different geographical locations - countries. (c) We have always been
using JBoss on CentOS, but this issue occurs both when we work with
Apache as a front end using mod_jk to tomcat, as well as when
eliminating this tier and having clients go directly to tomcat - using
it as an HTTP server. (d) tcpdump sniffer explicitly shows that the
server receives ALWAYS EXACTLY 80% of the request payload. Unless this
is something even lower level in that machine (the VPS software used -
virtuozzo, the network card/driver, etc.), these observations pretty
much provides an alibi for the server... I think we'd better focus on
other places.
2. There are indications that this is not inside the browser as well:
(a) It happens in several GWT versions. (b) It happens "to" all
browsers, which provides a strong clue, since this code is completely
different from browser to browser - GWT uses MsXMLHTTP activeX in IE,
while using completely other objects in other browsers. Since this is
the underlying mechanism used to perform RPC, it seems that if it
happens for more than one of them, low chances that this is the cause.
Still it seems that this MUST be the GWT/client code, since these
clients, to whom this issue occurs much more often, don't have
problems in any other websites (we managed to talk to several of
them).
One thing that comes to mind is perhaps the GWT serialization code? I
don't know...

Therefore, currently, aside from the possibility that there's a bug in
the GWT serialization code, there's also the possibility that it's
something in the network, even though these clients are from various
ISPs, and geographical locations. Yes, I notice the dead end as
well...

These observations somewhat reduce the anticipated benefit (let alone
the feasibility...) of several of your (MUCH APPRECIATED, THOUGH)
suggestions:
1. ping from the lab
2. perl HTTP server

Despite that, we ARE happy about any suggestion and willing to put the
required effort, so we'll try to make progress in these direction.

Our situation now is that we assume that the data arrives corrupted to
the server, and we should see how this data comes out of the client.
Therefore we will also try to install a sniffer in a client computer
in which this occurs (though we have been trying to do that for quite
a long time now).

jchimene

unread,
Dec 3, 2008, 10:16:07 AM12/3/08
to Google Web Toolkit
On Dec 2, 11:20 pm, Amit Kasher <amitkas...@gmail.com> wrote:
> Hi and thanks again for your responses.

No Prob.

If this "opportunity for excellence" is as pervasive as you suspect,
installing software on a client's computer should be a non-starter
from the perspective that installing it on *any* computer *anywhere on
the planet* should reliably reproduce the issue. You say that tcpdump
shows the packet truncation, so I'm not sure I understand the
requirement to install something on a client machine. My goal in these
past responses has been to absolutely prove that it's the
serialization code (by factoring out the serialization code using
ping), not something peculiar to the transport or session layers.

Are you using the public switched network to provide client/server
connectivity? If not, nothing you've said so far would eliminate your
network transport service.

I find it hard to believe it's GWT, as the cargo size is so small as
to be insignificant, and others would have reported this issue by now.
I have to admit that I'm not a user of Java serialization, so there
may have been reports of this serialization issues of which I'm
blissfully unaware. From everything you're saying, it really looks
like the problem is in user-space. It may be a certain code path that
leads to the same serialization invocation logic. I'd start pulling
this code apart, instrumenting the hell out of it and running it
through JUnit or some such automated testing environment. Again, I
understand you've probably done this...

I'm wondering if there's a specific byte-pattern that's causing this.
Have you tried reordering the structure members? Also, have you
eliminated buffer corruption issues? Since it's cross-browser, what
does the -pretty flag + Firebug reveal? Esp. when profiling the code?
(Although I must admit that you've probably tried all that type of
debugging by now).

Bueno Suerte,
jec

marcelstoer

unread,
Dec 3, 2008, 11:06:49 AM12/3/08
to Google Web Toolkit
I was recently confronted with the very same exception but in a
slightly different context.
I implemented a Servlet listener that parsed the request before it was
being forwarded through the filter chain to the GWT RPC Servlet. At
the beginning I wasn't careful enough and tinkered with the request a
bit too much, GWT doesn't like that. I now use the GWT RPC and
RPCServletUtils classes to parse the request instead of doing it
myself.

HTH,
Marcel

johann_fr

unread,
Dec 3, 2008, 11:31:25 AM12/3/08
to Google Web Toolkit
Hi,

Just in case : we had quite the same problem one year ago, but only
with IE over https. We finally found the following solution (Apache
configuration) :

SetEnvIf User-Agent ".*MSIE.*" \
nokeepalive ssl-unclean-shutdown \
downgrade-1.0 force-response-1.0

It seems that this solution has a small impact on performances, but
that's the only way we found to fix this issue.

Hope it helps.

Johann

markmccall

unread,
Dec 3, 2008, 1:48:59 PM12/3/08
to Google Web Toolkit
Johann,

That solution seems somewhat obscure...how did you arrive at that
solution?

Thanks,
Mark

johann_fr

unread,
Dec 4, 2008, 4:48:44 AM12/4/08
to Google Web Toolkit
Hi,

The issue is about how IE manage xmlHttpRequest over SSL : IE seems to
randomly close the connection before request is completed.

Some links about this issue and how to solve it :

http://jnylund.typepad.com/joels_blog/2007/09/ie6-ajax-httpss.html
http://www.perkiset.org/forum/ajax/ie_6_ajax_over_ssl-t29.0.html
http://forum.mootools.net/viewtopic.php?pid=11200

In our case, it solved the issue.

Johann

Amit Kasher

unread,
Dec 5, 2008, 3:21:25 AM12/5/08
to Google Web Toolkit
Hi,
We have spent the past 2 days working on this, and have some new
findings.

We have made contact to one of our customers who is encountering this
issue more frequently than others, and he granted us access to his
computer (using logmein). We installed WireShark on his computer, as
well as on the server. We managed to reproduced the problem with both
sniffers in action, and analyze the exact correlating TCP segments
according to their sequence and ack numbers. Here are the results.

This is what happens in the valid state:
The client sends 2 TCP segments for a GWT service calls, which are
supposed to be reassembled to a single PDU which is the entire single
HTTP request. The first segment always contains the HTTP request
header, and the second TCP segment always contains the HTTP request
body. For instance, we see that the client sends a first segment of
size 969 bytes, and a second segment of size 454 bytes. In the server
we see that these 2 segments become 3 segments. The first is still 969
bytes and contains the HTTP request header; the second is 363 bytes
(80% of the original second segment), and the third is the remaining
91 bytes (20% of the original 454 bytes).

In the invalid state, when the problem occurs, the third segment
simply does not arrive in the server. It seems that something in the
way has split the second 454 bytes segment to 2 segments, and only
sent the first one to the server.

1. If this is something in the client's machine, how come we don't see
it in the sniffer? (we even tried removing all firewall/antivirus
software, reinstalling the network card driver)
2. If this is not something in the client's machine, how come some
clients encounter this much more than others, that never encounter
this?

Can it be some kind of network equipment that some of our clients
(reminder - different ISPs) go through, and others don't?

Unfortunately, this new info still leaves us clueless...

jchimene

unread,
Dec 5, 2008, 1:09:50 PM12/5/08
to Google Web Toolkit
Hi Amit,

You don't make this easy, do you...

o Just to be clear: goodness happens when the client sends 2 TCP
packets; which become three IP packets on the wire; which are
reassembled by the server into 2 TCP packets.
Badness happens when the client sends 2 TCP packets; which
become three IP packets on the wire; which are reassembled into one
complete TCP packet and 1 incomplete TCP packet.
Can you reproduce this in your lab? I'm guessing "no", otherwise
you would not have deployed the app...

o Do you see a NAK at the client after the dropped fragment?

o Pls. try traceroute from your lab and from the client box. What
are the differences?

o It's now appearing to be an IP issue. The fact that the
fragmentation doesn't occur on the larger packet is interesting.

o The two separate TCP packets leads to an assumption that you can
identify requests from the same client box at the server. IOW, you
have an
application-level protocol that lets you reassemble the two
packets into a single request. I'm sure this is the case, but such a
design isn't explicitly stated in your
message. Your server application never sees the 2 -> 3 split,
since the normal case is that your server app only sees 2 packets from
the client. I'm reluctant to say this, but
part of this process may require proof that the protocol design
is resilient to network transmission errors.

o I'd start playing around w/ different packet sizes and
transmission rates (via ping) to see if you can trip any triggers. It
may be a combination of buffering/congestion
between the client and the server.
Did you try ping w/ different packet sizes? I realize that you
have different servers. Does the connection between the client and
server occur over the public switched network
or does it use a private circuit?

o There have been posts in this thread w/r/t/ SSL and IE. Are they
relevant?

Cheers,
jec

Amit Kasher

unread,
Dec 18, 2008, 2:52:47 AM12/18/08
to Google Web Toolkit
I really appreciate you attempts to help.
In reply to your last post, jec:
* We can't reproduce this in the lab...
* We see a combination of FIN, FIN-ACK and RST.
* We haven't seen any suspicious traceroutes... nothing
differentiating suffering clients from non-suffering client.
* We don't do anything "special" - these are normal GWT service call
requests from browser to server.
* We tried that, as well as different MTU sizes... no clue
* This occurs without any SSL involved, and regardless what browser
being used (IE, FF).

Unfortunately we gave up on the persistent attempt to get to the root
of that issue. We now just assume that it's some low level network
issue (level 1-2) that causes some of the packets not to arrive, in an
unexplained combination with a higher level network issue (level 3-6)
that causes the packet's data to split at exactly 80%.

In order to deal with this situation, we implemented a high level
(GWT) configurable retry mechanism, with timeout support. This
resolves the symptoms, and in effect solves the problem.

We don't mind contributing this mechanism (both client and server
code), if someone is interested or believes GWT needs this kind of
mechanism.

Thanks again,
Amit
> ...
>
> read more »

markmccall

unread,
Dec 18, 2008, 9:14:49 AM12/18/08
to Google Web Toolkit
I would be interested to see this solution - regardless if it is
contributed or not.
> ...
>
> read more »

jchimene

unread,
Dec 18, 2008, 10:14:57 AM12/18/08
to Google Web Toolkit
Hi Amit,

A few final observations:

o You should be seeing SYN packets that will track the IP packet
segments, particularly when the client sends 2 TCP
packets; which become three IP packets on the wire. Look for
retransmission requests, which would be SYN-ACK packets where the
sequence number doesn't increment (IOW NAKs)

o Compare router configurations in your lab and in the field,
particularly if this is a service provided using a virtual private
network

o Look for indications of mismanaged TCP flow control (window size)
and congestion control

Cheers,
jec
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages