We recently integrated Erlang into our MMO game[1]'s cluster, and
we're having a couple problems, but I want to address them one at a
time.
There are several simulation processes which, as they are needed,
fork() on any of several machines and subsequently connect to beam
(currently running only on one machine) as a C node, using
erl_interface, with an sname of "cSSSS@voc1-X" where S is the integer
sectorid and voc1-X is the machine it runs on, obviously. When the
simulation of that sector of space is no longer needed, the process
disconnects and dies.
This works fine, except occasionally a process will fail to connect on
our production cluster. It then aggressively retries every time an
erl_send() is attempted, and after several consecutive failures will
eventually succeed in connecting.
We are logging nodeups and nodedowns to ensure that as sectors start
up and shut down, they are properly connecting and disconnecting; my
initial thought was that maybe a sector is trying to connect as
"c200", for instance, but that name is taken. This doesn't appear to
be the case. A node that has been down for several hours can still
fail to connect. But certain sectors tend to fail more often, or at
least it seems that way to me at a glance.
To get more info, I increased ei_tracelevel during the erl_connect()
call. This is what I see when there is a failure:
ei_xconnect: Wed Jul 4 10:25:04 2007: -> CONNECT attempt to connect to yt
ei_epmd_r4_port: Wed Jul 4 10:25:04 2007: -> PORT2_REQ alive=yt
ip=204.11.209.42
ei_epmd_r4_port: Wed Jul 4 10:25:04 2007: <- PORT2_RESP result=0 (ok)
ei_epmd_r4_port: Wed Jul 4 10:25:04 2007: port=4000 ntype=77
proto=0 dist-high=5 dist-low=5
ei_xconnect: Wed Jul 4 10:25:04 2007: -> CONNECT connected to remote
ei_xconnect: Wed Jul 4 10:25:04 2007: -> CONNECT failed
erl_connect: Input/output error
So it's obviously making the TCP connection, getting through send_name
without any errors, and then recv_status is reading a response other
than "sok" but, unhelpfully, doesn't tell me what the response
actually was and just bombs with EIO. This is what led me to believe
beam thinks there's something wrong with the sname.
The server is using R11B2 on FreeBSD, built from /usr/ports.
The clients are Linux using the erl_interface library 3.5.5.3 from
R11B3 currently. (The changelog between 3.5.5.2 and 3.5.5.3 seems
like it wouldn't make any difference here).
I'm going to patch ei_connect.c to log the actual response in
recv_status to get some insight, but in the meantime I would
appreciate any advice. In particular: is there any way to get more
verbose information from erlang of any attempts to connect? Something
like net_kernel:monitor_nodes but more low-level?
I intend to upgrade both client and server to R11B5 at some point in
the near future, but I don't believe that that will solve this issue,
given the changelogs.
-Andy
* * *
> The server is using R11B2 on FreeBSD, built from /usr/ports.
>
> The clients are Linux using the erl_interface library 3.5.5.3 from
> R11B3 currently. (The changelog between 3.5.5.2 and 3.5.5.3 seems
> like it wouldn't make any difference here).
>
This is a brave assumption, and occasionally a wrong one... I suggest that
you do recompile & re-link using 3.5.5.3 if your target is R11B3, if nothing
else, just to eliminate this as a potential cause.
V.
> So it's obviously making the TCP connection, getting through send_name
> without any errors, and then recv_status is reading a response other
> than "sok" but, unhelpfully, doesn't tell me what the response
> actually was and just bombs with EIO. This is what led me to believe
> beam thinks there's something wrong with the sname.
>
I'd suggest reading someone else's suggests first :-), but if really get
stuck try,
* Can you add a net_adm:ping/1 to the c-node and get it to periodically
pole the server so you can see how often the problem occurs? This may
help you see correlations with other events(eg load, open sockets, etc).
* Is the problem isolated to erlang or other protocol affected?
* Double checking that the switches between the two machines are locks
at the correct speed, eg 100Mps full-duplex.
* That the switches aren't overload in someway. I have not had this
particular problem but have had cases where upgrading a switch did wonders.
* Check the interface stats on the switch and computers to see if there
are any errors CRC, frame errors, etc.
* Check MTU. Strange I know but has been known to cause problems. Mostly
on WAN links though.
* Check that subnets/netmask/broadcast addresses agree.
* Get Ethereal ( http://www.ethereal.com/ ) and mirror the ports of
interest to another port on the switch. Plug a laptop with ethereal on
it in and see what is actually on the network.
* Check and replace cables if your seeing errors on the interfaces.
I now this sounds like over kill and really basic stuff plus I don't
know how much access you have to you own servers and network, but if the
lower layer aren't working the higher layers can't.
As I said see what other people on the list have to say. The above is
more a set of random thought of general things to try.
Jeff.
> * Double checking that the switches between the two machines are locks
> at the correct speed, eg 100Mps full-duplex.
This is good advice for people who know what they're doing. But it's
prone to misinterpretation by gefingerpokers.
Verifying that the interfaces are running as expected, e.g. that both
ends have the same idea about what they're doing, is good.
Manually meddling with the settings is trouble prone.
The root problem is that IEEE 802.3:2000 autonegotiation and manual
configuration do not play nicely together.
There are three main cases:
1. Both ethernet interfaces (e.g. your server and your switch)
use autonegotiation). No problem.
2. Both interfaces manually configured to the same settings. No problem.
3. One interface is manually configured to 100Mbit/s full duplex
and the other to autonegotiate. This is bad.
In the third case, the autonegotiating end is required to select
100Mbit/s _half_ duplex. But other end is using full duplex. The
autonegotiating end will now get _late_ collisions, which will cause
packets to get lost, which causes TCP to behave erratically.
Matthias
> > * Double checking that the switches between the two machines are locks
> > at the correct speed, eg 100Mps full-duplex.
>
> Verifying that the interfaces are running as expected, e.g. that both
> ends have the same idea about what they're doing, is good.
TCP should take care of collisions and it is already established that data is
being exchanged between the two hosts as TCP connect succeeds which requires
SYN, SYN-ACK, ACK. Ethernet is obviously OK.
I would suggest
1) attach strace - you get a mountain of data but you can see what your EIO is
at the socket level.
2) tcpdump - try and get packet trace of the connection; you can use "tcpdump
-w -X -s0 ..." and just log everything and post-process it later.
It could be a firewall or NAT issue; I've seen stateful firewalls get confused
and block connections for a long time.
It is important to find out what is happening around the connection retries (
I'm assuming this means making new TCP connections). I think packet logs are
your most important tool. Maybe you have an intermittent IP address conflict.
It might be another host outside the client and server which is causing the
problem. Might not even be connected to an erlang node - of course the response
won't make sense then.
You'll slap yourself on the forehead when you see it.
Send instant messages to your online friends http://au.messenger.yahoo.com
This indeed has happened to me, and it is quite frustrating and difficult to
detect for application-level software developers like myself. I am no
network expert by any extent of the imagination, but I have learned a lot
about TCP just because I've had to prove to our network services team that
problems are network issues rather than application. In a recent case (a
few weeks ago), two separate networks were joined by a firewall, but to me
they appeared as one network, so I was not even aware that there was a
firewall in between. Anyway, the firewall didn't really like long periods
of inactivity on a TCP connection and would cause errors and then block
future connections for some reason. I don't know the specifics, I only know
that once one of the network guys figured it out, it magically fixed itself.
David
Just to clarify something, though: there is no real I/O error.
erl_interface sets errno to EIO if *anything* goes slightly wrong --
in this case, simply because the remote end sent something in response
other than an "sok" packet. It is not timing out; it fails instantly,
so as you said, the TCP connection is obviously correctly established.
There are also several other connections to beam from different
cnodes on the same machine at the time of the failure.
So the only possibility of a firewall issue is if a firewall were
resetting connections right after they were established, which I've
never heard of before. The firewall rules are not stateful among the
cluster nodes, anyway, and there is no NAT.
I also discovered that I was indeed using 3.5.5.2 on the client and
R11B-2 on the server.
strace - well, a mountain of data is an understatement; it would give
me more like a planet of data. The whole cluster has to be running
for at least a day or so before this shows up for the first time, and
it can happen on any machine, after thousands of simultaneous
player-hours. tcpdump on the erlang port is feasible though and I'll
do that next time.
The important missing piece of data is what the actual response from
beam is when the connection failed, and I am now logging that, but
haven't seen the problem recur yet.
During the time that the erlang connection fails, other connections
continue to work and the rest of the system proceeds normally. The
simulator process that tried to connect gives up each time and goes
about its business for a while before attempting to reconnect to the
erlang part of the system, and each connection attempt either succeeds
or fails within a second.
Another interesting thing to note is that I was manufacturing process
ids from the cnode which were referenced by other processes in the
system, and there was a bug where those processes wouldn't stop
sending messages to them after the cnode went down. This bug has now
been fixed, but I wonder if that had anything to do with it. Guess
I'll find out soon.
Matthias > Verifying that the interfaces are running as expected, e.g. that both
Matthias > ends have the same idea about what they're doing, is good.
Chasing possible ethernet problems is not the first thing I'd do if
presented with the evidence Andy gave.
I posted because a small part of Jeff's advice (above) is prone to
misinterpretation. And also because my calling in life is to prevent
people from making that particular mistake.
Richard Andrews writes:
> TCP should take care of collisions and it is already established
> that data is being exchanged between the two hosts as TCP connect
> succeeds which requires SYN, SYN-ACK, ACK. Ethernet is obviously
> OK.
This is misleading.
Collisions are part of the normal operation of a half-duplex
ethernet. TCP does not take care of them. Collisions are normally
taken care of by ethernet itself (by retransmission*).
Late collisions are not a normal part of ethernet. Late collisions
result in lost packets. TCP will take care of packet losses, but
throughput will often suffer dramatically. Timeouts chosen to be
reasonable for a LAN will also be exceeded.
Getting a successful TCP connection is a necessary condition for
having an OK ethernet, but it's not sufficient.
The rest of your advice is pretty good.
Matthias
> I would suggest
> 1) attach strace - you get a mountain of data but you can see what your EIO is
> at the socket level.
> 2) tcpdump - try and get packet trace of the connection; you can use "tcpdump
> -w -X -s0 ..." and just log everything and post-process it later.
>
> It could be a firewall or NAT issue; I've seen stateful firewalls get confused
> and block connections for a long time.
>
> It is important to find out what is happening around the connection retries (
> I'm assuming this means making new TCP connections). I think packet logs are
> your most important tool. Maybe you have an intermittent IP address conflict.
> It might be another host outside the client and server which is causing the
> problem. Might not even be connected to an erlang node - of course the response
> won't make sense then.
>
> You'll slap yourself on the forehead when you see it.
* There is a limit to the number of retransmission attempts. Exceeding
that limit is unusual.
I wouldn't consider it my calling in life:-), but it's a pet peeve for
me too (for a number of reasons) - and I think your "causes TCP to
behave erratically" is quite an understatement. The type of broken setup
you're describing can effectively kill a network segment when the
traffic load is high, while it seems to work just fine in light testing
- extremely nasty. And the amount of myth propagation or downright
misinformation is immense - just earlier today I happened to be googling
for how to configure speed/duplex on Solaris (it seems to be arguably
even more bizarre than Linux in this area), and on one of the first hits
I found this:
"Solaris is often unable to correctly auto-negotiate duplex settings
with a link partner (e.g. switch), especially when the switch is set to
100Mbit full-duplex."
Sigh...
--Per Hedeland