LAN discovery does not work on a desktop computer

211 views
Skip to first unread message

Marek Wieczorek

unread,
Sep 21, 2016, 3:30:07 AM9/21/16
to KryoNet users
Hello,

I have two desktop computers (Ubuntu 16.04) and am using them to test LAN discovery functionality, by running a simple server and a simple client (source code shown below). On one of them (Computer A) , everything works like a charm; one the other one (Computer B) the client cannot find the server. Sample execution logs:

------------------------------------- Computer A (WORKING) -------------------------------------
Computer A$ run_server
...
00:05 TRACE: [kryo] Read class 13: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 DEBUG: [kryo] Read: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 TRACE: [kryo] Object graph complete.
00:05 DEBUG: [kryonet] Responded to host discovery from: /192.168.0.112:45778
00:05 DEBUG: [kryo] Read: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 TRACE: [kryo] Object graph complete.
00:05 DEBUG: [kryonet] Responded to host discovery from: /127.0.0.1:45778
00:05 DEBUG: [kryo] Read: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 TRACE: [kryo] Object graph complete.
00:05 DEBUG: [kryonet] Responded to host discovery from: /127.0.0.1:45778

Computer A$ run_client
...
00:00 TRACE: [kryo] Write class 13: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:00 DEBUG: [kryo] Write: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:00 TRACE: [kryo] FieldSerializer.write fields of class: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:00 TRACE: [kryo] Object graph complete.
00:00 DEBUG: [kryonet] Broadcasted host discovery on port: 54777
00:00  INFO: [kryonet] Discovered server: /192.168.0.112
00:00  INFO: [kryonet] Discovered server: /127.0.0.1
00:00  INFO: [kryonet] Discovered server: /127.0.0.1
00:05  INFO: [kryonet] Host discovery timed out.
Server 0 address: /192.168.0.112
Server 1 address: /127.0.0.1
Server 2 address: /127.0.0.1

----------------------------------- Computer B (NOT WORKING-----------------------------------
Computer B$ run_server
...
00:05 TRACE: [kryo] Read class 13: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 DEBUG: [kryo] Read: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 TRACE: [kryo] Object graph complete.
00:05 DEBUG: [kryonet] Responded to host discovery from: /192.168.0.102:52415
00:05 DEBUG: [kryo] Read: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 TRACE: [kryo] Object graph complete.
00:05 DEBUG: [kryonet] Responded to host discovery from: /127.0.0.1:52415
00:05 DEBUG: [kryo] Read: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:05 TRACE: [kryo] Object graph complete.
00:05 DEBUG: [kryonet] Responded to host discovery from: /127.0.0.1:52415

Computer B$ run_client
...
00:00 TRACE: [kryo] Write class 13: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:00 DEBUG: [kryo] Write: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:00 TRACE: [kryo] FieldSerializer.write fields of class: com.esotericsoftware.kryonet.FrameworkMessage$DiscoverHost
00:00 TRACE: [kryo] Object graph complete.
00:00 DEBUG: [kryonet] Broadcasted host discovery on port: 54777
00:05  INFO: [kryonet] Host discovery timed out.
-------------------------------------------------------------------------------------------------

Does anyone know why discovery may not work on Computer B? There is nothing special in its configuration and the firewall is not enabled. Interestingly, the server works fine on Computer B and a client from Computer A can find it; it's only the client that doesn't work. (In the example above, both the server and the client are running on the same computer, but the same happens when the server is running on another computer.) I also experimented with different port numbers and connection timeout values, always with the same result.

Regards,
Marek

----------------------------------------- Server code -------------------------------------------
Log.set(Log.LEVEL_TRACE);
Server server = new Server();
server.start();
server.bind(54555, 54777);

----------------------------------------- Client code -------------------------------------------
Log.set(Log.LEVEL_TRACE);
Client client = new Client();
List<InetAddress> addresses = client.discoverHosts(54777, 5000);
for (int i = 0; i < addresses.size(); ++i) {
  final InetAddress address = addresses.get(i);
  System.out.println("Server " + i + " address: " + address);
}
client.close();

Marek Wieczorek

unread,
Sep 22, 2016, 4:15:23 AM9/22/16
to KryoNet users
Things started getting interesting. It appears that the problem only happens on kernel 4.8.0 (4.8.0-040800rc4 to be more precise). When tested on kernel versions <= 4.4.0-38, everything works fine.

I'm curious if anyone else experienced the same behavior. The problem is fully reproducible on my computer.

Marek

Joachim Durchholz

unread,
Sep 22, 2016, 10:12:27 AM9/22/16
to kryone...@googlegroups.com
Am 22.09.2016 um 10:15 schrieb Marek Wieczorek:
> Things started getting interesting. It appears that the problem only
> happens on kernel 4.8.0 (4.8.0-040800rc4 to be more precise). When tested
> on kernel versions <= 4.4.0-38, everything works fine.
>
> I'm curious if anyone else experienced the same behavior. The problem is
> fully reproducible on my computer.

That's good, it means you have a chance to find out what's the
difference. Try ltrace or ktrace to see it.
That and reading the associated API documentation should tell you if
it's a kernel bug, or something that the callers have always been doing
wrong but got away with until 4.4.0-38.
The "-38" indicates it's a patchlevel applied by your distro, the
official Linux kernel is currently at 4.7.4. Such distro patches
typically apply security updates that are found between (in this case)
4.4.0 and 4.7.4, and they might have introduced a bug in networking.
So the potential culprits are:
- Linux kernel itself (unlikely if 4.4.0-37 worked)
- Distro, most likely the security team
- JVM
- JDK
- Kryo

Marek Wieczorek

unread,
Sep 26, 2016, 7:42:26 AM9/26/16
to kryonet-users
First a correction: the first version that doesn't work is 4.8.0-040800rc4 (4.4.0-38 works fine).

I managed to reproduce the same problem on another computer, where I installed kernel 4.8.0-040800rc4, and also for another Java version (java-1.8.0-openjdk-amd64 instead of java-8-oracle).
Running ltrace hasn't given me any clue (although I have no experience in using this tool, so I might have not enabled some important options):

server log (execution interrupted manually after the experiment):
__libc_start_main(0x400620, 4, 0x7ffc53564138, 0x4006d0 <unfinished ...>
JLI_Launch(4, 0x7ffc53564138, 1, 0 <no return ...>
--- SIGINT (Interrupt) ---
+++ exited (status 130) +++

client log:
__libc_start_main(0x400620, 4, 0x7ffcd316e218, 0x4006d0 <unfinished ...>
JLI_Launch(4, 0x7ffcd316e218, 1, 0)                                   = 0
+++ exited (status 0) +++

I'd be very happy to do some more debugging, although I have a very limited experience with this kind of problems. Right now I'm quite convinced that the problem can be easily reproduced on other computers.

Marek

Joachim Durchholz

unread,
Sep 26, 2016, 9:43:25 AM9/26/16
to kryone...@googlegroups.com
Am 26.09.2016 um 13:42 schrieb Marek Wieczorek:
> First a correction: the first version that doesn't work is *4.8.0-040800rc4*
> (4.4.0-38 works fine).
>
> I managed to reproduce the same problem on another computer, where I
> installed kernel 4.8.0-040800rc4, and also for another Java version
> (java-1.8.0-openjdk-amd64 instead of java-8-oracle).
> Running ltrace hasn't given me any clue (although I have no experience in
> using this tool, so I might have not enabled some important options):
>
> server log (execution interrupted manually after the experiment):
> __libc_start_main(0x400620, 4, 0x7ffc53564138, 0x4006d0 <unfinished ...>
> JLI_Launch(4, 0x7ffc53564138, 1, 0 <no return ...>
> --- SIGINT (Interrupt) ---
> +++ exited (status 130) +++
>
> client log:
> __libc_start_main(0x400620, 4, 0x7ffcd316e218, 0x4006d0 <unfinished ...>
> JLI_Launch(4, 0x7ffcd316e218, 1, 0) = 0
> +++ exited (status 0) +++

Hmm... "ltrace java -help" wouldn't work for me, it would run into SIGILL.
Not sure what's going on, but it seems that the JVM is doing things that
can throw off ltrace, maybe the other tool (strace) is better.

strace is tracing calls to the kernel, not to libraries, so it can give
a complete picture but might be slightly lower-level.
You'll want to make sure that it follows all threads and child
processes; there's an option for that.
You might want to limit its output to networking-related calls; the
option for that it described on the man page.

> I'd be very happy to do some more debugging, although I have a very limited
> experience with this kind of problems.

We all started learning at "very limited experience" :-)

> Right now I'm quite convinced that
> the problem can be easily reproduced on other computers.

Quite possible.
However, not everybody is able to easily switch kernels. I'd have to
unravel more of my distro's setup than I care fore, for example, plus my
/boot partition is pretty small by today's standards.
I guess willingness to reproduce will go up by a large margin once you
can present traces and say "here's where the difference starts".

HTH
Jo

Marek Wieczorek

unread,
Sep 27, 2016, 6:13:27 AM9/27/16
to kryonet-users
It's not that I'm not willing to contribute to the solution of the problem; it's just that the learning curve of a new technology appeared to be a bit steeper than I expected, and I truly appreciate any suggestion as to the direction in which to investigate, for everyone's sake :) Reproducing the issue and making tests is not a problem for me.

I did two more experiments, using strace and wireshark. The results are attached to the email.
Files kryo_server.log and kryo_client.log contain kryo logs for the server and the client, respectively.
Similarly, files execute_server.log and execute_client.log contain strace logs.
File wireshark.pcapng contains the wireshark trace of the network traffic during the experiment.
Directories "good" and "bad" contain logs for the successful execution (kernel 4.4.0-38) and the unsuccessful execution (kernel 4.8.0-040800rc4), respectively.

Kryo logs confirm that the problem does exist.
strace logs don't give me any clue. They look pretty much the same, except for memory addresses and ids, which are specific for each execution. Moreover, system calls stop before the actual communication process starts.
Also wireshark logs don't seem to contain anything interesting (these ARP packages seem to be a random garbage, as they were not present in every test run I performed). Both traces (the successful and the unsuccessful execution) show UDP packages flying both ways (client->server, server->client).

I'm curious if anyone else would be able to reproduce the same problem, or there is something peculiar to my setup that makes the discovery process fail (although it happens on two different computers independently). Or maybe the way I'm perfoming LAN discovery is incorrect? (the source code is in the first email of this thread)
I'd also be happy to perform some more experiments, if I got an idea what to test.

Marek
discovery-problem.tgz

Joachim Durchholz

unread,
Sep 27, 2016, 5:01:39 PM9/27/16
to kryone...@googlegroups.com
Am 27.09.2016 um 12:13 schrieb Marek Wieczorek:
> strace logs don't give me any clue. They look pretty much the same, except
> for memory addresses and ids, which are specific for each execution.
> Moreover, system calls stop before the actual communication process starts.

By default, strace following only the main process, and possibly the
main thread.
There are options to change that behaviour; did you try them?

> Also wireshark logs don't seem to contain anything interesting (these ARP
> packages seem to be a random garbage, as they were not present in every
> test run I performed). Both traces (the successful and the unsuccessful
> execution) show UDP packages flying both ways (client->server,
> server->client).

ARP is just machines trying to find out which MAC address is associated
with a given IP address, this is a very low-level part of the networking
protocols.
You can safely ignore these.

You can also ignore all packages except those that have the server as
source and the client as target, or vice versa.

Wireshark is good for finding out what data goes over the wire when, but
not so good at finding out why.

> I'm curious if anyone else would be able to reproduce the same problem, or
> there is something peculiar to my setup that makes the discovery process
> fail (although it happens on two different computers independently). Or
> maybe the way I'm perfoming LAN discovery is incorrect? (the source code is
> in the first email of this thread)

You're using a part of the Kryo API that I do not know.
I took a (very) short look at the code, it seems that the parameters are
two integers -> there's a potential for mixing up the parameter order,
so better check that and maybe debug until the code hits JDK networking
to see whether some intermediate software layer is mixing up parameter
values.
It's a shot in the dark though.

The best option is making strace work for you; it should give you a
definitive clue, particularly if you also compare return codes.

Run the outputs through a good diff tool that highlights not just
diverging lines but also diverging characters. Eclipse's "compare with
-> each other" is pretty good for that task actually, though it may be a
bit slow.
It may help to filter out dates, or replaces memory addresses with IDs.
Read the diff from top to bottom, and whenever some recurring noise
makes the diff too hard to read, do a search-and-replace - Eclipse's
search&replace in regex mode supports backreferences, i.e. it can do
quite a lot without forcing you into writing filter programs.

HTH
Jo

Marek Wieczorek

unread,
Sep 29, 2016, 7:26:02 AM9/29/16
to kryonet-users
El martes, 27 de septiembre de 2016, 23:01:39 (UTC+2), toolforger escribió:
Am 27.09.2016 um 12:13 schrieb Marek Wieczorek:
> strace logs don't give me any clue. They look pretty much the same, except
> for memory addresses and ids, which are specific for each execution.
> Moreover, system calls stop before the actual communication process starts.

By default, strace following only the main process, and possibly the
main thread.
There are options to change that behaviour; did you try them?

Thanks, that was helpful. Running strace with option -f gave a lot more verbose output (see files execute_*_f.log in the attachment).
Successful runs produce output similar to:

16925 write(1, "00:00 DEBUG: [kryonet] Broadcast"..., 64) = 64
16925 write(1, "\n", 1)                 = 1
16925 poll([{fd=25, events=POLLIN|POLLERR}], 1, 5000 <unfinished ...>
16935 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
16935 futex(0x7f19cc0bdc28, FUTEX_WAKE_PRIVATE, 1) = 0
16935 futex(0x7f19cc0bdc54, FUTEX_WAIT_BITSET_PRIVATE, 1, {43343, 257057149}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
16935 futex(0x7f19cc0bdc28, FUTEX_WAKE_PRIVATE, 1) = 0
16935 futex(0x7f19cc0bdc54, FUTEX_WAIT_BITSET_PRIVATE, 1, {43343, 307359659}, ffffffff <unfinished ...>
16925 <... poll resumed> )              = 1 ([{fd=25, revents=POLLIN}])
16925 recvfrom(25, "", 0, 0, {sa_family=AF_INET6, sin6_port=htons(54777), inet_pton(AF_INET6, "::ffff:192.168.0.112", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
16925 write(1, "00:00  INFO: [kryonet] Discovere"..., 56) = 56
16925 write(1, "\n", 1)                 = 1

while unsuccessful runs produce something like:

2765  write(1, "00:00 DEBUG: [kryonet] Broadcast"..., 64) = 64
2765  write(1, "\n", 1)                 = 1
2765  poll([{fd=25, events=POLLIN|POLLERR}], 1, 5000 <unfinished ...>
2775  <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
2775  futex(0x7fe6440bdd28, FUTEX_WAKE_PRIVATE, 1) = 0
2775  futex(0x7fe6440bdd54, FUTEX_WAIT_BITSET_PRIVATE, 1, {533, 20035901}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
2775  futex(0x7fe6440bdd28, FUTEX_WAKE_PRIVATE, 1) = 0
  waiting goes on ...
2765  <... poll resumed> )              = 0 (Timeout)

(everything in spite of the server producing valid responses)
The above seems to indicate that the kernel fails to respond correctly to the poll system call (a kernel bug?).

Anyway, in version 4.8.0-040800rc8 everything started working again. Let's hope that it was just a transitory Linux kernel problem, which was fixed and didn't go to the official release.
I'll stop the investigation for now. Thanks for an interesting conversation!

Marek
discovery-problem.tgz

Joachim Durchholz

unread,
Sep 29, 2016, 7:48:29 AM9/29/16
to kryone...@googlegroups.com
Am 29.09.2016 um 13:26 schrieb Marek Wieczorek:
> Anyway, in version 4.8.0-040800rc*8* everything started working again.
> Let's hope that it was just a transitory Linux kernel problem, which was
> fixed and didn't go to the official release.

Heh. I guess that rc kernels are release candidates, not official
kernels, so you test these to identify problems in new kernel versions
before they get released to the general public.
Kudos for doing that kind of testing, not everybody has the
infrastructure and setup to do that. (I certainly don't, but I'd like to
have it.)

If you find that a RC candidate fails, the Linux kernel crowd is a good
place to turn to. One of Linus' mantras is: "If it's a regression, we
roll back what caused them, no excuses." So if you see a regression like
the one you had here, reporting it to the appropriate subproject should
give you a good chance of having this fixed before that kernel hits the
streets.

Good to know things are working for you :-)
Reply all
Reply to author
Forward
0 new messages