Hello,
In my experiment (SIGMOD-paper-1b), I'm trying to start a 10 node Hadoop cluster on physical machines (it's a rather large experiment :) ) using the Wisconsin cluster.
Unfortunately I'm running into a problem that after a period of time, my worker nodes go missing from Hadoop and also sshd is no longer running. For example:
$ ssh slave3
ssh: connect to host slave3 port 22: Connection refused
Looking on the console log, I see the following:
... normal bootup messages ...
[ 26.015237] testbed[1487]: Booting up vnodes^M
[ 26.071022] testbed[1487]: Booting up subnodes^M
[ 26.118061] testbed[1487]: No subnodes. Exiting gracefully ...^M
^M^M
^M
slave3 login: [36463.579374] systemd[1]: Failed to start Getty on tty1.^M
[36463.585306] systemd[1]: Failed to start udev Kernel Device Manager.^M
[36463.598393] systemd[1]: Failed to start Journal Service.^M
[36463.616533] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy^M
[36463.629382] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.locale1': Device or resource busy^M
[36463.642016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy^M
[36463.654733] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.hostname1': Device or resource busy^M
[36463.667765] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.timedate1': Device or resource busy^M
[36488.640997] systemd[1]: Failed to register name: Connection timed out^M
[36488.648213] systemd[1]: Failed to set up API bus: Connection timed out^M
[36488.657643] systemd[1]: Failed to start Login Service.^M
I have been able to attach to the console (sometimes) and as root "systemctl restart ssh" and ssh works... for a little while.
In my troubleshooting, I decided to try before going to bed last night, I would just launch the Hadoop profile and let it sit overnight to see if it would have a problem. This behavior did not occur, so I'm lead to believe that it isn't something inherent to the Hadoop profile.
... but this then leads me to think its something I'm doing.
Of note:
1. I do need to change some of the Hadoop default DFS paramaters, so I do copy down new files and restart Hadoop.
2. Hadoop likes to ssh to slave nodes as root, and I figured perhaps this was triggering some sort of security thing, so I even tried replacing that with a "ssh as me, then sudo". That still caused the nodes to crash out.
Could someone help me understand what is going on here? What is trying to restart that suite of services and why they're not working? This has the feel of something written for /etc/init.d scripts trying to fight with systemd, but I don't even know what it is... and if nothing else I'd even accept turning it off. I can't even begin to make forward progress on my slate of experiments until the nodes don't die on me...
Thanks,
-Jeff