ssh (and other daemons) die after period of time

41 views
Skip to first unread message

Jeff Ballard

unread,
Sep 15, 2017, 12:12:57 PM9/15/17
to cloudlab-users
Hello,

In my experiment (SIGMOD-paper-1b), I'm trying to start a 10 node Hadoop cluster on physical machines (it's a rather large experiment :) ) using the Wisconsin cluster.

Unfortunately I'm running into a problem that after a period of time, my worker nodes go missing from Hadoop and also sshd is no longer running.  For example:
$ ssh slave3
ssh: connect to host slave3 port 22: Connection refused


Looking on the console log, I see the following:

... normal bootup messages ...
[   26.015237] testbed[1487]: Booting up vnodes^M
[   26.071022] testbed[1487]: Booting up subnodes^M
[   26.118061] testbed[1487]: No subnodes. Exiting gracefully ...^M
^M^M
^M
slave3 login: [36463.579374] systemd[1]: Failed to start Getty on tty1.^M
[36463.585306] systemd[1]: Failed to start udev Kernel Device Manager.^M
[36463.598393] systemd[1]: Failed to start Journal Service.^M
[36463.616533] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy^M
[36463.629382] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.locale1': Device or resource busy^M
[36463.642016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy^M
[36463.654733] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.hostname1': Device or resource busy^M
[36463.667765] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.timedate1': Device or resource busy^M
[36488.640997] systemd[1]: Failed to register name: Connection timed out^M
[36488.648213] systemd[1]: Failed to set up API bus: Connection timed out^M
[36488.657643] systemd[1]: Failed to start Login Service.^M

I have been able to attach to the console (sometimes) and as root "systemctl restart ssh" and ssh works... for a little while.

In my troubleshooting, I decided to try before going to bed last night, I would just launch the Hadoop profile and let it sit overnight to see if it would have a problem.  This behavior did not occur, so I'm lead to believe that it isn't something inherent to the Hadoop profile.

... but this then leads me to think its something I'm doing.

Of note:
1. I do need to change some of the Hadoop default DFS paramaters, so I do copy down new files and restart Hadoop.
2. Hadoop likes to ssh to slave nodes as root, and I figured perhaps this was triggering some sort of security thing, so I even tried replacing that with a "ssh as me, then sudo".  That still caused the nodes to crash out.

Could someone help me understand what is going on here?  What is trying to restart that suite of services and why they're not working?  This has the feel of something written for /etc/init.d scripts trying to fight with systemd, but I don't even know what it is... and if nothing else I'd even accept turning it off.  I can't even begin to make forward progress on my slate of experiments until the nodes don't die on me...

Thanks,

-Jeff

Mike Hibler

unread,
Sep 15, 2017, 1:10:27 PM9/15/17
to Jeff Ballard, cloudlab-users, Gary Wong
I notice it is only the slave nodes that have died. I also notice they are
not running any of the emulab daemons either. There seems to be some sort
of mass-extinction event occurring on those nodes.

Most global catastrophes can be traced back to systemd, so I suspect your
instincts are correct.

Gary Wong is the closest thing we have to a hadoop expert, maybe he can
provide some clarity.
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To post to this group, send email to cloudla...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/a5c5753e-5f77-4e0d-a3cf-ad9f524b4466%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Mike Hibler

unread,
Sep 15, 2017, 1:58:58 PM9/15/17
to Jeff Ballard, cloudlab-users, Gary Wong
I thought it might be the OOM process killer too, but I saw no evidence in
any of the logs for that. I manually started up an sshd on "slave2" and it
has been running since.

Do you have any idea what /var/cleaner is:

root@slave0:/var# ls -latc /var/cleaner
-rwsr-xr-x 1 root blocking-PG0 8864 Sep 15 10:18 /var/cleaner

setuid root programs in non-standard places make me nervous.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/20170915171026.GA91438%40flux.utah.edu.

Jeff Ballard

unread,
Sep 15, 2017, 7:44:17 PM9/15/17
to Mike Hibler, cloudlab-users, Gary Wong
Yeah I've only been running really tiny jobs so I would be pretty shocked if it OOM'd on me with this (that's a consideration when I run my big jobs).

/var/cleaner is the following c program:

int main()                                                                                                               
{                                                                                                                        
   seteuid( 0 );                                                                                                         
   setuid( 0 );                                                                                                          
   system( "/bin/sync" );                                                                                                
   system( "/sbin/swapoff -a" );                                                                                         
   system( "/sbin/swapon -a" );                                                                                          
                                                                                                                         
   int fd = open ("/proc/sys/vm/drop_caches", O_WRONLY| O_TRUNC);                                                        
   write(fd, "3\n", 2);                                                                                                  
   close(fd);                                                                                                            
                                                                                                                         
                                                                                                                         
   return 0;                                                                                                             
}                                                                                                                        
(END)


...basically it flushes swap and tells the kernel to clear out its caches and things so that each datapoint in my starts of datapoints with the same (relatively) cold machine. It also runs on the machine I'm submitting the jobs from (currently the resourcemanager).

I'm still at a loss for what could be causing it.  I'll try reloading my experiment again and keep prodding at it.  If anyone has ideas, I'm quite eager to hear them.

-Jeff


You received this message because you are subscribed to a topic in the Google Groups "cloudlab-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloudlab-users/MWIscqzyzAA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloudlab-user...@googlegroups.com.

To post to this group, send email to cloudla...@googlegroups.com.

Jeff Ballard

unread,
Sep 16, 2017, 11:45:53 PM9/16/17
to Mike Hibler, cloudlab-users, Gary Wong
TL;DR: Hadoop profile ships a substantially out of date Ubuntu Xenial. apt-get update; apt-get dist-upgrade seems to fix the problems.

Details:  I'll spare everyone my epistle of systemd/dbus debugging hate.  Ultimately after a whole bunch of prodding (where I was able to refine my ability to cause the mass-extinction meteor to fall quickly) I stumbled across similar mass-extinction-causing events in systemd bug reports ... which lead me to see how out of date the cloudlab Hadoop profile was.  dist-upgrade took systemd from 229-4ubuntu10 to 229-4ubuntu19; its partner in crime dbus from 1.10.6-1ubuntu3.1 to 1.10.6-1ubuntu3.3; and a whole lot of other updates. After applying the updates, and rebooting all the nodes, Hadoop seems to run as I'd expect.

Ultimately this solution is rather unfulfilling as since I didn't find an actual root cause I'm still worried that the meteor is biding its time waiting to strike. In other words, while I have a lot of ideas of what was going wrong, I'm not sure what exactly what was wrong, or what the updates fixed.  Certainly not a super fun ending to spend several days of debugging :/

-Jeff

Gary Wong

unread,
Sep 18, 2017, 12:24:29 PM9/18/17
to Jeff Ballard, Mike Hibler, cloudlab-users
On Sun, Sep 17, 2017 at 03:45:41AM +0000, Jeff Ballard wrote:
> apt-get update; apt-get dist-upgrade seems to fix the problems.

Thanks very much for the debugging... I'll update the profile
accordingly.

Gary.
--
Gary Wong g...@flux.utah.edu http://www.cs.utah.edu/~gtw/
Reply all
Reply to author
Forward
0 new messages