Having trouble with passwordless SSH and NFS

258 views
Skip to first unread message

Jeffrey Layton

unread,
May 28, 2017, 11:10:20 AM5/28/17
to ClusterHAT
Good morning,

I tried to configure passwordless ssh to all of the nodes in the ClusterHAT and I ran into some problems. I ran the command recommended on 8086.net (https://8086.support/content/23/83/en/how-do-i-create-and-copy-the-ssh-key-from-the-controller-pi-to-the-pi-zeros-to-allow-passwordless-logins.html ). Here is the output:

I set up the ssh keys on the controller without a password. Then I ran the recommended command. Here is the output.


pi@controller:~/.ssh $ for I in 1 2 3 4; do echo "Copying to p$I.local";ssh-copy-id pi@p$I.local;done
Copying to p1.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p1.local: Name or service not known

Copying to p2.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p2.local: Name or service not known

Copying to p3.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p3.local: Name or service not known

Copying to p4.local
The authenticity of host 'p4.local (192.168.1.99)' can't be established.
ECDSA key fingerprint is 3b:10:cd:84:68:21:36:82:61:c8:d2:b6:bb:e1:05:7d.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.


For some reason it doesn't like the ".local" extension. So I took that off the command and it seemed to work. Any ideas why it doesn't like ".local"?

Also, when I ssh to a node such as p1, I can't do a passwordless ssh to any other p node. Any thoughts? (I usually just NFS export /home from the "controller" to the "p" nodes but I'm having problems with that too (may be related to the ".local" issue.

TIA!

Jeff

Chris Burton

unread,
May 28, 2017, 12:15:42 PM5/28/17
to ClusterHAT
Hi, 
For some reason it doesn't like the ".local" extension. So I took that off the command and it seemed to work. Any ideas why it doesn't like ".local"?

If you can't log into them using "ssh p...@p1.local" then I would guess they haven't been powered on long enough to boot up properly, or they haven't joined the bridge correctly.

Depending on what the Pi Zero is doing whilst booting up it can take longer than a minute for it to boot up so it might be worth waiting a little longer before trying to copy the keys. If you've waited long enough I'd advise checking the network interfaces exist "ifconfig -a" should show the ethpi1 interface, they've been added to the bridge correctly "brctl show" should list br0 with eth0/ethpi1, then once they've booted up you should be able to ping them "ping -c1 p1.local" and once pingable you should be able to access them via SSH a few seconds later "ssh p...@p1.local" and similar for the other Pi Zeros. Once you can SSH in with "ssh p...@p1.local" you should be able to copy the keys over.
  
Also, when I ssh to a node such as p1, I can't do a passwordless ssh to any other p node. Any thoughts?

This is to be expected as you'd need to create a new private key on p1, and then copy it to controller/p2/p3/p4 and repeat for the other Pi Zeros.
 
 (I usually just NFS export /home from the "controller" to the "p" nodes but I'm having problems with that too (may be related to the ".local" issue.
 
You didn't say what your problem was but when I was looking at nfsroot on the Pi Zeros I found rpcbind wasn't always starting on boot (https://github.com/dagon666/rpcbindplumber links to some of the issues) so that might be something to check.

Chris.

Jeffrey Layton

unread,
May 28, 2017, 1:09:35 PM5/28/17
to Chris Burton, ClusterHAT
Chris,

Thanks for the tips! My answers are in-line below.


On Sun, May 28, 2017 at 4:15 PM, Chris Burton <bur...@gmail.com> wrote:
Hi, 
For some reason it doesn't like the ".local" extension. So I took that off the command and it seemed to work. Any ideas why it doesn't like ".local"?

If you can't log into them using "ssh p...@p1.local" then I would guess they haven't been powered on long enough to boot up properly, or they haven't joined the bridge correctly.

Depending on what the Pi Zero is doing whilst booting up it can take longer than a minute for it to boot up so it might be worth waiting a little longer before trying to copy the keys. If you've waited long enough I'd advise checking the network interfaces exist "ifconfig -a" should show the ethpi1 interface, they've been added to the bridge correctly "brctl show" should list br0 with eth0/ethpi1, then once they've booted up you should be able to ping them "ping -c1 p1.local" and once pingable you should be able to access them via SSH a few seconds later "ssh p...@p1.local" and similar for the other Pi Zeros. Once you can SSH in with "ssh p...@p1.local" you should be able to copy the keys over.


pi@controller:~ $ clusterhat on all
Turning on P1
Turning on P2
Turning on P3
Turning on P4
pi@controller:~ $ date
Sun 28 May 16:36:28 UTC 2017
pi@controller:~ $ date
Sun 28 May 16:37:42 UTC 2017
pi@controller:~ $ date
Sun 28 May 16:39:25 UTC 2017
pi@controller:~ $ date
Sun 28 May 16:42:10 UTC 2017
pi@controller:~ $ for I in 1 2 3 4; do echo "Copying to p$I.local";ssh-copy-id pi@p$I.local;done
Copying to p1.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p1.local: Name or service not known

Copying to p2.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p2.local: Name or service not known

Copying to p3.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p3.local: Name or service not known

Copying to p4.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
pi@controller:~ $ ssh -v p1.local
OpenSSH_6.7p1 Raspbian-5+deb8u3, OpenSSL 1.0.1t  3 May 2016
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
ssh: Could not resolve hostname p1.local: Name or service not known
pi@controller:~ $ ssh p1

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sun May 28 15:04:59 2017 from p4
pi@p1:~ $ 


On the controller node I took a look at the interfaces:

pi@controller:~ $ ifconfig -a
br0       Link encap:Ethernet  HWaddr 00:22:82:ff:fe:01  
          inet addr:192.168.1.95  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::ba27:ebff:fe39:c0c7/64 Scope:Link
          inet6 addr: 2602:304:cf01:8a00:89a0:ba8a:9db1:2a88/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8255 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4369 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5876243 (5.6 MiB)  TX bytes:702485 (686.0 KiB)

eth0      Link encap:Ethernet  HWaddr b8:27:eb:39:c0:c7  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9259 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4740 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6469887 (6.1 MiB)  TX bytes:784942 (766.5 KiB)

ethpi1    Link encap:Ethernet  HWaddr 00:22:82:ff:fe:01  
          inet6 addr: fe80::222:82ff:feff:fe01/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:173 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1644 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:15845 (15.4 KiB)  TX bytes:122321 (119.4 KiB)

ethpi2    Link encap:Ethernet  HWaddr 00:22:82:ff:fe:02  
          inet6 addr: fe80::222:82ff:feff:fe02/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:104 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1557 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7212 (7.0 KiB)  TX bytes:111494 (108.8 KiB)

ethpi3    Link encap:Ethernet  HWaddr 00:22:82:ff:fe:03  
          inet6 addr: fe80::222:82ff:feff:fe03/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:109 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1541 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7864 (7.6 KiB)  TX bytes:110089 (107.5 KiB)

ethpi4    Link encap:Ethernet  HWaddr 00:22:82:ff:fe:04  
          inet6 addr: fe80::222:82ff:feff:fe04/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:192 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1491 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:20224 (19.7 KiB)  TX bytes:107021 (104.5 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:148 errors:0 dropped:0 overruns:0 frame:0
          TX packets:148 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:12192 (11.9 KiB)  TX bytes:12192 (11.9 KiB)

wlan0     Link encap:Ethernet  HWaddr b8:27:eb:6c:95:92  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:153 errors:0 dropped:153 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:54087 (52.8 KiB)  TX bytes:0 (0.0 B)



Looks like they are all there. Looking at the Bridges:


pi@controller:~ $ brctl show
bridge name bridge id STP enabled interfaces
br0 8000.002282fffe01 no eth0
ethpi1
ethpi2
ethpi3
ethpi4


I'm not familiar with the bridging so I'm not sure what this means.

I can't ping p1.local from the controller:


pi@controller:~ $ ping -c1 p1.local
ping: unknown host p1.local

The controller has been up for 15 minutes and the compute nodes have been up 13 minutes when I tried the commands.

I waited another 10 minutes or so and I was able to ping p1.local and log into it with "ssh p...@p1.local". So I tried copying the pub key to the nodes:


pi@controller:~ $ for I in 1 2 3 4; do echo "Copying to p$I.local";ssh-copy-id pi@p$I.local;done
Copying to p1.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.

Copying to p2.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p2.local: Name or service not known

Copying to p3.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: ERROR: ssh: Could not resolve hostname p3.local: Name or service not known

Copying to p4.local
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.


I can ssh into the nodes now. But 15+ minutes seems like a long time. Any suggestions on debugging this? (or speeding it up?)



  
Also, when I ssh to a node such as p1, I can't do a passwordless ssh to any other p node. Any thoughts?

This is to be expected as you'd need to create a new private key on p1, and then copy it to controller/p2/p3/p4 and repeat for the other Pi Zeros.

Yep. This is why I want to use NFS to export /home from the controller to all of the compute nodes :)
 
 
 (I usually just NFS export /home from the "controller" to the "p" nodes but I'm having problems with that too (may be related to the ".local" issue.
 
You didn't say what your problem was but when I was looking at nfsroot on the Pi Zeros I found rpcbind wasn't always starting on boot (https://github.com/dagon666/rpcbindplumber links to some of the issues) so that might be something to check.

I haven't thought about nfsroot - cool idea (I'll try that at some other point).

In the meantime, on the controller node I installed the NFS server on the controller and created /etc/exports:

pi@controller:~ $ more /etc/exports
/home      *(rw,sync,no_subtree_check)
pi@controller:~ $ sudo exportfs -a
pi@controller:~ $ sudo showmount -v
showmount for 1.2.8

I used the wildcard to get started. I grabbed p4 to test. I edited /etc/fstab to look like the following:


pi@p4:~ $ more /etc/fstab
proc            /proc           proc    defaults          0       0
/dev/mmcblk0p1  /boot           vfat    defaults          0       2
/dev/mmcblk0p2  /               ext4    defaults,noatime  0       1
controller.local:/home     /home      nfs     defaults          0       0
# a swapfile is not a swap partition, no line here
#   use  dphys-swapfile swap[on|off]  for that
pi@p4:~ $ ls -s
total 0


The last line shows no files (I created a "junk" file in the /home of the controller). Running "mount" shows it's not mounted:


pi@p4:~ $ mount
/dev/mmcblk0p2 on / type ext4 (rw,noatime,data=ordered)
devtmpfs on /dev type devtmpfs (rw,relatime,size=218256k,nr_inodes=54564,mode=755)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mmcblk0p1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro)


So I "sudo-ed" to /bin/bash on p4 and went to /tmp and tried remounting /home:


pi@p4:~ $ sudo /bin/bash
root@p4:/home/pi# cd /tmp
root@p4:/tmp# mount -a
^C
root@p4:/tmp# mount -a -v
/proc                    : already mounted
/boot                    : already mounted
/                        : ignored
mount.nfs: timeout set for Sun May 28 17:06:37 2017
mount.nfs: trying text-based options 'vers=4,addr=192.168.1.95,clientaddr=192.168.1.99'
mount.nfs: mount(2): Connection refused
mount.nfs: trying text-based options 'vers=4,addr=192.168.1.95,clientaddr=192.168.1.99'
mount.nfs: mount(2): Connection refused
^C


Not sure why it's refusing to mount. Iptables? (It's running but there aren't any rules). There is nothing in the logs of controller or p4 (the only thing in there is 'action 17' around rsyslogd.


Thanks!

Jeff



Jeffrey Layton

unread,
May 29, 2017, 4:03:15 PM5/29/17
to Chris Burton, ClusterHAT
Chris (or whomever),

Sorry to jump back a little. I got NFS working  (needed to restart rpcbind and nfs-kernel-server). But I'm having some trouble getting the bridging to work. For example, I booted all 4 zeros and after about 35 minutes, only one zero allowed to ping (p4.local). The other zeros can't be pinged as using the .local extension ("unknown host").

pi@controller:~ $ brctl show
bridge name bridge id STP enabled interfaces
br0 8000.002282fffe01 no eth0
ethpi1
ethpi2
ethpi3
ethpi4


I didn't add anything to the images on the zeros so I don't know why it takes so long for the bridging to come up.

I'm not sure what's causing the issue. Does anyone have any suggestions?

TIA!

Jeff

Chris Burton

unread,
May 31, 2017, 7:41:55 PM5/31/17
to ClusterHAT, bur...@gmail.com
Hi, 
I didn't add anything to the images on the zeros so I don't know why it takes so long for the bridging to come up.

I'm not sure what's causing the issue. Does anyone have any suggestions?

I'm not sure how you can contact the Pi Zero using just "ssh pi@p4" as the Controller doesn't have this information in the standard setup. The only thing I can think of is your local router/gateway is handing out the IP via DHCP, adding the hostname to the local DNS server (and possibly populating /etc/resolv.conf with a "search" value to use). But that still leaves the question of why p4.local works but not the others, unless it has maybe ran out of IP addresses on the DHCP side of things?

Can you try the Cluster HAT using a local KB/mouse/monitor (no ethernet on the controller) to see if after waiting a min from powering on p1-p4 you can "ping -c1 p1.local" .. "ping -c1 p4.local", if that works then I'd look at your router/gateway settings/manual.

Chris.

Jeffrey Layton

unread,
Jun 4, 2017, 8:57:58 AM6/4/17
to Chris Burton, ClusterHAT
On Wed, May 31, 2017 at 7:41 PM, Chris Burton <bur...@gmail.com> wrote:
Hi, 
I didn't add anything to the images on the zeros so I don't know why it takes so long for the bridging to come up.

I'm not sure what's causing the issue. Does anyone have any suggestions?

I'm not sure how you can contact the Pi Zero using just "ssh pi@p4" as the Controller doesn't have this information in the standard setup. The only thing I can think of is your local router/gateway is handing out the IP via DHCP, adding the hostname to the local DNS server (and possibly populating /etc/resolv.conf with a "search" value to use). But that still leaves the question of why p4.local works but not the others, unless it has maybe ran out of IP addresses on the DHCP side of things?''

This was indeed the problem. My home router was screwing up the DHCP on the cluster. So I pulled the network cable and the system booted just fine. I can then plug in the cable to access the Internet. If I bring down a node and then bring it back up, I have to unplug and then re-plug the network cable. Sort of a pain but not a show stopped. Anyway I can fix this?

Thanks for all of the help! The ClusterHAT just rocks!

Jeff
 

Can you try the Cluster HAT using a local KB/mouse/monitor (no ethernet on the controller) to see if after waiting a min from powering on p1-p4 you can "ping -c1 p1.local" .. "ping -c1 p4.local", if that works then I'd look at your router/gateway settings/manual.

Chris.

--
You received this message because you are subscribed to the Google Groups "ClusterHAT" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clusterhat+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/clusterhat/77ade730-ca1a-4cf2-87c0-47156a598cd4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


Virus-free. www.avast.com

Chris Burton

unread,
Jun 5, 2017, 4:19:59 PM6/5/17
to ClusterHAT
Hi, 
This was indeed the problem. My home router was screwing up the DHCP on the cluster. So I pulled the network cable and the system booted just fine. I can then plug in the cable to access the Internet. If I bring down a node and then bring it back up, I have to unplug and then re-plug the network cable. Sort of a pain but not a show stopped. Anyway I can fix this?

I would assume switching to static IP addresses for the controller/pi zeros should work around it.

Depending on the router you can probably set fixed leases based on the MAC addresses (to prevent the IPs configured statically being used elsewhere) on the router or use IPs not in the DHCP range but are in the IP range used on the LAN.

Chris.
Reply all
Reply to author
Forward
0 new messages