[Rocks-Discuss] TFTP / kickstart.cgi issues - Performance tuning needed?

92 views

Skip to first unread message

Damon Miller

unread,

Nov 25, 2010, 2:14:04 PM11/25/10

to npaci-rocks...@sdsc.edu

Hello all. I have a 64-node Rocks cluster up and running and I'm looking to scale toward 1,000 nodes. I could use some recommendations on the "ideal" frontend configuration for such a thing. My question stems from some scalability problems I've bumped into thus far, specifically:

1. When I boot more than ~10 servers at the same time, I an occasional TFTP issue.

Affected servers successfully lease an IP but generally encounter failures early in the TFTP process. This usually happens while trying to pull down the initial ramdisk. The specific error in this case is "Cloud not find ramdisk image: initrd.img-5.4-i386". I've tried increasing the number of tftp daemons instances xinted will space but I haven't been able to address the problem. After increasing the daemon's verbosity, I see the following in /var/log/daemon:

Nov 25 13:03:12 rocks1 in.tftpd[29294]: RRQ from 10.1.255.216 filename pxelinux.0
Nov 25 13:03:12 rocks1 in.tftpd[29294]: tftp: client does not accept options
Nov 25 13:03:12 rocks1 in.tftpd[29298]: RRQ from 10.1.255.216 filename pxelinux.0
Nov 25 13:03:12 rocks1 in.tftpd[29307]: RRQ from 10.1.255.216 filename pxelinux.cfg/01-60-9f-9d-f0-09-4a
Nov 25 13:03:12 rocks1 in.tftpd[29307]: sending NAK (1, File not found) to 10.1.255.216
Nov 25 13:03:13 rocks1 in.tftpd[29312]: RRQ from 10.1.255.216 filename pxelinux.cfg/0A01FFD8
Nov 25 13:03:13 rocks1 in.tftpd[29313]: RRQ from 10.1.255.216 filename vmlinuz-5.4-i386
Nov 25 13:03:13 rocks1 in.tftpd[29308]: RRQ from 10.1.255.216 filename pxelinux.cfg/0A01FFD8
Nov 25 13:03:15 rocks1 in.tftpd[29316]: RRQ from 10.1.255.206 filename initrd.img-5.4-i386
Nov 25 13:03:18 rocks1 in.tftpd[29322]: RRQ from 10.1.255.216 filename initrd.img-5.4-i386
Nov 25 13:03:21 rocks1 in.tftpd[29166]: tftpd: read: Connection refused
Nov 25 13:03:21 rocks1 in.tftpd[29207]: tftpd: read: Connection refused
Nov 25 13:03:22 rocks1 in.tftpd[29167]: tftpd: read: Connection refused

The 'client does not accept options' is apparently harmless, but refusing connections is more problematic. Obviously the requested files exist and are readable:

[root@rocks1 log]# ls -l /tftpboot/pxelinux
total 20944
-rwxr-xr-x 1 root root 16445066 Nov 23 13:38 initrd.img-5.4-i386
-rwxr-xr-x 1 root root 20020 Nov 2 19:52 memdisk
-rw-r--r-- 1 root root 94600 Nov 23 13:38 memtest
-rwxr-xr-x 1 root root 2949120 Nov 2 19:52 pxeflash.img
-rw-r--r-- 1 root root 13148 Nov 23 13:38 pxelinux.0
drwxrwxr-x 2 root apache 4096 Nov 24 13:55 pxelinux.cfg
-rwxr-xr-x 1 root root 1876372 Nov 23 13:38 vmlinuz-5.4-i386

I see no errors in the Frontend's messages file nor am I having any network connectivity issues on the subnet. I'm kind of stumped on where to go next. I could try a network trace but I thought before I tried wading through the output from 15 clients I'd ask the list.

2. When provisioning multiple servers in parallel, the Frontend is under heavy load and generally starts sending HTTP/503 messages with more than 5-6 clients. The clients request a ks.cfg, get the 503, and then go into the "retrying in # seconds" loop. If I increase the number of clients above ~20, I find myself stuck in what amounts to an infinite retry loop. None of the clients is ever provisioned because the Frontend doesn't seem able to generate the necessary ks.cfg files. Here is sample output from Apache's SSL logfile:

10.1.255.191 - - [25/Nov/2010:14:07:38 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25
10.1.255.212 - - [25/Nov/2010:14:07:38 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25
10.1.255.214 - - [25/Nov/2010:14:07:38 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25
10.1.255.200 - - [25/Nov/2010:14:07:39 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25
10.1.255.210 - - [25/Nov/2010:14:07:37 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25
10.1.255.196 - - [25/Nov/2010:14:07:40 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25
10.1.255.208 - - [25/Nov/2010:14:07:40 -0500] "GET /install/sbin/public/kickstart.cgi?arch=i386&np=1 HTTP/1.0" 503 25

I tried setting Apache's Timeout directive higher (to 240)--hoping that would allow kickstart.cgi to finish--but it hasn't helped. I also tried increasing the number of Apache processes by adjusting StartServers but that hasn't helped either.

The server itself never swaps so I don't think memory is an issue. I'm running on a 2.33 GHz Xeon but if CPU horsepower is the issue I can look to increase that. With 32 clients attempting to provision, the Frontend's load average is approximately 11 (and it's been there for a while):

[root@rocks1 httpd]# uptime
14:10:53 up 17:26, 2 users, load average: 11.28, 10.78, 10.46

Any suggestions would be much-appreciated. I can dial back the number of parallel provisioning attempts but that's going to have a significant impact on my goal of 1,000 nodes. I haven't seen any documentation on how to run multiple Frontends but if that's a possibility I could certainly move in that direction. Finally, I don't think bandwidth is an issue yet, as I'm having problems before the clients even attempt to pull down the OS media. I was hoping the Bittorrent capabilities would help with the bandwidth concerns but I haven't gotten far enough to enable and/or test this.

Regards,

Damon

Joe Landman

unread,

Nov 25, 2010, 2:37:25 PM11/25/10

to npaci-rocks...@sdsc.edu

On 11/25/2010 02:14 PM, Damon Miller wrote:
> Hello all. I have a 64-node Rocks cluster up and running and I'm
> looking to scale toward 1,000 nodes. I could use some
> recommendations on the "ideal" frontend configuration for such a
> thing. My question stems from some scalability problems I've bumped
> into thus far, specifically:
>
> 1. When I boot more than ~10 servers at the same time, I an
> occasional TFTP issue.
>
> Affected servers successfully lease an IP but generally encounter
> failures early in the TFTP process. This usually happens while
> trying to pull down the initial ramdisk. The specific error in this
> case is "Cloud not find ramdisk image: initrd.img-5.4-i386". I've

Hmmm ... we've seen this in earlier Rocks with more than 10 at a time
(tftp-ing). TFTP isn't a terribly smart protocol, it doesn't handle
delays/timeouts very well at all.

But it was rarely the problem. We've seen a) bad switch configurations,
b) poor NIC behavior, and c) underpowered frontends.

For bad switch configs, some switches have 'smart' detection of
broadcast storms and tftp traffic thottling. You probably want to turn
off all of its "smartness".

For the poor NIC behavior, if this is a Dell front end, and the NIC is a
broadcom, spend the $150 USD, and get an Intel NIC in there. You won't
be unhappy (and as often as not, it may solve the problem for you, if it
was a Dell). Otherwise, if you can turn on NAPI behavior, or enable
interrupt moderation, both are quite helpful.

For underpowered frontends ... with user loads of 11, its possible that
tftp isn't being scheduled as rapidly as it needs. I don't know how
many cores you are running, but I'd suggest a minimum of a quad core,
and up to an 8 core machine for the head node, with sufficient ram so
there is little memory pressure even under load. Also, as often as not,
head node disk is slow. Slow disks can render the machine sluggish.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: lan...@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Reply all

Reply to author

Forward

0 new messages