Node provision problem

2,247 views
Skip to first unread message

Jason Kost

unread,
Apr 3, 2017, 11:32:33 AM4/3/17
to Warewulf
Hello Warewulfer's

I am attempting to build/rebuild a mid-sized HPC cluster using OpenHPC 1.2, Warewulf 3.7 (I believe that's the version included in OpenHPC), and Scientific Linux 7.2 but am running into problems actually getting the nodes to provision.  Most aggravatingly, I believe that I solved this issue once in the past, but can't remember how for the life of me.

Everything appears to be working perfectly up until the point where I actually attempt to boot up the compute nodes to pull the image over.  There is no indication of anything wrong in any of the log files.  Nothing untoward happens during the tcpdump.  Everything goes fine up until the initfs.gz image is pulled over, then nothing.  It just hangs there forever.  I can see that the filesystem is being created on the nodes (doing a stateful provision).  I have repeatedly destroyed the filesystem before attempting to provision to no avail.  I have isolated the local network I'm using from the rest of the University just in case.  I have updated the version of syslinux to 6.03 and replaced the Warewulf versions of those files (pxelinux.0, lpxelinux.0, and ldlinux.c32).  I've checked the pxelinux.cfg, tftp, dhcpd.conf, and all other configuration files I know of for any errors or incorrect paths.  I've read through every thread here that seems like it might pertain to the issue I'm seeing, and followed any advice I could find, but nothing seems to be having any effect.  So here I am asking if anyone has any new ideas?

Please let me know if there is any data/info that would be useful in helping to diagnose this issue.  I've included what I could think of/what I've seen frequently included in other posts. 

[root@nucleus syslinux-6.03]# cat /etc/xinetd.d/tftp
# default: off
# description: The tftp server serves files using the trivial file transfer \
#       protocol.  The tftp protocol is often used to boot diskless \
#       workstations, download configuration files to network-aware printers, \
#       and to start the installation process for some operating systems.
service tftp
{
        socket_type             = dgram
        protocol                = udp
        wait                    = yes
        user                    = root
        server                  = /usr/sbin/in.tftpd
        server_args             = -s /var/lib/tftpboot
 disable = no
        per_source              = 11
        cps                     = 100 2
        flags                   = IPv4
}

[root@nucleus syslinux-6.03]# wwsh provision print chr1
#### chr1 #####################################################################
           chr1: BOOTSTRAP        = 3.10.0-514.10.2.el7.x86_64
           chr1: VNFS             = SL7
           chr1: FILES            = dynamic_hosts,group,important_paths.sh,munge.key,network,passwd,shadow,slurm.conf
           chr1: PRESHELL         = FALSE
           chr1: POSTSHELL        = FALSE
           chr1: CONSOLE          = UNDEF
           chr1: PXELINUX         = UNDEF
           chr1: SELINUX          = DISABLED
           chr1: KARGS            = "console=ttyS1,115200"
           chr1: FILESYSTEMS      = mountpoint=/boot:dev=sda1:type=xfs:size=500,dev=sda2:type=swap:size=48000,mountpoint=/:dev=sda3:type=xfs:size=fill
           chr1: DISKFORMAT       = sda1,sda2,sda3
           chr1: DISKPARTITION    = sda
           chr1: BOOTLOADER       = sda
           chr1: BOOTLOCAL        = FALSE


[root@nucleus syslinux-6.03]# cat /var/log/messages-20170403 | grep 'chr1'
Mar 31 15:45:44 chr1.localdomain wwlogger: Starting the provision handler
Mar 31 15:45:44 chr1.localdomain wwlogger: Running provision script: adhoc-pre
Mar 31 15:45:44 chr1.localdomain wwlogger: Running provision script: filesystems
Mar 31 15:45:45 chr1.localdomain wwlogger: Root partition found in WWFILESYSTEMS from node configuration.
Mar 31 15:45:48 chr1.localdomain wwlogger: Running provision script: getvnfs
Mar 31 15:45:58 chr1.localdomain wwlogger: Running provision script: config
Mar 31 15:45:58 chr1.localdomain wwlogger: Running provision script: runtimesupport
Mar 31 15:45:58 chr1.localdomain wwlogger: Running provision script: devtree
Mar 31 15:45:58 chr1.localdomain wwlogger: Running provision script: kernelmodules
Mar 31 15:45:59 chr1.localdomain wwlogger: Running provision script: files
Mar 31 15:45:59 chr1.localdomain wwlogger: Running provision script: mkbootable
Mar 31 15:46:00 chr1.localdomain wwlogger: Running provision script: adhoc-post
Mar 31 15:46:00 chr1.localdomain wwlogger: Running provision script: selinux
Mar 31 15:46:00 chr1.localdomain wwlogger: Running provision script: umount
Mar 31 15:46:01 chr1.localdomain wwlogger: Running provision script: postreboot


[root@nucleus syslinux-6.03]# cat /var/log/httpd/access_log-20170403 | grep '42\.10'
192.168.42.10 - - [31/Mar/2017:15:28:51 -0400] "GET /WW/vnfs?hwaddr=f4:ce:46:b9:fb:68 HTTP/1.1" 200 308969569 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:05 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&timestamp= HTTP/1.1" 200 401 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:05 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=9 HTTP/1.1" 200 16 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:05 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=1 HTTP/1.1" 200 177 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:05 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=2 HTTP/1.1" 200 4019 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:06 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=3 HTTP/1.1" 200 1407 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:06 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=4 HTTP/1.1" 200 2283 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:06 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=5 HTTP/1.1" 200 2122 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:06 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=6 HTTP/1.1" 200 1024 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:06 -0400] "GET /WW/file?hwaddr=f4:ce:46:b9:fb:68&fileid=11 HTTP/1.1" 200 2274 "-" "Wget"
192.168.42.10 - - [31/Mar/2017:15:29:07 -0400] "GET /WW/script?hwaddr=f4:ce:46:b9:fb:68&type=post HTTP/1.1" 200 - "-" "Wget"


[root@nucleus syslinux-6.03]# cat /var/lib/tftpboot/warewulf/pxelinux.cfg/01-f4-ce-46-b9-fb-68
# Configuration for Warewulf node: chr1
# Warewulf data store ID: 10
DEFAULT bootstrap
LABEL bootlocal
LOCALBOOT 0
LABEL bootstrap
SAY Now booting chr1 with Warewulf bootstrap (3.10.0-514.10.2.el7.x86_64)
KERNEL bootstrap/7/kernel
APPEND ro initrd=bootstrap/7/initfs.gz wwhostname=chr1 console=ttyS1,115200 wwmaster=192.168.42.1 wwipaddr=192.168.42.10 wwnetmask=255.255.255.0 wwnetdev=eth0

Any thoughts or advice would be greatly appreciated. 

Sincerely,
Jason Kost

Jason Stover

unread,
Apr 3, 2017, 11:58:32 AM4/3/17
to ware...@lbl.gov
Hi Jason,

OpenHPC 1.2 has 3.6 (with patches); 1.3 has Warewulf 3.7pre.

How did you build the chroot structure? A clean install, or was it
using the golden image to pull an install? With SL7, I had an issue
when there was the /run existing, and it just wouldn't boot. Provision
fine, but boot... nope.

From the provision logs, it appears to be provisioning just fine.
Are you getting anything on the console?

-J
> --
> You received this message because you are subscribed to the Google Groups
> "Warewulf" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to warewulf+u...@lbl.gov.
> To post to this group, send email to ware...@lbl.gov.
> To view this discussion on the web visit
> https://groups.google.com/a/lbl.gov/d/msgid/warewulf/28e40550-492a-454b-b2ea-6943c9a70658%40lbl.gov.
> For more options, visit https://groups.google.com/a/lbl.gov/d/optout.

Jason Kost

unread,
Apr 3, 2017, 12:23:15 PM4/3/17
to Warewulf
Jason,

I have to admit I'm still a little fuzzy on what the golden image is... 

I used the centos-7 template with wwmkchroot. 

When trying to provision a node, the last thing appearing on the console is the "Probing EDD" message.  After sitting at that for about 30 seconds, the caps-lock/num-lock/scroll-lock lights on the keyboard start flashing (like at the start of booting), and the system freezes; requiring a hard-reset/shutdown.

I wonder if it might be worthwhile to compile WW 3.7 and give it a shot with that?  I've been a little reticent to do so as I'm not completely certain what will need to be redone (since I'm more or less following the OpenHPC recipe).

Jason

Jason Stover

unread,
Apr 3, 2017, 1:04:13 PM4/3/17
to ware...@lbl.gov
Hi Jason,

The "Golden" image is basically just a rsync of a node. So, you'd
setup/install a node as usual, and then use this to create an image
from it.

"Probing EDD" is all you're going to see. You can try adding in
"console=tty1" to the KARGS value:

wwsh provision set chr1 --kargs="console=tty1 console=ttyS1,115200"

Otherwise, only going over the serial console will you see anything useful.

But, as I said, it appears that it's provisioning fine. It's getting
down to postreboot in the logs. So, without anything else, I cannot
say exactly. You could try updating to 3.7, as it does have much
better 7 support, but from what you gave I'm not seeing anything on
the provisioner side that's causing it, since it's getting to the
point it exits out of the bootstrap and runs the chroot init.

There _could_ be an issue there, but I can't see that from the logs
provided, and if you're needing to manually reboot, then it doesn't
look like a usual error that will cause the system to reboot
automatically after 15 seconds or so.

-J
> --
> You received this message because you are subscribed to the Google Groups
> "Warewulf" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to warewulf+u...@lbl.gov.
> To post to this group, send email to ware...@lbl.gov.
> To view this discussion on the web visit
> https://groups.google.com/a/lbl.gov/d/msgid/warewulf/d4a9d846-2b45-4ab8-a70f-5d9dc2b3d86c%40lbl.gov.

Meij, Henk

unread,
Apr 3, 2017, 1:38:46 PM4/3/17
to Warewulf

Golden image notes (I love it especially on complicated, large configs which are hard to do in CHROOT). It is also useful if you need to compile software on compute nodes directly so it can customize itself during compilation based on the hardware it will actually compute on, gromacs is an example.

https://dokuwiki.wesleyan.edu/doku.php?id=cluster:144


I have some old HP blades where NIC1/2 reverse roles between PXE and regular boot. Since private network netmask is 255.255.0.0, I hook them both up to the same switch and define both macs in ww db and bios.


-Henk


From: Jason Kost <jk...@worcester.edu>
Sent: Monday, April 03, 2017 12:23:15 PM
To: Warewulf
Subject: Re: [Warewulf] Node provision problem
 
--
You received this message because you are subscribed to the Google Groups "Warewulf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warewulf+u...@lbl.gov.
To post to this group, send email to ware...@lbl.gov.

Jason Kost

unread,
Apr 5, 2017, 5:27:00 PM4/5/17
to Warewulf
A little bit of an update:

After a bit of work, I was able to get WW 3.7 compiled, and to figure out where it installed itself to (the OpenHPC version seems to put everything relative to /, where 3.7 goes in /usr/local/).  Unfortunately, after rebuilding the vnfs image and WareWulf object databases, the original issue remained.

I do however have a bit of additional info from redirecting the provisioning output to the console.  As far as I can tell, there are no major errors occurring anywhere (the output just flies by until it hangs, so tough to say, and I'm not certain how to redirect it to a file on the head node for future analysis).  The final 4 lines of output before everything hangs are:

3.772125] usb 6-1: New full-speed USB device number 2 using uhci_hcd
3.898030] Freeing unused kernel memory: 1648k freed
3.910315] usb 6-1: New USB device found: idVendor=03f0, idProduct=1027
3.920144] usb 6-1: New USB device strings: Mfr=1, Product=2, SerialNumber

It's not clear to me if the hanging is a result of the action undertaken in the final line of the output, or if it's a result of whatever comes next not completing successfully.  Nor am I completely clear how to discover what would/should be next. 

I don't know if this helps to diagnose what's going on or not.  Because I do know that the compute nodes are capable of taking a Scientific Linux install (my cluster was up and running until the head node OS became irrecoverably corrupted), I'm thinking that the Golden Image route may be worth looking into if this issue isn't easily solvable (not that the Golden Image is guaranteed to work either, but at least the image would be known to function on at least 1 node).

Any thoughts or comments would be greatly appreciated. 

Sincerely,
Jason Kost

John Hearns

unread,
Apr 6, 2017, 3:05:06 AM4/6/17
to ware...@lbl.gov
Jason, I don't think this will go any way to helping with your problem, but here is something I would try:
When the nodes are 'frozen' at the EDD prompt, can you nmap them to see if there is any network stack up, and maybe possible to ssh in?

I guess the answer will be no, but this is something I always try when nodes have boot problems.

--
You received this message because you are subscribed to the Google Groups "Warewulf" group.
To unsubscribe from this group and stop receiving emails from it, send an email to warewulf+u...@lbl.gov.
To post to this group, send email to ware...@lbl.gov.

Jason Stover

unread,
Apr 6, 2017, 9:27:57 AM4/6/17
to ware...@lbl.gov
Hi Jason,

From the final output, it appears that provisioning is happening
fine. That looks to be normal output from the system booting... then
hanging when probing the USB bus, or right after.

For logging, you can modify the compute /etc/rsyslogd.conf and add
in: *.* @[prov ipaddr] -- Make sure the provisioner syslog is
setup to allow remote logging. Or a dedicated syslog server. That
could let you see a bit more. And although it's messy, making sure the
'debug' kernel argument is there (in KARGS value), at least until you
get the node booting.

And to expand on what John said. There isn't an SSH daemon when the
system is provisioning, but the network stack is up so you can ping
it. The closest you can get is setting a PRESHELL / POSTSHELL value to
spawn a shell before or after the provisioning steps. Or in some cases
if an error is hit, you have 10 seconds or so to press a key to enter
a debug shell before it reboots. Accessible with a KVM or SOL console.

-J
> https://groups.google.com/a/lbl.gov/d/msgid/warewulf/CAPqNE2U4-v%2BBMACeYXdbwR3RMXpO6%2Ba%3DBk-Xraeu9WGEPRfWrg%40mail.gmail.com.

Jason Kost

unread,
Apr 6, 2017, 1:47:38 PM4/6/17
to Warewulf
Still nothing definitive, but I did find a new (or at least one I missed seeing previously) message in the logs on the provisioning server:

Apr  6 12:44:15 nucleus in.tftpd[10634]: Client 192.168.42.10 finished /warewulf/pxelinux.cfg/01-f4-ce-46-b9-fb-68
Apr  6 12:44:16 nucleus in.tftpd[10635]: Client 192.168.42.10 finished /warewulf/bootstrap/7/kernel
Apr  6 12:44:18 nucleus in.tftpd[10636]: Client 192.168.42.10 finished /warewulf/bootstrap/7/initfs.gz
Apr  6 12:44:39 nucleus wwprovision[5701]: Could not connect to data store!

I don't know if this is the source of my issues, or is just incidental. 

Also, here are the contents of the node's syslog:

Apr  6 16:44:53 chr1.localdomain wwlogger: Starting the provision handler
Apr  6 16:44:53 chr1.localdomain wwlogger: Running provision script: adhoc-pre
Apr  6 16:59:24 chr1.localdomain wwlogger: Running provision script: ipmiconfig
Apr  6 16:59:24 chr1.localdomain wwlogger: Running provision script: filesystems
Apr  6 16:59:24 chr1.localdomain wwlogger: Running provision script: getvnfs

Interestingly, I think I'm not even less sure what the problem is, or where it's arising.  I'm also wondering if there's something incorrect in my KARGS and/or rsyslog.conf setup, as I would have expected more info in my node's log output. 

chr1: KARGS            = "console=tty1 console=ttyS1,115200 debug"

Provisioning Server rsyslog.conf

# rsyslog configuration file

# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

#### MODULES ####

# The imjournal module bellow is now used as a message source instead of imuxsock.
$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$ModLoad imjournal # provides access to the systemd journal
#$ModLoad imklog # reads kernel messages (the same are read from journald)
#$ModLoad immark  # provides --MARK-- message capability

# Provides UDP syslog reception
$ModLoad imudp
$UDPServerRun 514

# Provides TCP syslog reception
#$ModLoad imtcp
#$InputTCPServerRun 514

$ModLoad imklog

#### GLOBAL DIRECTIVES ####

# Where to place auxiliary files
$WorkDirectory /var/lib/rsyslog

# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

# File syncing capability is disabled by default. This feature is usually not required,
# not useful and an extreme performance hit
#$ActionFileEnableSync on

# Include all config files in /etc/rsyslog.d/
$IncludeConfig /etc/rsyslog.d/*.conf

# Turn off message reception via local log socket;
# local messages are retrieved through imjournal now.
$OmitLocalLogging on

# File to store the position in the journal
$IMJournalStateFile imjournal.state


#### RULES ####

$template FILENAME,"/var/log/%fromhost-ip%/syslog.log"

*.* ?FILENAME

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog

# Log cron stuff
cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 :omusrmsg:*

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log


# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
#$ActionQueueFileName fwdRule1 # unique name prefix for spool files
#$ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
#$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
#$ActionQueueType LinkedList   # run asynchronously
#$ActionResumeRetryCount -1    # infinite retries if host is down
# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @@remote-host:514
# ### end of the forwarding rule ###

Conpute Node rsyslog.conf
# rsyslog configuration file

# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

#### MODULES ####

# The imjournal module bellow is now used as a message source instead of imuxsock.
$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$ModLoad imjournal # provides access to the systemd journal
#$ModLoad imklog # reads kernel messages (the same are read from journald)
#$ModLoad immark  # provides --MARK-- message capability

# Provides UDP syslog reception
#$ModLoad imudp
#$UDPServerRun 514

# Provides TCP syslog reception
#$ModLoad imtcp
#$InputTCPServerRun 514


#### GLOBAL DIRECTIVES ####

# Where to place auxiliary files
$WorkDirectory /var/lib/rsyslog

# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

# File syncing capability is disabled by default. This feature is usually not required,
# not useful and an extreme performance hit
#$ActionFileEnableSync on

# Include all config files in /etc/rsyslog.d/
$IncludeConfig /etc/rsyslog.d/*.conf

# Turn off message reception via local log socket;
# local messages are retrieved through imjournal now.
$OmitLocalLogging on

# File to store the position in the journal
$IMJournalStateFile imjournal.state


#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
#*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
#authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
#mail.*                                                  -/var/log/maillog


# Log cron stuff
#cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 :omusrmsg:*

# Save news errors of level crit and higher in a special file.
#uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log


# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
#$ActionQueueFileName fwdRule1 # unique name prefix for spool files
#$ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
#$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
#$ActionQueueType LinkedList   # run asynchronously
#$ActionResumeRetryCount -1    # infinite retries if host is down
# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @@remote-host:514
# ### end of the forwarding rule ###
*.* @192.168.42.1:514

I *think* the lack of info in the log may be due to the kernel logging being commented out on the compute node.  Changing that and seeing if I get anything more helpful. 

Jason

Ryan Novosielski

unread,
Apr 6, 2017, 2:03:01 PM4/6/17
to ware...@lbl.gov
I can’t see
> On Apr 6, 2017, at 1:47 PM, Jason Kost <jk...@worcester.edu> wrote:
>
> Still nothing definitive, but I did find a new (or at least one I missed seeing previously) message in the logs on the provisioning server:
>
> Apr 6 12:44:15 nucleus in.tftpd[10634]: Client 192.168.42.10 finished /warewulf/pxelinux.cfg/01-f4-ce-46-b9-fb-68
> Apr 6 12:44:16 nucleus in.tftpd[10635]: Client 192.168.42.10 finished /warewulf/bootstrap/7/kernel
> Apr 6 12:44:18 nucleus in.tftpd[10636]: Client 192.168.42.10 finished /warewulf/bootstrap/7/initfs.gz
> Apr 6 12:44:39 nucleus wwprovision[5701]: Could not connect to data store!
>
> I don't know if this is the source of my issues, or is just incidental.

I can’t see any way that this could possibly work with that error there. FYI, this message comes from nodeconfig.pl, which normally lives in a cgi-bin directory:

if (! $db) {
warn("wwprovision: Apache Could not connect to data store!\n");
openlog("wwprovision", "ndelay,pid", LOG_LOCAL0);
syslog("ERR", "Could not connect to data store!");
closelog;
exit;
}

Sounds like you’ve got something wrong with your database connection. If you follow the trail to the DataStore.pm module, etc., you eventually have a MySQL connection getting attempted. Could be lots of reasons for that not to work. I’d expect that your Apache error log would say more.

=R

--
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novo...@rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
`'

signature.asc

Jason Stover

unread,
Apr 6, 2017, 2:12:56 PM4/6/17
to ware...@lbl.gov
In updated WW versions, the database files are owned by: root:warewulf

On RHEL, the apache user is added to the warewulf group.

There may be some permission issue that needs to be straightened out
so the /etc/warewulf/database.conf file is readable by the webserver.
Or if permissions are fine, it could have just been a transient
error... I've had DB connects fail before. Rarely, but I have had it
happen.

-J
> --
> You received this message because you are subscribed to the Google Groups "Warewulf" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to warewulf+u...@lbl.gov.
> To post to this group, send email to ware...@lbl.gov.
> To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/warewulf/EF36F719-591F-4BDF-B4E5-B90C2C71CA3C%40rutgers.edu.

Jason Kost

unread,
Apr 6, 2017, 3:35:10 PM4/6/17
to Warewulf, novo...@rutgers.edu
This appeared to be an issue with the permissions on the database.conf and database-root files.  Changed ownership to root:apache, and this particular issue seems to have gone away.  As can be seen from the syslog output from two different provisioning attempts:

Apr  6 18:09:53 chr1.localdomain wwlogger: Starting the provision handler
Apr  6 18:09:53 chr1.localdomain wwlogger: Running provision script: adhoc-pre
Apr  6 18:24:24 chr1.localdomain wwlogger: Running provision script: ipmiconfig
Apr  6 18:24:24 chr1.localdomain wwlogger: Running provision script: filesystems
Apr  6 18:24:25 chr1.localdomain wwlogger: Running provision script: getvnfs
Apr  6 18:56:29 chr1.localdomain wwlogger: Starting the provision handler
Apr  6 18:56:29 chr1.localdomain wwlogger: Running provision script: adhoc-pre
Apr  6 18:56:30 chr1.localdomain wwlogger: Running provision script: ipmiconfig
Apr  6 18:56:30 chr1.localdomain wwlogger: Running provision script: filesystems
Apr  6 18:56:30 chr1.localdomain wwlogger: Root partition found in WWFILESYSTEMS from node configuration.
Apr  6 18:56:33 chr1.localdomain wwlogger: Running provision script: getvnfs
Apr  6 18:56:47 chr1.localdomain wwlogger: Running provision script: config
Apr  6 18:56:48 chr1.localdomain wwlogger: Running provision script: runtimesupport
Apr  6 18:56:48 chr1.localdomain wwlogger: Running provision script: devtree
Apr  6 18:56:48 chr1.localdomain wwlogger: Running provision script: kernelmodules
Apr  6 18:56:49 chr1.localdomain wwlogger: Running provision script: files
Apr  6 18:56:50 chr1.localdomain wwlogger: Running provision script: mkbootable
Apr  6 18:56:51 chr1.localdomain wwlogger: Running provision script: adhoc-post
Apr  6 18:56:51 chr1.localdomain wwlogger: Running provision script: selinux
Apr  6 18:56:51 chr1.localdomain wwlogger: Running provision script: umount
Apr  6 18:56:51 chr1.localdomain wwlogger: Running provision script: postreboot

Unfortunately, provisioning is still not completing successfully.  Going to try wiping the partition data from the node, and see if the console output is giving any more useful information.

Jason  

Ryan Novosielski

unread,
Apr 6, 2017, 3:39:40 PM4/6/17
to Jason Kost, Warewulf
As Jason Stover said earlier, the expected config is to have the database.conf root:warewulf and to have apache in the warewulf group. Not sure what you can expect if some other part needs to be readable by the warewulf group.
signature.asc

Ryan Novosielski

unread,
Apr 6, 2017, 5:03:05 PM4/6/17
to Kost, Jason, ware...@lbl.gov
Yes, either would solve that problem. The way you had it might have caused others, however. 

You should be able to scroll back in your terminal to see what the error was if you're watching the console. 

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Apr 6, 2017, at 16:50, Kost, Jason <jk...@worcester.edu> wrote:

I believe that's what I intended to do, but got it backwards.  Yet somehow it still seemed to rectify the problem.  Have changed to the correct group now. 

Interestingly enough, this time around I managed to get a kernel panic on the node at the same place it usually hangs.  Unfortunately the details flew by too fast to see the source of the panic, and for some reason the console output isn't being redirected to the syslog...  At least there's some outward indication of error rather than hanging with no explanation whatsoever. 

Jason

Jason Kost

unread,
Apr 12, 2017, 5:08:26 PM4/12/17
to Warewulf, jk...@worcester.edu, novo...@rutgers.edu
So, a bit of an update again (unfortunately no good news, but potentially zeroing in on a cause at least). 

I went back to the very beginning, and reinstalled my base operating system using Scientific Linux 7.3, then went through the full OpenHPC 1.3 pipeline (which incorporates WW 3.7).  The end result was the same: if the node has a drive with existing partitions, the provisioning fails at a kernel panic.  If the partition info on the drive has been wiped, the node hangs indefinitely. 

Through some fiddling around with HP's iLO2, I was finally able to capture the output immediately surrounding the panic, and can now confirm that I am running into exactly the problem brought up in this thread: https://groups.google.com/a/lbl.gov/forum/#!searchin/warewulf/mkbootable/warewulf/cOcHX1AvYfI/CLFquUWAvUMJ

Unfortunately, from reading through the above, and similar threads (most notably https://groups.google.com/a/lbl.gov/forum/#!topic/warewulf/s1mqOMZwJpc), there still doesn't seem to be a straightforward solution to this issue.

Any ideas or suggestions?

Sincerely,
Jason Kost

Jason Kost

unread,
Apr 13, 2017, 2:07:05 PM4/13/17
to Warewulf, jk...@worcester.edu, novo...@rutgers.edu
So, this should be the last update as I have finally managed to resolve the problem.  It appears to have been a rather silly thing: Warewulf (or some of the underlying commands) isn't able to handle stateful provisioning using xfs partitions.  Switching the /boot and / partitions to ext4 resolved things. 

On a completely unrelated note, is there any news on what's up with the website?  All Warewulf related resources (other than the google groups) have been unavailable for the last week or so. 

Once again, thanks for all the assistance. 

Sincerely,
Jason Kost

Jason Stover

unread,
May 2, 2017, 8:11:08 PM5/2/17
to ware...@lbl.gov, jk...@worcester.edu, novo...@rutgers.edu
Hi Jason,

Sorry, this got lost and I just saw it... Yeah, the bootstrap can
only create partitions of the ext* type. There isn't support for other
file system types.

As for the site... I know at one point there was a hardware issue
with the web server.... The warewulf.lbl.gov site is deprecated and is
mostly in stasis now, as we're trying to move the development over to
Github (https://github.com/warewulf/warewulf3).

-J
> --
> You received this message because you are subscribed to the Google Groups
> "Warewulf" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to warewulf+u...@lbl.gov.
> To post to this group, send email to ware...@lbl.gov.
> To view this discussion on the web visit
> https://groups.google.com/a/lbl.gov/d/msgid/warewulf/b6dd8e5f-a1f7-4a34-b951-e0a4bcafc5fa%40lbl.gov.
Reply all
Reply to author
Forward
0 new messages