Re: [razor] troubleshooting MK in vmware guest not connecting to razor server

116 views
Skip to first unread message

Daniel Pittman

unread,
Nov 28, 2012, 1:01:04 PM11/28/12
to puppet...@googlegroups.com
On Tue, Nov 27, 2012 at 3:16 PM, Alan Laird <alan....@gmail.com> wrote:
>
> I'm new to razor and and running into an issue with MK and vmware guests. I
> could really use a suggestion to get past this.
>
> My razor server is a vmware guest and I have two physical servers that
> happily PXE through to the MK and then will install ubuntu based on policy.
> I configured a couple vmware guests and they PXE boot the MK but they do not
> seem to connect to the razor server. Running razor node on the the razor
> server only shows the physical hosts and not the VM hosts.
>
> I changed the MK to use the debug image and was able to login to the node
> and ping the razor server (to validate networking).
>
> Any thoughts?

Can you connect to the Razor server API endpoint as specified in the
configuration, from those problem guests?

--
Daniel Pittman
⎋ Puppet Labs Developer – http://puppetlabs.com
♲ Made with 100 percent post-consumer electrons

Nan Liu

unread,
Nov 28, 2012, 1:35:25 PM11/28/12
to puppet...@googlegroups.com
On Wed, Nov 28, 2012 at 10:01 AM, Daniel Pittman <dan...@puppetlabs.com> wrote:
On Tue, Nov 27, 2012 at 3:16 PM, Alan Laird <alan....@gmail.com> wrote:
>
> I'm new to razor and and running into an issue with MK and vmware guests.  I
> could really use a suggestion to get past this.
>
> My razor server is a vmware guest and I have two physical servers that
> happily PXE through to the MK and then will install ubuntu based on policy.
> I configured a couple vmware guests and they PXE boot the MK but they do not
> seem to connect to the razor server.  Running razor node on the the razor
> server only shows the physical hosts and not the VM hosts.
>
> I changed the MK to use the debug image and was able to login to the node
> and ping the razor server (to validate networking).
>
> Any thoughts?

Can you connect to the Razor server API endpoint as specified in the
configuration, from those problem guests?

In the debug MK, check /var/log/rz_mk_common.log for errors and what server it's attempting to connect.

Thanks,

Nan

Tom McSweeney

unread,
Nov 28, 2012, 2:42:01 PM11/28/12
to puppet...@googlegroups.com

On Nov 28, 2012, at 2:16 PM, Matt Vogt <vogt...@gmail.com> wrote:

I've got a really similar issue. Everything is working fine, even through the node checking in the first time, but then it never checks in again, goes inactive and then gets removed after 5 minutes of not checking in. This is at the end of my rz_mk_controller.log:


What version of the Microkernel are you using?  This behavior looks a lot like a bug we squashed in a recent release of the MK (the fix was made in v0.9.2.1, so any release after that should not have this problem if your problem is similar to the problem being reported by Matt, above)


Wondering if your log file looks similar. BTW, my log file was rz_mk_controller.log, not rz_mk_common.log


yes, it is the /var/log/rz_mk_controller.log file, not the /var/log/rz_mk_common.log file.  There are also other files that may show additional error output in the /tmp directory...

Matt Vogt

unread,
Nov 28, 2012, 2:50:37 PM11/28/12
to puppet...@googlegroups.com
I'm running the v0.9.3.0 debug mk.  What other files should I look in /tmp?

Thanks!

--
You received this message because you are subscribed to the Google Groups "puppet-razor" group.
To post to this group, send email to puppet...@googlegroups.com.
To unsubscribe from this group, send email to puppet-razor...@googlegroups.com.
Visit this group at http://groups.google.com/group/puppet-razor?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tom McSweeney

unread,
Nov 28, 2012, 5:47:35 PM11/28/12
to puppet...@googlegroups.com
First, I'd check and see if the rz_mk_control_server.rb script is still running.  My guess is that an (uncaught) exception was thrown that caused the process to exit after the first checkin.  Once that process exits, the Microkernel stops checking in.  Once a node stops checking in, it is marked as inactive, and after a configurable amount of time in that inactive state (5 minutes is the default, I think) that node is removed from the system by the Razor daemon (unless it's stopped checking in because an active_model was bound to it)...

If I remember correctly, the file you want to look at for the exception that would have been thrown is either the log file I mentioned previously (in my previous reply) or the /tmp/rz_mk_control_server.out...or something similar to that name...

If you have trouble finding that output file, let me know.  I should be back at my desk in a few hours, where I can check my laptop for the appropriate filename...

Matt Vogt

unread,
Nov 28, 2012, 6:18:58 PM11/28/12
to puppet-razor
Yes, the script is still running and the rz_mk_controller.log file is full of connection attempts:

E, [2012-11-28T22:56:46.049787 #2789] ERROR -- Object#: An exception occurred: Network is unreachable - connect(2)
E, [2012-11-28T22:57:46.109802 #2789] ERROR -- Object#: An exception occurred: Network is unreachable - connect(2)
E, [2012-11-28T22:58:46.170156 #2789] ERROR -- Object#: An exception occurred: Network is unreachable - connect(2)
E, [2012-11-28T22:59:46.230337 #2789] ERROR -- Object#: An exception occurred: Network is unreachable - connect(2)
E, [2012-11-28T23:00:46.290329 #2789] ERROR -- Object#: An exception occurred: Network is unreachable - connect(2)
E, [2012-11-28T23:01:46.350702 #2789] ERROR -- Object#: An exception occurred: Network is unreachable - connect(2)

The rz_mk_control_server.rb.output file has an exception in it and i've attached it (we'll see if this works)
--
rz_mk_control_server.rb.output

Tom McSweeney

unread,
Nov 28, 2012, 9:17:06 PM11/28/12
to puppet...@googlegroups.com
So, if I remember correctly, the issue is that your Razor Microkernel instances are checking in successfully once, then failing to check in at all after that, is that right?

If it is checking in successfully exactly once, then failing to check in after that, then try this test.  Go and look at the value for the 'mk_uri' parameter that is defined in the '/tmp/mk_conf.yaml' file on that Microkernel instance.  Can you actually ping the Razor server using the IP address shown in that URI?  My guess is that you'll see an IP address in the value for that 'mk_uri' parameter that either doesn't exist or isn't routable from the local network.  This is typically because the 'next-server' IP address is correct in the DHCP server (it has to be for the node to successfully boot into the Microkernel), and that "next-server" IP address actually does point to the Razor server.  This is the value that will be used for the initial checkin (which is succeeding, right?).

If that is the case (that it checked in once successfully, but the IP address in the 'mk_uri' value defined in that file is incorrect), then the issue is likely that the IP address used when defining that same 'mk_uri' value in your '${RAZOR_HOME}/conf/razor_server.conf' file is not correct.  If that is the case, then the new configuration will be returned to the Microkernel instance after the first successful checkin, and that new configuration (including the new value for the 'mk_uri' parameter with the incorrect IP address) will be saved to that '/tmp/mk_conf.yaml' file.  Once the new configuration received from the Razor server has been saved to that file, the Microkernel Controller restarts, and it picks up it's configuration from that same file (including the new 'mk_uri' value that it received from Razor, the one with the incorrect IP address for the Razor server).  From there on out, the Microkernel instance will not be able to check in with the Razor server.  After a certain amount of time it will be transitioned to an inactive state (since it hasn't checked in for a while), and a short time later (about 5 minutes using the default configuration) the node will be removed from the list of nodes that Razor is managing (since it hasn't checked in for a "long time").

So, could you check the contents of that file and let me know whether or not you can ping Razor at the IP address in that file?  If not, the fix might be as simple as changing the 'mk_uri' value in your razor configuration file (on the Razor server).  Once that file is changed, just reboot the machines running the Microkernel instances and they should check in successfully until they are bound to a model by a policy (at which time they will reboot) or they are powered off.

Hopefully that is the issue, let me know if it is not.

Matt Vogt

unread,
Nov 28, 2012, 9:49:37 PM11/28/12
to puppet...@googlegroups.com
Tom,

I had actually checked that and the mk_uri is a valid routable ip address from the node that can be pinged. (i've scp'ed files from the node to that IP address).

Sorry :)

Tom McSweeney

unread,
Nov 28, 2012, 10:03:00 PM11/28/12
to puppet...@googlegroups.com
I have no idea what's going on, then...

If the Microkernel Controller is still running, and it's still attempting to checkin with the Razor server, and if the Razor server is still up and running (you can "see" it at the indicated port from that URI, right?) then you should be seeing checkin requests on the Razor server (and replies back to the Microkernel Controller)...

Are you seeing the checkin requests arriving on the Razor server from those nodes (they should show up in the razor_daemon.log file)?  If so, then the only other thing I can think of is that we did see something similar to this when the Razor server from one of our users had reached the point where there were an obscenely large number of policy entries in their MongoDB instance.  The issue was fixed in a relatively recent release of Razor, but you may or may not have updated recently.  Since the policy collection was so large, checkin with the Razor server was timing out...unfortunately you don't seem to be having the same problem, since the error you're seeing is different (theirs was a timeout error, not a "network unavailable" error).

Here's a quick Ruby script you can use to test that out (it should be run from your Razor server):
# cat ./get_razor_collection_counts.rb
#!/usr/bin/env ruby
require 'rubygems'
require 'mongo'

hostname = '127.0.0.1'
port = 27017

connection = Mongo::Connection.new(hostname, port)
db = connection.db("project_razor")
db.collections.each { |collection|
    print "#{collection.name}.count = "
    puts collection.count
}
#
As I said, if that's not the issue then I'm fresh out of ideas...the only other thing I can think of doing is to run the debug microkernel (which gathers a lot more data during the initial boot) and to set the 'mk_log_level' parameter in your Razor server configuration file to 'Logger::DEBUG' so that the debug output continues to be collected after the first checkin...

LMK if that script shows anything odd in your MongoDB collection sizes.

Alan Laird

unread,
Nov 29, 2012, 4:10:33 PM11/29/12
to puppet...@googlegroups.com
Well that's interesting....

The VM /var/log/rz_mkm_controller.log is showing something like this:

DEBUG -- Object#: registration_uri = http://false:8026/razor/api/node/register

Tom McSweeney

unread,
Nov 29, 2012, 6:34:07 PM11/29/12
to puppet...@googlegroups.com
What does a 'razor config' show on your Razor server?

Alan Laird

unread,
Nov 29, 2012, 6:41:27 PM11/29/12
to puppet...@googlegroups.com


On Thursday, November 29, 2012 3:34:07 PM UTC-8, Tom McSweeney wrote:
What does a 'razor config' show on your Razor server?


root@pdxnpalaird02:~# razor config
ProjectRazor Config:
mk_tce_mirror_port: 2157 
persist_host: 127.0.0.1 
default_ipmi_power_state: off 
rz_mk_boot_debug_level:  
default_ipmi_password: ipmi_password 
mk_gem_mirror_uri: http://localhost:2158/gem-mirror 
admin_port: 8025 
mk_checkin_path: /razor/api/node/checkin 
mk_log_level: Logger::ERROR 
mk_checkin_skew: 5 
image_svc_port: 8027 
image_svc_path: /opt/razor/image 
mk_fact_excl_pattern: (^facter.*$)|(^id$)|(^kernel.*$)|(^memoryfree$)|(^operating.*$)|(^osfamily$)|(^path$)|(^ps$)|(^ruby.*$)|(^selinux$)|(^ssh.*$)|(^swap.*$)|(^timezone$)|(^uniqueid$)|(^uptime.*$)|(.*json_str$) 
default_ipmi_username: ipmi_user 
persist_timeout: 10 
mk_checkin_interval: 60 
daemon_min_cycle_time: 30 
image_svc_host: 10.4.35.103 
persist_mode: mongo 
register_timeout: 120 
mk_register_path: /razor/api/node/register 
noun: config 
force_mk_uuid:  
api_port: 8026 
persist_port: 27017 
node_expire_timeout: 300 

Tom McSweeney

unread,
Nov 29, 2012, 8:09:10 PM11/29/12
to puppet...@googlegroups.com
Do you, by any chance, have DNS entry for Razor that is returning false?  Is there a razor.ip value (or razor.server value) in your iPXE script boot line that is set to false?  There are now four ways to find the razor server...perhaps we need to rethink this a bit.

Tom McSweeney

unread,
Nov 29, 2012, 8:20:06 PM11/29/12
to puppet...@googlegroups.com
Then again, perhaps it's an issue with the new code :(  have you tried a slightly older version of the Razor Microkernel (v0.9.1.7, perhaps)?  If not, please give that a shot and let me know if that works for you (since it doesn't have the "fix" that adds the other sources for determining the Razor server)...

The odd thing is that the change in question (adding in multiple sources for determining the Razor IP address) was added a little over three weeks ago.  Did you just update to a more recent version of the Microkernel, perhaps?  If that's the case, perhaps we need to revisit the syntax of the changes made by @GregSutcliffe (pull request #13) in order to add this new feature...

Sent from my iPad

Alan Laird

unread,
Nov 29, 2012, 8:59:55 PM11/29/12
to puppet...@googlegroups.com
On Thursday, November 29, 2012 5:20:06 PM UTC-8, Tom McSweeney wrote:
Then again, perhaps it's an issue with the new code :(  have you tried a slightly older version of the Razor Microkernel (v0.9.1.7, perhaps)?  If not, please give that a shot and let me know if that works for you (since it doesn't have the "fix" that adds the other sources for determining the Razor server)...

The odd thing is that the change in question (adding in multiple sources for determining the Razor IP address) was added a little over three weeks ago.  Did you just update to a more recent version of the Microkernel, perhaps?  If that's the case, perhaps we need to revisit the syntax of the changes made by @GregSutcliffe (pull request #13) in order to add this new feature...

I just built this stack of gear so it's all the latest bits.  I created a vmware guest for the razor server, grabbed two spare dell systems, and created a couple more vm guests.  The hardware works fine and I deployed ubuntu on one of the systems.  

Both the bare metal and the vm are using the same network, dhcp, and razor server.  I only have the one MK installed and it is the debug one. The /var/lib/tftpboot/pxelinux.cfd/default appears to be generic.

So..... Maybe it would be good to explore some of the other methods of passing the razor server to the node.  Or, I might try an older MK....

Thanks!

Alan

Tom McSweeney

unread,
Nov 29, 2012, 9:58:18 PM11/29/12
to puppet...@googlegroups.com
So I guess I'm confused here...is it working for some nodes but not for others? Do the "problem nodes" ever check in successfully with the Razor server or do they show this error from the very beginning (with the hostname in the URI set to "false" for all checkin requests)?

It's pretty apparent that the Microkernel is booting properly, so the issue is probably not related to your "next-server" settings in your TFTP server directory. If I understand what you're seeing correctly, then I think the issue is that the configuration that is being returned by Razor (for only some of the nodes? If so, that's odd) or the initial configuration changes that are made by the Microkernel Controller during the boot process (which should effect ALL Microkernel instances, not just a subset of them). Those configuration parameters are saved to the /tmp/mk_conf.yaml file on the Microkernel. Can you verify that you have a /tmp/mk_conf.yaml file that shows "false" for the hostname of the Razor server in that URI on one of those problem nodes???

Alan Laird

unread,
Nov 29, 2012, 11:38:14 PM11/29/12
to puppet...@googlegroups.com


On Thursday, November 29, 2012 6:58:18 PM UTC-8, Tom McSweeney wrote:
So I guess I'm confused here...is it working for some nodes but not for others?  Do the "problem nodes" ever check in successfully with the Razor server or do they show this error from the very beginning (with the hostname in the URI set to "false" for all checkin requests)?

It's pretty apparent that the Microkernel is booting properly, so the issue is probably not related to your "next-server" settings in your TFTP server directory.  If I understand what you're seeing correctly, then I think the issue is that the configuration that is being returned by Razor (for only some of the nodes? If so, that's odd) or the initial configuration changes that are made by the Microkernel Controller during the boot process (which should effect ALL Microkernel instances, not just a subset of them).  Those configuration parameters are saved to the /tmp/mk_conf.yaml file on the Microkernel.  Can you verify that you have a /tmp/mk_conf.yaml file that shows "false" for the hostname of the Razor server in that URI on one of those problem nodes???

I have two hardware nodes where MK appears to work correctly and they register with the razor server.  I have two VM guests (in the same environment as the vm razor server) and they fail to register. 

Here's the contents of the /tmp/mk_conf.yaml.  The mk_uri is correct at http://10.4.35.103:8026

I wonder why it's showing up as http://false in the /var/log/mk_controller.log....



vm_mk.png

Tom McSweeney

unread,
Nov 30, 2012, 3:16:31 AM11/30/12
to puppet...@googlegroups.com
What is in the /tmp/nextServerIP.addr file on the Microkernel instances
that are failing to register?

Alan Laird

unread,
Nov 30, 2012, 3:54:58 PM11/30/12
to puppet...@googlegroups.com


On Friday, November 30, 2012 12:16:31 AM UTC-8, Tom McSweeney wrote:
What is in the /tmp/nextServerIP.addr file on the Microkernel instances
that are failing to register?


That is correct as well.  10.4.35.103


Tom McSweeney

unread,
Nov 30, 2012, 5:09:55 PM11/30/12
to puppet...@googlegroups.com
From looking at the code, I'm starting to think this is a "false" is not false type of problem...will check it more when I'm back at my desk (Monday?)

Tom McSweeney

unread,
Dec 3, 2012, 1:44:47 PM12/3/12
to puppet...@googlegroups.com
I think a bit more information is needed to debug this issue...can you try the following:
  • if you haven't already, download the latest "debug" Microkernel (v0.9.3.0) and setup Razor so that your nodes will boot using that Microkernel
  • if you haven't already, go into your ${RAZOR_HOME}/conf/razor_server.conf file and set the 'mk_log_level' parameter to a value of Logger::DEBUG (this will ensure that you continue to get debug output in the Microkernel log files, even after the first successful checkin with Razor occurs)
  • reboot one of your "problem nodes" and wait for it to fail to checkin with Razor
  • post the contents of the following files (from the instance that is failing to checkin) somewhere where we can download it from (unless you are able to just attach it as an email attachment).  A gzipped (or bzipped) tarfile will do nicely
    • /var/log/rz_mk_controller.log
    • /tmp/rz_mk_control_server.rb.output
    • /tmp/mk_conf.yaml
    • /tmp/first_checkin.yaml
    • /tmp/nextServerIP.addr
    • /tmp/prev_facts.yaml
  • In that same gzipped (or bzipped) tarfile it might also help to include the output of the "dmesg" command (from that same node); just use the "tee" command to capture the output in a file and include it in the same tarfile you are constructing (above)
  • The contents of your ${RAZOR_HOME}/conf/razor_server.conf file
  • A description of the environment that you are using...I guess I just need to understand how you've configured the various machines here (the Razor server, the physical servers, and the VMs).  Also, please do verify that these VMs are successfully checking in once (and only once), but that they are failing to checkin after that first successful checkin (and Razor is eventually dropping them from the list of nodes as a result)

Hopefully the contents of those files will give a bit more information as to what is going on here...if not I'll try to come up with a custom MK that you can use to gather more diagnostic output for debugging...

Thanks in advance for your help with this.  I don't seem to be able to recreate the error you are seeing in my own environment (running either VMware Workstation 8.x or VMware Workstation 9.x on a 64-bit Linux host and iPXE booting VMs using a Razor instance running in a separate VM using that same VMware environment)...

Reply all
Reply to author
Forward
0 new messages