Test failures question (long, sorry)

Evan Thomas

unread,

Nov 19, 2014, 6:10:45 PM11/19/14

to Quality Assurance Team

Hi all,

I've been having multiple test failures in QA over the last week, all to test a simple change to validate template in the CloudFormation service. I don't think any of the stock tests touch cloudformation, so I don't see how it should fail euca-sequence-01.

I'm not asking someone to scrutinize every test result, as there are several, but I will list what I have.

EUCA-10129 (also notice it uses EDGE)

Build-n-conf passed, ran it to test the new feature
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1000

changed test to use euca-sequence-4.1, failed several tests, starting with net_test

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1001

and repeated doing so

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1002
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1003

These all occured during the week of the on-boarding. I figured maybe there was some networking stuff that was problematic. Monday when I got back I tried again. Same thing

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1004

Then I thought I'd run against standard testing and see what I got.

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test&uid=1000

euare and bfebs failures, but no net test failures.

I figured there might be an issue that was fixed in testing, so I merged testing into my branch and tried again.

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1005

Still the same types of failures.

Ran testing again

No net_test failures but euare again and this time instance_suite failures

Ran both one more time,

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test&uid=1001

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1006

saw that the testing one only had one failure, the other one was failing as usual but then realized that the test wasn't a good comparison as testing was in MANAGED mode and the other one was in EDGE so I killed the other test.

Tried testing in EDGE mode
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1005

Similar failures to what I had seein in the EUCA-10129 test, starting with net_test

Another thing that might be an issue is that maybe the memo fields had changed from quick-launch or something. I created a new branch EUCA-10129-2, which was just the change I had made in EUCA-10129 right into testing. That run was no better.

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2&uid=1000

Got even more errors starting it so i thought maybe that was a fluke and ran again

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2&uid=1001

Seemed to be consistent with the other errors from before (net_test).

I asked in a previous email about: http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1004

Matt Clark suggest I run the smoke test for configure_imaging and if that worked run the euca-sequence-4.1 smoke test.

I no longer had that set of machines to run against though.

Finally, I decided to try all variables

1) My code vs testing

2) edge vs managed

I created 4 tests, ethomas-EUCA-10129-2-managed, ethomas-test-managed, ethomas-EUCA-10129-2-edge, ethomas-EUCA-10129-2-edge

The managed tests both finished first:
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test-managed&uid=1000

in this case the "testing" branch is having issues with net_test even in managed mode.

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2-managed&uid=1000

This is the fewest number of tests that my code has failed, and it is just start_stop_bfebs which had happened in other testing setups with managed mode

The edge tests took a lot longer to finish (this has consistently been the case over the last week)
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test-edge&uid=1000

Similar types of failures, net test, and now also ebs and instance suite failures

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2-edge&uid=1000

Same as testing case

So as per Matt Clark's advice, I ran the smoke-configure-imaging-service test against the later 3 tests that had failed, and they all passed.

http://10.111.1.120/euca-qa/display_test.php?testname=SMOKE-ethomas-test-edge&uid=1000
http://10.111.1.120/euca-qa/display_test.php?testname=SMOKE-ethomas-test-managed&uid=1000
http://10.111.1.120/euca-qa/display_test.php?testname=SMOKE-ethomas-EUCA-10129-2-edge&uid=1000

But running the euca-sequence-4.1-smoke

http://10.111.1.120/euca-qa/display_test.php?testname=SMOKE-ethomas-test-managed&uid=1001

One failed, with euare failure (which only happened once before and which is the "testing" branch), and autoscaling and bfebs failures (all have happened before)

the rest are (presumably) still running

http://10.111.1.120/euca-qa/display_test.php?testname=SMOKE-ethomas-test-edge&uid=1001
http://10.111.1.120/euca-qa/display_test.php?testname=SMOKE-ethomas-EUCA-10129-2-edge&uid=1001

In all cases instance_suite seems to be taking a long time too.

So with all this stuff, I guess my question is:

1) are we having net_test issues with edge (or managed) mode?

2) What other common failures are happening?

3) Is my test in any way considered "passed"? (The only one that seemed to come close is the one against managed mode, and in that case only the start_stop_bfebs test failed)

4) Is something wrong with any of my configs?

5) What's going on?

If anyone has any ideas, I would be appreciative.

Thanks,

-Evan

Matt Clark

unread,

Nov 19, 2014, 7:02:51 PM11/19/14

to Evan Thomas, Quality Assurance Team

HI Evan,

Thanks for the detailed info!

For the 4.1 test sequence(s)...

I haven't looked at every test but so far the tests which are failing 'net_test' seem to be due to the configure_imaging test unit failing. I think there's a bug in the legacy QA system that results in the test showing 'passed' when it actually failed, one issue in the past was test units which had long names?

In this case it looks like a perms issue when creating the python virtual env on the testing machine/vm?

Since this did not happen in the smoke test(s) we might just be able to update the 4.1 sequence to see if it the test units are out of date.

Does anyone have any objection to updating euca-sequence-4.1?

For the 4.0 test sequence it looks like there was an issue with EBS snapshots maybe?

Thanks!

-M

--

Matt Clark - Software Quality Ninja

Eucalyptus Systems
www.eucalyptus.com

Iglesias, Vic

unread,

Nov 19, 2014, 7:08:56 PM11/19/14

to Matt Clark, Evan Thomas, Quality Assurance Team

Matt,

Go ahead with the update of the sequence.

From: Matt Clark <matt....@eucalyptus.com<mailto:matt....@eucalyptus.com>>
Date: Wednesday, November 19, 2014 at 4:02 PM
To: Evan Thomas <eth...@eucalyptus.com<mailto:eth...@eucalyptus.com>>
Cc: Quality Assurance Team <qa-...@eucalyptus.com<mailto:qa-...@eucalyptus.com>>
Subject: Re: Test failures question (long, sorry)

HI Evan,
Thanks for the detailed info!
For the 4.1 test sequence(s)...
I haven't looked at every test but so far the tests which are failing 'net_test' seem to be due to the configure_imaging test unit failing. I think there's a bug in the legacy QA system that results in the test showing 'passed' when it actually failed, one issue in the past was test units which had long names?
In this case it looks like a perms issue when creating the python virtual env on the testing machine/vm?

Since this did not happen in the smoke test(s) we might just be able to update the 4.1 sequence to see if it the test units are out of date.
Does anyone have any objection to updating euca-sequence-4.1?

For the 4.0 test sequence it looks like there was an issue with EBS snapshots maybe?
Thanks!
-M

www.eucalyptus.com<http://www.eucalyptus.com/>

Matt Clark

unread,

Nov 19, 2014, 7:26:31 PM11/19/14

to Iglesias, Vic, Evan Thomas, Quality Assurance Team

Ok, updated euca-sequence-4.1 so there might be some differences when running the test but after looking at more of Evan's failed tests it looks like there may have been a general networking and/or image issue. Instances which were up, running and granted ssh and icmp access were not accessible.

I see this from the instance(s)'s console output, maybe a bad image or the test is trying to run a service image (imaging, ELB, etc) ?

no instance data found in start-local

cloud-init-nonet waiting 120 seconds for a network device.

cloud-init-nonet gave up waiting for a network device.

Evan,

If you want to re-run a system with euca-sequence-4.1 and freeze the system at the end, I'd be happy to help debug whats going on?

Thanks!

-Matt

Matt Clark

unread,

Nov 20, 2014, 10:52:19 AM11/20/14

to Iglesias, Vic, Evan Thomas, Quality Assurance Team

Quick update on this one...

There were some test issues related to permissions on the test/QA server, now resolved but this is not the main issue. Looks like dhcpd is not running on the nodes. My guess is because there's a server already running on the nodes ...maybe kicked off by the hypervisor?

nobody 6821 0.0 0.0 12888 724 ? S Nov19 0:00 /usr/sbin/dnsmasq --strict-order --pid-file=/var/run/libvirt/network/default.pid --conf-file= --except-interface lo --bind-interfaces --listen-address 192.168.122.1 --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-leasefile=/var/lib/libvirt/dnsmasq/default.leases --dhcp-lease-max=253 --dhcp-no-override --dhcp-hostsfile=/var/lib/libvirt/dnsmasq/default.hostsfile --addn-hosts=/var/lib/libvirt/dnsmasq/default.addnhosts

Not sure if this is the underlying cause yet, but if so the next question here is...

Should eucanetd error out and die under this condition?

There is an error in the log (which I looked for after looking for the dhcpd process) , but errors like this which are known to effect the operation of the node/VMs should be propagated upstream in some form. There's an open bug related to this. I think if eucanetd dies the node will transition to disabled and at least gives the admin/operator a big clue that something needs servicing and will prevent user's VMs from landing on that node.

Thanks!

-Matt

Evan Thomas

unread,

Nov 20, 2014, 2:01:23 PM11/20/14

to Matt Clark, Iglesias, Vic, Quality Assurance Team

Hi Matt,

Is it worth trying this test again then?

-Evan

Matt Clark

unread,

Nov 20, 2014, 2:07:29 PM11/20/14

to Evan Thomas, Iglesias, Vic, Quality Assurance Team

Hey Evan,

I think the test by itself will continue to fail until we figure out what's changed on the nodes. You could kill this other dhcpd instance and maybe restart eucanetd at an early stage in the test(s) and/or kill it now and do the smoke full sequence-4.1.

Shaon suggested a change to support the VPC setup may be to blame, but I haven't looked yet.

Thanks,

-M

Evan Thomas

unread,

Nov 20, 2014, 2:11:36 PM11/20/14

to Matt Clark, Iglesias, Vic, Quality Assurance Team

Hi Matt,

Sorry for my ignorance here, but the dhcpd instance, is it running on one of the two machines the tests are using? If so shouldn't restarting the test nuke everything? Same question about the eucanetd stuff. Finally, what would you suggest to do to consider the code I'm testing "ok", based on these other issues? Has anyone else seen this stuff? If not, has everyone migrated to the new qa system? Is that what I should do? (I know yes ultimately, but I'm sure there's a learning curve as well)

-Evan

Matt Clark

unread,

Nov 20, 2014, 6:43:52 PM11/20/14

to Evan Thomas, Iglesias, Vic, Quality Assurance Team

Hi Evan,

Sorry poor use of wording on my part. The hypervisor is running it's own dhcpd server on the node which (due to a recent change in dhcpd) seems to detect the dual use of UDP 0.0.0.0:67 as a conflict and prevents a second dhcpd server from binding to this port. Doesn't appear to have cared in the past.

This causes eucanetd to fail to run the dhcpd server and instances do not get IPs.

I opened to track this:

https://eucalyptus.atlassian.net/browse/EUCA-10166

If you want to log into your nodes and 'virsh network-destroy default' while the test is in the initial setup stages, you'll probably pass a majority of these failing cases.

...However a heads up on the EBS failures...

There's also a ticket opened now against snapshot failures in general which we discussed in today's bug scrub. This is going to cause a lot of EBS related test units to fail as well.

Thanks!

-M

Matt Clark

unread,

Nov 21, 2014, 2:10:28 AM11/21/14

to Evan Thomas, Iglesias, Vic, Quality Assurance Team

Just fyi,

I added a quick work around to the configure_edge unit to 'virsh net-destroy' the default network from each node so dnsmasq is not running/conflicting with eucanetd's dhcpd server. This fixes most of the issues seen. We still have the snapshot issue as mentioned above...

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2-edge&uid=1003&prev_page=http%3A%2F%2F10.111.1.120%2Feuca-qa%2Fstart_test.php&Refresh=Refresh

Cheers,

-M

Timothy Cramer

unread,

Nov 21, 2014, 9:01:23 AM11/21/14

to Matt Clark, Evan Thomas, Iglesias, Vic, Quality Assurance Team

Swathi said Wes would have the snapshot (actually multi part upload) bug fixed today

Tim

Reply all

Reply to author

Forward