Test failures question (long, sorry)

1 view
Skip to first unread message

Evan Thomas

unread,
Nov 19, 2014, 6:10:45 PM11/19/14
to Quality Assurance Team
Hi all,

  I've been having multiple test failures in QA over the last week, all to test a simple change to validate template in the CloudFormation service.  I don't think any of the stock tests touch cloudformation, so I don't see how it should fail euca-sequence-01.

I'm not asking someone to scrutinize every test result, as there are several, but I will list what I have.

EUCA-10129 (also notice it uses EDGE)
Build-n-conf passed, ran it to test the new feature
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1000

changed test to use euca-sequence-4.1, failed several tests, starting with net_test

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1001

and repeated doing so

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1002
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1003

These all occured during the week of the on-boarding.  I figured maybe there was some networking stuff that was problematic.  Monday when I got back I tried again.  Same thing

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1004

Then I thought I'd run against standard testing and see what I got.

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test&uid=1000

euare and bfebs failures, but no net test failures. 

I figured there might be an issue that was fixed in testing, so I merged testing into my branch and tried again.
Still the same types of failures.

Ran testing again

No net_test failures but euare again and this time instance_suite failures

Ran both one more time,

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test&uid=1001

saw that the testing one only had one failure, the other one was failing as usual but then realized that the test wasn't a good comparison as testing was in MANAGED mode and the other one was in EDGE so I killed the other test.

Tried testing in EDGE mode
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129&uid=1005

Similar failures to what I had seein in the EUCA-10129 test, starting with net_test

Another thing that might be an issue is that maybe the memo fields had changed from quick-launch or something.  I created a new branch EUCA-10129-2, which was just the change I had made in EUCA-10129 right into testing.  That run was no better.

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2&uid=1000

Got even more errors starting it so i thought maybe that was a fluke and ran again

http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-EUCA-10129-2&uid=1001

Seemed to be consistent with the other errors from before (net_test).

Matt Clark suggest I run the smoke test for configure_imaging and if that worked run the euca-sequence-4.1 smoke test.

I no longer had that set of machines to run against though.

Finally, I decided to try all variables

1) My code vs testing
2) edge vs managed

I created 4 tests, ethomas-EUCA-10129-2-managed, ethomas-test-managed, ethomas-EUCA-10129-2-edge, ethomas-EUCA-10129-2-edge


in this case the "testing" branch is having issues with net_test even in managed mode.
This is the fewest number of tests that my code has failed, and it is just start_stop_bfebs which had happened in other testing setups with managed mode

The edge tests took a lot longer to finish (this has consistently been the case over the last week)
http://10.111.1.120/euca-qa/display_test.php?testname=ethomas-test-edge&uid=1000
Similar types of failures, net test, and now also ebs and instance suite failures
Same as testing case


One failed, with euare failure (which only happened once before and which is the "testing" branch), and autoscaling and bfebs failures (all have happened before)

the rest are (presumably) still running
In all cases instance_suite seems to be taking a long time too.


So with all this stuff, I guess my question is:
1) are we having net_test issues with edge (or managed) mode?
2) What other common failures are happening?

3) Is my test in any way considered "passed"?  (The only one that seemed to come close is the one against managed mode, and in that case only the start_stop_bfebs test failed)
4) Is something wrong with any of my configs?
5) What's going on?

If anyone has any ideas, I would be appreciative.

Thanks,
-Evan






















Matt Clark

unread,
Nov 19, 2014, 7:02:51 PM11/19/14
to Evan Thomas, Quality Assurance Team
HI Evan, 
Thanks for the detailed info! 
For the 4.1 test sequence(s)...
I haven't looked at every test but so far the tests which are failing 'net_test' seem to be due to the configure_imaging test unit failing. I think there's a bug in the legacy QA system that results in the test showing 'passed' when it actually failed, one issue in the past was test units which had long names? 
In this case it looks like a perms issue when creating the python virtual env on the testing machine/vm? 

Since this did not happen in the smoke test(s) we might just be able to update the 4.1 sequence to see if it the test units are out of date.
Does anyone have any objection to updating  euca-sequence-4.1?

For the 4.0 test sequence it looks like there was an issue with EBS snapshots maybe? 
Thanks!
-M



--
Matt Clark - Software Quality Ninja 
Eucalyptus Systems
www.eucalyptus.com

Iglesias, Vic

unread,
Nov 19, 2014, 7:08:56 PM11/19/14
to Matt Clark, Evan Thomas, Quality Assurance Team
Matt,

Go ahead with the update of the sequence.

From: Matt Clark <matt....@eucalyptus.com<mailto:matt....@eucalyptus.com>>
Date: Wednesday, November 19, 2014 at 4:02 PM
To: Evan Thomas <eth...@eucalyptus.com<mailto:eth...@eucalyptus.com>>
Cc: Quality Assurance Team <qa-...@eucalyptus.com<mailto:qa-...@eucalyptus.com>>
Subject: Re: Test failures question (long, sorry)

HI Evan,
Thanks for the detailed info!
For the 4.1 test sequence(s)...
I haven't looked at every test but so far the tests which are failing 'net_test' seem to be due to the configure_imaging test unit failing. I think there's a bug in the legacy QA system that results in the test showing 'passed' when it actually failed, one issue in the past was test units which had long names?
In this case it looks like a perms issue when creating the python virtual env on the testing machine/vm?

Since this did not happen in the smoke test(s) we might just be able to update the 4.1 sequence to see if it the test units are out of date.
Does anyone have any objection to updating euca-sequence-4.1?

For the 4.0 test sequence it looks like there was an issue with EBS snapshots maybe?
Thanks!
-M




www.eucalyptus.com<http://www.eucalyptus.com/>

Matt Clark

unread,
Nov 19, 2014, 7:26:31 PM11/19/14
to Iglesias, Vic, Evan Thomas, Quality Assurance Team
Ok, updated euca-sequence-4.1 so there might be some differences when running the test but after looking at more of Evan's failed tests it looks like there may have been a general networking and/or image issue. Instances which were up, running and granted ssh and icmp access were not accessible. 
I see this from the instance(s)'s console output, maybe a bad image or the test is trying to run a service image (imaging, ELB, etc) ?

no instance data found in start-local

cloud-init-nonet waiting 120 seconds for a network device.

cloud-init-nonet gave up waiting for a network device.

Evan, 
If you want to re-run a system with euca-sequence-4.1 and freeze the system at the end, I'd be happy to help debug whats going on? 
Thanks!
-Matt

Matt Clark

unread,
Nov 20, 2014, 10:52:19 AM11/20/14
to Iglesias, Vic, Evan Thomas, Quality Assurance Team
Quick update on this one...
There were some test issues related to permissions on the test/QA server, now resolved but this is not the main issue. Looks like dhcpd is not running on the nodes. My guess is because there's a server already running on the nodes ...maybe kicked off by the hypervisor? 

nobody    6821  0.0  0.0  12888   724 ?        S    Nov19   0:00 /usr/sbin/dnsmasq --strict-order --pid-file=/var/run/libvirt/network/default.pid --conf-file= --except-interface lo --bind-interfaces --listen-address 192.168.122.1 --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-leasefile=/var/lib/libvirt/dnsmasq/default.leases --dhcp-lease-max=253 --dhcp-no-override --dhcp-hostsfile=/var/lib/libvirt/dnsmasq/default.hostsfile --addn-hosts=/var/lib/libvirt/dnsmasq/default.addnhosts

Not sure if this is the underlying cause yet, but if so the next question here is...

Should eucanetd error out and die under this condition? 

There is an error in the log (which I looked for after looking for the dhcpd process) , but errors like this which are known to effect the operation of the node/VMs should be propagated upstream in some form. There's an open bug related to this.  I think if eucanetd dies the node will transition to disabled and at least gives the admin/operator a big clue that something needs servicing and will prevent user's VMs from landing on that node. 

Thanks!

-Matt




Evan Thomas

unread,
Nov 20, 2014, 2:01:23 PM11/20/14
to Matt Clark, Iglesias, Vic, Quality Assurance Team
Hi Matt,

  Is it worth trying this test again then?
-Evan

Matt Clark

unread,
Nov 20, 2014, 2:07:29 PM11/20/14
to Evan Thomas, Iglesias, Vic, Quality Assurance Team
Hey Evan, 
I think the test by itself will continue to fail until we figure out what's changed on the nodes. You could kill this other dhcpd instance and maybe restart eucanetd at an early stage in the test(s) and/or kill it now and do the smoke full sequence-4.1. 
Shaon suggested a change to support the VPC setup may be to blame, but I haven't looked yet. 
Thanks,
-M

Evan Thomas

unread,
Nov 20, 2014, 2:11:36 PM11/20/14
to Matt Clark, Iglesias, Vic, Quality Assurance Team
Hi Matt,

  Sorry for my ignorance here, but the dhcpd instance, is it running on one of the two machines the tests are using?  If so shouldn't restarting the test nuke everything?  Same question about the eucanetd stuff.  Finally, what would you suggest to do to consider the code I'm testing "ok", based on these other issues?  Has anyone else seen this stuff?  If not, has everyone migrated to the new qa system?  Is that what I should do?  (I know yes ultimately, but I'm sure there's a learning curve as well)

-Evan

Matt Clark

unread,
Nov 20, 2014, 6:43:52 PM11/20/14
to Evan Thomas, Iglesias, Vic, Quality Assurance Team
Hi Evan, 
Sorry poor use of wording on my part. The hypervisor is running it's own dhcpd server on the node which (due to a recent change in dhcpd) seems to detect the dual use of UDP 0.0.0.0:67 as a conflict and prevents a second dhcpd server from binding to this port. Doesn't appear to have cared in the past. 
This causes eucanetd to fail to run the dhcpd server and instances do not get IPs.  
I opened to track this:
If you want to log into your nodes and 'virsh network-destroy default' while the test is in the initial setup stages, you'll probably pass a majority of these failing cases. 

...However a heads up on the EBS failures...

There's also a ticket opened now against snapshot failures in general which we discussed in today's bug scrub. This is going to cause a lot of EBS related test units to fail as well. 

Thanks!
-M


Matt Clark

unread,
Nov 21, 2014, 2:10:28 AM11/21/14
to Evan Thomas, Iglesias, Vic, Quality Assurance Team
Just fyi, 
I added a quick work around to the configure_edge unit to 'virsh net-destroy' the default network from each node so dnsmasq is not running/conflicting with eucanetd's dhcpd server. This fixes most of the issues seen. We still have the snapshot issue as mentioned above...


Cheers,
-M

Timothy Cramer

unread,
Nov 21, 2014, 9:01:23 AM11/21/14
to Matt Clark, Evan Thomas, Iglesias, Vic, Quality Assurance Team
Swathi said Wes would have the snapshot (actually multi part upload) bug fixed today

Tim
Reply all
Reply to author
Forward
0 new messages