15:00:30 <BobBall> #startmeeting XenAPI 15:00:31 <openstack> Meeting started Wed Apr 15 15:00:30 2015 UTC and is due to finish in 60 minutes. The chair is BobBall. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:33 <BobBall> Howdy! 15:00:35 <openstack> The meeting name has been set to 'xenapi' 15:00:39 <BobBall> johnthetubaguy - ping? 15:01:13 <johnthetubaguy> o/ 15:01:14 <johnthetubaguy> hello 15:01:22 <johnthetubaguy> BobBall: hope you had a good holiday 15:01:24 <BobBall> Howdy. Good, good. 15:01:26 <BobBall> I did indeed, thanks! 15:01:33 <BobBall> Forgot that I was on holiday last meeting :) 15:01:57 <BobBall> South Africa is a wonderful place. Definitely recommend it. As long as you lock up tight at night and don't walk around by yourself... 15:02:00 <BobBall> Anyway 15:02:02 <BobBall> #topic CI 15:02:07 <BobBall> Let's start with the fun one - the CI 15:02:18 <BobBall> As you probably know the CI was disabled from voting late last week 15:02:27 <BobBall> The initial suggestion was that the CI was broken 15:02:34 <BobBall> That was unfortunately not the case... 15:02:36 <johnthetubaguy> BobBall: south africa, awesome 15:02:46 <johnthetubaguy> BobBall: ah, the code is broken? 15:02:49 <BobBall> A new, major, race condition has been introduced somewhere which is hitting XenAPI very badl 15:02:52 <BobBall> badly* 15:03:03 <johnthetubaguy> yes, you mentioned snapshot 15:03:06 <BobBall> We went from a 10% failure rate to 60+% (not quite sure what the rate was) 15:03:14 <johnthetubaguy> but its not affecting libvirt? 15:03:21 <BobBall> Presumably not 15:03:23 <johnthetubaguy> do you have the logs from the failure to look at? 15:03:23 <BobBall> which isn't surprising 15:03:27 <BobBall> since we have very different code paths 15:03:30 <BobBall> http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/screen-n-cpu.txt.gz 15:03:43 <BobBall> Uhhh 15:03:49 <BobBall> I meant: http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/results.html 15:03:59 <BobBall> Which includes that n-cpu log file 15:04:13 <johnthetubaguy> have you found the offending log trace? 15:04:24 <BobBall> Anyway - the interesting thing is that Tempest has recently changed to add a whole bunch of useful identifiers that should help track it down 15:04:34 <BobBall> but I still didn't see anything helpfully obvious :( 15:05:16 <BobBall> It's slightly worrying, perhaps, that most of the failures seem to happen when test snapshot pattern is the last test 15:05:22 <BobBall> but maybe that's due to the timeout we're hitting 15:05:39 <BobBall> just pushing it to the last test executed 15:06:23 <BobBall> What I don't understand is how/why this is showing as a failure to connect to SSH 15:06:23 <johnthetubaguy> hmm 15:06:42 <johnthetubaguy> oh, thats odd, is that post boot from snapshot? 15:07:01 <BobBall> It really _is_ some form of race since even though the pass rate is low, some jobs at the same time are still passing 15:07:54 <BobBall> http://paste.openstack.org/show/203994/ is a list of some passes/fails 15:08:32 <BobBall> so you can see that even over a period of hours (so not just a temporary thing) we're getting a splattering of passes 15:09:16 <BobBall> http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/06/155006/11/15027/results.html is an example of a pass-in-amongst-the-fails 15:09:46 <BobBall> Anyway - either of you have thoughts on where the issue might be from those logs? 15:10:32 <johnthetubaguy> sorry, distracted 15:10:46 <johnthetubaguy> agreed its a race 15:10:53 <johnthetubaguy> just seeing if we found anything in the log 15:11:24 <BobBall> Any thoughts on how it might be a race, since it's actually SSH connection to the guest that is broken 15:12:25 <johnthetubaguy> waiting for it to be active? 15:12:29 <johnthetubaguy> unsure 15:12:36 <johnthetubaguy> what the test doing? 15:13:01 <BobBall> It's writing a pattern of bits ot the guest then snapshotting I think 15:14:19 <BobBall> oh - no - it's not doing that any more 15:14:22 <BobBall> I thought it used to 15:14:27 <johnthetubaguy> hmm, odd 15:14:36 <BobBall> now it's just writing the timestamp to the guest, snapshotting, booting that snapshot and testing it 15:14:44 <BobBall> but it can't write the snapshot to the server initially 15:14:46 <BobBall> that's what timesout 15:14:49 <BobBall> so it's pre-snapshot 15:15:04 <BobBall> Just booted a newly created image which then can't be accessed 15:15:10 <johnthetubaguy> hmm 15:15:21 <johnthetubaguy> so maybe its not waiting long enough for it to boot? 15:16:08 <BobBall> Not unless we take a substantially longer amount of time than libvirt to download + boot a cirros image 15:16:41 <johnthetubaguy> we do 15:16:47 <BobBall> like minutes longer? 15:16:48 <johnthetubaguy> from what I remember 15:16:55 <BobBall> I think SSH timeout is 120 seconds? 15:17:03 <johnthetubaguy> yeah, that should be OK 15:17:38 <johnthetubaguy> unless we hit some slow path I guess 15:17:39 <johnthetubaguy> hmm 15:17:50 <BobBall> console shows that the guest did boot 15:17:59 <BobBall> however 2015-04-10 15:45:20.980 | wget: can't connect to remote host (169.254.169.254): Network is unreachable 15:18:01 <johnthetubaguy> ah good point 15:18:06 <johnthetubaguy> how long did it take 15:18:11 <BobBall> Hmmmm... Was this not our fault? :/ 15:18:22 <BobBall> Could this have been a RAX networking issue? 15:18:35 <johnthetubaguy> I don't think so, its on box networking right? 15:19:03 <BobBall> Infra tempest used to be just on HP cloud right? 15:19:04 <BobBall> hmmm 15:19:10 <BobBall> fair point 15:19:24 <BobBall> 2015-04-10 15:45:20.911 | cloud-setup: failed to read iid from metadata. tried 30 15:19:24 <johnthetubaguy> not sure 15:19:27 <BobBall> 2015-04-10 15:45:20.919 | WARN: /etc/rc3.d/S45-cloud-setup failed 15:19:56 <johnthetubaguy> BobBall: we don't run the metadata service, where is that log from? 15:20:11 <BobBall> http://dd6b71949550285df7dc-dda4e480e005aaa13ec303551d2d8155.r49.cf1.rackcdn.com/48/172448/1/15028/run_tests.log 15:20:16 <BobBall> Search for the Cirros boot 15:20:22 <johnthetubaguy> BobBall: the alternative is to try boot with config drive 15:20:26 <BobBall> Looks like the guest doesn't get an IP address 15:20:31 <johnthetubaguy> BobBall: do force config drive 15:20:37 <johnthetubaguy> see if that narrows it down 15:20:53 <BobBall> I think we must be forcing it normally for it to work at all 15:20:59 <johnthetubaguy> it could be a race in the ip tables writing 15:21:06 <BobBall> I wonder if there is an issue where the config drive might not be attached before we boot or something 15:21:09 <BobBall> ? 15:21:18 <johnthetubaguy> seems very unlikely 15:21:18 <BobBall> 2015-04-10 15:45:20.396 | ### ifconfig -a 15:21:18 <BobBall> 2015-04-10 15:45:20.402 | eth0 Link encap:Ethernet HWaddr FA:16:3E:FC:EC:8F 15:21:21 <BobBall> 2015-04-10 15:45:20.408 | inet6 addr: fe80::f816:3eff:fefc:ec8f/64 Scope:Link 15:21:29 <johnthetubaguy> BobBall: is it using the correct image? 15:21:32 <BobBall> I assume this means the guest definitely didn't get an IP address 15:21:43 <johnthetubaguy> I guess we override the config option 15:21:51 * ijw spies on you 15:22:54 <BobBall> We must 15:23:00 * BobBall hides behind a tree 15:23:17 <ijw> Seems you didn't get DHCP (which is only attempted when a config drive isn't mounted) 15:23:30 <BobBall> OK - so config drive wasn't there, but it _must_ be for RAX. 15:23:41 <BobBall> that's still right isn't it johnthetubaguy? 15:23:48 <BobBall> So the real question is why wasn't there a config drive... 15:23:51 <ijw> And config drive can be enabled with --config-drive=true but equally can be enabled if the global 'always give a config drive' option is set 15:24:05 <BobBall> force_config_drive = always 15:24:12 <BobBall> from n-cpu 15:24:17 <johnthetubaguy> BobBall: I think you are getting confused, the metadata service used in't the cloud one, its the one in your openstack setup right? 15:24:38 <BobBall> Oh, yes, I am getting confused. 15:24:40 <johnthetubaguy> BobBall: are you sure its the right image we are launching in this test? 15:24:46 <ijw> Usually I end up poking around in the /opt/stack/data/nova/instances for the config_drive file when I'm using libvirt, you can use a loopmount to see if it's good 15:24:50 <johnthetubaguy> maybe the new test used the wrong config 15:25:13 <ijw> But assuming your cloud is not completely buggered I suspect John's right and your image hates you 15:25:19 <BobBall> Bad image! 15:25:42 <ijw> At this time of the morning it's probably just decided to down tools and get a coffee 15:26:01 <BobBall> Using CONF.compute.image_ref which is what lots of others use too 15:26:25 <BobBall> Late afternoon here ijw... Does that mean it's time to down tools and get a cup of tea and cake? 15:26:58 <johnthetubaguy> BobBall: hmm, that sounds reasonable enough 15:27:01 <BobBall> johnthetubaguy: But we were 'agreed' that it looked like a race... 15:27:09 <BobBall> because some runs definitely pass 15:27:32 <BobBall> oh hang on again sorry I'm getting fixated on config drive 15:27:37 <BobBall> which is _not_ the issue, right? 15:27:38 <johnthetubaguy> BobBall: it passes sometimes and not others, so it has to be right, it could be bad state shared between tests when they get reordered 15:27:50 <johnthetubaguy> but thats not xenapi specific 15:28:06 <BobBall> Agreed. 15:28:11 <johnthetubaguy> BobBall: unsure, coming up and not getting a ip could be and issues, but that assumes it actually started 15:28:45 <ijw> BobBall: you're down the road from the pub, it's probably work just popping over there to see if that's where it's gone 15:28:55 <ijw> And you might as well have a pint while you're at it 15:29:24 <BobBall> Might as well. 15:30:46 <BobBall> OK 15:30:49 <BobBall> I'm going in circles here 15:31:35 <BobBall> Anyway.... 15:31:37 <BobBall> Long story short 15:31:40 <BobBall> I disabled the test 15:31:44 <BobBall> so the CI is back commenting on changes 15:31:53 <BobBall> not voting though 15:32:45 <BobBall> Gets us over this initial hurdle 15:33:26 <BobBall> Tests can be re-run by removing it from the tempest_exclusion_list in stackforge/xenapi-os-testing and they get picked up so you can re-run multiple times 15:33:41 <BobBall> changes to xenapi-os-testing get picked up by the CI I mean 15:36:13 <BobBall> Any more on CI johnthetubaguy? 15:36:52 <ijw> johnthetubaguy failed CI, sorry 15:37:03 <johnthetubaguy> not really I am afraid 15:37:14 <johnthetubaguy> we should find why it failed, and whats going on first 15:37:23 <johnthetubaguy> I can try help, ping me tomorrow morning 15:37:37 <BobBall> OK, will do. 15:38:02 <BobBall> I don't think I've got anything more to add 15:38:29 <BobBall> #topic AOB 15:38:57 <BobBall> Just one from me - if you know anyone in Nanjing who wants to work on OpenStack, tell them to apply to Citrix. 15:39:32 <BobBall> And that's all :) 15:39:39 <BobBall> johnthetubaguy - anything from you? 15:39:48 <johnthetubaguy> nothing big from me, except 15:40:46 <johnthetubaguy> lets look out for rc1 stuff 15:40:57 <BobBall> What do you mean, look out for it? 15:41:28 <BobBall> Review it, CI it, find more stuff that we want in RC1? ;) 15:41:50 <johnthetubaguy> I mean RC2 really 15:41:58 <johnthetubaguy> lets just keen an eye out for major issues 15:42:01 <johnthetubaguy> like this one :) 15:42:05 <BobBall> Yes 15:42:17 <BobBall> I am worried that this might go out in the release if we can't figure it out in time... 15:42:41 <BobBall> Unfortunately because it's just showing in our CI it's likely to just be us trying to fix it unless we can find a case that reproduces elsewhere 15:44:09 <BobBall> OK - anyway - I think we're done. 15:44:12 <BobBall> So I'll call it there. 15:44:31 <BobBall> Next meeting in a fortnight, so Thursday 30th April 15:44:34 <BobBall> no sorry 15:44:39 <BobBall> Wednesday 29th April! 15:44:42 <johnthetubaguy> cool, thanks 15:44:49 <BobBall> #endmeeting