17:02:20 <krtaylor> #startmeeting ironic-qa 17:02:20 <openstack> Meeting started Wed Dec 9 17:02:20 2015 UTC and is due to finish in 60 minutes. The chair is krtaylor. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:02:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:02:24 <DuncanT> hemna: Moving to main channel. Model updates - we want to kill db access in drivers 17:02:24 <openstack> The meeting name has been set to 'ironic_qa' 17:02:34 <krtaylor> anyone here for ironic-qa? 17:02:35 <hemna> DuncanT, +1 17:02:39 <[1]cdearborn> o/ 17:02:41 <rpioso> o/ 17:02:43 <jroll> krtaylor: SUP :) 17:02:50 <sinval> o/ 17:02:54 <liliars> o/ 17:03:16 <krtaylor> hi everyone 17:03:25 <kcalman> o/ 17:03:32 <krtaylor> here is the agenda: 17:03:36 <krtaylor> #link https://wiki.openstack.org/wiki/Meetings/Ironic-QA 17:03:59 <krtaylor> not much there, but I know better than to think we won't have anything to talk about :) 17:04:26 <krtaylor> #topic Announcements 17:04:54 <sinval> We have great news about OneView CI, we are able to test deployment workflow, I assume that we are publishing the deploy test job until the end of this week :) 17:05:22 <krtaylor> good 17:05:28 <krtaylor> and we have one on the agenda 17:05:31 <krtaylor> OpenStack Health Dashboard 17:05:42 <krtaylor> not sure if that was sambetts 17:05:47 <jroll> also just a quick note, we're making the ipxe job voting, it's been stable for a while: https://review.openstack.org/#/c/255382/ 17:06:01 <krtaylor> but here's the link for the Dashboard: 17:06:14 <krtaylor> #link http://status.openstack.org/openstack-health/#/g/project/openstack%252Fironic 17:06:37 <krtaylor> ipxe voting: 17:06:47 <krtaylor> #link https://review.openstack.org/#/c/255382/ 17:07:03 <jroll> so now that we've brought the dashboard up... can we talk about how abysmal that failure rate is? can save for open discussion if we like... 17:07:33 <krtaylor> yes, we should, lets table that for open discussion 17:07:34 <jroll> and even worse, thuogh I think it may be a sample size thing: http://status.openstack.org/openstack-health/#/g/project/openstack%252Fironic-python-agent 17:07:37 <jroll> k 17:07:49 <sambetts> o/ 17:08:06 <krtaylor> hi sambetts we were just looking at the dashboard 17:08:31 <krtaylor> ok, any other quick announcements? 17:08:32 <sambetts> Yeah I came up on the ML and I thought it was cool 17:08:37 <sambetts> it* 17:09:24 <krtaylor> #topic Grenade and Functional testing 17:09:52 <krtaylor> not sure of any progress this week either, jlvillal is in class 17:10:08 <krtaylor> anyone? 17:10:26 <krtaylor> then onward 17:10:42 <krtaylor> #topic Third party CI 17:10:54 <krtaylor> so,t he spec has had a lot of activity :) 17:11:36 <sinval> hehe 17:11:36 <krtaylor> jroll, I read the comments and started a revision, but got distracted by my day job, sorry about the duplicate patch 17:12:07 <jroll> krtaylor: no worries :) 17:12:20 <krtaylor> there were some counterpoint comments 17:12:50 <krtaylor> lucasagomes isn't here... 17:13:12 <jroll> I think we're way past the point of deciding whether or not we're doing this... 17:13:52 <[1]cdearborn> agree 17:14:01 <krtaylor> jroll, agreed, but maybe we can address some of the concerns that prompted the concern 17:14:26 <krtaylor> used concern to many times 17:14:37 <jroll> seems like the concerns were "I don't want my driver dropped from tree, but I can't CI it" 17:14:50 <jroll> idk how to address this unless we just don't do it 17:15:15 <krtaylor> can that driver be tested as a part of the infra testing? 17:15:39 <krtaylor> if it requires special hw, then no 17:15:41 <jroll> almost certainly not, it needs real hardware afaik 17:15:44 <lucasagomes> we could somehow mark drivers in tree that are CI'ed and not CI'ed? 17:15:56 <lucasagomes> having a config option to allow enabling non-CI'ed driver? 17:15:59 <jroll> maybe kvm has a wol 17:16:12 <jroll> lucasagomes: that was an alternative proposed a long time ago that we decided not to do 17:16:17 <jroll> I don't remember off hand all the reasons why 17:16:26 <lucasagomes> jroll, it was a bit different AFAIR 17:16:39 <lucasagomes> the propose before was you could either enable production drivers or testing drivers 17:17:07 <lucasagomes> (but I wanna point out that, I'm OK with that proposal in the spec. I understand it has been discussed) 17:17:12 <jroll> hmm 17:17:22 <jroll> yeah, I think we don't want to turn back now 17:17:27 <krtaylor> historically, as projects have grown, that has not worked well, if it is in tree, it needs to be tested 17:17:36 <lucasagomes> jroll, yeah, and I'm fine with that 17:18:06 <jroll> cool 17:18:11 <lucasagomes> krtaylor, it's "tested" not as a functional test 17:18:17 <lucasagomes> but unittest is testing 17:18:29 <lucasagomes> I think my idea is more like staging drivers in linux 17:18:39 <jroll> dprince: do you want to talk about this at all or are you good with my replies on the spec patch? 17:18:50 <lucasagomes> which are drivers in-tree but with less promises that they are working 17:19:43 <krtaylor> lucasagomes, yeah, that is what I was proposing at summit, like an attic for drivers that were interesting, but not production 17:20:00 <lucasagomes> krtaylor, ++ 17:20:30 <lucasagomes> and I understand that there's an overhead about having driver in tree as well, we have to maintain the code 17:20:38 <lucasagomes> so all ideas has pos/cons 17:20:54 <krtaylor> the spec started that way, but the consensus was that once out of main tested tree, they were noop 17:21:21 <dprince> lucasagomes: maintenance in-tree is much cheaper though 17:21:29 <lucasagomes> dprince, right 17:21:36 <jroll> why? 17:21:44 <dprince> jroll: grep 17:22:09 <jroll> grep -r thing ironic/ mydriver/ 17:22:09 <jroll> ? 17:22:22 <dprince> jroll: I'm just not keen on all the split out here 17:22:45 <dprince> jroll: I mean if we expected 10 more power management drivers over the next year, maybe 17:22:45 <jroll> right, I'd like a concrete reason why. most of the issues I've heard are easily solved 17:23:04 <dprince> jroll: even then, the main arguement seems to be quality of the "in-tree" drivers 17:23:16 <dprince> jroll: I don't think iboot, and wol are dragging down that quality 17:24:06 <jroll> dprince: we want to be able to say "all drivers in tree are well tested" 17:24:21 <krtaylor> put another way, why would we be different than any other project wrt vendor CI? 17:24:33 <jroll> it's not that drivers are dragging down quality, it's that the quality is 100% unknown 17:24:41 <dprince> krtaylor: other projects got it wrong in some cases 17:24:46 <dprince> krtaylor: took it too far 17:24:48 <lucasagomes> one thing that in-tree drivers helps is to expand the driver interface... It's easy to argue about adding an extra parameter for example in one of the methods if it's in-tree 17:24:51 <lucasagomes> IMO 17:25:07 <lucasagomes> like when get_supported_boot_devices() was extended to add the "task" argument 17:25:20 <lucasagomes> because we needed to check the arch of the node instead of just returning an static list 17:25:44 <lucasagomes> small example, that is easy to reason about if the driver is in-tree when changing the base interfaces 17:25:57 <dprince> just my prospective, I don't see any reason to penalize developers who have taken the extra time to enhance their dev environments with drivers which are super helpful at developing and testing Ironic 17:26:19 <dprince> perhaps out of tree makes sense for some vendors who don't generally engage upstream 17:26:28 <dprince> sort of a pay to play model 17:26:33 <jroll> dprince: I don't see the penalty, is my point 17:26:44 <dprince> jroll: out of tree, out of site, out of mind 17:26:59 <jroll> and we *want* to keep out of tree drivers working 17:27:01 <dprince> jroll: i.e. I have to go and read the changelogs as to why interface X changed 17:27:02 <jroll> I think they're super valuable 17:27:15 <jroll> we don't want to break that interface 17:27:19 <dprince> jroll: right, I'm saying (in practice) it isn't perfect 17:27:26 <dprince> jroll: look at the includes in all the vendor drivers 17:27:41 <jroll> no, but we're working towards that 17:27:41 <jroll> yep 17:27:43 <dprince> jroll: some of those I think might be considered internal drivers 17:27:44 <jroll> we have a plan to get rid of those 17:28:14 <jroll> and have an actual driver api sort of thing 17:28:30 <dprince> jroll: until there are versioned (internally versioned) libaries out of tree could be painful 17:28:34 <dprince> jroll: that is all I'm saying 17:28:45 <jroll> sure 17:29:17 <jroll> we also have 1.5+ cycles to figure out how to make it less painful 17:29:29 <krtaylor> true 17:29:42 <krtaylor> but, we have already started off down the runway 17:30:32 <lucasagomes> jroll, what about documentation for out-of-tree drivers? 17:30:39 <lucasagomes> things like http://docs.openstack.org/developer/ironic/drivers/wol.html 17:30:57 <lucasagomes> or http://docs.openstack.org/developer/ironic/drivers/vbox.html 17:31:05 <jroll> great question, we'll need to solve that somehow 17:31:17 <jroll> readthedocs is pretty easy, and infra has support for it 17:31:39 <jroll> there's probably other good ways 17:31:51 <lucasagomes> true, yeah we should give it some thought 17:31:57 <krtaylor> it the spirit of big tent and fostering new projects (drivers) maybe therecan be a cross-project driver incubation repo 17:32:36 <krtaylor> but we have time to brainstorm that 17:33:25 <krtaylor> so, the issue is that there are valuable drivers that developers are using, that cannot be tested as a part of our infra check pipeline 17:34:02 <lucasagomes> krtaylor, right... they can be tested, but we just don't have $$ to actually test it 17:34:26 <krtaylor> and that is a special case, maybe we can find a vendor that could? 17:34:33 * krtaylor thinks ipmi... 17:34:38 <lucasagomes> I think jroll is looking at it 17:34:41 <lucasagomes> for ipmitool for e.g 17:34:52 <lucasagomes> because that's our "main" (not sure I can say it) driver 17:34:54 <jroll> yeah, I'm on top of ipmitool 17:35:02 <jroll> "recommended" :) 17:35:15 <jroll> so my thought is we roll forward with our plan, try to make it work well for those drivers, and go from there 17:35:15 <lucasagomes> IBM may want to do it for ipminative as well, since pyghmi is something they work on 17:35:39 <krtaylor> yes, we are having those discussions also 17:35:51 <jroll> and if it's super terrible for out of tree drivers, we either work really hard to fix it, or admit that out of tree in ironic is terrible and allow them back in tree with some other mechanism to say it's untested 17:36:18 <krtaylor> works for me 17:37:16 <lucasagomes> cool 17:37:25 <jroll> can we land this spec then? :D 17:37:30 <sambetts> I'm already maintaining an out of tree driver, and its not that bad right now, nothing has drastically broken anything recently 17:37:42 <krtaylor> jroll, +1000 17:37:45 <lucasagomes> sambetts, yeah we want to make it even better 17:37:51 <sambetts> of course :) 17:37:51 * lucasagomes wants at least 17:38:20 <jroll> lucasagomes: don't forget we had a whole session on the driver interface and came out with a plan :) 17:38:26 <sinval> jroll, I think there are some issues regarding testing, I'm not sure if krtaylor is planning to discuss that during this meeting... 17:38:40 <lucasagomes> jroll, right 17:38:51 <krtaylor> ok, we are winding down on that topic 17:38:57 <jroll> sinval: open discussion? :) 17:39:03 <krtaylor> anyone object to moving on? 17:39:12 <jroll> let's do it 17:39:14 <sinval> jroll: can be 17:39:15 <krtaylor> #topic General QA and Open Discussion 17:39:26 <jroll> sinval: whatcha got 17:40:31 <sinval> deva had comments about the testing session in the spec, something about: "None is not good for a testing spec" 17:40:40 <jroll> sinval: oh, that's been fixed 17:41:14 <krtaylor> that was a joke about a testing spec not having test impact :) 17:41:36 <krtaylor> it was the subject of the entire spec :) 17:41:48 <jroll> my open discussion topic: we have sporadic failures caused by either timeouts and/or pxe failures, still plaguing the gate for months (years?) now, and it's long past the point where we're terrible people for not fixing them 17:41:53 <jroll> #link http://status.openstack.org/elastic-recheck/#1393099 17:41:59 <jroll> #link http://status.openstack.org/elastic-recheck/#1408067 17:42:06 <jroll> top 2 out of 3 on elastic recheck 17:42:29 <jroll> we can't just let these sit anymore, we need someone working on them as hard as possible, and nothing else 17:42:32 <lucasagomes> that timeout is a PITA indeed 17:42:44 <jroll> clark b has the bug numbers memorized - that's how bad it is 17:43:16 <jroll> I tried to get to it, but I have way too much other stuff going on 17:43:25 <jroll> I'm happy to prioritize it on my list, but would rather have a volunteer that knows this stuff well working on it 17:43:37 <jroll> it's a terrible waste of infra resources :( 17:43:38 <krtaylor> jlvillal will more than likely be interested, but I won't volunteer him 17:45:22 <krtaylor> #action find someone to work the transient timeout and pxe failures 17:45:27 * lucasagomes is also too busy to take a look 17:45:36 <lucasagomes> but I'm happy to help 17:45:37 <jroll> no volunteers? :( 17:45:44 <krtaylor> I am too atm, maybe in a few weeks 17:45:57 <lucasagomes> jroll, do we know at least why it takes so long? slow network when PXE booting? 17:46:26 <jroll> lucasagomes: probably slow gate nodes, I don't think localhost networking could be that slow 17:46:34 <lucasagomes> tftp can be a real pain even for local network 17:46:44 <lucasagomes> right :-/ 17:47:05 <lucasagomes> sambetts, would be good to test ur tiny ipa there anyway 17:47:10 <lucasagomes> it will consume fair less resources 17:47:16 <sambetts> I was about to suggest that :-P 17:47:17 <lucasagomes> should be very quick to boot that 17:47:47 <jroll> idk if it's always the boot being slow, either, hard to tell 17:47:51 <jroll> could just be *everything* being slow 17:48:06 <jroll> for instance, jay had a test yesterday where it timed out during cleaning 17:48:18 <sambetts> its a considerably smaller image to tranfer by tftp too, so that might help 17:49:16 <sambetts> :/ 17:49:27 <lucasagomes> one thing that is hard to debug is the boot time 17:49:35 <lucasagomes> because every time we power on the nested VM 17:49:44 <lucasagomes> the file which the console is redirect to is overwritten 17:49:56 <lucasagomes> so it's actually hard to see the logs and figure out how long did it take to boot 17:49:58 <sambetts> yeah that is really frustrating for debuging inspector too 17:50:15 <jroll> yep, need to fix that as well 17:50:41 <jroll> I guess I'm going to get hacking on that stuff, then 17:51:11 <lucasagomes> yeah, improving the troubleshooting def helps 17:51:23 <jroll> I may also work on moving devstack code into our tree as a plguin, that will make it much easier to work on it I thin 17:51:23 <jroll> k 17:51:28 <jroll> and make devstack people super happy 17:51:32 <lucasagomes> ++ 17:51:35 <sinval> krtaylor: did you have the chance to think about those questions about testing in the spec? or we are going to postpone it? 17:52:07 <jroll> sinval: have you looked at the spec since yesterday? 17:52:18 <jroll> sinval: what questions do you have that are still unanswered? 17:52:19 <krtaylor> sinval, yes I replied, we will document 17:52:39 <sinval> jroll, krtaylor I'm reading right now 17:52:56 <jroll> ok 17:53:54 <krtaylor> we are nearing the top of the hour, any other topics, questions? 17:55:46 <sambetts> Nothign from me 17:56:02 <krtaylor> going once...twice... 17:56:08 <sinval> ops 17:56:23 <sinval> 1. Are we considering that the tempest-dsvm-pxe-ipa is enough for testing a patch impacts on a driver? 17:56:29 <sinval> 1.1 If not: What are the base test cases for considering a driver CI as "reliable"? 17:56:38 <sinval> 2. Can a CI implement specific test cases that are not in Ironic tree or even in the Tempest tree to ensure that their driver is not broke by a patch? 17:56:51 <sinval> 3. Are we considering implementation of functional test cases of driver interfaces to ensure that a CI is reliable? 17:57:21 <sinval> just some things to think about, I'm not sure if it is clear to everyone 17:57:47 <krtaylor> good questions, but don't have all the answers atm 17:58:25 <jroll> so, tl;dr 1) it's a good start, and we can add to it. 2) sure? :) 3) I haven't considered it much, but we should totally add things like power off/on calls to the API to tempest if they don't exist already 17:59:00 <krtaylor> agreed :) 17:59:00 <sinval> cool 17:59:07 <jroll> sinval: let's add those to the docs when we write those :) 17:59:17 <sinval> jroll: sure 17:59:29 <krtaylor> yes! I'll sign you up :) 17:59:38 <krtaylor> ok, so I think we are done here 17:59:42 <krtaylor> thanks everyone! 17:59:48 <jroll> thanks krtaylor :D 17:59:58 <sinval> o/ 18:00:07 <krtaylor> #endmeeting