17:00:49 #startmeeting ironic 17:00:50 Meeting started Mon Nov 28 17:00:49 2016 UTC and is due to finish in 60 minutes. The chair is jroll. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:51 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:00:52 o/ 17:00:53 The meeting name has been set to 'ironic' 17:00:54 o/ 17:00:55 o/ 17:00:55 o/ 17:00:56 o/ 17:01:00 o/ 17:01:01 o/ 17:01:01 o/ 17:01:05 o/ 17:01:12 o/ 17:01:21 hey everyone :) 17:01:27 o/ 17:01:31 #topic announcements and reminders 17:01:35 o/ 17:01:45 #info don't forget to sign up for the PTG 17:01:47 #link https://www.eventbrite.com/e/project-teams-gathering-tickets-27549298694 17:01:54 o/ 17:01:55 o/ 17:02:03 * jroll has no other announcements or reminders, does anyone else have one? 17:02:34 o/ 17:03:22 o/ 17:03:28 o/ 17:03:49 o/ 17:03:56 #topic subteam status reports 17:04:04 as always, those are here: 17:04:08 #link https://etherpad.openstack.org/p/IronicWhiteBoard 17:04:11 line 70 this time 17:04:47 o/ 17:05:30 dtantsur: how far behind are you with bug triaging? should we hold some bug-triage-thing? 17:05:32 #info dtantsur needs help with bug triage, volunteers very welcome 17:05:46 ^^ volunteers would be great 17:05:55 rloo, I've not done essentially anything for a while. and probably won't be able any soon. 17:06:06 looks like we're making good progress on lots of this stuff 17:06:10 at least cleaning up "New" and checking health of "In Progress" things 17:06:21 I don't want to become sole person in charge, but I'd be glad to help generally with bug triage 17:06:28 dtantsur: so it'd be good to have a volunteer catch up and keep an eye on new ones 17:06:38 thx JayF!!! 17:06:46 I'm adding it to my list, tomorrow I'll help triage some of these bugs 17:06:54 JayF, lucasagomes, feel free to check any of these 42 new bugs 17:07:57 I'm down to spend some time helping with bug triage. I am new but if I could be of help, I'll be glad to contribute. 17:08:40 thanks all! 17:09:09 vgadiraj, sure thing and welcome :-) 17:09:28 thank you all :) 17:10:20 I see that we have 5? specs that need reviews. Are any at the point where spending eg 1 hour will get it done? 17:10:45 I haven't looked lately, but wonder if rolling upgrades is at that point 17:11:08 might be 17:11:22 jroll: oh, i skimmed the latest revision but it looks close. 17:11:25 jroll: I have been away from the rolling upgrade testing work for about the last two weeks. Starting to work on it again now. 17:11:30 xek: what do you think? 17:11:32 is there a list of spec with high priority? 17:11:38 jlvillal: yeah, wondering about the spec :) 17:11:40 jroll: Not sure on the rolling upgrade development work. 17:11:58 yuriyz|2: i am looking at the subteam reports, where they say spec needs reviews. 17:12:05 yuriyz|2: also http://specs.openstack.org/openstack/ironic-specs/priorities/ocata-priorities.html#smaller-things and anything on that page without the spec merged 17:13:42 ok, so I trimmed up the priorities for the week a bit 17:13:51 anything that feels missing there? maybe the next driver comp patch? 17:13:53 * gabriel-bezerra wondering about the status of events from neutron stuff 17:13:59 jroll, yes please 17:14:13 dtantsur: the move node_create thing, yes? 17:14:19 jroll, yep 17:14:20 or maybe just both 17:14:22 thanks 17:14:30 It's not urgent, and probably shoulnd't go on this weeks' priorities 17:14:34 I've just finished another huge patch, but let's start with node_create 17:14:45 but the specific-faults spec is "close enough" to get some design input from others before it goes too much further 17:15:00 so I don't think it should be added to that list, but if anyone wants to take a look it'd be helpful 17:15:04 link? 17:15:17 https://review.openstack.org/#/c/334113/ 17:15:29 * dtantsur adds to his todo list for tomorrow 17:15:52 rloo ran away before RFE review, heh 17:16:02 heh 17:16:05 anything else here before that? 17:16:26 #topic RFE review 17:16:34 rloo: hey :) 17:16:34 if we move node create to conductor looks like we should have start-end-fail notifications schema in this case 17:16:50 what you think? 17:17:02 yuriyz|2, ++ (but it's out of topic) 17:17:07 yuriyz|2: let's take it to the patch :) 17:17:15 yuriyz|2++ 17:17:22 agree will prepare spec change 17:17:34 https://bugs.launchpad.net/ironic/+bug/1630442 17:17:34 Launchpad bug 1630442 in Ironic "[RFE] FSM event for skipping automatic cleaning" [Wishlist,New] - Assigned to Varun Gadiraju (varun-gadiraju) 17:17:41 so, does that need a spec? ^^ 17:17:56 I think so, it's changing the state machine (which usually requires a spec yes?) 17:18:01 ++ for usually requires 17:18:03 i'm not even sure we need it ? 17:18:16 * dtantsur also thinks that "skip" is a horrible name for a transition 17:18:30 so folks think the idea has merit, just needs spec? 17:18:37 dtantsur: ++ 17:18:38 yeah, usually when it touches the FSM we require a spec 17:18:41 ++ 17:18:44 dtantsur: ++ on the name thing 17:18:44 I think we should have a spec for this case 17:18:50 rloo, how to put it.. I'd not reject it right away before hearing their use case. 17:18:53 rloo: I'm not sure we need it either, but I'm willing to hear it out 17:18:55 dtantsur: ++ 17:19:07 ok thx. 17:19:10 https://bugs.launchpad.net/ironic/+bug/1633299 17:19:10 and I kinda like the simplicity of always transitioning through all the states (even if it's no-op, like cleaning in this case) 17:19:10 Launchpad bug 1633299 in Ironic "[RFE] Overcloud deploy resiliency " [Undecided,New] 17:19:17 not sure if we will discuss it here tho 17:19:32 I'd like to rename this one before doing anything with it :P 17:19:47 ++ 17:20:19 I'm inclined to say this isn't impossible in the general case and reject it on that premise 17:20:31 agreed, this seems like several issues rolled into one 17:20:32 I don't know what to do about this request (as you can guess it's not the first time I hear it) 17:20:41 isn't possible**** 17:21:09 Yeah, I'm thinking that RFE is difficult/impossible for Ironic 17:21:20 this is like an ops thing. can we make it easier/better for them? documentation? 17:21:31 unless you have a BMC that can somehow confirm, in a way that ironic can detect, that a reboot has actually happened 17:21:38 rloo: I assigned bug 1630442 to myself months ago when I was looking for something to contribute on, let me move it to unassigned. 17:21:38 bug 1630442 in Ironic "[RFE] FSM event for skipping automatic cleaning" [Wishlist,New] https://launchpad.net/bugs/1630442 - Assigned to Varun Gadiraju (varun-gadiraju) 17:21:42 maybe we should reduce the timeout when we wait for the first heartbeat 17:21:45 yeah, specially with network segragation. If wasn't for that a simple ping test could be used 17:21:50 thx vgadiraj 17:21:58 lucasagomes: ping test makes lots of really bad assumptions 17:21:58 lucasagomes: ++ 17:22:00 well, the RFE is to ensure that the instance booted properly 17:22:13 which yeah, I agree isn't really possible 17:22:22 JayF, right, totally. But, it would be better than having no check at all 17:22:24 lucasagomes: is it Ironic's problem, and should a node not go active, if a deployer chooses to deploy an image which doesn't setup netowkr? or firewalls by default? 17:22:31 our pingtest is heartbeat 17:22:31 lucasagomes: I think a bad check is 100x worse than no check at all 17:22:44 jroll, would be possible to somehow tell neutron to perfom a ping test (in the network segregation case) for us ? 17:22:46 dtantsur: heartbeat is outside of "make sure the instance booted correctly", though 17:22:48 maybe we should try rebooting if we don't receive a heartbeat in e.g. 20 minutes? 17:22:52 ah, the instance 17:22:56 JayF, right 17:22:57 * dtantsur confused this with another complain 17:22:59 Right now we make no guarantees, I think that's the sanest way to remain. Some higher-level function (like OOO?) coulduse something like pingchecking to trigger a redo 17:23:27 JayF, let's maybe keep it as a "generic" test... I think the main problem with this RFE is that we don't have a way to ironic reach the node after the deployment 17:23:28 hmm. if we provided some generic step/hook so the operator could specify what tests to run to test if it booted up... ? 17:23:33 as much as I wish we could do this, we honestly can't do it in a way that works in all cases 17:23:39 JayF: That is my feeling as well 17:23:42 if we get pass that we could perform test to make sure it has booted 17:23:56 well, it can just as well happen outside of ironic, right? 17:24:02 dtantsur: exactly. 17:24:05 run test, tear down and reschedule the node if it fails 17:24:06 dtantsur, could yeah 17:24:10 +1 17:24:15 And since seeing if the node is 'up' is completely image/deploy/node independent 17:24:18 well, if you are using ironic via nova? 17:24:35 rloo, tripleo does 17:24:38 I think it's difficult/impossible for Ironic to tackle generically. 17:25:02 this seems maybe? worthy of a design session... 17:25:05 I think it belongs to triple-heat-templates. maybe wrap Nova resource into something running the validation 17:25:09 I think this is really a problem that can only be implemented from nova up. But I do't think the schematics exist for "I've started and now I need to reschedule" 17:26:11 I'd not do something in ironic that is just as easy to implement outside of ironic.. 17:26:27 I don't think we can without the ability to support rescheduling 17:26:33 and knowing what the user wanted to schedule on precisely 17:26:41 * JayF notes that him and TheJulia, the two most operator-ish folks in this meeting, both think implementing this RFE is hard/impossible to do in ironic 17:26:47 dtantsur, I agree in parts, even outside ironic, I don't think it's easy 17:26:54 right, rescheduling is our of question completely 17:26:57 * out 17:27:00 dtantsur, the node will go to active first then gets reschedule, it's kinda odd 17:27:36 well, outside of ironic, you can delete the nova instance, (optionally move node to maintenance) and create it again 17:27:42 so, sounds like we should reject this yes? 17:27:44 How does nova handle it if you try to boot an image that doesn't boot successfully? 17:27:59 sambetts, good q, idk either 17:28:00 sambetts: nova agent is how we did it at Rackspace 17:28:03 sambetts: it goes to active, boots the vm 17:28:07 does nothing else for you 17:28:12 sambetts: IDK if that was upstream or not, but downstream at Rackspace we had a nova-agent callback 17:28:29 jroll: I agree we should reject it 17:28:33 ++ Reject 17:28:35 me too 17:29:00 how about the '1) The node just doesn't reboot (seen on HP, Dell and Supermicro)' part? 17:29:12 gabriel-bezerra: that, I think, is a lot more interesting 17:29:33 gabriel-bezerra, this bit can be converted to a bug 17:29:36 well, we use power off/on now, so hopefully that's mitigated 17:29:40 but yeah, sounds like a bug 17:29:46 jroll: don't we also have an in-band reboot option? 17:29:46 yeah 17:29:59 JayF: in-band power off 17:30:04 jroll: aha 17:30:18 and then bmc poll until off 17:30:36 I wonder if it is a soft-power-off vs press-and-hold issue 17:30:43 There is a possibility the BMC may have decided to go on a vacation at this point and there is not much we can do then. 17:30:58 TheJulia++ this happens way too often 17:31:36 TheJulia: dtantsur: or even a crazy saturated BMC network problem or similar ... ipmi is udp :x 17:31:56 JayF: That also is a possibility 17:32:00 yeah. if BMC is not reliable, we can't do much 17:32:03 okay, I've marked this wontfix with a message 17:32:07 thanks! 17:32:16 rloo: what's next? 17:32:20 did you see comment #1, about provisioning network? 17:32:41 * dtantsur probably participated in the discussion resulting in this comment 17:32:50 rloo: what about it? 17:32:57 dtantsur: if that looks familiar to you, maybe you could comment. 17:33:08 jroll: the idea about keeping provisioning network around 17:33:19 rloo: that's a massive security issue 17:33:33 rloo, I didn't like the idea back then, nor do I like it now 17:33:38 jroll: right. which they mention too. 17:33:47 jroll: just wanted us to give our opinion on it :) 17:33:50 ok, next up 17:33:51 * lucasagomes doesn't like it either 17:33:53 https://bugs.launchpad.net/ironic/+bug/1633756 17:33:53 Launchpad bug 1633756 in ironic-lib "RFE: Add initial static type hint checking support" [Wishlist,In progress] - Assigned to John L. Villalovos (happycamp) 17:33:53 rloo: IMO it isn't acceptable 17:33:53 ok 17:34:05 i think jlvillal mentioned it at the summit 17:34:12 the "best" way I can think of, is if neutron could perform a test for us in any of the networks 17:34:16 :) 17:34:18 * lucasagomes is moving on 17:34:26 I think I stated my feelings on this at the summit, I don't think it's a very good use of our time, but not going to block it myself 17:34:29 but i don't recall what we decided 17:34:38 I don't think it needs a spec 17:34:42 seems straight forward 17:34:47 I sorta wonder what's the point of this, especially if it's not going to be openstack-wide 17:34:51 and agree with jroll about priority 17:34:56 I'm with jroll and JayF on it 17:35:01 It is low on my priority list. But hope to get more time to play around with it in my spare time. 17:35:03 JayF: helps us avoid a certain class of bugs 17:35:05 Is there any movement to get this sort of type checking in any other OpenStack project? 17:35:22 I'm just curious if this is something we should do at a larger-than-single-project level, if it gets done 17:35:22 JayF: movement sometimes starts with one ball rolling 17:35:26 I don't know. Someone has to be first. 17:35:27 not that I'm aware of 17:35:29 JayF, we could start the trendy :-) 17:35:35 but yeah, this sort of thing would prove it out 17:35:45 Then I'm with the other opinions of "I don't personally care, but don't wanna block either" 17:35:46 worth at least raising to other folks 17:35:48 unless folks are against it, why don't we approve and see what (if anything) happens 17:36:07 my only concern is whether it's going to bloat the code and confuse contributors 17:36:07 idk, I think as a 'thing to play with' there's plenty of other python projects to play with this in 17:36:27 I'm not sure I'd like to -1 patches from newcomers saying "please add static type hints" 17:36:40 * dtantsur already does it for release notes often enough 17:36:51 dtantsur: yep, that's a good point 17:37:00 I'm hoping that this will catch some bugs and improve code quality. Once it is implemented. 17:37:13 overall I think knowing what the method returns/receive kinda helps with code quality and understand of it 17:37:14 Hard to say though, until the work is done. 17:37:31 jlvillal, tbh I'm not convinced that a substantial share of our bugs is due to wrong type sent in or out the call 17:38:12 Could be. I'm not sure. 17:38:19 do we want to vote? i love votes :) 17:38:27 Maybe we bless it for ironic-lib, and kind of see where it goes? 17:38:51 rloo: only if it is not a boolean voting option :) 17:39:11 TheJulia, my only question is: are we going to require it for all patches from now on? 17:39:22 I like TheJulia's idea too. start with ironic-lib. 17:39:29 dtantsur: I don't think so, but maybe add a test in eventually 17:39:36 dtantsur: we'd only require it if we felt like it was worthwhile. 17:39:39 that way we self-require later on 17:39:49 who is going to do the first retrofit 17:39:51 IF we see value 17:39:58 IIRC, it is optional even for the checker 17:40:00 I don't think we would require initially. See what happens. Is it worthwhile. 17:40:00 dtantsur: i think after ironic-lib is done, we can evaluate it and decide then. 17:40:05 NobodyCam: the RFE was filed by jlvillal, and aiui he's the most interested party 17:40:16 * dtantsur is fine with that 17:40:25 okay :) 17:40:27 I would do the work. Though it is not at the top of my priority list either. 17:40:37 As there are much more important things to work on. 17:40:48 More of a spare time thing for me. 17:40:50 okay, if others are fine, I'm fine 17:41:01 * jlvillal needs to find more interesting things to do on weekends ;) 17:41:03 sounds good. thx all. one more ... 17:41:05 https://bugs.launchpad.net/ironic/+bug/1634118 17:41:06 Launchpad bug 1634118 in Ironic "[RFE] Auto cleaning after instance deletions: secure and non-secure projects" [Undecided,New] 17:41:47 jlvillal: fyi left a comment on that RFE about it being approved for ironic-lib only 17:41:53 jroll: Thanks 17:42:01 there is not support now on keystone side 17:42:30 rloo: I think for me this one is similar to the state machine rfe, I don't like it but willing to hear it out in a spec 17:42:31 hmm, so it boils down to selecting clean steps for automated cleaning, right? 17:42:40 I'm really not a fan of having any concept of classification of data, because people misclassify data all the time 17:42:53 my biggest problem is the ironic operator setting it 17:42:55 but we can do it already, with priorities ? 17:43:00 rather than the project/user themselves 17:43:01 not sure if I grasp the RFE correctly 17:43:06 I'm really nervous about adding more knobs to disable automated cleaning, because I think it adds more ambiguity around data security and guarantees 17:43:08 lucasagomes: this is per-tenant decision 17:43:11 JayF: ++ 17:43:17 ohh 17:43:18 jroll, which brings us back to passing complex data from nova :D 17:43:20 we should have keystone project metadata (not present now) 17:43:25 dtantsur: :| 17:43:44 JayF++ 17:43:44 JayF, actually I thought about proposing removing automated_clean option.. I hate that TripleO disabled it. 17:43:48 yuriyz|2: keystone project == based on the tenant of the instance? 17:43:58 * dtantsur wants to see a spec for sure 17:44:00 JayF: +1000 17:44:00 tenant based 17:44:02 My suggestion to whoever wrote this RFE: Write a custom hardware manager 17:44:11 that skips cleaning based on some metadata on the node, or on disk 17:44:18 JayF: yuriyz|2 proposed it 17:44:29 rather than adding another mechanism to ironic to skip cleaning steps and make whether cleaning happened more ambiguous 17:44:39 JayF, I agree with it 17:44:45 lucasagomes: with me or with the rfe? 17:44:50 with you 17:44:54 cool, ty :D 17:45:30 yuriyz|2: you have a usecase for that? 17:45:34 so, reject or hear this out in a spec? 17:45:36 having N interfaces to disable/enable the same things sounds like a bad UX, terrible for troubleshooting as well 17:46:00 now no this is not my priority currently 17:46:03 I'm with JayF, I just think it is asking for trouble to have more knobs and to expect each tenant to be setup correctly with some sort of metadata eventually. I would immediately see operators demand an override knob. 17:46:06 I mean, I am -1 to the RFE as written, and am skeptical that a spec could change my mind 17:46:16 * dtantsur remembers that we have an RFE now to skip only in-band cleaning.. which adds even more mess to the picture. 17:46:28 dtantsur: @_@ 17:46:32 ok. yuriyz|2 said it isn't a priority so even if we ask for a spec, it probalby won't happen soon. why don't we reject and someone can bring it up again if they need/want it 17:46:37 rloo++ 17:46:44 you ok with that yuriyz|2? 17:46:45 jroll, I've requested a spec on that... but I don't believe it's going to land 17:46:51 agree 17:46:53 dtantsur, hah 17:47:02 I'm for rejecting as well. 17:47:08 thx yuriyz|2 and everyone else. That's it for my 4 today :) 17:47:09 jroll, what I would love to see though, is conductor not starting IPA if all requested *manual* clean steps are OOB 17:47:13 but this is offtopic 17:47:24 * rloo passes the baton back to jroll 17:47:25 i don't think the tenant is the place for this. even within a single tenant there might be classified and unclassified instances. 17:47:28 thanks rloo 17:47:41 rloo: are you marking that rejected then? 17:47:48 rloo: thank you for this section! 17:47:51 #topic open discussion 17:47:54 anybody have a thing? 17:47:56 jroll: yeah, i'm going to go through them all and make sure they're marked or whatever 17:48:02 cool, thanks 17:48:17 dtantsur: that sounds sorta like a bug to me, presuming you can specify interfaces in manual cleaning (i.e. rather than just assuming a step existing in OOB precludes it from existing IB) 17:49:02 NobodyCam: yw :) 17:49:03 * jroll waits a minute 17:49:21 sorry if it was already discussed, just wanted to make sure all ( jlvillal ) are ok with mering ironic-qa back into this meeting and handling the work via subteam reports 17:49:21 JayF, yeah.. we have this problem with drac RAID which is fully OOB 17:49:29 do we have a topic for 3rd party CI? 17:49:31 krtaylor: oh yeah, jlvillal pinged me this morning 17:49:35 I am okay with canceling the QA meeting. 17:49:49 sweet, I'll make a note in the meeting page 17:49:50 Thanks krtaylor 17:49:52 krtaylor: do you mind doing the irc-meetings patch? 17:50:04 jroll, sure, will do 17:50:08 krtaylor: If you need help, let me know. 17:50:11 xavierr: we don't have a standing topic here, but if someone has something to bring up they are welcome to add it to the agenda 17:50:47 xavierr, did you have a CI question? 17:50:52 jroll: rloo: Thought maybe for these updated RFEs: Should we maybe link to the meeting log so whoever filed the bug can see the discussion that led to the decision? 17:51:05 JayF: yup, that's my plan! 17:51:09 btw, anyone doing zuul + ansible for 3rd party ci? 17:51:15 rloo: awesome 17:51:21 OneView CI is back. we are working to fix https://bugs.launchpad.net/ironic/+bug/1503855 and bring back agent_pxe_oneview job back 17:51:21 Launchpad bug 1503855 in Ironic "Set boot device while server is on" [High,In progress] - Assigned to Galyna Zholtkevych (gzholtkevych) 17:51:32 xavierr: \o 17:51:33 err 17:51:34 instead of zuul + oneview 17:51:36 \o/ 17:51:39 maybe we need subsection for each CI status? 17:51:41 \o/ 17:51:47 or just report it under your driver 17:51:48 \o/ 17:51:54 we already have sections for them 17:51:55 ++ report under the driver 17:51:57 dtantsur: yeah, good catch 17:52:17 gabriel-bezerra, we use puppet 17:52:18 ++ report under the driver 17:52:31 ops zuul + jenkis** 17:53:04 gabriel-bezerra, last time I've heard about it, this method was not yet recommended for 3rd party CI 17:53:04 infra requires status on the test systems page, can we link to that? I'd hate to start a new place to record status 17:53:12 gabriel-bezerra, check with #openstack-infra 17:53:17 ++ 17:53:24 I've seen something about upstream infra changing jenkins for ansible. 17:53:34 dtantsur: thanks. i'll check it 17:53:45 gabriel-bezerra: I believe the current recommendation is keep using jenkins until zuul v3 17:54:29 thanks 17:54:35 gabriel-bezerra, there have been several talks about ansible over the years, but not sure latest status 17:55:13 v3 openstack zuul or netflix zuul? :P lol 17:55:21 anything else here? 17:55:30 nope 17:55:34 thank you all great meeting! 17:55:42 thank you 17:55:46 yes, thanks all, see you next time 17:55:48 Thank you everyone 17:55:48 #endmeeting