15:10:29 <dtantsur> #startmeeting ironic 15:10:30 <openstack> Meeting started Mon May 13 15:10:29 2019 UTC and is due to finish in 60 minutes. The chair is dtantsur. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:10:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:10:34 <openstack> The meeting name has been set to 'ironic' 15:10:46 <etingof> \o/ 15:10:46 <dtantsur> hi all, sorry for the late start (first day after 2 weeks of absence, sigh) 15:10:50 <bdodd> o/ 15:10:52 <kaifeng> o/ 15:10:53 <rpittau> meeting! o/ 15:10:55 <hjensas> o/ 15:10:55 <arne_wiebalck> o/ 15:11:00 <jroll> morning :) 15:11:09 <dtantsur> our agenda is pretty light: 15:11:12 <dtantsur> #link https://wiki.openstack.org/wiki/Meetings/Ironic 15:11:12 <rloo> o/ 15:11:28 <dtantsur> #topic Announcements / Reminder 15:11:46 <dtantsur> #info The Summit is over, thanks all :) 15:11:50 <dtantsur> #link https://dtantsur.github.io/posts/ironic-denver-2019/ dtantsur's notes from the Summit + PTG in Denver 15:12:10 <dtantsur> Subjectively, a lot is going on around ironic 15:12:14 <arne_wiebalck> thanks for the notes dtantsur, that's useful 15:12:17 <rloo> thx dtantsur, for your notes on the summit/ptg! 15:12:32 <rpittau> thank you for the notes dtantsur :) 15:12:42 <dtantsur> #info The next Summit + PTG will be in Shanghai, November 4 - 6, and the PTG will be November 6 - 8, 2019 15:13:13 <dtantsur> #link https://www.openstack.org/summit/shanghai-2019 15:13:55 <dtantsur> if you've dreamt of going to China, this is your chance :) registration is open already 15:14:19 <rpittau> well I guess I can't miss 3 summits in a row :P 15:14:25 <dtantsur> heh 15:14:35 <dtantsur> I have mixed feelings about the next one :) 15:14:37 <dtantsur> anyway 15:14:40 <dtantsur> anything else to announce? 15:14:54 <dtantsur> hmm, maybe this one: 15:15:03 <dtantsur> #link https://www.openstack.org/bare-metal/ The official bare metal program 15:15:17 <cdearborn> o/ sorry i'm late! 15:15:21 <dtantsur> also you'll notice that we have Baremetal SIG business as part of our agenda 15:15:35 <dtantsur> anything else? 15:15:38 <mgoddard> o/ 15:15:59 <mjturek> o/ 15:15:59 <mkrai> o/ 15:16:20 <rajinir> o/ 15:16:41 <dtantsur> No action items from the last week, and I doubt we have anything new on the whiteboard (or do we?) 15:17:00 <dtantsur> I guess TheJulia will propose the priorities for Train officially next week 15:17:23 <baha> o/ 15:17:42 <dtantsur> those who like spoilers, check https://etherpad.openstack.org/p/DEN-train-ironic-ptg after line 639 15:17:45 <dtantsur> * 539 15:18:22 <dtantsur> #topic Deciding on priorities for the coming week 15:20:17 <dtantsur> what do you think about the list as it is now? 15:21:07 <TheJulia> I didn't have time to update it, so I think it is okay to just carry it forward if it is up to date. 15:21:23 <TheJulia> Re priorities, next Monday most likely :( 15:21:24 <dtantsur> reasonable up-to-date, I just added the mdns thingy 15:21:34 <rpittau> I did update some stuff based on merged things last week 15:21:46 <etingof> let's review this as well please? -- https://review.opendev.org/655685 15:21:46 <patchbot> patch 655685 - ironic-specs - Add spec for indicator management - 6 patch sets 15:21:47 <TheJulia> dtantsur: perfect :) 15:22:14 <dtantsur> etingof: fair enough, added 15:22:17 <dtantsur> any other comments? 15:22:29 <rloo> I haven't read the retirement spec (https://review.opendev.org/#/c/656799/) but we did discuss that at ptg, wrt beth's ask for other states? so wondering if more work is required in that spec. 15:22:30 <patchbot> patch 656799 - ironic-specs - Add support for node retirement - 4 patch sets 15:22:51 <arne_wiebalck> rloo: it's on the agenda :) 15:23:13 <dtantsur> rloo: yeah, we have this on the agenda, and I'd like it on priorities to highlight this discussion 15:23:15 <rloo> arne_wiebalck: oh, is that what 'needs follow-up from the PTG' is about? 15:23:20 <dtantsur> rloo: yes 15:23:23 * etingof has 10-patch long chain against sushy-tools to review... 15:23:29 * dtantsur needs time and spoons to type letters about that spec 15:23:36 <rloo> arne_wiebalck, dtantsur: ok, thx for clarifying 15:23:39 <dtantsur> arne_wiebalck: you can find some of the information in my summary 15:23:52 <arne_wiebalck> dtantsur: ok 15:23:59 <dtantsur> etingof: I'm still -1 on half of them, but ignoring in case somebody cares less.. 15:24:27 * rloo thinks there are a lot of priorities for this week but i guess we had them all in previous weeks. 15:24:28 <etingof> dtantsur, I've refactored them heavily since your last review 15:24:38 <dtantsur> okay, I'll check again if my concerns still hold 15:24:53 <etingof> though the things you did not like still there ;) 15:24:56 <dtantsur> rloo: exactly 15:25:06 <dtantsur> etingof: well, here we go :) do you really want me to review them? ;) 15:25:42 <etingof> dtantsur, absolutely! 15:26:02 <rloo> is cisco-ucs out? 15:26:10 <dtantsur> TheJulia: ^^? 15:26:39 <etingof> dtantsur, I hope that with my recent additions (chassis, indicators and vmedia boot emulation) the design I am proposing makes more sense... 15:26:45 <TheJulia> Yeah, pretty much. :( 15:27:06 <rloo> ok, i'm deleting that from vendor priorities then. 15:27:14 <dtantsur> etingof: if you found a real manager object in libvirt - yes, otherwise I expect a similar comment 15:27:33 <dtantsur> anyway, does the list look good? 15:28:19 <TheJulia> I did chat with ?ianw? In Denver and he stressed that we should do what is best for the project. 15:28:30 <dtantsur> sigh 15:28:39 <etingof> dtantsur, it's the other way round - if you find a real manager and chassis objects in libvirt, then your design makes sense ;) 15:28:50 <TheJulia> Yeah. :( 15:29:08 <TheJulia> Anyway, airplane mode time. 15:29:21 <dtantsur> #topic Baremetal SIG 15:29:26 <dtantsur> Anything for today? 15:29:54 <TheJulia> The white paper could use some assistance with regards to use cases 15:30:01 <rloo> dtantsur: as an intro -- did anyone mention why we have baremetal sig in this meeting? 15:30:24 <dtantsur> rloo: I did not, because I was not present when it was decided 15:30:55 <rpittau> when was that decided ? 15:30:58 <TheJulia> Tl;dr there needs to be a periodic reminder, so the meeting was the best time 15:31:17 <rloo> dtantsur: i can't recall now, if there was email/announcement about this. if there wasn't, there should be, or mention it here? 15:31:43 <TheJulia> During one of the baremetal sessions there was consensus that it would be good to at least raise as part of this meeting given overlaps. 15:32:21 <TheJulia> Chris hodge... appears to be offline at the moment, but this is largely for human wrangling, so if there are no items then we can skip past. 15:33:04 <rloo> fair enough, and that is fine with me. could we have an action item for someone (chris?) to send out email to both groups, about this being in the ironic meeting? 15:33:26 <rloo> (unless folks disagree with it) 15:33:35 <TheJulia> I concur, I won’t be able to do it this week. 15:33:40 <rloo> (since not everyone was at ptg/summit) 15:33:51 <TheJulia> Anyway, they just closed the boarding door, need to quite literally disconnect now :( 15:34:06 <arne_wiebalck> I can ping Chris. 15:34:11 <TheJulia> Have a wonderful day everyone 15:34:23 <rloo> BYE TheJulia! 15:34:23 <dtantsur> safe flight TheJulia 15:34:32 <rpittau> TheJulia: safe flight :) 15:34:55 <dtantsur> arne_wiebalck: please! I think we can do it off-meeting. 15:34:58 <rloo> ok, AI for arne_wiebalck to ping chris to ask chris to send email out etc. 15:36:14 <dtantsur> ++ 15:36:20 <dtantsur> #topic RFE review 15:36:54 <dtantsur> #link https://review.opendev.org/#/c/656799/ Support for node retirement / nodes in failure states 15:36:55 <patchbot> patch 656799 - ironic-specs - Add support for node retirement - 4 patch sets 15:37:09 <dtantsur> as I said, I did not have time to write more detailed thoughts - sorry for that 15:37:24 <dtantsur> but something that did come up in the room is that people want a new state more than a new flag 15:37:34 <arne_wiebalck> this spec is basically a summary of a discussion I had with jroll and TheJulia 15:37:39 <dtantsur> I was initially quite against it, but then got more or less convinced 15:38:12 <dtantsur> although this may be a difference between retiring nodes and "failing" them.. 15:38:34 <arne_wiebalck> how do you mark an active node for retirement when retired is a state? 15:38:49 <dtantsur> with a state transition that does not, however, tear it down 15:39:12 <arne_wiebalck> so instances can be on nodes in state retired? 15:39:16 <dtantsur> that's what people wanted for nodes at fault: keep it intact, but mark it as very broken 15:39:18 <dtantsur> yep 15:39:22 <arne_wiebalck> ok 15:39:27 <jroll> hmm 15:39:28 <dtantsur> which is going to confuse the heck out of nova, I suspect 15:39:55 <rpittau> interesting, it goes slightly against logic imho 15:39:58 <jroll> I'd love to see the proposed state machine 15:40:08 <jroll> er, state diagram 15:40:11 <arne_wiebalck> the idea behind the flag was to be very close to 'maintenace', an attribute you assign to a node 15:40:25 <dtantsur> the thing about maintenance is that it can happen in literally any state 15:40:28 <arne_wiebalck> and such a flag would not interfere with the state machine 15:40:38 <arne_wiebalck> retired as well 15:40:39 <dtantsur> does it make sense to retire a node in "enroll"? "cleaning"? "manageble"? 15:40:40 <rpittau> well in theory retirement too 15:40:59 <arne_wiebalck> dtantsur: I think so, why not? 15:41:00 <rloo> i think more thinking is needed here. wondering if we want a 'retire' state, AND some flag eg 'next-step' or 'next-phase' 15:41:01 <rpittau> I think it does 15:41:46 <dtantsur> arne_wiebalck: if you mark an available node as retired, how do you prevent nova from scheduling to it? 15:42:14 <arne_wiebalck> how about using the same mechanism 'maintenance' does? 15:42:26 <arne_wiebalck> the same way 15:42:33 <dtantsur> maintenance has been a part of our API since the inception 15:42:42 <dtantsur> so it's something everybody is used to checking 15:42:50 <dtantsur> older tooling would not take retired into account 15:43:12 <dtantsur> I think this ^^ should be in the spec even if we keep the flag approach 15:43:14 <arne_wiebalck> older tooling like ...? 15:43:37 <dtantsur> nova, metalsmith, metal3 are the tools I'm aware of 15:43:39 <arne_wiebalck> we could also set maintenance in addition to retired 15:43:58 <dtantsur> maintenance has side effects like preventing cleaning from working 15:44:12 <rloo> at what point is the node 'retired' ? 15:44:12 <arne_wiebalck> true, I take that back :) 15:44:32 <rloo> we want the node to-be-retired, but when is it 'retired'? 15:44:39 <arne_wiebalck> rloo, it is not retired, it's more marked for retirement 15:44:47 <rloo> i'm having trouble understanding the state of the node in this. 15:45:13 <rloo> arne_wiebalck: from the spec, it seems like what you want is to indicate 'do not make this available' 15:45:21 <arne_wiebalck> rloo: right 15:45:53 <arne_wiebalck> I'd like to mark an 'active' node as on its way out 15:45:55 <rloo> arne_wiebalck: and (based on problem description) you want to be able to search/list nodes 15:46:03 <arne_wiebalck> rloo: correct 15:46:28 <dtantsur> arne_wiebalck: but not only active nodes, all states? 15:46:29 * arne_wiebalck thinks rloo is preparing a suggestion 15:46:37 <arne_wiebalck> dtantsur: yes 15:46:40 <rloo> so end user/nova invokes 'delete' on their instance. 15:46:50 <rloo> i'm not preparing a solution, just thinking about the problem. sorry. 15:47:00 <arne_wiebalck> dtantsur: also nodes in clean_failed may be retired 15:47:07 <arne_wiebalck> rloo: :) 15:47:47 <rloo> seems like one could want to 'retire' a node in any state. in some states, the operator could just do an 'openstack node delete', right, no need to sayd 'openstack node retire ...'. 15:48:15 <arne_wiebalck> rloo: correct 15:48:18 <dtantsur> yeah, this ^^ is something I'm not quite sure about. I understand we want to avoid unprovisioning an active node right away, but why keep nodes in available? 15:48:28 <rloo> wondering if it is sufficient to put in a mechanism to say 'do not make available'. but if we did that, it would not necessarily mean it is for retirement. 15:48:52 <arne_wiebalck> maintenance is very close 15:48:58 <arne_wiebalck> but has the cleaning issue 15:49:03 <rloo> wasn't there a case where someone might want to do firmware update after an instance is removed, so they don't want the cleaned node to go to available right away? 15:49:14 <rloo> (and firmware update isn't part of the cleaning) 15:49:29 <rloo> ie, they may want the node to go to mgt after cleaning, not available. 15:50:20 <rloo> do we need a 'retirement' state, or do we want a mechanism to move instance-deleted nodes from cleaning to some non-avail state like manage? 15:51:24 <arne_wiebalck> rloo: we don't necessarily need to change the state of the node I think 15:52:16 <rloo> arne_wiebalck: so for you, you're good if you could indicate 'when deleting instance, do cleaning and go to manage (instead of available)', and then indicate via eg extra specs or something 'this is due for retirement'? 15:52:53 <rloo> i am not against a 'retire*' state or something, just trying to understand/generalize. 15:53:03 <arne_wiebalck> rloo: if it is easy extract the list of nodes in this state: yes 15:53:11 <arne_wiebalck> rloo: sounds good :) 15:53:27 <dtantsur> I wonder if this solution generalizes to the case discussed on the PTG 15:53:40 <arne_wiebalck> what way discussed at the ptg? 15:53:44 <rloo> now i'd like some more operator feedback (wrt explicit 'retire*' something). and dtantsur, yeah. 15:53:46 <arne_wiebalck> s/way/was/ 15:53:55 <dtantsur> "We detected that this node is broken; do not kill it right away, but mark as faulty" 15:54:10 <arne_wiebalck> sounds very close 15:54:40 <arne_wiebalck> rloo' suggestion seems to cover both, no? 15:54:46 <rpittau> although faulty sounds much more close to maintenance 15:54:47 <rloo> arne_wiebalck: i think so :) 15:55:38 <arne_wiebalck> question still is whether this should be a state 15:55:47 <rloo> rpittau: true. i think the idea was that 'faulty' was not quite the same as 'maintenance' and we shouldn't overload/use 'maintenance' for non-maintenance things. of course, we haven't really defined what 'maintenance' really means :) 15:56:12 <dtantsur> arne_wiebalck: it depends on how you want the existing API consumers to behave wrt retirement 15:56:16 <rloo> arne_wiebalck: right, that's what i wanted to hear from you. do you need/want these nodes-to-be-retired, to end up in some new state? 15:56:21 <dtantsur> (not only on that, but it's a big part of the question) 15:56:31 <dtantsur> a new state will force all tooling to handle it explicitly 15:56:42 <dtantsur> a new flag on a node will be ignored by all consumers we don't update explicitly 15:57:20 <rpittau> I totally agree with the non-abuse of maintenance, but I see 2 new states here, if we go down that road 15:57:38 <arne_wiebalck> dtantsur: for the consumers, I thought ironic gives the nodes rather than nova fetching the nodes, no? 15:57:44 <rloo> rpittau: i have nothing against new states -- as long as they make sense :) 15:57:48 <dtantsur> rpittau: maintenance is ruled out because it will prevent cleaning from working 15:58:03 <dtantsur> arne_wiebalck: nova fetches nodes with some filters 15:59:16 <arne_wiebalck> dtantsur: ah, ok 15:59:29 <arne_wiebalck> dtantsur: I thought ironic hides them from nova 15:59:34 <arne_wiebalck> the ones in maintenance 16:00:10 <dtantsur> I think it's done on the nova side 16:00:27 <dtantsur> arne_wiebalck: https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L886-L894 16:00:30 <arne_wiebalck> ok ... that makes things more complicated ... in any case :) 16:00:41 <dtantsur> we're a bit out of time, let's move the discussion back to the spec? 16:00:59 <dtantsur> we need to cover existing API consumers impact, allocation API impact, etc 16:01:05 <dtantsur> I also left one comment about the API design 16:01:08 <dtantsur> thanks all! 16:01:15 <arne_wiebalck> ok, please comment on the spec then, thanks! 16:01:30 <dtantsur> #endmeeting