15:00:18 <TheJulia> #startmeeting ironic
15:00:18 <openstack> Meeting started Mon May 11 15:00:18 2020 UTC and is due to finish in 60 minutes.  The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:19 <TheJulia> o/
15:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:21 <iurygregory> o/
15:00:22 <rpittau> o/
15:00:22 <openstack> The meeting name has been set to 'ironic'
15:00:27 <TheJulia> \o
15:00:29 <kaifeng> o/
15:00:29 <stendulker> o/
15:00:30 <TheJulia> Good morning everyone!
15:00:31 <erbarr> o/
15:00:36 <arne_wiebalck> o/
15:00:38 <rloo> o/
15:01:11 <TheJulia> Our agenda can be found on the wiki.
15:01:13 <TheJulia> #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting
15:01:15 <ajya> o/
15:01:28 <TheJulia> #topic Announcements / Reminders
15:01:39 <TheJulia> Three items to note this week!
15:01:41 <mgoddard> \o
15:02:01 <rpioso> o/
15:02:03 <TheJulia> #info Victoria Priorities Document under discussion, please join that discussion in review!
15:02:10 <dtantsur> o/
15:02:12 <TheJulia> #link https://review.opendev.org/#/c/720100/
15:02:13 <patchbot> patch 720100 - ironic-specs - WIP - Victoria Cycle Priorit(y|ies) - 1 patch set
15:02:48 <TheJulia> #info Baremetal Whitepaper effort will be holding it's third session on Tuesday May 12th, at 2PM UTC
15:02:55 <TheJulia> #link https://cern.zoom.us/j/94248770580
15:03:34 <TheJulia> #info PTG Topics etherpad is up for comments/additions/thoughts/crazy ideas.
15:03:36 <TheJulia> #link https://etherpad.opendev.org/p/Ironic-VictoriaPTG-Planning
15:03:50 <TheJulia> Does anyone have anything else to announce or remind us of this week?
15:05:14 * iurygregory doesn't
15:05:33 * TheJulia goes and sees if we had any action items from last week
15:06:29 <TheJulia> No meeting related action items, so it seems like we could skip to reviewing subteam status.
15:06:33 <dtantsur> ++
15:06:45 * TheJulia wonders if we need to file a "Take over the world" or "world domination" action item every week.
15:06:59 <dtantsur> won't hurt at least, will it?
15:07:10 <TheJulia> It will be more words to type every monday morning :)
15:07:13 <iurygregory> ++
15:07:44 <iurygregory> I can type the action if needed
15:07:47 <TheJulia> Okay, well then I guess we'll move on!
15:08:00 <TheJulia> #topic Review sub-team status reports
15:08:09 <TheJulia> #link https://etherpad.opendev.org/p/IronicWhiteBoard
15:08:28 <TheJulia> \o/
15:08:31 <iurygregory> \o/
15:08:34 <dtantsur> well, at least they're back :)
15:08:38 <rajinir> o/
15:08:41 <TheJulia> Starting around line 220 in the etherpad
15:10:10 <TheJulia> Any thoughts on logging the wsme changes on the items to review?
15:10:22 <dtantsur> +1
15:10:27 <rpittau> yeah
15:10:48 <iurygregory> ++
15:10:49 <TheJulia> k, if someone wants to add some of those links now it would be awesome
15:11:42 <TheJulia> I guess we can nuke the software raid item?
15:12:00 <arne_wiebalck> yes
15:12:09 <TheJulia> done!
15:12:11 <iurygregory> Old items from the Grenade work we want to keep track or I can remove old things? =)
15:12:21 <TheJulia> iurygregory: remove the old things!
15:12:29 <iurygregory> TheJulia, sure
15:12:52 <dtantsur> usually stuff older than 2 weeks can be safely removed
15:12:58 <dtantsur> (unless it's still up-to-date)
15:13:01 <TheJulia> ++
15:13:07 <iurygregory> dtantsur, ok =)
15:13:08 <TheJulia> Looks like the v6 ci jobs needs a rebase
15:13:49 <iurygregory> I think we are done with Python3
15:13:55 <iurygregory> wdyt rpittau ?
15:14:04 <iurygregory> I think we can remove the topic =)
15:14:08 <rpittau> we can remove that yes
15:15:23 <TheJulia> Looks like two of the dib changes need reviews
15:16:12 <rpittau> yes, please, they're the last 2
15:16:50 <TheJulia> Adding to the list
15:17:53 <TheJulia> I wonder if there is any interest on deployment state callbacks with nova
15:18:15 <TheJulia> I guess that really depends on the number of new deployments being performed against baremetal
15:19:04 <arne_wiebalck> And the power state callbacks were quite some work already.
15:19:15 <dtantsur> could be a huge scalability improvement
15:19:35 <dtantsur> wait for CERN to start hitting problems with periodic sync? :-P
15:19:40 <TheJulia> Yeah, I'd just prefer it be driven by those actually actively encountering it
15:20:00 <arne_wiebalck> :)
15:20:10 <TheJulia> Anyway, shall we proceed to priorities for the coming week?
15:21:06 <dtantsur> ++
15:21:24 <TheJulia> #topic Priorities for the coming week
15:21:30 <TheJulia> #link https://etherpad.opendev.org/p/IronicWhiteBoard
15:22:40 <dtantsur> can we get the release model spec there please?
15:22:50 <TheJulia> dtantsur: please add it
15:23:10 <dtantsur> done
15:23:23 <rpittau> TheJulia: the ci failure on the dib conversion patch was because the timeout
15:24:26 <dtantsur> TheJulia: https://review.opendev.org/#/c/688299/ needs removing -2 (not urgent)
15:24:26 <patchbot> patch 688299 - python-ironicclient - Add `network_data` ironic node attribute support - 9 patch sets
15:24:37 <TheJulia> Thanks
15:24:51 <TheJulia> Done
15:25:49 <TheJulia> That looks good to me, but does anyone see anything we're missing or that needs to be added?
15:26:10 <dtantsur> nope, looks fine
15:26:31 <rpittau> lgtm :)
15:26:36 <TheJulia> dtantsur: what about timeouts/retries to jsonrpc?
15:26:57 <dtantsur> mm, yeah, there are stable backports to merge
15:27:04 * dtantsur is sleepy today
15:27:20 <iurygregory> it's monday =)
15:27:53 <TheJulia> it is a monday
15:27:53 <dtantsur> oh, https://review.opendev.org/725867 another fun one
15:27:54 <patchbot> patch 725867 - ironic - Mark more configuration options as reloadable - 2 patch sets
15:28:30 <dtantsur> and https://review.opendev.org/#/c/726378/ could use opinions
15:28:31 <patchbot> patch 726378 - ironic-lib - image_convert: retry resource unavailable and make... - 3 patch sets
15:28:37 * dtantsur loves his own definition of "nope"
15:28:45 <iurygregory> hehehe
15:29:14 <dtantsur> it's not just Monday, it's Monday after night when a storm tried to make a music instrument out of our balcony
15:29:25 <iurygregory> wow
15:29:43 <kaifeng> an idea for 726378, maybe perform reduced memory limit on retry?
15:30:06 <dtantsur> kaifeng: how will reducing help? if anything, it will increase the chance of failing
15:30:58 <dtantsur> anyway, let's discuss on the patch
15:31:12 <kaifeng> if memory is low, keep retry with the same resource limit seems identical
15:31:32 <kaifeng> np
15:32:37 <TheJulia> I looked at the source of qemu-img, and it dynamically scales as it assembles, so stepping down the amount of ram may not help if it truly needs to map out something basically fragmeneted
15:32:43 <TheJulia> fragmeneted
15:33:11 <TheJulia> Which kind of caused me to think of https://review.opendev.org/#/c/726483/ as a guard so we hopefully avoid OOM conditions
15:33:11 <patchbot> patch 726483 - ironic - WIP: Guard conductor from consuming all of the ram - 1 patch set
15:33:28 <TheJulia> Anyway, the list looks good to me. Are we good to proceed?
15:33:31 <dtantsur> ++
15:33:38 <iurygregory> ++
15:34:06 <kaifeng> ++
15:34:10 <TheJulia> We have no explicit topics. I believe we've also basically covered the SIG item, unless there is more arne_wiebalck ?
15:34:17 <TheJulia> Which I guess takes us to RFE Review
15:34:18 <dtantsur> I actually did have a topic..
15:34:20 <arne_wiebalck> nope
15:34:23 * dtantsur not sure where it went
15:34:37 <dtantsur> It was about the new release model proposal, maybe we should just ask people to review it
15:35:26 <TheJulia> dtantsur: I moved it to annoucements earlier, unless you really think there is something to discuss right now?
15:35:40 <dtantsur> ah, ok
15:35:49 <dtantsur> no, I think it requires careful reading
15:35:56 <dtantsur> I just hope people do read it :)
15:36:46 <TheJulia> Okay, then RFE review it is!
15:36:49 <TheJulia> #topic RFE Review
15:37:03 <TheJulia> Looks like we have two items that have been proposed, would anyone like to introduce them?
15:37:05 <dtantsur> yeah
15:37:14 <dtantsur> They're not mine, but I've added them, soo
15:37:21 <dtantsur> #link https://storyboard.openstack.org/#!/story/2007646 More convenient state transitions in case of failures
15:37:31 <dtantsur> this came from the Friday's SPUC (one of the action items)
15:37:50 <dtantsur> two pretty minor additions that may potentially make newcomers' life easier
15:38:44 <iurygregory> this one sounds interesting =)
15:39:02 <TheJulia> seems reasonable, at least from a 10,000 foot view
15:39:03 <kaifeng> we have just discussed this kind of guard today :)
15:39:13 <dtantsur> to be clear, it doesn't add anything new, just aliases for existing things
15:39:20 <dtantsur> but ones that are hopefully easier to discover and remember
15:39:35 <dtantsur> I had another RFE with aliases for provisioning verbs, but I've lost it in storyboard....
15:39:54 <dtantsur> ... found it, thank you firefox
15:40:12 <dtantsur> #link https://storyboard.openstack.org/#!/story/2007551 An RFE in a similar spirit with 2 more actions
15:41:03 <TheJulia> +1000
15:41:28 <dtantsur> both are pretty easy to implement. so, while I can do it, I encourage anyone who want to get into ironic development to take them
15:41:55 <dtantsur> any objections to either of these two?
15:42:00 <iurygregory> Can I take the second? =)
15:42:07 <dtantsur> iurygregory: sure, assign yourself
15:42:22 <TheJulia> no objections at all
15:42:28 <iurygregory> done =)
15:42:36 <kaifeng> no objection, in addition I'd like to seek a transition from deploy wait to deploy failed
15:43:02 <dtantsur> kaifeng: mmm, interesting, maybe we should make 'abort' do that?
15:43:14 <dtantsur> rather than being an alias for 'deleted'?
15:43:34 <dtantsur> (this is re 1st RFE action #2)
15:43:38 <rpittau> abort suggests a return to a precedent state though
15:43:39 <rloo> for 2007646, (sorry if this is bikeshedding', but I wonder if 'recover' is the right word. this is just moving the node's provision state, right?
15:43:54 <rloo> oh wait. what does 'initial proposal (obsolete)" mean?
15:43:56 <kaifeng> dtantsur: quite alike, there is no way to cancel we can only wait it fails itself.
15:44:09 <dtantsur> rloo: the initial text of the RFE as filed by arne_wiebalck
15:44:13 <rloo> wait. i have to actually READ it...
15:44:15 <rpittau> maybe call it explicitely 'fail' ?
15:44:30 <dtantsur> rpittau: we even have a 'fail' action, we just don't expose it
15:45:07 <dtantsur> rloo: 'recover' is based on the questions I receive pretty often: "How do I recover a node from the error state?" and similar
15:45:18 <dtantsur> I agree that it's far from being perfect
15:45:23 <kaifeng> inspector use abort to fail a inspection, so maybe it suits for ironic
15:45:52 <dtantsur> abort on clean wait causes clean failed
15:46:03 <dtantsur> yeah, I agree, abort on deploy wait should end up in deploy failed
15:47:03 <dtantsur> updated https://storyboard.openstack.org/#!/story/2007646
15:47:33 <kaifeng> looks like i misundertood the recover, so it just moves node out of a failed state, do we care the clean up too?
15:47:47 <rloo> i'm confused. in the rfe, it sez 'a node can be recovered by applying the deleted transition'. i don't think that's what the initial proposal is. or maybe i misunderstand.
15:48:12 <dtantsur> the proposed 'recover' action is an alias for whatever action moved a failed node to a non-failed state
15:48:20 <dtantsur> e.g. deploy failed -> (deleted) -> available
15:48:48 <dtantsur> based on the current state and target_provision_state
15:48:57 <dtantsur> it won't have any new logic behind it
15:49:03 <rloo> i don't think that's what was desired. i think it is more like 'deploying' -> error, but the user wants the node to go back to active state.
15:49:27 <rloo> more like nova's reset, i forgot the exact command.
15:49:27 <dtantsur> I'd quite like a 'retry' action
15:49:36 <kaifeng> oh, so it's an alias to avoid the old word which is kind of misleading, but maybe it doen't help for non- new comers :)
15:50:10 <dtantsur> arne_wiebalck pointed out that "undeploy" is not an obvious way to make a failed node available again
15:50:35 <dtantsur> btw do we have an agreement on the 2nd half, i.e. supporting abort on 'deploy wait'?
15:50:47 <arne_wiebalck> dtantsur: yes
15:51:24 <rloo> abort works on 'clean failed' and 'inspect failed' ?
15:51:28 <dtantsur> rloo: yes
15:51:31 <dtantsur> wait
15:51:36 <dtantsur> on 'wait'
15:51:48 <JayF> abort is to get you from *_wait -> * failed
15:51:49 <dtantsur> abort works on 'clean wait' and 'inspect wait', but not on 'deploy wait'
15:51:58 <TheJulia> Quick note, we only have about 8 minutes left.
15:52:14 <arne_wiebalck> rloo: the original idea was to have reset-state command, but maybe that is too big of a hammer
15:52:15 <rloo> am looking at our state xsition diagram, and don't see 'abort' from 'clean failed': https://docs.openstack.org/ironic/latest/_images/states.svg
15:52:28 <TheJulia> arne_wiebalck: but the purpose of a big hammer is actually what is needed
15:52:29 <JayF> clean failed is the target state after an abort
15:52:38 <JayF> clean failed is restored by going manage->provide
15:52:51 <JayF> which puts it in mangeable and restarts cleaning
15:52:58 <dtantsur> TheJulia: I'm not convinced the big hammer is needed if we get other things right
15:53:09 <dtantsur> a good thing about editing the database is that people know they're in danger..
15:53:18 <TheJulia> dtantsur: that presumes we can see and handle all edge cases
15:53:32 <dtantsur> that's our mission :)
15:53:34 <JayF> dtantsur: that's a really unreasonable answer for the real world, IMO
15:53:45 <TheJulia> we already... for a very long time now, have had people delete notes completely as their big hammer... except that is also the last thing we want them to ever do
15:53:56 <dtantsur> JayF: that's why we have API and any levels of protection
15:54:01 <rloo> downstream we've seen where the node is in a wedged state. i don't recall the details now.
15:54:12 <dtantsur> trust me, your position on "don't let people disable cleaning" is not shared by a lot of field folks either :)
15:54:19 <JayF> dtantsur: and we made that configurable :)
15:54:32 <dtantsur> and people just disabled the heck out if it
15:54:54 <JayF> dtantsur: real talk, not from current job, but I've seen manual DB edits cause downtime when someone fat fingered something
15:55:04 <JayF> dtantsur: I don't think that's a reasonable alternative
15:55:06 <rloo> i think this discussion might be more relevant if we provided details wrt when a node gets wedged. (haven't had time to dig into that)
15:55:11 <dtantsur> the edge case I don't know how to handle (a PTG topic) is how to break out from stuck "deploying"
15:55:24 <TheJulia> rloo: ++
15:55:30 <dtantsur> but just editing it to 'available' may get us into trouble
15:55:40 <TheJulia> dtantsur: ++
15:55:50 <rloo> i don't think our usecase was to move a node to available.
15:56:01 <arne_wiebalck> rloo: one example is the conductor crashing during deploy
15:56:12 <arne_wiebalck> rloo: this leaves the nodes in error
15:56:23 <dtantsur> error is not a stuck state
15:56:31 <rloo> it was that a bm node has an instance. maybe it was rebuilding and something happened in the code (I don't recall). the node is still 'active' (from nova sense) but we can't update the ironic state.
15:56:58 <rloo> ie, i think we can put the instance into active in nova, but no equiv in ironic.
15:57:15 <dtantsur> my objection is based on the fact that we don't know what will happen to a node if we just edit the state in the DB, bypassing locks, cleaning, and so on
15:57:31 <arne_wiebalck> rloo: in nova and cinder you can set the state to whatever the admin thinks is right
15:57:33 <rloo> dtantsur: error isn't a stuck state BUT there isn't any way to put that ironic node into 'active' :-(
15:57:40 <dtantsur> rloo: 'deleted'
15:57:44 <TheJulia> rloo: that is actually the adopt feature, except it only starts from ?two? possible states, not later on in the ownership of the node
15:57:52 <dtantsur> which is very obscure, hence the 'recover' proposal
15:57:56 <rloo> no. deleted won't put it into active, it'll make it available.
15:57:57 <JayF> dtantsur: I'd assume any such API would cancel locks and such as well, although that is pretty complex
15:58:01 <dtantsur> ah, sorry
15:58:03 <rloo> or did something change recently?
15:58:16 <dtantsur> rloo: how do you imagine putting a half-deleted node back to active?
15:58:26 <dtantsur> it might have had its VIFs already removed..
15:58:37 <rloo> dtantsur: how does nova put a half-deployed (NOT DELETED) node back to active?
15:58:47 <rloo> dtantsur: it relies on the admin knowing what they are doing.
15:58:56 <TheJulia> Everyone, we have two minutes left and this really seems like a topic that needs higher bandwidth discussion
15:59:02 <rloo> dtantsur:  we (me anyway) are not talking about deleting anything.
15:59:10 <TheJulia> and an understanding of failure cases that people presentlly encounter
15:59:11 <dtantsur> rloo: 'error' happens when deleting fails
15:59:14 <rloo> ++ agree with Julia.
15:59:30 <rloo> we're talking about (I am) rebuilding.
15:59:31 <dtantsur> and 'rebuild' can recover it back to 'active'
15:59:47 <dtantsur> note that deleting->(fail)->error is the only way to get to error
16:00:01 <kaifeng> some *-ing is transitional states and there is no way to quit, restarting conductor can recover the state but I believe it would be nice to have some guarding task.
16:00:02 <rloo> i don't really think this discussion is productive. i think we need real use cases.
16:00:10 <TheJulia> s/use/failure/
16:00:28 <rloo> cuz we're mixing deploy/delete/error/failure/active/available.
16:00:32 <TheJulia> Anyway, does anyone have anything else too discuss or raise before we end the meeting
16:00:34 <TheJulia> rloo: ++++
16:00:34 * dtantsur will respond kaifeng after the meeting
16:00:40 <TheJulia> rloo: a chart is needed, honestly
16:00:50 <rloo> i like our state diagram :)
16:01:23 <dtantsur> if anybody wants to take the 'abort' part - feel welcome
16:01:32 <dtantsur> this one seems not controversial
16:01:56 <TheJulia> Thanks everyone! Have a wonderful week!
16:02:00 <kaifeng> it's a simple change anyone can take it :)
16:02:12 <rloo> dtantsur: maybe a new RFE, I think it means allow ironic node in 'deploy wait' to be aborted?
16:02:42 <rloo> 'deploy wait' -> abort -> 'deploy failed' ?
16:02:43 <dtantsur> rloo: https://storyboard.openstack.org/#!/story/2007646
16:02:45 <TheJulia> I think deploy already kind of allows it
16:02:50 <TheJulia> I _think_
16:02:52 <dtantsur> no
16:03:03 <dtantsur> it has 'deleted', but that brings is all the way through cleaning
16:03:08 <TheJulia> ahh
16:03:20 <TheJulia> for an admin that is likely "okay"
16:03:31 <rloo> in fact, it isn't 'deploy wait' i think it is 'callback wait' or something odd like that. would be great to rename that state...
16:03:38 <dtantsur> wait call-back, yeah
16:04:17 <rloo> dtantsur: thx, i see you updated 2007646. I"m good with 2. But not with 1. Not yet anyway.
16:04:23 <rloo> I'll comment in the rfe.
16:04:38 <dtantsur> thanks!
16:05:38 <dtantsur> TheJulia: time for #endmeeting?
16:05:50 <TheJulia> #endmeeting