15:00:18 #startmeeting ironic 15:00:18 Meeting started Mon May 11 15:00:18 2020 UTC and is due to finish in 60 minutes. The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:19 o/ 15:00:20 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:21 o/ 15:00:22 o/ 15:00:22 The meeting name has been set to 'ironic' 15:00:27 \o 15:00:29 o/ 15:00:29 o/ 15:00:30 Good morning everyone! 15:00:31 o/ 15:00:36 o/ 15:00:38 o/ 15:01:11 Our agenda can be found on the wiki. 15:01:13 #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting 15:01:15 o/ 15:01:28 #topic Announcements / Reminders 15:01:39 Three items to note this week! 15:01:41 \o 15:02:01 o/ 15:02:03 #info Victoria Priorities Document under discussion, please join that discussion in review! 15:02:10 o/ 15:02:12 #link https://review.opendev.org/#/c/720100/ 15:02:13 patch 720100 - ironic-specs - WIP - Victoria Cycle Priorit(y|ies) - 1 patch set 15:02:48 #info Baremetal Whitepaper effort will be holding it's third session on Tuesday May 12th, at 2PM UTC 15:02:55 #link https://cern.zoom.us/j/94248770580 15:03:34 #info PTG Topics etherpad is up for comments/additions/thoughts/crazy ideas. 15:03:36 #link https://etherpad.opendev.org/p/Ironic-VictoriaPTG-Planning 15:03:50 Does anyone have anything else to announce or remind us of this week? 15:05:14 * iurygregory doesn't 15:05:33 * TheJulia goes and sees if we had any action items from last week 15:06:29 No meeting related action items, so it seems like we could skip to reviewing subteam status. 15:06:33 ++ 15:06:45 * TheJulia wonders if we need to file a "Take over the world" or "world domination" action item every week. 15:06:59 won't hurt at least, will it? 15:07:10 It will be more words to type every monday morning :) 15:07:13 ++ 15:07:44 I can type the action if needed 15:07:47 Okay, well then I guess we'll move on! 15:08:00 #topic Review sub-team status reports 15:08:09 #link https://etherpad.opendev.org/p/IronicWhiteBoard 15:08:28 \o/ 15:08:31 \o/ 15:08:34 well, at least they're back :) 15:08:38 o/ 15:08:41 Starting around line 220 in the etherpad 15:10:10 Any thoughts on logging the wsme changes on the items to review? 15:10:22 +1 15:10:27 yeah 15:10:48 ++ 15:10:49 k, if someone wants to add some of those links now it would be awesome 15:11:42 I guess we can nuke the software raid item? 15:12:00 yes 15:12:09 done! 15:12:11 Old items from the Grenade work we want to keep track or I can remove old things? =) 15:12:21 iurygregory: remove the old things! 15:12:29 TheJulia, sure 15:12:52 usually stuff older than 2 weeks can be safely removed 15:12:58 (unless it's still up-to-date) 15:13:01 ++ 15:13:07 dtantsur, ok =) 15:13:08 Looks like the v6 ci jobs needs a rebase 15:13:49 I think we are done with Python3 15:13:55 wdyt rpittau ? 15:14:04 I think we can remove the topic =) 15:14:08 we can remove that yes 15:15:23 Looks like two of the dib changes need reviews 15:16:12 yes, please, they're the last 2 15:16:50 Adding to the list 15:17:53 I wonder if there is any interest on deployment state callbacks with nova 15:18:15 I guess that really depends on the number of new deployments being performed against baremetal 15:19:04 And the power state callbacks were quite some work already. 15:19:15 could be a huge scalability improvement 15:19:35 wait for CERN to start hitting problems with periodic sync? :-P 15:19:40 Yeah, I'd just prefer it be driven by those actually actively encountering it 15:20:00 :) 15:20:10 Anyway, shall we proceed to priorities for the coming week? 15:21:06 ++ 15:21:24 #topic Priorities for the coming week 15:21:30 #link https://etherpad.opendev.org/p/IronicWhiteBoard 15:22:40 can we get the release model spec there please? 15:22:50 dtantsur: please add it 15:23:10 done 15:23:23 TheJulia: the ci failure on the dib conversion patch was because the timeout 15:24:26 TheJulia: https://review.opendev.org/#/c/688299/ needs removing -2 (not urgent) 15:24:26 patch 688299 - python-ironicclient - Add `network_data` ironic node attribute support - 9 patch sets 15:24:37 Thanks 15:24:51 Done 15:25:49 That looks good to me, but does anyone see anything we're missing or that needs to be added? 15:26:10 nope, looks fine 15:26:31 lgtm :) 15:26:36 dtantsur: what about timeouts/retries to jsonrpc? 15:26:57 mm, yeah, there are stable backports to merge 15:27:04 * dtantsur is sleepy today 15:27:20 it's monday =) 15:27:53 it is a monday 15:27:53 oh, https://review.opendev.org/725867 another fun one 15:27:54 patch 725867 - ironic - Mark more configuration options as reloadable - 2 patch sets 15:28:30 and https://review.opendev.org/#/c/726378/ could use opinions 15:28:31 patch 726378 - ironic-lib - image_convert: retry resource unavailable and make... - 3 patch sets 15:28:37 * dtantsur loves his own definition of "nope" 15:28:45 hehehe 15:29:14 it's not just Monday, it's Monday after night when a storm tried to make a music instrument out of our balcony 15:29:25 wow 15:29:43 an idea for 726378, maybe perform reduced memory limit on retry? 15:30:06 kaifeng: how will reducing help? if anything, it will increase the chance of failing 15:30:58 anyway, let's discuss on the patch 15:31:12 if memory is low, keep retry with the same resource limit seems identical 15:31:32 np 15:32:37 I looked at the source of qemu-img, and it dynamically scales as it assembles, so stepping down the amount of ram may not help if it truly needs to map out something basically fragmeneted 15:32:43 fragmeneted 15:33:11 Which kind of caused me to think of https://review.opendev.org/#/c/726483/ as a guard so we hopefully avoid OOM conditions 15:33:11 patch 726483 - ironic - WIP: Guard conductor from consuming all of the ram - 1 patch set 15:33:28 Anyway, the list looks good to me. Are we good to proceed? 15:33:31 ++ 15:33:38 ++ 15:34:06 ++ 15:34:10 We have no explicit topics. I believe we've also basically covered the SIG item, unless there is more arne_wiebalck ? 15:34:17 Which I guess takes us to RFE Review 15:34:18 I actually did have a topic.. 15:34:20 nope 15:34:23 * dtantsur not sure where it went 15:34:37 It was about the new release model proposal, maybe we should just ask people to review it 15:35:26 dtantsur: I moved it to annoucements earlier, unless you really think there is something to discuss right now? 15:35:40 ah, ok 15:35:49 no, I think it requires careful reading 15:35:56 I just hope people do read it :) 15:36:46 Okay, then RFE review it is! 15:36:49 #topic RFE Review 15:37:03 Looks like we have two items that have been proposed, would anyone like to introduce them? 15:37:05 yeah 15:37:14 They're not mine, but I've added them, soo 15:37:21 #link https://storyboard.openstack.org/#!/story/2007646 More convenient state transitions in case of failures 15:37:31 this came from the Friday's SPUC (one of the action items) 15:37:50 two pretty minor additions that may potentially make newcomers' life easier 15:38:44 this one sounds interesting =) 15:39:02 seems reasonable, at least from a 10,000 foot view 15:39:03 we have just discussed this kind of guard today :) 15:39:13 to be clear, it doesn't add anything new, just aliases for existing things 15:39:20 but ones that are hopefully easier to discover and remember 15:39:35 I had another RFE with aliases for provisioning verbs, but I've lost it in storyboard.... 15:39:54 ... found it, thank you firefox 15:40:12 #link https://storyboard.openstack.org/#!/story/2007551 An RFE in a similar spirit with 2 more actions 15:41:03 +1000 15:41:28 both are pretty easy to implement. so, while I can do it, I encourage anyone who want to get into ironic development to take them 15:41:55 any objections to either of these two? 15:42:00 Can I take the second? =) 15:42:07 iurygregory: sure, assign yourself 15:42:22 no objections at all 15:42:28 done =) 15:42:36 no objection, in addition I'd like to seek a transition from deploy wait to deploy failed 15:43:02 kaifeng: mmm, interesting, maybe we should make 'abort' do that? 15:43:14 rather than being an alias for 'deleted'? 15:43:34 (this is re 1st RFE action #2) 15:43:38 abort suggests a return to a precedent state though 15:43:39 for 2007646, (sorry if this is bikeshedding', but I wonder if 'recover' is the right word. this is just moving the node's provision state, right? 15:43:54 oh wait. what does 'initial proposal (obsolete)" mean? 15:43:56 dtantsur: quite alike, there is no way to cancel we can only wait it fails itself. 15:44:09 rloo: the initial text of the RFE as filed by arne_wiebalck 15:44:13 wait. i have to actually READ it... 15:44:15 maybe call it explicitely 'fail' ? 15:44:30 rpittau: we even have a 'fail' action, we just don't expose it 15:45:07 rloo: 'recover' is based on the questions I receive pretty often: "How do I recover a node from the error state?" and similar 15:45:18 I agree that it's far from being perfect 15:45:23 inspector use abort to fail a inspection, so maybe it suits for ironic 15:45:52 abort on clean wait causes clean failed 15:46:03 yeah, I agree, abort on deploy wait should end up in deploy failed 15:47:03 updated https://storyboard.openstack.org/#!/story/2007646 15:47:33 looks like i misundertood the recover, so it just moves node out of a failed state, do we care the clean up too? 15:47:47 i'm confused. in the rfe, it sez 'a node can be recovered by applying the deleted transition'. i don't think that's what the initial proposal is. or maybe i misunderstand. 15:48:12 the proposed 'recover' action is an alias for whatever action moved a failed node to a non-failed state 15:48:20 e.g. deploy failed -> (deleted) -> available 15:48:48 based on the current state and target_provision_state 15:48:57 it won't have any new logic behind it 15:49:03 i don't think that's what was desired. i think it is more like 'deploying' -> error, but the user wants the node to go back to active state. 15:49:27 more like nova's reset, i forgot the exact command. 15:49:27 I'd quite like a 'retry' action 15:49:36 oh, so it's an alias to avoid the old word which is kind of misleading, but maybe it doen't help for non- new comers :) 15:50:10 arne_wiebalck pointed out that "undeploy" is not an obvious way to make a failed node available again 15:50:35 btw do we have an agreement on the 2nd half, i.e. supporting abort on 'deploy wait'? 15:50:47 dtantsur: yes 15:51:24 abort works on 'clean failed' and 'inspect failed' ? 15:51:28 rloo: yes 15:51:31 wait 15:51:36 on 'wait' 15:51:48 abort is to get you from *_wait -> * failed 15:51:49 abort works on 'clean wait' and 'inspect wait', but not on 'deploy wait' 15:51:58 Quick note, we only have about 8 minutes left. 15:52:14 rloo: the original idea was to have reset-state command, but maybe that is too big of a hammer 15:52:15 am looking at our state xsition diagram, and don't see 'abort' from 'clean failed': https://docs.openstack.org/ironic/latest/_images/states.svg 15:52:28 arne_wiebalck: but the purpose of a big hammer is actually what is needed 15:52:29 clean failed is the target state after an abort 15:52:38 clean failed is restored by going manage->provide 15:52:51 which puts it in mangeable and restarts cleaning 15:52:58 TheJulia: I'm not convinced the big hammer is needed if we get other things right 15:53:09 a good thing about editing the database is that people know they're in danger.. 15:53:18 dtantsur: that presumes we can see and handle all edge cases 15:53:32 that's our mission :) 15:53:34 dtantsur: that's a really unreasonable answer for the real world, IMO 15:53:45 we already... for a very long time now, have had people delete notes completely as their big hammer... except that is also the last thing we want them to ever do 15:53:56 JayF: that's why we have API and any levels of protection 15:54:01 downstream we've seen where the node is in a wedged state. i don't recall the details now. 15:54:12 trust me, your position on "don't let people disable cleaning" is not shared by a lot of field folks either :) 15:54:19 dtantsur: and we made that configurable :) 15:54:32 and people just disabled the heck out if it 15:54:54 dtantsur: real talk, not from current job, but I've seen manual DB edits cause downtime when someone fat fingered something 15:55:04 dtantsur: I don't think that's a reasonable alternative 15:55:06 i think this discussion might be more relevant if we provided details wrt when a node gets wedged. (haven't had time to dig into that) 15:55:11 the edge case I don't know how to handle (a PTG topic) is how to break out from stuck "deploying" 15:55:24 rloo: ++ 15:55:30 but just editing it to 'available' may get us into trouble 15:55:40 dtantsur: ++ 15:55:50 i don't think our usecase was to move a node to available. 15:56:01 rloo: one example is the conductor crashing during deploy 15:56:12 rloo: this leaves the nodes in error 15:56:23 error is not a stuck state 15:56:31 it was that a bm node has an instance. maybe it was rebuilding and something happened in the code (I don't recall). the node is still 'active' (from nova sense) but we can't update the ironic state. 15:56:58 ie, i think we can put the instance into active in nova, but no equiv in ironic. 15:57:15 my objection is based on the fact that we don't know what will happen to a node if we just edit the state in the DB, bypassing locks, cleaning, and so on 15:57:31 rloo: in nova and cinder you can set the state to whatever the admin thinks is right 15:57:33 dtantsur: error isn't a stuck state BUT there isn't any way to put that ironic node into 'active' :-( 15:57:40 rloo: 'deleted' 15:57:44 rloo: that is actually the adopt feature, except it only starts from ?two? possible states, not later on in the ownership of the node 15:57:52 which is very obscure, hence the 'recover' proposal 15:57:56 no. deleted won't put it into active, it'll make it available. 15:57:57 dtantsur: I'd assume any such API would cancel locks and such as well, although that is pretty complex 15:58:01 ah, sorry 15:58:03 or did something change recently? 15:58:16 rloo: how do you imagine putting a half-deleted node back to active? 15:58:26 it might have had its VIFs already removed.. 15:58:37 dtantsur: how does nova put a half-deployed (NOT DELETED) node back to active? 15:58:47 dtantsur: it relies on the admin knowing what they are doing. 15:58:56 Everyone, we have two minutes left and this really seems like a topic that needs higher bandwidth discussion 15:59:02 dtantsur: we (me anyway) are not talking about deleting anything. 15:59:10 and an understanding of failure cases that people presentlly encounter 15:59:11 rloo: 'error' happens when deleting fails 15:59:14 ++ agree with Julia. 15:59:30 we're talking about (I am) rebuilding. 15:59:31 and 'rebuild' can recover it back to 'active' 15:59:47 note that deleting->(fail)->error is the only way to get to error 16:00:01 some *-ing is transitional states and there is no way to quit, restarting conductor can recover the state but I believe it would be nice to have some guarding task. 16:00:02 i don't really think this discussion is productive. i think we need real use cases. 16:00:10 s/use/failure/ 16:00:28 cuz we're mixing deploy/delete/error/failure/active/available. 16:00:32 Anyway, does anyone have anything else too discuss or raise before we end the meeting 16:00:34 rloo: ++++ 16:00:34 * dtantsur will respond kaifeng after the meeting 16:00:40 rloo: a chart is needed, honestly 16:00:50 i like our state diagram :) 16:01:23 if anybody wants to take the 'abort' part - feel welcome 16:01:32 this one seems not controversial 16:01:56 Thanks everyone! Have a wonderful week! 16:02:00 it's a simple change anyone can take it :) 16:02:12 dtantsur: maybe a new RFE, I think it means allow ironic node in 'deploy wait' to be aborted? 16:02:42 'deploy wait' -> abort -> 'deploy failed' ? 16:02:43 rloo: https://storyboard.openstack.org/#!/story/2007646 16:02:45 I think deploy already kind of allows it 16:02:50 I _think_ 16:02:52 no 16:03:03 it has 'deleted', but that brings is all the way through cleaning 16:03:08 ahh 16:03:20 for an admin that is likely "okay" 16:03:31 in fact, it isn't 'deploy wait' i think it is 'callback wait' or something odd like that. would be great to rename that state... 16:03:38 wait call-back, yeah 16:04:17 dtantsur: thx, i see you updated 2007646. I"m good with 2. But not with 1. Not yet anyway. 16:04:23 I'll comment in the rfe. 16:04:38 thanks! 16:05:38 TheJulia: time for #endmeeting? 16:05:50 #endmeeting