15:00:40 #startmeeting ironic 15:00:40 Meeting started Mon Aug 21 15:00:40 2023 UTC and is due to finish in 60 minutes. The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:40 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:40 The meeting name has been set to 'ironic' 15:00:46 o/ 15:00:48 o/ 15:01:04 Julia Kreger proposed openstack/ironic master: DNM Enable OVN https://review.opendev.org/c/openstack/ironic/+/885087 15:01:06 Good morning Ironic'ers! 15:01:09 A reminder we operate under the OpenInfra Foundation CoC https://openinfra.dev/legal/code-of-conduct 15:01:16 #topic Announcements/Reminders 15:01:21 o/ 15:01:24 #note Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash 15:01:36 I'm also going to note that 15:01:46 #note Bobcat non-client library freeze is Thursday, Aug 24 15:02:25 Finally, one about the PTG 15:02:35 #note PTG is virtual and taking place October 23-27 2023 15:02:41 #link https://etherpad.opendev.org/p/ironic-ptg-october-2023 15:02:51 please use the etherpad to chat about topics of interest for the etherpad 15:03:19 Any comments/questions on the announcements, or anything to add? 15:04:13 * dtantsur has nothing 15:04:37 I'm going to skip the next item; we do not have action items from the last meeting to follow up on. 15:04:42 #topic Review Ironic CI Status 15:04:56 We have a couple of CI-related items on the agenda I wanna let folks know about before we get into general status 15:05:38 frickler brought it to our attention in IRC Friday afternoon that Ironic is one of the projects left with the most zuul config errors 15:05:49 so, apparently our power just dropped, I don't know how much longer we'll be on 15:06:03 this is basically when CI is so broken that zuul can't even read the config (usually it means we haven't had any patches pass testing since the zuul queue change months maybe year+ ago) 15:06:10 #link https://od42.de/ironic 15:06:30 that link doesn't work for me, but if I use the filter manually it's obvious we have old bugfix and stable branches plagued by the issue 15:07:06 o/ (man I'm late) 15:07:09 heck, it looks like python-ironicclient gates as recent as yoga are impacted 15:07:42 dtantsur: TheJulia: Is it one of your teams that use bugfix/ branches downstream? 15:07:53 That's one of the big pieces of info I'm missing: if these are bugfix branches which ones can we nuke 15:08:03 JayF: we used to rely on them; no longer I think 15:08:08 Mine does not 15:08:23 not even for ancient releases, right rpittau? 15:08:25 Ack, let me take this action then 15:08:26 I *believe* there was a list made ~1 year ago which enumerated a ton of branches that could be dropped 15:08:38 JayF dtantsur: not anymore no 15:08:44 #action JayF to audit zuul-config-errors, propose retirement of clearly-abandonded branches and try to fix broken ones 15:08:49 JayF: seem I need to do some URL quoting on that redirect :-/ 15:08:56 we plan to use bugfix for metal3 upstream but only the latest 1-2 15:08:59 frickler: yeah, it happens but it's obvious what's broken :) 15:09:03 okay, so as far as OCP is concerned, we're fine with going back to short-lived bugfix branches 15:09:07 yep 15:09:20 metal3 - as rpittau said (thanks!) 15:09:21 btw I need to propose new bugfix branches this week :D 15:09:29 Would someone who isn't me mind htiting the list with a "bugfix branch update" saying some of this 15:09:37 with a proposal for how long they should live, etc? 15:09:48 I don't wanna just guess and rugpull, but it's obvious right now we keep 'em up for too long 15:09:52 I thought we've updated already our docs ? 15:09:52 We can simply get back to what I proposed in the spec back then 15:09:59 and yeah, the docs 15:10:01 ack 15:10:14 JayF: fixed 15:10:19 to summarize my understanding of the existing docs: bugfix branches are retired when their letter'd counterpart go out 15:10:29 frickler: oooh very nice 15:10:55 lol stable/pike NGS 15:10:56 the bugfix branches last for at most 6 months 15:11:05 that screams "jay didn't retire this with the rest" :( 15:11:09 then they can get pulverized 15:11:14 ack; that works for me 15:11:23 So the other CI-related item we have is 15:11:27 yeah, 6 months matches my recollection 15:11:38 After the chat in IRC last week about janders's change and not getting tested on real hardware 15:11:52 I reached out to the HPE team, they claim to have fixed HPE Third Party CI. 15:12:06 https://review.opendev.org/c/openstack/ironic/+/889750 one of their examples, has a run on it 15:12:14 #note HPE Third Party CI is functioning again. 15:12:20 \o/ 15:12:25 nice 15:13:02 Is Is there anything generically about CI we need to speak about? 15:13:16 I think other than the endless sqlite locking battles we've been cleaner than usual? 15:13:41 the metalsmith src jobs are busted at the moment 15:13:45 this impacts ipa CI 15:14:19 ack 15:14:20 I think it depends on the fact that they still use focal 15:14:26 so proposed https://review.opendev.org/c/openstack/metalsmith/+/892146 15:14:29 that makes sense to me, and is a forced migration anyway 15:14:39 we shouldn't release "B" with it on focal anyway 15:14:43 yep 15:15:04 Thanks for looking into that. 15:15:09 Are there any other outstanding CI items? 15:15:09 also CS9 jobs in metalsmith -> https://review.opendev.org/c/openstack/metalsmith/+/869374 15:15:26 landed that one just now 15:15:34 great, thanks! 15:15:40 Do we have a userbase for metalsmith? 15:15:49 I feel like I only ever hear about it when CI is broken 15:16:07 I assume CERN since arne_wiebalck has some activity on it? 15:16:14 nope 15:16:35 mmm not sure, maybe TheJulia or hjensas know ? 15:16:36 I just ran into it since zuul is not happy with my raid rebuild patch 15:16:44 Interesting. OK. Maybe I'll add that to PTG topics. 15:16:57 sorry what might I know? 15:17:04 TheJulia: if we have any known users of metalsmith 15:17:18 Just RHOSP 15:17:20 afaik 15:17:32 When I created it, I hoped that people start using it just in general, as a handy CLI 15:17:44 Maybe I was naive, and we need something equal in ironicclient (and the backing API)... 15:17:52 RHOSP is a pretty big user of it then :D 15:18:06 dtantsur: it's possible for all those things to be true at the same time :D 15:18:24 dtantsur: metalsmith can blaze a trail ,we can use that to figure out how to make it work in primary clients/apis 15:18:45 the planned but never implemented Deployment API was the next logical step for me 15:18:49 that's good to know though, I just want someone to have a use case for it, RHOSP totally counts 15:18:58 dtantsur: maybe toss that on PTG topics and we can res it? 15:19:12 nobody is goign to make time to do it if we don't talk about it and hype it up 15:19:14 I added Metalsmith to the ptg topic list 15:19:16 I can be your hype man dtantsur 15:19:17 I don't see a point in that. We've had these discussions over and over again. 15:19:19 heh 15:19:28 Until someone has a vested interest, it just does not happen... 15:19:33 Problem is, at least in my circles, it gets viewed as this "alternative to ironic" or "replacement of ironic" 15:19:42 and people don't really grok that it is just a client 15:19:44 lol 15:19:50 yeah 15:20:03 Maybe the answer from PTG is gonna be to make better docs out of it :) 15:20:18 I was talking to kubajj this morning about how doing non-nova Ironic deploys is not very intuitive 15:20:23 I think the real isuse is tons of people don't know how to actually *use* ironic 15:20:25 Or decide how we can decompose metalsmith into smaller bits and gradually merge 15:20:28 and AFAICT we lack a directive doc on how to do it exactly 15:20:29 even though there are videos, pages, everything else 15:20:29 TheJulia: YES 15:20:35 OH YEAH 15:20:37 instance_info anyone? 15:20:39 almost like we need a class 15:20:47 Who does not like JSON fields without schema validation? 15:21:09 most people I talk to don't even think along those lines, it is a big scary thing they just don't understand in general 15:21:19 If the most basic thing Ironic is doing needs to be taught... we're losing :( 15:21:41 Well, we don't always like to frame Ironic this way 15:21:49 but Ironic deployments are super easy to do... if you have nova in front 15:21:58 it might just be resistance to information because they have no need to touch it because that is not their primary role 15:22:02 By the way, the outreachy season is coming. If we have an easy win, we can try proposing it. 15:22:06 this is a secondary use case and we've always treated it as a secondary use case, whether that's right/wrong/etc 15:23:03 dtantsur: I will have an MLH intern around that season too 15:23:14 dtantsur: but we'd need rough docs to be able to get an intern to curate it into not-terrible docs 15:23:32 We have rough docs, no? 15:23:45 dtantsur: I'll note: incoming interns is also why I'm working on contributor guide updates now (probably Tues you'll see a post with a radical improvement on our ironic-in-devstack docs) 15:23:52 dtantsur: if so I couldn't find them in 10 minutes while working /kuba 15:23:55 *w/kuba 15:24:12 is https://docs.openstack.org/ironic/latest/user/index.html what you mean? 15:24:22 particularly, https://docs.openstack.org/ironic/latest/user/deploy.html 15:24:42 that is exactly the doc I was looking for earlier 15:24:46 dtantsur+++++ 15:24:55 kubajj: ^^ fyi I'll also slack the link to you 15:25:11 I've spent quite some time on this document, but I'm sure it can be improved much further 15:25:19 especially the configdrive explanation is lacking 15:25:43 yeah I got feedback about this stuff being confusing from a lost-potential-user the other day too 15:25:48 JayF: I was reading this in the morning but got some error, thought it might be just bifrost 15:25:59 * TheJulia wonders what is the attention span window we should be focusing on 15:26:53 "what is documentation in the tiktok generation" might be another way of framing that mental musing 15:27:00 heh 15:27:01 lol 15:27:18 I think that is based on a flawed premise: we're not doing a good job of making docs discoverable for the altavista generation either ;) 15:27:23 We have a lot of vague concepts. Like instance_info itself. 15:27:24 should we do a "bare metal deployment dance" ? 15:27:37 we have a large number of docs, it's borderline impossible to know which one you need 15:27:53 and the vague concepts like dtantsur points out makes it hard to know what to search for 15:28:25 Well, dunno. I think "User Guide" is a pretty natural place to look 15:28:30 * JayF was convinced at SCALE20x by a librarian that we need one 15:28:38 I'd be more worried that people run away screaming after reading the Installation Guide :D 15:29:31 I'm going to add a note at PTG about this, maybe we can take a swing at it or at least think about it in the intervening time 15:29:57 well Julia beat me to it, but it's on that doc :D 15:30:02 moving on 15:30:10 #topic Review ongoing 2023.2 workstreams 15:30:20 doc topic added to ptg etherpad 15:30:26 #link https://etherpad.opendev.org/p/IronicWorkstreams2023.2 15:31:17 It's too early to fully declare victory 15:31:23 but this has been a crazy productive cycle it seems 15:31:30 so much impactful stuff landing and in progress 15:32:51 Where are we at on the nova side of shards key usage? 15:33:14 testing and positive feedback to johnthetubaguy 15:33:23 then I think he goes begging for reviews 15:33:34 I was struggling with devstack I got it working, so I should have an env to test that in this week 15:34:11 TheJulia: unless you have time and want to dedicate time to it, let me commit to doing that test on Tues 15:34:22 TheJulia: then we can free that up for John and hopefully he lands it 15:35:36 Any other questions/comments/discussions on in-progress work streams? 15:35:48 I can re-review, I think the last time I looked at the code I had high confidence in it 15:35:52 same 15:35:57 I just want to actually test it 15:36:02 Since it is all well walked pattern changes 15:36:25 lets sync after the meeting on it 15:36:32 ack, going to move on 15:36:39 Nothing listed for RFE Review; skipping that section. 15:36:46 #topic Open Discussion 15:36:50 I had one item for here: 15:37:17 PTL and TC nominations are open. I strongly encourage Ironic contributors to run for PTL and/or TC. If you're interested in being PTL talk to me. 15:37:31 If nobody else has self-nominated for PTL by midweek, I will re-nominate myself for a third term. 15:38:08 That's all I had, just wanted to draw attention there. 15:38:13 Anything else for open discussion? 15:38:14 Democracy is good, letting Jay to get a break is even better! 15:38:17 Go people go! 15:38:21 So, yes, one funny bug 15:38:31 https://bugs.launchpad.net/ironic/+bug/2032377 was brought to my attention by my fellow operator 15:38:41 Eh, I don't mind being PTL tbh; I just appreciate that we cycle leadership and don't wanna break tradition :) 15:38:50 it's stupidly simple, but I have no idea how to work around it cleanly 15:39:33 we can't really leaving cloud-init AND glean installed on the same IPA image 15:39:37 that's the bug there, yeah? 15:39:48 nope 15:40:06 it *sounds* like the the image has a pre-baked config drive 15:40:08 and we don't find it 15:40:10 Imagine we're cleaning a node that had a configdrive. And IPA has a configdrive. 15:40:11 and we create a new one 15:40:14 and *boom* 15:40:25 oh 15:40:31 dtantsur: ah, in IPA world we should never ever not ever respect config on disk 15:40:34 that is potentially a security bug 15:40:40 Right. But we do. 15:41:35 so it is when the ramdisk boots, it finds/attaches the config drive data embedded in the iso 15:41:35 we'd almost need glean to have an option to filter block devices and/or look for a different label 15:41:43 and doesn't unmount it for operations it sounds like? 15:42:12 TheJulia: still simpler. Glean is looking for a configdrive. There are two: the one it should use (in the CD) and the old one on disk. 15:42:25 the one on the disk shouldn't be a block device... 15:42:25 yet 15:42:28 wut 15:42:29 for cleaning 15:42:35 yeah? 15:42:37 TheJulia: how can it be NOT a block device? 15:42:50 it would need to be attached to the loopback to become a device 15:42:50 JayF: cleaning is the biggest problem; after the cleaning the rogue partition will be gone. 15:42:52 or on a not-cleaned device doing a second deploy 15:43:16 we need to look at the simple-init code and if we can get a ramdisk log that would be super helpful 15:43:19 TheJulia: you're talking about a file; configdrive is a partition (on disk) or a whole device (CD) 15:43:36 OH 15:43:41 the whole CD is labled config-2 15:43:42 wut 15:43:50 yep, that's how DHCP-less works in Ironic 15:43:52 that is how ISO-based configdrives work in VM 15:43:56 as well 15:44:34 I didn't realize that was how vmedia ramdisk worked 15:44:45 TheJulia: not always, only when Node.network_data is used 15:44:52 OH 15:45:05 wheeeeeeeeeeeeee 15:46:22 Yeah, I agree that's a nasty bug. 15:46:37 I agree I don't know how we fix it without changes in glean. 15:46:44 AND service steps will increase the scope of the bug 15:47:35 yeah. maybe we should talk to ianw, I can drop him an email 15:47:42 I think we don't have enough information to fully comprehend the bug since if they ahve a pre-existing configuration drive, and one based on the image itself, it is sort of a case we never expected 15:47:46 ++ 15:48:11 even like glean-use-part-uuid=AAAA-BBBB-CCCC-DDDD 15:48:12 since we expected the node to be cleaned, but this is an instance as a ramdisk with it's own config drive 15:48:15 in the kernel command line 15:48:22 TheJulia: we do actually. IPA has a configdrive because it's how DHCP-less works. The disk has a configdrive because it's cleaning after deployment. 15:48:23 just some way for Ironic to signal to glean "use this one" 15:48:40 yeah that's why the bug is tricky; both configdrives are valid just not in the same context 15:48:56 yeah, but we know it at that point, they are doing it themselves outside of cleaning, which is how I'm reading the bug 15:49:11 mmm? 15:49:52 hmm 15:49:59 we really need to talk with the filer of the bug and ask questions 15:50:13 I talk to him all the time but I don't know which questions you have in mind 15:50:32 We know what is going on, we don't know how to fix it 15:50:39 Well, I'm confused on what exactly they are doing 15:50:56 1) Normal deployment with DHCP-less IPA; 2) Instance tear down; 3) Boom 15:51:08 3) *Boom with DHCP-less cleaning 15:51:17 Configure node.network_data. Deploy the node. Clean the node. At clean time both the original deployed configdrive AND the IPA-node.network_data configdrive exists. 15:51:53 so, our ipa ramdisk should have the smarts to know where to boot from, and not hunt to the OS disks when booted that way 15:52:05 *should* being the keyword there 15:52:07 yep, that's what dtantsur and I were talking about with glean 15:52:17 b/c we'll need to tell glean exactly which partition to help it dedupe 15:52:27 ... or just explicitly run it 15:52:31 for contrast, that's the current glean's logic: https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-early.sh#L34-L38 15:52:49 Merged openstack/bifrost master: Remove Fedora from the CI https://review.opendev.org/c/openstack/bifrost/+/892123 15:53:18 yeah, that is semi-problematic 15:53:27 TheJulia: running Glean manually is non-trivial since it gets triggered from udev 15:53:51 yeah 15:54:31 I mean, couldn't we do something like: 15:54:40 1) Disable glean from autorun, via udev and everything else 15:55:03 2) On IPA startup, look for ipa-network-data=blah and if it exists, do some mounting then run LN 53 from glean-early.sh? 15:55:07 glean uses integrated network interface enumeration 15:55:15 so you can't run it late? 15:55:19 not really 15:55:28 yeah, then we need glean to get some sorta hint 15:55:35 that says "no really, this configdrive" 15:55:43 You probably can, but it's going to happen quite late 15:55:50 e.g. currently IPA is After:network-online 15:55:59 good point 15:56:13 we'd basically have to write a separate unit, at which point we've reinvented the wheel 15:56:14 We could have our own service that goes Before 15:56:16 right 15:56:34 * JayF votes for kernel cli or on-disk glean config that points it explicitly at a partition uuid 15:56:53 since we should know the partition uuid at create time, yeah? 15:56:58 okay, lemme talk to Ian, maybe he has an opinion too 15:57:04 JayF: we can use /dev/sr0 really.. 15:57:13 dtantsur: that is going to potentially vary based on hardware 15:57:18 also true 15:57:20 dtantsur: which is why I'd prefer a uuid-based approach 15:57:42 Aight, we have 3 minutes left 15:57:46 any items remain for open discussion? 15:57:59 I have a quick question regarding the docs. Is there any prefered location I should describe the hierarchy of kernel/ramdisk parameters? I did not find the current state described anywhere. 15:58:23 you can manually invoke the glean script with the right network info 15:58:30 thats all the udev systemd integration does 15:58:52 kubajj: doc/source/install/configure-glance-images.rst 15:59:01 kubajj: from a cursory running of rg deploy_ramdisk doc/ 15:59:03 you could also potentially use fancier udev rules to do what you want. udev is magic but also indecipherable 15:59:06 clarkb: possibly, but then we need to stop using simple-init 15:59:49 .. which may not be a terrible thing because then we can include whatever we invent in all ramdisks (currently it's opt-in) 15:59:52 * JayF hears the bells ring for the top of the hour 15:59:54 #endmeeting