15:00:40 <JayF> #startmeeting ironic
15:00:40 <opendevmeet> Meeting started Mon Aug 21 15:00:40 2023 UTC and is due to finish in 60 minutes.  The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:40 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:40 <opendevmeet> The meeting name has been set to 'ironic'
15:00:46 <TheJulia> o/
15:00:48 <kubajj> o/
15:01:04 <opendevreview> Julia Kreger proposed openstack/ironic master: DNM Enable OVN  https://review.opendev.org/c/openstack/ironic/+/885087
15:01:06 <JayF> Good morning Ironic'ers!
15:01:09 <JayF> A reminder we operate under the OpenInfra Foundation CoC https://openinfra.dev/legal/code-of-conduct
15:01:16 <JayF> #topic Announcements/Reminders
15:01:21 <dtantsur> o/
15:01:24 <JayF> #note Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash
15:01:36 <JayF> I'm also going to note that
15:01:46 <JayF> #note Bobcat non-client library freeze is Thursday, Aug 24
15:02:25 <JayF> Finally, one about the PTG
15:02:35 <JayF> #note PTG is virtual and taking place October 23-27 2023
15:02:41 <JayF> #link https://etherpad.opendev.org/p/ironic-ptg-october-2023
15:02:51 <JayF> please use the etherpad to chat about topics of interest for the etherpad
15:03:19 <JayF> Any comments/questions on the announcements, or anything to add?
15:04:13 * dtantsur has nothing
15:04:37 <JayF> I'm going to skip the next item; we do not have action items from the last meeting to follow up on.
15:04:42 <JayF> #topic Review Ironic CI Status
15:04:56 <JayF> We have a couple of CI-related items on the agenda I wanna let folks know about before we get into general status
15:05:38 <JayF> frickler brought it to our attention in IRC Friday afternoon that Ironic is one of the projects left with the most zuul config errors
15:05:49 <TheJulia> so, apparently our power just dropped, I don't know how much longer we'll be on
15:06:03 <JayF> this is basically when CI is so broken that zuul can't even read the config (usually it means we haven't had any patches pass testing since the zuul queue change months maybe year+ ago)
15:06:10 <JayF> #link https://od42.de/ironic
15:06:30 <JayF> that link doesn't work for me, but if I use the filter manually it's obvious we have old bugfix and stable branches plagued by the issue
15:07:06 <rpittau> o/ (man I'm late)
15:07:09 <JayF> heck, it looks like python-ironicclient gates as recent as yoga are impacted
15:07:42 <JayF> dtantsur: TheJulia: Is it one of your teams that use bugfix/ branches downstream?
15:07:53 <JayF> That's one of the big pieces of info I'm missing: if these are bugfix branches which ones can we nuke
15:08:03 <dtantsur> JayF: we used to rely on them; no longer I think
15:08:08 <TheJulia> Mine does not
15:08:23 <dtantsur> not even for ancient releases, right rpittau?
15:08:25 <JayF> Ack, let me take this action then
15:08:26 <TheJulia> I *believe* there was a list made ~1 year ago which enumerated a ton of branches that could be dropped
15:08:38 <rpittau> JayF dtantsur: not anymore no
15:08:44 <JayF> #action JayF to audit zuul-config-errors, propose retirement of clearly-abandonded branches and try to fix broken ones
15:08:49 <frickler> JayF: seem I need to do some URL quoting on that redirect :-/
15:08:56 <rpittau> we plan to use bugfix for metal3 upstream but only the latest 1-2
15:08:59 <JayF> frickler: yeah, it happens but it's obvious what's broken :)
15:09:03 <dtantsur> okay, so as far as OCP is concerned, we're fine with going back to short-lived bugfix branches
15:09:07 <rpittau> yep
15:09:20 <dtantsur> metal3 - as rpittau said (thanks!)
15:09:21 <rpittau> btw I need to propose new bugfix branches this week :D
15:09:29 <JayF> Would someone who isn't me mind htiting the list with a "bugfix branch update" saying some of this
15:09:37 <JayF> with a proposal for how long they should live, etc?
15:09:48 <JayF> I don't wanna just guess and rugpull, but it's obvious right now we keep 'em up for too long
15:09:52 <rpittau> I thought we've updated already our docs ?
15:09:52 <dtantsur> We can simply get back to what I proposed in the spec back then
15:09:59 <dtantsur> and yeah, the docs
15:10:01 <JayF> ack
15:10:14 <frickler> JayF: fixed
15:10:19 <JayF> to summarize my understanding of the existing docs: bugfix branches are retired when their letter'd counterpart go out
15:10:29 <JayF> frickler: oooh very nice
15:10:55 <JayF> lol stable/pike NGS
15:10:56 <rpittau> the bugfix branches last for at most 6 months
15:11:05 <JayF> that screams "jay didn't retire this with the rest" :(
15:11:09 <rpittau> then they can get pulverized
15:11:14 <JayF> ack; that works for me
15:11:23 <JayF> So the other CI-related item we have is
15:11:27 <dtantsur> yeah, 6 months matches my recollection
15:11:38 <JayF> After the chat in IRC last week about janders's change and not getting tested on real hardware
15:11:52 <JayF> I reached out to the HPE team, they claim to have fixed HPE Third Party CI.
15:12:06 <JayF> https://review.opendev.org/c/openstack/ironic/+/889750 one of their examples, has a run on it
15:12:14 <JayF> #note HPE Third Party CI is functioning again.
15:12:20 <dtantsur> \o/
15:12:25 <rpittau> nice
15:13:02 <JayF> Is Is there anything generically about CI we need to speak about?
15:13:16 <JayF> I think other than the endless sqlite locking battles we've been cleaner than usual?
15:13:41 <rpittau> the metalsmith src jobs are busted at the moment
15:13:45 <rpittau> this impacts ipa CI
15:14:19 <JayF> ack
15:14:20 <rpittau> I think it depends on the fact that they still use focal
15:14:26 <rpittau> so proposed https://review.opendev.org/c/openstack/metalsmith/+/892146
15:14:29 <JayF> that makes sense to me, and is a forced migration anyway
15:14:39 <JayF> we shouldn't release "B" with it on focal anyway
15:14:43 <rpittau> yep
15:15:04 <JayF> Thanks for looking into that.
15:15:09 <JayF> Are there any other outstanding CI items?
15:15:09 <rpittau> also CS9 jobs in metalsmith -> https://review.opendev.org/c/openstack/metalsmith/+/869374
15:15:26 <JayF> landed that one just now
15:15:34 <rpittau> great, thanks!
15:15:40 <JayF> Do we have a userbase for metalsmith?
15:15:49 <JayF> I feel like I only ever hear about it when CI is broken
15:16:07 <JayF> I assume CERN since arne_wiebalck has some activity on it?
15:16:14 <arne_wiebalck> nope
15:16:35 <rpittau> mmm not sure, maybe TheJulia or hjensas know ?
15:16:36 <arne_wiebalck> I just ran into it since zuul is not happy with my raid rebuild patch
15:16:44 <JayF> Interesting. OK. Maybe I'll add that to PTG topics.
15:16:57 <TheJulia> sorry what might I know?
15:17:04 <JayF> TheJulia: if we have any known users of metalsmith
15:17:18 <TheJulia> Just RHOSP
15:17:20 <TheJulia> afaik
15:17:32 <dtantsur> When I created it, I hoped that people start using it just in general, as a handy CLI
15:17:44 <dtantsur> Maybe I was naive, and we need something equal in ironicclient (and the backing API)...
15:17:52 <JayF> RHOSP is a pretty big user of it then :D
15:18:06 <JayF> dtantsur: it's possible for all those things to be true at the same time :D
15:18:24 <JayF> dtantsur: metalsmith can blaze a trail ,we can use that to figure out how to make it work in primary clients/apis
15:18:45 <dtantsur> the planned but never implemented Deployment API was the next logical step for me
15:18:49 <JayF> that's good to know though, I just want someone to have a use case for it, RHOSP totally counts
15:18:58 <JayF> dtantsur: maybe toss that on PTG topics and we can res it?
15:19:12 <JayF> nobody is goign to make time to do it if we don't talk about it and hype it up
15:19:14 <TheJulia> I added Metalsmith to the ptg topic list
15:19:16 <JayF> I can be your hype man dtantsur
15:19:17 <dtantsur> I don't see a point in that. We've had these discussions over and over again.
15:19:19 <dtantsur> heh
15:19:28 <dtantsur> Until someone has a vested interest, it just does not happen...
15:19:33 <TheJulia> Problem is, at least in my circles, it gets viewed as this "alternative to ironic" or "replacement of ironic"
15:19:42 <TheJulia> and people don't really grok that it is just a client
15:19:44 <dtantsur> lol
15:19:50 <TheJulia> yeah
15:20:03 <JayF> Maybe the answer from PTG is gonna be to make better docs out of it :)
15:20:18 <JayF> I was talking to kubajj this morning about how doing non-nova Ironic deploys is not very intuitive
15:20:23 <TheJulia> I think the real isuse is tons of people don't know how to actually *use* ironic
15:20:25 <dtantsur> Or decide how we can decompose metalsmith into smaller bits and gradually merge
15:20:28 <JayF> and AFAICT we lack a directive doc on how to do it exactly
15:20:29 <TheJulia> even though there are videos, pages, everything else
15:20:29 <JayF> TheJulia: YES
15:20:35 <dtantsur> OH YEAH
15:20:37 <dtantsur> instance_info anyone?
15:20:39 <TheJulia> almost like we need a class
15:20:47 <dtantsur> Who does not like JSON fields without schema validation?
15:21:09 <TheJulia> most people I talk to don't even think along those lines, it is a big scary thing they just don't understand in general
15:21:19 <dtantsur> If the most basic thing Ironic is doing needs to be taught... we're losing :(
15:21:41 <JayF> Well, we don't always like to frame Ironic this way
15:21:49 <JayF> but Ironic deployments are super easy to do... if you have nova in front
15:21:58 <TheJulia> it might just be resistance to information because they have no need to touch it because that is not their primary role
15:22:02 <dtantsur> By the way, the outreachy season is coming. If we have an easy win, we can try proposing it.
15:22:06 <JayF> this is a secondary use case and we've always treated it as a secondary use case, whether that's right/wrong/etc
15:23:03 <JayF> dtantsur: I will have an MLH intern around that season too
15:23:14 <JayF> dtantsur: but we'd need rough docs to be able to get an intern to curate it into not-terrible docs
15:23:32 <dtantsur> We have rough docs, no?
15:23:45 <JayF> dtantsur: I'll note: incoming interns is also why I'm working on contributor guide updates now (probably Tues you'll see a post with a radical improvement on our ironic-in-devstack docs)
15:23:52 <JayF> dtantsur: if so I couldn't find them in 10 minutes while working /kuba
15:23:55 <JayF> *w/kuba
15:24:12 <dtantsur> is https://docs.openstack.org/ironic/latest/user/index.html what you mean?
15:24:22 <dtantsur> particularly, https://docs.openstack.org/ironic/latest/user/deploy.html
15:24:42 <JayF> that is exactly the doc I was looking for earlier
15:24:46 <JayF> dtantsur+++++
15:24:55 <JayF> kubajj: ^^ fyi I'll also slack the link to you
15:25:11 <dtantsur> I've spent quite some time on this document, but I'm sure it can be improved much further
15:25:19 <dtantsur> especially the configdrive explanation is lacking
15:25:43 <JayF> yeah I got feedback about this stuff being confusing from a lost-potential-user the other day too
15:25:48 <kubajj> JayF: I was reading this in the morning but got some error, thought it might be just bifrost
15:25:59 * TheJulia wonders what is the attention span window we should be focusing on
15:26:53 <TheJulia> "what is documentation in the tiktok generation" might be another way of framing that mental musing
15:27:00 <dtantsur> heh
15:27:01 <rpittau> lol
15:27:18 <JayF> I think that is based on a flawed premise: we're not doing a good job of making docs discoverable for the altavista generation either ;)
15:27:23 <dtantsur> We have a lot of vague concepts. Like instance_info itself.
15:27:24 <rpittau> should we do a "bare metal deployment dance" ?
15:27:37 <JayF> we have a large number of docs, it's borderline impossible to know which one you need
15:27:53 <JayF> and the vague concepts like dtantsur points out makes it hard to know what to search for
15:28:25 <dtantsur> Well, dunno. I think "User Guide" is a pretty natural place to look
15:28:30 * JayF was convinced at SCALE20x by a librarian that we need one
15:28:38 <dtantsur> I'd be more worried that people run away screaming after reading the Installation Guide :D
15:29:31 <JayF> I'm going to add a note at PTG about this, maybe we can take a swing at it or at least think about it in the intervening time
15:29:57 <JayF> well Julia beat me to it, but it's on that doc :D
15:30:02 <JayF> moving on
15:30:10 <JayF> #topic Review ongoing 2023.2 workstreams
15:30:20 <TheJulia> doc topic added to ptg etherpad
15:30:26 <JayF> #link https://etherpad.opendev.org/p/IronicWorkstreams2023.2
15:31:17 <JayF> It's too early to fully declare victory
15:31:23 <JayF> but this has been a crazy productive cycle it seems
15:31:30 <JayF> so much impactful stuff landing and in progress
15:32:51 <TheJulia> Where are we at on the nova side of shards key usage?
15:33:14 <JayF> testing and positive feedback to johnthetubaguy
15:33:23 <JayF> then I think he goes begging for reviews
15:33:34 <JayF> I was struggling with devstack I got it working, so I should have an env to test that in this week
15:34:11 <JayF> TheJulia: unless you have time and want to dedicate time to it, let me commit to doing that test on Tues
15:34:22 <JayF> TheJulia: then we can free that up for John and hopefully he lands it
15:35:36 <JayF> Any other questions/comments/discussions on in-progress work streams?
15:35:48 <TheJulia> I can re-review, I think the last time I looked at the code I had high confidence in it
15:35:52 <JayF> same
15:35:57 <JayF> I just want to actually test it
15:36:02 <TheJulia> Since it is all well walked pattern changes
15:36:25 <TheJulia> lets sync after the meeting on it
15:36:32 <JayF> ack, going to move on
15:36:39 <JayF> Nothing listed for RFE Review; skipping that section.
15:36:46 <JayF> #topic Open Discussion
15:36:50 <JayF> I had one item for here:
15:37:17 <JayF> PTL and TC nominations are open. I strongly encourage Ironic contributors to run for PTL and/or TC. If you're interested in being PTL talk to me.
15:37:31 <JayF> If nobody else has self-nominated for PTL by midweek, I will re-nominate myself for a third term.
15:38:08 <JayF> That's all I had, just wanted to draw attention there.
15:38:13 <JayF> Anything else for open discussion?
15:38:14 <dtantsur> Democracy is good, letting Jay to get a break is even better!
15:38:17 <dtantsur> Go people go!
15:38:21 <dtantsur> So, yes, one funny bug
15:38:31 <dtantsur> https://bugs.launchpad.net/ironic/+bug/2032377 was brought to my attention by my fellow operator
15:38:41 <JayF> Eh, I don't mind being PTL tbh; I just appreciate that we cycle leadership and don't wanna break tradition :)
15:38:50 <dtantsur> it's stupidly simple, but I have no idea how to work around it cleanly
15:39:33 <JayF> we can't really leaving cloud-init AND glean installed on the same IPA image
15:39:37 <JayF> that's the bug there, yeah?
15:39:48 <dtantsur> nope
15:40:06 <TheJulia> it *sounds* like the the image has a pre-baked config drive
15:40:08 <TheJulia> and we don't find it
15:40:10 <dtantsur> Imagine we're cleaning a node that had a configdrive. And IPA has a configdrive.
15:40:11 <TheJulia> and we create a new one
15:40:14 <TheJulia> and *boom*
15:40:25 <TheJulia> oh
15:40:31 <JayF> dtantsur: ah, in IPA world we should never ever not ever respect config on disk
15:40:34 <JayF> that is potentially a security bug
15:40:40 <dtantsur> Right. But we do.
15:41:35 <TheJulia> so it is when the ramdisk boots, it finds/attaches the config drive data embedded in the iso
15:41:35 <JayF> we'd almost need glean to have an option to filter block devices and/or look for a different label
15:41:43 <TheJulia> and doesn't unmount it for operations it sounds like?
15:42:12 <dtantsur> TheJulia: still simpler. Glean is looking for a configdrive. There are two: the one it should use (in the CD) and the old one on disk.
15:42:25 <TheJulia> the one on the disk shouldn't be a block device...
15:42:25 <TheJulia> yet
15:42:28 <TheJulia> wut
15:42:29 <JayF> for cleaning
15:42:35 <JayF> yeah?
15:42:37 <dtantsur> TheJulia: how can it be NOT a block device?
15:42:50 <TheJulia> it would need to be attached to the loopback to become a device
15:42:50 <dtantsur> JayF: cleaning is the biggest problem; after the cleaning the rogue partition will be gone.
15:42:52 <JayF> or on a not-cleaned device doing a second deploy
15:43:16 <TheJulia> we need to look at the simple-init code and if we can get a ramdisk log that would be super helpful
15:43:19 <dtantsur> TheJulia: you're talking about a file; configdrive is a partition (on disk) or a whole device (CD)
15:43:36 <TheJulia> OH
15:43:41 <TheJulia> the whole CD is labled config-2
15:43:42 <TheJulia> wut
15:43:50 <dtantsur> yep, that's how DHCP-less works in Ironic
15:43:52 <JayF> that is how ISO-based configdrives work in VM
15:43:56 <JayF> as well
15:44:34 <TheJulia> I didn't realize that was how vmedia ramdisk worked
15:44:45 <dtantsur> TheJulia: not always, only when Node.network_data is used
15:44:52 <TheJulia> OH
15:45:05 <TheJulia> wheeeeeeeeeeeeee
15:46:22 <JayF> Yeah, I agree that's a nasty bug.
15:46:37 <JayF> I agree I don't know how we fix it without changes in glean.
15:46:44 <JayF> AND service steps will increase the scope of the bug
15:47:35 <dtantsur> yeah. maybe we should talk to ianw, I can drop him an email
15:47:42 <TheJulia> I think we don't have enough information to fully comprehend the bug since if they ahve a pre-existing configuration drive, and one based on the image itself, it is sort of a case we never expected
15:47:46 <TheJulia> ++
15:48:11 <JayF> even like glean-use-part-uuid=AAAA-BBBB-CCCC-DDDD
15:48:12 <TheJulia> since we expected the node to be cleaned, but this is an instance as a ramdisk with it's own config drive
15:48:15 <JayF> in the kernel command line
15:48:22 <dtantsur> TheJulia: we do actually. IPA has a configdrive because it's how DHCP-less works. The disk has a configdrive because it's cleaning after deployment.
15:48:23 <JayF> just some way for Ironic to signal to glean "use this one"
15:48:40 <JayF> yeah that's why the bug is tricky; both configdrives are valid just not in the same context
15:48:56 <TheJulia> yeah, but we know it at that point, they are doing it themselves outside of cleaning, which is how I'm reading the bug
15:49:11 <dtantsur> mmm?
15:49:52 <TheJulia> hmm
15:49:59 <TheJulia> we really need to talk with the filer of the bug and ask questions
15:50:13 <dtantsur> I talk to him all the time but I don't know which questions you have in mind
15:50:32 <dtantsur> We know what is going on, we don't know how to fix it
15:50:39 <TheJulia> Well, I'm confused on what exactly they are doing
15:50:56 <dtantsur> 1) Normal deployment with DHCP-less IPA; 2) Instance tear down; 3) Boom
15:51:08 <dtantsur> 3) *Boom with DHCP-less cleaning
15:51:17 <JayF> Configure node.network_data. Deploy the node. Clean the node. At clean time both the original deployed configdrive AND the IPA-node.network_data configdrive exists.
15:51:53 <TheJulia> so, our ipa ramdisk should have the smarts to know where to boot from, and not hunt to the OS disks when booted that way
15:52:05 <TheJulia> *should* being the keyword there
15:52:07 <JayF> yep, that's what dtantsur and I were talking about with glean
15:52:17 <JayF> b/c we'll need to tell glean exactly which partition to help it dedupe
15:52:27 <TheJulia> ... or just explicitly run it
15:52:31 <dtantsur> for contrast, that's the current glean's logic: https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-early.sh#L34-L38
15:52:49 <opendevreview> Merged openstack/bifrost master: Remove Fedora from the CI  https://review.opendev.org/c/openstack/bifrost/+/892123
15:53:18 <TheJulia> yeah, that is semi-problematic
15:53:27 <dtantsur> TheJulia: running Glean manually is non-trivial since it gets triggered from udev
15:53:51 <TheJulia> yeah
15:54:31 <JayF> I mean, couldn't we do something like:
15:54:40 <JayF> 1) Disable glean from autorun, via udev and everything else
15:55:03 <JayF> 2) On IPA startup, look for ipa-network-data=blah and if it exists, do some mounting then run LN 53 from glean-early.sh?
15:55:07 <TheJulia> glean uses integrated network interface enumeration
15:55:15 <JayF> so you can't run it late?
15:55:19 <TheJulia> not really
15:55:28 <JayF> yeah, then we need glean to get some sorta hint
15:55:35 <JayF> that says "no really, this configdrive"
15:55:43 <dtantsur> You probably can, but it's going to happen quite late
15:55:50 <dtantsur> e.g. currently IPA is After:network-online
15:55:59 <JayF> good point
15:56:13 <JayF> we'd basically have to write a separate unit, at which point we've reinvented the wheel
15:56:14 <dtantsur> We could have our own service that goes Before
15:56:16 <dtantsur> right
15:56:34 * JayF votes for kernel cli or on-disk glean config that points it explicitly at a partition uuid
15:56:53 <JayF> since we should know the partition uuid at create time, yeah?
15:56:58 <dtantsur> okay, lemme talk to Ian, maybe he has an opinion too
15:57:04 <dtantsur> JayF: we can use /dev/sr0 really..
15:57:13 <JayF> dtantsur: that is going to potentially vary based on hardware
15:57:18 <dtantsur> also true
15:57:20 <JayF> dtantsur: which is why I'd prefer a uuid-based approach
15:57:42 <JayF> Aight, we have 3 minutes left
15:57:46 <JayF> any items remain for open discussion?
15:57:59 <kubajj> I have a quick question regarding the docs. Is there any prefered location I should describe the hierarchy of kernel/ramdisk parameters? I did not find the current state described anywhere.
15:58:23 <clarkb> you can manually invoke the glean script with the right network info
15:58:30 <clarkb> thats all the udev systemd integration does
15:58:52 <JayF> kubajj:  doc/source/install/configure-glance-images.rst
15:59:01 <JayF> kubajj: from a cursory running of rg deploy_ramdisk doc/
15:59:03 <clarkb> you could also potentially use fancier udev rules to do what you want. udev is magic but also indecipherable
15:59:06 <dtantsur> clarkb: possibly, but then we need to stop using simple-init
15:59:49 <dtantsur> .. which may not be a terrible thing because then we can include whatever we invent in all ramdisks (currently it's opt-in)
15:59:52 * JayF hears the bells ring for the top of the hour
15:59:54 <JayF> #endmeeting