16:00:02 <bauzas> #startmeeting nova
16:00:02 <opendevmeet> Meeting started Tue May  2 16:00:02 2023 UTC and is due to finish in 60 minutes.  The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:02 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:02 <opendevmeet> The meeting name has been set to 'nova'
16:00:11 <bauzas> welcome everyone
16:00:19 <elodilles> o/
16:00:25 <bauzas> #link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting
16:00:42 <gmann> o/
16:00:50 <gibi> o/
16:00:54 <gibi> (on a spotty connection)
16:01:19 <dansmith> bauzas: got it
16:02:02 <bauzas> coolio, let's start
16:02:11 <bauzas> #topic Bugs (stuck/critical)
16:02:21 <bauzas> #info One Critical bug
16:02:30 <bauzas> #link https://bugs.launchpad.net/nova/+bug/2012993
16:02:35 <bauzas> shall be quickly reverted
16:02:45 <bauzas> and I'll propose backports as soon as it lands
16:03:25 <bauzas> not sure we need to discuss on it by now
16:03:38 <bauzas> so we can move on, unless people wanna know more about it
16:04:07 <sean-k-mooney> i have +w'ed it so its on its way
16:04:21 <bauzas> yup, and as you wrote, two backports are planned
16:04:26 <bauzas> down to 2023.1 and Zed
16:04:45 <bauzas> anyway, moving on
16:04:56 <bauzas> #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 20 new untriaged bugs (+3 since the last meeting)
16:05:16 <bauzas> artom: I assume you didn't had a lot of time for spinning around those bugs ?
16:05:29 <bauzas> again, no worries
16:05:32 <Uggla> o/
16:05:53 <bauzas> #info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster
16:06:08 <bauzas> Uggla: you're the next on the list, happy to axe some of the bugs ?
16:06:19 <artom> bauzas, trying to do a few now
16:06:28 <bauzas> cool, no rush
16:06:36 <Uggla> bauzas, yep ok for me.
16:06:43 <bauzas> thanbks
16:06:45 <artom> For https://bugs.launchpad.net/nova/+bug/2018172 I think we need kashyap to weigh in, 'virtio' may not be a thing in RHEL 9.1?
16:06:51 <bauzas> #info bug baton is being passed to Uggla
16:06:55 <bauzas> artom: looking
16:07:01 <dansmith> uh, what
16:07:03 <artom> Or maybe just close immediately and direct them to a RHEL... thing
16:07:07 <kashyap> artom: Wait; it's impossible "virtio" can't be a thing in RHEL9.1.  /me looks
16:07:21 <dansmith> hah, yeah
16:07:27 <dansmith> but looks like maybe just a particular virtio video model/
16:07:30 <bauzas> that's a libvirt exception
16:07:42 <bauzas> so, IMHO invalid
16:07:50 <bauzas> (for the project)
16:08:23 <artom> Right, but before that I'd like to confirm that Nova is doing the right thing
16:08:46 <dansmith> is the video model an extra spec/
16:08:50 <opendevreview> Rafael Weingartner proposed openstack/nova master: Nova to honor "cross_az_attach" during server(VM) migrations  https://review.opendev.org/c/openstack/nova/+/864760
16:08:57 * dansmith curses his ? key
16:09:10 <artom> I guess I'm going about it backwards - it's just an easier thing to check initially, RHEL 9.1 support for virtio video
16:09:12 <artom> dansmith, yeah
16:09:14 <sean-k-mooney> its an image property
16:09:19 <bauzas> yeah
16:09:41 <dansmith> is virtio the only option there or is it like virtio plus a device model name/
16:09:47 <sean-k-mooney> virtio shoudl be a thing in rhel9
16:09:48 <bauzas> we just try to generate a domain which is badly incorrect given the image metadata
16:09:49 <dansmith> just wondering if they're specifying the wrong thing
16:09:56 <sean-k-mooney> technially it maps to virtio-gpu
16:10:05 <bauzas> that rings a bell to me
16:10:39 <bauzas> anyway, I'd propose to punt this bug today by asking the reporter its flavor and image
16:10:57 <opendevreview> Rafael Weingartner proposed openstack/nova master: Nova to honor "cross_az_attach" during server(VM) migrations  https://review.opendev.org/c/openstack/nova/+/864760
16:11:02 <dansmith> why are we even discussing this on the meeting?
16:11:43 <bauzas> I guess because artom wanted to triage it
16:11:48 <bauzas> so please => incomplete it
16:11:49 <artom> I just wanted to ping kashyap on it, is all
16:11:49 <sean-k-mooney> i guess it was a bug artom tought whoudl be brougt to our attention based on the triage
16:11:57 <bauzas> and ask for the image metadata
16:12:11 <kashyap> artom: We can sort it out off-meeting :) Thx!
16:12:25 <sean-k-mooney> artom: i have been using hw_video_model=virtio for years so its valid for sure
16:12:48 <sean-k-mooney> but ya we can move on
16:12:56 <kashyap> (Yeah; plain "virtio" must work)
16:13:20 <bauzas> moving on so
16:13:36 <kashyap> Is it still on-topic to discuss a CirrOS seg-fault that dansmith pointed out earlier?
16:13:41 <bauzas> #topic Gate status
16:13:48 <bauzas> #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs
16:13:53 <bauzas> #link https://etherpad.opendev.org/p/nova-ci-failures
16:14:04 <bauzas> there it is time to discuss gate failures now
16:14:26 <bauzas> dansmith: shoot the good news and the bad ones if you have
16:14:49 <dansmith> tldr is that I have got the ceph job moved to jammy and cephadm and it's working finally, and:
16:15:10 <dansmith> that I found a bunch of places where we thought we were requiring SSHABLE before volume activities where it was being silently ignored
16:15:17 <dansmith> basically what gibi asserted
16:15:28 <dansmith> so I've got a stack of tempest patches (and cinder-tempest-plugin) to not only fix those,
16:15:51 <dansmith> but also sanity check and raise an error if someone asks for SSHABLE but without the requisite extra stuff to actually honor it
16:16:13 <dansmith> that helps us pass the ceph job in the new config but will also likely help improve volume failures in the other jobs
16:16:37 <bauzas> cool
16:16:43 <bauzas> thanks for the janitorisation
16:16:57 <gibi> dansmith: awesome
16:17:06 <gibi> thank you
16:17:16 <bauzas> which patches shall be looked at for people who care ?
16:17:24 <dansmith> it's been a lot of work and frustration, but happy to see it becoming fruitful
16:17:36 <dansmith> well, no patches to nova, but I can get you a link, hang on
16:17:56 <gmann> this one #link https://review.opendev.org/q/topic:sshable-volume-tests
16:18:00 <dansmith> this stack on tempest: https://review.opendev.org/c/openstack/tempest/+/881925
16:18:14 <dansmith> and this one against cinder-tempest https://review.opendev.org/c/openstack/cinder-tempest-plugin/+/881764
16:18:30 <dansmith> the nova patch I have up is a DNM just to test our job with that full stack, but we don't need to merge anything
16:18:55 <bauzas> #link https://review.opendev.org/q/topic:sshable-volume-tests Tempest patches that add more ssh checks for volume-related tests
16:19:11 <bauzas> cool excellent, thanks a lot dansmith for this hard work
16:19:26 * dansmith nods
16:19:32 <bauzas> that's noted.
16:19:50 <bauzas> any other CI failure to relate or mention ?
16:20:22 <bauzas> looks not, excellent
16:20:37 <bauzas> #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status
16:21:08 <bauzas> the fips job run had a node failure, but should be green next week
16:22:11 <bauzas> on a side note, I'm trying to add some kind of specific vgpu job that would use the mtty sample framework for validating the usage we have on a periodic basis
16:22:55 <bauzas> #info Please look at the gate failures and file a bug report with the gate-failure tag.
16:23:01 <bauzas> #info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures
16:23:06 <bauzas> that's it for me on that topic
16:23:08 <bauzas> moving on
16:23:15 <bauzas> #topic Release Planning
16:23:19 <bauzas> #link https://releases.openstack.org/bobcat/schedule.html
16:23:31 <bauzas> #info Nova deadlines are set in the above schedule
16:23:37 <bauzas> #info Bobcat-1 is in 1 weeks
16:23:56 <bauzas> #info Next Tuesday 9th is stable branches review day https://releases.openstack.org/bobcat/schedule.html#b-nova-stable-review-day
16:24:24 <bauzas> I'll communicate on that review day by emailing -discuss
16:24:46 <bauzas> but let's discuss that more in the stable topic we have later
16:25:04 <bauzas> #topic Review priorities
16:25:04 <bauzas> #link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2)
16:25:06 <bauzas> #info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review
16:25:11 <bauzas> #topic Stable Branches
16:25:16 <bauzas> elodilles: go for it
16:25:23 <elodilles> #info stable nova versions were released for zed (26.1.1) and for yoga (25.1.1) last week
16:25:29 <elodilles> otherwise not much happened
16:25:35 <bauzas> rigt
16:25:38 <elodilles> #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci
16:25:50 <elodilles> that's all from me
16:27:05 <bauzas> excellent thanks
16:27:30 <bauzas> and as I said just before, we shall round about some patches next week hopefully
16:27:36 <elodilles> \o/
16:28:02 <bauzas> #topic Open discussion
16:28:11 <bauzas> (enriquetaso) Discuss the blueprint: NFS Encryption Support for qemu
16:28:14 <bauzas> enriquetaso: shoot
16:28:18 <enriquetaso> hi
16:28:21 <enriquetaso> sure
16:28:30 <enriquetaso> As discussed on the PTG a couple weeks ago. I’ve proposed the blueprint.
16:28:42 <enriquetaso> #link https://blueprints.launchpad.net/nova/+spec/nfs-encryption-support
16:28:57 <enriquetaso> Summary: Cinder is working on supporting encryption on NFS volumes. To do this NFS driver uses LUKS inside qcow2 for this.
16:29:10 <enriquetaso> This affects Nova because Nova cannot handle qemu + LUKS inside qcow2 disk format at the moment.
16:29:25 <enriquetaso> What are your thoughts?
16:29:35 <enriquetaso> should I mention the rolling upgrades would be a problem on the bp ?
16:30:20 <bauzas> lemme reopen the ptg notes
16:30:50 <bauzas> right
16:31:51 <bauzas> we basically said we were quite okay with the proposed design but we were wondering if it was worth not scheduling to old computes
16:32:03 <bauzas> we have three options here :
16:32:23 <bauzas> 1/ avoid scheduling to old computes (by adding a prefilter)
16:33:07 <bauzas> 2/ preventing this feature on the API level by checking the compute service versions
16:33:59 <bauzas> 3/ do some compute check that would prevent the volume to be encryped on some preconditionsq
16:34:36 <bauzas> #2 seems a bit harsh to me
16:35:02 <sean-k-mooney> dind we discusss addign a trait
16:35:07 <bauzas> we did it
16:35:09 <sean-k-mooney> so 1
16:35:21 <bauzas> without having full quorum, hence me restating the options
16:35:34 <sean-k-mooney> well i vote 1 or file a spec
16:36:04 <bauzas> then I tend to say option #1 and specless blueprint as we basically went down
16:36:12 <gibi> I'm OK with 1/
16:36:16 <sean-k-mooney> becasue if we dont just use a trait/prefileter then i think we need a spec to expalin why that is not sufficent and descirbe it in detail
16:37:01 <dansmith> so a trait of "this is newer than X"?
16:37:09 <sean-k-mooney> no
16:37:16 <dansmith> that's kinda fundamentally wrong, so you need to expose it as a feature flag
16:37:22 <sean-k-mooney> COMPUTE_somehting
16:37:36 <bauzas> sean-k-mooney: do we all agree now that we accept a new trait saying something like "I_CAN_ENCRYPT_YOUR-STUFF"
16:37:38 <dansmith> is there any compute config that needs to be enabled? if so, ideally that would control the exposure (or not) of the trait
16:37:38 <sean-k-mooney> to report the hypervers capablity to supprot lux in qcow
16:37:56 <sean-k-mooney> dansmith: i think this just need to check the qemu/libvirt version
16:38:00 <bauzas> my only concern is the traits inflation but that's a string
16:38:02 <sean-k-mooney> and report it statically if we are above that
16:38:09 <sean-k-mooney> i dont think we need a cofnig option
16:38:17 <bauzas> sean-k-mooney: that's why I was considering option 3
16:38:35 <dansmith> sean-k-mooney: yeah, that's just a little annoying I think
16:38:39 <bauzas> which would be "meh dude, you don't have what I need, I'll just do what I can do"
16:38:45 <sean-k-mooney> bauzas: i dont think optional encypting is accpetable
16:38:47 <dansmith> because it becomes not as much a feature flag but a shadow version number
16:39:22 <bauzas> sean-k-mooney: true
16:39:28 <sean-k-mooney> if you asked for the storage to be enypeed we either need to do it or raise an error
16:40:05 <bauzas> sounds then reasonable to ERROR the instance
16:40:12 <enriquetaso> what a `new trait` involves?
16:40:18 <sean-k-mooney> dansmith: im not agaisn a min compute service version check in the api as well by the way
16:40:24 <bauzas> that's option 3
16:40:25 <sean-k-mooney> i dont really think 2 is harsh
16:40:29 <bauzas> s/3/2
16:40:46 <sean-k-mooney> we normally dont enabel feature untill the cloud is fully upgraded
16:40:54 <dansmith> I'm not saying that's how it needs to be, I'm just saying it feels like we're bordering on trait abuse here
16:40:56 <dansmith> so FWIW,
16:41:10 <sean-k-mooney> we have in the past done this on a per compute host bassis
16:41:15 <dansmith> we could also have a scheduler filter that requires a service version at or above a number
16:41:20 <sean-k-mooney> but in generall i think a min comptue version check is preferable
16:41:23 <dansmith> and we could add hints/advice to the scheduler for this sort of thing
16:41:31 <dansmith> which would be nice for this and other things I imagine
16:41:49 <bauzas> yeah this sounds quite a reasonable tradeoff
16:42:03 <sean-k-mooney> this being?
16:42:20 <bauzas> I'm just wondering whether we expose the service version on the scheduling side
16:42:31 <sean-k-mooney> i think you can already schedul based on the comptue service version
16:42:45 <sean-k-mooney> with either the json of comptue capablity filter
16:42:58 <sean-k-mooney> but in any case we want this to work without any configuration requried
16:43:20 <sean-k-mooney> bauzas: why not just do 2?
16:43:24 <gibi> I might miss something but if this feature nees compute code + libirt/qemu version then as simple compute version check is not enough
16:43:34 <dansmith> sean-k-mooney: Im saying make it integrated
16:43:44 <dansmith> sean-k-mooney: kinda like a prefilter
16:43:56 <bauzas> wait https://github.com/openstack/nova/blob/master/nova/scheduler/request_filter.py#L399
16:44:17 <dansmith> gibi: yeah, you need service version and libvirt version and qemu version right?
16:44:20 <sean-k-mooney> so if feature x requries minv version y have a prefilter or simialr that will request a host with that min version
16:44:25 <bauzas> that's ephemeral encryption
16:44:50 <sean-k-mooney> ya thats seperate
16:44:56 <sean-k-mooney> we are talkign about encyption for nfs
16:45:09 <sean-k-mooney> for cinder volume backend that use nfs
16:45:10 <dansmith> gibi: that's kinda why I just think this is trait pollution because we end up with all these feature flags for any possible combination of several versions
16:45:24 <dansmith> and especially in three years, that trait is useless as everything exposes it all the time
16:45:55 <dansmith> so I'm not sure what the best plan is here, to be clear, I'm just saying none of these simple things feels right
16:46:07 <sean-k-mooney> dansmith: not really usign a trait we report an abstract capblity. but in general i would prefer both a trait and a compute version bump
16:46:43 <dansmith> that's the thing though:
16:47:02 <sean-k-mooney> enriquetaso: how is this feature requested on teh volume by the way?
16:47:08 <dansmith> just exposing a "can do nfs encryption" because all the versions are new enough just feels like we could have a thousand of those things
16:47:20 <enriquetaso> when attaching the volume sean-k-mooney
16:47:35 <sean-k-mooney> how exactly a atribute on the volume
16:47:44 <sean-k-mooney> enriquetaso: we have two distict but related problems
16:47:55 <sean-k-mooney> for attachemtn we know the host and have to check if the host suprot this
16:48:03 <sean-k-mooney> for new boots or move operattions
16:48:13 <sean-k-mooney> we need to find another host that also supports it
16:48:38 <bauzas> that's why I tend to lean on option 2
16:48:40 <sean-k-mooney> and we need this to work without any operator configurign of aggreates ectra
16:48:46 <bauzas> it's an interim solution
16:48:54 <dansmith> bauzas: me too, but 2 is not enough right?
16:49:07 <sean-k-mooney> option 2 works if an only if our min libvirt/qemu version are new enough
16:49:10 <dansmith> because you could be running an older libvirt/qemu
16:49:11 <enriquetaso> mmh, the volume attr says it's encrypted and the volume image format is qcow2: https://review.opendev.org/c/openstack/nova/+/854030/6/nova/virt/libvirt/utils.py
16:49:18 <dansmith> and I suspect it also depends on brick versions?
16:49:24 <enriquetaso> not sure to understand the questions
16:49:42 <sean-k-mooney> enriquetaso: i was asking because if we are booting a new vm we would need to look that up
16:49:55 <sean-k-mooney> and then include it as an input to placemnt and the schduler in some way
16:50:00 <bauzas> dansmith: hah, true, I was considering an API check, but that would require the libvirt versions, so nevermind my foolness
16:50:24 <bauzas> yeah, so that'a tuple (libvirt, compute) for accepted versions
16:50:26 <sean-k-mooney> well it may or may not again it depend on if our min version is above or below the ersion that intoduced it
16:50:44 <bauzas> you need libvirt AND compute versions to be recent enough
16:50:44 <enriquetaso> oh, the volume doesnt have anything special, it just volume type=nfs and encrypted=true sean-k-mooney
16:51:22 <dansmith> sean-k-mooney: yeah, well, that's a good reason not to just add the compute+libvirt+qemu into a trait IMHO
16:51:35 <sean-k-mooney> dansmith: right which is not what i suggested
16:51:57 <dansmith> I know :)
16:51:57 <sean-k-mooney> i very explictly said have a triat for the capablity to supprot lux in qcow
16:52:06 <gibi> dansmith: on the trait pollution: either we encode a list of versions as a capability on the compute side, or we expose those versions to the scheduler / placement and code up a similar mapping of capability - versions in the sceduler side. We just move around similar logic
16:52:40 <dansmith> gibi: yeah, understand, it's just that traits are supposed to be timeless right?
16:52:47 <bauzas> we have 5 mins to find a solution or defer to a spec, honestly
16:53:00 <dansmith> I'm not saying I know which solution (combination) is best, I'm just saying nothing feels particularly natural to me
16:53:19 <sean-k-mooney> so i was suggestign have the driver check the requirements and report COMPUTE_LUKS_IN_QCOW if it supprots it
16:53:50 <sean-k-mooney> and have a prefilter requesst that if the voluem was nfs and qcow and encrypted=true
16:54:01 <gibi> dansmith: we will have a bunch of traits yes, but I don't see what problem that causes. We never remove min compute version checks from the code either
16:54:07 <sean-k-mooney> and addtional have a min compute service check for the rooling upgrade case
16:54:36 <sean-k-mooney> the other thing about traits is they are ment to be virt driver independent
16:55:08 <gibi> LUKS_IN_QCOW does not seem to be libvirt dependent
16:55:13 <sean-k-mooney> to dans timeless point i.e. if we add one it shoudl be resuable by other virt driver
16:55:19 <sean-k-mooney> ya i was just thinking that
16:55:25 <sean-k-mooney> its the qcow bit but
16:55:32 <sean-k-mooney> we have that alredy
16:56:17 <sean-k-mooney> https://github.com/openstack/os-traits/blob/master/os_traits/compute/ephemeral.py#L18 and https://github.com/openstack/os-traits/blob/master/os_traits/compute/image.py#L28
16:56:23 <sean-k-mooney> almost would work togather
16:56:43 <sean-k-mooney> btu we cant assume that epmeral encryption supprot for lux means nfs also works
16:56:58 <enriquetaso> LUKS_IN_QCOW is already a config option on Nova gibi ?
16:57:13 <sean-k-mooney> enriquetaso: not really
16:57:23 <sean-k-mooney> or not that im aware of
16:57:35 <sean-k-mooney> in the context of cinder
16:57:39 <gibi> enriquetaso: we are discussing creating a new placement trait to represent if a compute supports luks in qcow
16:57:54 <enriquetaso> gibi++ thanks
16:58:19 <bauzas> I honestly think we need to settle the dust
16:58:49 <bauzas> given the very short time we have left I hereby propose enriquetaso to create a spec and describe the feature
16:58:58 <bauzas> we could then chime on the upgrade concerns
16:59:06 <enriquetaso> okay u.u
16:59:15 <bauzas> enriquetaso: are you familiar with the spec process ?
16:59:26 <enriquetaso> is it too different from the cinder one?
16:59:39 <enriquetaso> bauzas, do you a have a doc? :P
16:59:55 <bauzas> enriquetaso: I can provide you pointers and more than that : guidance
17:00:11 <enriquetaso> sure bauzas
17:00:19 <bauzas> #agreed enriquetaso to provide a spec for this feature
17:00:33 <bauzas> #action bauzas to provide enriquetaso details on the spec process
17:00:40 <bauzas> we're on time
17:00:45 <bauzas> the agenda is done
17:00:56 <bauzas> so thanks all and see you next week
17:01:01 <bauzas> #endmeeting