16:00:08 <bauzas> #startmeeting nova
16:00:08 <opendevmeet> Meeting started Tue May 30 16:00:08 2023 UTC and is due to finish in 60 minutes.  The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:08 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:08 <opendevmeet> The meeting name has been set to 'nova'
16:00:25 <bauzas> #link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting
16:00:28 <bauzas> welcome folks
16:00:41 <elodilles> o/
16:02:04 <dansmith> Oj
16:02:30 <bauzas> mmm
16:02:35 <bauzas> doesn't sound a lot of people around
16:03:04 <bauzas> but we can make it a slow start and hopefully people will join
16:04:15 <bauzas> #topic Bugs (stuck/critical)
16:04:20 <bauzas> #info No Critical bug
16:04:24 <bauzas> #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 15 new untriaged bugs (+0 since the last meeting)
16:04:34 <bauzas> auniyal: any bug you wanted to discuss ?
16:05:17 <bauzas> looks he's not arounf
16:05:20 <bauzas> moving on
16:05:25 <bauzas> #info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster
16:05:29 <bauzas> #info bug baton is being passed to bauzas
16:05:37 <bauzas> ok, next topic then
16:05:42 <bauzas> #topic Gate status
16:05:46 <bauzas> #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs
16:05:50 <bauzas> #link https://etherpad.opendev.org/p/nova-ci-failures
16:06:09 <bauzas> I'll be honest, gerrit was a bit off my radar since the last week
16:06:31 <bauzas> I plead guilty but I have an OpenInfra prezo to prepare
16:06:46 <bauzas> any CI failures people wanna discuss ?
16:06:50 <dansmith> gate has been not too bad I think
16:07:04 <bauzas> nice
16:07:12 * gibi lurks
16:07:30 <bauzas> the very few gerrit emails I glanced were indeed positive
16:07:49 <bauzas> let's assume we're living on a quiet world and move on
16:07:51 <Uggla_> o/
16:07:59 <bauzas> #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status
16:08:14 <bauzas> the recent failures we've seen on the periodics seem to be resolved ^
16:08:23 <bauzas> all of them are green
16:08:31 <bauzas> so I guess it's fixed
16:08:47 <bauzas> #info Please look at the gate failures and file a bug report with the gate-failure tag.
16:08:51 <bauzas> #info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures
16:09:09 <bauzas> voila for this topic, anything else to address on that ?
16:09:44 <bauzas> -
16:09:50 <bauzas> #topic Release Planning
16:09:55 <bauzas> #link https://releases.openstack.org/bobcat/schedule.html
16:09:59 <bauzas> #info Nova deadlines are set in the above schedule
16:10:03 <bauzas> #info Nova spec review day next week !
16:10:12 <bauzas> I should write in caps actually
16:10:32 <bauzas> UPLOAD YOUR SPECS, AMEND THEM, TREAT THEM, MAKE THEM READY
16:10:32 <elodilles> yepp, a reminder on that day might help ;)
16:10:48 <bauzas> elodilles: indeed
16:11:11 <bauzas> #action bauzas to notify -discuss@ about the spec review day
16:12:00 <bauzas> I've seen a couple of new specs
16:12:09 <bauzas> I'll slowly make a round in advance if I have time
16:12:25 <bauzas> next topic if nothing else
16:12:27 <bauzas> ,
16:12:42 <bauzas> #topic Vancouver PTG Planning
16:12:51 <bauzas> #info please add your topics and names to the etherpad https://etherpad.opendev.org/p/vancouver-june2023-nova
16:13:09 <bauzas> so, I created an etherpad (actually I reused the one automatically created)
16:13:57 <bauzas> given it will be an exercice to guess who's around and when, I'd love if people could chime into this etherpad their ideas of topics and ideally mention their presences
16:14:44 <bauzas> also, if people not able to join wanna bring some topics worth discussing at the PTG, that'd be nice
16:15:22 <bauzas> not sure we'll have a quorum, but I just hope we could somehow try to have some kind of synchronous discussion at the PTG
16:15:47 <bauzas> so we would capture the outcome of such discussions for following them up after the PTG
16:16:03 <bauzas> I won't lie, that'll be a challenge anyway.
16:16:45 <bauzas> worth helping,
16:16:47 <bauzas> #info The table #24 is booked for the whole two days. See the Nova community there !
16:17:32 <bauzas> so, yeah, I should stick around this table for the two days, except when I need to present a breakout session or when I have a Forum session to moderate, or depending on my bath needs
16:18:06 <bauzas> (the last one being a transitive result of the number of coffee shots I'll take)
16:18:24 * gibi drops
16:18:26 <bauzas> anyway, the word is passed.
16:18:34 <bauzas> #topic Review priorities
16:18:39 <bauzas> #link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2)
16:18:43 <bauzas> gibi: \o
16:18:47 <bauzas> #info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review
16:18:53 <bauzas> #topic Stable Branches
16:19:07 <bauzas> elodilles: wanna bring some points ?
16:19:11 <elodilles> o7
16:19:16 <elodilles> #info stable/train is unblocked, as openstacksdk-functional-devstack job is fixed for train
16:19:21 <bauzas> huzzah
16:19:22 <elodilles> \o/
16:19:34 <elodilles> so:
16:19:37 <elodilles> #info stable gates should be OK
16:19:40 <bauzas> that'll somehow change a bit of the next topics we'll discuss
16:19:51 <elodilles> #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci
16:20:03 <elodilles> that's all from me
16:20:04 <bauzas> elodilles: thanks
16:20:11 <bauzas> and yeah, I added a bullet point
16:20:31 <bauzas> but it seems to me we don't have quorum today, so I'll reformulate
16:21:16 <bauzas> based on the ML thread https://lists.openstack.org/pipermail/openstack-discuss/2023-May/033833.html,
16:21:36 <bauzas> do people think it's reasonable to EOL stable/train ?
16:21:59 <dansmith> personally I do
16:22:04 <bauzas> I saw the discussion that happened while I was away, and I wanted to reply this morning
16:22:16 <bauzas> but I preferred to defer any reply after this meeting
16:22:18 <dansmith> the point of keeping those around was so communities could form around them if needed,
16:22:34 <dansmith> but if even we (redhat) aren't keeping it up to date, I think that's a good sign that it should go away
16:23:12 <bauzas> dansmith: well, some folks are continuing to backport a few things to train
16:23:23 <bauzas> and we have some stable cores that do reviews on that branch
16:23:26 <bauzas> butn,
16:23:29 <dansmith> but not critical CVEs
16:23:38 <bauzas> as I said in the email, the two critical CVEs don't have a path forward
16:23:56 <bauzas> dansmith: this is unfortunately the problem
16:24:03 <dansmith> okay, I see there's more traffic than I expected
16:24:09 <dansmith> then I guess I don't care
16:24:10 <bauzas> that's not exactly that I'm against backporting the VMDK fix
16:24:25 <dansmith> but I do think it's confusing for people to see a community-maintained repo that is missing such large fixes
16:24:34 <bauzas> but we'll break olso.utils semver if we do so
16:25:15 <dansmith> yeah, idk, I lean towards EOLing
16:25:21 <bauzas> for the other CVE (the brick one), I'd say I don't know how much it would be difficult to propose a backport
16:25:41 <dansmith> the brick fix is not going to be backported AFAIK
16:25:57 <dansmith> and I don't think the cinder one (which is the most important) is being backported past xena
16:26:28 <dansmith> our part of the fix for that CVE is pretty minor and we could backport it, but without the FC part of the brick fix it's not complete
16:26:28 <bauzas> that's why I personnally too lean forward to EOLing stable/train
16:26:47 <dansmith> only debian spoke up about/agains EOLing right/
16:26:51 <bauzas> correct
16:27:01 <bauzas> I can reply and explain the problem again
16:27:31 <dansmith> shrug
16:27:49 <dansmith> cinder does still have a train branch
16:27:56 <dansmith> and the last thing there was the VMDK fix
16:28:15 <dansmith> so does brick, but the last thing was last summer
16:28:30 <elodilles> i also stated that we could keep open train as long as the gate is working, though indeed it is unfortunate that CVE fixes doesn't get backported :/
16:28:36 <dansmith> idk, almost seems like there's not enough people to care to even push us one way or the other :)
16:28:49 <bauzas> I would indeed be more inclined if some other project like cinder would make the same step than us quite at the same time
16:29:20 <bauzas> dansmith: elodilles: honestly I guess I'll propose a patch for -eol and people can -1 if they care
16:29:33 <bauzas> that's probably where we'll capture most of the concerns
16:29:42 <dansmith> ack
16:29:46 <bauzas> or we could round forever
16:29:48 <elodilles> ++
16:30:16 <bauzas> #action bauzas to propose a gerrit change for tagging stein-eol so people could vote on it
16:30:30 <elodilles> you mean train-eol :)
16:30:35 <bauzas> damn
16:30:38 <bauzas> #undo
16:30:38 <opendevmeet> Removing item from minutes: #action bauzas to propose a gerrit change for tagging stein-eol so people could vote on it
16:30:45 <elodilles> stein-eol is long gone for nova ;)
16:30:46 <bauzas> #action bauzas to propose a gerrit change for tagging train-eol so people could vote on it
16:30:54 <bauzas> elodilles: my brain fscked
16:31:13 <bauzas> anyway, I guess we're done on this
16:31:22 <bauzas> #topic Open discussion
16:31:23 <elodilles> ++
16:31:36 <bauzas> geguileo: you had a point :)
16:31:41 <bauzas> (geguileo) Change to os-brick's connect_volume idempotency
16:31:47 <geguileo> bauzas: yes, thanks
16:31:49 <bauzas> #link https://review.opendev.org/c/openstack/os-brick/+/882841
16:31:52 <bauzas> #link https://bugs.launchpad.net/nova/+bug/2020699
16:32:11 <geguileo> So with the latest CVE on SCSI volumes we decided in os-brick to make some changes
16:32:45 <geguileo> Specifically os-brick would remove any existing devices before starting the connect_volume code
16:32:55 <geguileo> and then proceeding with the actual attachment
16:33:09 <geguileo> this means that connect_volume will no longer be idempotent
16:33:17 <geguileo> (which is never something we promised would be)
16:33:26 <geguileo> that seems to break some Nova operations
16:33:57 <geguileo> particularly the rescue/unrescue operations
16:34:05 <dansmith> is there some way we can check to see if we need to run connect_volume()?
16:34:24 <bauzas> geguileo: thanks for spotting this in advance
16:34:30 <dansmith> because otherwise it's hard to know if we're restarting after a recovery or just a service restart
16:34:39 <dansmith> whether or not we should run that
16:35:08 <bauzas> yeah that's one of the problems
16:35:22 <geguileo> dansmith: that could be solved if Nova didn't stash and then unstash the instance config
16:35:33 <geguileo> and instead rebuilt the instance config
16:35:40 <dansmith> meaning the guest xml?
16:35:41 <geguileo> I believe sean-k-mooney[m] mentioned that
16:35:45 <geguileo> dansmith: yes
16:36:00 <bauzas> geguileo: we also recreate the XML on stop/start fwiw
16:36:11 <dansmith> okay, but if we're running on a readonly-root host or recovering from a disaster, none of that would be available after a restart for us
16:36:27 <dansmith> yeah, the guest XML isn't something we can assume hangs around long-term IMHO
16:36:41 <geguileo> dansmith: could nova know if it's attaching the volume for the recovery instance on the same host that was running the instance?
16:36:42 <dansmith> most of nova is designed to assume that it's tempoary
16:36:53 <bauzas> yeah and in general, we don't consider libvirt a source of truth for persisting some instance metadata
16:37:01 <dansmith> bauzas: right
16:38:00 <dansmith> geguileo: for the rescue case specifically? if the instance was already stopped and you go straight to rescue, we wouldn't really know when the last time the connect was run.. could have been weeks ago and multiple host reboots since
16:38:01 <bauzas> would the spec on dangling volumes helping the problem ?
16:38:17 <dansmith> bauzas: I don't think so
16:38:19 <bauzas> I mean, our BDM/volume reconciliation
16:38:33 <dansmith> that's not the problem, it's the underlying host devices
16:38:48 <bauzas> yeah I remember the context of the CVE :)
16:39:21 <bauzas> and I guess only brick knows whether there is a residue ?
16:39:29 <dansmith> geguileo: so let's say we reboot with one instance stopped that points to /dev/vdf,
16:39:49 <dansmith> then we reboot, and someone spawns a new instance, we call connect_volume() for that new instance, it gets /dev/vdf,
16:40:09 <dansmith> then the user starts the old was-stopped instance, we can't look at the instance xml config and know that the volume is wrong right?
16:40:20 <dansmith> because /dev/vdf exists, but it's no longer relevant for this instance
16:40:55 <geguileo> dansmith: well, Nova could check the ID of the volume and see if it matches the information returned by os-brick
16:41:07 <dansmith> point being, there could be multiple instances with disks that point to stale host devices at any given point..
16:41:09 <geguileo> but I don't know if we want Nova to be on the business of doing those checks
16:41:17 <dansmith> yeah, ideally not
16:41:19 <geguileo> dansmith: yes, that could happen
16:42:05 <bauzas> could nova ask brick such thing ?
16:42:08 <dansmith> geguileo: maybe everywhere we currently do connect_volume() we do a disconnect first and ignore an error? that generates a lot of churn, but maybe that's safer?
16:42:29 <dansmith> bauzas: that's what I was wondering.. if there was some validate_volume() call or something we could run to know if we should run connect
16:42:41 <bauzas> like, before doing connect_volume(), do a check_vol()
16:42:43 <bauzas> ahah
16:42:54 <geguileo> dansmith: os-brick will be doing that already with optimized code
16:43:20 <geguileo> dansmith: that's the issue that I'm bringing, that the new code that cleans things up break the idempotency, so there's no need for nova to do that call you mention
16:43:37 <geguileo> os-brick cannot do a proper check_vol like we would want
16:43:48 <geguileo> because the volume can have changed in the background
16:44:06 <geguileo> for example change the size, point a different volume in the backend (like happened in the CVE)
16:44:26 <dansmith> okay, I'm just not sure how we can know what the right thing to do is
16:44:27 <geguileo> there are a bunch of messes that could happen, and I'm not sure we would want this to depend on nova calling a method
16:45:04 <geguileo> dansmith: well, if you call connect_volume all instances in Nova should use the device it returns from that moment onwards
16:45:10 <bauzas> geguileo: I'm confused, who's responsible for knowing that all connectors are present on the host ?
16:45:17 <bauzas> I thought it was brick
16:45:23 <dansmith> geguileo: okay I thought you didn't want us to do that
16:45:27 <geguileo> what do you mean by connectors?
16:45:48 <bauzas> wrong wording, my bad.
16:45:51 <dansmith> geguileo: or are you just saying any time we call connect_volume() we need to examine the result and make sure the instance is (changed, if necessary) to use that/
16:46:16 <bauzas> geguileo: I'd say the device number
16:46:44 <geguileo> dansmith: what I don't want is nova to call connect_volume, get the value and stash it (/dev/sdc), then call connect_volume again (/dev/sda) to use it in the rescue instance, and then use the stashed value (/dev/sdc) that no longer exists
16:47:05 <geguileo> dansmith: yes, that's it
16:47:18 <dansmith> geguileo: okay
16:47:25 <geguileo> dansmith: if connect_volume is called, make sure that is used in all the instances that use the same cinder volume on that host
16:47:53 <dansmith> sounds like the solution is that we need to call connect_volume every time we're going to start an instance for any reason, and update the XML (rescue or otherwise) to use that value before we start
16:48:43 <geguileo> dansmith: that would be slower but safe
16:49:12 <geguileo> the optimised mechanism where we could tweak Nova to only do that when necessary could lead to unintended bugs
16:49:13 <bauzas> yeah, I was hoping some mechanism that would prevent the unnecessary roundtrip but if that's hard, then...
16:49:18 <dansmith> yeah I just don't know how we could do it safely otherwise, unless there's something that maintains an exclusive threadsafe set of host devices to make sure we never get them confused
16:50:02 <geguileo> dansmith: if the host doesn't reboot and connect_volume is not called again, then we could assume that the device is correct
16:50:13 <bauzas> geguileo: but we don't know thisq
16:50:27 <bauzas> geguileo: we only know about the service restart
16:50:40 <geguileo> bauzas: could nova query the uptime or something?
16:50:47 <dansmith> no :)
16:51:03 <geguileo> ok, then the whole connect_volume is probably the only way  :'-(
16:51:08 <dansmith> let's not do anything like that for host or service uptime.. I think we just need to do the safe/slow approach
16:51:24 <bauzas> geguileo: we have other host-dependent resources like mdevs that we treat this way without requiring to know if it's a reboot
16:53:08 <geguileo> so I guess that's the solution then?
16:53:37 <dansmith> I think so.. not ideal for sure
16:53:43 <bauzas> in general, we follow some workflow which is "lookup my XML, find the specific resources attached, then query whatever needed to know whether those still exist, and if not, ask whatever else to recreate them"
16:54:08 <bauzas> (at service restart I mean)
16:54:35 <geguileo> with changes to cinder drivers we could have a proper check if it's valid, or have the connect-volume be faster (and idempotent when it can)
16:54:56 <geguileo> they would need to provide extra information to validate the device
16:55:10 <dansmith> let's do the slow/safe thing now, and if connect_volume() can be idempotent and faster in the future, then that's cool
16:55:20 <geguileo> dansmith: sounds like a plan
16:55:48 <geguileo> since that way the newer approach would probably not require additional nova changes
16:55:52 <geguileo> just os-brick and cinder
16:55:55 <dansmith> yep
16:56:59 <geguileo> ok, then I have nothing else to say on the topic
16:57:06 <dansmith> geguileo: unrelated, but since you're here.. what's cinder's plan for stable/train?
16:57:15 <dansmith> I assume no backports of this CVE back that far right?
16:57:23 <dansmith> and are you all thinking about EOLing at some point?
16:57:46 <geguileo> dansmith: definitely no backports that far
16:57:49 <dansmith> we have concerns about keeping branches open that look maintained but don't have backports of high-profile CVEs (two for us now)
16:58:08 <geguileo> I believe at some point it was discussed to stop supporting anything before Yoga
16:58:29 <dansmith> ack
16:58:36 <bauzas> anyway, we're on time
16:58:44 <bauzas> the nova meeting is ending in 1 min
16:59:07 <dansmith> thanks geguileo
16:59:16 <geguileo> thank you all
16:59:20 <bauzas> geguileo: dansmith: I guess we've arrived on a conclusion
16:59:25 <bauzas> thanks both of you
16:59:29 <bauzas> and thanks all
16:59:48 <bauzas> if nothing else,
16:59:53 <bauzas> bye
16:59:59 <bauzas> #endmeeting