16:00:08 #startmeeting nova 16:00:08 Meeting started Tue May 30 16:00:08 2023 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:08 The meeting name has been set to 'nova' 16:00:25 #link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 16:00:28 welcome folks 16:00:41 o/ 16:02:04 Oj 16:02:30 mmm 16:02:35 doesn't sound a lot of people around 16:03:04 but we can make it a slow start and hopefully people will join 16:04:15 #topic Bugs (stuck/critical) 16:04:20 #info No Critical bug 16:04:24 #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 15 new untriaged bugs (+0 since the last meeting) 16:04:34 auniyal: any bug you wanted to discuss ? 16:05:17 looks he's not arounf 16:05:20 moving on 16:05:25 #info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster 16:05:29 #info bug baton is being passed to bauzas 16:05:37 ok, next topic then 16:05:42 #topic Gate status 16:05:46 #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs 16:05:50 #link https://etherpad.opendev.org/p/nova-ci-failures 16:06:09 I'll be honest, gerrit was a bit off my radar since the last week 16:06:31 I plead guilty but I have an OpenInfra prezo to prepare 16:06:46 any CI failures people wanna discuss ? 16:06:50 gate has been not too bad I think 16:07:04 nice 16:07:12 * gibi lurks 16:07:30 the very few gerrit emails I glanced were indeed positive 16:07:49 let's assume we're living on a quiet world and move on 16:07:51 o/ 16:07:59 #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status 16:08:14 the recent failures we've seen on the periodics seem to be resolved ^ 16:08:23 all of them are green 16:08:31 so I guess it's fixed 16:08:47 #info Please look at the gate failures and file a bug report with the gate-failure tag. 16:08:51 #info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures 16:09:09 voila for this topic, anything else to address on that ? 16:09:44 - 16:09:50 #topic Release Planning 16:09:55 #link https://releases.openstack.org/bobcat/schedule.html 16:09:59 #info Nova deadlines are set in the above schedule 16:10:03 #info Nova spec review day next week ! 16:10:12 I should write in caps actually 16:10:32 UPLOAD YOUR SPECS, AMEND THEM, TREAT THEM, MAKE THEM READY 16:10:32 yepp, a reminder on that day might help ;) 16:10:48 elodilles: indeed 16:11:11 #action bauzas to notify -discuss@ about the spec review day 16:12:00 I've seen a couple of new specs 16:12:09 I'll slowly make a round in advance if I have time 16:12:25 next topic if nothing else 16:12:27 , 16:12:42 #topic Vancouver PTG Planning 16:12:51 #info please add your topics and names to the etherpad https://etherpad.opendev.org/p/vancouver-june2023-nova 16:13:09 so, I created an etherpad (actually I reused the one automatically created) 16:13:57 given it will be an exercice to guess who's around and when, I'd love if people could chime into this etherpad their ideas of topics and ideally mention their presences 16:14:44 also, if people not able to join wanna bring some topics worth discussing at the PTG, that'd be nice 16:15:22 not sure we'll have a quorum, but I just hope we could somehow try to have some kind of synchronous discussion at the PTG 16:15:47 so we would capture the outcome of such discussions for following them up after the PTG 16:16:03 I won't lie, that'll be a challenge anyway. 16:16:45 worth helping, 16:16:47 #info The table #24 is booked for the whole two days. See the Nova community there ! 16:17:32 so, yeah, I should stick around this table for the two days, except when I need to present a breakout session or when I have a Forum session to moderate, or depending on my bath needs 16:18:06 (the last one being a transitive result of the number of coffee shots I'll take) 16:18:24 * gibi drops 16:18:26 anyway, the word is passed. 16:18:34 #topic Review priorities 16:18:39 #link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2) 16:18:43 gibi: \o 16:18:47 #info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review 16:18:53 #topic Stable Branches 16:19:07 elodilles: wanna bring some points ? 16:19:11 o7 16:19:16 #info stable/train is unblocked, as openstacksdk-functional-devstack job is fixed for train 16:19:21 huzzah 16:19:22 \o/ 16:19:34 so: 16:19:37 #info stable gates should be OK 16:19:40 that'll somehow change a bit of the next topics we'll discuss 16:19:51 #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci 16:20:03 that's all from me 16:20:04 elodilles: thanks 16:20:11 and yeah, I added a bullet point 16:20:31 but it seems to me we don't have quorum today, so I'll reformulate 16:21:16 based on the ML thread https://lists.openstack.org/pipermail/openstack-discuss/2023-May/033833.html, 16:21:36 do people think it's reasonable to EOL stable/train ? 16:21:59 personally I do 16:22:04 I saw the discussion that happened while I was away, and I wanted to reply this morning 16:22:16 but I preferred to defer any reply after this meeting 16:22:18 the point of keeping those around was so communities could form around them if needed, 16:22:34 but if even we (redhat) aren't keeping it up to date, I think that's a good sign that it should go away 16:23:12 dansmith: well, some folks are continuing to backport a few things to train 16:23:23 and we have some stable cores that do reviews on that branch 16:23:26 butn, 16:23:29 but not critical CVEs 16:23:38 as I said in the email, the two critical CVEs don't have a path forward 16:23:56 dansmith: this is unfortunately the problem 16:24:03 okay, I see there's more traffic than I expected 16:24:09 then I guess I don't care 16:24:10 that's not exactly that I'm against backporting the VMDK fix 16:24:25 but I do think it's confusing for people to see a community-maintained repo that is missing such large fixes 16:24:34 but we'll break olso.utils semver if we do so 16:25:15 yeah, idk, I lean towards EOLing 16:25:21 for the other CVE (the brick one), I'd say I don't know how much it would be difficult to propose a backport 16:25:41 the brick fix is not going to be backported AFAIK 16:25:57 and I don't think the cinder one (which is the most important) is being backported past xena 16:26:28 our part of the fix for that CVE is pretty minor and we could backport it, but without the FC part of the brick fix it's not complete 16:26:28 that's why I personnally too lean forward to EOLing stable/train 16:26:47 only debian spoke up about/agains EOLing right/ 16:26:51 correct 16:27:01 I can reply and explain the problem again 16:27:31 shrug 16:27:49 cinder does still have a train branch 16:27:56 and the last thing there was the VMDK fix 16:28:15 so does brick, but the last thing was last summer 16:28:30 i also stated that we could keep open train as long as the gate is working, though indeed it is unfortunate that CVE fixes doesn't get backported :/ 16:28:36 idk, almost seems like there's not enough people to care to even push us one way or the other :) 16:28:49 I would indeed be more inclined if some other project like cinder would make the same step than us quite at the same time 16:29:20 dansmith: elodilles: honestly I guess I'll propose a patch for -eol and people can -1 if they care 16:29:33 that's probably where we'll capture most of the concerns 16:29:42 ack 16:29:46 or we could round forever 16:29:48 ++ 16:30:16 #action bauzas to propose a gerrit change for tagging stein-eol so people could vote on it 16:30:30 you mean train-eol :) 16:30:35 damn 16:30:38 #undo 16:30:38 Removing item from minutes: #action bauzas to propose a gerrit change for tagging stein-eol so people could vote on it 16:30:45 stein-eol is long gone for nova ;) 16:30:46 #action bauzas to propose a gerrit change for tagging train-eol so people could vote on it 16:30:54 elodilles: my brain fscked 16:31:13 anyway, I guess we're done on this 16:31:22 #topic Open discussion 16:31:23 ++ 16:31:36 geguileo: you had a point :) 16:31:41 (geguileo) Change to os-brick's connect_volume idempotency 16:31:47 bauzas: yes, thanks 16:31:49 #link https://review.opendev.org/c/openstack/os-brick/+/882841 16:31:52 #link https://bugs.launchpad.net/nova/+bug/2020699 16:32:11 So with the latest CVE on SCSI volumes we decided in os-brick to make some changes 16:32:45 Specifically os-brick would remove any existing devices before starting the connect_volume code 16:32:55 and then proceeding with the actual attachment 16:33:09 this means that connect_volume will no longer be idempotent 16:33:17 (which is never something we promised would be) 16:33:26 that seems to break some Nova operations 16:33:57 particularly the rescue/unrescue operations 16:34:05 is there some way we can check to see if we need to run connect_volume()? 16:34:24 geguileo: thanks for spotting this in advance 16:34:30 because otherwise it's hard to know if we're restarting after a recovery or just a service restart 16:34:39 whether or not we should run that 16:35:08 yeah that's one of the problems 16:35:22 dansmith: that could be solved if Nova didn't stash and then unstash the instance config 16:35:33 and instead rebuilt the instance config 16:35:40 meaning the guest xml? 16:35:41 I believe sean-k-mooney[m] mentioned that 16:35:45 dansmith: yes 16:36:00 geguileo: we also recreate the XML on stop/start fwiw 16:36:11 okay, but if we're running on a readonly-root host or recovering from a disaster, none of that would be available after a restart for us 16:36:27 yeah, the guest XML isn't something we can assume hangs around long-term IMHO 16:36:41 dansmith: could nova know if it's attaching the volume for the recovery instance on the same host that was running the instance? 16:36:42 most of nova is designed to assume that it's tempoary 16:36:53 yeah and in general, we don't consider libvirt a source of truth for persisting some instance metadata 16:37:01 bauzas: right 16:38:00 geguileo: for the rescue case specifically? if the instance was already stopped and you go straight to rescue, we wouldn't really know when the last time the connect was run.. could have been weeks ago and multiple host reboots since 16:38:01 would the spec on dangling volumes helping the problem ? 16:38:17 bauzas: I don't think so 16:38:19 I mean, our BDM/volume reconciliation 16:38:33 that's not the problem, it's the underlying host devices 16:38:48 yeah I remember the context of the CVE :) 16:39:21 and I guess only brick knows whether there is a residue ? 16:39:29 geguileo: so let's say we reboot with one instance stopped that points to /dev/vdf, 16:39:49 then we reboot, and someone spawns a new instance, we call connect_volume() for that new instance, it gets /dev/vdf, 16:40:09 then the user starts the old was-stopped instance, we can't look at the instance xml config and know that the volume is wrong right? 16:40:20 because /dev/vdf exists, but it's no longer relevant for this instance 16:40:55 dansmith: well, Nova could check the ID of the volume and see if it matches the information returned by os-brick 16:41:07 point being, there could be multiple instances with disks that point to stale host devices at any given point.. 16:41:09 but I don't know if we want Nova to be on the business of doing those checks 16:41:17 yeah, ideally not 16:41:19 dansmith: yes, that could happen 16:42:05 could nova ask brick such thing ? 16:42:08 geguileo: maybe everywhere we currently do connect_volume() we do a disconnect first and ignore an error? that generates a lot of churn, but maybe that's safer? 16:42:29 bauzas: that's what I was wondering.. if there was some validate_volume() call or something we could run to know if we should run connect 16:42:41 like, before doing connect_volume(), do a check_vol() 16:42:43 ahah 16:42:54 dansmith: os-brick will be doing that already with optimized code 16:43:20 dansmith: that's the issue that I'm bringing, that the new code that cleans things up break the idempotency, so there's no need for nova to do that call you mention 16:43:37 os-brick cannot do a proper check_vol like we would want 16:43:48 because the volume can have changed in the background 16:44:06 for example change the size, point a different volume in the backend (like happened in the CVE) 16:44:26 okay, I'm just not sure how we can know what the right thing to do is 16:44:27 there are a bunch of messes that could happen, and I'm not sure we would want this to depend on nova calling a method 16:45:04 dansmith: well, if you call connect_volume all instances in Nova should use the device it returns from that moment onwards 16:45:10 geguileo: I'm confused, who's responsible for knowing that all connectors are present on the host ? 16:45:17 I thought it was brick 16:45:23 geguileo: okay I thought you didn't want us to do that 16:45:27 what do you mean by connectors? 16:45:48 wrong wording, my bad. 16:45:51 geguileo: or are you just saying any time we call connect_volume() we need to examine the result and make sure the instance is (changed, if necessary) to use that/ 16:46:16 geguileo: I'd say the device number 16:46:44 dansmith: what I don't want is nova to call connect_volume, get the value and stash it (/dev/sdc), then call connect_volume again (/dev/sda) to use it in the rescue instance, and then use the stashed value (/dev/sdc) that no longer exists 16:47:05 dansmith: yes, that's it 16:47:18 geguileo: okay 16:47:25 dansmith: if connect_volume is called, make sure that is used in all the instances that use the same cinder volume on that host 16:47:53 sounds like the solution is that we need to call connect_volume every time we're going to start an instance for any reason, and update the XML (rescue or otherwise) to use that value before we start 16:48:43 dansmith: that would be slower but safe 16:49:12 the optimised mechanism where we could tweak Nova to only do that when necessary could lead to unintended bugs 16:49:13 yeah, I was hoping some mechanism that would prevent the unnecessary roundtrip but if that's hard, then... 16:49:18 yeah I just don't know how we could do it safely otherwise, unless there's something that maintains an exclusive threadsafe set of host devices to make sure we never get them confused 16:50:02 dansmith: if the host doesn't reboot and connect_volume is not called again, then we could assume that the device is correct 16:50:13 geguileo: but we don't know thisq 16:50:27 geguileo: we only know about the service restart 16:50:40 bauzas: could nova query the uptime or something? 16:50:47 no :) 16:51:03 ok, then the whole connect_volume is probably the only way :'-( 16:51:08 let's not do anything like that for host or service uptime.. I think we just need to do the safe/slow approach 16:51:24 geguileo: we have other host-dependent resources like mdevs that we treat this way without requiring to know if it's a reboot 16:53:08 so I guess that's the solution then? 16:53:37 I think so.. not ideal for sure 16:53:43 in general, we follow some workflow which is "lookup my XML, find the specific resources attached, then query whatever needed to know whether those still exist, and if not, ask whatever else to recreate them" 16:54:08 (at service restart I mean) 16:54:35 with changes to cinder drivers we could have a proper check if it's valid, or have the connect-volume be faster (and idempotent when it can) 16:54:56 they would need to provide extra information to validate the device 16:55:10 let's do the slow/safe thing now, and if connect_volume() can be idempotent and faster in the future, then that's cool 16:55:20 dansmith: sounds like a plan 16:55:48 since that way the newer approach would probably not require additional nova changes 16:55:52 just os-brick and cinder 16:55:55 yep 16:56:59 ok, then I have nothing else to say on the topic 16:57:06 geguileo: unrelated, but since you're here.. what's cinder's plan for stable/train? 16:57:15 I assume no backports of this CVE back that far right? 16:57:23 and are you all thinking about EOLing at some point? 16:57:46 dansmith: definitely no backports that far 16:57:49 we have concerns about keeping branches open that look maintained but don't have backports of high-profile CVEs (two for us now) 16:58:08 I believe at some point it was discussed to stop supporting anything before Yoga 16:58:29 ack 16:58:36 anyway, we're on time 16:58:44 the nova meeting is ending in 1 min 16:59:07 thanks geguileo 16:59:16 thank you all 16:59:20 geguileo: dansmith: I guess we've arrived on a conclusion 16:59:25 thanks both of you 16:59:29 and thanks all 16:59:48 if nothing else, 16:59:53 bye 16:59:59 #endmeeting