16:02:54 <Uggla> #startmeeting nova
16:02:54 <opendevmeet> Meeting started Tue Jul 29 16:02:54 2025 UTC and is due to finish in 60 minutes.  The chair is Uggla. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:02:54 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:02:54 <opendevmeet> The meeting name has been set to 'nova'
16:03:05 <elodilles> o/
16:03:06 <Uggla> Hello everyone
16:03:16 <gmaan> o/
16:03:30 <sp-bmilanov> hi!
16:03:43 <opendevreview> Stephen Finucane proposed openstack/nova master: api: Add response body schemas for simple tenant usage APIs  https://review.opendev.org/c/openstack/nova/+/956096
16:04:05 <Uggla> Let's start smoothly so people can join
16:04:28 <Uggla> #topic Bugs (stuck/critical)
16:04:39 <Uggla> #info No Critical bug
16:05:06 <Uggla> #topic Gate status
16:05:14 <Uggla> #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs
16:05:21 <Uggla> #link https://etherpad.opendev.org/p/nova-ci-failures-minimal
16:05:30 <Uggla> #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&branch=stable%2F*&branch=master&pipeline=periodic-weekly&skip=0 Nova&Placement periodic jobs status
16:05:38 <Uggla> #info Please look at the gate failures and file a bug report with the gate-failure tag.
16:05:47 <Uggla> #info Please try to provide a meaningful comment when you recheck
16:06:14 <Uggla> #info (gibi): The ceph regression is now tracked in https://tracker.ceph.com/issues/72203 and I pinged two of the maintainers to let them know we need help.
16:06:20 * bauzas waves late, was in meeting
16:06:24 <Uggla> #info (gibi): https://review.opendev.org/c/openstack/grenade/+/955865 is merged. If you still see grenade failures due to timeout during DB dumping then please let me know in the bug.
16:06:59 <gmaan> ++
16:07:00 <Uggla> gibi, I think ^ is related to issues we had last week. Any news about them.
16:07:26 <gibi> I added this to the agenda today as FYI :)
16:07:32 <gibi> s/this/these/
16:07:45 <gibi> so no news since two hours ago :)
16:08:29 <Uggla> ok thanks, so this is still in progess.
16:08:43 <gibi> yepp
16:09:36 <Uggla> #topic tempest-with-latest-microversion job status
16:09:43 <Uggla> #link https://zuul.opendev.org/t/openstack/builds?job_name=tempest-with-latest-microversion&skip=0
16:09:56 <gmaan> no update on this. I need to schedule some time for this
16:10:05 <Uggla> no worries, thanks
16:10:23 <Uggla> #topic Release Planning
16:10:31 <Uggla> #link https://releases.openstack.org/flamingo/schedule.html
16:10:40 <Uggla> #info Nova deadlines are set in the above schedule
16:11:06 <Uggla> We are ~1 month before feature freeze.
16:11:28 <Uggla> #topic Review priorities
16:11:37 <Uggla> #link https://etherpad.opendev.org/p/nova-2025.2-status
16:12:03 <Uggla> I'm not sure it is fully up2date, I will have a look tomorrow.
16:12:59 <Uggla> I think the main priorities are eventlet removal and vTPM live migration.
16:13:28 <Uggla> Both features have patches to review unless I'm wrong.
16:14:10 <Uggla> #topic OpenAPI
16:14:20 <Uggla> #link: https://review.opendev.org/q/topic:%22openapi%22+(project:openstack/nova+OR+project:openstack/placement)+-status:merged+-status:abandoned
16:14:27 <Uggla> #info 13 remaining.
16:15:05 <Uggla> I guess gmaan and sean-k-mooney managed to do some reviews. Thanks
16:16:02 <Uggla> #topic Stable Branches
16:16:11 <Uggla> elodilles, the floor is yours
16:16:17 <elodilles> #info stable branches (stable/2025.1 and stable/2024.*) seem to be in OK state
16:16:27 <elodilles> release nova stable versions:
16:16:33 <elodilles> 31.0.1 (2025.1 Epoxy) https://review.opendev.org/c/openstack/releases/+/955058
16:16:36 <elodilles> 30.0.2 (2024.2 Dalmatian) https://review.opendev.org/c/openstack/releases/+/947847
16:16:42 <elodilles> 29.2.2 (2024.1 Caracal) https://review.opendev.org/c/openstack/releases/+/954474
16:16:56 <elodilles> still waiting for release liaison's review o:)
16:17:03 <elodilles> (the latter is preparation for 2024.1 Caracal to move to Unmainained in October)
16:17:09 <elodilles> #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci
16:17:27 <elodilles> that's all from me -> back to you Uggla
16:17:36 <Uggla> elodilles, I will review them tomorrow.
16:17:45 <elodilles> Uggla: thanks in advance o/
16:18:20 <Uggla> #topic vmwareapi 3rd-party CI efforts Highlights
16:18:29 <fwiesel> Hi, no update from my side.
16:18:44 <Uggla> fwiesel, thanks.
16:18:47 <fwiesel> Uggla: back to you
16:18:59 <Uggla> #topic Gibi's news about eventlet removal.
16:19:03 <gibi> o/
16:19:07 <Uggla> #link Blog: https://gibizer.github.io/categories/eventlet/
16:19:13 <Uggla> #link nova-scheduler series is ready for core review, starting at https://review.opendev.org/c/openstack/nova/+/947966
16:19:33 <gibi> Thanks to the reviews we have Kamil's patch approved https://review.opendev.org/c/openstack/nova/+/949754
16:20:07 <gibi> and we are too commits from https://review.opendev.org/c/openstack/nova/+/948450/35 that will enable n-sch in threading mode in nova-next
16:20:13 <gibi> s/too/two/
16:21:11 <gibi> I recently pushed a series of unit test re-enabling changes to the top of the series mostly Tpool and Threading.lock copying fixes
16:21:52 <gibi> ahh and we have a ssl socket warpping change as well that is an interesting one and still failing on the gate after my initial fix
16:22:01 <gibi> https://review.opendev.org/c/openstack/nova/+/955915
16:22:55 <gibi> anyhow my goal is to land https://review.opendev.org/c/openstack/nova/+/948450 before the FF (n-sch) or as a stretch goal https://review.opendev.org/c/openstack/nova/+/951957/17 (n-api)
16:22:59 <gibi> that is it.
16:23:32 <gibi> there is nothing major in between n-sch and n-api so I'm hopeful about the strech goal
16:24:08 <gibi> back to you Uggla if no questions
16:24:32 <Uggla> That would be great. So next cycle you may focus on compute.
16:24:46 <gibi> vncproxy and compute yepp
16:24:58 <gibi> Kamil has the n-cond on his plate
16:26:02 <Uggla> thanks gibi, moving on
16:26:09 <Uggla> #topic Open discussion
16:26:32 <sean-k-mooney> sorry was diestackted. but here now o/
16:27:10 <Uggla> gibi, I thnik you want to discuss --> just want to socialize a small independent test speed improvement series: https://review.opendev.org/q/topic:%22unit-test-speedup%22 Easy to review and low risk patches.
16:27:24 <Uggla> Hi sean-k-mooney
16:27:48 <sean-k-mooney> hum that sounds interesting
16:27:57 <gibi> yeah just a heads up that I run a lot of unit test these days so I spent an afternoon speeding up some of the slow unit test
16:28:00 <gibi> s
16:28:28 <gibi> mostly changing timeout values or mocking time.sleep in test
16:28:42 <sean-k-mooney> cool. mainly uitnitest not functional at this point, will any of it carry over or are they all targeted
16:28:51 <gmaan> nice
16:28:55 <gibi> they are all targeted
16:28:59 <sean-k-mooney> ack
16:29:08 <sean-k-mooney> im happy to rewview those
16:29:19 <gibi> when I will start enabling threading in func test you can expet similar speed up series there as well
16:29:32 <gmaan> I will check
16:29:41 <gibi> thanks
16:30:14 <sean-k-mooney> i currently cheat and when i want to run unit test alot i move to a system with 48 cores so it woudl be nice to not if adding mockign of time.sleep ectra can do this in the gate too
16:30:42 <sean-k-mooney> the unit test i dont find to be too bad but the functional test are slower.
16:31:16 <gmaan> if you run api sample test too in finctional then yes it is more slower. I usually separate tgem
16:31:19 <sean-k-mooney> gibi: have you added this to the nova status etherpad
16:31:20 <gmaan> them
16:31:33 <gibi> sean-k-mooney: not yet, doing it now...
16:32:01 <sean-k-mooney> cool ill add my self to the bottom patch and try to do a pass on all fo them this week
16:32:54 <sean-k-mooney> i want to check them out and run them locally but skiming them they all seam to make sense
16:33:07 <gibi> nothing fancy :)
16:33:57 <sean-k-mooney> before we move on did you have a stragy to find  thse slow test
16:34:08 <sean-k-mooney> or were you just looking at the slow test list and going hum why is that so slow
16:34:13 <gibi> tox prints the slow tests
16:34:28 <gibi> I just checked those
16:34:44 <sean-k-mooney> ack, its been a while since i looked at that list
16:35:04 <sean-k-mooney> i see it but i dont dig into it often. thanks for takign the time
16:35:23 <Uggla> gibi, 'tox prints the slow tests' with a specific commands or ?
16:35:48 <gibi> Uggla: after each successful target run :)
16:36:03 <gmaan> yeah, at the end you can see those
16:36:43 <Uggla> Shame on me I did not noticed that...
16:36:45 <sean-k-mooney> Uggla: https://github.com/openstack/nova/blob/master/tox.ini#L91
16:36:53 <gibi> -> flag = self._event.wait(timeout)
16:37:00 <gibi> ahh wrong buffer
16:37:06 <gibi> https://github.com/openstack/nova/blob/master/tox.ini#L54
16:37:18 <sean-k-mooney> so context, we use stester which uses subunit internally as a db
16:37:19 <gibi> so you can use stestr to print it
16:37:35 <sean-k-mooney> and there are some tools to extract thigns like slowest tests and print them
16:37:46 <sean-k-mooney> that what stestr slowest is doing
16:38:00 <Uggla> good to know thanks
16:38:00 <sean-k-mooney> it looking at the last test run in the subunit db and listing them
16:38:45 <Uggla> sean-k-mooney I'm just wondering if the 48 cores machine is part of your homelab ? Sorry I'm curious
16:39:17 <sean-k-mooney> yes it part of the hardware im considerign replacing its pretty old at this point
16:39:29 * gibi has 16 cores 32 threads desktop CPU
16:39:55 <sp-bmilanov> sean-k-mooney sounds like a dual-xeon workstation from mid 2010s
16:40:13 <sean-k-mooney> yep
16:40:27 <sean-k-mooney> Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
16:41:03 <gmaan> I recently bought two 16 core dell poweredge R630 but need to setup those
16:41:43 <Uggla> gmaan for homelab too ?
16:42:00 <sean-k-mooney> honestly you can get much more power effeict systems now
16:42:00 <gmaan> yeah, my both old desktop are dead
16:42:25 <sean-k-mooney> but its my main compute node in my home openstack i jsut pop onto the host if i want to run fast
16:43:20 <sean-k-mooney> our unit test are very very paralisable
16:43:28 <sean-k-mooney> anyway i think we are a little off topic
16:43:35 <Uggla> yepp
16:43:55 <Uggla> any other topic ?
16:44:08 <sp-bmilanov> yep
16:44:45 <Uggla> oh sp-bmilanov, please go ahead. And sorry I might remove your topic thinking it was covered last time
16:44:54 <sp-bmilanov> Uggla: no probs
16:45:28 <sp-bmilanov> yeah, so from last time, gibi, dansmith, you might remember the nova-compute forced restart bug, I drafted a spec: https://review.opendev.org/c/openstack/nova-specs/+/955812
16:45:53 <sp-bmilanov> does it make sense to discuss it with the hindsight of Sean's comments? (thanks sean-k-mooney!)
16:46:29 <sp-bmilanov> as I understand it, it would be best if we add some extra RPC functionality for the source to ask the destination if the instance is running there
16:46:42 <sean-k-mooney> just opening that now
16:46:54 <sean-k-mooney> ah the restart during live migration issue
16:47:18 <sean-k-mooney> so there are proably a few differnt way to adress that. btu there are trade off for all of them
16:47:34 <sp-bmilanov> IMO admin-locking it would be a compromise, what you suggested with the self-heal sounds best
16:48:14 <sean-k-mooney> so i think there are diffent mitigation for diffent usecase. where we can gracefully shutdown i think we likely shoudl abort the live migration ad the libvirt level
16:48:24 <sean-k-mooney> and then clean it up at the nova level on start up
16:48:45 <sean-k-mooney> buting the isntnace to error and locking them i think is a step we coudl take to improve the situraion
16:49:16 <sean-k-mooney> but the ultimate fix would be to be able to resume the management of the migration and allow it to succeded if in fact did
16:49:51 <sean-k-mooney> i feel like an incremental approch is what we shoudl try to do but i dont know what others think
16:50:11 <sean-k-mooney> sp-bmilanov: is this somethign you woudl ahve time to work on or are you rasing it as a pain point that you think we shoudl prioritize
16:50:19 <gibi> last week we discussed that sp-bmilanov interested in the non-graceful restart case
16:50:33 <gibi> so we need something in the startup code path that prevents the original problem
16:51:36 <sp-bmilanov> gibi: yes, the ungraceful restart case is what we experienced ; sean-k-mooney pain point first, I am not sure if I can help implement something that you've triaged as so complex
16:51:37 <gibi> ie. the duplication of instances reported https://bugs.launchpad.net/nova/+bug/2092391
16:52:02 * sp-bmilanov should maybe link the bug to the spec
16:52:43 <gibi> I believe we should be able to detect during compute startup that we have a duplication based on the fact that one of the domain is on a host that is not equal to the instance.host
16:53:12 <gibi> or based on that we have an unfinished migration
16:53:16 <gibi> in the DB
16:53:33 <sp-bmilanov> just an aside, I guess that's is not a good approach since you haven't brought it up, but why not try using a flag in the database to check if the migration is at at point where the instance is running at the destination
16:53:39 <sean-k-mooney> so the duplciation i think weill need us to do an rpc to the dest compute
16:53:59 <sean-k-mooney> and determin if we shoudl revert or run post-live-migration
16:54:02 <gibi> sp-bmilanov: I hope we have a migration state that signals that already
16:55:50 <opendevreview> Stephen Finucane proposed openstack/nova master: api: Address issues with images APIs  https://review.opendev.org/c/openstack/nova/+/956102
16:56:37 <sean-k-mooney> sp-bmilanov: so resumeing the guest is not done by nova
16:56:40 <gibi> anyhow I think it make sense to review a spec and perpare a PTG discussion about it
16:56:43 <sean-k-mooney> its doen by libvirt
16:56:55 <sean-k-mooney> and the scoue comput enode will update the dbstate if its still runing
16:57:15 <sean-k-mooney> the problem currently is if you restart it the thread we have monitoring the libvirt job will be killed
16:57:25 <sean-k-mooney> and we wont do that update.
16:57:57 <sean-k-mooney> but on startup we can see fi the job is still runing and either resume waiting or  make an rpc to see if it scuceeded. or check if the vm is running lcoally
16:58:14 <sean-k-mooney> if its runnign locally and there is no migration job active in libvirt it mean the migratoin failed
16:58:25 <sean-k-mooney> if the vm is not local it means either it crashed or it moved
16:58:38 <sean-k-mooney> so we coudl be more intelegent on startup
16:58:48 <sean-k-mooney> but someone need to spend tiem documenting this bevhioer
16:58:50 <sean-k-mooney> and writing it up
16:59:10 <sean-k-mooney> so yes i think a spec/ptg discussion woudl be good
16:59:11 <gibi> yepp I agree
17:00:25 * sp-bmilanov is still wondering why "make an rpc to see if it scuceeded" cannot be checked from the DB, but he needs to make sense of sean-k-mooney's messages offline
17:00:58 <sp-bmilanov> s/checked from the DB/set from the dst and then checked from the src from the DB/
17:01:31 <sean-k-mooney> well
17:01:45 <sean-k-mooney> the compute agent does not have direct db access for one
17:01:55 <sean-k-mooney> so just chekc the db state is already an rpc to teh conductor
17:02:06 <sean-k-mooney> but the responisbleity for updating the host of the vm is on the souce node
17:02:14 <sean-k-mooney> as part of post live migrate
17:02:33 <sean-k-mooney> so on start up if we need to check if tis on the dest node the source node need to call the dest to do that
17:02:52 <sean-k-mooney> i.e. asumeign the vm is not still running locally
17:03:39 <sean-k-mooney> we need to know if it crashed or if its actully active on the other host. and we can do that with only local knowlsage or by checkign the db.
17:04:50 <Uggla> We are on top of the hour. So I propose to close the meeting. ^ discusion can happen right after if needed.
17:05:06 <sp-bmilanov> thanks Uggla
17:05:20 <Uggla> Thanks all.
17:05:22 <Uggla> #endmeeting