16:00:27 <bauzas> #startmeeting nova 16:00:27 <opendevmeet> Meeting started Tue Nov 26 16:00:27 2024 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:27 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:27 <opendevmeet> The meeting name has been set to 'nova' 16:01:18 <bauzas> #link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 16:01:27 <tkajinam> o/ 16:01:33 <elodilles> o/ 16:01:48 <s3rj1k> hi all 16:02:27 <Uggla> o/ 16:03:30 <bauzas> hey 16:04:22 <bauzas> starting slowly 16:04:31 <bauzas> #topic Bugs (stuck/critical) 16:04:49 <bauzas> #info No Critical bug 16:05:02 <bauzas> #info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster 16:05:08 <bauzas> any questions about bugs ? 16:07:06 <bauzas> ok moving on 16:07:35 <bauzas> #topic Gate status 16:07:43 <bauzas> #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs 16:07:48 <bauzas> #link https://etherpad.opendev.org/p/nova-ci-failures-minimal 16:07:59 <bauzas> #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&branch=stable%2F*&branch=master&pipeline=periodic-weekly&skip=0 16:08:14 <bauzas> #info Please look at the gate failures and file a bug report with the gate-failure tag. 16:08:19 <bauzas> #info Please try to provide meaningful comment when you recheck 16:08:29 <bauzas> I saw a couple of failures but those are known issues 16:08:40 <bauzas> anything about CI failures that is pretty new ? 16:08:52 <bauzas> (all periodics are green) 16:10:00 <bauzas> looks not, moving on 16:10:09 <bauzas> #topic Release Planning 16:10:15 <bauzas> #link https://releases.openstack.org/epoxy/schedule.html 16:10:20 <bauzas> #action bauzas to add Epoxy nova deadlines in the schedule 16:10:45 <bauzas> I'm pretty done with the patch proposal but I need to fix something before uploading it 16:11:03 <bauzas> #topic Review priorities 16:11:10 <bauzas> #link https://etherpad.opendev.org/p/nova-2025.1-status 16:11:28 <bauzas> the page should be up to date, feel free to use it and amend it 16:12:46 <bauzas> anything about that ? 16:13:15 <gibi> o/ 16:13:24 <sean-k-mooney> o/ nothing from me on that topic 16:13:33 <bauzas> cool 16:13:41 <bauzas> #topic Stable Branches 16:13:49 <bauzas> elodilles: shoor 16:14:02 <elodilles> #info stable/2024.2 gate seem to be OK 16:14:12 <elodilles> #info stable/2024.1 gate is blocked on grenade-skip-level & stable/203.2 is blocked on nova-grenade-multinode 16:14:21 <elodilles> failure is due to stable/2023.1->unmaintained/2023.1 transition, devstack and grenade fixes are proposed 16:14:52 <elodilles> and actually the 2024.1 branch fix (grenade) patch is already in the gate queue 16:15:05 <elodilles> though: other workaround is to set these jobs as non-voting - given that gate should not rely on an unmaintained branch 16:15:21 <elodilles> see further details: 16:15:28 <elodilles> #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci 16:15:58 <elodilles> and that's all from me about stable branches now 16:16:09 <bauzas> thanks 16:16:13 <elodilles> np 16:16:30 <bauzas> #topic vmwareapi 3rd-party CI efforts Highlights 16:16:33 <bauzas> fwiesel: around ? 16:17:20 <bauzas> looks he's AFK 16:17:24 <bauzas> no worries, moving on 16:17:28 <fwiesel> Sorry , I am here 16:17:31 <bauzas> ah 16:17:41 <bauzas> anything to raise from your side ? 16:17:59 <fwiesel> There was a regression in oslo.utils (master) and I have created a change to fix it: https://review.opendev.org/c/openstack/oslo.utils/+/936247 16:18:23 <fwiesel> Hopefully the builds will back to the two failures and I will tackle these then. 16:18:26 <fwiesel> That's from my side 16:18:28 <sean-k-mooney> ah is that related to removign netifaces 16:18:55 <bauzas> okay, gtk 16:18:57 <bauzas> thanks 16:19:11 * tkajinam is aware of the proposed fix and will ping the other cores to get in 16:19:16 <tkajinam> get that in 16:19:23 <bauzas> nice, thanks tkajinam 16:19:24 <tkajinam> fwiesel, if you need a new release with the fix early then ping me 16:19:35 <tkajinam> once that is merged 16:20:01 <fwiesel> tkajinam: Thanks, I'll let you know 16:20:39 <bauzas> cool 16:20:47 <bauzas> then moving to the last item from the agenda 16:20:54 <bauzas> #topic Open discussion 16:21:01 <bauzas> anything in the agenda, so anything anyone ? 16:21:11 <s3rj1k> there is this https://bugs.launchpad.net/nova/+bug/2089386 16:21:34 <sean-k-mooney> i have one followup form last week too 16:21:48 <sean-k-mooney> lets start with s3rj1k topic 16:22:29 <bauzas> ok, s3rj1k, shoot 16:23:30 <s3rj1k> idea is to allow for host discovery to be concurrent, both cli and internal using distributed locking 16:24:25 <sean-k-mooney> so perhaps i can provide some context 16:24:46 <s3rj1k> thi mostly needed for k8s like envs where discovery is run in multiple places 16:24:47 <sean-k-mooney> s3rj1k is interested in using the discover hsost perodic in a ha env 16:25:09 <bauzas> s3rj1k: I think that topic requires a proper discussion that can't be done during a meeting 16:25:18 <sean-k-mooney> currently we require that if you use the perodic its enabled on at most one host 16:25:29 <sean-k-mooney> they would like to adress that pain point 16:25:35 <bauzas> if we want to discuss about the design, it has to be an async conversation that has to be in a proper formatted document 16:26:04 <bauzas> that's the reason why we introduced our specification program for those kind of feature requests 16:26:14 <s3rj1k> bauzas: spec? or rfe is enough for this time? 16:26:30 <sean-k-mooney> so this would defneitlly be a spec if you were going to work on it 16:26:36 <bauzas> s3rj1k: are you familiar to the specs writing or do you need guidance ? 16:27:03 <s3rj1k> bauzas: done one for neutron, so all ok 16:27:11 <sean-k-mooney> i think before going that far however s3rj1k wanted some intiall feedback on is this in scope of nova to fix 16:27:44 <bauzas> sean-k-mooney: well, I'm not sure we have a quorum today for such design discussion 16:28:10 <bauzas> if that was something before the PTG, we would have said "sure, just add that to the PTG and we'll discuss it" 16:28:24 <sean-k-mooney> thats still an option 16:28:37 <sean-k-mooney> i suggested that s3rj1k bring it here to advertise that it exist 16:28:47 <bauzas> honestly, I haven't yet formally written the nova deadlines for Epoxy but we're already running short in tome 16:28:49 <sean-k-mooney> and then start eithe r a mailing list or spec dicussion after that 16:28:52 <bauzas> time* 16:29:19 <bauzas> what exact problem are we trying to solve then ? 16:29:45 <sean-k-mooney> currently if you enable the discover host perodic task in more then one schdluer it can get duplict key error form the db 16:29:47 <bauzas> are we speaking of concurrent nova-scheduler services that need to be HA active-active for X reasons ? 16:29:54 <sean-k-mooney> as 2 process can race to create the mappings 16:29:59 <sean-k-mooney> leading to errors in the logs 16:30:04 <sean-k-mooney> we dont actully supprot that today 16:30:09 <bauzas> I think we always said that nova-scheduler has to be active-passive 16:30:13 <sean-k-mooney> but our documention on that is kind of lacking 16:30:17 <sean-k-mooney> no 16:30:26 <bauzas> I pretty bet we documented it 16:30:28 <sean-k-mooney> the schdluer has been supproted in active active for a very long time 16:30:31 <bauzas> nevre 16:30:35 <sean-k-mooney> yes 16:31:03 <tkajinam> as far as I can tell Tripleo in the past deployed it in all controllers 16:31:12 <bauzas> with placement, we thought that we /could/ run it active-active but there were reasons not to 16:31:21 <sean-k-mooney> nope 16:31:26 <bauzas> tkajinam: which was a bug that we raised a couple of times 16:31:33 <sean-k-mooney> downstream its been active active since like 16 maybe before 16:31:41 <bauzas> and I think TripleO changed it to A-P 16:31:49 <bauzas> for that exact reason 16:31:53 <sean-k-mooney> nope 16:31:56 <tkajinam> no 16:32:22 <sean-k-mooney> ok well i think we need a longer discussion on this RFE request 16:32:36 <sean-k-mooney> likely a spec and we probly dont have time to complete it in epoxy 16:32:50 <sean-k-mooney> but we shoudl dicuss this more async 16:33:22 <s3rj1k> no prob, thanks sean-k-mooney for taking a lead on explaining 16:34:28 <bauzas> I have to admit that none of that tribal knowledge is written in https://docs.openstack.org/nova/latest/admin/scheduling.html 16:34:43 <sean-k-mooney> its also not in the config option 16:35:01 <sean-k-mooney> i left my inital feedback on the bug when i traged it as opion 16:35:30 <sean-k-mooney> i dint make it as invlaid as i tought we shoudl atelast dicuss it more widely first 16:36:01 <bauzas> for now, we should document that active-passive HA configuration for sure 16:36:16 <sean-k-mooney> for the perodic only 16:36:19 <bauzas> because indeed, we know that there is no eventual consistency betwen schedulers 16:36:27 <sean-k-mooney> the schduler shoudl be generally deploy active active 16:36:42 <bauzas> that's your opinion :) 16:36:44 <sean-k-mooney> but also the perodic has perfomance issuues 16:36:50 <sean-k-mooney> bauzas: its waht we use in our product 16:36:58 <sean-k-mooney> and what almost all instller do by defult 16:37:42 <bauzas> https://specs.openstack.org/openstack/nova-specs/specs/abandoned/parallel-scheduler.html 16:37:53 <tkajinam> yeah > almost all installer do by default 16:38:11 <sean-k-mooney> that a diffent proposal 16:38:36 <bauzas> I litterally quote the first sentence of that spec : 16:38:42 <bauzas> "If you running two nova-scheduler processes they race each other, they don’t find out about each others choices until the DB gets updated by the nova-compute resource tracker. This has lead to many deployments opting for an Active/Passive HA setup for the nova-scheduler process." 16:39:10 <tkajinam> people may not prefer using act-act for simplicity and avoid clustering mechanism to implement active-passive. 16:39:24 <tkajinam> without large warning :-P 16:39:31 <sean-k-mooney> bauzas: that does not really apply as of placement 16:39:55 <sean-k-mooney> bauzas: i woudl condier it to be very incorrect advice to deocument that active active is not supported 16:40:50 <gibi> yeah the goal of placement to shrink the race window between parallel schedulers 16:41:07 <gibi> it is a solved problem for those resources that are tracked in placement 16:41:13 <bauzas> I don't disagree with the fact that HA active-active schedulers is a problem to solve 16:41:30 <gibi> for those that are not tracked there, the compute manager has a lock around claim to prevent overallocation 16:41:36 <gibi> and we have alternatives to reschedule 16:41:50 <bauzas> gibi: exactly, hence the A/P mechanism 16:41:57 <gibi> no this is A A 16:42:10 <gibi> the only A P problem is in the periodic discovery 16:42:13 <bauzas> in the very early times, we were considering reschedules as a way to address the problem 16:42:38 <bauzas> we stopped that tenet by wanting to reduce the reschedules, leading to indeed a broader problem 16:43:10 <gibi> we reduced reschedules with placement 16:43:16 <bauzas> originally, the scheduler wasn't intended to provide an exact solution 16:43:21 <gibi> and we improved reschedules with alterntive generation 16:43:30 <bauzas> right, which is why we never solved that problem 16:43:46 <bauzas> we reduced the scope of reschedules, that's it 16:43:49 <sean-k-mooney> we solved it to the point that we recomemnd active active as the defualt 16:43:50 <gibi> in a distributes system you have limits what you can solve exactly 16:44:11 <gibi> I agree with sean-k-mooney we can recomend A A 16:44:19 <gibi> actually OSP 18 does A A A 16:44:42 <gibi> (or as many As as you want :D) 16:44:43 <sean-k-mooney> right our product does not supprot active passive but i belive that was true in 17 as well 16:44:45 <bauzas> A A A is OK to me with resources tracked by placement 16:45:16 <sean-k-mooney> anyway perhasp we should move on? 16:45:23 <bauzas> agreed 16:45:28 <sean-k-mooney> we can talk about this more but proably dont need to in the meeting 16:45:32 <bauzas> and agreed on the fact we need a spec 16:45:43 <bauzas> but maybe the solution is to add more resources to placement 16:46:01 <sean-k-mooney> well that is the general direction anyway 16:46:04 <bauzas> or consider this as a non-solvable problem and accepting reschedules as a caveart 16:46:07 <bauzas> caveat 16:46:11 <sean-k-mooney> but that does not adress the reporte problem 16:46:17 <gibi> on the proposal of a distributed discover I can suggest to do the discover outside of a scheduler periodic to avoid the race 16:46:19 <sean-k-mooney> nova-audit would 16:46:28 <bauzas> anyway, moving on 16:46:31 <sean-k-mooney> gibi: yes its very diffent 16:46:41 <bauzas> s3rj1k: fancy writing a spec ? 16:46:41 <sean-k-mooney> bauzas: ack so i had one quick topic 16:46:47 <bauzas> sean-k-mooney: shoot 16:46:51 <s3rj1k> gibi: similar issue would be with CLI, check out RFE 16:47:05 <s3rj1k> bauzas: will do 16:47:12 <sean-k-mooney> so last week i raised adding rodolfo to os-vif core 16:47:23 <sean-k-mooney> i sent a mail to the list and no one objected 16:47:24 <gibi> s3rj1k: I mean if you control to only run the discover from a single CLI session at a time then I assume there is no race 16:47:37 <sean-k-mooney> so if there is no other objection here i will proceed with that after the call. 16:48:01 <s3rj1k> gibi: yes, need external control on how CLI gets run 16:48:16 <s3rj1k> lets move on, yes 16:48:31 <gibi> sean-k-mooney: no objection on my side 16:48:51 <bauzas> sean-k-mooney: no objections indeed 16:49:17 <sean-k-mooney> ack so that is all i had 16:49:36 <tkajinam> I have no objections but +1 :-) (I'm not a core, though) 16:50:18 <sean-k-mooney> ill send a mail to the list and then ill add them after that 16:50:37 <sean-k-mooney> jsut to keep a record of it beyond this meeting 16:51:59 <bauzas> ++ 16:52:07 <bauzas> okay, then I think we're done for today 16:52:11 <bauzas> anything else ? 16:52:39 <bauzas> looks not 16:52:44 <bauzas> have a good end of day 16:52:47 <bauzas> thanks all 16:52:50 <bauzas> #endmeeting