16:00:27 #startmeeting nova 16:00:27 Meeting started Tue Nov 26 16:00:27 2024 UTC and is due to finish in 60 minutes. The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:27 The meeting name has been set to 'nova' 16:01:18 #link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting 16:01:27 o/ 16:01:33 o/ 16:01:48 hi all 16:02:27 o/ 16:03:30 hey 16:04:22 starting slowly 16:04:31 #topic Bugs (stuck/critical) 16:04:49 #info No Critical bug 16:05:02 #info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster 16:05:08 any questions about bugs ? 16:07:06 ok moving on 16:07:35 #topic Gate status 16:07:43 #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs 16:07:48 #link https://etherpad.opendev.org/p/nova-ci-failures-minimal 16:07:59 #link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&branch=stable%2F*&branch=master&pipeline=periodic-weekly&skip=0 16:08:14 #info Please look at the gate failures and file a bug report with the gate-failure tag. 16:08:19 #info Please try to provide meaningful comment when you recheck 16:08:29 I saw a couple of failures but those are known issues 16:08:40 anything about CI failures that is pretty new ? 16:08:52 (all periodics are green) 16:10:00 looks not, moving on 16:10:09 #topic Release Planning 16:10:15 #link https://releases.openstack.org/epoxy/schedule.html 16:10:20 #action bauzas to add Epoxy nova deadlines in the schedule 16:10:45 I'm pretty done with the patch proposal but I need to fix something before uploading it 16:11:03 #topic Review priorities 16:11:10 #link https://etherpad.opendev.org/p/nova-2025.1-status 16:11:28 the page should be up to date, feel free to use it and amend it 16:12:46 anything about that ? 16:13:15 o/ 16:13:24 o/ nothing from me on that topic 16:13:33 cool 16:13:41 #topic Stable Branches 16:13:49 elodilles: shoor 16:14:02 #info stable/2024.2 gate seem to be OK 16:14:12 #info stable/2024.1 gate is blocked on grenade-skip-level & stable/203.2 is blocked on nova-grenade-multinode 16:14:21 failure is due to stable/2023.1->unmaintained/2023.1 transition, devstack and grenade fixes are proposed 16:14:52 and actually the 2024.1 branch fix (grenade) patch is already in the gate queue 16:15:05 though: other workaround is to set these jobs as non-voting - given that gate should not rely on an unmaintained branch 16:15:21 see further details: 16:15:28 #info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci 16:15:58 and that's all from me about stable branches now 16:16:09 thanks 16:16:13 np 16:16:30 #topic vmwareapi 3rd-party CI efforts Highlights 16:16:33 fwiesel: around ? 16:17:20 looks he's AFK 16:17:24 no worries, moving on 16:17:28 Sorry , I am here 16:17:31 ah 16:17:41 anything to raise from your side ? 16:17:59 There was a regression in oslo.utils (master) and I have created a change to fix it: https://review.opendev.org/c/openstack/oslo.utils/+/936247 16:18:23 Hopefully the builds will back to the two failures and I will tackle these then. 16:18:26 That's from my side 16:18:28 ah is that related to removign netifaces 16:18:55 okay, gtk 16:18:57 thanks 16:19:11 * tkajinam is aware of the proposed fix and will ping the other cores to get in 16:19:16 get that in 16:19:23 nice, thanks tkajinam 16:19:24 fwiesel, if you need a new release with the fix early then ping me 16:19:35 once that is merged 16:20:01 tkajinam: Thanks, I'll let you know 16:20:39 cool 16:20:47 then moving to the last item from the agenda 16:20:54 #topic Open discussion 16:21:01 anything in the agenda, so anything anyone ? 16:21:11 there is this https://bugs.launchpad.net/nova/+bug/2089386 16:21:34 i have one followup form last week too 16:21:48 lets start with s3rj1k topic 16:22:29 ok, s3rj1k, shoot 16:23:30 idea is to allow for host discovery to be concurrent, both cli and internal using distributed locking 16:24:25 so perhaps i can provide some context 16:24:46 thi mostly needed for k8s like envs where discovery is run in multiple places 16:24:47 s3rj1k is interested in using the discover hsost perodic in a ha env 16:25:09 s3rj1k: I think that topic requires a proper discussion that can't be done during a meeting 16:25:18 currently we require that if you use the perodic its enabled on at most one host 16:25:29 they would like to adress that pain point 16:25:35 if we want to discuss about the design, it has to be an async conversation that has to be in a proper formatted document 16:26:04 that's the reason why we introduced our specification program for those kind of feature requests 16:26:14 bauzas: spec? or rfe is enough for this time? 16:26:30 so this would defneitlly be a spec if you were going to work on it 16:26:36 s3rj1k: are you familiar to the specs writing or do you need guidance ? 16:27:03 bauzas: done one for neutron, so all ok 16:27:11 i think before going that far however s3rj1k wanted some intiall feedback on is this in scope of nova to fix 16:27:44 sean-k-mooney: well, I'm not sure we have a quorum today for such design discussion 16:28:10 if that was something before the PTG, we would have said "sure, just add that to the PTG and we'll discuss it" 16:28:24 thats still an option 16:28:37 i suggested that s3rj1k bring it here to advertise that it exist 16:28:47 honestly, I haven't yet formally written the nova deadlines for Epoxy but we're already running short in tome 16:28:49 and then start eithe r a mailing list or spec dicussion after that 16:28:52 time* 16:29:19 what exact problem are we trying to solve then ? 16:29:45 currently if you enable the discover host perodic task in more then one schdluer it can get duplict key error form the db 16:29:47 are we speaking of concurrent nova-scheduler services that need to be HA active-active for X reasons ? 16:29:54 as 2 process can race to create the mappings 16:29:59 leading to errors in the logs 16:30:04 we dont actully supprot that today 16:30:09 I think we always said that nova-scheduler has to be active-passive 16:30:13 but our documention on that is kind of lacking 16:30:17 no 16:30:26 I pretty bet we documented it 16:30:28 the schdluer has been supproted in active active for a very long time 16:30:31 nevre 16:30:35 yes 16:31:03 as far as I can tell Tripleo in the past deployed it in all controllers 16:31:12 with placement, we thought that we /could/ run it active-active but there were reasons not to 16:31:21 nope 16:31:26 tkajinam: which was a bug that we raised a couple of times 16:31:33 downstream its been active active since like 16 maybe before 16:31:41 and I think TripleO changed it to A-P 16:31:49 for that exact reason 16:31:53 nope 16:31:56 no 16:32:22 ok well i think we need a longer discussion on this RFE request 16:32:36 likely a spec and we probly dont have time to complete it in epoxy 16:32:50 but we shoudl dicuss this more async 16:33:22 no prob, thanks sean-k-mooney for taking a lead on explaining 16:34:28 I have to admit that none of that tribal knowledge is written in https://docs.openstack.org/nova/latest/admin/scheduling.html 16:34:43 its also not in the config option 16:35:01 i left my inital feedback on the bug when i traged it as opion 16:35:30 i dint make it as invlaid as i tought we shoudl atelast dicuss it more widely first 16:36:01 for now, we should document that active-passive HA configuration for sure 16:36:16 for the perodic only 16:36:19 because indeed, we know that there is no eventual consistency betwen schedulers 16:36:27 the schduler shoudl be generally deploy active active 16:36:42 that's your opinion :) 16:36:44 but also the perodic has perfomance issuues 16:36:50 bauzas: its waht we use in our product 16:36:58 and what almost all instller do by defult 16:37:42 https://specs.openstack.org/openstack/nova-specs/specs/abandoned/parallel-scheduler.html 16:37:53 yeah > almost all installer do by default 16:38:11 that a diffent proposal 16:38:36 I litterally quote the first sentence of that spec : 16:38:42 "If you running two nova-scheduler processes they race each other, they don’t find out about each others choices until the DB gets updated by the nova-compute resource tracker. This has lead to many deployments opting for an Active/Passive HA setup for the nova-scheduler process." 16:39:10 people may not prefer using act-act for simplicity and avoid clustering mechanism to implement active-passive. 16:39:24 without large warning :-P 16:39:31 bauzas: that does not really apply as of placement 16:39:55 bauzas: i woudl condier it to be very incorrect advice to deocument that active active is not supported 16:40:50 yeah the goal of placement to shrink the race window between parallel schedulers 16:41:07 it is a solved problem for those resources that are tracked in placement 16:41:13 I don't disagree with the fact that HA active-active schedulers is a problem to solve 16:41:30 for those that are not tracked there, the compute manager has a lock around claim to prevent overallocation 16:41:36 and we have alternatives to reschedule 16:41:50 gibi: exactly, hence the A/P mechanism 16:41:57 no this is A A 16:42:10 the only A P problem is in the periodic discovery 16:42:13 in the very early times, we were considering reschedules as a way to address the problem 16:42:38 we stopped that tenet by wanting to reduce the reschedules, leading to indeed a broader problem 16:43:10 we reduced reschedules with placement 16:43:16 originally, the scheduler wasn't intended to provide an exact solution 16:43:21 and we improved reschedules with alterntive generation 16:43:30 right, which is why we never solved that problem 16:43:46 we reduced the scope of reschedules, that's it 16:43:49 we solved it to the point that we recomemnd active active as the defualt 16:43:50 in a distributes system you have limits what you can solve exactly 16:44:11 I agree with sean-k-mooney we can recomend A A 16:44:19 actually OSP 18 does A A A 16:44:42 (or as many As as you want :D) 16:44:43 right our product does not supprot active passive but i belive that was true in 17 as well 16:44:45 A A A is OK to me with resources tracked by placement 16:45:16 anyway perhasp we should move on? 16:45:23 agreed 16:45:28 we can talk about this more but proably dont need to in the meeting 16:45:32 and agreed on the fact we need a spec 16:45:43 but maybe the solution is to add more resources to placement 16:46:01 well that is the general direction anyway 16:46:04 or consider this as a non-solvable problem and accepting reschedules as a caveart 16:46:07 caveat 16:46:11 but that does not adress the reporte problem 16:46:17 on the proposal of a distributed discover I can suggest to do the discover outside of a scheduler periodic to avoid the race 16:46:19 nova-audit would 16:46:28 anyway, moving on 16:46:31 gibi: yes its very diffent 16:46:41 s3rj1k: fancy writing a spec ? 16:46:41 bauzas: ack so i had one quick topic 16:46:47 sean-k-mooney: shoot 16:46:51 gibi: similar issue would be with CLI, check out RFE 16:47:05 bauzas: will do 16:47:12 so last week i raised adding rodolfo to os-vif core 16:47:23 i sent a mail to the list and no one objected 16:47:24 s3rj1k: I mean if you control to only run the discover from a single CLI session at a time then I assume there is no race 16:47:37 so if there is no other objection here i will proceed with that after the call. 16:48:01 gibi: yes, need external control on how CLI gets run 16:48:16 lets move on, yes 16:48:31 sean-k-mooney: no objection on my side 16:48:51 sean-k-mooney: no objections indeed 16:49:17 ack so that is all i had 16:49:36 I have no objections but +1 :-) (I'm not a core, though) 16:50:18 ill send a mail to the list and then ill add them after that 16:50:37 jsut to keep a record of it beyond this meeting 16:51:59 ++ 16:52:07 okay, then I think we're done for today 16:52:11 anything else ? 16:52:39 looks not 16:52:44 have a good end of day 16:52:47 thanks all 16:52:50 #endmeeting