Friday, 2025-07-04

*** ykarel__ is now known as ykarel04:10
opendevreviewStephen Finucane proposed openstack/nova master: api: Return 404 on bad project ID for 'os-quota-sets'  https://review.opendev.org/c/openstack/nova/+/93712509:58
stephenfingibi: sean-k-mooney: Can we get https://review.opendev.org/c/openstack/nova/+/952266 in? It's rather annoying when running unit tests 😅10:00
gibi+210:16
stephenfinty10:17
opendevreviewStephen Finucane proposed openstack/nova master: api: Deprecate v2 API  https://review.opendev.org/c/openstack/nova/+/95410210:51
opendevreviewStephen Finucane proposed openstack/nova master: WIP: api: Remove version controller split  https://review.opendev.org/c/openstack/nova/+/95410310:51
sean-k-mooneystephenfin: sure11:05
opendevreviewBiser Milanov proposed openstack/nova stable/2024.1: Hardcode the use of iothreads for KVM.  https://review.opendev.org/c/openstack/nova/+/95411311:25
opendevreviewBiser Milanov proposed openstack/nova stable/2024.1: Hardcode the use of iothreads for KVM.  https://review.opendev.org/c/openstack/nova/+/95411311:28
opendevreviewBiser Milanov proposed openstack/nova stable/2024.1: StorPool: Pass the instance UUID and device_name to os-brick  https://review.opendev.org/c/openstack/nova/+/95411511:29
sp-bmilanovsean-k-mooney: Hi, you can ignore the backport chain : https://review.opendev.org/c/openstack/nova/+/954115 , it is not meant to actually be merged11:30
sean-k-mooneyack well the master version is not appliable either11:30
sp-bmilanovyep, I am aware there's an effort to go about adding iothreads another way11:31
sean-k-mooneywe are not gong to make this a per host config option11:31
sean-k-mooneymasahito: by the way even if we are pass the spec freeze i would encurrage you to work on a poc of the iothread and queue patches based on the current version fo hte spec, while we may not have time to do extensive reivew of it if you have a working implemation with tempest test ectra at the sart of the new cycle in october i think it would be reaosnbale ot avocate for12:50
sean-k-mooneyreiviewing and completign the overall functionatliy early in the cycle.12:50
sean-k-mooneydepending on how early it was done you may even have time to look at the next steps although that less likely to happen in time for 2026.112:51
gibifyi I see significant failure rate in our multi node gate due to rabbit is not accessible from compute1 https://bugs.launchpad.net/nova/+bug/2115980 It is started 3 days ago12:51
sean-k-mooneygibi: ya so that a zuul issue12:52
gibiis it a tracker for it from infra perspective?12:52
sean-k-mooneygibi: tl;dr in the old impelation nodesets were not allowe to be provisioned form differnt providers12:52
sean-k-mooneyin the new one it can but that breaks our jobs beacue they use the local ips since it expect all the vms to be on the same neutron network12:52
sean-k-mooneygibi: in terms of tracker i am not sure but dansmith and clarkb where talkign about it last night12:53
gibiOK, reading back then...12:53
gibithanks12:53
sean-k-mooneythey know that his is happenign and i think they are looking to disbale that feature12:53
sean-k-mooneybut i dont know if that change has been doen yet, the openstack-infra channle is proably hte best plase to follow up12:54
gibiack, I've pinged infra about it12:57
sean-k-mooneyi belive this is basicly just a boolean we need to flip for our tenant. long term we coudl refactor the multi node role that we use in the devstack job to use the floating ips and streach the deploymnt over the wan but that is likely to cause other issues so i dont think that will pan out12:59
gibi14:59 < fungi> gibi: there's https://review.opendev.org/954064 which we should be auto-upgrading to within the next 24 hours13:01
gibiso fix is in the pipe13:01
sean-k-mooneycool13:01
fungiyeah, we automatically upgrade our zuul servers through rolling restarts over the weekend13:02
sean-k-mooneyfungi: i woudl still suggest that it shoudl be posisble to disable the ablity to use multiple providers at the nodeset or tenant level13:03
fungibooting a node in another provider for a multi-node job should only ever be a fallback in cases where the job would have otherwise been reported as a node failure13:03
fungithe fact that it was happening in recent cases was a logic bug, which that change addresses13:04
sean-k-mooneyfungi: right but we know our multi node devstack jobs will never work in that case13:04
sean-k-mooneyso woudl it not be better to allow declaring if split providers can be tollerated in a job/build set level13:04
fungidisabling the fallback would just mean always reporting node_failure in those cases rather than trying anyway and possibly failing for other reasons13:04
sean-k-mooneyright which may actully be a beter develoepr experince13:05
sean-k-mooneyit mean i dont havee to debug why it fail jsut to see it was mulit provider13:05
fungiso the job would still fail, but yes maybe the benefit is that it returns sooner, doesn't waste as many resources, and gives a clearer failure reason for those13:06
sean-k-mooneyif we modify our devstack josb to alwasy use publicly routabel ips for all our networking that would be one thing13:06
sean-k-mooneyfungi: for what its worth i am happy the multi provider capablity was finally added i just wish is was more contolable13:07
fungiyeah, but also the fact that two nodes boot in the same provider region doesn't guarantee low latency between them13:08
fungianyway, once we're running on the referenced change, the incidence of this particular failure should be on par with prior frequency of node_failure results in those jobs, which i hope was exceedingly rare to begin with13:09
sean-k-mooneyfungi: latency is not really the concern we do expect routableity between the nodes however using there local ips13:10
fungiah, yeah also booting in the same provider region doesn't guarantee that their local addresses can reach each other, though it's a relatively safe assumption13:10
sean-k-mooneyfungi: well that demped how you conifuged the provider13:13
sean-k-mooneyspecificly you can configure the subnet and i think the network in the provider section13:14
fungiright. in many providers we, as the users, aren't configuring that13:14
sean-k-mooneyto ensure that all node provided by that guarentee that13:14
sean-k-mooneyin which case we are assumign there is only one netowrk and that is beign used by default13:15
sean-k-mooneywhich si ok but not ideal13:15
fungia lot of openstack public clouds do shared provider networks for the server ports, but yes those can generally all still reach each other even in cases where they allocate out of disjoint networks (they just bounce off a gateway address)13:15
opendevreviewMerged openstack/nova master: db: Resolve alembic deprecation warning  https://review.opendev.org/c/openstack/nova/+/95226613:16
sean-k-mooneyyep so our current multi jobs all depend on that today13:16
sean-k-mooneyideally we woudl be able to express that on the nodeset definitoin "node_must_be_routeable" or somethign like that, i assuem we cant assuem all nodes have ipv6 yet13:18
sean-k-mooneyif we coudl we coudl have the exisitng multi node role that create the tunnel messh do that over ipv6 and then use the ipv4 addreses it provides for our jobs13:18
sean-k-mooneyi.e. instead of using the ip form the default route use the one for the multi node bridge13:19
sean-k-mooneyim referint ot https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/multi-node-bridge just ot be clear13:19
sean-k-mooneyif we could rely on that mesh working across providers we woudl not have to care for the most part as that vxlan tunnle should resolve any firewall issues we have13:20
sean-k-mooneyalthough runing ceph rbd over that likely wotn be a fun time13:21
sean-k-mooneylets just see how things go once that patch lands13:21
fungianother option is an early acceptance check in the job to confirm the nodes are able to reach one another satisfactorally over the interfaces it wants them to use, and then the nodeset can be discarded and the build automatically retried if not. we do something similar in a common role to make sure nodes are able to reach internet resources, for example13:24
sean-k-mooneyya thats not a bad idea we coudl do it as a pre-playbook that would alow the job to potetally rescdhule to a diffetn node set13:28
sean-k-mooneythats honestly the best of both worlds as we may be able to avoid recheckign 13:28
sean-k-mooneyand if we do it as a role we coudl selectivly add it to the jobs that actully need it13:28
sean-k-mooneythat may end up beign all the multi node jobs but perhaps not13:29
opendevreviewTakashi Natsume proposed openstack/nova master: Update contributor guide for 2025.2 Flamingo  https://review.opendev.org/c/openstack/nova/+/94460314:17
gboutryHello nova, is there anything else required from this review (a part from reviews): https://review.opendev.org/c/openstack/nova/+/953737? I'd be quite interested for the backport in 2025.114:19
gibigboutry: no, I think we just waiting for reviews there14:27
gibibut the US is out today and our resident stable branch reviewer elodilles is on PTO14:28
gibibauzas: sean-k-mooney: ^^ could you hit https://review.opendev.org/c/openstack/nova/+/95373714:28
sean-k-mooneyoh the backport sure14:28
sean-k-mooneydone14:30
sean-k-mooneystephenfin: ^ if your still here could you also look at that14:30
gibithanks14:30
gboutryThank you very much!14:49
opendevreviewBalazs Gibizer proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec  https://review.opendev.org/c/openstack/nova/+/95414915:11
sean-k-mooneygibi: so i did somethign similr to that for the pci device table a long time ago15:14
sean-k-mooneygibi: is that only for the placment side or was there a regression on the device table as well15:14
gibiit is only the PCI in Placement isde15:14
gibisdie15:14
gibiside15:14
sean-k-mooneyack15:14
sean-k-mooneyi guess we missed that edge case so in the orginal implemeation15:15
gibibtw the pci tracker side is also a bit buggy as after the inconsistency handled by deleting the VM the device that is not in the devspec any more is not marked deleted until the compute is restarted again15:15
gibiwe had a bug in the edge case handling and we have an edge case of an edge case during VM deletion that was not really handled well15:16
sean-k-mooneythat becasue of the cacheing we have i think15:16
gibiyepp, we don't re-read the hypervisor view and remove the dev15:16
sean-k-mooneyright15:16
sean-k-mooneyso the actual correct behavior is to refuse to start the comptue at all15:16
sean-k-mooneyi.e. if a deivce is in use by a vm and you remvoe it form the dev spec 15:17
sean-k-mooneythat shoudl be a hard error15:17
gibiwe can change to that but that is a hard block if one PCI device died 15:17
sean-k-mooneybut we didnt want to do that when we were fixing the orginal bug15:17
gibiI have to drop for the weekend but I have to go back to the bugfix to do the proper tests anyhow next week 15:18
gibio/15:18
sean-k-mooneygibi: yep although if we chang this now we probaly shoudl add a workaroudn flag15:18
sean-k-mooneygibi: o/15:18
gmaansean-k-mooney: you asked me to remind about manager role series review, whenever you have time https://review.opendev.org/q/topic:%22bp/policy-manager-role-default%22+status:open15:19
gmaanI migrated all existing migration (live and cold) tempest tests to use manager role, changes are linked in same topic15:20
gmaanone tempest change adding abort/force complete test is in progress which I am working in parallel but those things are covered in nova side tests also but I just want to write tempest test also if we can15:21
sean-k-mooneycool, im trying to start wrappign up for the week so i likely wont get to it till monday15:22
gmaanno issue, thanks15:22
sean-k-mooney https://review.opendev.org/c/openstack/tempest/+/953847/6/tempest/lib/api_schema/response/compute/v2_1/servers.py so we just didnt have tempest schema for that before?15:23
sean-k-mooneyi guess tempest never tested those api in the past then?15:24
sean-k-mooneyi feel like any test that try to use abort/force complete woudl eb racy 15:24
sean-k-mooneyunless you plannig to add somthign to slow down the migration by creating memory load15:25
sean-k-mooneybtu even then that seam hard to properly test15:25
gmaanyeah, we did not had tests/schema for list migration (in-progress live migration list), tempest has migration list (list all migration) schema though15:26
gmaanyeah, I am trying in try block if migration is still going on then perform the force complet/abort15:27
gmaanI know that is best we can test and log if test is not able to perform the operation15:27
gmaanI am not 100% convince if that will help or make a good tempest test but in some cases it can perform the things it want 15:28
gmaanfor abort, it is like if test able to perform the abort migration it pass otherwise skip test saying 'abort is not performed bcz migration was completed before that'15:31
opendevreviewMerged openstack/nova master: Replace utils.spawn_n with spawn  https://review.opendev.org/c/openstack/nova/+/94807618:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!