*** ykarel__ is now known as ykarel | 04:10 | |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Return 404 on bad project ID for 'os-quota-sets' https://review.opendev.org/c/openstack/nova/+/937125 | 09:58 |
---|---|---|
stephenfin | gibi: sean-k-mooney: Can we get https://review.opendev.org/c/openstack/nova/+/952266 in? It's rather annoying when running unit tests 😅 | 10:00 |
gibi | +2 | 10:16 |
stephenfin | ty | 10:17 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Deprecate v2 API https://review.opendev.org/c/openstack/nova/+/954102 | 10:51 |
opendevreview | Stephen Finucane proposed openstack/nova master: WIP: api: Remove version controller split https://review.opendev.org/c/openstack/nova/+/954103 | 10:51 |
sean-k-mooney | stephenfin: sure | 11:05 |
opendevreview | Biser Milanov proposed openstack/nova stable/2024.1: Hardcode the use of iothreads for KVM. https://review.opendev.org/c/openstack/nova/+/954113 | 11:25 |
opendevreview | Biser Milanov proposed openstack/nova stable/2024.1: Hardcode the use of iothreads for KVM. https://review.opendev.org/c/openstack/nova/+/954113 | 11:28 |
opendevreview | Biser Milanov proposed openstack/nova stable/2024.1: StorPool: Pass the instance UUID and device_name to os-brick https://review.opendev.org/c/openstack/nova/+/954115 | 11:29 |
sp-bmilanov | sean-k-mooney: Hi, you can ignore the backport chain : https://review.opendev.org/c/openstack/nova/+/954115 , it is not meant to actually be merged | 11:30 |
sean-k-mooney | ack well the master version is not appliable either | 11:30 |
sp-bmilanov | yep, I am aware there's an effort to go about adding iothreads another way | 11:31 |
sean-k-mooney | we are not gong to make this a per host config option | 11:31 |
sean-k-mooney | masahito: by the way even if we are pass the spec freeze i would encurrage you to work on a poc of the iothread and queue patches based on the current version fo hte spec, while we may not have time to do extensive reivew of it if you have a working implemation with tempest test ectra at the sart of the new cycle in october i think it would be reaosnbale ot avocate for | 12:50 |
sean-k-mooney | reiviewing and completign the overall functionatliy early in the cycle. | 12:50 |
sean-k-mooney | depending on how early it was done you may even have time to look at the next steps although that less likely to happen in time for 2026.1 | 12:51 |
gibi | fyi I see significant failure rate in our multi node gate due to rabbit is not accessible from compute1 https://bugs.launchpad.net/nova/+bug/2115980 It is started 3 days ago | 12:51 |
sean-k-mooney | gibi: ya so that a zuul issue | 12:52 |
gibi | is it a tracker for it from infra perspective? | 12:52 |
sean-k-mooney | gibi: tl;dr in the old impelation nodesets were not allowe to be provisioned form differnt providers | 12:52 |
sean-k-mooney | in the new one it can but that breaks our jobs beacue they use the local ips since it expect all the vms to be on the same neutron network | 12:52 |
sean-k-mooney | gibi: in terms of tracker i am not sure but dansmith and clarkb where talkign about it last night | 12:53 |
gibi | OK, reading back then... | 12:53 |
gibi | thanks | 12:53 |
sean-k-mooney | they know that his is happenign and i think they are looking to disbale that feature | 12:53 |
sean-k-mooney | but i dont know if that change has been doen yet, the openstack-infra channle is proably hte best plase to follow up | 12:54 |
gibi | ack, I've pinged infra about it | 12:57 |
sean-k-mooney | i belive this is basicly just a boolean we need to flip for our tenant. long term we coudl refactor the multi node role that we use in the devstack job to use the floating ips and streach the deploymnt over the wan but that is likely to cause other issues so i dont think that will pan out | 12:59 |
gibi | 14:59 < fungi> gibi: there's https://review.opendev.org/954064 which we should be auto-upgrading to within the next 24 hours | 13:01 |
gibi | so fix is in the pipe | 13:01 |
sean-k-mooney | cool | 13:01 |
fungi | yeah, we automatically upgrade our zuul servers through rolling restarts over the weekend | 13:02 |
sean-k-mooney | fungi: i woudl still suggest that it shoudl be posisble to disable the ablity to use multiple providers at the nodeset or tenant level | 13:03 |
fungi | booting a node in another provider for a multi-node job should only ever be a fallback in cases where the job would have otherwise been reported as a node failure | 13:03 |
fungi | the fact that it was happening in recent cases was a logic bug, which that change addresses | 13:04 |
sean-k-mooney | fungi: right but we know our multi node devstack jobs will never work in that case | 13:04 |
sean-k-mooney | so woudl it not be better to allow declaring if split providers can be tollerated in a job/build set level | 13:04 |
fungi | disabling the fallback would just mean always reporting node_failure in those cases rather than trying anyway and possibly failing for other reasons | 13:04 |
sean-k-mooney | right which may actully be a beter develoepr experince | 13:05 |
sean-k-mooney | it mean i dont havee to debug why it fail jsut to see it was mulit provider | 13:05 |
fungi | so the job would still fail, but yes maybe the benefit is that it returns sooner, doesn't waste as many resources, and gives a clearer failure reason for those | 13:06 |
sean-k-mooney | if we modify our devstack josb to alwasy use publicly routabel ips for all our networking that would be one thing | 13:06 |
sean-k-mooney | fungi: for what its worth i am happy the multi provider capablity was finally added i just wish is was more contolable | 13:07 |
fungi | yeah, but also the fact that two nodes boot in the same provider region doesn't guarantee low latency between them | 13:08 |
fungi | anyway, once we're running on the referenced change, the incidence of this particular failure should be on par with prior frequency of node_failure results in those jobs, which i hope was exceedingly rare to begin with | 13:09 |
sean-k-mooney | fungi: latency is not really the concern we do expect routableity between the nodes however using there local ips | 13:10 |
fungi | ah, yeah also booting in the same provider region doesn't guarantee that their local addresses can reach each other, though it's a relatively safe assumption | 13:10 |
sean-k-mooney | fungi: well that demped how you conifuged the provider | 13:13 |
sean-k-mooney | specificly you can configure the subnet and i think the network in the provider section | 13:14 |
fungi | right. in many providers we, as the users, aren't configuring that | 13:14 |
sean-k-mooney | to ensure that all node provided by that guarentee that | 13:14 |
sean-k-mooney | in which case we are assumign there is only one netowrk and that is beign used by default | 13:15 |
sean-k-mooney | which si ok but not ideal | 13:15 |
fungi | a lot of openstack public clouds do shared provider networks for the server ports, but yes those can generally all still reach each other even in cases where they allocate out of disjoint networks (they just bounce off a gateway address) | 13:15 |
opendevreview | Merged openstack/nova master: db: Resolve alembic deprecation warning https://review.opendev.org/c/openstack/nova/+/952266 | 13:16 |
sean-k-mooney | yep so our current multi jobs all depend on that today | 13:16 |
sean-k-mooney | ideally we woudl be able to express that on the nodeset definitoin "node_must_be_routeable" or somethign like that, i assuem we cant assuem all nodes have ipv6 yet | 13:18 |
sean-k-mooney | if we coudl we coudl have the exisitng multi node role that create the tunnel messh do that over ipv6 and then use the ipv4 addreses it provides for our jobs | 13:18 |
sean-k-mooney | i.e. instead of using the ip form the default route use the one for the multi node bridge | 13:19 |
sean-k-mooney | im referint ot https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/multi-node-bridge just ot be clear | 13:19 |
sean-k-mooney | if we could rely on that mesh working across providers we woudl not have to care for the most part as that vxlan tunnle should resolve any firewall issues we have | 13:20 |
sean-k-mooney | although runing ceph rbd over that likely wotn be a fun time | 13:21 |
sean-k-mooney | lets just see how things go once that patch lands | 13:21 |
fungi | another option is an early acceptance check in the job to confirm the nodes are able to reach one another satisfactorally over the interfaces it wants them to use, and then the nodeset can be discarded and the build automatically retried if not. we do something similar in a common role to make sure nodes are able to reach internet resources, for example | 13:24 |
sean-k-mooney | ya thats not a bad idea we coudl do it as a pre-playbook that would alow the job to potetally rescdhule to a diffetn node set | 13:28 |
sean-k-mooney | thats honestly the best of both worlds as we may be able to avoid recheckign | 13:28 |
sean-k-mooney | and if we do it as a role we coudl selectivly add it to the jobs that actully need it | 13:28 |
sean-k-mooney | that may end up beign all the multi node jobs but perhaps not | 13:29 |
opendevreview | Takashi Natsume proposed openstack/nova master: Update contributor guide for 2025.2 Flamingo https://review.opendev.org/c/openstack/nova/+/944603 | 14:17 |
gboutry | Hello nova, is there anything else required from this review (a part from reviews): https://review.opendev.org/c/openstack/nova/+/953737? I'd be quite interested for the backport in 2025.1 | 14:19 |
gibi | gboutry: no, I think we just waiting for reviews there | 14:27 |
gibi | but the US is out today and our resident stable branch reviewer elodilles is on PTO | 14:28 |
gibi | bauzas: sean-k-mooney: ^^ could you hit https://review.opendev.org/c/openstack/nova/+/953737 | 14:28 |
sean-k-mooney | oh the backport sure | 14:28 |
sean-k-mooney | done | 14:30 |
sean-k-mooney | stephenfin: ^ if your still here could you also look at that | 14:30 |
gibi | thanks | 14:30 |
gboutry | Thank you very much! | 14:49 |
opendevreview | Balazs Gibizer proposed openstack/nova master: [pci]Keep used dev in Placement regardless of dev_spec https://review.opendev.org/c/openstack/nova/+/954149 | 15:11 |
sean-k-mooney | gibi: so i did somethign similr to that for the pci device table a long time ago | 15:14 |
sean-k-mooney | gibi: is that only for the placment side or was there a regression on the device table as well | 15:14 |
gibi | it is only the PCI in Placement isde | 15:14 |
gibi | sdie | 15:14 |
gibi | side | 15:14 |
sean-k-mooney | ack | 15:14 |
sean-k-mooney | i guess we missed that edge case so in the orginal implemeation | 15:15 |
gibi | btw the pci tracker side is also a bit buggy as after the inconsistency handled by deleting the VM the device that is not in the devspec any more is not marked deleted until the compute is restarted again | 15:15 |
gibi | we had a bug in the edge case handling and we have an edge case of an edge case during VM deletion that was not really handled well | 15:16 |
sean-k-mooney | that becasue of the cacheing we have i think | 15:16 |
gibi | yepp, we don't re-read the hypervisor view and remove the dev | 15:16 |
sean-k-mooney | right | 15:16 |
sean-k-mooney | so the actual correct behavior is to refuse to start the comptue at all | 15:16 |
sean-k-mooney | i.e. if a deivce is in use by a vm and you remvoe it form the dev spec | 15:17 |
sean-k-mooney | that shoudl be a hard error | 15:17 |
gibi | we can change to that but that is a hard block if one PCI device died | 15:17 |
sean-k-mooney | but we didnt want to do that when we were fixing the orginal bug | 15:17 |
gibi | I have to drop for the weekend but I have to go back to the bugfix to do the proper tests anyhow next week | 15:18 |
gibi | o/ | 15:18 |
sean-k-mooney | gibi: yep although if we chang this now we probaly shoudl add a workaroudn flag | 15:18 |
sean-k-mooney | gibi: o/ | 15:18 |
gmaan | sean-k-mooney: you asked me to remind about manager role series review, whenever you have time https://review.opendev.org/q/topic:%22bp/policy-manager-role-default%22+status:open | 15:19 |
gmaan | I migrated all existing migration (live and cold) tempest tests to use manager role, changes are linked in same topic | 15:20 |
gmaan | one tempest change adding abort/force complete test is in progress which I am working in parallel but those things are covered in nova side tests also but I just want to write tempest test also if we can | 15:21 |
sean-k-mooney | cool, im trying to start wrappign up for the week so i likely wont get to it till monday | 15:22 |
gmaan | no issue, thanks | 15:22 |
sean-k-mooney | https://review.opendev.org/c/openstack/tempest/+/953847/6/tempest/lib/api_schema/response/compute/v2_1/servers.py so we just didnt have tempest schema for that before? | 15:23 |
sean-k-mooney | i guess tempest never tested those api in the past then? | 15:24 |
sean-k-mooney | i feel like any test that try to use abort/force complete woudl eb racy | 15:24 |
sean-k-mooney | unless you plannig to add somthign to slow down the migration by creating memory load | 15:25 |
sean-k-mooney | btu even then that seam hard to properly test | 15:25 |
gmaan | yeah, we did not had tests/schema for list migration (in-progress live migration list), tempest has migration list (list all migration) schema though | 15:26 |
gmaan | yeah, I am trying in try block if migration is still going on then perform the force complet/abort | 15:27 |
gmaan | I know that is best we can test and log if test is not able to perform the operation | 15:27 |
gmaan | I am not 100% convince if that will help or make a good tempest test but in some cases it can perform the things it want | 15:28 |
gmaan | for abort, it is like if test able to perform the abort migration it pass otherwise skip test saying 'abort is not performed bcz migration was completed before that' | 15:31 |
opendevreview | Merged openstack/nova master: Replace utils.spawn_n with spawn https://review.opendev.org/c/openstack/nova/+/948076 | 18:00 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!