Thursday, 2025-08-14

gibido we have some node pool shortage? I see that the tip of the  integrated gate queue is blocked on waiting for executor for small jobs like openstack-tox-docs 14:47
fungigibi: there's some innefficiency with node assignments at the moment, per discussion in #opendev corvus is working on tweaking it to address that14:50
gibifungi: thanks for the information14:51
fungiwe're still working to iron out corner cases in zuul-launcher (the nodepool-launcher rewrite/replacement) that emerged under load which weren't apparent from basic test scenarios14:52
fungiin most cases it's dealing with random errors and disconnects from cloud providers that have proved to be hard to predict and design defensively for14:53
fungiso there have been a series of incremental improvements addressing the more common problems we witness, which then makes it easier and easier to spot the less common ones14:54
gibithanks for the hard work14:55
opendevreviewStephen Finucane proposed openstack/pbr master: util: Fix deprecation warnings  https://review.opendev.org/c/openstack/pbr/+/95726216:07
opendevreviewStephen Finucane proposed openstack/pbr master: util: Make handling of description_file clearer  https://review.opendev.org/c/openstack/pbr/+/95726316:07
opendevreviewStephen Finucane proposed openstack/pbr master: util: Skip normalization of description if README opts present  https://review.opendev.org/c/openstack/pbr/+/95726416:07
opendevreviewStephen Finucane proposed openstack/pbr master: util: Deprecate description_file  https://review.opendev.org/c/openstack/pbr/+/95726516:07
clarkbgibi: it is worth noting that at least part of the problem is due to https://bugs.launchpad.net/nova/+bug/2095364. There is a feedback loop where bugs that end up in production impact our ability to boot and manage VMs with nova.16:49
clarkbJust something to keep in mind when dealing with problems like that. In a way they may seem innocuous "just use an older microversion" but there are two problems here. The first is discoverability. How would an end user know to even attempt that? And second Bugs that break our ability to manage nodes impact the ability to land other changes16:50
fungiyeah, hopefully the provider upgrades with the fix soon16:50
clarkbI guess we don't use microversions when listing nodes within zuul so that wasn't impacting the service itself. Just our ability to debug things as listings stopped working with the cli tool17:01
sean-k-mooneyi was going ot ask a simialr question basedon the node failture in https://review.opendev.org/c/openstack/nova/+/94241317:32
clarkbsean-k-mooney: ya its all related. There was a significant enough slowdowin the system last night that we were consistently losing a race condition where zuul thought allocated nodes were not properly allocated and it bailed out early17:33
clarkblast night we restarted the system to pick up some fixes that we had been working on landing through the day which corrected that problem for the most part17:34
clarkbI think the issues now are in the other direction rathr than marking things failed because its checking too quickly when processing is slow we're just slowly assigning nodes17:34
sean-k-mooneyshould we hold off on rechecks until tommorow17:34
sean-k-mooneyi.e. give it a bit of tiem to stablise17:35
clarkbmaybe? on the one hand having the load helps verify the fixes we've been pushing out are helping17:35
sean-k-mooneythe bug you reference has backprots propsosed to all aeffected barnchs and there is only one pending merge17:35
clarkbsean-k-mooney: yes but it also first hit us in april17:36
clarkbits great its getting fixed. But at the same time it seems like we really drag our feet on these things that directly impact zuul17:36
sean-k-mooneywe fixed it in master in march :)17:36
clarkbbut when zuul goes sideways we've got people banging on the doors demanding answers almost immediately17:36
clarkbI'm just trying to poitn out how there is an imbalance in that feedback loop where we're doing the best with what we've been given17:37
sean-k-mooneyclarkb: for what it sworht noone reported this as an issue for our ci to nova17:37
sean-k-mooneyat least not that im aware of17:37
sean-k-mooneywhe we were working on this it was just beign fixed as a gneeral slow down in api access ectra17:37
sean-k-mooneyi.e. not a gate blocker or critical issue17:38
clarkbits not a general slow down fwiw. Its a "if you use default client configurations then the cloud stops working for you" bug17:38
clarkband there isn't any way to know that you can workaround it without finding the issue in launchpad17:38
sean-k-mooneyit only happens if you have isntance that do not have requests spec 17:38
sean-k-mooneywhich basiclly means you have db curption17:38
fungiwhich was likely the result of some other bug or failure, yes17:39
clarkbright but thats also caused by nova... I as the end user or as the operator don't really have a say in that17:39
sean-k-mooneybut apprently that happens mroe then we are awre off...17:39
fungiwe churn enough boots/deletes that random stray cosmic rays turn up a lot of weird failure cases in providers17:39
sean-k-mooneyfungi: ya par tof the problem is we often dont get visablity into cause the orgina issue17:40
fungiback when i worked in data center networking, i kept close tabs on solar activity because it generally meant an uptick in random bitflips in memory on huge core routers17:40
sean-k-mooneywe know there can be instance that end up in this state, we have not been told anything about how they got that way so we are just fixing a sideefect rather the then root cause with this change17:40
fungi(and my employer was too cheap to spring for ecc ram in the routers)17:40
clarkbsean-k-mooney: ya don't know how they end up there either. But rackspace may be willing to dig into it more if you ask?17:41
clarkbjamesdenton isn't in here but has been helpful in tracking things down like that in the past17:41
sean-k-mooneyclarkb: well if infra have issues like this going forward feel free to shout at us17:42
clarkbsean-k-mooney: fwiw there is another issue we discovered and melwitt helped track it down around quota errors iirc.17:42
clarkblet me see if I can find it17:42
sean-k-mooneyis this the other slow api over time thread on the mailign list17:42
sean-k-mooneyi didnt follow that closely17:43
fungiunrelated i think17:44
clarkbsean-k-mooney: https://bugs.launchpad.net/nova/+bug/185632917:44
clarkbzuul wants to use the fault field to determine if the issue was running out of quota (which is considered a retryable error since quota usage should drop eventually) or if the problem is something we can't expect to resolve itself through normal usage17:44
clarkbbut due to this bug in some clouds the fault field is simply unset so we have to do the conservative thing and assume it won't be resolved17:45
sean-k-mooneyim tryign to parse the descrption17:45
sean-k-mooneyclarkb: ah ok and this is sayign that it depend on if the instance is burreid in cell 0 meaning it past the api quta check but then failed to schdlue to a host17:46
sean-k-mooneyor its in a real cell where it is empty?17:46
fungithose are also nuances that are hard for us to determine as end-users of the service17:47
clarkbright we haev no idea17:47
sean-k-mooneywell we returnn a no valid host if its not a quta issue17:47
clarkbthe problem is that isn't included regardless of the error aiui17:48
clarkbits simply unset17:48
sean-k-mooneyso this bug is very old and not super helpful. could you explain what is happenign and what you want to happen17:48
clarkbsean-k-mooney: server create foo; server show foo; results are server is in ERROR state but the fault field is empty17:49
sean-k-mooneyright just got that form the commit message https://review.opendev.org/c/openstack/nova/+/69917617:49
clarkbwhat should happen is anytime a server is in an error state some sort of fault value is set indicating to the user the type of error so they can either adjust their usage of the cloud, wait for quota to become free, or contact their cloud provider17:49
sean-k-mooneythe problem is the beahvior between sever show and server list differ17:49
sean-k-mooneyand in the sever show case we have a bug wehere we dont get teh fault form the cell db17:49
sean-k-mooneythe bug is reporting that it works for show but not for list by the way17:52
sean-k-mooneyclarkb: let me go rebase that and see if we can get up to date ci results17:53
sean-k-mooneyim not conviced the patch si entirly correct17:55
clarkbsean-k-mooney: its possible we're doing list and not show and I've got them backwards17:55
clarkba trivial test case would be to do a list and a show and compare the values17:55
clarkbthey should both match and both be non empty17:55
sean-k-mooneyya you might be doing list with a specifcc uuid in teh query17:56
sean-k-mooneyits almost but not quite the same17:56
sean-k-mooneybut ya this shoudl be easy to replciate17:56
sean-k-mooneyclarkb: the bit im not conviced about is the fix is grouping the isntance by cell and then doing a call to each cell db to get the faults for the instnaces in that cell17:57
sean-k-mooneywe normally batch these request in parallen using some scather_gather functions17:58
sean-k-mooneyinstead of looping over the cell dbs serially17:59
melwittsean-k-mooney: agree that the idea of the fix is right but the implementation isn't how we normally add cell awareness and should instead be done consistent with the others17:59
sean-k-mooneyso https://review.opendev.org/c/openstack/nova/+/699176/17/nova/objects/instance.py#1565 likely shoudl use that parttern. gibi  and dansmith know that code better then i as its being rewriten to a degress for the eventlet removal17:59
sean-k-mooneymelwitt: are you plannign to work on this?18:07
sean-k-mooneyor at least bring it up with others to see if anyoen else can pick it up18:07
sean-k-mooneyi know you have a lot on your plate before FF18:07
clarkbfwiw I didn't intend on forcing melwitt to work on this. More than happy to have had the help tracking down the bug in the first place18:07
melwittsean-k-mooney: it's been on my todo list for awhile to review :/ and I pinged rene about it to let him know. and kinda hoped he would get some people to gravitate :P but yeah it at least needs some review18:08
sean-k-mooneywell i dont either.im jsut wonding what the best way to highlight this to the rest of the nova team is. i was thinking addign it to the next irc team meeting vfor visiblity18:08
clarkbI just want to call out that these issues that appear minor have potentially noticeable impects to users and operators and sometimes those bubble up to "your ci results are slow or gave an invalid result"18:09
sean-k-mooneyclarkb: no this is a good thing to od i jsut dont want to loos track of it18:09
clarkbthey are minor from a "the cloud is up and functioning standpoint" but less so from a "users are able to use the cloud to perform useful work" standpoint18:09
melwittwe could put it on the nova status etherpad, I am still getting used to using it but that might be a way to keep better track of it18:09
sean-k-mooneyya we coudl add a know bugs section for thigs like this18:10
sean-k-mooneyits not wuite a low haning furit bug but its a if you have sapre cycle this has potically high value18:10
sean-k-mooneymelwitt: i woudl have used the review priority flag for this if i jusc campe accross it but we decied to use etherpatd instead..18:11
melwittyeah. I'll find a place to add it18:11
sean-k-mooneyclarkb: so just an idea, woudl an infra pain points session at the ptg be useful18:12
sean-k-mooneyclarkb: i.e. a session where you could tell us all the pain point you have with usign openstack form zuul 18:12
melwittI have never gotten onto the review priority flag.. I thought it got added to everything but maybe it was not overused and did not become noise. I kept forgetting about it18:12
clarkbsean-k-mooney: possibly. I think part of the reason we're noticing things more lately is that we're trying to switch from nodepool to zuul launcher which necessarily invovles rewriting things to chagne how it functions (to allow any procss to handle any request in any cloud and to allow mixed nodesets between openstack vms, aws vms, k8s containers etc)18:13
sean-k-mooneyclarkb: ya that makes sesnes. opendev is in a somewhat uniqce postion wehre we have a closed loop to some degree18:14
clarkbbasically the rewrite there makes us more senstiive to errors because we have to be smarter about handling them due to the added complexity of the featureset18:16
clarkbin the past we were a lot more brute force solution heavy with nodepool18:16
sean-k-mooneyi assum you have reduced the latency by waiting less too18:16
melwittsean-k-mooney: just added it to this existing section here https://etherpad.opendev.org/p/nova-2025.2-status#L13118:16
* sean-k-mooney rarely scrools all the way to the end18:18
sean-k-mooneyshoudl we put the priorty section at the top?18:18
melwittI don't see a way to add review priority on gerrit?18:18
sean-k-mooneyits configure per project18:18
melwittmaybe but I would want to ask rene first before moving stuff he put18:18
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/69917618:19
sean-k-mooneyif you click reply there it one of the option becied code revew adn workflow18:19
melwittok I got it, sorry. I was dumb and didn't realize it was in the Reply area18:19
sean-k-mooneysyslvain wanted to change the name so now its subtext si "Core Review Promise" or "Contributor Review Promise"18:20
sean-k-mooneywhich to me actully made the feild less useful and dint really do what i orgianlly wanted.18:21
sean-k-mooneybut ya i never got aroudn to remvinging it when we sadi we should try the status etherpad instead18:21
fungithat's good feedback18:51
fungias part of the bridging-the-gap work i was looking yesterday at how many projects enabled a "review-priority" label in gerrit, and how they leverage it seems to vary widely even though they copied the same basic name18:52
fungi(related to contributors relating their inability to figure out how maintainers indicate prioritization and how to get involved in helping set priorities)18:52
opendevreviewJeremy Stanley proposed openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho  https://review.opendev.org/c/openstack/project-config/+/95746721:23
*** cloudnull10977461 is now known as cloudnull23:56

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!