Thursday, 2025-08-14

gibi	do we have some node pool shortage? I see that the tip of the integrated gate queue is blocked on waiting for executor for small jobs like openstack-tox-docs	14:47
fungi	gibi: there's some innefficiency with node assignments at the moment, per discussion in #opendev corvus is working on tweaking it to address that	14:50
gibi	fungi: thanks for the information	14:51
fungi	we're still working to iron out corner cases in zuul-launcher (the nodepool-launcher rewrite/replacement) that emerged under load which weren't apparent from basic test scenarios	14:52
fungi	in most cases it's dealing with random errors and disconnects from cloud providers that have proved to be hard to predict and design defensively for	14:53
fungi	so there have been a series of incremental improvements addressing the more common problems we witness, which then makes it easier and easier to spot the less common ones	14:54
gibi	thanks for the hard work	14:55
opendevreview	Stephen Finucane proposed openstack/pbr master: util: Fix deprecation warnings https://review.opendev.org/c/openstack/pbr/+/957262	16:07
opendevreview	Stephen Finucane proposed openstack/pbr master: util: Make handling of description_file clearer https://review.opendev.org/c/openstack/pbr/+/957263	16:07
opendevreview	Stephen Finucane proposed openstack/pbr master: util: Skip normalization of description if README opts present https://review.opendev.org/c/openstack/pbr/+/957264	16:07
opendevreview	Stephen Finucane proposed openstack/pbr master: util: Deprecate description_file https://review.opendev.org/c/openstack/pbr/+/957265	16:07
clarkb	gibi: it is worth noting that at least part of the problem is due to https://bugs.launchpad.net/nova/+bug/2095364. There is a feedback loop where bugs that end up in production impact our ability to boot and manage VMs with nova.	16:49
clarkb	Just something to keep in mind when dealing with problems like that. In a way they may seem innocuous "just use an older microversion" but there are two problems here. The first is discoverability. How would an end user know to even attempt that? And second Bugs that break our ability to manage nodes impact the ability to land other changes	16:50
fungi	yeah, hopefully the provider upgrades with the fix soon	16:50
clarkb	I guess we don't use microversions when listing nodes within zuul so that wasn't impacting the service itself. Just our ability to debug things as listings stopped working with the cli tool	17:01
sean-k-mooney	i was going ot ask a simialr question basedon the node failture in https://review.opendev.org/c/openstack/nova/+/942413	17:32
clarkb	sean-k-mooney: ya its all related. There was a significant enough slowdowin the system last night that we were consistently losing a race condition where zuul thought allocated nodes were not properly allocated and it bailed out early	17:33
clarkb	last night we restarted the system to pick up some fixes that we had been working on landing through the day which corrected that problem for the most part	17:34
clarkb	I think the issues now are in the other direction rathr than marking things failed because its checking too quickly when processing is slow we're just slowly assigning nodes	17:34
sean-k-mooney	should we hold off on rechecks until tommorow	17:34
sean-k-mooney	i.e. give it a bit of tiem to stablise	17:35
clarkb	maybe? on the one hand having the load helps verify the fixes we've been pushing out are helping	17:35
sean-k-mooney	the bug you reference has backprots propsosed to all aeffected barnchs and there is only one pending merge	17:35
clarkb	sean-k-mooney: yes but it also first hit us in april	17:36
clarkb	its great its getting fixed. But at the same time it seems like we really drag our feet on these things that directly impact zuul	17:36
sean-k-mooney	we fixed it in master in march :)	17:36
clarkb	but when zuul goes sideways we've got people banging on the doors demanding answers almost immediately	17:36
clarkb	I'm just trying to poitn out how there is an imbalance in that feedback loop where we're doing the best with what we've been given	17:37
sean-k-mooney	clarkb: for what it sworht noone reported this as an issue for our ci to nova	17:37
sean-k-mooney	at least not that im aware of	17:37
sean-k-mooney	whe we were working on this it was just beign fixed as a gneeral slow down in api access ectra	17:37
sean-k-mooney	i.e. not a gate blocker or critical issue	17:38
clarkb	its not a general slow down fwiw. Its a "if you use default client configurations then the cloud stops working for you" bug	17:38
clarkb	and there isn't any way to know that you can workaround it without finding the issue in launchpad	17:38
sean-k-mooney	it only happens if you have isntance that do not have requests spec	17:38
sean-k-mooney	which basiclly means you have db curption	17:38
fungi	which was likely the result of some other bug or failure, yes	17:39
clarkb	right but thats also caused by nova... I as the end user or as the operator don't really have a say in that	17:39
sean-k-mooney	but apprently that happens mroe then we are awre off...	17:39
fungi	we churn enough boots/deletes that random stray cosmic rays turn up a lot of weird failure cases in providers	17:39
sean-k-mooney	fungi: ya par tof the problem is we often dont get visablity into cause the orgina issue	17:40
fungi	back when i worked in data center networking, i kept close tabs on solar activity because it generally meant an uptick in random bitflips in memory on huge core routers	17:40
sean-k-mooney	we know there can be instance that end up in this state, we have not been told anything about how they got that way so we are just fixing a sideefect rather the then root cause with this change	17:40
fungi	(and my employer was too cheap to spring for ecc ram in the routers)	17:40
clarkb	sean-k-mooney: ya don't know how they end up there either. But rackspace may be willing to dig into it more if you ask?	17:41
clarkb	jamesdenton isn't in here but has been helpful in tracking things down like that in the past	17:41
sean-k-mooney	clarkb: well if infra have issues like this going forward feel free to shout at us	17:42
clarkb	sean-k-mooney: fwiw there is another issue we discovered and melwitt helped track it down around quota errors iirc.	17:42
clarkb	let me see if I can find it	17:42
sean-k-mooney	is this the other slow api over time thread on the mailign list	17:42
sean-k-mooney	i didnt follow that closely	17:43
fungi	unrelated i think	17:44
clarkb	sean-k-mooney: https://bugs.launchpad.net/nova/+bug/1856329	17:44
clarkb	zuul wants to use the fault field to determine if the issue was running out of quota (which is considered a retryable error since quota usage should drop eventually) or if the problem is something we can't expect to resolve itself through normal usage	17:44
clarkb	but due to this bug in some clouds the fault field is simply unset so we have to do the conservative thing and assume it won't be resolved	17:45
sean-k-mooney	im tryign to parse the descrption	17:45
sean-k-mooney	clarkb: ah ok and this is sayign that it depend on if the instance is burreid in cell 0 meaning it past the api quta check but then failed to schdlue to a host	17:46
sean-k-mooney	or its in a real cell where it is empty?	17:46
fungi	those are also nuances that are hard for us to determine as end-users of the service	17:47
clarkb	right we haev no idea	17:47
sean-k-mooney	well we returnn a no valid host if its not a quta issue	17:47
clarkb	the problem is that isn't included regardless of the error aiui	17:48
clarkb	its simply unset	17:48
sean-k-mooney	so this bug is very old and not super helpful. could you explain what is happenign and what you want to happen	17:48
clarkb	sean-k-mooney: server create foo; server show foo; results are server is in ERROR state but the fault field is empty	17:49
sean-k-mooney	right just got that form the commit message https://review.opendev.org/c/openstack/nova/+/699176	17:49
clarkb	what should happen is anytime a server is in an error state some sort of fault value is set indicating to the user the type of error so they can either adjust their usage of the cloud, wait for quota to become free, or contact their cloud provider	17:49
sean-k-mooney	the problem is the beahvior between sever show and server list differ	17:49
sean-k-mooney	and in the sever show case we have a bug wehere we dont get teh fault form the cell db	17:49
sean-k-mooney	the bug is reporting that it works for show but not for list by the way	17:52
sean-k-mooney	clarkb: let me go rebase that and see if we can get up to date ci results	17:53
sean-k-mooney	im not conviced the patch si entirly correct	17:55
clarkb	sean-k-mooney: its possible we're doing list and not show and I've got them backwards	17:55
clarkb	a trivial test case would be to do a list and a show and compare the values	17:55
clarkb	they should both match and both be non empty	17:55
sean-k-mooney	ya you might be doing list with a specifcc uuid in teh query	17:56
sean-k-mooney	its almost but not quite the same	17:56
sean-k-mooney	but ya this shoudl be easy to replciate	17:56
sean-k-mooney	clarkb: the bit im not conviced about is the fix is grouping the isntance by cell and then doing a call to each cell db to get the faults for the instnaces in that cell	17:57
sean-k-mooney	we normally batch these request in parallen using some scather_gather functions	17:58
sean-k-mooney	instead of looping over the cell dbs serially	17:59
melwitt	sean-k-mooney: agree that the idea of the fix is right but the implementation isn't how we normally add cell awareness and should instead be done consistent with the others	17:59
sean-k-mooney	so https://review.opendev.org/c/openstack/nova/+/699176/17/nova/objects/instance.py#1565 likely shoudl use that parttern. gibi and dansmith know that code better then i as its being rewriten to a degress for the eventlet removal	17:59
sean-k-mooney	melwitt: are you plannign to work on this?	18:07
sean-k-mooney	or at least bring it up with others to see if anyoen else can pick it up	18:07
sean-k-mooney	i know you have a lot on your plate before FF	18:07
clarkb	fwiw I didn't intend on forcing melwitt to work on this. More than happy to have had the help tracking down the bug in the first place	18:07
melwitt	sean-k-mooney: it's been on my todo list for awhile to review :/ and I pinged rene about it to let him know. and kinda hoped he would get some people to gravitate :P but yeah it at least needs some review	18:08
sean-k-mooney	well i dont either.im jsut wonding what the best way to highlight this to the rest of the nova team is. i was thinking addign it to the next irc team meeting vfor visiblity	18:08
clarkb	I just want to call out that these issues that appear minor have potentially noticeable impects to users and operators and sometimes those bubble up to "your ci results are slow or gave an invalid result"	18:09
sean-k-mooney	clarkb: no this is a good thing to od i jsut dont want to loos track of it	18:09
clarkb	they are minor from a "the cloud is up and functioning standpoint" but less so from a "users are able to use the cloud to perform useful work" standpoint	18:09
melwitt	we could put it on the nova status etherpad, I am still getting used to using it but that might be a way to keep better track of it	18:09
sean-k-mooney	ya we coudl add a know bugs section for thigs like this	18:10
sean-k-mooney	its not wuite a low haning furit bug but its a if you have sapre cycle this has potically high value	18:10
sean-k-mooney	melwitt: i woudl have used the review priority flag for this if i jusc campe accross it but we decied to use etherpatd instead..	18:11
melwitt	yeah. I'll find a place to add it	18:11
sean-k-mooney	clarkb: so just an idea, woudl an infra pain points session at the ptg be useful	18:12
sean-k-mooney	clarkb: i.e. a session where you could tell us all the pain point you have with usign openstack form zuul	18:12
melwitt	I have never gotten onto the review priority flag.. I thought it got added to everything but maybe it was not overused and did not become noise. I kept forgetting about it	18:12
clarkb	sean-k-mooney: possibly. I think part of the reason we're noticing things more lately is that we're trying to switch from nodepool to zuul launcher which necessarily invovles rewriting things to chagne how it functions (to allow any procss to handle any request in any cloud and to allow mixed nodesets between openstack vms, aws vms, k8s containers etc)	18:13
sean-k-mooney	clarkb: ya that makes sesnes. opendev is in a somewhat uniqce postion wehre we have a closed loop to some degree	18:14
clarkb	basically the rewrite there makes us more senstiive to errors because we have to be smarter about handling them due to the added complexity of the featureset	18:16
clarkb	in the past we were a lot more brute force solution heavy with nodepool	18:16
sean-k-mooney	i assum you have reduced the latency by waiting less too	18:16
melwitt	sean-k-mooney: just added it to this existing section here https://etherpad.opendev.org/p/nova-2025.2-status#L131	18:16
* sean-k-mooney rarely scrools all the way to the end		18:18
sean-k-mooney	shoudl we put the priorty section at the top?	18:18
melwitt	I don't see a way to add review priority on gerrit?	18:18
sean-k-mooney	its configure per project	18:18
melwitt	maybe but I would want to ask rene first before moving stuff he put	18:18
sean-k-mooney	https://review.opendev.org/c/openstack/nova/+/699176	18:19
sean-k-mooney	if you click reply there it one of the option becied code revew adn workflow	18:19
melwitt	ok I got it, sorry. I was dumb and didn't realize it was in the Reply area	18:19
sean-k-mooney	syslvain wanted to change the name so now its subtext si "Core Review Promise" or "Contributor Review Promise"	18:20
sean-k-mooney	which to me actully made the feild less useful and dint really do what i orgianlly wanted.	18:21
sean-k-mooney	but ya i never got aroudn to remvinging it when we sadi we should try the status etherpad instead	18:21
fungi	that's good feedback	18:51
fungi	as part of the bridging-the-gap work i was looking yesterday at how many projects enabled a "review-priority" label in gerrit, and how they leverage it seems to vary widely even though they copied the same basic name	18:52
fungi	(related to contributors relating their inability to figure out how maintainers indicate prioritization and how to get involved in helping set priorities)	18:52
opendevreview	Jeremy Stanley proposed openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho https://review.opendev.org/c/openstack/project-config/+/957467	21:23
*** cloudnull10977461 is now known as cloudnull		23:56

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!