gibi | do we have some node pool shortage? I see that the tip of the integrated gate queue is blocked on waiting for executor for small jobs like openstack-tox-docs | 14:47 |
---|---|---|
fungi | gibi: there's some innefficiency with node assignments at the moment, per discussion in #opendev corvus is working on tweaking it to address that | 14:50 |
gibi | fungi: thanks for the information | 14:51 |
fungi | we're still working to iron out corner cases in zuul-launcher (the nodepool-launcher rewrite/replacement) that emerged under load which weren't apparent from basic test scenarios | 14:52 |
fungi | in most cases it's dealing with random errors and disconnects from cloud providers that have proved to be hard to predict and design defensively for | 14:53 |
fungi | so there have been a series of incremental improvements addressing the more common problems we witness, which then makes it easier and easier to spot the less common ones | 14:54 |
gibi | thanks for the hard work | 14:55 |
opendevreview | Stephen Finucane proposed openstack/pbr master: util: Fix deprecation warnings https://review.opendev.org/c/openstack/pbr/+/957262 | 16:07 |
opendevreview | Stephen Finucane proposed openstack/pbr master: util: Make handling of description_file clearer https://review.opendev.org/c/openstack/pbr/+/957263 | 16:07 |
opendevreview | Stephen Finucane proposed openstack/pbr master: util: Skip normalization of description if README opts present https://review.opendev.org/c/openstack/pbr/+/957264 | 16:07 |
opendevreview | Stephen Finucane proposed openstack/pbr master: util: Deprecate description_file https://review.opendev.org/c/openstack/pbr/+/957265 | 16:07 |
clarkb | gibi: it is worth noting that at least part of the problem is due to https://bugs.launchpad.net/nova/+bug/2095364. There is a feedback loop where bugs that end up in production impact our ability to boot and manage VMs with nova. | 16:49 |
clarkb | Just something to keep in mind when dealing with problems like that. In a way they may seem innocuous "just use an older microversion" but there are two problems here. The first is discoverability. How would an end user know to even attempt that? And second Bugs that break our ability to manage nodes impact the ability to land other changes | 16:50 |
fungi | yeah, hopefully the provider upgrades with the fix soon | 16:50 |
clarkb | I guess we don't use microversions when listing nodes within zuul so that wasn't impacting the service itself. Just our ability to debug things as listings stopped working with the cli tool | 17:01 |
sean-k-mooney | i was going ot ask a simialr question basedon the node failture in https://review.opendev.org/c/openstack/nova/+/942413 | 17:32 |
clarkb | sean-k-mooney: ya its all related. There was a significant enough slowdowin the system last night that we were consistently losing a race condition where zuul thought allocated nodes were not properly allocated and it bailed out early | 17:33 |
clarkb | last night we restarted the system to pick up some fixes that we had been working on landing through the day which corrected that problem for the most part | 17:34 |
clarkb | I think the issues now are in the other direction rathr than marking things failed because its checking too quickly when processing is slow we're just slowly assigning nodes | 17:34 |
sean-k-mooney | should we hold off on rechecks until tommorow | 17:34 |
sean-k-mooney | i.e. give it a bit of tiem to stablise | 17:35 |
clarkb | maybe? on the one hand having the load helps verify the fixes we've been pushing out are helping | 17:35 |
sean-k-mooney | the bug you reference has backprots propsosed to all aeffected barnchs and there is only one pending merge | 17:35 |
clarkb | sean-k-mooney: yes but it also first hit us in april | 17:36 |
clarkb | its great its getting fixed. But at the same time it seems like we really drag our feet on these things that directly impact zuul | 17:36 |
sean-k-mooney | we fixed it in master in march :) | 17:36 |
clarkb | but when zuul goes sideways we've got people banging on the doors demanding answers almost immediately | 17:36 |
clarkb | I'm just trying to poitn out how there is an imbalance in that feedback loop where we're doing the best with what we've been given | 17:37 |
sean-k-mooney | clarkb: for what it sworht noone reported this as an issue for our ci to nova | 17:37 |
sean-k-mooney | at least not that im aware of | 17:37 |
sean-k-mooney | whe we were working on this it was just beign fixed as a gneeral slow down in api access ectra | 17:37 |
sean-k-mooney | i.e. not a gate blocker or critical issue | 17:38 |
clarkb | its not a general slow down fwiw. Its a "if you use default client configurations then the cloud stops working for you" bug | 17:38 |
clarkb | and there isn't any way to know that you can workaround it without finding the issue in launchpad | 17:38 |
sean-k-mooney | it only happens if you have isntance that do not have requests spec | 17:38 |
sean-k-mooney | which basiclly means you have db curption | 17:38 |
fungi | which was likely the result of some other bug or failure, yes | 17:39 |
clarkb | right but thats also caused by nova... I as the end user or as the operator don't really have a say in that | 17:39 |
sean-k-mooney | but apprently that happens mroe then we are awre off... | 17:39 |
fungi | we churn enough boots/deletes that random stray cosmic rays turn up a lot of weird failure cases in providers | 17:39 |
sean-k-mooney | fungi: ya par tof the problem is we often dont get visablity into cause the orgina issue | 17:40 |
fungi | back when i worked in data center networking, i kept close tabs on solar activity because it generally meant an uptick in random bitflips in memory on huge core routers | 17:40 |
sean-k-mooney | we know there can be instance that end up in this state, we have not been told anything about how they got that way so we are just fixing a sideefect rather the then root cause with this change | 17:40 |
fungi | (and my employer was too cheap to spring for ecc ram in the routers) | 17:40 |
clarkb | sean-k-mooney: ya don't know how they end up there either. But rackspace may be willing to dig into it more if you ask? | 17:41 |
clarkb | jamesdenton isn't in here but has been helpful in tracking things down like that in the past | 17:41 |
sean-k-mooney | clarkb: well if infra have issues like this going forward feel free to shout at us | 17:42 |
clarkb | sean-k-mooney: fwiw there is another issue we discovered and melwitt helped track it down around quota errors iirc. | 17:42 |
clarkb | let me see if I can find it | 17:42 |
sean-k-mooney | is this the other slow api over time thread on the mailign list | 17:42 |
sean-k-mooney | i didnt follow that closely | 17:43 |
fungi | unrelated i think | 17:44 |
clarkb | sean-k-mooney: https://bugs.launchpad.net/nova/+bug/1856329 | 17:44 |
clarkb | zuul wants to use the fault field to determine if the issue was running out of quota (which is considered a retryable error since quota usage should drop eventually) or if the problem is something we can't expect to resolve itself through normal usage | 17:44 |
clarkb | but due to this bug in some clouds the fault field is simply unset so we have to do the conservative thing and assume it won't be resolved | 17:45 |
sean-k-mooney | im tryign to parse the descrption | 17:45 |
sean-k-mooney | clarkb: ah ok and this is sayign that it depend on if the instance is burreid in cell 0 meaning it past the api quta check but then failed to schdlue to a host | 17:46 |
sean-k-mooney | or its in a real cell where it is empty? | 17:46 |
fungi | those are also nuances that are hard for us to determine as end-users of the service | 17:47 |
clarkb | right we haev no idea | 17:47 |
sean-k-mooney | well we returnn a no valid host if its not a quta issue | 17:47 |
clarkb | the problem is that isn't included regardless of the error aiui | 17:48 |
clarkb | its simply unset | 17:48 |
sean-k-mooney | so this bug is very old and not super helpful. could you explain what is happenign and what you want to happen | 17:48 |
clarkb | sean-k-mooney: server create foo; server show foo; results are server is in ERROR state but the fault field is empty | 17:49 |
sean-k-mooney | right just got that form the commit message https://review.opendev.org/c/openstack/nova/+/699176 | 17:49 |
clarkb | what should happen is anytime a server is in an error state some sort of fault value is set indicating to the user the type of error so they can either adjust their usage of the cloud, wait for quota to become free, or contact their cloud provider | 17:49 |
sean-k-mooney | the problem is the beahvior between sever show and server list differ | 17:49 |
sean-k-mooney | and in the sever show case we have a bug wehere we dont get teh fault form the cell db | 17:49 |
sean-k-mooney | the bug is reporting that it works for show but not for list by the way | 17:52 |
sean-k-mooney | clarkb: let me go rebase that and see if we can get up to date ci results | 17:53 |
sean-k-mooney | im not conviced the patch si entirly correct | 17:55 |
clarkb | sean-k-mooney: its possible we're doing list and not show and I've got them backwards | 17:55 |
clarkb | a trivial test case would be to do a list and a show and compare the values | 17:55 |
clarkb | they should both match and both be non empty | 17:55 |
sean-k-mooney | ya you might be doing list with a specifcc uuid in teh query | 17:56 |
sean-k-mooney | its almost but not quite the same | 17:56 |
sean-k-mooney | but ya this shoudl be easy to replciate | 17:56 |
sean-k-mooney | clarkb: the bit im not conviced about is the fix is grouping the isntance by cell and then doing a call to each cell db to get the faults for the instnaces in that cell | 17:57 |
sean-k-mooney | we normally batch these request in parallen using some scather_gather functions | 17:58 |
sean-k-mooney | instead of looping over the cell dbs serially | 17:59 |
melwitt | sean-k-mooney: agree that the idea of the fix is right but the implementation isn't how we normally add cell awareness and should instead be done consistent with the others | 17:59 |
sean-k-mooney | so https://review.opendev.org/c/openstack/nova/+/699176/17/nova/objects/instance.py#1565 likely shoudl use that parttern. gibi and dansmith know that code better then i as its being rewriten to a degress for the eventlet removal | 17:59 |
sean-k-mooney | melwitt: are you plannign to work on this? | 18:07 |
sean-k-mooney | or at least bring it up with others to see if anyoen else can pick it up | 18:07 |
sean-k-mooney | i know you have a lot on your plate before FF | 18:07 |
clarkb | fwiw I didn't intend on forcing melwitt to work on this. More than happy to have had the help tracking down the bug in the first place | 18:07 |
melwitt | sean-k-mooney: it's been on my todo list for awhile to review :/ and I pinged rene about it to let him know. and kinda hoped he would get some people to gravitate :P but yeah it at least needs some review | 18:08 |
sean-k-mooney | well i dont either.im jsut wonding what the best way to highlight this to the rest of the nova team is. i was thinking addign it to the next irc team meeting vfor visiblity | 18:08 |
clarkb | I just want to call out that these issues that appear minor have potentially noticeable impects to users and operators and sometimes those bubble up to "your ci results are slow or gave an invalid result" | 18:09 |
sean-k-mooney | clarkb: no this is a good thing to od i jsut dont want to loos track of it | 18:09 |
clarkb | they are minor from a "the cloud is up and functioning standpoint" but less so from a "users are able to use the cloud to perform useful work" standpoint | 18:09 |
melwitt | we could put it on the nova status etherpad, I am still getting used to using it but that might be a way to keep better track of it | 18:09 |
sean-k-mooney | ya we coudl add a know bugs section for thigs like this | 18:10 |
sean-k-mooney | its not wuite a low haning furit bug but its a if you have sapre cycle this has potically high value | 18:10 |
sean-k-mooney | melwitt: i woudl have used the review priority flag for this if i jusc campe accross it but we decied to use etherpatd instead.. | 18:11 |
melwitt | yeah. I'll find a place to add it | 18:11 |
sean-k-mooney | clarkb: so just an idea, woudl an infra pain points session at the ptg be useful | 18:12 |
sean-k-mooney | clarkb: i.e. a session where you could tell us all the pain point you have with usign openstack form zuul | 18:12 |
melwitt | I have never gotten onto the review priority flag.. I thought it got added to everything but maybe it was not overused and did not become noise. I kept forgetting about it | 18:12 |
clarkb | sean-k-mooney: possibly. I think part of the reason we're noticing things more lately is that we're trying to switch from nodepool to zuul launcher which necessarily invovles rewriting things to chagne how it functions (to allow any procss to handle any request in any cloud and to allow mixed nodesets between openstack vms, aws vms, k8s containers etc) | 18:13 |
sean-k-mooney | clarkb: ya that makes sesnes. opendev is in a somewhat uniqce postion wehre we have a closed loop to some degree | 18:14 |
clarkb | basically the rewrite there makes us more senstiive to errors because we have to be smarter about handling them due to the added complexity of the featureset | 18:16 |
clarkb | in the past we were a lot more brute force solution heavy with nodepool | 18:16 |
sean-k-mooney | i assum you have reduced the latency by waiting less too | 18:16 |
melwitt | sean-k-mooney: just added it to this existing section here https://etherpad.opendev.org/p/nova-2025.2-status#L131 | 18:16 |
* sean-k-mooney rarely scrools all the way to the end | 18:18 | |
sean-k-mooney | shoudl we put the priorty section at the top? | 18:18 |
melwitt | I don't see a way to add review priority on gerrit? | 18:18 |
sean-k-mooney | its configure per project | 18:18 |
melwitt | maybe but I would want to ask rene first before moving stuff he put | 18:18 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/699176 | 18:19 |
sean-k-mooney | if you click reply there it one of the option becied code revew adn workflow | 18:19 |
melwitt | ok I got it, sorry. I was dumb and didn't realize it was in the Reply area | 18:19 |
sean-k-mooney | syslvain wanted to change the name so now its subtext si "Core Review Promise" or "Contributor Review Promise" | 18:20 |
sean-k-mooney | which to me actully made the feild less useful and dint really do what i orgianlly wanted. | 18:21 |
sean-k-mooney | but ya i never got aroudn to remvinging it when we sadi we should try the status etherpad instead | 18:21 |
fungi | that's good feedback | 18:51 |
fungi | as part of the bridging-the-gap work i was looking yesterday at how many projects enabled a "review-priority" label in gerrit, and how they leverage it seems to vary widely even though they copied the same basic name | 18:52 |
fungi | (related to contributors relating their inability to figure out how maintainers indicate prioritization and how to get involved in helping set priorities) | 18:52 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho https://review.opendev.org/c/openstack/project-config/+/957467 | 21:23 |
*** cloudnull10977461 is now known as cloudnull | 23:56 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!