corvus | hrm, log rotation has happened on zl01 but not zl02 | 00:23 |
---|---|---|
clarkb | that rotation is happening in python (not using logrotate) | 00:25 |
corvus | perhaps fallout from the memory issues? | 00:25 |
clarkb | ya maybe the object in python crashed trying to allocate memroy? | 00:26 |
clarkb | or even the file copy itself? | 00:26 |
clarkb | however 8e437d947d1c4d6eb53212fd5ce8d53c this request was one that failed per https://zuul.opendev.org/t/openstack/build/e7fe7c44d29242f9993be9137f5f6cf9 and it appears to have been a single node request whose noderequest was fulfilled by zl01 | 00:27 |
clarkb | though I suppose that maybe zl02 was then responsbile for handling request completion for the entire request? | 00:27 |
corvus | yeah, they often trade off :) | 00:28 |
corvus | i was looking up request fd545f68533344b8b64135ed500cb4e4] | 00:28 |
clarkb | looks like zl02 handled the ssh keyscanning for that node | 00:29 |
clarkb | I don't see anything indicating why the request failed | 00:29 |
clarkb | unless Exception loading ZKObject at /zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c/revision which happend early in the process was to blame? | 00:29 |
corvus | no that's benign | 00:29 |
corvus | 2025-08-13 22:27:02,098 DEBUG zuul.Launcher: [e: 38edb69d08d34ae4b9cc0faee9bef52a] [req: fd545f68533344b8b64135ed500cb4e4] Request <NodesetRequest uuid=fd545f68533344b8b64135ed500cb4e4, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/fd545f68533344b8b64135ed500cb4e4> nodes ready: [<OpenstackProviderNode uuid=fbfe4e20998942d188a06e9a90270dbe, label=ubuntu-noble-8GB, state=ready, | 00:29 |
corvus | provider=opendev.org%2Fopendev%2Fzuul-providers/rax-dfw-main>] | 00:29 |
corvus | 2025-08-13 22:27:02,100 DEBUG zuul.Launcher: [e: 38edb69d08d34ae4b9cc0faee9bef52a] [req: fd545f68533344b8b64135ed500cb4e4] Marking request <NodesetRequest uuid=fd545f68533344b8b64135ed500cb4e4, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/fd545f68533344b8b64135ed500cb4e4> as fulfilled | 00:29 |
clarkb | I thought so considering it seemed to proceed afterwards | 00:29 |
corvus | i think the launcher thought everything was fine with that request | 00:30 |
clarkb | maybe need to look at the scheduler then? | 00:30 |
corvus | yep | 00:30 |
clarkb | the schedulers have ERROR zuul.LauncherClient: Exception loading ZKObject at /zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c/revision | 00:31 |
corvus | that's fine too | 00:31 |
clarkb | on zuul01 we have 2025-08-14 00:04:34,373 INFO zuul.Pipeline.openstack.gate: [e: 6892d4f49d6e479fa8af9c2296e48c07] Node(set) request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c>: failure for openstack-tox-pep8 | 00:32 |
corvus | (https://review.opendev.org/957184 will reduce or eliminate a bunch of those; that change is approved) | 00:32 |
clarkb | problem is very few jobs seem to be running right now | 00:32 |
clarkb | maybe like 50% of them the others are hitting this issue so it may not land until we figure out what is going on | 00:33 |
clarkb | jumping up a level and looking at the event logs it kinda seems like that is what we get log wise | 00:37 |
clarkb | ok that log message originates from onNodesProvisioned when request.fulfilled is not true. request.fulfilled requires the state to be fulfilled but it is accepted in that log line | 00:40 |
clarkb | on zl01 we get 2025-08-14 00:04:33,972 DEBUG zuul.Launcher: [e: 6892d4f49d6e479fa8af9c2296e48c07] [req: 8e437d947d1c4d6eb53212fd5ce8d53c] Marking request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c> as fulfilled | 00:40 |
clarkb | then half a second later on zuul01 we get 2025-08-14 00:04:34,373 INFO zuul.Pipeline.openstack.gate: [e: 6892d4f49d6e479fa8af9c2296e48c07] Node(set) request <NodesetRequest uuid=8e437d947d1c4d6eb53212fd5ce8d53c, state=accepted, labels=['ubuntu-noble-8GB'], path=/zuul/nodeset/requests/8e437d947d1c4d6eb53212fd5ce8d53c>: failure for openstack-tox-pep8 | 00:41 |
clarkb | the zl01 log indicates it is marking it fulfilled but the state logged on zuul01 is still accepted | 00:41 |
clarkb | so I think the problem is that we are not transitioning the state record properly for some reason | 00:41 |
clarkb | I don't see any changes to the code that marks things fulfilled in the set of changes today | 00:43 |
corvus | i think the launcher may be running very slowly and we're now consistently losing a race | 00:46 |
clarkb | ya I just noticed in the launcher code we send the event for nodes provisioned then updated the request attributes | 00:46 |
corvus | i think it's worth restarting the launchers to clear things out | 00:46 |
corvus | exactly that yes | 00:46 |
clarkb | which would be a race condition we can lose if things are slower than the network | 00:46 |
corvus | do you want to restart launchers while i work on a code fix? | 00:46 |
corvus | and maybe do something about logs while they're down | 00:47 |
clarkb | corvus: sure. Just docker compose down && docker compose up -d? | 00:47 |
clarkb | or do you want me to pull first? | 00:47 |
clarkb | I guess I should pull since we had some updates | 00:48 |
clarkb | I'm doing zuul01 now. Will pull then down then up -d as it doesn't need log updates. Then I'll go look at what needs to be done while zl02 is down | 00:49 |
clarkb | Not zuul01. zl01 | 00:49 |
clarkb | ok it appears to be running. Moving on to zl02 | 00:52 |
clarkb | zl02 is down. I'm gzipping the large log file in case there is any info in there worth saving. Then I'll remove the inflated file version if gzip doesn't do that automatically and start the service up again | 00:55 |
clarkb | zl02 launcher is starting up now | 00:57 |
clarkb | corvus: the logfile is gzipped and renamed with a dated suffix. We'll probably want to delete that file in a day or two if we don't end up digging into it for more data | 01:02 |
corvus | ++ thx | 01:02 |
tkajinam | I was about to ask if there is any maintenance going on, seeing frequent node failure but I think you all are aware of the situation. | 01:03 |
* tkajinam is still reading the log | 01:03 | |
clarkb | things look happier after I restarted the launchers. Newly pushed / enqueued changes aren't showing up with a bunch of orange in the ui | 01:04 |
clarkb | I'm going to have to pop out soon for dinner. But will check back in on this in the morning | 01:05 |
clarkb | looking at the zuul components list I think our weekend reboots may have gotten stuck or crashed again too | 01:07 |
clarkb | I'm not going to dig into that deeper at the moment since its late and already Thurdsay UTC time so things have mostly been working despite that | 01:08 |
clarkb | corvus: ^ just fyi that is probably something to followup on next too | 01:08 |
corvus | ack | 01:10 |
corvus | i'll clean out the leaked image files; i should be able to do it safely while the launchers are running, just based on timestamps. | 01:31 |
corvus | nl01 is not fully functional due to https://review.opendev.org/957296 -- i'm going to restart it | 02:16 |
corvus | 2025-08-14 02:23:15,164 ERROR zuul.Launcher: openstack.exceptions.HttpException: HttpException: 503: Server Error for url: https://ord.servers.api.rackspacecloud.com/v2/637776/flavors/detail?is_public=None, Service Unavailable | 02:25 |
corvus | now we're getting errors from rax-ord | 02:25 |
corvus | 2025-08-14 02:27:17,384 ERROR zuul.Launcher: keystoneauth1.exceptions.connection.ConnectTimeout: Request to https://ord.servers.api.rackspacecloud.com/v2/637776/servers timed out | 02:27 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Disable rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/957297 | 02:28 |
corvus | infra-root: ^ i self-approved that | 02:29 |
opendevreview | Merged opendev/zuul-providers master: Disable rax-ord https://review.opendev.org/c/opendev/zuul-providers/+/957297 | 02:29 |
corvus | 2025-08-14 02:41:00,922 ERROR zuul.Launcher: Exception: Unable to find flavor: opendev.xxlarge in osuosl-regionone-main | 02:41 |
corvus | apparently there is an opendev.xxxlarge | 02:43 |
corvus | there's no timestamp or anything, so i don't know if that changed | 02:43 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Correct osuosl flavor name https://review.opendev.org/c/opendev/zuul-providers/+/957298 | 02:44 |
corvus | infra-root: ^ i'm self-approving that too | 02:44 |
opendevreview | Merged opendev/zuul-providers master: Correct osuosl flavor name https://review.opendev.org/c/opendev/zuul-providers/+/957298 | 02:44 |
corvus | it's pretty bad if we can't talk to a cloud at startup... we may need to parallelize the startup process and make the providers a bit more independent. | 02:47 |
opendevreview | Takashi Kajinami proposed opendev/system-config master: WIP: Add OpenVox to mirror https://review.opendev.org/c/opendev/system-config/+/957299 | 02:51 |
corvus | we can't deal with the load right now in this degraded state, i'm going to dump the periodic pipeline. | 02:51 |
corvus | okay, afaict the launchers are working as well as they can be atm with ord misbehaving | 03:22 |
corvus | i had to dump the periodic, periodic-stable, and check-arm64 queues in openstack | 03:23 |
corvus | i also had to re-enqueue some changes in zuul/gate in order to get it to stop trying to boot nodes on ord | 03:23 |
*** mrunge_ is now known as mrunge | 08:49 | |
stephenfin | clarkb: fungi: When you're around, I think we want a v7.0.1 release for https://review.opendev.org/c/openstack/pbr/+/957262/ and above | 10:50 |
stephenfin | ...and above | 10:51 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: Raise connection pool for boto3 in s3 upload role https://review.opendev.org/c/zuul/zuul-jobs/+/957218 | 12:42 |
tristanC[m] | It looks like I was mentioned last week in this room, but I can't find the message with element, apparently the "jump to notification" feature is not working. So please contact me again if needed! | 13:19 |
frickler | tristanC[m]: likely this one? https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2025-08-07.log.html#opendev.2025-08-07.log.html#t2025-08-07T18:15:24 | 13:26 |
tristanC[m] | frickler: ha thanks! So are you grepping these irclogs file to find matrix mention? | 13:31 |
tristanC[m] | clarkb: that's right, that's done for "reproducibility" where the image timestamps are set to a known value. But that's not super useful, next time I update the matrix gerritbot I'll use the current time of the day. | 13:32 |
tristanC[m] | But as far as I know, the service has been humming away without noticeable bug, so I'm inclined to leave it that way unless you want to fix that. | 13:33 |
frickler | tristanC[m]: I just looked for a mention of your nick last week. but for larger searches I do grep my local logs sometimes. one more advantage of IRC over matrix, I'd think, haven't found a way to save logs from element yet | 13:49 |
fungi | while i can't recommend the matrix-rs plugin for weechat in its present state due to significant missing features, it does give me the ability to log matrix room discussions just as if they were irc channels | 13:51 |
corvus | frickler: Room Info -> Export Chat | 13:53 |
corvus | in element | 13:53 |
fungi | frickler: on 957262 why are you saying "(alias, arg) in DEPRECATED_CFG" won't match in that spot if it was matching a few lines earlier (the entire reason for the change is to get it to stop outputting those warnings unless they were actually used) | 13:58 |
corvus | i'm still not happy with the launcher behavior; i have some stuff to do this morning, but i'll resume looking into that when i'm done. | 13:58 |
corvus | the launcher's aren't stuck, they're making progress, just slowly and poorly. i wouldn't try to do anything manual. | 13:59 |
fungi | noted, thanks corvus | 13:59 |
frickler | fungi: from my understanding it cannot have worked before, either | 14:02 |
fungi | so DEPRECATED_CFG isn't a list of tuples then? | 14:04 |
fungi | or is the problem more nuanced? | 14:04 |
fungi | i'm just trying to figure out where the excess warning messages were coming from if that was dead code as you say | 14:05 |
frickler | fungi: you can check the definitions at the beginning of the same file | 14:05 |
frickler | these are nested tuples, with alias = (section, keyname), and only that tuple is being used as key in DEPRECATED_CFG | 14:06 |
fungi | huh, i think the definition may have missing commas | 14:06 |
frickler | I commented on that, too, but that's a different - albeit related - bug | 14:07 |
fungi | the dict keys should be ('metadata', 'home_page') rather than concatenated strings i guess | 14:07 |
frickler | yes | 14:07 |
fungi | implicit string concatenation strikes again | 14:07 |
fungi | oh, *some* of them match | 14:07 |
fungi | ('metadata', 'requires_dist') is a key, for example | 14:08 |
fungi | it's just the first few that are missing commas and so will never match | 14:08 |
frickler | is there a sample of the excess warning messags showing up in logs somewhere? | 14:08 |
fungi | stephenfin: ^ did you have a public sample to link? otherwise i can fish one out of a random job log for a pbr-using project | 14:09 |
fungi | basically some of the keys in the current DEPRECATED_CFG dict aren't working at all because of the missing commas leading to implicit concatenated strings, while others are matching because they're proper tuples as expected | 14:12 |
fungi | prior to 957262 pbr is emitting a warning for config option it iterates over if that tuple is a key in the DEPRECATED_CFG dict (so all of the not-broken ones), with 957262 it only does so if the option was present in the configuration it's checking, since the conditional above it short-circuits the loop first | 14:14 |
frickler | I still claim nothing should match at all, since e.g. (alias, arg) = (('metadata', 'home_page'), 'url') in CFG_TO_PY_SETUP_ARGS, while DEPRECATED_CFG has (modulo missing comma) only ('metadata', 'home_page') as dict key | 14:15 |
fungi | ah, i see, so alias is also a tuple | 14:16 |
frickler | yes and it is getting split as "section, option = alias" in L410. so using "(section, option)" in the deprecation check could also work and might be more readable | 14:17 |
clarkb | Looks like corvus is already looking at launcher things. I'll probably start my morning looking at the zuul weekly reboot status (after noticiing inconsistency on the components page last night) then look at the pbr changes. | 14:57 |
clarkb | but first breakfast | 14:57 |
clarkb | TASK [zuul-executor : Check if Zuul Executor containers are running] failed on ze05 with rc -13. This is what we believe to be the ssh control persistent race with starting a task as the ssh process shuts down | 15:03 |
clarkb | so I guess that is "good" news in that it isn't a new error | 15:03 |
clarkb | We can probably just manually start the cronjob today if we want to kick things off sooner but I think with the launcher situation its probably best to address launcher things first and possibly just wait for the late friday cronjob to fire. Then check the results first thing monday | 15:04 |
clarkb | the run prior appears to have been successful | 15:04 |
clarkb | it would be nice if ansible had some if rc code == -13 then retry option | 15:10 |
corvus | clarkb: did we merge the change to do keepalives? | 15:48 |
clarkb | corvus: yes | 15:50 |
clarkb | `ssh_args = -o ControlMaster=auto -o ControlPersist=180s -o ServerAliveInterval=60` is the current /etc/ansible/ansible.cfg entry on bridge | 15:51 |
stephenfin | Thanks, frickler 🤦 I've cleaned that up now https://review.opendev.org/c/openstack/pbr/+/957262 | 16:10 |
fungi | and added tests! | 16:11 |
fungi | very nice | 16:11 |
corvus | i get a 500 error for /usr/launcher-venv/bin/openstack --os-cloud opendevzuul-rax-flex --os-region SJC3 server list | 16:26 |
corvus | seems to work for dfw3 | 16:26 |
corvus | Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. | 16:27 |
corvus | <class 'KeyError'> | 16:27 |
fungi | hah | 16:27 |
fungi | dan_with: do you happen to know whether there's maintenance or an incident going on in flex sjc3 right now that would cause the nova api to return a 500 error? ^ | 16:28 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Disable raxflex-sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/957449 | 16:29 |
fungi | i'll test with my personal account too | 16:29 |
corvus | opendevci account works | 16:29 |
corvus | may be a problem with opendevzuul | 16:30 |
fungi | though i did notice a vm i have there spontaneously rebooted at 11:19 utc today | 16:30 |
clarkb | I want to say we had this problem before and it required intervention on the cloud side | 16:30 |
* clarkb tries to find notes | 16:30 | |
opendevreview | Merged opendev/zuul-providers master: Disable raxflex-sjc3 https://review.opendev.org/c/opendev/zuul-providers/+/957449 | 16:31 |
clarkb | according to my irc logs jamesdenton helped debug this issue on April 9 | 16:32 |
clarkb | at the time there were three instances in an Error state but nova couldn't list them due to keyerrors | 16:33 |
corvus | we are down 3 cloud regions; this is making a significant dent in our capacity. | 16:33 |
clarkb | https://bugs.launchpad.net/nova/+bug/2095364 its this bug | 16:34 |
clarkb | or was this bug. Not 100% certain it is the same problem this time | 16:34 |
clarkb | I think we can try an older microversion and attempt to delete error'd nodes that way | 16:35 |
clarkb | I'll look into that | 16:35 |
corvus | earlier one of the launchers saw: 2025-08-14 16:28:22,056 ERROR zuul.Launcher: <class 'sqlalchemy.exc.OperationalError'> | 16:35 |
clarkb | `sudo openstack --os-cloud opendevzuul-rax-flex --os-region SJC3 --os-compute-api-version 2.54 server list` this works on bridge | 16:36 |
corvus | clarkb: if you have a process for listing servers using the earlier microversion, i'd be interested in the process and/or the current output | 16:36 |
clarkb | corvus: ^ that is it | 16:36 |
corvus | ++ | 16:36 |
clarkb | I'll hold off on deleting anything but there are a number of servers in an ERROR state. | 16:36 |
corvus | thanks! give me 1 second, then i think we can go to down deleting | 16:37 |
clarkb | setting the version to 2.95 also works | 16:37 |
clarkb | I don't know why I have a much older version in my command history too. but I think that confirms this is the problem | 16:37 |
corvus | okay i think we can delete stuff now | 16:38 |
clarkb | corvus: do you want me to do that? | 16:38 |
corvus | yes please | 16:39 |
clarkb | ok I will attempt to manually delete every instance in sjc3 in an error state | 16:39 |
corvus | i just saw the same error in dfw3, then dfw3 worked again in a subsequent attempt a few seconds later | 16:41 |
corvus | (sjc3 still seems consistent) | 16:41 |
corvus | and now it works | 16:42 |
fungi | huh | 16:42 |
corvus | (after clarkb deleted stuff, i think) | 16:42 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Revert "Disable raxflex-sjc3" https://review.opendev.org/c/opendev/zuul-providers/+/957452 | 16:42 |
clarkb | yes deletions are complete and a listing shows no more ERROR'd nodes | 16:42 |
fungi | that's "fun" | 16:42 |
opendevreview | Merged opendev/zuul-providers master: Revert "Disable raxflex-sjc3" https://review.opendev.org/c/opendev/zuul-providers/+/957452 | 16:43 |
clarkb | last time the cloud intervened on our behalf and deleted the nodes for us | 16:44 |
clarkb | but it is good to know that we can do it if we use the correct magical microversion | 16:45 |
fungi | yeah, that's a great find | 16:45 |
fungi | though it's supposedly fixed in a stable point release for dalmatian, has a fix merged in stable/caracal (no point release yet), and one proposed for stable/bobcat that hasn't merged | 16:47 |
clarkb | I don't know which version rackspace flex has deployed but maybe dan_with can try to get that fix on their radar | 16:48 |
clarkb | thinking out loud here and I'm not convinced this is a good idea: but we could probably get away with forcing microversion 2.95 on all clouds that support 2.96? | 16:51 |
clarkb | that would require looking up supported versions but launchers are already caching things on startup about clouds so maybe this would fit into that process? Basically lookup microversions if repsent and if version 2.95 is in the valid version use it. otherwise use whatever default the sdk picks | 16:52 |
corvus | i'm not sure if that affected the launcher... now that i know more of what's going on, that may have been cli only | 16:52 |
clarkb | huh I guess that could be if the sdk listing things isn't requesting the filed with a keyerror or using a "compatibile" microversion for some reason? | 16:53 |
corvus | we don't use the sdk method for listing | 16:54 |
clarkb | got it. That may explain it then as we probably aren't doing anything with microversions like the cli or sdk would | 16:55 |
corvus | the launchers do a straight keystoneauth get of '/servers/detail' | 16:55 |
clarkb | fungi: looking at the timestamps on that nova bug seems like a lot of those changes just merged | 17:15 |
clarkb | I guess we shouldn't be surprised the fixes haven't rolled out to actual clouds yet | 17:15 |
fungi | yeah, agreed | 17:15 |
fungi | apropos of nothing in particular, https://github.com/mfontanini/presenterm looks like a presentty clone in rust (though using markdown instead of restructuredtext) | 17:29 |
corvus | okay, i have identified 2 improvements i'd like to make to the launchers. they aren't trivial, but they aren't too complex. it's going to take a bit. i don't have any quick fixes to deal with the backlog. i'm just going to try to plow through writing these changes and we can try to get them merged asap. | 17:34 |
corvus | after that, i'll fix up the event stuff from yesterday. even though that was catastrophic, as long as we don't start swapping again, i think it's slightly lower priority. | 17:34 |
clarkb | makes sense | 17:35 |
fungi | sounds good, thanks! | 17:41 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/957456 Include requested nodes in quota percentage calculation [NEW] | 17:43 |
corvus | that's the first change -- hopefully the commit msg explains sufficiently | 17:43 |
corvus | the second one i'm going to work on is roughly "order the node objects so we fill old multinode requests sooner" | 17:44 |
fungi | yep, commit message made sense, now seeing if i can apply that understanding to the diff | 17:45 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/957457 Sort nodes by request time [NEW] | 18:13 |
corvus | that's change #2 | 18:13 |
corvus | was not expecting sequential numbers for those | 18:14 |
corvus | i think the systemic problem that's going on is that: | 18:23 |
corvus | any time the quota usage for a provider drops, the launchers see an opportunity and create a node request for that provider, regardless of how many other backlogged reqsts there are | 18:23 |
corvus | even if that only happens one at a time (because after it creates one node, it should account for that allocation and defer the next request), given enough time, all the node requests can end up with allocated nodes, as each one has plenty of opportunities to find a "hole" | 18:23 |
corvus | the first change should help us avoid that situation by accounting for the backlog earlier, so that zuul doesn't see the hole in the first place | 18:24 |
corvus | some leakage will still happen due to multinode requests; the second change helps get them cleared out in the right order. | 18:24 |
clarkb | corvus: I have some questions on https://review.opendev.org/c/zuul/zuul/+/957457 | 18:27 |
corvus | thx replied | 18:31 |
clarkb | corvus: thanks one followup posted | 18:35 |
clarkb | side note: I love that datetime.datetime.utcfromtimestamp is deprecated and will be removed in a future python release because you are supposed to use timezone aware objects to represent datetimes in UTC | 18:37 |
clarkb | "utcfromtimestamp" isn't timezone aware enough I guess | 18:38 |
clarkb | posix epoch times are recorded as beginning from midnight january 1 1970 UTC. Asking for the datetime in UTC X seconds after the epoch is timezone aware... | 18:38 |
corvus | clarkb: thx fixed | 18:39 |
clarkb | corvus: are you pushing a new patchset? | 18:48 |
clarkb | want to make sure I wait for that to +2 before lunch if so | 18:48 |
corvus | oh sorry, git-review was asking me a question i forgot to answer | 19:16 |
corvus | hope you didn't wait for lunch | 19:16 |
fungi | git-review can be so nosey sometimes | 19:18 |
*** dhill is now known as Guest24174 | 19:31 | |
clarkb | I did not wait :) | 19:39 |
clarkb | Iv'e approved it now | 19:40 |
JayF | clarkb: utcfromtimestamp() returns a tzinfo=None object; so while you get a datetime that is UTC, it doesn't return a tz-aware value | 19:50 |
JayF | they can't change the return without breaking the api | 19:51 |
JayF | I am unable to reach security.openstack.org | 20:36 |
JayF | which is resolving to static02.opendev.org for me | 20:36 |
JayF | only via ipv4, once I tried ipv6 it's working | 20:38 |
JayF | mtr implies it's somewhere close to the lasthop that's dropping | 20:38 |
fungi | i can ping it over v4 and v6 | 20:39 |
fungi | so if it's close to the last hop, it might be that you've got an asymmetrical route so your traceroute shows the drop at the hop where they converge (the provider's border) | 20:40 |
fungi | if you want to msg me your ip address i could try to traceroutge to it from static02 for comparison | 20:40 |
clarkb | JayF: I mean that seems reasonable if A) the function name is literally UTC already and B) they aer going to delete it and break everyone anyway | 20:41 |
clarkb | I dunno just seems like if you're going to break it then breaking a subset of people is probably better | 20:41 |
JayF | I don't really agree; you can't use tzinfo=None and tzinfo=!None objects interchangibly. It's just a bad interface from the beginning and any pain now is just paying back for that original sin | 20:42 |
fungi | it sends a clear message | 20:42 |
fungi | to any other functions that might try to misbehave | 20:42 |
JayF | fungi: https://usercontent.irccloud-cdn.com/file/fQNI38xe/image.png | 20:43 |
corvus | i have packet loss to both review02 and static02 | 20:45 |
fungi | JayF: yeah, going back the other way i see packet loss starting in or just past tukw.qwest.net | 20:46 |
fungi | potentially at the lumen/qwest peer there | 20:46 |
JayF | just broken internet stuff, got it | 20:47 |
JayF | makes sense | 20:47 |
JayF | corvus: are you my ISP-brother lol | 20:47 |
corvus | probably not; maybe just a bad internet day :) | 20:48 |
corvus | fungi: clarkb https://review.opendev.org/957464 is the extra fix to the event thing from yesterday (also, swest added a change to that stack too) | 20:49 |
fungi | static is in rackspace dfw, while review is in vexxhost ca-ymq-1, so different clouds in different parts of the continet even | 20:49 |
corvus | so i've got a whole sequence stacked up now, with the 2 changes from this morning, the 3 event sequence changes, and then 2 more small patches on that. | 20:50 |
corvus | stack end is at https://review.opendev.org/957296 | 20:50 |
corvus | if everything looks good to folks, i'll probably take a build of that and manually deploy it to launchers | 20:50 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Replace 2025.2/Flamingo key with 2026.1/Gazpacho https://review.opendev.org/c/openstack/project-config/+/957467 | 21:23 |
clarkb | corvus: I have a few questions on https://review.opendev.org/c/zuul/zuul/+/957464 | 21:47 |
corvus | replied | 21:58 |
clarkb | corvus: thanks I think that behavior I'm thinkign of with the .get is only on certain classes that have a list of attributes that zuul loops thrugh to load up | 22:06 |
clarkb | this one doesn't do that so I was mistaken | 22:06 |
corvus | ++ i like this pattern better | 22:08 |
corvus | (can't always use it tho) | 22:08 |
corvus | i think it's going to be a long time until we get an image build we can use with the launchers. | 22:19 |
corvus | (and even longer until those changes actually merge) | 22:19 |
corvus | i wonder what we could do to improve things.... | 22:19 |
corvus | i could try to monkeypatch those fixes into the running server, but that's a lot to patch and easy to get wrong | 22:20 |
corvus | we could dequeue some stuff to clear out the queue | 22:20 |
corvus | i could try to build an image manually | 22:21 |
corvus | or i could push up a change exclusively to build an image; if we clear the zuul tenant queues, that job might get done in a reasonable time | 22:22 |
corvus | i'm open to other suggestions | 22:22 |
clarkb | change just to build an image seems like maybe something worth trying then fallback to locally built image? | 22:23 |
corvus | ack. i'll try that. | 22:23 |
corvus | okay, that change is uploaded, its one job is running now. i also am running the test suite locally on that change. | 22:28 |
corvus | it's also worth noting that even when we restart the launcher with these changes, the existing node requests are still going to be a bit of a mess. so until we work through that backlog, we may continue to have problems. we also may not be able to work through the backlog due to the way the multinode jobs have wedged things. so even then, we may need to start dequeing some things. | 22:30 |
corvus | er, i said the job was running, i should have said the node was building... :/ | 22:35 |
corvus | unit tests and linters pass locally; so if the job ever runs, we're good | 22:37 |
corvus | 2025-08-14 22:38:24,549 ERROR zuul.Launcher: openstack.exceptions.HttpException: HttpException: 503: Server Error for url: https://dfw.servers.api.rackspacecloud.com/v2/637776/servers, Service Unavailable | 22:41 |
corvus | we're getting 503 errors in rax-dfw (classic) now | 22:42 |
corvus | we were getting those with rax-ord yesterday, so i turned it off | 22:42 |
corvus | should we do the same for dfw? | 22:43 |
corvus | that is our only remaining region in rax classic | 22:43 |
opendevreview | James E. Blair proposed opendev/zuul-providers master: Disable rax-dfw https://review.opendev.org/c/opendev/zuul-providers/+/957470 | 22:46 |
opendevreview | Merged opendev/zuul-providers master: Disable rax-dfw https://review.opendev.org/c/opendev/zuul-providers/+/957470 | 22:47 |
corvus | i'm aggressively cleaning up logs on the launchers | 22:59 |
clarkb | fwiw trying to list servers from bridge in rax dfw is very slow to return (I haven't gotten results yet) and wonder if the proxy will eventually 503 | 23:02 |
clarkb | oh wait no there it goes it listed | 23:02 |
corvus | the graphs and logs suggest that it's spotty | 23:02 |
clarkb | corvus: I wonder if the 503 problem is just something we have to be resilent against happening occasionally? | 23:02 |
corvus | but when there are problems, it's hours at a time | 23:02 |
clarkb | ya that | 23:02 |
clarkb | ah ok | 23:02 |
corvus | yeah, if it were occasionally, then i think we could work around it, but it's looking a lot more like a 6 hour stretch where the launcher can't see what servers there are | 23:03 |
corvus | pulling images | 23:06 |
corvus | the launchers are restarted | 23:09 |
corvus | i think we're going to need to dequeue at least some of the queue items with multinode jobs in order to un-wedge things | 23:15 |
corvus | i will start picking some off. | 23:15 |
clarkb | is that bceause we're still susceptible to the ordering problem with those node requests hanging around? | 23:22 |
clarkb | (basicaly we need new ones to get the ordering right?) | 23:22 |
corvus | yep, and because so many of the multinode requests are partially completed and they're hogging the quota as ready nodes waiting for their friends | 23:23 |
corvus | i dequeued 2 kolla changes, and the number of in-use nodes in ovh-bhs1 has increased from 1 to 25 | 23:26 |
corvus | i think it will be able to slowly make progress in that region at this point | 23:26 |
*** cloudnull10977461 is now known as cloudnull | 23:56 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!