opendevreview | Merged opendev/system-config master: launch: further DNS cleanups https://review.opendev.org/c/opendev/system-config/+/880400 | 00:40 |
---|---|---|
ianw | the try.gitea.io cert has expired, which is a bit annoying for testing aginst it | 01:13 |
ianw | ok; scoped access tokens has this written all over it. i found that by tracing it back to the top-level api router and "git blame" -> https://github.com/go-gitea/gitea/commit/de484e86bc495a67d2f122ed438178d587a92526 | 01:24 |
ianw | filed a couple of issues about this; notes in https://review.opendev.org/c/opendev/system-config/+/877541 | 03:23 |
*** Trevor is now known as Guest11302 | 04:06 | |
opendevreview | Ian Wienand proposed opendev/zone-opendev.org master: Add DNS servers for Ubuntu Jammy refresh https://review.opendev.org/c/opendev/zone-opendev.org/+/880576 | 05:55 |
opendevreview | Ian Wienand proposed opendev/zone-opendev.org master: Add Jammy refresh NS records https://review.opendev.org/c/opendev/zone-opendev.org/+/880577 | 06:07 |
*** amoralej|off is now known as amoralej | 06:16 | |
opendevreview | Ian Wienand proposed opendev/system-config master: inventory : add Ubuntu Jammy DNS refresh servers https://review.opendev.org/c/opendev/system-config/+/880579 | 06:16 |
ianw | ^ that's getting closer; the hosts are up. i need to think through a few things so we can have two adns servers | 06:22 |
opendevreview | Ian Wienand proposed opendev/system-config master: dns: abstract names https://review.opendev.org/c/opendev/system-config/+/880580 | 06:31 |
ianw | ^ that's a start | 06:31 |
ianw | i'll think about it some more too | 06:31 |
dpawlik | dansmith: hey, soon I would like to make a release for ci-log-processing, but I still see some leftovers that are not send to the opensearch. There are just few, but each week. Almost all of the logs from this week was because of parsing the performance.json file - | 08:13 |
dpawlik | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1c2/periodic/opendev.org/x/networking-opencontrail/master/noc-tempest-neutron-plugin/1c21c82/controller/logs/performance.json | 08:13 |
dpawlik | the "MemoryCurrent": 18446744073709551615 seems to be "too big" to the Opensearch field | 08:14 |
dpawlik | dpawlik is it correct? Are you using the performance index in Opensearch? | 08:14 |
dpawlik | I see that most of the errors in the performance log comes from project: x/networking-opencontrail . Is it still used? can we remove the periodic job "noc-tempest-neutron-plugin" ? | 08:18 |
opendevreview | Daniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh https://review.opendev.org/c/zuul/zuul-jobs/+/880517 | 09:01 |
opendevreview | Daniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh https://review.opendev.org/c/zuul/zuul-jobs/+/880517 | 09:18 |
gthiemonge | Hi Folks, there are multiple jobs stuck in zuul | 12:14 |
gthiemonge | ex: https://zuul.openstack.org/status/change/880435,1 | 12:15 |
fungi | gthiemonge: thanks for the heads up. i was on vacation last week so need to catch up on what might have changed, but it looks like our weekend upgrade stopped a quarter of the way through the executors too: https://zuul.opendev.org/components | 12:51 |
fungi | i wonder if ze04 is the culprit | 12:51 |
fungi | it has what look to be a bunch of hung git processes dating back to thursday | 12:53 |
fungi | i think the lingering git cat-file --batch-check processes are a red herring. executors from both before and after container restarts seem to have a bunch of them too | 13:09 |
fungi | 2023-04-17 13:09:57,978 DEBUG zuul.ExecutorServer: Waiting for 2 jobs to end | 13:10 |
fungi | so for some reason there are two builds on ze04 it can't seem to terminate | 13:10 |
fungi | 2023-04-17 13:10:44,015 DEBUG zuul.ExecutorServer: [e: b54117e7ad044144b1d1cce0bd252f19] [build: c4157ad90b3c4db383d3ac5fb6ce9707] Stop job | 13:10 |
fungi | 2023-04-17 13:10:44,015 DEBUG zuul.ExecutorServer: Unable to find worker for job c4157ad90b3c4db383d3ac5fb6ce9707 | 13:10 |
fungi | we have "Unable to find worker for job" messages dating back over a week though, basically all the way back to the start of our log retention. and not just on ze04 either, executors from before and after the restart seem to have them, so probably not related? | 13:14 |
fungi | i think the executors are just being very, very, very slow to gracefully stop, like taking about a day each | 13:19 |
fungi | looks like ze01 started around 2023-04-15 00:00z on schedule | 13:19 |
fungi | started stopping i mean | 13:20 |
fungi | then ~48 minutes later ze02 began its graceful stop | 13:20 |
fungi | and it took over a day, ze03 began to stop around 10:30z today | 13:21 |
fungi | and ze04 a little after 11:50z so that was less than 1.5 hours | 13:23 |
fungi | so i guess the delay was really just ze02 for some reason | 13:23 |
fungi | ze04 will probably finish stopping soon | 13:24 |
fungi | so maybe the builds that have been queued for so long are unrelated to the restart slowness. certainly quite a few of them are stuck in check since before the restarts were initiated anyway | 13:25 |
fungi | gthiemonge: i notice a disproportionately large number of these jobs waiting for node assignments are for octavia changes, and i seem to remember octavia has relied heavily on nested-virt node types in the past. what are the chances all these waiting jobs want nested-virt nodes? maybe i should start looking at potential problems supplying nodes from some specific providers | 13:30 |
fungi | the octavia-v2-dsvm-scenario-ipv6-only build for 879874,1 has been waiting for a nested-virt-ubuntu-focal node since 2023-04-14 05:36:32z | 13:31 |
fungi | that was node request 300-0020974848, so i guess i'll look into where that ended up | 13:31 |
dansmith | dpawlik: no, tbh, I didn't realize the performance index was actually ingesting those values now, but I see that it is | 13:32 |
dansmith | dpawlik: I agree that memory value must be wrong, but I'd have to go dig to figure out why.. can you ignore those logs for now? | 13:32 |
gthiemonge | fungi: yeah, these jobs are using nested-virt nodes, and I think that most of them (but not all) are centos-9-stream-based jobs | 13:36 |
fungi | dpawlik: dansmith: apropos of nothing in particular, 18446744073709551615 is 2^64-1, so looks like something passed a -1 as an unsigned int | 13:36 |
fungi | unsigned 64-bit wide int specifically | 13:37 |
fungi | likely whatever was being measured didn't exist/had no value and attempted to communicate that with a -1 | 13:37 |
dansmith | ack | 13:37 |
fungi | looks like nl03 took the lock for nr 300-0020974848 at 05:36:38 and logged boot failures in vexxhost-ca-ymq-1, then nl04 picked it up at 05:39:52 and tried to boot in ovh-bhs1 but failed and marked the last node attempt for deletion at 05:50:03, but i see no further attempts by any providers to service the request after that point nor was it released as a node error | 13:45 |
fungi | ever since 2023-04-14 05:50:03 there's nothing about it | 13:46 |
fungi | i need to step away for a few to run a quick errand, but if any other infra-root is around and wants to have a look, i think the first example's breadcrumb trail ends at nl04 | 13:49 |
*** dviroel__ is now known as dviroel | 13:53 | |
Clark[m] | Executors logging that they are unable to find workers for builds is normal when you have more than one executor. Basically the executor is finding that it can't process a build because it is running on another executor. | 14:03 |
Clark[m] | The issue with builds being stuck seems similar to the issue corvus and I looked into last week. https://review.opendev.org/c/zuul/nodepool/+/880354 is expected to make that better and I think restarting breaks the deadlock so landing that change and deploying new images should get stuff moving again | 14:05 |
Clark[m] | This should be independent of slow zuul restarts since executor stops wait on running builds and Nodepool deadlocking happens before builds begin | 14:06 |
fungi | makes sense | 14:48 |
opendevreview | Daniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh https://review.opendev.org/c/zuul/zuul-jobs/+/880517 | 15:01 |
clarkb | I've approved that change just now | 15:10 |
clarkb | infra-root Wednesday starting at about 20:00 UTC looks like a good time for an etherpad outage and data migration/server swap for me. Any objections to this time? If not I'll go ahead and announce it to service-announce | 15:19 |
fungi | clarkb: thanks, it looked good to me but i wasn't clear whether there was a reason it sat unapproved and was trying to skim the zuul channel for additional related discussion | 15:20 |
clarkb | I don't think there was any particular reason. If i had noticed that it wasn't merged on friday I would've approved it then (though I was also afk due to kids being out of school) | 15:22 |
fungi | makes sense | 15:23 |
fungi | i guess once updated images are available we can pull and restart the launchers? | 15:23 |
clarkb | fungi: ansible will automatically do that for us in the opendev hourly job runs | 15:23 |
fungi | oh, right | 15:23 |
fungi | which should hopefully also get all those deadlocked node requests going again | 15:24 |
clarkb | fungi: https://review.opendev.org/c/opendev/zone-gating.dev/+/880214 may interest you. Makes a 1hour ttl default consistent across the dns zone files we manage (the others have already been updated) | 15:24 |
clarkb | correct since the deadlock is due to in process state | 15:24 |
fungi | infra-prod-service-nameserver hit RETRY_LIMIT in deploy for 880214 just now | 15:36 |
fungi | ansible said bridge01.opendev.org was unreachable | 15:37 |
fungi | "zuul@bridge01.opendev.org: Permission denied (publickey)." | 15:37 |
fungi | i guess we haven't authorized that project key? | 15:38 |
fungi | intentionally? | 15:38 |
fungi | presumably our periodic deploy will still apply the change | 15:39 |
*** amoralej is now known as amoralej|off | 15:42 | |
clarkb | fungi: gating.dev is the one you had a change up to add jobs for right? I suspect that yes we need the project key to be added to bridge | 15:44 |
clarkb | and yes the daily job should get us in sync | 15:44 |
clarkb | (that is what happened with the static changes to gating.dev just had to wait for the daily run) | 15:44 |
fungi | yeah, that was https://review.opendev.org/879910 which merged 10 days ago, so i guess that's when it started | 15:48 |
fungi | we have these keys authorized so far: zuul-system-config-20180924 zuul-project-config-20180924 zuul-zone-zuul-ci.org-20200401 zuul-opendev.org-20200401 | 15:50 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Allow opendev/zone-gating.dev project on bridge https://review.opendev.org/c/opendev/system-config/+/880661 | 15:56 |
fungi | clarkb: ^ like that i guess | 15:56 |
clarkb | yes that looks right | 15:57 |
fungi | ftr, i obtained the key with `wget -qO- https://zuul.opendev.org/api/tenant/openstack/project-ssh-key/opendev/zone-gating.dev.pub` | 15:58 |
opendevreview | Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672 | 16:23 |
clarkb | I think ^ is a reasonable workaround for the gerrit replication issue we discovered during the recent gerrit upgrade | 16:23 |
clarkb | fungi: does wednesday at 20:00 UTC for a ~90 minute etherpad outage and server move work for you? | 16:26 |
fungi | yeah, sgtm | 16:27 |
clarkb | thanks. you tend to be on top of projects happenigns and are a good one to ask for that sort of thing | 16:27 |
fungi | mmm. actually not project-related but i may not be around at that time... i can do later though, like maybe 22:00 or 23:00z | 16:28 |
fungi | wedding anniversary and we were looking at going up the island to a place that doesn't open until 19:00z so probably wouldn't be back early enough to make 20:00 maintenance | 16:29 |
clarkb | later times also work for me | 16:30 |
clarkb | re zuul restart slowness I think it is related to the nodepool node stuff afterall. In particular I think ze04 is "running" a paused job that is paused waiting on one or more of the jobs that are queued to run | 16:31 |
clarkb | this will in theory clear up automatically with the nodepool deployment but we should keep an eye on the whole thing | 16:31 |
clarkb | looks like some of the queued jobs are running? | 16:32 |
fungi | interesting... i wonder if that's why ze02 took almost 1.5 days to gracefully stop | 16:32 |
clarkb | all four launchers did restart just over 15 minutes ago which should've pulled in that latest image (it was promoted ~33 minutes ago) | 16:33 |
clarkb | and the swift change at the top of the queue just started its last remaining job | 16:34 |
clarkb | * top of the check queue | 16:34 |
fungi | perfect | 16:35 |
fungi | ze04 is still waiting on one of those to complete, looks like | 16:47 |
clarkb | ya it will likely take 3 or more hours if it is one of the tripleo buildsets | 16:48 |
clarkb | fungi: its the pause job in the gate for 879863 | 16:58 |
clarkb | fungi: you can look on ze04 in /var/lib/zuul/builds to get the running build uuids. Then grep that out of https://zuul.opendev.org/api/tenant/openstack/status | 16:59 |
clarkb | hrm I expected https://review.opendev.org/c/opendev/system-config/+/880672/1/playbooks/zuul/gerrit/files/cleanup-replication-tasks.py#25 to trigger in https://zuul.opendev.org/t/openstack/build/da3c4879c4ec47ab938665020cdfc2fe/log/review99.opendev.org/docker/gerrit-compose_gerrit_1.txt but it isn't in there | 17:21 |
clarkb | oh we docker-compose down to do renames and that will only trigger on the first startup but we don't collect logs from before | 17:25 |
opendevreview | Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672 | 17:32 |
clarkb | that mimics wait-for-it a bit in its logging output | 17:32 |
clarkb | I think ze04 should restart in about an hour | 17:38 |
clarkb | I've just updated the meeting agenda with what I'm aware of as being current. Please add content or let me know what is missing and I'll send that out later today | 18:16 |
fungi | thanks! | 18:17 |
clarkb | while I sorted out lunch it looks like ze04 was restarted. | 19:24 |
fungi | yes, it's working on 5 now | 19:25 |
clarkb | I think we are in good shape to finish up the restart now. We can probably check it tomorrow to ensure it completes | 19:25 |
fungi | agreed | 19:25 |
clarkb | going to send an announcement for the etherpad outage now. I'll indicates 22:00 UTC to 23:30 UTC wednesday the 19th | 19:28 |
johnsom | E: Failed to fetch https://mirror.ca-ymq-1.vexxhost.opendev.org/ubuntu/pool/universe/v/vlan/vlan_2.0.4ubuntu1.20.04.1_all.deb Unable to connect to mirror.ca-ymq-1.vexxhost.opendev.org:https: [IP: 2604:e100:1:0:f816:3eff:fe0c:e2c0 443] | 19:32 |
johnsom | https://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96 | 19:33 |
johnsom | Looks like there is an issue at vexxhost | 19:33 |
clarkb | if I open that link I get a download. | 19:34 |
clarkb | of course I'm going over ipv4 from here | 19:35 |
clarkb | hitting it via ipv6 also works. So whatever it is isn't a complete failure | 19:35 |
clarkb | could be specific to the test node too | 19:35 |
johnsom | It would not be the first time there was an IPv6 routing issue | 19:36 |
opendevreview | Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672 | 19:37 |
clarkb | https://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96/log/job-output.txt#3315 there it seems to indicate it tried both ipv4 and ipv6 | 19:38 |
clarkb | https://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96/log/job-output.txt#821 and it fails quite early in the job too. | 19:40 |
clarkb | which means it is unlikely that job payload caused it to happen | 19:40 |
clarkb | definitely seems like a test node that couldn't route internally to the cloud but generally had network connectivity (otherwise the job wouldn't run at all). But lets see the mirror side | 19:41 |
fungi | the boot failures in vexxhost which were contributing to the deadlocked node requests did mostly look like unreachable nodes too, so i wonder if there are some network reachability problems | 19:42 |
clarkb | no OOMs or unexpected reboots of the mirror node. and the apache process isn't new either | 19:42 |
clarkb | ianw: when you're around can you clarify whether or not you think weneed to wait for upstream gitea to fix those api interaction things you posted bugs for before we upgrade? You are -1 on the change and not sure if that means you think this is a big enough problem to hold off upgrading for now | 21:41 |
ianw | clarkb: umm, i guess i'm not sure. they've put both issues in the 1.19.2 target tracker | 21:57 |
clarkb | ya I think the main issue is if anyone is using the APIs as an unauthenticated user | 21:58 |
ianw | the external thing would be that the organisation list is now an authenticated call. i mean, i doubt anyone is using that though | 21:58 |
clarkb | the basic auth 401 problem is minor since you can force it with most tools seems like | 21:58 |
ianw | yeah, there may be other bits that have fallen under the same thing, i didn't audit them | 21:58 |
clarkb | I guess we can wait to be extra cautious. I'm mostly worried about letting it linger and then forgetting. But it is stillfresh at this point | 22:00 |
ianw | i could go either way, i'm not -1 now we understand things, although i doubt we'll forget as we'll get any updates on those bugs | 22:02 |
ianw | it might be a breaking change for us with the icon setting stuff? i can't remember how that works, but that may walk the org list from an unauthenticated call? | 22:03 |
clarkb | it does it via the db actually | 22:04 |
clarkb | and it seems to work in the held node (there are logos iirc) | 22:04 |
ianw | it does hit it anonymously -> https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea-set-org-logos/tasks/main.yaml#L1 | 22:07 |
clarkb | ya thats the task my change updated https://review.opendev.org/c/opendev/system-config/+/877541/6/playbooks/roles/gitea-set-org-logos/tasks/main.yaml which send us down the rabbit hole | 22:09 |
ianw | oh doh, right | 22:10 |
ianw | for some reason i had in my head that was on the test path | 22:12 |
ianw | clarkb: do you think we should run that cleanup script in a cron job? | 22:13 |
ianw | the gerrit replication cleanup script, sorry, to be clear | 22:13 |
clarkb | ianw: I think we could run it there as well. The files don't seem large and there are only a "few" thousand of them right now so we can probably get away with just doing it at container startup | 22:14 |
clarkb | the upside to doing it at startup is that it prevents a race in generating those errors in the logs at startup. The downside is we'd only run it at startup and we might not see if there are other types of files that leak or if it stops/doesn't work for some reason | 22:15 |
ianw | i guess startup and a cron job? | 22:16 |
clarkb | I'm definitely open to feedback on that. I was thinking about artificially injecting some of the leaked files into the test nodes too but it gets weird because ideally we would write real replication tasks that should replicate and those that shouldn't and check that the ones we want to be removed are removed and that the ones we want to replicate are replicated but we don't test | 22:17 |
clarkb | replication in the test nodes | 22:17 |
clarkb | basically to test this properly got really complicated quickly and I decided to push what I had early rather than focus n making it perfect | 22:17 |
ianw | fair enough. i guess we could just do a out-of-band test type thing with dummy files and make sure it removes what we want | 22:18 |
clarkb | ya that might be the easiest thing | 22:18 |
clarkb | ianw: re cronjob I think we may not have cron in the container images. We'd have to trigger a cronjob that ran docker exec? This should work fine just trying to think of the best way to write it down essentially | 22:22 |
ianw | yeah that's what i was thinking; cron from review that calls a docker exec | 22:22 |
clarkb | and maybe run it hourly or daily? | 22:23 |
clarkb | I'll do it in a followup change since we don't want the cron until the script is in the image running on the host | 22:24 |
ianw | i'd say daily is enough | 22:25 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run the replication task cleanup daily https://review.opendev.org/c/opendev/system-config/+/880688 | 22:41 |
clarkb | Something like that maybe. I tried to capture some of the oddities of this change in the commit message. We don't actually have anything like this running today. Not sure if reusing the shell container is appropriate. Again feedback very much welcome | 22:41 |
ianw | i think the mariadb backups are fairly similar | 22:58 |
ianw | clarkb: dropped a comment on run v exec and using --rm with run, if we want to use that | 23:02 |
clarkb | ianw: re --rm we aren't rm'ing that container today | 23:08 |
clarkb | that might make a good followup but I think we should leave it as is until we change it globally | 23:08 |
ianw | but that will create a new container on every cron run? why do we need to keep them? | 23:14 |
clarkb | it doesn't create a new container. It never deletes the container so it hangs around. You can see it if you run `sudo docker ps -a` on review02 | 23:17 |
clarkb | I don't think we need to keep them but I don't know that there is a good way to manage `docker-compose up -d` and also somewhat atomically remove the shell container it creates | 23:17 |
opendevreview | Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672 | 23:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run the replication task cleanup daily https://review.opendev.org/c/opendev/system-config/+/880688 | 23:22 |
clarkb | ianw: ^ that adds testing | 23:22 |
opendevreview | Clark Boylan proposed opendev/system-config master: Explicitly disable offline reindexing during project renames https://review.opendev.org/c/opendev/system-config/+/880692 | 23:27 |
clarkb | and that is something I noticed when working on the previous change | 23:28 |
clarkb | ianw: fwiw on the --rm thing I don't know that this was an anticipated problem when the shell pattern was used. I do kinda like having an obvious place to run things with less potential for impacting the running services though. However, maybe it is simpler to have fewer moving parts and we should try to factor out the shell container. This would affect our upgrade processes though | 23:31 |
clarkb | as they rely on this container for example | 23:31 |
clarkb | ok last call for meeting agenda topics as I'm running out of time before I need to find dinner | 23:31 |
clarkb | and sent | 23:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!