Monday, 2023-04-17

opendevreview	Merged opendev/system-config master: launch: further DNS cleanups https://review.opendev.org/c/opendev/system-config/+/880400	00:40
ianw	the try.gitea.io cert has expired, which is a bit annoying for testing aginst it	01:13
ianw	ok; scoped access tokens has this written all over it. i found that by tracing it back to the top-level api router and "git blame" -> https://github.com/go-gitea/gitea/commit/de484e86bc495a67d2f122ed438178d587a92526	01:24
ianw	filed a couple of issues about this; notes in https://review.opendev.org/c/opendev/system-config/+/877541	03:23
*** Trevor is now known as Guest11302		04:06
opendevreview	Ian Wienand proposed opendev/zone-opendev.org master: Add DNS servers for Ubuntu Jammy refresh https://review.opendev.org/c/opendev/zone-opendev.org/+/880576	05:55
opendevreview	Ian Wienand proposed opendev/zone-opendev.org master: Add Jammy refresh NS records https://review.opendev.org/c/opendev/zone-opendev.org/+/880577	06:07
*** amoralej\|off is now known as amoralej		06:16
opendevreview	Ian Wienand proposed opendev/system-config master: inventory : add Ubuntu Jammy DNS refresh servers https://review.opendev.org/c/opendev/system-config/+/880579	06:16
ianw	^ that's getting closer; the hosts are up. i need to think through a few things so we can have two adns servers	06:22
opendevreview	Ian Wienand proposed opendev/system-config master: dns: abstract names https://review.opendev.org/c/opendev/system-config/+/880580	06:31
ianw	^ that's a start	06:31
ianw	i'll think about it some more too	06:31
dpawlik	dansmith: hey, soon I would like to make a release for ci-log-processing, but I still see some leftovers that are not send to the opensearch. There are just few, but each week. Almost all of the logs from this week was because of parsing the performance.json file -	08:13
dpawlik	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1c2/periodic/opendev.org/x/networking-opencontrail/master/noc-tempest-neutron-plugin/1c21c82/controller/logs/performance.json	08:13
dpawlik	the "MemoryCurrent": 18446744073709551615 seems to be "too big" to the Opensearch field	08:14
dpawlik	dpawlik is it correct? Are you using the performance index in Opensearch?	08:14
dpawlik	I see that most of the errors in the performance log comes from project: x/networking-opencontrail . Is it still used? can we remove the periodic job "noc-tempest-neutron-plugin" ?	08:18
opendevreview	Daniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh https://review.opendev.org/c/zuul/zuul-jobs/+/880517	09:01
opendevreview	Daniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh https://review.opendev.org/c/zuul/zuul-jobs/+/880517	09:18
gthiemonge	Hi Folks, there are multiple jobs stuck in zuul	12:14
gthiemonge	ex: https://zuul.openstack.org/status/change/880435,1	12:15
fungi	gthiemonge: thanks for the heads up. i was on vacation last week so need to catch up on what might have changed, but it looks like our weekend upgrade stopped a quarter of the way through the executors too: https://zuul.opendev.org/components	12:51
fungi	i wonder if ze04 is the culprit	12:51
fungi	it has what look to be a bunch of hung git processes dating back to thursday	12:53
fungi	i think the lingering git cat-file --batch-check processes are a red herring. executors from both before and after container restarts seem to have a bunch of them too	13:09
fungi	2023-04-17 13:09:57,978 DEBUG zuul.ExecutorServer: Waiting for 2 jobs to end	13:10
fungi	so for some reason there are two builds on ze04 it can't seem to terminate	13:10
fungi	2023-04-17 13:10:44,015 DEBUG zuul.ExecutorServer: [e: b54117e7ad044144b1d1cce0bd252f19] [build: c4157ad90b3c4db383d3ac5fb6ce9707] Stop job	13:10
fungi	2023-04-17 13:10:44,015 DEBUG zuul.ExecutorServer: Unable to find worker for job c4157ad90b3c4db383d3ac5fb6ce9707	13:10
fungi	we have "Unable to find worker for job" messages dating back over a week though, basically all the way back to the start of our log retention. and not just on ze04 either, executors from before and after the restart seem to have them, so probably not related?	13:14
fungi	i think the executors are just being very, very, very slow to gracefully stop, like taking about a day each	13:19
fungi	looks like ze01 started around 2023-04-15 00:00z on schedule	13:19
fungi	started stopping i mean	13:20
fungi	then ~48 minutes later ze02 began its graceful stop	13:20
fungi	and it took over a day, ze03 began to stop around 10:30z today	13:21
fungi	and ze04 a little after 11:50z so that was less than 1.5 hours	13:23
fungi	so i guess the delay was really just ze02 for some reason	13:23
fungi	ze04 will probably finish stopping soon	13:24
fungi	so maybe the builds that have been queued for so long are unrelated to the restart slowness. certainly quite a few of them are stuck in check since before the restarts were initiated anyway	13:25
fungi	gthiemonge: i notice a disproportionately large number of these jobs waiting for node assignments are for octavia changes, and i seem to remember octavia has relied heavily on nested-virt node types in the past. what are the chances all these waiting jobs want nested-virt nodes? maybe i should start looking at potential problems supplying nodes from some specific providers	13:30
fungi	the octavia-v2-dsvm-scenario-ipv6-only build for 879874,1 has been waiting for a nested-virt-ubuntu-focal node since 2023-04-14 05:36:32z	13:31
fungi	that was node request 300-0020974848, so i guess i'll look into where that ended up	13:31
dansmith	dpawlik: no, tbh, I didn't realize the performance index was actually ingesting those values now, but I see that it is	13:32
dansmith	dpawlik: I agree that memory value must be wrong, but I'd have to go dig to figure out why.. can you ignore those logs for now?	13:32
gthiemonge	fungi: yeah, these jobs are using nested-virt nodes, and I think that most of them (but not all) are centos-9-stream-based jobs	13:36
fungi	dpawlik: dansmith: apropos of nothing in particular, 18446744073709551615 is 2^64-1, so looks like something passed a -1 as an unsigned int	13:36
fungi	unsigned 64-bit wide int specifically	13:37
fungi	likely whatever was being measured didn't exist/had no value and attempted to communicate that with a -1	13:37
dansmith	ack	13:37
fungi	looks like nl03 took the lock for nr 300-0020974848 at 05:36:38 and logged boot failures in vexxhost-ca-ymq-1, then nl04 picked it up at 05:39:52 and tried to boot in ovh-bhs1 but failed and marked the last node attempt for deletion at 05:50:03, but i see no further attempts by any providers to service the request after that point nor was it released as a node error	13:45
fungi	ever since 2023-04-14 05:50:03 there's nothing about it	13:46
fungi	i need to step away for a few to run a quick errand, but if any other infra-root is around and wants to have a look, i think the first example's breadcrumb trail ends at nl04	13:49
*** dviroel__ is now known as dviroel		13:53
Clark[m]	Executors logging that they are unable to find workers for builds is normal when you have more than one executor. Basically the executor is finding that it can't process a build because it is running on another executor.	14:03
Clark[m]	The issue with builds being stuck seems similar to the issue corvus and I looked into last week. https://review.opendev.org/c/zuul/nodepool/+/880354 is expected to make that better and I think restarting breaks the deadlock so landing that change and deploying new images should get stuff moving again	14:05
Clark[m]	This should be independent of slow zuul restarts since executor stops wait on running builds and Nodepool deadlocking happens before builds begin	14:06
fungi	makes sense	14:48
opendevreview	Daniil Gan’kov proposed zuul/zuul-jobs master: Quote file name in URL in download-logs.sh https://review.opendev.org/c/zuul/zuul-jobs/+/880517	15:01
clarkb	I've approved that change just now	15:10
clarkb	infra-root Wednesday starting at about 20:00 UTC looks like a good time for an etherpad outage and data migration/server swap for me. Any objections to this time? If not I'll go ahead and announce it to service-announce	15:19
fungi	clarkb: thanks, it looked good to me but i wasn't clear whether there was a reason it sat unapproved and was trying to skim the zuul channel for additional related discussion	15:20
clarkb	I don't think there was any particular reason. If i had noticed that it wasn't merged on friday I would've approved it then (though I was also afk due to kids being out of school)	15:22
fungi	makes sense	15:23
fungi	i guess once updated images are available we can pull and restart the launchers?	15:23
clarkb	fungi: ansible will automatically do that for us in the opendev hourly job runs	15:23
fungi	oh, right	15:23
fungi	which should hopefully also get all those deadlocked node requests going again	15:24
clarkb	fungi: https://review.opendev.org/c/opendev/zone-gating.dev/+/880214 may interest you. Makes a 1hour ttl default consistent across the dns zone files we manage (the others have already been updated)	15:24
clarkb	correct since the deadlock is due to in process state	15:24
fungi	infra-prod-service-nameserver hit RETRY_LIMIT in deploy for 880214 just now	15:36
fungi	ansible said bridge01.opendev.org was unreachable	15:37
fungi	"zuul@bridge01.opendev.org: Permission denied (publickey)."	15:37
fungi	i guess we haven't authorized that project key?	15:38
fungi	intentionally?	15:38
fungi	presumably our periodic deploy will still apply the change	15:39
*** amoralej is now known as amoralej\|off		15:42
clarkb	fungi: gating.dev is the one you had a change up to add jobs for right? I suspect that yes we need the project key to be added to bridge	15:44
clarkb	and yes the daily job should get us in sync	15:44
clarkb	(that is what happened with the static changes to gating.dev just had to wait for the daily run)	15:44
fungi	yeah, that was https://review.opendev.org/879910 which merged 10 days ago, so i guess that's when it started	15:48
fungi	we have these keys authorized so far: zuul-system-config-20180924 zuul-project-config-20180924 zuul-zone-zuul-ci.org-20200401 zuul-opendev.org-20200401	15:50
opendevreview	Jeremy Stanley proposed opendev/system-config master: Allow opendev/zone-gating.dev project on bridge https://review.opendev.org/c/opendev/system-config/+/880661	15:56
fungi	clarkb: ^ like that i guess	15:56
clarkb	yes that looks right	15:57
fungi	ftr, i obtained the key with `wget -qO- https://zuul.opendev.org/api/tenant/openstack/project-ssh-key/opendev/zone-gating.dev.pub`	15:58
opendevreview	Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672	16:23
clarkb	I think ^ is a reasonable workaround for the gerrit replication issue we discovered during the recent gerrit upgrade	16:23
clarkb	fungi: does wednesday at 20:00 UTC for a ~90 minute etherpad outage and server move work for you?	16:26
fungi	yeah, sgtm	16:27
clarkb	thanks. you tend to be on top of projects happenigns and are a good one to ask for that sort of thing	16:27
fungi	mmm. actually not project-related but i may not be around at that time... i can do later though, like maybe 22:00 or 23:00z	16:28
fungi	wedding anniversary and we were looking at going up the island to a place that doesn't open until 19:00z so probably wouldn't be back early enough to make 20:00 maintenance	16:29
clarkb	later times also work for me	16:30
clarkb	re zuul restart slowness I think it is related to the nodepool node stuff afterall. In particular I think ze04 is "running" a paused job that is paused waiting on one or more of the jobs that are queued to run	16:31
clarkb	this will in theory clear up automatically with the nodepool deployment but we should keep an eye on the whole thing	16:31
clarkb	looks like some of the queued jobs are running?	16:32
fungi	interesting... i wonder if that's why ze02 took almost 1.5 days to gracefully stop	16:32
clarkb	all four launchers did restart just over 15 minutes ago which should've pulled in that latest image (it was promoted ~33 minutes ago)	16:33
clarkb	and the swift change at the top of the queue just started its last remaining job	16:34
clarkb	* top of the check queue	16:34
fungi	perfect	16:35
fungi	ze04 is still waiting on one of those to complete, looks like	16:47
clarkb	ya it will likely take 3 or more hours if it is one of the tripleo buildsets	16:48
clarkb	fungi: its the pause job in the gate for 879863	16:58
clarkb	fungi: you can look on ze04 in /var/lib/zuul/builds to get the running build uuids. Then grep that out of https://zuul.opendev.org/api/tenant/openstack/status	16:59
clarkb	hrm I expected https://review.opendev.org/c/opendev/system-config/+/880672/1/playbooks/zuul/gerrit/files/cleanup-replication-tasks.py#25 to trigger in https://zuul.opendev.org/t/openstack/build/da3c4879c4ec47ab938665020cdfc2fe/log/review99.opendev.org/docker/gerrit-compose_gerrit_1.txt but it isn't in there	17:21
clarkb	oh we docker-compose down to do renames and that will only trigger on the first startup but we don't collect logs from before	17:25
opendevreview	Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672	17:32
clarkb	that mimics wait-for-it a bit in its logging output	17:32
clarkb	I think ze04 should restart in about an hour	17:38
clarkb	I've just updated the meeting agenda with what I'm aware of as being current. Please add content or let me know what is missing and I'll send that out later today	18:16
fungi	thanks!	18:17
clarkb	while I sorted out lunch it looks like ze04 was restarted.	19:24
fungi	yes, it's working on 5 now	19:25
clarkb	I think we are in good shape to finish up the restart now. We can probably check it tomorrow to ensure it completes	19:25
fungi	agreed	19:25
clarkb	going to send an announcement for the etherpad outage now. I'll indicates 22:00 UTC to 23:30 UTC wednesday the 19th	19:28
johnsom	E: Failed to fetch https://mirror.ca-ymq-1.vexxhost.opendev.org/ubuntu/pool/universe/v/vlan/vlan_2.0.4ubuntu1.20.04.1_all.deb Unable to connect to mirror.ca-ymq-1.vexxhost.opendev.org:https: [IP: 2604:e100:1:0:f816:3eff:fe0c:e2c0 443]	19:32
johnsom	https://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96	19:33
johnsom	Looks like there is an issue at vexxhost	19:33
clarkb	if I open that link I get a download.	19:34
clarkb	of course I'm going over ipv4 from here	19:35
clarkb	hitting it via ipv6 also works. So whatever it is isn't a complete failure	19:35
clarkb	could be specific to the test node too	19:35
johnsom	It would not be the first time there was an IPv6 routing issue	19:36
opendevreview	Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672	19:37
clarkb	https://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96/log/job-output.txt#3315 there it seems to indicate it tried both ipv4 and ipv6	19:38
clarkb	https://zuul.opendev.org/t/openstack/build/6fa53bb38b8045a7b55d3180a8df1e96/log/job-output.txt#821 and it fails quite early in the job too.	19:40
clarkb	which means it is unlikely that job payload caused it to happen	19:40
clarkb	definitely seems like a test node that couldn't route internally to the cloud but generally had network connectivity (otherwise the job wouldn't run at all). But lets see the mirror side	19:41
fungi	the boot failures in vexxhost which were contributing to the deadlocked node requests did mostly look like unreachable nodes too, so i wonder if there are some network reachability problems	19:42
clarkb	no OOMs or unexpected reboots of the mirror node. and the apache process isn't new either	19:42
clarkb	ianw: when you're around can you clarify whether or not you think weneed to wait for upstream gitea to fix those api interaction things you posted bugs for before we upgrade? You are -1 on the change and not sure if that means you think this is a big enough problem to hold off upgrading for now	21:41
ianw	clarkb: umm, i guess i'm not sure. they've put both issues in the 1.19.2 target tracker	21:57
clarkb	ya I think the main issue is if anyone is using the APIs as an unauthenticated user	21:58
ianw	the external thing would be that the organisation list is now an authenticated call. i mean, i doubt anyone is using that though	21:58
clarkb	the basic auth 401 problem is minor since you can force it with most tools seems like	21:58
ianw	yeah, there may be other bits that have fallen under the same thing, i didn't audit them	21:58
clarkb	I guess we can wait to be extra cautious. I'm mostly worried about letting it linger and then forgetting. But it is stillfresh at this point	22:00
ianw	i could go either way, i'm not -1 now we understand things, although i doubt we'll forget as we'll get any updates on those bugs	22:02
ianw	it might be a breaking change for us with the icon setting stuff? i can't remember how that works, but that may walk the org list from an unauthenticated call?	22:03
clarkb	it does it via the db actually	22:04
clarkb	and it seems to work in the held node (there are logos iirc)	22:04
ianw	it does hit it anonymously -> https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea-set-org-logos/tasks/main.yaml#L1	22:07
clarkb	ya thats the task my change updated https://review.opendev.org/c/opendev/system-config/+/877541/6/playbooks/roles/gitea-set-org-logos/tasks/main.yaml which send us down the rabbit hole	22:09
ianw	oh doh, right	22:10
ianw	for some reason i had in my head that was on the test path	22:12
ianw	clarkb: do you think we should run that cleanup script in a cron job?	22:13
ianw	the gerrit replication cleanup script, sorry, to be clear	22:13
clarkb	ianw: I think we could run it there as well. The files don't seem large and there are only a "few" thousand of them right now so we can probably get away with just doing it at container startup	22:14
clarkb	the upside to doing it at startup is that it prevents a race in generating those errors in the logs at startup. The downside is we'd only run it at startup and we might not see if there are other types of files that leak or if it stops/doesn't work for some reason	22:15
ianw	i guess startup and a cron job?	22:16
clarkb	I'm definitely open to feedback on that. I was thinking about artificially injecting some of the leaked files into the test nodes too but it gets weird because ideally we would write real replication tasks that should replicate and those that shouldn't and check that the ones we want to be removed are removed and that the ones we want to replicate are replicated but we don't test	22:17
clarkb	replication in the test nodes	22:17
clarkb	basically to test this properly got really complicated quickly and I decided to push what I had early rather than focus n making it perfect	22:17
ianw	fair enough. i guess we could just do a out-of-band test type thing with dummy files and make sure it removes what we want	22:18
clarkb	ya that might be the easiest thing	22:18
clarkb	ianw: re cronjob I think we may not have cron in the container images. We'd have to trigger a cronjob that ran docker exec? This should work fine just trying to think of the best way to write it down essentially	22:22
ianw	yeah that's what i was thinking; cron from review that calls a docker exec	22:22
clarkb	and maybe run it hourly or daily?	22:23
clarkb	I'll do it in a followup change since we don't want the cron until the script is in the image running on the host	22:24
ianw	i'd say daily is enough	22:25
opendevreview	Clark Boylan proposed opendev/system-config master: Run the replication task cleanup daily https://review.opendev.org/c/opendev/system-config/+/880688	22:41
clarkb	Something like that maybe. I tried to capture some of the oddities of this change in the commit message. We don't actually have anything like this running today. Not sure if reusing the shell container is appropriate. Again feedback very much welcome	22:41
ianw	i think the mariadb backups are fairly similar	22:58
ianw	clarkb: dropped a comment on run v exec and using --rm with run, if we want to use that	23:02
clarkb	ianw: re --rm we aren't rm'ing that container today	23:08
clarkb	that might make a good followup but I think we should leave it as is until we change it globally	23:08
ianw	but that will create a new container on every cron run? why do we need to keep them?	23:14
clarkb	it doesn't create a new container. It never deletes the container so it hangs around. You can see it if you run `sudo docker ps -a` on review02	23:17
clarkb	I don't think we need to keep them but I don't know that there is a good way to manage `docker-compose up -d` and also somewhat atomically remove the shell container it creates	23:17
opendevreview	Clark Boylan proposed opendev/system-config master: Prune invalid replication tasks on Gerrit startup https://review.opendev.org/c/opendev/system-config/+/880672	23:22
opendevreview	Clark Boylan proposed opendev/system-config master: Run the replication task cleanup daily https://review.opendev.org/c/opendev/system-config/+/880688	23:22
clarkb	ianw: ^ that adds testing	23:22
opendevreview	Clark Boylan proposed opendev/system-config master: Explicitly disable offline reindexing during project renames https://review.opendev.org/c/opendev/system-config/+/880692	23:27
clarkb	and that is something I noticed when working on the previous change	23:28
clarkb	ianw: fwiw on the --rm thing I don't know that this was an anticipated problem when the shell pattern was used. I do kinda like having an obvious place to run things with less potential for impacting the running services though. However, maybe it is simpler to have fewer moving parts and we should try to factor out the shell container. This would affect our upgrade processes though	23:31
clarkb	as they rely on this container for example	23:31
clarkb	ok last call for meeting agenda topics as I'm running out of time before I need to find dinner	23:31
clarkb	and sent	23:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!