Monday, 2023-05-01

ianw	it's just weird that one happened for https://review.opendev.org/c/opendev/system-config/+/880579 and then https://review.opendev.org/c/opendev/system-config/+/880710	00:44
ianw	one added the jammy servers and the other removed them	00:45
ianw	removed the old ones	00:45
ianw	ok: [codesearch01.opendev.org]	00:45
ianw	it gathered facts ok	00:45
ianw	i dunno; happened well before anything relating to nameservers happened. might be a big coincidence	00:50
ianw	https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-base&pipeline=deploy&skip=0 it is perhaps semi-common i guess	00:53
fungi	might be a misbehaving middlebox somewhere in that cloud region	01:02
ianw	fungi/clarkb: not for now ... but i wasn't sure where we got to on AAAA glue records for opendev.org. If we want to add them, we probably need to ask (my preference) but if we're ok with not having them, we can cross it off https://etherpad.opendev.org/p/2023-opendev-dns	01:16
fungi	ianw: related, we may want to ask vexxhost to add ipv6 reverse dns for ns04	01:19
ianw	yeah, i don't think irc is effective for that	01:20
ianw	ok i logged a low priority ticket	01:27
ianw	#status log shutdown ns1.opendev.org, ns2.opendev.org and adns1.opendev.org that have been replaced with ns03.opendev.org, ns04.opendev.org and adns02.opendev.org	02:13
opendevstatus	ianw: finished logging	02:13
opendevreview	Ian Wienand proposed openstack/project-config master: project-config-grafana: filter opendev-buildset-registry https://review.opendev.org/c/openstack/project-config/+/847870	03:44
opendevreview	Merged opendev/system-config master: Add logging During Statup for haproxy-statsd https://review.opendev.org/c/opendev/system-config/+/881901	04:26
clarkb	it is so quiet today	15:14
clarkb	I'm going toget a gitea 1.19.2 change up after local system updates. They didn't fixthe header issue but that has been a long standing problem so I think we can proceed with 1.19.2	15:21
clarkb	I've just spot checked zuul and nodepool services and believe that we are running quay images at this point. Restarts over the weekend appear successful.	15:35
clarkb	One thing it looks like we will need to do is manualy prune out the old docker hub images since our regular pruning hangs onto them	15:35
clarkb	cc corvus not sure if that is worth warning zuul users about	15:36
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.2 https://review.opendev.org/c/opendev/system-config/+/877541	15:46
opendevreview	Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181	15:46
clarkb	cleaned up my old hold and put another in place for ^ but I anticipate we can upgrade soon	15:48
clarkb	the centos 9 stream mirror is broken. repomd.xml points to files that don't exist. THis problem oiginates in our upstream mirror	16:26
clarkb	(throwing that out there so that when everyone is back to work tomorrow they can short circuit the debugging)	16:26
clarkb	I'm getting my quay.io change for zookeeper-statsd together and will be regenerating the robot accounts token just to ensure we are starting fresh. Nothing should be using it yet anyway but good extra step for safety.	16:49
clarkb	s/token/docker cli passwd/	16:49
clarkb	corvus: ^ for that I am having to press ^D twice for it to emit a password entered on the command line. Is this expected? I'm worried the first control character may end up in the input somehow. I'll use echo -n 'value' \| zuul-client encrypt instead I guess	16:54
corvus	clarkb: re pruning -- i don't think that's something we need to warn people about	16:55
corvus	clarkb: yes, 2 ctrl-d's is expected when not immediately following a newline	16:55
corvus	that's a shell thing	16:56
clarkb	TIL. fwiw echo -n '' seems to work. Just prefix it with a space to prevent it from going into history	16:57
corvus	yep it's nice to see what you're doing :)	16:59
corvus	clarkb: i went to go check exactly which image is running for zuul... and i see this:	17:06
corvus	4a9793f4f9fa 0021610b5ea6 "/usr/bin/dumb-init …" 2 days ago Up 2 days zuul-scheduler_scheduler_1	17:06
corvus	so then i run docker inspect 0021610b5ea6	17:06
corvus	and i see: "org.zuul-ci.change_url": "https://review.opendev.org/873012"	17:06
corvus	and that does not look right to me	17:07
opendevreview	Clark Boylan proposed opendev/system-config master: Base jobs for quay.io image publishing https://review.opendev.org/c/opendev/system-config/+/881285	17:08
corvus	do you think something about how we're building images is broken and not attaching those labels correctly now? maybe that's the most recent layer that has a label?	17:08
corvus	i get that with docker inspect quay.io/zuul-ci/zuul-scheduler locally too	17:09
clarkb	corvus: I Think the component reported version looks fairly accurate	17:09
clarkb	so I suspect this has to do with metadat and not pushing stale content	17:09
clarkb	corvus: does the most recent docker hub image look better?	17:10
corvus	yeah, the build date looks correct too	17:10
corvus	yep	17:10
corvus	points to https://review.opendev.org/880658	17:11
clarkb	infra-root I think https://review.opendev.org/c/opendev/system-config/+/881285 is read for review now. Should be safe to land whenever we are ready to debug it. zookeeper-statd has not had any new images since I synced docker hub to quay.io so don't need to sync that before we switch either	17:11
corvus	clarkb: i think i see the issue	17:12
clarkb	ok I'm still trying to figure out hwere we set the value. Must be in the jobs somehwere	17:12
corvus	working on a change	17:12
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Add labels to build-container-image https://review.opendev.org/c/zuul/zuul-jobs/+/881919	17:14
corvus	clarkb: ^	17:15
corvus	looks like another case of the build-container-image bitrotting between when we made it originally and when we finally started using it.	17:15
clarkb	corvus: heh ya the buildx tasks have them and they were copied over more recently	17:16
fungi	i guess 881285 is going to need 881919 too?	17:18
clarkb	fungi: not strictly necessary but very nice to have yes	17:18
clarkb	probably worth waiting on then we can be sure the label fix works too	17:18
corvus	yeah, it's super hard to map images back to what they're running without it, so i'd support waiting for 919 before doing any more builds :)	17:18
corvus	i'd like to restart zuul again to catch the changes that merged over the weekend; any objections?	17:20
corvus	i'll just run the zuul_reboot playbook	17:20
fungi	sounds good to me, thanks	17:20
clarkb	corvus: its quiet today (holidays elsewhere in the world) and the zookeeper content should make it even less of an impact. I'm good with this	17:21
clarkb	oh ya zuul_reboot is the graceful one. Should go quickly anyway	17:21
corvus	running now in screen on bridge	17:23
clarkb	corvus: there is a periodic job that may need to be dealt with in openstack now that I look	17:23
clarkb	it is queued though which means it isn't on an executor yet so maybe it is fine	17:23
corvus	yeah, probably how the last reboot made it through	17:24
corvus	i wish there were a way to copy the tooltip to get the node request id	17:24
clarkb	++	17:24
opendevreview	Merged zuul/zuul-jobs master: Add labels to build-container-image https://review.opendev.org/c/zuul/zuul-jobs/+/881919	17:27
corvus	clarkb: found a nodepool bug causing that stuck request	17:33
clarkb	ack	17:34
corvus	change linked in #zuul:opendev.org	17:37
corvus	i believe a restart of nl01 will correct the immediate problem; maybe we should just land that change and let the subsequent automatic restart handle that.	17:38
clarkb	yup I've approved the change and the auto hourly deployment should handle it automatically	17:40
clarkb	I'm going to pop out for lunch soon so won't approve the quay.io change in system-config yet. Happy for someone else to if they can watch it otherwise I'll aim to +A it when I get back	18:01
fungi	clarkb: 881285 won't actually upload a new image though, right? we need another change to merge for that?	18:03
clarkb	fungi: I think it will because those jobs are added/modified so they should run?	18:04
fungi	oh, yeah i guess the upload happens in gate	18:04
fungi	okay, i'll check it once it merges	18:04
clarkb	oh but ya the post may not. we can push up a noop dockerfile change if we need it to trigger more stuff	18:05
clarkb	a number of our dockerfiles have atimestamp comment for this purpose now	18:05
fungi	right, that's what i was assuming we'd need to test, but we can do that if necessary after you finish your lunch	18:07
clarkb	++	18:07
rlandy	fungi: hi ... has anyone reported anything wrt centos9 mirrors? Looks like all jobs (not just tripleo related) are failing with "error: Status code: 404 for https://mirror.bhs1.ovh.opendev.org/centos-stream/9-stream/BaseOS/x86_64/os/repodata/3ea088796d71ec43bd0450022bddc9365606b1996065fac43595e4ef6798af11-primary.xml.gz" or some similar error. Example log: https://zuul.opendev.org/t/openstack/build/683d1d11236441d48c2b181b7ce193e8	18:30
rlandy	another example: https://zuul.opendev.org/t/openstack/build/371a7f56326b4eb5877aafd600ed0a85	18:31
fungi	rlandy: first i've heard of it, but we mirror from other mirrors so i guess it's worth checking those to see if they're stale	18:35
fungi	https://mirror.bhs1.ovh.opendev.org/centos-stream/timestamp.txt indicates it was updated at the top of the hour	18:37
fungi	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/centos-stream-mirror-update#L44 says we're pulling from mirror.rackspace.com	18:38
fungi	and i don't see a 3ea088796d71ec43bd0450022bddc9365606b1996065fac43595e4ef6798af11-primary.xml.gz at https://mirror.rackspace.com/centos-stream/9-stream/BaseOS/x86_64/os/repodata/	18:39
fungi	so my guess is that their mirror is behind	18:39
fungi	or somehow got rolled back	18:40
rlandy	ok - so we're back to that mirror caching fun	18:40
fungi	yes, our mirror is only ever as reliable as the mirrors we copy from	18:40
fungi	and apparently the mirror network for centos is not all that reliable	18:41
rlandy	thank you	18:41
fungi	rlandy: https://review.opendev.org/868392 switched us from mirror.facebook.net to mirror.rackspace.com in december because facebooks mirrors stopped updating, according to the commit message	18:43
rlandy	yep - I remember we switched a few times last year	18:45
rlandy	going to give it a few hours to see if the mirrors sync up	18:45
opendevreview	Merged opendev/system-config master: Base jobs for quay.io image publishing https://review.opendev.org/c/opendev/system-config/+/881285	19:04
Clark[m]	I posted about the mirror thing earlier today. I confirmed our upstream mirror has the same issue	19:18
fungi	ahh, i missed that, thanks	19:22
clarkb	fungi: corvus: the quay thing failed in deploy on the zuul job. Likely due to the ongoing zuul restart? I didn't think about that interaction. It did push a change tag but did not update latest. I think because we do need an image change to trigger the promote job	19:41
clarkb	I'm working on that change now	19:41
clarkb	it did restart the zk statsd service on zk04. Image is identical to the one running before so that was just a docker bookkeeping change	19:42
opendevreview	Clark Boylan proposed opendev/system-config master: Force zookeeper-statsd rebuild https://review.opendev.org/c/opendev/system-config/+/881924	19:43
clarkb	nl01's launcher restarted ~36 minutes ago	19:44
fungi	yeah, gate did run system-config-upload-image-zookeeper-statsd successfully, promote doesn't seem to do any tagging	19:44
clarkb	and the stuck job is running	19:44
clarkb	fungi: yup and if you visit the imgae location on quay you'll see the gate change tag but latest is still old	19:44
clarkb	nothing unexpected that I see so far. just what we anticipated might be an issue which is good	19:45
clarkb	hrm my gitea 1.19.2 change failed presumably on lack of authentication. Implying that authentication is required?	19:49
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.2 https://review.opendev.org/c/opendev/system-config/+/877541	19:50
clarkb	no log removed (which we can do because this shouldn't need authentication) to see more info	19:51
clarkb	the infra-prod-run-zuul job failed due to zm02 failing to copy project config. We no log that so I don't know what happened to make that fail (plenty of disk space)	19:59
clarkb	I guess keep an eye on it for recurrences and we can dig in if necessary	19:59
clarkb	fungi: corvus want to review (and hopefully approve) https://review.opendev.org/c/opendev/system-config/+/881924 so that we can see zookeeperstatsd go end to end with container publishing	20:46
clarkb	oh heh the gitea thing I know what it is. pebkac	21:07
opendevreview	Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.2 https://review.opendev.org/c/opendev/system-config/+/877541	21:09
opendevreview	Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181	21:09
clarkb	the good news is that having that issue had me realize we don't need to no log that request since it isn't privileged. I expect this to work now and put a hold in place	21:11
corvus	fyi, due to the low load, i paused a handful of executors ahead of the reboot script to reduce the overall upgrade time	21:39
corvus	that seems to be working as expected so far	21:39
clarkb	I saw that. A couple of executors are done too	21:41
opendevreview	Merged opendev/system-config master: Force zookeeper-statsd rebuild https://review.opendev.org/c/opendev/system-config/+/881924	21:43
corvus	yeah, i'm continuing my rolling window of keeping "about half" paused ahead of the script	21:43
corvus	incidentally, the zookeeper persistent recursive watches change has had a noticeable impact on the zk latency and outstanding requests metrics.	21:45
corvus	that merged on april 11 (and if there's any doubt, you can see it in the zk watches graph)	21:46
corvus	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-90d&to=now	21:46
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Add job to ansible-config_template to Galaxy https://review.opendev.org/c/openstack/project-config/+/881930	21:48
opendevreview	Dmitriy Rabotyagov proposed openstack/project-config master: Add job to ansible-config_template to Galaxy https://review.opendev.org/c/openstack/project-config/+/881930	21:49
clarkb	https://quay.io/repository/opendevorg/zookeeper-statsd?tab=tags boom! our first promoted image on quay.io the hourly zuul jobs should deploy that for us (we didn't trigger the zuul job after the image build)	21:50
fungi	w00t	21:51
clarkb	I'll work on getting a few more of those queued up	21:51
clarkb	now to find some easy to swap images that havne't updated on dockerhub since I did the sync	21:55
clarkb	this should reduce the amount of syncing we/I end up needing to do	21:55
clarkb	as I look at this I'm realizing that there is going to be a bit to do to get things moved. Stuff like our base python images create dependency issues. I think for now I'm going to ignore that though. If we get the leaf images moved then we can rebuild once the base images move too	21:59
corvus	clarkb: why not start with base?	22:08
clarkb	corvus: I guess I can. My main concern with doing that is that if we need to update base for some reason urgently we may not be ready to consume it from its new location everywhere	22:09
clarkb	the risk of that is low though and might be good motivation :)	22:09
corvus	ok fair. no strong opinion here	22:09
clarkb	I think doing it leaf image first is probably mor eeffort but also "safer" from that perspective	22:09
corvus	btw, friendly reminder https://zuul.opendev.org/t/openstack/project/opendev.org/opendev/system-config?branch=master&pipeline=check exists in case it's helpful :)	22:10
clarkb	I'm working on two changes at the moment. One to update the base jobs and one to update ircbot	22:17
clarkb	We can use review to decide which approach we prefer	22:17
opendevreview	Clark Boylan proposed opendev/system-config master: Move ircbot to quay.io https://review.opendev.org/c/opendev/system-config/+/881931	22:22
ianw	nice!	22:25
ianw	gitea change looks good. i don't think we use any authenticated endpoints now?	22:25
clarkb	ianw: we do to create rpeos and orgs and stuff	22:26
opendevreview	Clark Boylan proposed opendev/system-config master: Move python builder/base images to quay.io https://review.opendev.org/c/opendev/system-config/+/881932	22:56
clarkb	I'll work on a third change that updates our Dockerfiles to consume ^	22:56
clarkb	my concern with that is a lot of images will update all at once... we can hash that out if we want ot split things up in review	22:56
opendevreview	Clark Boylan proposed opendev/system-config master: Consume python base images from quay.io https://review.opendev.org/c/opendev/system-config/+/881933	23:03
clarkb	There are a lot of moving pieces here. I think we can pause here since the general thing has been shown to work. Think about the approach we want to take / discuss it in the meeting tomorrow. Write down a plan/todo list and then get it done	23:04
clarkb	I'm going to shift gears here and check up on the gitea upgrade then get a meeting agenda sent out	23:05
clarkb	at first glance gitea 1.19.2 seems to be working https://158.69.65.228:3081/opendev/system-config	23:07
clarkb	thinking a bit about the quay.io work. It might make sense to try and do a "sprint" for that. Pick a couple of days in the near future and just focus on getting as much of that done as possible. Then we ideally don't end up with stale images for very long and can have people around to double check services are happy with their new images	23:08
clarkb	Meeting agenda has been updated. Probably with too much detail. Please let me know if there is anything else to add/chnage/edit	23:22
ianw	clarkb: i think the gerrit acl indent, etc. all got merged	23:33
clarkb	oh neat I'll dobule check and clena that up	23:34
ianw	and the renames at the bottom are done, right?	23:34
clarkb	oh yup. Thanks	23:35
ianw	nameserver status is accurate; i might just remove the old servers later today as there's been no problem after i shut them down yesterday morning (my time)	23:35
clarkb	the acl updates did merge. Any idea if we applied them specially to ensure they all got updated?	23:36
ianw	that's a good point, i'll go back and check the deploy	23:36
opendevreview	Ian Wienand proposed opendev/zone-opendev.org master: Remove old DNS servers https://review.opendev.org/c/opendev/zone-opendev.org/+/881935	23:40
ianw	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_95b/879906/7/deploy/infra-prod-manage-projects/95b29cf/manage-projects.yaml.log	23:42
ianw	hrm	23:43
ianw	To ssh://review.opendev.org:29418/x/gearman-plugin	23:43
ianw	! [remote rejected] HEAD -> refs/meta/config (prohibited by Gerrit: project state does not permit write)	23:43
clarkb	that is probably a read only project	23:43
ianw	i didn't think of that	23:43
clarkb	I think that is fine. If we ever make it not read only we'll sync a current good config	23:43
ianw	yeah, the r/o projects all failed like that	23:43
ianw	all the errors were the doens't permit write	23:45
clarkb	how long did it take (that could be good info)	23:46
ianw	55 minutes	23:47
clarkb	agenda sent	23:47
clarkb	ianw: maybe we should increase the timeout of that job (assuming it is 60 minutes) and then we can just merge change sand not worry about manual runs	23:47
ianw	name: infra-prod-manage-projects	23:52
ianw	parent: infra-prod-playbook	23:52
ianw	timeout: 4800	23:52
ianw	probably enough headroom	23:53
clarkb	perfect	23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!