Tuesday, 2022-12-13

clarkb	ok I'll push a new patchset	00:05
clarkb	but first sending the meeting agenda	00:05
opendevreview	Clark Boylan proposed opendev/system-config master: Update tox.ini for tox v4 https://review.opendev.org/c/opendev/system-config/+/867269	00:09
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-kubernetes: add microk8s support https://review.opendev.org/c/zuul/zuul-jobs/+/866953	00:16
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] zuul-jobs-test-registry-buildset-registry-k8s-microk8s https://review.opendev.org/c/zuul/zuul-jobs/+/867063	00:16
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-kubernetes: add microk8s support https://review.opendev.org/c/zuul/zuul-jobs/+/866953	00:24
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] zuul-jobs-test-registry-buildset-registry-k8s-microk8s https://review.opendev.org/c/zuul/zuul-jobs/+/867063	00:24
*** rlandy is now known as rlandy\|out		00:28
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-kubernetes: add microk8s support https://review.opendev.org/c/zuul/zuul-jobs/+/866953	00:35
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [wip] zuul-jobs-test-registry-buildset-registry-k8s-microk8s https://review.opendev.org/c/zuul/zuul-jobs/+/867063	00:35
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: use-buildset-registry: support microk8s https://review.opendev.org/c/zuul/zuul-jobs/+/867063	01:01
*** dviroel\|rover\|out is now known as dviroel\|rover		01:12
*** dviroel\|rover is now known as dviroel\|rover\|out		01:36
*** yadnesh\|away is now known as yadnesh		04:47
opendevreview	Tim Beermann proposed zuul/zuul-jobs master: Add yamllint job. https://review.opendev.org/c/zuul/zuul-jobs/+/866679	06:37
opendevreview	Tim Beermann proposed zuul/zuul-jobs master: Add yamllint job. https://review.opendev.org/c/zuul/zuul-jobs/+/866679	06:47
opendevreview	Tim Beermann proposed zuul/zuul-jobs master: Add yamllint job. https://review.opendev.org/c/zuul/zuul-jobs/+/866679	06:59
*** sandy__ is now known as ysandeep		07:57
*** jpena\|off is now known as jpena		08:27
opendevreview	Merged openstack/project-config master: Flip jeepyb over to building Gerrit 3.6 https://review.opendev.org/c/openstack/project-config/+/867314	09:41
*** ysandeep is now known as ysandeep\|brb		10:37
*** ysandeep\|brb is now known as ysandeep		11:09
*** rlandy\|out is now known as rlandy		11:17
*** dviroel\|rover\|out is now known as dviroel\|rover		11:18
*** frenzy_friday is now known as frenzy_friday\|doc		11:34
*** sfinucan is now known as stephenfin		12:04
*** yadnesh is now known as yadnesh\|away		13:34
*** dasm\|off is now known as dasm		14:00
*** frenzy_friday\|doc is now known as frenzy_friday		14:57
clarkb	gmann: basically as far as we know the change in "trigger vote" vs "submit requirement" label rendering on list pages isn't configurable. I think what is happening is that Gerrit only shows you labels that are required to be set to merge on the summary views. What you can do is search by specific labels and values to get those listings though.	15:01
fungi	or you can make those labels required, i suppose?	15:02
*** marios is now known as marios\|out		15:05
clarkb	ya maybe you can set them up where they can't be lowest negative value to merge and then that would implicitly work for most things?	15:08
clarkb	infra-root we just got email saying the volume backing nb02's build dir needs emergency maintenance. I don't think this is a big deal for us we'll just build images less quickly	15:17
clarkb	they don't want us touching the volume until given the all clear	15:17
*** pojadhav is now known as pojadhav\|dinner		15:23
mnasiadka	I have a feeling that Zuul UI have some problem refreshing - some jobs are not updated for the last 30 minutes	15:31
mnasiadka	ok, it got updated just now ;)	15:32
clarkb	depending on what you mean by not updating thats perfectly normal. There may be more demand for nodes than we have quota for delaying node assignments. Some jobs take more than half an hour to complee so won't flip from running to success/fail for a long period etc	15:34
mnasiadka	clarkb: I rather meant that the log stream stopped updating for like 30 minutes, and then I got loads of logs on the console stream (and the job was marked as completed) - sounded like some networking issue at a nodepool provider	15:39
clarkb	infra-root I've got a couple of small book keeping changes that would be good to land https://review.opendev.org/c/opendev/system-config/+/867269 addresses tox v4 in system-config (it runs a lot of tests but they all seem to pass) and https://review.opendev.org/c/openstack/project-config/+/867282 resorts gate ahead of post-review and fixes the trigger mechanism per corvus' suggestion	15:46
*** JasonF is now known as JayF		15:53
*** dviroel\|rover is now known as dviroel\|rover\|lunch		15:57
clarkb	corvus: not sure if you saw but we didn't end up restarting Zuul over the weekend due to an issue with our restart playbook and new ansible on bridge (docker-compose wants to make a tty by default but new ansible doesn't have one). This has been corrected and I'd like to run that playbook soonish. I've got meetings through the infra meeting but then I should be able to start it. This	16:02
clarkb	will switch us to python3.11 based images among other things. Any concerns with doing that today?	16:02
opendevreview	Merged openstack/project-config master: Scale iweb back to 25% of possible quota https://review.opendev.org/c/openstack/project-config/+/867261	16:13
corvus	clarkb: that sounds good -- i can actually run the restart playbook if you like.	16:29
clarkb	corvus: oh that would be great	16:29
corvus	clarkb: ` flock -n /var/run/zuul_reboot.lock /usr/local/bin/ansible-playbook -f 20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_reboot.yaml >> /var/log/ansible/zuul_reboot.log 2>&1` run that in screen?	16:33
clarkb	yup that seems to match the cronjob entry	16:34
corvus	cool, i'll wait for https://review.opendev.org/865923 to land then kick that off and keep an eye on it	16:36
clarkb	corvus: oh and be sure to run it on bridge01.opendev.org not old bridge	16:36
corvus	(i don't want to split 865923 across a restart)	16:36
corvus	clarkb: have we updated the cname yet?	16:37
clarkb	corvus: no because old bridge was never a cname, only an A record	16:37
corvus	(thanks for the reminder)	16:37
clarkb	but it is a good reminder for trying to come up with a way to make this more obvious/easier to get right	16:38
corvus	how close are we to terminating bridge.openstack?	16:38
corvus	and maybe we could go ahead and make a bridge.opendev.org cname....	16:38
clarkb	corvus: I think we are actually close at this point. There is a set of changes I still need to review to do an encrypted tarball backup thing and I think the idea was to ru nthat on old bridge before removing it	16:39
clarkb	ya bridge.opendev.org CNAME bridge01.opendev.org doesn't help when using bridge.openstack.org but will avoid confusion in the future if there is a bridge02 swap	16:39
corvus	and we can get started on that muscle memory now	16:42
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add bridge -> bridge01 CNAME https://review.opendev.org/c/opendev/zone-opendev.org/+/867540	16:44
clarkb	actually there is a bug in that I think. I need a '.' at the end of the target name right?	16:44
fungi	infra-prod-service-nodepool failed to deploy 867261	16:45
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add bridge -> bridge01 CNAME https://review.opendev.org/c/opendev/zone-opendev.org/+/867540	16:45
clarkb	fungi: nb02 is having a bad time	16:45
fungi	ahh, about the maintenance i guess	16:46
clarkb	fungi: did the job fail and still apply the change to nl03? or did nl03 not get updated	16:46
fungi	i'm digging up the log on bridge	16:46
clarkb	it updated the config on nl03 at least	16:46
fungi	yeah, it was nb02.opendev.org	16:46
fungi	"TASK [sync-project-config : Sync project-config repo] ... fatal: [nb02.opendev.org]: FAILED! ... the output has been hidden due to the fact that 'no_log: true' was specified for this result"	16:47
fungi	[Tue Dec 13 16:48:32 2022] blk_update_request: I/O error, dev xvdb, sector 1459993088 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0	16:48
fungi	it's still spewing those	16:48
clarkb	ya and they asked us to not touch the volume while they figure it out so I think we just let it be for now	16:48
*** dviroel\|rover\|lunch is now known as dviroel\|rover		16:50
fungi	wfm	16:51
clarkb	worst case we create a new volume and build new images with a fresh cache. That might need some zk/nodepool record keeping to make happy too but I think it isn't terrible	16:51
opendevreview	Ade Lee proposed openstack/project-config master: Add FIPS job for ubuntu https://review.opendev.org/c/openstack/project-config/+/867112	16:57
opendevreview	Ade Lee proposed zuul/zuul-jobs master: Add ubuntu to enable-fips role https://review.opendev.org/c/zuul/zuul-jobs/+/866881	17:12
opendevreview	Ade Lee proposed openstack/project-config master: Add FIPS job for ubuntu https://review.opendev.org/c/openstack/project-config/+/867112	17:15
*** jpena is now known as jpena\|off		17:36
*** ysandeep is now known as ysandeep\|out		17:50
*** pojadhav\|dinner is now known as pojadhav		18:10
corvus	i have run the zuul_pull playbook to get latest images everywhere; starting the restart playbook shortly	18:12
corvus	restart playbook is running	18:14
corvus	in screen	18:14
fungi	thanks!	18:19
ianw	... i seem to have very few review.o.o mails again :/	19:04
opendevreview	Merged opendev/zone-opendev.org master: Add bridge -> bridge01 CNAME https://review.opendev.org/c/opendev/zone-opendev.org/+/867540	19:17
ianw	I got 3 emails about ^ after I +2 +W'd it. but none of the emails about it's creation :/ according to logs, gerrit has send and mimecast has received a bunch of mail. i'll have to take it up with internal it :/	19:24
*** dasm is now known as dasm\|off		19:26
fungi	logged into the mm3 server as admin (had to confirm the account e-mail address first since we never did that) and at https://lists.zuul-ci.org/mailman3/domains/ i see two "mail host" entries as expected, but both use the same "web host" of lists.opendev.org	19:32
clarkb	fungi: ok Ithink we should check with the api for creating those if a web host value is settable on creation and maybe just edit these in the django admin and sort through the others in our CI system?	19:33
fungi	there's an "edit" link but if i try to follow it i get a 403 forbidden from https://lists.zuul-ci.org/admin/sites/site/1/change/	19:33
clarkb	huh I wonder if we have to make the admin more of an admin	19:35
clarkb	also we can probably edit thi svia the db directly...	19:35
fungi	or if that url isn't plumbed correctly	19:35
clarkb	check the web logs I guess	19:36
fungi	"GET /admin/sites/site/1/change/ HTTP/1.1" 403 5907 "https://lists.zuul-ci.org/mailman3/domains/"	19:38
clarkb	is that from apache or the python server? usually hte python server had extra content for errors but maybe thats only for 500s	19:39
fungi	AH01630: client denied by server configuration: proxy:uwsgi://localhost:8080/admin/sites/site/1/change/, referer: https://lists.zuul-ci.org/mailman3/domains/	19:39
fungi	that's it	19:39
clarkb	oh ya we might have made this stuff localhost only?	19:40
clarkb	I seem to remembre something about that? I can't recall if we did that or not	19:40
fungi	quite possible, and for good reason	19:40
clarkb	I remember the topic coming up. I don't recall the outcome	19:40
clarkb	fungi: the apache config has a require local for /admin	19:45
fungi	yeah, i'm fiddling with ssh port forwarding now	19:46
fungi	forwarding 8080 seems to not really work though	19:49
fungi	i don't even seem to be able to perform basic get requests over the socket	19:50
fungi	https://paste.opendev.org/show/bkz1sr9XIHp5kCsjHNVv/	19:51
clarkb	8080 is uwsgi not http iirc	19:52
clarkb	you need to talk to the regular port 443 but with a source originating from the mail server	19:53
fungi	ahh, that makes more sense	19:53
fungi	yeah, that gets me a login page	19:55
fungi	and i'm able to get into it with the admin account credentials	19:56
fungi	and yes, the sites panel does list only a single site which has both a domain name and display name of lists.opendev.org	19:58
fungi	so i guess we need one created for lists.zuul-ci.org and then associate that with the lists.zuul-ci.org mail domain	19:58
clarkb	fungi: and maybe we hold a test node and test that?	19:59
fungi	yes	19:59
clarkb	just to avoid unexpected fallout	19:59
clarkb	but ya that sounds like the next thing to do	19:59
fungi	oh, i'm definitely not switching anything in the production webui for this now	19:59
clarkb	:)	19:59
fungi	i just wanted to confirm what django has is what i thought we were seeing	20:00
clarkb	fungi: re testing we might want to double check the api doesn't have a method for setting that. If it does we can set it then compare how it ends up	20:00
clarkb	https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/domains.html doens't seem to have any web host conten tthough :(	20:01
fungi	104.130.140.226 is the last node i held for this, and i haven't deleted it yet, so can poke around on there	20:02
clarkb	sounds good. I need to find lunch	20:02
fungi	i need to start cooking dinner anyway	20:02
fungi	looks like rax sent us another update about 5 minutes ago, seems they're still working on the volume problem impacting nb02	20:48
fungi	"...our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, nb02.opendev.org/main01, ... at 2022-12-13T20:42:18.499971. We are currently investigating..."	20:50
ianw	i've got the mail open and can keep an eye on that too	21:07
clarkb	we are waiting on one job to end on ze01 before it restarts. I think it must be a paused job because there is no ansible running	21:29
ianw	looks like we just got a mail saying the nb02 storage is restored. i need to school run then can poke to make sure it's happy	21:29
clarkb	I'm looking at what is taking zuul so long to restart and note that a number of tripleo changes have a number of long running jobs with multiple retries. At first glance looking at the logs for those jobs it looks like the nodes are "going away" and are not reachable for log collection after some sort of failure	21:44
clarkb	that reminds me of the network manager thing	21:44
clarkb	however that should only affect lookups on the host? ssh into the hosts should still work (even if the dns reverse lookup that ssh does fails)	21:45
clarkb	but first things first, did we ever land a fix for the NM thing? the original was reverted iirc	21:45
clarkb	https://review.opendev.org/c/openstack/project-config/+/866475 looks like the new version of the fix has not landed	21:46
* clarkb reviews this now		21:46
clarkb	I think we are about 2 hours away from ze01 restarting	21:49
clarkb	about 6 hours after the request :(	21:49
clarkb	I don't see a ton of retries in jobs outside of tripleo but they do exist https://zuul.opendev.org/t/openstack/build/28b291e7955e48cf84fc20378b354d45/log/job-output.txt#216-223 is an interesting example because it hit a simple job and early on but still recorded logs	21:56
clarkb	The behavior there is different from what we see in the long running tripleo jobs	21:56
clarkb	that said I wonder if some sort of name resolution thing is the underlying issue and that NM dns fix above might help	21:56
*** dviroel\|rover is now known as dviroel\|out		21:58
opendevreview	Merged openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf https://review.opendev.org/c/openstack/project-config/+/866475	21:59
clarkb	that landed quicker than I expected. If nb02 isn't happy yet that should be fine as we run the nodepool job hourly	22:05
clarkb	(so we'll catch up soon enough)	22:05
ianw	nb02 has /opt mounted ro, i'm just going to reboot	22:11
clarkb	fungi: did you change anything with mm3? I'm seeing zuul lists listed under lists.opendev.org and vice versa now	22:14
clarkb	it definitely didn't do that before	22:14
clarkb	it seems to be functional though	22:14
ianw	i now realise rebooting with a possibly corrupt large /opt was not a sensible thing to do	22:19
fungi	clarkb: i altered nothing as far as i know	22:21
fungi	i confirmed the admin account	22:22
fungi	other than that i merely viewed things as far as i could tell. maybe mm3 is subject to heisenberg's uncertainty?	22:22
clarkb	there is a FILTER_VHOST setting which causes the web to do the filtering iirc	22:24
fungi	but i agree i'm seeing all mailing lists shown under both sites now	22:25
clarkb	https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/MAXK2AAP7HGSTQDFSBID7DVUGXLUHO4G/	22:25
fungi	i wonder if somehow loading the django admin interface wrote something	22:25
clarkb	that seems to say that the behavior we saw before with the domain name in the top left is expected, but filter vhosts should filter things as we had before	22:25
clarkb	however: "For Postorius, it works only if in the domains view, each mail host has a unique web host."	22:26
clarkb	I guess we should try to reproduce this on the held test node?	22:29
clarkb	I don'nt know what is going on there. If this is a side effect of confirming an admin account that seems to be a really bad bug in mailman	22:30
fungi	i have a feeling it's a side effect of me logging into the django admin interface and browsing around without clicking anything that "looks like it would change something"	22:31
clarkb	ya that seems most likely.	22:32
fungi	perhaps there are things that change things but don't make it apparent that simply viewing them would do so	22:32
ianw	PV VG Fmt Attr PSize PFree	22:32
ianw	/dev/xvdc1 main lvm2 a-- <1024.00g 0	22:32
ianw	[unknown] main lvm2 a-m <1024.00g 0	22:32
clarkb	I'm saying mailman shouldn't behave that way :) but ya I guess we try to reproduce it on the held node and then sort it out from there?	22:32
ianw	i don't think this is good?	22:32
fungi	ianw: that looks like it's lost track of the underlying volume. is that after rebooting?	22:34
fungi	it was apparently xvdb we were seeing errors for	22:35
fungi	ahh, you did say rebooting	22:35
clarkb	oh right we have 2x1TB volumes now	22:35
ianw	yeah, that was in rescue mode. i've commented out the /opt mount in fstab and hopefully it comes back to the main system now	22:35
fungi	it should	22:36
ianw	it was very dumb to reboot this and assume it would work	22:36
clarkb	so ya I guess the one for xvdb doesn't have the lvm bits there anymore?	22:36
fungi	likely more than just the lvm pv header missing	22:36
fungi	i'm going to guess we've effectively lost /opt and need to recreate and repopulate it	22:36
ianw	it's just rebooting out of rescue mode, that will make it easier to reason about	22:37
fungi	thankfully those contents aren't precious, just convenient	22:37
fungi	anyway, as for the listserv, seems like it isn't in dire straits so i'll probably try to tackle further research tomorrow morning with a clear head	22:38
clarkb	fungi: yes, everything seems fine other than it hsowing extra data for the two domains	22:39
ianw	ok, it's up	22:39
clarkb	fungi: I don't think it is critical, more of a huh how did that happen that I noticed trying to find an email in the archive	22:39
ianw	well, i can see both pv's now in "regular" mode	22:39
fungi	yeah, i think it could be an offshoot of us not having properly separate sites created for the two domains	22:39
fungi	ianw: oh! did the volume end up active too? (just not mounted)	22:40
ianw	running fsck /dev/mapper/main-main now ... see what happens	22:40
fungi	maybe it simply needed to reconnect	22:40
clarkb	ianw: possible that your rescue instance wasn't able to read the lvm data due to being too old or something?	22:40
clarkb	oh ya maybe it was just a connectivity issue and rebooting a couple of times convinced it to be present again	22:40
ianw	maybe? it could see one pv, but not the other ... could be anything i guess; yay clouds	22:41
clarkb	fungi: fwiw I poked around the mm3 api docs a bit more and couldn't find evidence that we can set the domain web_host at creation time. However, that doesn't mean it isn't possible. But I think we start with a test node and try to reproduce there then work forward to the behavior we had and expect	22:41
clarkb	I'm working on my semi monthly opendev update email here https://etherpad.opendev.org/p/V-YLkq0iEJyhBi4hHsFU	22:42
fungi	not sure whose idea it was to encapsulate scsi over ip networks, but i've never been a fan	22:42
clarkb	I realized I hadn't sent one in a while and writing that is half the info I need for the opendev annual report content :)	22:42
clarkb	Hoping to move on to annual report content this week too for both opendev and zuul	22:42
*** rlandy is now known as rlandy\|out		22:43
fungi	i'm also going to try to get a preliminary 2023 engagement report generated, though i may temporarily patch out the mailing list activity collector until i get time to rewrite it to grok mm3's api	22:43
fungi	since right now it only knows how to scrape the mbox files provided with the old pipermail archives	22:44
clarkb	fungi: and now they are all in a xapian index or something right? So ya api seems necessary	22:46
fungi	er, 2022 engagement report i mean ;)	22:47
clarkb	https://www.gerritcodereview.com/3.7.html#important-notes am I reading that coorectly that 3.7 requires an offline upgrade too? wow	22:54
ianw	ok, nb02 has fsck'd /opt and it did some thing but is clean now	22:57
clarkb	I think https://github.com/go-gitea/gitea/issues/21880 may be what is holding up gitea 1.18	22:59
clarkb	ianw: excellent, then the next hourly run of ansible should update our nodepool element content with the NM dns fix	22:59
ianw	^^ yeah that's a bit unclear	22:59
ianw	it does seem like it's saying offlien upgrade	23:00
ianw	i'm just clearing out /dib_tmp before restarting container	23:00
clarkb	ianw: ya I think it is. I'm surprised since they have avoided those for a long time	23:00
clarkb	ugh that tripleo change is doing its third rerun of the 2ish hour long job	23:05
ianw	oh oh	23:10
ianw	nb is unhappy, let me post in matrix	23:10
clarkb	ianw: nb02?	23:11
ianw	yeah, looks like openstacksdk backtrace(s)	23:12
clarkb	note that the other threads for building appear to be working	23:18
clarkb	So I don't think this is a halt and catch fire moment which also explains why we haven't noticed for a while :/	23:19
clarkb	Iexpect ze01 to finally restart ina bout 30 minutes	23:50
ianw	i guess openstack-tox-py38 has pinned itself to bionic, but tox-py38 hasn't? -> https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/jobs.yaml#L247	23:55
clarkb	ianw: no tox-pyXY come from zuul-jobs and we cannot pin them there	23:55
clarkb	ianw: you'll just need to specify a nodeset in a variant probably. Zuul and Nodepool do this	23:55
ianw	ahh, it must have been quite a minute since i looked at the dib queue	23:55
ianw	yeah	23:55
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: tox jobs: pin to correct nodesets https://review.opendev.org/c/openstack/diskimage-builder/+/867579	23:58
opendevreview	Clark Boylan proposed openstack/diskimage-builder master: tox jobs: pin to correct nodesets https://review.opendev.org/c/openstack/diskimage-builder/+/867579	23:59
clarkb	there was small yaml thing so I just fixed it	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!