Tuesday, 2022-12-13

clarkbok I'll push a new patchset00:05
clarkbbut first sending the meeting agenda00:05
opendevreviewClark Boylan proposed opendev/system-config master: Update tox.ini for tox v4  https://review.opendev.org/c/opendev/system-config/+/86726900:09
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-kubernetes: add microk8s support  https://review.opendev.org/c/zuul/zuul-jobs/+/86695300:16
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] zuul-jobs-test-registry-buildset-registry-k8s-microk8s  https://review.opendev.org/c/zuul/zuul-jobs/+/86706300:16
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-kubernetes: add microk8s support  https://review.opendev.org/c/zuul/zuul-jobs/+/86695300:24
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] zuul-jobs-test-registry-buildset-registry-k8s-microk8s  https://review.opendev.org/c/zuul/zuul-jobs/+/86706300:24
*** rlandy is now known as rlandy|out00:28
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-kubernetes: add microk8s support  https://review.opendev.org/c/zuul/zuul-jobs/+/86695300:35
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [wip] zuul-jobs-test-registry-buildset-registry-k8s-microk8s  https://review.opendev.org/c/zuul/zuul-jobs/+/86706300:35
opendevreviewIan Wienand proposed zuul/zuul-jobs master: use-buildset-registry: support microk8s  https://review.opendev.org/c/zuul/zuul-jobs/+/86706301:01
*** dviroel|rover|out is now known as dviroel|rover01:12
*** dviroel|rover is now known as dviroel|rover|out01:36
*** yadnesh|away is now known as yadnesh04:47
opendevreviewTim Beermann proposed zuul/zuul-jobs master: Add yamllint job.  https://review.opendev.org/c/zuul/zuul-jobs/+/86667906:37
opendevreviewTim Beermann proposed zuul/zuul-jobs master: Add yamllint job.  https://review.opendev.org/c/zuul/zuul-jobs/+/86667906:47
opendevreviewTim Beermann proposed zuul/zuul-jobs master: Add yamllint job.  https://review.opendev.org/c/zuul/zuul-jobs/+/86667906:59
*** sandy__ is now known as ysandeep07:57
*** jpena|off is now known as jpena08:27
opendevreviewMerged openstack/project-config master: Flip jeepyb over to building Gerrit 3.6  https://review.opendev.org/c/openstack/project-config/+/86731409:41
*** ysandeep is now known as ysandeep|brb10:37
*** ysandeep|brb is now known as ysandeep11:09
*** rlandy|out is now known as rlandy11:17
*** dviroel|rover|out is now known as dviroel|rover11:18
*** frenzy_friday is now known as frenzy_friday|doc11:34
*** sfinucan is now known as stephenfin12:04
*** yadnesh is now known as yadnesh|away13:34
*** dasm|off is now known as dasm14:00
*** frenzy_friday|doc is now known as frenzy_friday14:57
clarkbgmann: basically as far as we know the change in "trigger vote" vs "submit requirement" label rendering on list pages isn't configurable. I think what is happening is that Gerrit only shows you labels that are required to be set to merge on the summary views. What you can do is search by specific labels and values to get those listings though.15:01
fungior you can make those labels required, i suppose?15:02
*** marios is now known as marios|out15:05
clarkbya maybe you can set them up where they can't be lowest negative value to merge and then that would implicitly work for most things?15:08
clarkbinfra-root we just got email saying the volume backing nb02's build dir needs emergency maintenance. I don't think this is a big deal for us we'll just build images less quickly15:17
clarkbthey don't want us touching the volume until given the all clear15:17
*** pojadhav is now known as pojadhav|dinner15:23
mnasiadkaI have a feeling that Zuul UI have some problem refreshing - some jobs are not updated for the last 30 minutes15:31
mnasiadkaok, it got updated just now ;)15:32
clarkbdepending on what you mean by not updating thats perfectly normal. There may be more demand for nodes than we have quota for delaying node assignments. Some jobs take more than half an hour to complee so won't flip from running to success/fail for a long period etc15:34
mnasiadkaclarkb: I rather meant that the log stream stopped updating for like 30 minutes, and then I got loads of logs on the console stream (and the job was marked as completed) - sounded like some networking issue at a nodepool provider15:39
clarkbinfra-root I've got a couple of small book keeping changes that would be good to land https://review.opendev.org/c/opendev/system-config/+/867269 addresses tox v4 in system-config (it runs a lot of tests but they all seem to pass) and https://review.opendev.org/c/openstack/project-config/+/867282 resorts gate ahead of post-review and fixes the trigger mechanism per corvus' suggestion15:46
*** JasonF is now known as JayF15:53
*** dviroel|rover is now known as dviroel|rover|lunch15:57
clarkbcorvus: not sure if you saw but we didn't end up restarting Zuul over the weekend due to an issue with our restart playbook and new ansible on bridge (docker-compose wants to make a tty by default but new ansible doesn't have one). This has been corrected and I'd like to run that playbook soonish. I've got meetings through the infra meeting but then I should be able to start it. This16:02
clarkbwill switch us to python3.11 based images among other things. Any concerns with doing that today?16:02
opendevreviewMerged openstack/project-config master: Scale iweb back to 25% of possible quota  https://review.opendev.org/c/openstack/project-config/+/86726116:13
corvusclarkb: that sounds good -- i can actually run the restart playbook if you like.16:29
clarkbcorvus: oh that would be great16:29
corvusclarkb: ` flock -n /var/run/zuul_reboot.lock /usr/local/bin/ansible-playbook -f 20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_reboot.yaml >> /var/log/ansible/zuul_reboot.log 2>&1` run that in screen?16:33
clarkbyup that seems to match the cronjob entry16:34
corvuscool, i'll wait for https://review.opendev.org/865923 to land then kick that off and keep an eye on it16:36
clarkbcorvus: oh and be sure to run it on bridge01.opendev.org not old bridge16:36
corvus(i don't want to split 865923 across a restart)16:36
corvusclarkb: have we updated the cname yet?16:37
clarkbcorvus: no because old bridge was never a cname, only an A record16:37
corvus(thanks for the reminder)16:37
clarkbbut it is a good reminder for trying to come up with a way to make this more obvious/easier to get right16:38
corvushow close are we to terminating bridge.openstack?16:38
corvusand maybe we could go ahead and make a bridge.opendev.org cname....16:38
clarkbcorvus: I think we are actually close at this point. There is a set of changes I still need to review to do an encrypted tarball backup thing and I think the idea was to ru nthat on old bridge before removing it16:39
clarkbya bridge.opendev.org CNAME bridge01.opendev.org doesn't help when using bridge.openstack.org but will avoid confusion in the future if there is a bridge02 swap16:39
corvusand we can get started on that muscle memory now16:42
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add bridge -> bridge01 CNAME  https://review.opendev.org/c/opendev/zone-opendev.org/+/86754016:44
clarkbactually there is a bug in that I think. I need a '.' at the end of the target name right?16:44
fungiinfra-prod-service-nodepool failed to deploy 86726116:45
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add bridge -> bridge01 CNAME  https://review.opendev.org/c/opendev/zone-opendev.org/+/86754016:45
clarkbfungi: nb02 is having a bad time16:45
fungiahh, about the maintenance i guess16:46
clarkbfungi: did the job fail and still apply the change to nl03? or did nl03 not get updated16:46
fungii'm digging up the log on bridge16:46
clarkbit updated the config on nl03 at least16:46
fungiyeah, it was nb02.opendev.org16:46
fungi"TASK [sync-project-config : Sync project-config repo] ... fatal: [nb02.opendev.org]: FAILED! ... the output has been hidden due to the fact that 'no_log: true' was specified for this result"16:47
fungi[Tue Dec 13 16:48:32 2022] blk_update_request: I/O error, dev xvdb, sector 1459993088 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 016:48
fungiit's still spewing those16:48
clarkbya and they asked us to not touch the volume while they figure it out so I think we just let it be for now16:48
*** dviroel|rover|lunch is now known as dviroel|rover16:50
clarkbworst case we create a new volume and build new images with a fresh cache. That might need some zk/nodepool record keeping to make happy too but I think it isn't terrible16:51
opendevreviewAde Lee proposed openstack/project-config master: Add FIPS job for ubuntu  https://review.opendev.org/c/openstack/project-config/+/86711216:57
opendevreviewAde Lee proposed zuul/zuul-jobs master: Add ubuntu to enable-fips role  https://review.opendev.org/c/zuul/zuul-jobs/+/86688117:12
opendevreviewAde Lee proposed openstack/project-config master: Add FIPS job for ubuntu  https://review.opendev.org/c/openstack/project-config/+/86711217:15
*** jpena is now known as jpena|off17:36
*** ysandeep is now known as ysandeep|out17:50
*** pojadhav|dinner is now known as pojadhav18:10
corvusi have run the zuul_pull playbook to get latest images everywhere; starting the restart playbook shortly18:12
corvusrestart playbook is running18:14
corvusin screen18:14
ianw... i seem to have very few review.o.o mails again :/19:04
opendevreviewMerged opendev/zone-opendev.org master: Add bridge -> bridge01 CNAME  https://review.opendev.org/c/opendev/zone-opendev.org/+/86754019:17
ianwI got 3 emails about ^ after I +2 +W'd it.  but none of the emails about it's creation :/  according to logs, gerrit has send and mimecast has received a bunch of mail.  i'll have to take it up with internal it :/19:24
*** dasm is now known as dasm|off19:26
fungilogged into the mm3 server as admin (had to confirm the account e-mail address first since we never did that) and at https://lists.zuul-ci.org/mailman3/domains/ i see two "mail host" entries as expected, but both use the same "web host" of lists.opendev.org19:32
clarkbfungi: ok  Ithink we should check with the api for creating those if a web host value is settable on creation and maybe just edit these in the django admin and sort through the others in our CI system?19:33
fungithere's an "edit" link but if i try to follow it i get a 403 forbidden from https://lists.zuul-ci.org/admin/sites/site/1/change/19:33
clarkbhuh I wonder if we have to make the admin more of an admin19:35
clarkbalso we can probably edit thi svia the db directly...19:35
fungior if that url isn't plumbed correctly19:35
clarkbcheck the web logs I guess19:36
fungi"GET /admin/sites/site/1/change/ HTTP/1.1" 403 5907 "https://lists.zuul-ci.org/mailman3/domains/"19:38
clarkbis that from apache or the python server? usually hte python server had extra content for errors but maybe thats only for 500s19:39
fungiAH01630: client denied by server configuration: proxy:uwsgi://localhost:8080/admin/sites/site/1/change/, referer: https://lists.zuul-ci.org/mailman3/domains/19:39
fungithat's it19:39
clarkboh ya we might have made this stuff localhost only?19:40
clarkbI seem to remembre something about that? I can't recall if we did that or not19:40
fungiquite possible, and for good reason19:40
clarkbI remember the topic coming up. I don't recall the outcome19:40
clarkbfungi: the apache config has a require local for /admin19:45
fungiyeah, i'm fiddling with ssh port forwarding now19:46
fungiforwarding 8080 seems to not really work though19:49
fungii don't even seem to be able to perform basic get requests over the socket19:50
clarkb8080 is uwsgi not http iirc19:52
clarkbyou need to talk to the regular port 443 but with a source originating from the mail server19:53
fungiahh, that makes more sense19:53
fungiyeah, that gets me a login page19:55
fungiand i'm able to get into it with the admin account credentials19:56
fungiand yes, the sites panel does list only a single site which has both a domain name and display name of lists.opendev.org19:58
fungiso i guess we need one created for lists.zuul-ci.org and then associate that with the lists.zuul-ci.org mail domain19:58
clarkbfungi: and maybe we hold a test node and test that?19:59
clarkbjust to avoid unexpected fallout19:59
clarkbbut ya that sounds like the next thing to do19:59
fungioh, i'm definitely not switching anything in the production webui for this now19:59
fungii just wanted to confirm what django has is what i thought we were seeing20:00
clarkbfungi: re testing we might want to double check the api doesn't have a method for setting that. If it does we can set it then compare how it ends up20:00
clarkbhttps://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/domains.html doens't seem to have any web host conten tthough :(20:01
fungi104.130.140.226 is the last node i held for this, and i haven't deleted it yet, so can poke around on there20:02
clarkbsounds good. I need to find lunch20:02
fungii need to start cooking dinner anyway20:02
fungilooks like rax sent us another update about 5 minutes ago, seems they're still working on the volume problem impacting nb0220:48
fungi"...our monitoring systems have detected a problem with the server which hosts your Cloud Block Storage device, nb02.opendev.org/main01, ... at 2022-12-13T20:42:18.499971. We are currently investigating..."20:50
ianwi've got the mail open and can keep an eye on that too21:07
clarkbwe are waiting on one job to end on ze01 before it restarts. I think it must be a paused job because there is no ansible running21:29
ianwlooks like we just got a mail saying the nb02 storage is restored.  i need to school run then can poke to make sure it's happy21:29
clarkbI'm looking at what is taking zuul so long to restart and note that a number of tripleo changes have a number of long running jobs with multiple retries. At first glance looking at the logs for those jobs it looks like the nodes are "going away" and are not reachable for log collection after some sort of failure21:44
clarkbthat reminds me of the network manager thing21:44
clarkbhowever that should only affect lookups on the host? ssh into the hosts should still work (even if the dns reverse lookup that ssh does fails)21:45
clarkbbut first things first, did we ever land a fix for the NM thing? the original was reverted iirc21:45
clarkbhttps://review.opendev.org/c/openstack/project-config/+/866475 looks like the new version of the fix has not landed21:46
* clarkb reviews this now21:46
clarkbI think we are about 2 hours away from ze01 restarting21:49
clarkbabout 6 hours after the request :(21:49
clarkbI don't see a ton of retries in jobs outside of tripleo but they do exist https://zuul.opendev.org/t/openstack/build/28b291e7955e48cf84fc20378b354d45/log/job-output.txt#216-223 is an interesting example because it hit a simple job and early on but still recorded logs21:56
clarkbThe behavior there is different from what we see in the long running tripleo jobs21:56
clarkbthat said I wonder if some sort of name resolution thing is the underlying issue and that NM dns fix above might help21:56
*** dviroel|rover is now known as dviroel|out21:58
opendevreviewMerged openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf  https://review.opendev.org/c/openstack/project-config/+/86647521:59
clarkbthat landed quicker than I expected. If nb02 isn't happy yet that should be fine as we run the nodepool job hourly22:05
clarkb(so we'll catch up soon enough)22:05
ianwnb02 has /opt mounted ro, i'm just going to reboot22:11
clarkbfungi: did you change anything with mm3? I'm seeing zuul lists listed under lists.opendev.org and vice versa now22:14
clarkbit definitely didn't do that before22:14
clarkbit seems to be functional though22:14
ianwi now realise rebooting with a possibly corrupt large /opt was not a sensible thing to do22:19
fungiclarkb: i altered nothing as far as i know22:21
fungii confirmed the admin account22:22
fungiother than that i merely viewed things as far as i could tell. maybe mm3 is subject to heisenberg's uncertainty?22:22
clarkbthere is a FILTER_VHOST setting which causes the web to do the filtering iirc22:24
fungibut i agree i'm seeing all mailing lists shown under both sites now22:25
fungii wonder if somehow loading the django admin interface wrote something22:25
clarkbthat seems to say that the behavior we saw before with the domain name in the top left is expected, but filter vhosts should filter things as we had before22:25
clarkbhowever: "For Postorius, it works only if in the domains view, each mail host has a unique web host."22:26
clarkbI guess we should try to reproduce this on the held test node?22:29
clarkbI don'nt know what is going on there. If this is a side effect of confirming an admin account that seems to be a really bad bug in mailman22:30
fungii have a feeling it's a side effect of me logging into the django admin interface and browsing around without clicking anything that "looks like it would change something"22:31
clarkbya that seems most likely.22:32
fungiperhaps there are things that change things but don't make it apparent that simply viewing them would do so22:32
ianw  PV         VG   Fmt  Attr PSize     PFree22:32
ianw  /dev/xvdc1 main lvm2 a--  <1024.00g    0 22:32
ianw  [unknown]  main lvm2 a-m  <1024.00g    0 22:32
clarkbI'm saying mailman shouldn't behave that way :) but ya I guess we try to reproduce it on the held node and then sort it out from there?22:32
ianwi don't think this is good?22:32
fungiianw: that looks like it's lost track of the underlying volume. is that after rebooting?22:34
fungiit was apparently xvdb we were seeing errors for22:35
fungiahh, you did say rebooting22:35
clarkboh right we have 2x1TB volumes now22:35
ianwyeah, that was in rescue mode.  i've commented out the /opt mount in fstab and hopefully it comes back to the main system now22:35
fungiit should22:36
ianwit was very dumb to reboot this and assume it would work22:36
clarkbso ya I guess the one for xvdb doesn't have the lvm bits there anymore?22:36
fungilikely more than just the lvm pv header missing22:36
fungii'm going to guess we've effectively lost /opt and need to recreate and repopulate it22:36
ianwit's just rebooting out of rescue mode, that will make it easier to reason about22:37
fungithankfully those contents aren't precious, just convenient22:37
fungianyway, as for the listserv, seems like it isn't in dire straits so i'll probably try to tackle further research tomorrow morning with a clear head22:38
clarkbfungi: yes, everything seems fine other than it hsowing extra data for the two domains22:39
ianwok, it's up22:39
clarkbfungi: I don't think it is critical, more of a huh how did that happen that I noticed trying to find an email in the archive22:39
ianwwell, i can see both pv's now in "regular" mode22:39
fungiyeah, i think it could be an offshoot of us not having properly separate sites created for the two domains22:39
fungiianw: oh! did the volume end up active too? (just not mounted)22:40
ianwrunning fsck /dev/mapper/main-main now ... see what happens22:40
fungimaybe it simply needed to reconnect22:40
clarkbianw: possible that your rescue instance wasn't able to read the lvm data due to being too old or something?22:40
clarkboh ya maybe it was just a connectivity issue and rebooting a couple of times convinced it to be present again22:40
ianwmaybe?  it could see one pv, but not the other ... could be anything i guess; yay clouds22:41
clarkbfungi: fwiw I poked around the mm3 api docs a bit more and couldn't find evidence that we can set the domain web_host at creation time. However, that doesn't mean it isn't possible. But I think we start with a test node and try to reproduce there then work forward to the behavior we had and expect22:41
clarkbI'm working on my semi monthly opendev update email here https://etherpad.opendev.org/p/V-YLkq0iEJyhBi4hHsFU22:42
funginot sure whose idea it was to encapsulate scsi over ip networks, but i've never been a fan22:42
clarkbI realized I hadn't sent one in a while and writing that is half the info I need for the opendev annual report content :)22:42
clarkbHoping to move on to annual report content this week too for both opendev and zuul22:42
*** rlandy is now known as rlandy|out22:43
fungii'm also going to try to get a preliminary 2023 engagement report generated, though i may temporarily patch out the mailing list activity collector until i get time to rewrite it to grok mm3's api22:43
fungisince right now it only knows how to scrape the mbox files provided with the old pipermail archives22:44
clarkbfungi: and now they are all in a xapian index or something right? So ya api seems necessary22:46
fungier, 2022 engagement report i mean ;)22:47
clarkbhttps://www.gerritcodereview.com/3.7.html#important-notes am I reading that coorectly that 3.7 requires an offline upgrade too? wow22:54
ianwok, nb02 has fsck'd /opt and it did some thing but is clean now22:57
clarkbI think https://github.com/go-gitea/gitea/issues/21880 may be what is holding up gitea 1.1822:59
clarkbianw: excellent, then the next hourly run of ansible should update our nodepool element content with the NM dns fix22:59
ianw^^ yeah that's a bit unclear22:59
ianwit does seem like it's saying offlien upgrade23:00
ianwi'm just clearing out /dib_tmp before restarting container23:00
clarkbianw: ya I think it is. I'm surprised since they have avoided those for a long time23:00
clarkbugh that tripleo change is doing its third rerun of the 2ish hour long job23:05
ianwoh oh23:10
ianwnb is unhappy, let me post in matrix23:10
clarkbianw: nb02?23:11
ianwyeah, looks like openstacksdk backtrace(s)23:12
clarkbnote that the other threads for building appear to be working23:18
clarkbSo I don't think this is a halt and catch fire moment which also explains why we haven't noticed for a while :/23:19
clarkb Iexpect ze01 to finally restart ina bout 30 minutes23:50
ianwi guess openstack-tox-py38 has pinned itself to bionic, but tox-py38 hasn't? -> https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/jobs.yaml#L24723:55
clarkbianw: no tox-pyXY come from zuul-jobs and we cannot pin them there23:55
clarkbianw: you'll just need to specify a nodeset in a variant probably. Zuul and Nodepool do this23:55
ianwahh, it must have been quite a minute since i looked at the dib queue23:55
opendevreviewIan Wienand proposed openstack/diskimage-builder master: tox jobs: pin to correct nodesets  https://review.opendev.org/c/openstack/diskimage-builder/+/86757923:58
opendevreviewClark Boylan proposed openstack/diskimage-builder master: tox jobs: pin to correct nodesets  https://review.opendev.org/c/openstack/diskimage-builder/+/86757923:59
clarkbthere was  small yaml thing so I just fixed it23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!