Tuesday, 2023-08-15

opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Remove pre-bullseye release workaround  https://review.opendev.org/c/openstack/diskimage-builder/+/89139107:36
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Install netplan.io for Debian Bookworm  https://review.opendev.org/c/openstack/diskimage-builder/+/89132307:40
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Change default value of DIB_DEBIAN_ALT_INIT_PACKAGE  https://review.opendev.org/c/openstack/diskimage-builder/+/89129907:42
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Stop creating default user for cloud-init  https://review.opendev.org/c/openstack/diskimage-builder/+/89132207:42
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Stop creating default user for cloud-init  https://review.opendev.org/c/openstack/diskimage-builder/+/89132207:43
opendevreviewMerged zuul/zuul-jobs master: Stop testing ensure-kubernetes with crio under ubuntu bionic  https://review.opendev.org/c/zuul/zuul-jobs/+/89134116:19
opendevreviewMerged zuul/zuul-jobs master: Support ensure-kubernetes on bookworm  https://review.opendev.org/c/zuul/zuul-jobs/+/89133916:26
clarkbheads up it looks like zuul executors are running out of disk space cc infra-root corvus16:29
clarkbI wonder if /var/lib/zuul didn't get properly mounted on the new servers16:30
clarkbso we're running with less disk space in general16:30
clarkblooking at cacti I think this may be the case but I'm still trying to sort out breakfast so I haven't loaded ssh keys to confirm directly yet16:30
fungiyeah, there is no /var/lib/zuul filesystem16:32
fungiit's just using the rootfs16:32
corvusoh neat i can help16:32
fungishould we pause the executors to limit further damage while we (re)add volumes?16:33
Clark[m]cool I'm going to try and finish feeding everyone then jump back to real keyboard16:33
corvusfungi: they are self-protecting (they are taking themselves offline)16:34
fungioh, good16:34
fungii'm volume hunting now16:34
corvushttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1  shows they are basically all offline now16:35
corvusfascinating though that they've been skirting the limit for weeks/months now16:35
fungipretty amazing it lasted this long16:35
corvusi guess maybe nova just merged the commit that broke the camels back or something?16:36
fungiseems so16:36
fungithe executors are in rax-dfw16:36
fungii don't see any dedicated cinder volumes... in the past did we just mount a partition from the ephemeral disk?16:37
corvusso we do bind-mount /var/lib/zuul; and we have a big volume on /opt which is mostly empty.  i guess we don't have any ansible that links the two?16:37
corvusfungi: i think ephemeral (/opt) is what we used but i don't know for sure16:38
fungithat seems the most likely, yes16:38
Clark[m]Ya I don't recall either.16:38
fungimaybe we used to symlink it?16:38
Clark[m]It is possible if the volumes were created with the old servers they were set up to auto delete when the old servers went away. Otherwise the ephemeral disk seems likely16:39
Clark[m]I don't think it was symlinked because cacti has a graph for that device16:39
fungii don't think rax supports auto-deleting volumes16:39
corvusi'm trying to get cacti to spit out a 6 month graph of /opt but it's buggy16:39
fricklercorvus: yes the range selector is broken, you can get 1y at the bottom of e.g. http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=69591&rra_id=all16:41
corvusokay if you just click on a graph you can get the different time periods, which is good enough16:41
corvusyeah that16:41
corvusit's interesting that we don't have a graph for /opt before this16:42
corvuser before the new server16:42
corvusand we did for /var/lib/zuul16:42
Clark[m]Implies /opt was part of /16:42
fungii suspect we didn't have a separate /opt partition16:42
corvusso i think we must have remounted the ephemeral volume there16:42
fungiyeah, that16:42
fungido we want to just do it that way again?16:42
corvusand i guess we did that manually and had no automation for it16:43
Clark[m]Launch node can do it16:43
Clark[m]But it is a manual selection in the command16:43
corvusoh, maybe i missed an argument to launch node16:43
fungiwe can start from either end of the list and work our way inward?16:43
corvusokay so later, i'll update the deployment docs and make a note of that16:43
fungipresumably this will be less work than rebuilding the servers16:43
corvusand for now, yeah, i think we should stop; move dirs around; unmount; update fstab; reboot16:44
fungii can start with ze1216:44
corvushow about i formulate a list of steps so we can make sure it's consistent16:45
fungii started to jot down the commands i was planning to run16:45
corvusi'll start that on ze0116:45
fungiokay, moving to that pad16:46
corvuslet's use fungi's pad16:46
fungiokay, moving back16:46
Clark[m]I was wrong launch node can set up arbitrary paths for cinder volumes but the ephemeral node is hard coded to swap and /opt16:46
corvusdo we want a hard down or graceful?16:47
corvusthere may be some jobs still running, and graceful would let them finish, but would take longer16:47
Clark[m]Graceful might take a while for jobs that will ultimately fail16:47
corvusgood point, there is a high (but not 100%) job failure risk16:48
fungii took a stab at putting the full set of commands in https://etherpad.opendev.org/p/d0l2F7iYeyHzVw1EpSBy but please adjust to your liking16:49
fungipresumably the rootfs will drain a bit as jobs stop, so i'm 50/50 on whether graceful is better16:50
corvusfungi: that lgtm; you want to try that on ze12 now and let us know how it goes?16:50
fungistraight away16:50
corvuslet's go ahead and run stop on all the executors16:50
fungiyeah, good call16:51
corvusi'll do that from bridge with ansible16:51
fungiwill we still need to down the containers afterward, or is that part of stopping them?16:51
corvusi will down the containers16:51
fungithanks, crossed it off the pad then16:52
corvusthis way any jobs that are aborted don't get restarted on an executor that when then immediately down and abort the job again16:52
corvusokay hard-stopped all executors16:52
corvusfeel free to proceed on ze1216:52
fungishall i #status notice Zuul job execution is temporarily paused while we rearrange local storage on the servers16:53
corvusfungi: status lgtm16:53
fungi#status notice Zuul job execution is temporarily paused while we rearrange local storage on the servers16:53
opendevstatusfungi: sending notice16:53
fungiworking on 12 now16:53
-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers16:53
fungii should have timed it16:53
opendevstatusfungi: finished sending notice16:56
fungii missed a few commands initially which i'm adding to the pad16:56
fungistill waiting for the zuul data to move to the other fs though16:57
fungii'll grab the correct ownership/permissions once it completes and fill in the chmod/chown commands16:57
fungiwent ahead and filled them in while waiting16:59
clarkbdo we need a step between 5 and 6 to wait for zuul to stop complete?16:59
fungianyway, i'll get a wall clock estimate of the move time16:59
corvusclarkb: i did a hard stop; they're all down16:59
clarkbah got it16:59
fungiwe'll presumably want to do the rest of these in parallel once i'm sure all the commands are right, since this step takes a while17:01
clarkbmakes sense. We wnt ot see that ze12 is happy with the end result prior to starting on the others right?17:02
corvusagreed.  i have windows open ready for ze01-ze0617:04
corvusgiven the time it's taking... maybe we should decide we're comfortable with steps 1-5 and start those everywhere but then hold there for any changes from ze12?17:06
clarkbcorvus: I think the risk with 1-5 is if our hourly jobs run and rewrite project config in the intermediate /opt state17:07
clarkbthat is actually a risk either way I guess17:07
clarkbmaybe we need to stop ansible on bridge too?17:07
clarkboh wait17:07
clarkbthere are no running executors17:07
clarkbso no jobs17:07
fungii'm good with that17:07
clarkbso ya I think 1-5 should be pretty safe17:08
corvusyeah, as long as there's no cron (but when we bring ze12 up that risk returns)17:08
fungii also have my screen sessions on 7-11 ready17:08
clarkbfungi: where in the process is 12?17:08
fungistill on line 517:08
fungithe big data move17:08
corvusthen i'll run lines 1-4 on ze01-ze06 now.  sound good?17:08
clarkbshould we add a step 4.5 that deletes the builds content?17:08
clarkbI don't know where the majority of that file copying is happening (maybe it is the git dirs?)17:09
fungiprogress report on 12 is that /opt now contains 15G of data17:09
corvuswe can do 4.517:09
clarkboh and those are hardlinks so clearing out the data in builds may not help much17:09
fungithe rootfs is only 40gb and a lot of it is operating system, so probably not too much longer17:09
clarkbjust thinking that when we start the executor back up again it will clear the builds anyway17:09
corvusno idea what mv would do with those hardlinks though; it could expand them17:09
corvus(since we're crossing filesystems)17:10
clarkbyes it expand sthem17:10
clarkbso I think 4.5 is prudent17:10
fungimakes sense17:10
corvusi put it in as new #317:11
fungiokay, so there's a new step 3, yep17:11
corvusthat look good?17:11
fungii'll try to retrofit it on 1217:11
corvusi wouldn't worry about ze1217:11
fungi12 just finished the data move17:11
clarkbyes that path lgtm17:11
corvusthey will get deleted when zuul starts17:11
fungiokay, good17:11
corvus(or you could do it anyway, it's fine either way)17:12
clarkb* the path in #317:12
corvusokay, i'll run step 1-6 on ze01-ze06 now, good?17:12
fungisounds good. i've finished 12 up to all but the reboot17:13
fungipresumably we don't want to reboot any until we're done with them all?17:13
clarkbI'll disable ansible on bridge17:13
clarkbif I do that I think you can reboot 12 when ready17:13
corvusfungi: /opt doesn't look right17:13
fungii'm working on 7-11 now17:13
fungid'oh, thanks17:14
fungii forgot to rm the old mountpoint17:14
clarkbbridge should avoid running ansible.17:14
fungiperms and ownership match so no need to preserve it17:14
fungi12 look better now?17:15
corvusthat looks right :)17:15
fungii added a new step 8 to rm it17:15
fungiokay, working on 7-11 now17:16
clarkbwhat was in /opt prior to the rm ?17:16
corvus01-06 all have step 6 in progress now17:16
corvusclarkb: /newopt17:16
corvuslike /opt/newopt17:17
clarkboh I see. Is the problem actually step 9 then?17:17
clarkbits moving all of newopt into /opt?17:17
fungiwell, step 9 was a problem because of the missing step 817:17
corvusclarkb: fungi fixed the steps so they should work for new hosts17:17
corvusyeah, but with step 8 now, the procedure should be good17:18
fungibuild data deletion is taking some time on a few of these17:18
clarkbok I'm not understanding why that helps since we're unmounting /opt/ then deleting what was mounted under the root fs for /opt17:18
corvusclarkb: step 8 is new17:19
fungiclarkb: step 8 was previously not there, it removes the mountpoint17:19
corvusclarkb: old procedure was "unmount /opt but leave /opt directory stub in place then move newopt underneath that directory"17:19
corvusclarkb: new procedure is "unmount /opt, delete stub /opt directory; rename /newopt to /opt"17:19
clarkboh I see mv's behavior is chaning because the destination is no longer there17:20
fungistep 6 is now underway for 7-1117:20
corvusstep 6 just finished for ze01-06.17:20
corvusshould i continue with the process?  maybe just do ze01 and double check it?17:21
clarkbsounds good to me17:21
fungipresumably the move goes faster now with the build data deleted first17:21
corvusze01 complete through step 1317:22
fungiso 1 and 12 are done except for rebooting17:22
corvusze01 lgtm; feel free to double chec17:22
fungi01 lgtm17:23
clarkbya me too17:23
fungiwith ansible blocked on bridge do we want to reboot one (or both) of those?17:23
fungistep 6 finished on the remainder, i'll go through the rest of the steps on those now17:24
fungiokay, all executors are done except for rebooting17:26
clarkbmy expectation is that rebooting 01 and/or 12 at this point is fine17:26
fungido we want to reboot one and see if it comes up properly?17:26
corvusi'm still working on ze01-0617:26
fungioh, got it, thought they were finished17:26
corvusokay all done except reboot17:28
fungido you want to reboot 01 and see if it works properly?17:28
corvuslet's just reboot 12 and if it looks good do the rest17:28
fungioh, i can do 1217:28
fungirebooting 12 now17:28
clarkbnote we probably need to manually docker compose up -d?17:28
corvusprobably so17:29
fungii'm able to ssh back into it now17:30
fungino zuul processes running, as anticipated17:30
fungiwant me to up -d the container?17:30
corvusi'm staging ansible commands to do the rest of the hosts once we're happy with 1217:31
clarkbas expected it says it is clearing out builds 17:31
fungithanks, so ansible to reboot them and then to up their containers after?17:31
corvusyep.  and actually, why don't i go ahead and reboot all the others now?17:32
corvusze01-ze11 rebooting17:33
fungionce we're sure they're running builds again, i can #status notice Zuul job execution has resumed with additional disk space on the servers17:34
corvusze12 seems to be doing executor things17:34
fungiyes, a flurry in the log17:34
fungiincluding ansible to remote nodes17:34
clarkbonce we're happy with the state of things I can remove the DISABLE-ANSIBLE flag file on bridge17:34
corvusi'm happy to up the rest now.  sound good?17:34
fungi/var/lib/zuul is at 15% used on ze1217:35
fungicorvus: thanks, please proceed!17:35
corvusthe others should be up now17:35
corvuscomponent page looks good17:36
fungihttps://zuul.opendev.org/components does show them all running17:36
corvusgrafana agrees17:36
clarkbwe are running a few versions ahead now but I don't expect that to be a problem17:36
fungia bunch of commits later than the rest of the servers, but probably fine17:36
clarkbthere weren't any schema changes iirc17:36
corvushrm.  we should put the schema api version in the components api endpoint17:37
fungithat's a stellar idea17:37
corvusbut i agree, i don't recall any changes17:37
corvusclarkb: i think you're good to remove the flag file17:37
fungihttps://zuul.opendev.org/t/openstack/stream/5e3436e0211c48f1a990cf5a6d03cce6?logfile=console.log is a running build17:37
clarkbcorvus: done17:38
corvusthanks for the help and sorry i missed that during the server replacement17:39
fungishall i #status notice Zuul job execution has resumed with additional disk space on the servers17:40
fungior wait until we see a job succeed>?17:40
fungilooks like https://zuul.opendev.org/t/openstack/build/ef52e4829b4b40d8bfa91d5466748614 just succeeded17:40
clarkbcorvus: it didn't occur to me to check on that so as much a fault on my end as yours17:40
clarkbthank you for jumping on fixing it17:40
fungiyes, thanks for the help and sorry for not spotting the problem myself17:41
fungi#status notice Zuul job execution has resumed with additional disk space on the servers17:42
opendevstatusfungi: sending notice17:42
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers17:42
fungibuilds seem to be succeeding, so i didn't see any point in holding off sending that now17:42
opendevstatusfungi: finished sending notice17:45
fungii can talk about plans to roll out service upgrades, but need to disappear at any minute for an early dinner out so probably won't be able to review/approve things for at least an hour20:03
clarkbfungi: I too need lunch but I'm thinking maybe we start with the mm3 stuff that you feel is safe20:03
clarkbI want to do etherpad but also feel like I need to repage some of that in so I'm comfortable with it again20:04
fungiyeah, those first three changes you +2'd should be entirely safe20:04
clarkbmaybe plan to do etherpad early tomorrow instead. And then I'll work on cleaning up the gitea change to get it ready soon20:04
fungifor mm320:04
clarkbfungi: in that case lets start there20:04
fungii'd be fine merging the etherpad update too, but understand if you want to space them out so we don't end up with two fires to fight at the same time20:04
fungiokay, my ride is here. ttfn20:05
clarkbya exactly. That and I want to make sure I remember all the bits we were concerned about before proceeding with etherpad20:05
clarkbfrickler:  Ijust sent email to the repo discuss list about the star limits issue20:14
clarkbmostly asking them if stars should be improved to avoid this, if there are any options for lisitng these changes, and what considerations should be made when increasing index.maxTerms20:14
clarkband now I should eat lunch20:15
corvusi'm going to go ahead and restart the rest of zuul just to get everybody on the same commit20:32
corvus#status restarted the rest of zuul so all components are at the same version20:42
opendevstatuscorvus: unknown command20:42
corvus#status log restarted the rest of zuul so all components are at the same version20:42
opendevstatuscorvus: finished logging20:42
clarkbsounds good20:57
clarkbfungi: one of the things I had been meaning to do is check our settings config file for etherpad against a 1.9.1 example. I've removed all the settings we already set from this paste and left what is remaining that we can configure https://paste.opendev.org/show/bBzSkMUyO8LTvR6SafX8/21:12
clarkbLooking at that config I think we are ok to proceed with upgrading as either defaults are fine or if we want to configure things they are for UX behaviors that can be done after the upgrade21:13
clarkbI think I'm happy to proceed after doing that and looking at the chagne again But we should start with your mailman stuff when you are ready21:13
clarkbthe gitea access log format chagne was to intentionally remove source port info from the log lines because some parsers don't expect the port. I hope you don't use a reverse proxy...21:42
* clarkb sorts out how to override the format back to what it was before21:42
opendevreviewClark Boylan proposed opendev/system-config master: Update to Gitea 1.20  https://review.opendev.org/c/opendev/system-config/+/88699321:47
clarkbfungi: the last remaining todo assuming my latest changes check out is the one around url prefix types and whether or not we need to exclude some. Gitea made a recent point release to add some items to that list by default due to security concerns. Not sure if you are aware of a general list of protocols to avoid21:50
fungiokay, back22:05
fungisorry that took longer than anticipated22:06
clarkbno worries. I'm still around if you want to approve your mailman changes22:07
fungimeh, browsers are generally better at blacklisting unsafe protocol schemes than worrying about it server-side22:08
fungiservers aren't unsafe by hyperlinking things, it's up to browsers to protect their users22:08
clarkbfungi: its the protocols that get linkified in markdown in your browser aiui22:08
clarkbya ok22:08
clarkbbut ya I think if you're good I'm good for you to +A those 3 mm changes22:18
fungithe pasted etherpad config lgtm btw22:18
fungiapproving those 3 mm3 changes now and will check on stuff once they deploy22:19
clarkbfungi: note the pasted config is what we are not setting in our config. When you say it lgtm you are saying you are happy with us not setting those values?22:19
fungiyes those look like reasonable defaults i mean22:19
fungiindentationOnNewLine being false is interesting. i think that's a behavior change but one i welcome22:20
opendevreviewMerged opendev/system-config master: Pin importlib_resources<6 in mailman images  https://review.opendev.org/c/opendev/system-config/+/89022022:36
opendevreviewMerged opendev/system-config master: Make mailman3 DB migration check PyVer-agnostic  https://review.opendev.org/c/opendev/system-config/+/89025322:58
opendevreviewMerged opendev/system-config master: Use magic domain guessing in Mailman 3  https://review.opendev.org/c/opendev/system-config/+/86798722:58
funginow just waiting for the deploy jobs22:59
clarkbhourly deployments should finish up soon then it will do the deployment job for the chagnes above23:14
fungithere will probably be a brief outage of the ml web interfaces, the services take a couple of minutes to come up after the containers start23:16
clarkbfungi: looks like the jobs ran and I can still reach the service23:27
clarkbnot sure if it changed anything though23:28
clarkbthe jobs did report success though23:28
fungiyeah, i'm checking to see if they restarted23:42
fungi23:01 process start times23:43
fungiuwsgi and python3 processes23:45
fungiso looks like they did23:45
Clark[m]I think that may have been for the second job that ran not the third. Did we need a restart to be triggered by the third too?23:46
fungiyeah, we do23:46
fungi-rw-r--r-- 1 systemd-network systemd-journal 13465 Aug 15 23:21 /var/lib/mailman/web/settings.py23:47
fungiit contains "SITE_ID = 0" but https://lists.zuul-ci.org/archives/ still says lists.opendev.org in the corner23:47
fungiso the config got updated after the containers were restarted23:47
fungii guess i should just manually down/up -d them for now?23:48
fungiwe should add a restart handler trigger for changes to that file?23:49
Clark[m]Ya I think we should probably do both23:53
fungiwe have apache restart/reload handlers but not anything for restarting the containers23:53
Clark[m]If the settings file changes we restart services23:53
Clark[m]I think the LE handler for jitsi meet shows how to do that23:53
fungithanks, checking23:53
fungiyeah playbooks/roles/letsencrypt-create-certs/handlers/restart_jitsi_meet.yaml has docker-compose restart web23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!