Tuesday, 2023-08-15

opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Remove pre-bullseye release workaround https://review.opendev.org/c/openstack/diskimage-builder/+/891391	07:36
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Install netplan.io for Debian Bookworm https://review.opendev.org/c/openstack/diskimage-builder/+/891323	07:40
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Change default value of DIB_DEBIAN_ALT_INIT_PACKAGE https://review.opendev.org/c/openstack/diskimage-builder/+/891299	07:42
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Stop creating default user for cloud-init https://review.opendev.org/c/openstack/diskimage-builder/+/891322	07:42
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Stop creating default user for cloud-init https://review.opendev.org/c/openstack/diskimage-builder/+/891322	07:43
opendevreview	Merged zuul/zuul-jobs master: Stop testing ensure-kubernetes with crio under ubuntu bionic https://review.opendev.org/c/zuul/zuul-jobs/+/891341	16:19
opendevreview	Merged zuul/zuul-jobs master: Support ensure-kubernetes on bookworm https://review.opendev.org/c/zuul/zuul-jobs/+/891339	16:26
clarkb	heads up it looks like zuul executors are running out of disk space cc infra-root corvus	16:29
clarkb	I wonder if /var/lib/zuul didn't get properly mounted on the new servers	16:30
clarkb	so we're running with less disk space in general	16:30
clarkb	looking at cacti I think this may be the case but I'm still trying to sort out breakfast so I haven't loaded ssh keys to confirm directly yet	16:30
fungi	ohboy	16:32
fungi	yeah, there is no /var/lib/zuul filesystem	16:32
fungi	it's just using the rootfs	16:32
corvus	oh neat i can help	16:32
fungi	should we pause the executors to limit further damage while we (re)add volumes?	16:33
Clark[m]	cool I'm going to try and finish feeding everyone then jump back to real keyboard	16:33
corvus	fungi: they are self-protecting (they are taking themselves offline)	16:34
fungi	oh, good	16:34
fungi	i'm volume hunting now	16:34
corvus	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1 shows they are basically all offline now	16:35
corvus	fascinating though that they've been skirting the limit for weeks/months now	16:35
fungi	pretty amazing it lasted this long	16:35
corvus	i guess maybe nova just merged the commit that broke the camels back or something?	16:36
fungi	seems so	16:36
fungi	the executors are in rax-dfw	16:36
fungi	i don't see any dedicated cinder volumes... in the past did we just mount a partition from the ephemeral disk?	16:37
corvus	so we do bind-mount /var/lib/zuul; and we have a big volume on /opt which is mostly empty. i guess we don't have any ansible that links the two?	16:37
corvus	fungi: i think ephemeral (/opt) is what we used but i don't know for sure	16:38
fungi	that seems the most likely, yes	16:38
Clark[m]	Ya I don't recall either.	16:38
fungi	maybe we used to symlink it?	16:38
Clark[m]	It is possible if the volumes were created with the old servers they were set up to auto delete when the old servers went away. Otherwise the ephemeral disk seems likely	16:39
Clark[m]	I don't think it was symlinked because cacti has a graph for that device	16:39
fungi	i don't think rax supports auto-deleting volumes	16:39
corvus	i'm trying to get cacti to spit out a 6 month graph of /opt but it's buggy	16:39
frickler	corvus: yes the range selector is broken, you can get 1y at the bottom of e.g. http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=69591&rra_id=all	16:41
corvus	okay if you just click on a graph you can get the different time periods, which is good enough	16:41
corvus	yeah that	16:41
corvus	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=71433&rra_id=all	16:41
corvus	it's interesting that we don't have a graph for /opt before this	16:42
corvus	er before the new server	16:42
corvus	and we did for /var/lib/zuul	16:42
Clark[m]	Implies /opt was part of /	16:42
fungi	i suspect we didn't have a separate /opt partition	16:42
corvus	so i think we must have remounted the ephemeral volume there	16:42
fungi	yeah, that	16:42
corvus	yep	16:42
fungi	do we want to just do it that way again?	16:42
corvus	and i guess we did that manually and had no automation for it	16:43
Clark[m]	Launch node can do it	16:43
Clark[m]	But it is a manual selection in the command	16:43
corvus	oh, maybe i missed an argument to launch node	16:43
fungi	we can start from either end of the list and work our way inward?	16:43
corvus	okay so later, i'll update the deployment docs and make a note of that	16:43
fungi	presumably this will be less work than rebuilding the servers	16:43
corvus	and for now, yeah, i think we should stop; move dirs around; unmount; update fstab; reboot	16:44
fungi	i can start with ze12	16:44
corvus	how about i formulate a list of steps so we can make sure it's consistent	16:45
fungi	https://etherpad.opendev.org/p/d0l2F7iYeyHzVw1EpSBy	16:45
fungi	i started to jot down the commands i was planning to run	16:45
corvus	https://etherpad.opendev.org/p/B09GJw93S-x7U564-kfw	16:45
corvus	i'll start that on ze01	16:45
fungi	okay, moving to that pad	16:46
corvus	let's use fungi's pad	16:46
fungi	okay, moving back	16:46
Clark[m]	I was wrong launch node can set up arbitrary paths for cinder volumes but the ephemeral node is hard coded to swap and /opt	16:46
corvus	do we want a hard down or graceful?	16:47
corvus	there may be some jobs still running, and graceful would let them finish, but would take longer	16:47
Clark[m]	Graceful might take a while for jobs that will ultimately fail	16:47
corvus	good point, there is a high (but not 100%) job failure risk	16:48
fungi	i took a stab at putting the full set of commands in https://etherpad.opendev.org/p/d0l2F7iYeyHzVw1EpSBy but please adjust to your liking	16:49
fungi	presumably the rootfs will drain a bit as jobs stop, so i'm 50/50 on whether graceful is better	16:50
corvus	fungi: that lgtm; you want to try that on ze12 now and let us know how it goes?	16:50
corvus	actually	16:50
fungi	straight away	16:50
corvus	let's go ahead and run stop on all the executors	16:50
fungi	yeah, good call	16:51
corvus	i'll do that from bridge with ansible	16:51
fungi	thanks	16:51
fungi	will we still need to down the containers afterward, or is that part of stopping them?	16:51
corvus	i will down the containers	16:51
fungi	thanks, crossed it off the pad then	16:52
corvus	this way any jobs that are aborted don't get restarted on an executor that when then immediately down and abort the job again	16:52
corvus	s/when/we/	16:52
corvus	okay hard-stopped all executors	16:52
corvus	feel free to proceed on ze12	16:52
fungi	shall i #status notice Zuul job execution is temporarily paused while we rearrange local storage on the servers	16:53
corvus	fungi: status lgtm	16:53
fungi	#status notice Zuul job execution is temporarily paused while we rearrange local storage on the servers	16:53
opendevstatus	fungi: sending notice	16:53
fungi	working on 12 now	16:53
-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers		16:53
fungi	i should have timed it	16:53
opendevstatus	fungi: finished sending notice	16:56
fungi	i missed a few commands initially which i'm adding to the pad	16:56
fungi	still waiting for the zuul data to move to the other fs though	16:57
fungi	i'll grab the correct ownership/permissions once it completes and fill in the chmod/chown commands	16:57
fungi	went ahead and filled them in while waiting	16:59
clarkb	do we need a step between 5 and 6 to wait for zuul to stop complete?	16:59
clarkb	*completely	16:59
fungi	anyway, i'll get a wall clock estimate of the move time	16:59
corvus	clarkb: i did a hard stop; they're all down	16:59
clarkb	ah got it	16:59
fungi	we'll presumably want to do the rest of these in parallel once i'm sure all the commands are right, since this step takes a while	17:01
clarkb	makes sense. We wnt ot see that ze12 is happy with the end result prior to starting on the others right?	17:02
corvus	agreed. i have windows open ready for ze01-ze06	17:04
corvus	given the time it's taking... maybe we should decide we're comfortable with steps 1-5 and start those everywhere but then hold there for any changes from ze12?	17:06
clarkb	corvus: I think the risk with 1-5 is if our hourly jobs run and rewrite project config in the intermediate /opt state	17:07
clarkb	that is actually a risk either way I guess	17:07
clarkb	maybe we need to stop ansible on bridge too?	17:07
clarkb	oh wait	17:07
clarkb	there are no running executors	17:07
clarkb	so no jobs	17:07
fungi	i'm good with that	17:07
clarkb	so ya I think 1-5 should be pretty safe	17:08
corvus	yeah, as long as there's no cron (but when we bring ze12 up that risk returns)	17:08
fungi	i also have my screen sessions on 7-11 ready	17:08
clarkb	fungi: where in the process is 12?	17:08
fungi	still on line 5	17:08
fungi	the big data move	17:08
corvus	then i'll run lines 1-4 on ze01-ze06 now. sound good?	17:08
clarkb	should we add a step 4.5 that deletes the builds content?	17:08
clarkb	I don't know where the majority of that file copying is happening (maybe it is the git dirs?)	17:09
fungi	progress report on 12 is that /opt now contains 15G of data	17:09
corvus	we can do 4.5	17:09
clarkb	oh and those are hardlinks so clearing out the data in builds may not help much	17:09
fungi	the rootfs is only 40gb and a lot of it is operating system, so probably not too much longer	17:09
clarkb	just thinking that when we start the executor back up again it will clear the builds anyway	17:09
corvus	no idea what mv would do with those hardlinks though; it could expand them	17:09
corvus	(since we're crossing filesystems)	17:10
clarkb	yes it expand sthem	17:10
clarkb	so I think 4.5 is prudent	17:10
fungi	makes sense	17:10
corvus	i put it in as new #3	17:11
fungi	okay, so there's a new step 3, yep	17:11
corvus	that look good?	17:11
fungi	i'll try to retrofit it on 12	17:11
corvus	i wouldn't worry about ze12	17:11
fungi	12 just finished the data move	17:11
clarkb	yes that path lgtm	17:11
corvus	they will get deleted when zuul starts	17:11
fungi	okay, good	17:11
corvus	(or you could do it anyway, it's fine either way)	17:12
clarkb	* the path in #3	17:12
corvus	okay, i'll run step 1-6 on ze01-ze06 now, good?	17:12
clarkb	yes	17:12
fungi	sounds good. i've finished 12 up to all but the reboot	17:13
fungi	presumably we don't want to reboot any until we're done with them all?	17:13
clarkb	I'll disable ansible on bridge	17:13
fungi	thanks	17:13
clarkb	if I do that I think you can reboot 12 when ready	17:13
corvus	fungi: /opt doesn't look right	17:13
fungi	i'm working on 7-11 now	17:13
fungi	oh	17:13
fungi	d'oh, thanks	17:14
fungi	i forgot to rm the old mountpoint	17:14
clarkb	bridge should avoid running ansible.	17:14
fungi	perms and ownership match so no need to preserve it	17:14
fungi	12 look better now?	17:15
corvus	that looks right :)	17:15
fungi	i added a new step 8 to rm it	17:15
fungi	okay, working on 7-11 now	17:16
clarkb	what was in /opt prior to the rm ?	17:16
corvus	01-06 all have step 6 in progress now	17:16
corvus	clarkb: /newopt	17:16
corvus	like /opt/newopt	17:17
clarkb	oh I see. Is the problem actually step 9 then?	17:17
clarkb	its moving all of newopt into /opt?	17:17
fungi	well, step 9 was a problem because of the missing step 8	17:17
corvus	clarkb: fungi fixed the steps so they should work for new hosts	17:17
corvus	yeah, but with step 8 now, the procedure should be good	17:18
fungi	build data deletion is taking some time on a few of these	17:18
clarkb	ok I'm not understanding why that helps since we're unmounting /opt/ then deleting what was mounted under the root fs for /opt	17:18
corvus	clarkb: step 8 is new	17:19
fungi	clarkb: step 8 was previously not there, it removes the mountpoint	17:19
corvus	clarkb: old procedure was "unmount /opt but leave /opt directory stub in place then move newopt underneath that directory"	17:19
corvus	clarkb: new procedure is "unmount /opt, delete stub /opt directory; rename /newopt to /opt"	17:19
clarkb	oh I see mv's behavior is chaning because the destination is no longer there	17:20
corvus	yep	17:20
fungi	step 6 is now underway for 7-11	17:20
corvus	step 6 just finished for ze01-06.	17:20
corvus	should i continue with the process? maybe just do ze01 and double check it?	17:21
fungi	sure	17:21
clarkb	sounds good to me	17:21
fungi	presumably the move goes faster now with the build data deleted first	17:21
corvus	ze01 complete through step 13	17:22
fungi	so 1 and 12 are done except for rebooting	17:22
corvus	yep	17:22
corvus	ze01 lgtm; feel free to double chec	17:22
fungi	01 lgtm	17:23
clarkb	ya me too	17:23
fungi	with ansible blocked on bridge do we want to reboot one (or both) of those?	17:23
fungi	step 6 finished on the remainder, i'll go through the rest of the steps on those now	17:24
fungi	okay, all executors are done except for rebooting	17:26
clarkb	my expectation is that rebooting 01 and/or 12 at this point is fine	17:26
fungi	do we want to reboot one and see if it comes up properly?	17:26
corvus	i'm still working on ze01-06	17:26
fungi	oh, got it, thought they were finished	17:26
corvus	okay all done except reboot	17:28
fungi	do you want to reboot 01 and see if it works properly?	17:28
corvus	let's just reboot 12 and if it looks good do the rest	17:28
fungi	oh, i can do 12	17:28
fungi	rebooting 12 now	17:28
clarkb	note we probably need to manually docker compose up -d?	17:28
corvus	probably so	17:29
fungi	i'm able to ssh back into it now	17:30
fungi	no zuul processes running, as anticipated	17:30
fungi	want me to up -d the container?	17:30
corvus	yep	17:30
fungi	started	17:31
corvus	i'm staging ansible commands to do the rest of the hosts once we're happy with 12	17:31
clarkb	as expected it says it is clearing out builds	17:31
fungi	thanks, so ansible to reboot them and then to up their containers after?	17:31
corvus	yep. and actually, why don't i go ahead and reboot all the others now?	17:32
fungi	wfm	17:32
corvus	ze01-ze11 rebooting	17:33
fungi	once we're sure they're running builds again, i can #status notice Zuul job execution has resumed with additional disk space on the servers	17:34
corvus	ze12 seems to be doing executor things	17:34
fungi	yes, a flurry in the log	17:34
fungi	including ansible to remote nodes	17:34
clarkb	once we're happy with the state of things I can remove the DISABLE-ANSIBLE flag file on bridge	17:34
corvus	i'm happy to up the rest now. sound good?	17:34
clarkb	wfm	17:35
fungi	/var/lib/zuul is at 15% used on ze12	17:35
fungi	corvus: thanks, please proceed!	17:35
corvus	the others should be up now	17:35
corvus	component page looks good	17:36
fungi	https://zuul.opendev.org/components does show them all running	17:36
fungi	yep	17:36
corvus	grafana agrees	17:36
clarkb	we are running a few versions ahead now but I don't expect that to be a problem	17:36
fungi	a bunch of commits later than the rest of the servers, but probably fine	17:36
clarkb	there weren't any schema changes iirc	17:36
corvus	hrm. we should put the schema api version in the components api endpoint	17:37
fungi	that's a stellar idea	17:37
corvus	but i agree, i don't recall any changes	17:37
corvus	clarkb: i think you're good to remove the flag file	17:37
fungi	https://zuul.opendev.org/t/openstack/stream/5e3436e0211c48f1a990cf5a6d03cce6?logfile=console.log is a running build	17:37
clarkb	corvus: done	17:38
corvus	thanks for the help and sorry i missed that during the server replacement	17:39
fungi	shall i #status notice Zuul job execution has resumed with additional disk space on the servers	17:40
fungi	or wait until we see a job succeed>?	17:40
fungi	looks like https://zuul.opendev.org/t/openstack/build/ef52e4829b4b40d8bfa91d5466748614 just succeeded	17:40
clarkb	corvus: it didn't occur to me to check on that so as much a fault on my end as yours	17:40
clarkb	thank you for jumping on fixing it	17:40
fungi	yes, thanks for the help and sorry for not spotting the problem myself	17:41
fungi	#status notice Zuul job execution has resumed with additional disk space on the servers	17:42
opendevstatus	fungi: sending notice	17:42
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers		17:42
fungi	builds seem to be succeeding, so i didn't see any point in holding off sending that now	17:42
opendevstatus	fungi: finished sending notice	17:45
fungi	i can talk about plans to roll out service upgrades, but need to disappear at any minute for an early dinner out so probably won't be able to review/approve things for at least an hour	20:03
clarkb	fungi: I too need lunch but I'm thinking maybe we start with the mm3 stuff that you feel is safe	20:03
clarkb	I want to do etherpad but also feel like I need to repage some of that in so I'm comfortable with it again	20:04
fungi	yeah, those first three changes you +2'd should be entirely safe	20:04
clarkb	maybe plan to do etherpad early tomorrow instead. And then I'll work on cleaning up the gitea change to get it ready soon	20:04
fungi	for mm3	20:04
clarkb	fungi: in that case lets start there	20:04
fungi	i'd be fine merging the etherpad update too, but understand if you want to space them out so we don't end up with two fires to fight at the same time	20:04
fungi	okay, my ride is here. ttfn	20:05
clarkb	ya exactly. That and I want to make sure I remember all the bits we were concerned about before proceeding with etherpad	20:05
clarkb	frickler: Ijust sent email to the repo discuss list about the star limits issue	20:14
clarkb	mostly asking them if stars should be improved to avoid this, if there are any options for lisitng these changes, and what considerations should be made when increasing index.maxTerms	20:14
clarkb	and now I should eat lunch	20:15
corvus	i'm going to go ahead and restart the rest of zuul just to get everybody on the same commit	20:32
corvus	#status restarted the rest of zuul so all components are at the same version	20:42
opendevstatus	corvus: unknown command	20:42
corvus	#status log restarted the rest of zuul so all components are at the same version	20:42
opendevstatus	corvus: finished logging	20:42
clarkb	sounds good	20:57
clarkb	fungi: one of the things I had been meaning to do is check our settings config file for etherpad against a 1.9.1 example. I've removed all the settings we already set from this paste and left what is remaining that we can configure https://paste.opendev.org/show/bBzSkMUyO8LTvR6SafX8/	21:12
clarkb	Looking at that config I think we are ok to proceed with upgrading as either defaults are fine or if we want to configure things they are for UX behaviors that can be done after the upgrade	21:13
clarkb	I think I'm happy to proceed after doing that and looking at the chagne again But we should start with your mailman stuff when you are ready	21:13
clarkb	the gitea access log format chagne was to intentionally remove source port info from the log lines because some parsers don't expect the port. I hope you don't use a reverse proxy...	21:42
* clarkb sorts out how to override the format back to what it was before		21:42
opendevreview	Clark Boylan proposed opendev/system-config master: Update to Gitea 1.20 https://review.opendev.org/c/opendev/system-config/+/886993	21:47
clarkb	fungi: the last remaining todo assuming my latest changes check out is the one around url prefix types and whether or not we need to exclude some. Gitea made a recent point release to add some items to that list by default due to security concerns. Not sure if you are aware of a general list of protocols to avoid	21:50
fungi	okay, back	22:05
fungi	sorry that took longer than anticipated	22:06
clarkb	no worries. I'm still around if you want to approve your mailman changes	22:07
fungi	meh, browsers are generally better at blacklisting unsafe protocol schemes than worrying about it server-side	22:08
fungi	servers aren't unsafe by hyperlinking things, it's up to browsers to protect their users	22:08
clarkb	fungi: its the protocols that get linkified in markdown in your browser aiui	22:08
clarkb	ya ok	22:08
fungi	right	22:10
clarkb	but ya I think if you're good I'm good for you to +A those 3 mm changes	22:18
fungi	the pasted etherpad config lgtm btw	22:18
fungi	approving those 3 mm3 changes now and will check on stuff once they deploy	22:19
clarkb	fungi: note the pasted config is what we are not setting in our config. When you say it lgtm you are saying you are happy with us not setting those values?	22:19
fungi	yes those look like reasonable defaults i mean	22:19
clarkb	cool	22:19
fungi	indentationOnNewLine being false is interesting. i think that's a behavior change but one i welcome	22:20
opendevreview	Merged opendev/system-config master: Pin importlib_resources<6 in mailman images https://review.opendev.org/c/opendev/system-config/+/890220	22:36
opendevreview	Merged opendev/system-config master: Make mailman3 DB migration check PyVer-agnostic https://review.opendev.org/c/opendev/system-config/+/890253	22:58
opendevreview	Merged opendev/system-config master: Use magic domain guessing in Mailman 3 https://review.opendev.org/c/opendev/system-config/+/867987	22:58
fungi	now just waiting for the deploy jobs	22:59
clarkb	hourly deployments should finish up soon then it will do the deployment job for the chagnes above	23:14
fungi	yep	23:15
fungi	there will probably be a brief outage of the ml web interfaces, the services take a couple of minutes to come up after the containers start	23:16
clarkb	fungi: looks like the jobs ran and I can still reach the service	23:27
clarkb	not sure if it changed anything though	23:28
clarkb	the jobs did report success though	23:28
fungi	yeah, i'm checking to see if they restarted	23:42
fungi	23:01 process start times	23:43
fungi	uwsgi and python3 processes	23:45
fungi	so looks like they did	23:45
Clark[m]	Excellent	23:45
Clark[m]	I think that may have been for the second job that ran not the third. Did we need a restart to be triggered by the third too?	23:46
fungi	oh	23:46
fungi	yeah, we do	23:46
fungi	-rw-r--r-- 1 systemd-network systemd-journal 13465 Aug 15 23:21 /var/lib/mailman/web/settings.py	23:47
fungi	it contains "SITE_ID = 0" but https://lists.zuul-ci.org/archives/ still says lists.opendev.org in the corner	23:47
fungi	so the config got updated after the containers were restarted	23:47
fungi	i guess i should just manually down/up -d them for now?	23:48
fungi	we should add a restart handler trigger for changes to that file?	23:49
Clark[m]	Ya I think we should probably do both	23:53
fungi	we have apache restart/reload handlers but not anything for restarting the containers	23:53
Clark[m]	If the settings file changes we restart services	23:53
Clark[m]	I think the LE handler for jitsi meet shows how to do that	23:53
fungi	thanks, checking	23:53
fungi	yeah playbooks/roles/letsencrypt-create-certs/handlers/restart_jitsi_meet.yaml has docker-compose restart web	23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!