Monday, 2021-05-17

openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Fix DISTRO_NAME in Fedora elements https://review.opendev.org/c/openstack/diskimage-builder/+/791627	00:02
openstackgerrit	Ian Wienand proposed opendev/system-config master: zuul job : collect some more logs https://review.opendev.org/c/opendev/system-config/+/791055	00:08
*** janders has quit IRC		00:17
*** Dmitrii-Sh has quit IRC		00:17
*** yoctozepto has quit IRC		00:17
*** zbr has quit IRC		00:17
*** Dmitrii-Sh has joined #opendev		00:17
*** zbr has joined #opendev		00:17
*** janders has joined #opendev		00:17
*** yoctozepto has joined #opendev		00:17
openstackgerrit	Merged opendev/system-config master: Double the default number of ansible forks https://review.opendev.org/c/opendev/system-config/+/791528	00:26
openstackgerrit	Merged openstack/diskimage-builder master: Add fedora-containerfile element https://review.opendev.org/c/openstack/diskimage-builder/+/790365	00:52
*** ricolin has joined #opendev		03:12
*** ricolin has quit IRC		03:12
*** ricolin has joined #opendev		03:14
*** ykarel_ has joined #opendev		03:48
*** ykarel_ has quit IRC		03:51
*** ykarel_ has joined #opendev		03:51
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	03:57
*** ykarel_ is now known as ykarel		03:59
*** akahat is now known as akahat\|ruck		04:26
openstackgerrit	Ian Wienand proposed opendev/system-config master: Run haproxy as root user https://review.opendev.org/c/opendev/system-config/+/791634	04:30
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	04:32
*** ralonsoh has joined #opendev		04:38
*** vishalmanchanda has joined #opendev		04:43
*** marios has joined #opendev		04:53
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	04:59
*** ysandeep\|away is now known as ysandeep		05:10
*** sboyron has joined #opendev		05:51
*** darshna has joined #opendev		05:51
*** sboyron has quit IRC		05:52
*** sboyron has joined #opendev		05:55
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	06:04
*** logan- has quit IRC		06:12
*** logan- has joined #opendev		06:15
*** slaweq has joined #opendev		06:35
*** mkowalski has quit IRC		06:43
*** mkowalski has joined #opendev		06:43
*** brinzhang has joined #opendev		06:43
*** amoralej\|off is now known as amoralej		06:44
*** gibi has quit IRC		06:54
*** fressi has joined #opendev		06:59
*** iurygregory has quit IRC		07:12
*** hashar has joined #opendev		07:17
*** iurygregory has joined #opendev		07:21
*** andrewbonney has joined #opendev		07:21
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	07:30
*** tosky has joined #opendev		07:32
*** ysandeep is now known as ysandeep\|lunch		07:49
*** lucasagomes has joined #opendev		07:58
*** jpena\|off is now known as jpena		07:58
*** whoami-rajat has joined #opendev		08:09
*** dtantsur\|afk is now known as dtantsur		08:09
*** ykarel is now known as ykarel\|lunch		08:13
*** gibi has joined #opendev		08:21
frickler	mnaser: any update about the IPv6 situation yet? this is still affecting my daily work by forcing me to explicitly require accessing opendev.org via v4 only	08:22
*** brinzhang_ has joined #opendev		08:55
*** brinzhang has quit IRC		08:58
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	09:03
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	09:05
*** ysandeep\|lunch is now known as ysandeep		09:11
*** ykarel\|lunch is now known as ykarel		09:27
*** hrw has joined #opendev		09:32
hrw	morning	09:32
hrw	can https://storage.gra.cloud.ovh.net be configured to show logs? or zuul configured to not store logs there?	09:32
hrw	"Network Error (Unable to fetch URL, check your network connectivity, browser plugins, ad-blockers, or try to refresh this page) https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_16b/777062/2/check-arm64/openstack-tox-py39-arm64/16b46d9/"	09:33
*** ralonsoh has quit IRC		09:54
*** ralonsoh has joined #opendev		09:57
*** hashar is now known as hasharAway		09:58
frickler	hrw: the logs expire after some time, I think 4 weeks, that job seems to be a bit older, so you'd need to rerun it in order to get new logs	10:24
hrw	ah. thanks	10:25
hrw	too many tabs in monday morning patch check and looked in wrong place for job age	10:26
*** gibi has quit IRC		10:42
*** gibi has joined #opendev		10:43
*** jpena is now known as jpena\|off		10:58
*** jpena\|off is now known as jpena		11:00
*** fressi has quit IRC		11:14
*** fressi has joined #opendev		11:15
openstackgerrit	Ian Wienand proposed opendev/system-config master: [DNM] Use new docker ipv6tables option to map haproxy ports https://review.opendev.org/c/opendev/system-config/+/791633	11:21
ianw	clarkb / fungi: ^ in short, haproxy switched to running as a user. the simple thing to do is to just run as root	11:22
ianw	however, i've been exploring the new options that uses ip6tables to make ipv6 much more workable for us. they're still experimental in docker, but i think it's worth fleshing it out just so we understand the option	11:23
ianw	it is incremental steps. it would mean we could expose 80/443 to containers on ipv4 and ipv6 without having to give them capabilities to bind to low ports, or fiddle other settings	11:25
*** jpena is now known as jpena\|lunch		11:30
*** fressi has quit IRC		11:36
*** fressi has joined #opendev		11:47
fungi	ianw: some of the permissions errors were about opening files for write though too, i guess ones we bindmount into the container?	11:55
*** yoctozepto has quit IRC		11:55
*** ykarel has quit IRC		11:55
*** ykarel has joined #opendev		12:03
*** hasharAway has quit IRC		12:09
openstackgerrit	Hitesh Kumar proposed openstack/diskimage-builder master: Migrate from testr to stestr https://review.opendev.org/c/openstack/diskimage-builder/+/789246	12:10
*** hashar has joined #opendev		12:10
openstackgerrit	Merged openstack/diskimage-builder master: Fix DISTRO_NAME in Fedora elements https://review.opendev.org/c/openstack/diskimage-builder/+/791627	12:15
*** jpena\|lunch is now known as jpena		12:25
*** amoralej is now known as amoralej\|lunch		12:25
*** yoctozepto has joined #opendev		12:31
openstackgerrit	chandan kumar proposed openstack/project-config master: Added publish-openstack-python-tarball job https://review.opendev.org/c/openstack/project-config/+/791745	12:38
*** marios is now known as marios\|call		13:02
openstackgerrit	chandan kumar proposed openstack/project-config master: Added publish-openstack-python-tarball job https://review.opendev.org/c/openstack/project-config/+/791745	13:11
*** amoralej\|lunch is now known as amoralej		13:13
*** marios\|call is now known as marios		13:46
kopecmartin	fungi: hi, can you have a look when you have a moment please https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/789481	13:54
*** artom has joined #opendev		14:00
*** ysandeep is now known as ysandeep\|afk		14:00
*** chandankumar is now known as raukadah		14:16
fungi	infra-root: per discussion from friday, i've forced a logrotate on zuul02 just now, and am in the process of copying /var/log/zuul to /var/log/zuul.tmp (which is on the rootfs, there's plenty of space there for now)	14:23
clarkb	fungi: thanks! I'm just starting to sit down to start my day, though I've got to visit the optometrist in a bit	14:25
fungi	i'll write up the rest of the plan for it in https://etherpad.opendev.org/p/zuul-swapfs-2021-05-17	14:25
fungi	clarkb: no sweat. i'm happy to wait on the scheduler restart until you're back and have increased your terminal font size a bunch ;)	14:25
*** hashar is now known as hasharAway		14:25
corvus	fungi: etherpad link is empty for me	14:26
fungi	yeah, for me to; i haven't actually pulled it up and started writing just yet ;)	14:27
corvus	oh ho	14:27
fungi	er, me too	14:27
corvus	fungi: why do we need a scheduler restart?	14:27
clarkb	corvus: to fix the swap partition on the zuul02 (we need to repartition xvde to do that and /var/log/zuul shares the same device)	14:28
corvus	gotcha, that's what swapfs means :)	14:29
clarkb	corvus: https://review.opendev.org/c/opendev/system-config/+/791554 is hte underlying issue that was fixed. It affects zuul02, zk04-06, two mirrors and review02	14:29
clarkb	I was going to do a more extensive double check today to make sure I didn't miss any on friday	14:29
clarkb	looks like ianw has fixed review02 already	14:29
fungi	yeah, something (i forget what?) changed such that our launch-node script began passing megabytes assuming they were gigabytes	14:30
fungi	so we have servers with 8mb swap devices	14:30
clarkb	fungi: ianw wrote a change when spinning up review02 to limit swap space to 8GB bceause review02 has ~120GB of memory and that is far too large of a swapfile/partition. Unfortunately this change got the scale wrong and caused things to be limited to 8MB not 8GB	14:31
*** fressi has quit IRC		14:31
fungi	aha, thanks, makes sense	14:32
fungi	oh, right, fixed in 791554 as you said	14:33
clarkb	the two mirror hosts use swapfiles and should be easy fixes. I'll try to pick those up after the optometrist	14:34
clarkb	zk04-06 use partitions but nothing seems to be using /opt there (what the swap partition shares a device with) so those are also likely easy. Means zuul02 is the only complicated fixup	14:35
clarkb	side note: looks like logrotate was working properly since the fix for zuul02 logrotation landed	14:36
fungi	yep, i can confirm that seemed to work fine	14:42
clarkb	fwiw my list of servers that need swap fixes is based on changes to our inventory file since the bug landed. I think that is a reasonably complete list but will try and do an ansible check against all the things today	14:43
*** ysandeep\|afk is now known as ysandeep		14:44
*** vishalmanchanda has quit IRC		14:54
clarkb	I have fixed up mirror.iad3.inmotion.opendev.org's swap situation. Doing the osuosl mirror next	15:03
clarkb	osuosl mirror is done now too	15:10
clarkb	I'll hold off on zookeeper servers until after my errand today to avoid neeind to leave it in a half state	15:13
clarkb	fungi: did you see https://review.opendev.org/c/opendev/system-config/+/791634/1/playbooks/roles/haproxy/files/docker/docker-compose.yaml should address the haproxy problems you worked around over the weekend?	15:14
clarkb	passes testing which did hit the problem previously right?	15:14
clarkb	frickler: hrw: correct we set the swift metadata on our log uploads to clean them up after ~1 month	15:17
clarkb	the volume and size of the logs is quite large which pushes us to do that	15:18
*** mlavalle has joined #opendev		15:34
fungi	clarkb: hadn't reviewed it yet, thanks for the reminder	15:35
fungi	infra-root: i think the plan at https://etherpad.opendev.org/p/zuul-swapfs-2021-05-17 is reasonably complete, but update/correct it if you spot obvious problems in there	15:35
*** ykarel has quit IRC		15:39
fungi	the longest parts of the outage will likely be moving the saved logs back into the recreated filesystem, and waiting for the scheduler to finish initializing... any feeling for whether it'll be long enough that we should do a status notice in there?	15:44
clarkb	fungi: we probably should as even the fastest restarts often get noticed	15:46
fungi	i'll add it as a step	15:46
fungi	it's in there before stopping the container now	15:51
*** amoralej is now known as amoralej\|off		15:52
fungi	clarkb: i need to switch to kneading pizza dough for a bit, but let me know when you're back from your appointment and i'll get started working through the steps outlined there	15:53
clarkb	fungi: will do!	15:55
*** ysandeep is now known as ysandeep\|away		15:59
*** marios is now known as marios\|out		16:06
*** artom has quit IRC		16:07
*** artom has joined #opendev		16:07
*** lucasagomes has quit IRC		16:12
*** DSpider has joined #opendev		16:20
*** DSpider has quit IRC		16:26
*** marios\|out has quit IRC		16:29
*** dtantsur is now known as dtantsur\|afk		16:36
openstackgerrit	Merged opendev/system-config master: Run haproxy as root user https://review.opendev.org/c/opendev/system-config/+/791634	16:56
*** jpena is now known as jpena\|off		17:01
*** artom has quit IRC		17:04
*** ralonsoh has quit IRC		17:22
*** hamalq has joined #opendev		17:25
clarkb	fungi: I'm back at this point	17:30
fungi	can you still see?	17:32
clarkb	I can, they only numbed my eyeballs and did not dilate them	17:32
clarkb	I'm pulling up the etherpad now to review the steps	17:32
clarkb	fungi: the etherpad lgtm. Do we also want to stop ansible from running on zuul02 while we do that work? (it won't restart services but it might modify /var/log/zuul contents?	17:36
fungi	oh, because ansible may set ownership on that path or something?	17:37
clarkb	also we need to do similar with the zk servers. Except instead of /var/log/zuul it is /opt.	17:37
clarkb	fungi: yup	17:37
fungi	i can stick zuul02 in the emergency disable list now, just a sec	17:37
fungi	okay, its in there	17:38
fungi	i guess we should give it a few minutes in case it's running a playbook which hasn't noticed that	17:38
clarkb	++	17:39
clarkb	infra-root once swap fixups are done the next thing I want to do is delete zuul01, please check if you want to preserve anything on that host	17:40
clarkb	fungi: question about this fs stuff: do we actually want to preserve lost+found across fses? I don't think so?	17:44
clarkb	maybe we do? your copy on zuul02 did copy it fwiw	17:44
clarkb	I guess it is empty	17:44
fungi	yeah, i tend to just ignore it	17:44
fungi	i can delete it after the rsync	17:45
clarkb	fungi: oh I would keep it around I just thought its content was fs specific	17:45
clarkb	and so copying it from one fs to another may not make sense	17:45
fungi	i mean i can delete the copy in the temporary location	17:45
clarkb	got it	17:45
*** andrewbonney has quit IRC		17:46
fungi	so that we don't overwrite the one for the new fs	17:46
clarkb	++	17:46
clarkb	fwiw the process you have on the etherpad looks very similar to what the zk's need so I may just stick a copy and edit it at the bottom of that etherpad too?	17:46
clarkb	will help me not miss anything when doing the zk's	17:46
fungi	okay, added lost+found cleanup and emergency disable list stuff to the plan	17:49
fungi	so i don't forget	17:49
fungi	clarkb: yeah, feel free to plagiarize it for the zk servers where it makes sense, continuing to use that pad seems fine too	17:49
clarkb	fungi: I think I'm ready to do a zk server if you think we should wait a bit on zuul otherwise I'll wait for zuul to finish first	17:55
fungi	here's an idea... since the zk servers are redundant, we can use one to test particularly the parted command syntax	17:56
fungi	in case i got something subtly wrong	17:56
clarkb	fungi: sounds like a good idea. Do you want to drive that or should I? zk04 is a follower so a good candidate	17:58
clarkb	(we can do the leader last just in case)	17:58
fungi	clarkb: your zk plan doesn't include stopping/starting zk. that's needed, right?	17:58
clarkb	fungi: stopping zk shouldn't be needed because /opt isn't used by zk	17:58
fungi	ohh	17:58
clarkb	the only thing in /opt is a mostly empty /opt/containerd	17:58
fungi	okay cool	17:59
clarkb	(there are two subdirs of that dir and no files)	17:59
fungi	yeah plan there lgtm then	17:59
fungi	none of them would be outages anyway i guess?	17:59
clarkb	ya there shouldn't be any outages unless we do something very wrong	17:59
corvus	clarkb: did the zuul02 swap happen yet? i'm wondering if we can squeeze the encrypt change into that sequence	18:01
fungi	corvus: it hasn't happened yet, let's add it	18:01
clarkb	ya we're going to do at least on zk serverfirst to make sure the parted commands are happy	18:01
fungi	corvus: can you plug the commands you want into the outline in https://etherpad.opendev.org/p/zuul-swapfs-2021-05-17 ?	18:01
clarkb	fungi: I've started a root screen on zk04	18:02
clarkb	and I'm editing the emergency.yaml now to put all the zks in it	18:02
fungi	joined	18:02
corvus	fungi: sorry i just mean if we merge 791765 first, then we'll restart into the new decrypt-on-executor code	18:02
clarkb	fungi: note ^ that you need to do an image pull and a full restart for that	18:03
corvus	though if we do that, we'll need to run the zuul stop/start playbooks	18:03
clarkb	corvus: and ya I can review those changes after zks are cleaned up	18:03
fungi	corvus: got it, yeah the current plan was just to down and up the container. if we need executors restarted too than can rework the plan to include that	18:04
fungi	s/than/then/	18:04
clarkb	fungi: I'm proceeding with zk04 now	18:05
fungi	lgtm so far	18:06
clarkb	fungi: I edited the parted command for zk to use 4096 MB instead of 8192 to match memory	18:06
fungi	yup	18:06
fungi	that looks right	18:07
clarkb	fungi: you ready for me to run the parted command now?	18:07
fungi	yes	18:07
fungi	perfecto	18:08
clarkb	fungi: I think that went well	18:11
clarkb	I'll do zk05 and zk06 now, do you want me to do those in a root screen too?	18:11
*** hasharAway is now known as hashar		18:12
fungi	not necessary, that was straightforward and as you said nothing's actually using it	18:12
clarkb	fungi: zk05 is done now if you want to double check it	18:16
clarkb	doing zk06 next	18:16
fungi	looking	18:17
fungi	05 lgtm	18:18
fungi	for zuul restarts, what's the playbook we normally run from bridge?	18:18
fungi	i guess the process would be to stop all the containers on the executor as stated in the maintenance plan and do the filesystem work, then instead of just upping those run the pull and restart playbook(s)?	18:19
clarkb	fungi: there is a system-config/playbooks/zuul_pull.yaml and a zuul_start.yaml and a zuul_stop.yaml	18:19
clarkb	I think you want to od a zuul_pull.yaml then a zuul_stop.yaml then a zuul_start.yaml	18:19
fungi	since it's not a single restart playbook, i suppose i can replace the `docker-compose down` on zuul02 with the full stop playbook on the bridge instead	18:20
clarkb	yes	18:21
clarkb	zk06 is done now too if you can take a quick look then I'll remove the zks from emergency	18:22
clarkb	fungi: oh also the queue dumping and restoring is a bit different on zuul02 now	18:24
fungi	we need the change merged and image updated before we pull, right?	18:24
clarkb	fungi: yes changes need to be merged and images promoted first	18:24
fungi	clarkb: i copied the queue dump/restore from root's command history on zuul02, are there missing steps?	18:25
clarkb	fungi: re queue dumping you need to run it out of a checkout on zuul02 since there isn't one in /opt anymore. Also you need to edit the commands to reenqueue to use docker exec	18:25
clarkb	fungi: if corvus set that up in roots homedir then it should work let me check	18:25
clarkb	yup looks like corvus did that for us, thank you corvus	18:25
clarkb	fungi: ^ you should be good	18:26
clarkb	I'm going to go and review changes for zuul now	18:26
fungi	thanks!	18:26
clarkb	fungi: oh you want to remove zuul02 from the emergency file so that those playbooks function	18:38
clarkb	fungi: instead of using the emergency file we should use disable-ansible to prevent automated ansibel from doing things while we do the human controlled ansible	18:38
clarkb	I'm going to remove zk04-06 from the emergency file now	18:38
clarkb	fungi: ^ I left zuul02 in the mergency file but I think you can go ahead and remove it	18:39
clarkb	I'm going to grab lunch while we wait for zuul to CI those changes	18:40
*** artom has joined #opendev		18:48
fungi	oh, yep, removing zuul02 from the emergency disable list	18:48
fungi	that would get in the way of us running those playbooks, for sure	18:48
fungi	where/when should we use disable-ansible in that sequence? now, i guess?	18:50
clarkb	fungi: maybe closer to when we are ready to run it	18:50
clarkb	since that puts a big roadblock on the zuul jobs and they can pile up	18:50
fungi	yeah	18:50
fungi	okay, i've got the continuous deployment disable/resume added to the plan	18:54
corvus	clarkb, fungi: the zuul decrypt patches are approved, if all goes well they should land in ~1 hour. if you need to restart zuul02 before then, that's fine. if it happens after, i can update the plan with the extra commands.	19:09
clarkb	I think we can wait. All of the other hsots have been sorted as far as swap goes. zuul02 is the last one one my list (though I still need to do a wider check)	19:11
fungi	corvus: nah, no rush, and i think i got the commands right if you can just double-check	19:11
clarkb	currently plenty of memory free on zuul02	19:11
corvus	oh you already changed, cool i'll check	19:11
corvus	fungi: looks correct to me	19:12
fungi	awesome, thanks!	19:12
clarkb	once we're done with that I Think zuul01 will be up for deletion too	19:13
openstackgerrit	Clark Boylan proposed openstack/project-config master: Stop requiring registered nicks for IRC https://review.opendev.org/c/openstack/project-config/+/791818	19:28
clarkb	I said I would push ^ up last meeting (I'm putting tomorrows agenda together)	19:28
fungi	probably worth keeping an extra close eye on the results of that once it's in place, given the recent drams	19:29
fungi	drama	19:29
clarkb	++	19:31
clarkb	we also don't need to land it just yet since we've got a few other things in the fire	19:32
*** hashar has quit IRC		19:50
clarkb	looking ahead at my week I'm thinking wednesday may be a good day to try the mailman ansible stuff	19:54
clarkb	corvus: fungi: looks like the zuul changes hit a problem in the gate (the upload image job timed out)	19:56
clarkb	do we want to dequeue then enqueue to speed things up? or just wait for unittest to finish and reapprove?	19:56
fungi	i'm in no rush	19:58
corvus	i'll rejigger it	19:59
corvus	i ran zuul promote --tenant zuul --pipeline gate --changes 791514,2	20:01
clarkb	things look queued the way we want them	20:01
clarkb	corvus: and you had to docker exec that?	20:02
corvus	ya	20:02
fungi	cool	20:03
clarkb	If you have anything to add to the meeting agenda do it soon. I think all my edits are in now. Just need to mail it out	20:07
corvus	i'm out to run an errand; biab.	20:34
*** sboyron has quit IRC		21:01
*** gothicserpent has quit IRC		21:05
*** whoami-rajat has quit IRC		21:11
*** slaweq has quit IRC		21:11
fungi	791514 has merged and its zuul-promote-image build succeeded. is that all we were waiting for to be able to pull?	21:18
clarkb	fungi: I think we want the end of that stack to merge	21:19
clarkb	there are 3 chagnes total	21:19
fungi	or i guess we wanted the other two in as well, yeah	21:19
fungi	looks like they're merging now	21:19
corvus	and merged; let's check the promote job	21:20
fungi	we want to see 791775 succeed its zuul-promote-image build i think	21:20
clarkb	ya	21:20
clarkb	the promote job just succeeded	21:22
clarkb	corvus: do you also want to double check the info on docker hub? You did that last time but I think that was because the job failed?	21:23
corvus	the job succeeded, so i think we're good	21:24
clarkb	fungi: you ready?	21:25
fungi	okay, yep, moving forward	21:25
clarkb	I guess let me know what I can do to help. I am around	21:25
fungi	i have root screen sessions on both bridge and zuul02 if anyone wants to follow along	21:26
fungi	starting with disabling ansible	21:26
clarkb	I've attached to both of them	21:26
fungi	and pulling images	21:27
fungi	looks like it worked	21:28
clarkb	The about an hour ago image update looks right to me	21:29
fungi	though `docker image ls` on zuul02 shows the most recent image is from "About an hour ago"	21:29
fungi	but yeah, i guess that's when the gate job to build the image completed	21:29
clarkb	yup because the image timestamp is when it was built which happened in the gate job	21:29
fungi	scheduler image id is b6c06442196d	21:29
fungi	i'll dump queues and send the status notice next	21:30
clarkb	++	21:30
corvus	yeah i believe that timestamp interpretation is correct	21:30
fungi	#status notice The Zuul service at zuul.opendev.org will be offline for a few minutes (starting now) in order for us to make some needed filesystem changes; if the outage lasts longer than anticipated we'll issue further notices	21:31
openstackstatus	fungi: sending notice	21:31
-openstackstatus- NOTICE: The Zuul service at zuul.opendev.org will be offline for a few minutes (starting now) in order for us to make some needed filesystem changes; if the outage lasts longer than anticipated we'll issue further notices		21:31
fungi	stopping services now	21:31
fungi	says it completed	21:32
fungi	working on the fs changes to zuul02 next	21:32
clarkb	I double checked 02 and it indeed has no containers running	21:32
openstackstatus	fungi: finished sending notice	21:34
fungi	interestingly the debug logs never updated after i called logrotate, but the non-debug logs have	21:34
fungi	anyone want to double-check me on that before i umount the original fs?	21:35
clarkb	I'm not sure I understand what you mean by that	21:35
clarkb	-rw-r--r-- 1 zuuld zuuld 3225855785 May 17 14:18 debug.log.1 exists	21:36
clarkb	which is from when you rotated earlier today	21:36
fungi	last modified timestamp on /var/log/zuul/debug.log and /var/log/zuul.tmp/debug.log are 14:23	21:36
clarkb	-rw-r--r-- 1 zuuld zuuld 1625733762 May 17 21:31 debug.log is what I see	21:36
fungi	similar for web-debug.log	21:36
fungi	okay that's super weird	21:37
clarkb	I think you are looking at the log.1 files?	21:37
clarkb	those are from earlier today when you rotated by hand	21:37
fungi	-rw-r--r-- 1 zuuld zuuld 28966363 May 17 14:23 debug.log	21:38
clarkb	but the current log files all seem to have current timestamps for me	21:38
fungi	if i ls -l the directory that's what it shows	21:38
clarkb	that isn't getting truncated?	21:38
fungi	nope line after it is this	21:38
fungi	-rw-r--r-- 1 zuuld zuuld 3225855785 May 17 14:18 debug.log.1	21:38
fungi	if i ls -l the file directly it shows a different timestamp	21:38
clarkb	the file size isn't want I see either	21:38
fungi	it's like the output is cached/stale or something	21:39
clarkb	I am not able to reproduce that	21:39
fungi	nevermind	21:40
corvus	i see clarkb's	21:40
fungi	i was scrolling back my tmux window which had an old ls -l in it :/	21:40
fungi	not scrolling back the screen buffer	21:40
fungi	okay, moving ahead!	21:40
clarkb	its me	21:40
clarkb	I'm out of the dir now	21:41
fungi	aha, thanks	21:41
fungi	it complains the partition is not aligned, do we care?	21:42
clarkb	I think make_swap.sh does log_2 math to aviod that (however 8192 should be log_2 aligned)	21:42
clarkb	I didn't get similar when doing 4096 on zks	21:43
clarkb	oh its sector alignments?	21:43
fungi	looks that way	21:43
fungi	the "s" suffix	21:43
clarkb	you could tell it to do 2048 sectors for the first partition	21:44
fungi	i don't know where/how it's inferring those sector numbers	21:44
fungi	like that?	21:45
clarkb	presumably?	21:46
fungi	nope, still not aligned, plus new errors	21:46
fungi	i don't know where it's getting the 1953 sector start	21:47
clarkb	heh its still the same error for swap. Is the issue the 1 in 1 8192 ?	21:47
fungi	i can try 0	21:48
clarkb	fungi: I think we want to start at sector 2048	21:48
clarkb	not at 0	21:48
fungi	ahh	21:48
fungi	so shift the values by +2048 like that?	21:49
fungi	or start at 2049 instead of 1?	21:49
clarkb	no because that is still bytes	21:49
clarkb	https://askubuntu.com/questions/201164/proper-alignment-of-partitions-on-an-advanced-format-hdd-using-parted says any multiple of 8 is probably fine so maybe we're ok with the original if we shift it by 8MB ?	21:50
fungi	so like that? i'm honestly not quite sure what you're suggesting, nor why we didn't see the same on other servers	21:50
clarkb	fungi: well I'm not sure what the command needs to be. But we want to express start at sector 2048 and end 8GB later. Then start the next partition from that point forward	21:51
clarkb	it seems that it zero indexes so you don't need to do the +1	21:51
fungi	there we go	21:51
fungi	parted /dev/xvde --script -- mklabel msdos mkpart primary linux-swap 8 8200 mkpart primary ext2 8200 -1	21:51
fungi	that did not error on me	21:51
clarkb	yup that lgtm (fungi did the shift by 8 bytes thing)	21:51
fungi	(8192+8=8200)	21:51
clarkb	I think that is good and we can proceed	21:52
fungi	lsblk says xvde1 202:65 0 7.6G 0 part	21:53
fungi	close enough i guess	21:53
clarkb	ya	21:54
fungi	logfiles are moving back to the new partition now	21:55
fungi	looks like it finished quickly	21:56
clarkb	we are ready to run the start playbook now?	21:57
fungi	contents of the new /var/log/zuul look correct to me	21:57
clarkb	corvus: ^ fyi	21:57
fungi	yeah, switching over to the bridge screen to run that if everyone's ready	21:58
clarkb	I'm ready	21:58
fungi	in theory the excitement won't begin until it tries to run some jobs anyway	21:58
fungi	starting it now	21:58
fungi	and that's completed	21:59
corvus	logs lgtm	21:59
corvus	scheduler starting	22:00
clarkb	corvus: does the saving of keys double check that the file isn't already there or is it unconditional?	22:00
clarkb	(since they should already be there?)	22:00
corvus	unconditional	22:00
fungi	once it's clear of the cat jobs, i'll start reenqueuing	22:00
corvus	and it's not the slow part; reading them from zk is	22:01
corvus	so i don't think adding a condition would speed that up	22:01
corvus	(though, there might be a bit of extra computation happening to write them out)	22:01
corvus	anyway, we're going to drop the filesystem stuff soon anyway, so i don't think that part is worth digging into	22:01
clarkb	ok	22:02
fungi	yeah, i figured that was transitional	22:02
corvus	i think as soon as we write an export utility, we can drop it	22:02
fungi	looks like we're through the cat jobs now?	22:05
clarkb	fungi: yup but it isn't done parsing yet I don't think	22:05
fungi	ahh, no not yet	22:05
fungi	i still see a few cats flashing by	22:06
corvus	more tenants	22:06
clarkb	I think it is up now. The tenant list loads as does openstack status	22:06
corvus	yep	22:06
fungi	okay, starting to reenqueue	22:07
openstackgerrit	Clark Boylan proposed opendev/system-config master: Better swap alignment https://review.opendev.org/c/opendev/system-config/+/791832	22:07
fungi	some builds are already running	22:08
clarkb	they have console logs too	22:08
fungi	yup	22:08
clarkb	previously when we had trouble with the yaml it failed before it got that far	22:08
fungi	so i think the revert of the revert is good now	22:09
corvus	\o/ that's definitely more than last time :)	22:09
fungi	that was ~half an hour downtime, not terrible	22:09
*** iurygregory has quit IRC		22:10
clarkb	corvus: fungi: any objections to me deleting zuul01 and its dns records once we're happy with zuul02s restart?	22:10
corvus	clarkb: no objection	22:11
fungi	clarkb: no objection	22:11
fungi	i checked my homedir on it earlier	22:11
clarkb	cool I'll do that as soon as fungi gives the all clear on 02	22:11
fungi	not that i have a habit of keeping things of any value on random servers	22:11
fungi	reenqueue finished, doing cleanup now	22:12
clarkb	I'm out of both root screens now too fwiw (I think you can close those up when you are happy with them fungi)	22:12
fungi	all finished	22:13
fungi	#status log Updated swap and log filesystem sizes on zuul02, and restarted all Zuul services on cdc99a3	22:14
openstackstatus	fungi: finished logging	22:14
fungi	some builds have already succeeded	22:16
fungi	i think it's good	22:17
clarkb	infra-root I will delete zuul01.openstack.org with id ef3deb18-e494-46eb-97a2-90fb8198b5d3 that look correct to you?	22:17
clarkb	fungi: the failures I see appear to be actual failures which is another good sign	22:18
*** iurygregory has joined #opendev		22:19
fungi	clarkb: that uuid looks like what openstack server show gives me for zuul01	22:19
clarkb	cool I'm going to issue the delete command now, thank you for double checking	22:19
clarkb	the deletion is done. Doing dns cleanup then will status log it	22:22
clarkb	also I didn't delete the zk01-03 dns records so will do that next	22:23
clarkb	#status log Deleted zuul01.openstack.org (ef3deb18-e494-46eb-97a2-90fb8198b5d3) and its DNS records as zuul02.opendev.org has replaced it.	22:25
openstackstatus	clarkb: finished logging	22:25
ianw	clarkb: urgh, sorry about the missing *1024, what a mess	22:26
clarkb	ianw: I reivewed the change too :) no worries	22:27
clarkb	ianw: I think we are all done now as far as cleanup goes, but I want to see if I can figure out having ansible check all the hosts before I declare victory	22:27
clarkb	zk01-03.openstack.org A and AAAA records are now cleaned up too	22:27
ianw	fungi: yeah, on the permissions issue with haproxy, we do paper over a lot running as root	22:29
ianw	i believe the way to do it most securely is with the user namespace stuff; so in the haproxy model	22:30
ianw	https://review.opendev.org/c/opendev/system-config/+/791633/8/playbooks/roles/haproxy/tasks/main.yaml	22:30
ianw	it creates a haproxy user as UID 99, so making /var/haproxy/* owned by 100099 on disk running with the "zuul" subuid makes things work	22:30
ianw	although, the exact location the zuul subuid is created is still a bit of a mystery to me. i'm assuming we do it in ansible, somehwere?	22:31
clarkb	infra-root I'm going to run bridge:~clarkb/playbooks/swap-inspector.yaml and see what that tells me to double check things	22:35
clarkb	it relies on ansible fact gathering and debug module to emit info when a host has less than 128MB of swap total	22:36
clarkb	fun that actually still shows zk04 as having too little swap beacuse we cache facts	22:37
clarkb	I guess I run it once, then double check my list is what we already fixed, then rm the cached facts for those hosts and rerun	22:37
clarkb	there are actually a few servers that have no swap	22:40
clarkb	that was not caused by the issue we have had with make_swap.sh so I'll ginore those for now (but maybe we want to swapfile them)	22:40
clarkb	fungi: corvus check out the periodic jobs in openstack tenant. They are all listed as error	22:42
clarkb	except for one	22:42
clarkb	I'm deleting cache entries from /var/cache/ansible/facts for the hosts we just fixed swap on	22:43
clarkb	swap lgtm for the hosts that had the tiny swap problem based on that playbook	22:45
corvus	clarkb: ack	22:46
clarkb	we should have the hourly opendev deploy jobs starting in about 11 minutes and we can cross check against that I Guess	22:49
clarkb	but there isn't a whole lot of pointers to why there were errors in the dashboard that I see	22:50
clarkb	the hourly opendev deploy jobs seem to be running just fine (which gives more weight to corvus' explanation in #zuul)	23:01
*** tosky has quit IRC		23:43
*** hamalq has quit IRC		23:56
*** hamalq has joined #opendev		23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!