Thursday, 2021-05-13

clarkb	I suspect it is the first name in the name list that determines the output file. I'm not sure https://review.opendev.org/c/opendev/system-config/+/791060 will fix it?	00:13
clarkb	maybe instead we should just swap the order of the names in the zuul02 list and have zuul.opendev.org come first?	00:13
ianw	ohh, you know what, i think you're right	00:21
fungi	but also interpolating the filename in the vhost configs makes sense	00:23
*** openstackgerrit has joined #opendev		00:26
openstackgerrit	Ian Wienand proposed opendev/zone-opendev.org master: Add acme challenge for zuul01 https://review.opendev.org/c/opendev/zone-opendev.org/+/791069	00:26
openstackgerrit	Ian Wienand proposed opendev/system-config master: zuul-web : use hostname for LE cert https://review.opendev.org/c/opendev/system-config/+/791060	00:26
ianw	it might be worth doing ^ anyway just for consistency of having the cert cover the hostname as well as the CNAMEs	00:27
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: bootloader: remove extlinux/syslinux path https://review.opendev.org/c/openstack/diskimage-builder/+/541129	00:34
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups https://review.opendev.org/c/openstack/diskimage-builder/+/790878	00:34
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Add fedora-containerfile element https://review.opendev.org/c/openstack/diskimage-builder/+/790365	00:44
*** whoami-rajat has quit IRC		01:13
*** brinzhang has joined #opendev		02:03
openstackgerrit	Steve Baker proposed openstack/diskimage-builder master: Add element block-device-efi-lvm https://review.opendev.org/c/openstack/diskimage-builder/+/790192	02:46
openstackgerrit	Steve Baker proposed openstack/diskimage-builder master: WIP Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083	02:46
*** hemanth_n has joined #opendev		02:52
*** brinzhang_ has joined #opendev		03:21
*** brinzhang has quit IRC		03:24
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: [WIP] test devstack https://review.opendev.org/c/openstack/diskimage-builder/+/791091	04:07
*** ralonsoh has joined #opendev		04:31
*** ykarel has joined #opendev		04:38
*** brinzhang0 has joined #opendev		04:50
*** brinzhang_ has quit IRC		04:54
*** marios has joined #opendev		04:58
*** hemanth_n has quit IRC		05:00
*** hemanth_n has joined #opendev		05:00
*** vishalmanchanda has joined #opendev		05:06
*** darshna has joined #opendev		05:08
jrosser	i think /etc/ci/mirror_info.sh might be broken for bullseye due to missing VERSION_ID	05:22
*** slaweq has joined #opendev		06:26
*** ykarel has quit IRC		06:46
*** jpena\|off is now known as jpena		06:48
*** zbr has quit IRC		06:49
*** zbr has joined #opendev		06:51
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-devstack: allow for minimal configuration of pull location https://review.opendev.org/c/zuul/zuul-jobs/+/791116	06:57
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117	07:00
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: ensure-devstack: allow for minimal configuration of pull location https://review.opendev.org/c/zuul/zuul-jobs/+/791116	07:03
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117	07:03
*** lucasagomes has joined #opendev		07:04
*** amoralej\|off is now known as amoralej		07:06
*** andrewbonney has joined #opendev		07:25
*** tosky has joined #opendev		07:47
*** jaicaa has quit IRC		08:36
*** jpena is now known as jpena\|lunch		11:32
*** jhesketh has quit IRC		11:43
fungi	jrosser: if it's the same problem as before, it's because base-files 11 is missing version information which base-files 11.1 will provide once it migrates to bullseye	11:56
fungi	so ansible can't find a version and substitutes "n/a"	11:56
jrosser	hrrm, is there a workaround anyone has been using for this?	11:58
fungi	not sure, to be honest. i mean, bullseye isn't released yet so it makes some sense that ansible doesn't recognize it	12:15
fungi	i expect the ansible community considers it to be working as designed	12:16
jrosser	i've tried to patch stuff to insert VERSION_ID=11 into /etc/os-release	12:17
fungi	you could try diffing /etc/os-release between base-files 11 and 11.1 and see if it's one of the other missing values which is needed	12:18
jrosser	https://zuul.opendev.org/t/openstack/build/fa59a627370949be96e3982d31683837/log/job-output.txt#2844	12:19
*** amoralej is now known as amoralej\|lunch		12:21
*** jhesketh has joined #opendev		12:21
fungi	jrosser: the source for that is here: https://opendev.org/opendev/base-jobs/src/branch/master/roles/mirror-info/templates/mirror_info.sh.j2#L26-L34	12:24
fungi	is there a fallback value we could grab when VERSION_ID is unset, do you think?	12:25
fungi	i'm setting up a bullseye machine now to see if i can get any ideas	12:25
fungi	all mine are either sid (which has had base-files 11.1 for months) or buster	12:25
*** jpena\|lunch is now known as jpena		12:33
jrosser	fungi: there is potential fallback information in the node info \| localhost \| Distro: Debian 11.0	12:33
jrosser	but i guess there are two slightly different things, making mirror_info.sh robust and then somewhat seperate the ansible n/a version	12:34
fungi	jrosser: yeah, so this is the diff between base-files 11 and 11.1 os-release files: http://paste.openstack.org/show/805351/	12:39
fungi	and this is the diff of /etc/debian_version as a possibility: http://paste.openstack.org/show/805353/	12:41
jrosser	oh, hmm https://zuul.opendev.org/t/openstack/build/5132e46c48f64f4ba324b70f94d86eab/log/zuul-info/host-info.debian-bullseye.yaml#126-132	12:41
jrosser	so it might not be unreasonable for VERSION_ID to fall back to ansible_distribution_major_version	12:43
fungi	it does mix contexts a bit, but yeah we could essentially use ansible jinja interpolation to "hard code" a fallback value into the script	12:44
jrosser	given that the script is a template that should be doable	12:44
fungi	we could even switch to setting those variables with ansible and just using the values from /etc/os-release as fallbacks, though that's more likely to introduce regressions	12:46
jrosser	indeed - thats why i thought it was a good question for here as i'm sure theres good reason for how it is now	12:47
fungi	well, lots of this grew out of shell scripts we ran with jenkins in the long-long ago, in the beforetime	12:48
jrosser	really the motivation here is to get bullseye working ASAP even though it's unreleased, as that lets us drop a decent %age of CI jobs maybe a cycle earlier	12:48
jrosser	what with bionic/focal centos8/stream buster/bullseye the support matrix is really full right now	12:48
fungi	yep, i totally get that. also worth noting the only thing we ultimately use VERSION_ID in is the wheel mirror url, so that could be reworked as well maybe	12:51
fungi	i feel like templating in a fallback string for the VERSION_ID assignment when it's unassigned or maybe even just preassigning before sourcing /etc/os-release would be the safest solution	12:54
jrosser	just to make it super obvious there could be an ANSIBLE_DISTRIBUTION_MAJOR_VERSION={{ ansible_distribution_major_version }} so it's really clear when someone looks at the generated script	12:56
fungi	well, code comments in the script work too	12:57
jrosser	of course :)	12:57
fungi	i'll push up a prototype and we can hash it out in review	12:57
openstackgerrit	Jeremy Stanley proposed opendev/base-jobs master: Test VERSION_INFO default for mirror-info role https://review.opendev.org/c/opendev/base-jobs/+/791176	13:02
openstackgerrit	Jeremy Stanley proposed opendev/base-jobs master: Revert "Test VERSION_INFO default for mirror-info role" https://review.opendev.org/c/opendev/base-jobs/+/791177	13:02
fungi	jrosser: for changes to base-jobs content, because it's a trusted repo where we don't get to take advantage of speculative job config changes, we try change the base-test job first and then we can try do-not-merge changes in untrusted repos which set base-test as the parent for some obvious jobs	13:04
fungi	if dnm changes parenting jobs to base-test work after 791176 merges, then we would merge the revert and propose a similar change to the normal mirror-info role used by the base job	13:05
jrosser	ah ok	13:06
*** amoralej\|lunch is now known as amoralej		13:11
*** DSpider has joined #opendev		13:12
*** hemanth_n has quit IRC		13:23
mnasiadka	I started to notice not started ntp on debian nodepool instances, timedatectl says "NTP service: inactive" - not all the time, but every 2nd-3rd CI run in kolla-ansible - any idea what might be wrong?	13:36
fungi	do you collect the system journal on those builds? or maybe syslog? it will probably have some indication	13:41
mnasiadka	ntpd claims it's running, but not synchronized - so maybe it's just timedatectl flaw it has problems finding ntpd running	13:42
fungi	could it be that ntpd simply hasn't settled yet by the time you're checking it?	13:43
mnasiadka	well, I'm fine with unsynchronized, I'm not really fine with timedatectl saying NTP service: inactive - but maybe timedatectl has some problems checking ntpd (it's rather tied with networkd-timesyncd)	13:45
fungi	yeah, seems like a very systemd-centric tool	13:47
fungi	which i suppose is fine if you're running systemdos	13:47
fungi	and also don't care that much about precision and discipline in your time sources	13:52
fungi	mnasiadka: do you have an example build with that i can look at?	13:53
fungi	i would expect it do say something like "NTP synchronized: no	13:54
mnasiadka	fungi: I think it's really Kolla-Ansible prechecks rely on timedatectl (which ignores ntpd - it only checks for networkd-timesyncd) - but I don't think we've seen them fail in the past on Debian. recent build: https://935f2aace51477baa019-09dce2ec9ab39d19fdc97cba82216d08.ssl.cf2.rackcdn.com/787701/6/check/kolla-ansible-debian-source/b491ad2/primary/logs/ansible/deploy-prechecks	13:55
fungi	the syslog it collected from that node claimed ntpd started at 12:35:16 and was instructed to accept large time jumps to get the clock in sync	13:57
fungi	May 13 12:35:17 debian-buster-rax-ord-0024662430 ntpd[649]: error resolving pool 0.debian.pool.ntp.org: Temporary failure in name resolution (-3)	13:58
fungi	so it was having trouble with dns resolution there	13:58
fungi	because ntpd started before unbound	13:58
mnasiadka	oops	13:59
openstackgerrit	Clark Boylan proposed opendev/system-config master: Add zuul02 to inventory https://review.opendev.org/c/opendev/system-config/+/790481	13:59
openstackgerrit	Clark Boylan proposed opendev/system-config master: Clean up zuul01 from inventory https://review.opendev.org/c/opendev/system-config/+/790484	13:59
clarkb	fungi: ianw I left a review on https://review.opendev.org/c/opendev/system-config/+/791060 which is what triggered my updates above	13:59
clarkb	I'm thinking we do the swap then can improve the cert configs after? I think that simplifies stuff as its one less system to worry about LE on	14:00
fungi	mnasiadka: well, it's built to handle that, it starts polling timeservers just after unbound starts, according to syslog	14:00
fungi	though as of 12:40:35 us still complains "kernel reports TIME_ERROR: 0x41: Clock Unsynchronized"	14:00
clarkb	fungi: ianw if you agree I think we can probably proceed to try and do https://etherpad.opendev.org/p/opendev-zuul-server-swap today. I'm feeling much better	14:00
fungi	clarkb: yeah, that sounds fairly straightforward	14:00
fungi	though systemd reported "Reached target System Time Synchronized.	14:02
fungi	at 12:35:12	14:02
fungi	which was before ntpd started?	14:02
fungi	the "Clock Unsynchronized" errors are apparently endemic of a system clock which is too erratic for ntpd to properly discipline	14:04
fungi	so maybe that's what's going on	14:05
mnasiadka	might be, will add some debug and check	14:15
clarkb	fungi: I also want to test my playbook at https://review.opendev.org/c/opendev/system-config/+/790487 before we do the swap over so will look at that after bootstrapping my morning. If that loosk good and zuul02 look good I think we can proceed with the swap whenever others are ready	14:21
*** sshnaidm is now known as sshnaidm\|afk		15:00
*** lucasagomes has quit IRC		15:03
fungi	clarkb: any opinion on the approach in 791176	15:33
fungi	?	15:33
clarkb	fungi: basically use the ansible value first and then let os-release override. If os-release doesn't override then at least we have something? that should work	15:36
fungi	yeah, i mean, i expect it to work. mainly that there are several ways we could go about solving it and i was tacking for the one with the least chance to cause regressions in behavior	15:37
clarkb	fungi: I think my only concern would be that the ansible fact and the os-release value may have different value tpyes? like one could be a number and the other string in some situations? But for this weird pre release debian situation it should be fine ?	15:40
fungi	i think it's always going to be a string in the end because it's a text file template to a shell script	15:41
fungi	so even numbers are strings	15:41
*** marios is now known as marios\|out		15:43
clarkb	right I meant more at a highl evel like for ubuntu can one be 20.04 and the other Focal Fossa	15:45
clarkb	11 vs bullseye etc	15:45
clarkb	I'm not worried about that to the point where we can't make the change though	15:47
*** marios\|out has quit IRC		15:47
clarkb	os-release is the winner and that will preserve existing behavior for us which should be sufficient	15:47
openstackgerrit	Merged opendev/system-config master: Add zuul02 to inventory https://review.opendev.org/c/opendev/system-config/+/790481	15:52
clarkb	I'm ssh'd into zuul02 ^ and running a tail on syslog watching for ansible	15:53
clarkb	it will probably be a few minutes before it gets there though. I'll try to keep an eye on it	15:53
*** jpena is now known as jpena\|off		16:01
clarkb	I think there is an ssh host key problem with new zuul02 and bridge	16:16
clarkb	I ran the script to scan the dns ssh key records and that usually populates things properly	16:18
clarkb	not sure what is going on yet	16:18
clarkb	and now I'm grumpy that ssh reports errors using a sha256 hash of the key and keyscan gives you the base64 encoding of the key	16:21
clarkb	you'd think having an option to make those line up would be done by now	16:22
clarkb	base has already failed and LE should fail next	16:23
clarkb	fungi: ^ do you see what may have happened there?	16:23
clarkb	ssh keys on the host were generated may 10 which is when I booted it	16:25
clarkb	I found my sudo sshfp.py command on bridge and that looks correct	16:25
clarkb	ok I think I see what happened, there must've been IP reuse	16:28
clarkb	we haev an earlier entry in known_hosts with a different key but the file last updated aroudn when I booted the new instance and the last entry in the file matches what I see when running keyscan on localhost	16:28
clarkb	I'll just remove the older entry and that should make things work	16:28
clarkb	hrm it is still asking me for a key	16:29
clarkb	ok the current entry seems to be for the ipv6 address but ansible uses ipv4	16:31
clarkb	now they are both in there with what appears to be the correct key based on on host ssh keyscanning	16:32
clarkb	Looks like we bailed out of the runs for that change merge	16:32
clarkb	fungi: ^ should I run base, LE, and then zuul by hand?	16:32
fungi	clarkb: yeah, sorry didn't get a chance to look yet but i agree we fail to clean up stale records for hostkeys of deleted servers in the known_hosts file so i can see where it would cause problems. in the future we might consider generating known_hosts instead	16:34
fungi	and yes running those playbooks seems safe	16:35
clarkb	ok I'll start base now	16:35
clarkb	er let me wait for the puppet else run to finish to avoid any conflicts	16:35
clarkb	base is running now	16:39
*** amoralej is now known as amoralej\|off		16:40
clarkb	I forgot to do -f 50	16:41
clarkb	this might be a while. I'll touch the ansible stoppage file on bridge if it gets closer to 1700 UTC to avoid conflict with the hourly runs	16:42
*** timburke_ has joined #opendev		16:45
clarkb	#status log Ran disable-ansible on bridge to avoid conflicts with reruns of playbooks to configure zuul02	16:46
openstackstatus	clarkb: finished logging	16:46
*** timburke has quit IRC		16:48
clarkb	if anyone is wondering you really do not want to forget the -f 50	16:58
fungi	cereal execution	17:04
fungi	as in go eat a bowl of some and check back later	17:05
fungi	or watch the zuul episode on openshift.tv	17:05
clarkb	looks like I also need to run service-borg-backup.yaml in my list of playbooks	17:10
clarkb	base is done. There werew a few issues on other hosts like the rc -13 and apt-get autoremove -y on a few hosts being unhappy. I don't think thoseaffect zuul so will proceed	17:18
clarkb	letsencrypt playbook is running now	17:19
fungi	these would also have been rerun in the daily job, right?	17:24
clarkb	yes, but that doesn't happen for another 12 hours or something	17:24
clarkb	and I want to maybe get a zuul swap in today	17:24
clarkb	I'm hoping I can get zuul02 all configured, we can double check it, eat lunch, then come back and run through the plan on the etherpad	17:25
clarkb	depending on how this goes maybe tomorrow we can land mailman updates too	17:26
clarkb	we'll see :)	17:26
fungi	yeah, i wasn't suggesting we wait, just checking that it would have been run within a day under normal circumstances	17:27
clarkb	yup they would be	17:27
*** andrewbonney has quit IRC		17:28
clarkb	borg backup is now done. Next is running the zuul playbook	17:28
clarkb	actually I may need to run zookeeper first to update the firewall rules there	17:29
clarkb	double checking on that	17:29
clarkb	ya zk playbook comes before zuul playbook	17:30
clarkb	oh nevermind base runs the iptables role too so this is already done (but doesn't hurt to run service-zookeeper.yaml anyway)	17:31
fungi	sure	17:31
clarkb	I realized this after I started it and it reported a bunch of noops :)	17:31
clarkb	ok thats done. I'm going to run service-zuul.yaml now. Remember we don't expect this to cause problems beacuse we shouldn't start zuul services on the new scheduler. But keep an eye open :)	17:33
clarkb	I notice that we may install apt-transport-https on newer systems that no longer need it	17:36
fungi	yeah, i think it's supported directly on focal?	17:41
clarkb	the start containers task was skipped on zuul02 for the scheduler (we expected and wanted this)	17:42
clarkb	fungi: ya	17:42
clarkb	reloading the scheduler failed (I guess I should've expected this too, going to check if there are any tasks we want that run after that)	17:43
clarkb	otherwise looks good from the ansible side	17:43
*** ralonsoh has quit IRC		17:43
clarkb	ah that is a handler so it should happen after everything else	17:44
clarkb	I think that means we are good	17:44
clarkb	zuul-web has a handler too that I don't see firing to reload apache2 so maybe I'll just do that by hand to be double sure	17:44
clarkb	that is done. infra-root can you look over zuul02.opendev.org and see if it looks put together to you?	17:45
clarkb	Note: we do not want zuul containers to be running there yet ( and they are not according to docker ps -a )	17:45
clarkb	I'm going to test my gearman server config update playbook against ze01 and zm01 next	17:46
clarkb	thats done and looks good. I have restored the zuul.conf states on those two hosts to what they should be for now	17:50
clarkb	I will remove the disable ansible file now	17:50
clarkb	corvus: ^ you've probably got the best sense for what a zuul scheduler should look like. Any chance you may have time to look at zuul02.opendev.org? (Not sure when your openshift.tv thing ends)	17:57
clarkb	specifically the things I'm less sure of are the zk and gearman certs/keys/ca	17:58
clarkb	I'm going to take a break and start heating up some lunch. https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 is the DNS update change that we will need to manually merge during the swap, reviews on that would be great. Also looking over https://etherpad.opendev.org/p/opendev-zuul-server-swap if you haven't yet and double checking zuul02 looks happy	18:03
clarkb	one difference I notice is that /opt/zuul/ doesn't exist on zuul02. We run the queue dumping script out of that so it isn't strictly necessary to have on zuul02 for this swap (and we could clone it to one of our homedirs if we need to)	18:06
clarkb	and now really finding food	18:07
corvus	clarkb: are you having second breakfast?	18:14
corvus	oh lunch	18:14
clarkb	corvus: the kids break from school at ~11am for lunch so I end up eating late breakfast/early lunch often	18:15
clarkb	I'm waiting for the oven to preheat and just found a problem with zuul02: cannot ssh to review.o.o due to host key problems	18:15
corvus	clarkb: yeah, i assume we just cloned /opt/zuul at some point. it's not kept updated. we could just manually do that again.	18:16
corvus	/var/log/zuul looks good (sufficient space)	18:17
clarkb	we do write out a known_hosts at /home/zuuld/.ssh/known_hosts with our gerrit and the opendaylight gerrit host keys in it but at least ours doesn't seem to work for some reason (I've not cross checked the values yet)	18:17
clarkb	I'll keep looking at this after food if no one else beats me to it	18:18
fungi	could be that openssh on the new ubuntu is expecting a different key format?	18:18
fungi	though i would have expected our integration jobs to find that if so	18:19
clarkb	fungi: it reported it used rsa in the error	18:19
clarkb	`ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 zuul@review.opendev.org gerrit ls-projects` is the command I ran as zuuld on zuul02 if anyone else wnts to check it	18:19
corvus	i see a different value reported by the server	18:19
corvus	ssh-keyscan != known_hosts	18:19
corvus	known_hosts on zuul01 == zuul02	18:20
clarkb	I was just going to ssay I wonder how zuul01 works, is it possible we bind mount some other bath?	18:20
clarkb	s/bath/path/	18:20
clarkb	anyway oven needs attention. Back in a bit	18:20
corvus	ssh test on 01 fails too	18:20
corvus	so afaict, they are ==	18:21
corvus	maybe client.set_missing_host_key_policy(paramiko.WarningPolicy()) actually also means "don't fail if they differ"	18:24
fungi	comparing /home/zuuld/.ssh/known_hosts right?	18:25
fungi	that seems to be what we bindmount into the container	18:26
*** d34dh0r53 has quit IRC		18:32
fungi	yeah, seems that's the one	18:33
fungi	clarkb: oh! it's sha2-256 rather than sha1	18:35
fungi	i think that's the problem?	18:36
fungi	as for why it's not breaking for the current server, corvus's explanation seems reasonable	18:36
fungi	or maybe paramiko is still using sha1	18:36
clarkb	fungi: gerrit doesn't serve the sha2-256 though iirc	18:38
clarkb	so that would all have to be client side for user verification	18:38
clarkb	basically that shouldn't have any impact on the known_hosts file, its purely a wire thing	18:39
clarkb	and gerrit shouldn't even attempt it becuase its sshd doesn' support it	18:39
fungi	ahh, maybe i'm just thrown off by the sha2 fingerprint	18:40
clarkb	corvus: does zuul set that cleint policy? I suspect that may be it if we somehow magically work	18:40
clarkb	also I feel like we've fixed this before (it was a port 22 vs 29418 mixup or something along those lines). I wonder if the changes never merged	18:40
corvus	clarkb: yeah that's a paste from zuul code	18:41
fungi	it does seem like the ssh hostkey hash in the known_hosts there doesn't match what i get for either 22 or 29418 on the current server	18:43
fungi	the one i get for 29418 matches this: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/host_vars/review01.openstack.org.yaml#L73	18:45
fungi	could we be prepopulating with something other than gerrit_self_hostkey?	18:45
clarkb	fungi: its https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/all.yaml#L1	18:46
fungi	looks like we're using https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/all.yaml#L1	18:46
clarkb	the gerrit vars aren't necessarily exposed to the zuul hosts in ansible	18:47
clarkb	maybe the easiest thing to do is update all.yaml to match the gerrit value for now?	18:47
fungi	yeah	18:47
clarkb	ok I'm still juggling food. I can write that chagne after I eat (or feel free to push it I won't care too much :) )	18:48
fungi	i'm double-checking that gerrit_ssh_rsa_pubkey_contents isn't also used for something different	18:48
*** amoralej\|off is now known as amoralej		18:51
*** amoralej is now known as amoralej\|off		18:52
fungi	this supposedly writes it out on the gerrit server: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/tasks/main.yaml#L107-L113	18:52
*** DSpider has quit IRC		18:55
fungi	but the value in ~gerrit2/review_site/etc/ssh_host_rsa_key.pub doesn't match the gerrit_ssh_rsa_pubkey_contents value, it matches the key part from gerrit_self_hostkey	18:56
fungi	same goes for review02... this is most perplexing	18:57
clarkb	fungi: do we override them in private vars?	18:57
fungi	oh, could be	18:58
clarkb	ya looks like we do though I'm not quite in a spot to cross check values. One thing I notice is we don't set it for the zuul schedulers but do for mergers and executors	18:59
fungi	yes, gerrit_ssh_rsa_pubkey_contents is overridden in 7 different places in private host_vars and group_vars :/	18:59
clarkb	I think that is likely how we end up with the wrong value on the scheduler	18:59
clarkb	I wonder if that all.yaml value in system-config is largely there for testing	18:59
clarkb	I'm thinking maybe we update the zuul-scheduler.yaml group var to include this, rerun service-zuul.yaml, double check ssh works, then make a todo to clean this up?	19:00
clarkb	fungi: if that sounds reasonable I can do the group var update for zuul-scheduler.yaml now	19:01
clarkb	actually I'm beginning to wonder if the wires are super corssed here	19:01
fungi	it feels like we're reusing pubkey file contents as known hosts entries, but not very well	19:02
clarkb	in the zuul executor file it almost feels like this is the value for zuul's ssh pubkey but I need to read that ansible	19:03
clarkb	no nevermind I think we do the right thing we just don't set this value at all on the zuul-scheduler group. However, known_hosts is written by the base zuul role which we run against the zuul group	19:05
clarkb	I think the short term fix here is to set that var in group_vars/zuul.yaml then rerun service-zuul.yaml	19:06
fungi	yes, agreed, it would at least be consistent with how we're doing the other zuul servers	19:06
clarkb	and then push up a change to all.yaml undefining it and sprinkle it into the testing vars as necessary?	19:06
clarkb	I'll do the bridge update now	19:06
fungi	longer term we should probably do something about the divergence between gerrit_ssh_rsa_pubkey_contents and gerrit_self_hostkey in system-config	19:07
fungi	yeah, maybe that	19:07
clarkb	ok thats done want to check the git log really quickly then I'll rerun service-zuul.yaml	19:08
clarkb	fungi: ^	19:09
fungi	yeah	19:10
fungi	yep, that looks like the correct value	19:11
clarkb	ok running service-zuul.yaml now	19:11
clarkb	Assuming that works are there other sanity checks people think we should run?	19:13
fungi	maybe make sure you can reach the geard port from some other servers?	19:14
clarkb	fungi: ya, I'm not sure how to do that with the ssl stuff but I guess we can try to sort that out. Similarly ensure that zuul02 can connect to the zk cluster	19:15
fungi	oh, though zuul won't be running so, right	19:15
clarkb	ya you'd need to boostrap some stuff	19:16
fungi	you'd need a fake geard listening on the port	19:16
clarkb	probably doable, possibly a lot of work	19:16
clarkb	fungi: yes and one that does ssl	19:16
fungi	if it's not working at cut-over i guess we can sort it out then	19:16
clarkb	ls-projects works now. There is a blank line between the two known hosts entries now though. I don't think this is an issue but maybe it is?	19:17
clarkb	I can ls-projects on the other gerrit in the known hosts so ya seems to not be a problem	19:19
clarkb	I can connect to zk04 from zuul02 using nc. Not sure how to set it up to use the ssl and auth stuff	19:21
clarkb	iptables and ip6tables report that port 4730 is open for gearman connections at least	19:21
clarkb	without doing a ton of additional faked out bootstrapping and learning to zk client with ssl by hand I'm not sure there more connectivity checking we can do	19:22
clarkb	https://etherpad.opendev.org/p/opendev-zuul-server-swap I think I'm fast approaching step 7's "when happy" assertion	19:23
clarkb	s/step/line/	19:23
clarkb	I also gave the openstack release team a heads up a little while ago	19:23
clarkb	zuul seems fairly idle too. Maybe let people look at 02 for another hour or so and then plan to proceed with the swap? If I can get a second set of hands to help with things like manually merging the dns update and updating rax dns that would be great. Then I can focus on the ansible-playbook stuff	19:25
fungi	yeah, sounds fine. i'll be around	19:50
fungi	was hoping to go out for a walk this afternoon, but our pest control people were due to come today and they still haven't shown up, so probably not getting out for a walk	19:51
corvus	clarkb: when do you want to start? any chance it's ~now?	20:04
*** clayg has quit IRC		20:09
*** fresta has quit IRC		20:09
*** jonher has quit IRC		20:09
*** clayg has joined #opendev		20:09
*** jonher has joined #opendev		20:09
clarkb	corvus: in about 15-20 minutes?	20:10
clarkb	but I guess I can speed up and do ~now :)	20:10
*** mhu has joined #opendev		20:11
clarkb	I've started a root screen on bridge	20:11
corvus	clarkb: i'm ready to help whenever you are :)	20:11
clarkb	thanks!	20:11
corvus	(and my next task is a power-out ups maintenance, so i can't really overlap :)	20:12
clarkb	corvus: can you review https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 but don't approve it yet? one of the steps in the plan is to manually merge that in gerrit after we stop zuul	20:13
clarkb	(and maybe you want to get ready to be able to manually merge that when the time is appropriate?)	20:13
clarkb	I'll tell the openstack release team we will proceed shortly	20:14
corvus	lgtm	20:14
clarkb	ok I ran the disable-ansible script to prevent the hourly jobs from getting in our way	20:15
clarkb	I've notified the openstack release team. I think I'm ready to dump queues and stop zuul. Once zuul is stopped someone can submit https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 in gerrit	20:16
corvus	i'll look up how to do that :)	20:16
clarkb	corvus: I think you do the promotion thing via ssh with your admin creds now to your regular user. then unpromote after being doen or you can try to do it as the admin user alone	20:17
clarkb	corvus: do you think I should proceed with dumping queues and stopping zuul? nothing to wait on on your end?	20:17
corvus	yeah pulled up the docs at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#gerrit-admins	20:17
corvus	clarkb: i think you're gtg	20:17
*** fressi has joined #opendev		20:18
clarkb	ok queues dumped. Stopping zuul next	20:18
clarkb	zuul is stopped I think you can approve the dns update whenever you are ready	20:19
clarkb	s/approve/submit/	20:20
corvus	ack	20:20
openstackgerrit	Merged opendev/zone-opendev.org master: Swap zuul.opendev.org CNAME to zuul02.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/790482	20:21
*** fressi has quit IRC		20:21
clarkb	once that shows up on gitea servers I'll run the nameserver playbook	20:21
corvus	submitted	20:21
clarkb	looks like it is there	20:21
clarkb	running nameserver playbook now	20:21
clarkb	my record ttl is under a minute now. Will check it resolves properly before proceeding	20:23
corvus	zuul.opendev.org.300INCNAMEzuul02.opendev.org.	20:23
clarkb	zuul.opendev.org.300INCNAMEzuul02.opendev.org.	20:24
corvus	that's from my local resolver	20:24
clarkb	yup I see the same thing. Proceeding with updating gearman config on executors and mergers	20:24
clarkb	corvus: maybe you want to do a status notice? also I think the rax update for zuul.openstack.org is less urgent but that needs doing too	20:24
* fungi joins the screen session late-ish		20:24
corvus	fungi: ^ rax updates maybe?	20:25
fungi	yeah, i can get that now	20:25
corvus	infra-root calls "not it" on ttw dns updates ;)	20:25
clarkb	ok the gearman config update lgtm on zm04 so I'll proceed to the next step whcih is starting zuul again	20:26
clarkb	corvus: you ready for ^?	20:26
corvus	clarkb: yep	20:26
clarkb	things are started	20:27
clarkb	looks like zuul01 was properly ignored	20:27
clarkb	s/started/starting/	20:27
corvus	status notice Zuul has been migrated to a new VM. It should be up and operating now with no user visible changes to the service or hostname, but you may need to reload the status page.	20:27
corvus	clarkb: how's that for sending in a minute or so?	20:27
clarkb	lgtm	20:27
clarkb	I'm going to stort out copying the queues.sh script now while we wait for it to come up	20:28
corvus	it's saving keypairs. i'll double check some shasums	20:28
fungi	okay, zuul.openstack.org cname has been updated to point at zuul02.opendev.org, ttl was already 5min	20:28
corvus	fungi: what about updating it to point to zuul.opendev.org?	20:29
fungi	oh, could do that too	20:29
clarkb	fungi: ya I checked its ttl a few days ago and its was already low	20:29
corvus	it's a double cname, but at this point we're probably not worried about network efficiency for folks still using it	20:29
fungi	nsd should be smart enough to return both records when asked for the zuul.opendev.org cname, at least i know bind does it that way	20:29
corvus	spot checks of some keys match on both hosts	20:29
corvus	good point	20:30
clarkb	ok I've just realized that to restore queues I need the zuul enqueue command to be present I ssuecpt this may be easiest if I start a shell on the scheduler container and run it there?	20:30
clarkb	I assume we don't want a global install anymore?	20:30
corvus	clarkb: ++	20:30
fungi	okay, zuul.openstack.org is now updated to be a cname to zuul.opendev.org	20:30
clarkb	fungi: thanks	20:30
fungi	saves us a step on future server changes	20:30
corvus	clarkb, fungi: did we not make a "zuul" alias?	20:31
corvus	i think we did that with nodepool	20:31
corvus	anyway, container shell for now, then later we can make a one-liner to have "zuul" do "docker-compose exec ...."	20:31
fungi	oh, shell command alias/wrapper?	20:31
corvus	fungi: ya	20:31
clarkb	corvus: I don't see any using `which zuul`	20:32
corvus	huh, i can't find that on nodepool either	20:32
corvus	probably a change from mordred sitting in review	20:32
clarkb	corvus: zuul02:/root/queues.sh has been edited to do docker exec can you check that and see if it looks right?	20:32
fungi	looks like /usr/local/bin/zuul on the old server is the old style entrypoint consolescript	20:32
clarkb	I think zuul is up now	20:33
clarkb	I'm ready to run the queues.sh script if it looks good to yall	20:33
corvus	clarkb: lgtm. will be slow, but script isn't long.	20:33
clarkb	ok running it now	20:33
fungi	from here i get to the webui	20:33
fungi	nothing enqueued yet	20:33
fungi	i take that back, four changes enqueued	20:34
openstackgerrit	James E. Blair proposed opendev/system-config master: Fix typo in gerrit sysadmin doc https://review.opendev.org/c/opendev/system-config/+/791314	20:34
clarkb	they should be showing up now ya	20:34
corvus	there's another for ya :)	20:34
fungi	make that five ;)	20:34
corvus	retry_limit	20:35
clarkb	ya but thats an airship job that does that alot? may or may not indicate a problem	20:35
clarkb	oh I'm seeing a number of other retries now	20:35
corvus	other attempts	20:35
clarkb	probably not airship specific	20:35
clarkb	yaml.constructor.ConstructorError: could not determine a constructor for the tag '!encrypted/pkcs1-oaep' I see that on ze01	20:37
corvus	clarkb: we may be out of version sync	20:37
corvus	probably need to run the pull playbook then a full restart	20:37
clarkb	corvus: ok	20:37
fungi	oh, yes that makes sense, that work did just merge	20:37
corvus	(my guess is scheduser version is > executor)	20:37
fungi	so executors/mergers need restarting	20:38
clarkb	running pull now	20:38
corvus	#status notice Zuul is in the process of migrating to a new VM and will be restarted shortly.	20:38
openstackstatus	corvus: sending notice	20:38
corvus	clarkb: in the mean time, can you save queues again (to a second file)?	20:38
-openstackstatus- NOTICE: Zuul is in the process of migrating to a new VM and will be restarted shortly.		20:38
clarkb	ya though I may need to do it on 01	20:39
clarkb	corvus: the pulls seem to report they didn't do updatse?	20:39
corvus	clarkb: probably ok	20:39
corvus	(probably images were already local)	20:39
clarkb	but why would be out of sync in that case?	20:39
corvus	(probably only restarts are needed, but good to double check)	20:39
clarkb	not sure I undersatnd that last message	20:40
fungi	clarkb: executors weren't restarted when the images updated	20:40
corvus	oh hrm, if there really was a global restart with everything up to date then...	20:40
fungi	were they?	20:40
clarkb	fungi: the output from the zuul_pull.sh implies this is the case	20:40
clarkb	Pulling executor ... status: image is up to date for z...	20:41
fungi	20:27 start time, so yeah i guess they were	20:41
corvus	clarkb: let's make sure the image on zuul02 is up to date -- did the pull playbook do that?	20:41
clarkb	checking	20:41
openstackstatus	corvus: finished sending notice	20:41
clarkb	Pulling scheduler ... status: image is up to date for z...	20:42
clarkb	looks like it	20:42
corvus	clarkb: can we do one more full stop / start just to make sure we got everything?	20:42
corvus	then if it happens again, we'll call it a bug	20:42
clarkb	yes, I'll do the dump then stop	20:42
*** vishalmanchanda has quit IRC		20:43
fungi	definitely worrisome that executors couldn't parse the secrets from zk	20:43
clarkb	stopped, running start now	20:43
corvus	fungi: that was parsing the secrets over gearman	20:43
fungi	ohh	20:44
fungi	right that's still going through gearman	20:44
clarkb	we should be coming back up again	20:44
corvus	we ship secret ciphertext over gearman now, then executors decrypt, we only got to the "decode off the wire" stage on the executor, not quite as far as the decrypt step	20:44
corvus	fungi: if they got to the decrypt step, they would get the keys from zk	20:44
fungi	yeah i forgot the secrets hadn't moved to zk as part of the serialization work	20:45
corvus	the latest scheduler and executor images were build from the same change	20:45
corvus	(i checked docker image inspect on zuul02 and ze10)	20:45
corvus	"org.zuul-ci.change": "788376", on both	20:45
clarkb	I've prepped zuul02:/root/queues.new.sh	20:46
fungi	i'm being hovered over to start heating a wok, but will try to be quick	20:48
*** slaweq has quit IRC		20:49
clarkb	web loads again	20:49
clarkb	corvus: should I reenqeuue a change maybe?	20:50
corvus	there's one already	20:50
clarkb	yup see a couple now actually	20:50
clarkb	looks like they are retrying	20:50
clarkb	I see the same traceback on ze01 concurrent with the newer restart	20:51
corvus	well, that's fascinating	20:51
corvus	i wonder if yaml is different on our image builds, or if there's a path we're not testing	20:51
clarkb	I6d94c1d8da8b68e5fb60c27e73039155a02fb485 maybe?	20:52
corvus	oh that's certainly the change that broke it, but i don't see how	20:53
clarkb	gotcha	20:53
corvus	there's extensive testing of secrets	20:53
clarkb	I suspect that the 13 day old executor image on ze01 isn't the one we want to run with as a fallback	20:55
corvus	we need a sync'd executor and scheduler image	20:55
corvus	and unfortunately i don't think we tagged our last restart	20:55
corvus	but i think we should be able to fall all the way back to 4.2.0?	20:56
clarkb	oh thats an idea	20:56
corvus	(especially now that the keys are on disk)	20:56
corvus	clarkb: i think that's my vote	20:56
clarkb	ok let me see what that looks like as far as changes we need to make	20:57
corvus	oh... 1 thing	20:57
corvus	that might have the old repo layout	20:57
corvus	yep	20:58
clarkb	yes it does	20:58
corvus	that will use extra space and cause a longer restart, but not a blocker	20:59
corvus	(we might as well just rm -rf before restarting to clean up the extra space)	20:59
clarkb	ok so we're still goo to proceed. How do we want to do the modification to zuul's docker-compose configs? Should I just manually do that on my checkout of system-config on bridge?	21:00
corvus	clarkb: yeah, why don't you do that	21:00
corvus	clarkb: i'll run the stop playbook now, and delete the git caches	21:00
clarkb	ok my computer has decided now is a fine time to swap or something	21:00
clarkb	I'll work on the docker-compose updates though	21:01
clarkb	image: docker.io/zuul/zuul-scheduler:4.2.0 that look right?	21:03
corvus	yep	21:03
clarkb	and I'm doing that for all the zuul services	21:04
corvus	++	21:04
fungi	okay, dinner's cooked and i'm back	21:05
clarkb	corvus: ok thats done. Are you ready for me to run service-zuul.yaml and then rerun the gearman config updater?	21:05
corvus	oh because the playbook will write old config files?	21:05
openstackgerrit	Ade Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node https://review.opendev.org/c/zuul/zuul-jobs/+/788778	21:05
corvus	clarkb: yes i am ready	21:05
clarkb	ok running in the screen and up because it will update the config fiel to point at old zuul	21:06
clarkb	so we just rerun the fixer playbook after too	21:06
corvus	clarkb: in fact, i'm now ready for you to proceed all the way to restart	21:06
corvus	(i have stopped everything and cleared out the git repo caches)	21:06
fungi	caught up again, so looks like we did a rollback to 4.2.0 everywhere	21:06
fungi	or are in progress with that	21:07
clarkb	fungi: we're doing that	21:07
fungi	yep, don't let me interrupt	21:07
fungi	but lmk if there's something i can help with	21:08
clarkb	the mergers are pulling 4.2.0 now	21:11
clarkb	then it will be executors then scheduler images	21:11
corvus	while we're waiting, i have a revert staged locally; i'd like to merge that today and restart into it, verify it works, then tag it (timing on that will obviously depend on what happens next, but that's the process i'd like to do)	21:13
fungi	makes sense	21:13
clarkb	sounds good to me	21:14
fungi	i agree i'd have expected the rather extensive testing to pick that up before the change merged though, so the fix is guaranteed to be an interesting one	21:15
clarkb	ok service-zuul is done no errors and rc is 0	21:16
clarkb	running the config fixup next	21:16
clarkb	corvus: that is done. I think we are ready to start again if you are ready?	21:16
clarkb	do you want to run the start playbook or should I?	21:17
corvus	clarkb: go for it	21:17
clarkb	ok running zuul_start.yaml now	21:17
clarkb	and done	21:18
clarkb	this startup will take longer bceause repos need to be cloned right?	21:18
corvus	yes	21:19
corvus	this should get a lot faster soon because the files that the cat jobs are asking for are going to be persisted in zk	21:20
fungi	that'll be nice	21:21
corvus	we're actually almost at the point of doing that (i think those changes are just coming ready for review now)	21:21
corvus	aside from the fact that they now have 2 things ahead of them instead of just one :)	21:22
clarkb	is there a way to see progress? it seems idler than I would expect	21:22
clarkb	when tailing the scheduler debug log	21:22
corvus	executor/merger logs and grafana	21:22
clarkb	the merger queue is supposedly 0?	21:23
corvus	the zuul job queue is not high which means we haven't sent the cat jobs out yet	21:23
clarkb	zm01 hasn't merged anything in about 5 minutes	21:23
corvus	we're... ratelimited on github?	21:24
corvus	it's possible we're sitting in a sleep in the github driver waiting to query more branches	21:25
fungi	maybe the successive restarts pushed us over query quota there	21:25
corvus	yeah, i'm like 90% sure that's what's happening	21:26
clarkb	hrm I see we're submitting a small number of cat jobs since the most recent start and they seem to all be things hosted on opendev?	21:26
clarkb	but maybe I'm looking at the wrong logs stuff	21:26
corvus	2021-05-13 21:20:52,327 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 386 seconds	21:27
corvus	opendev is the first tenant, openstack is after that	21:27
corvus	and the github projects are in openstack	21:27
clarkb	got it	21:27
corvus	the 5m has expired and it's running again	21:27
clarkb	and ya I see it doing a lot of work now	21:28
clarkb	and zm01 is happily busy	21:28
corvus	should be able to watch progress on 'zuul job queue' in grafana	21:28
corvus	https://grafana.opendev.org/d/5Imot6EMk/zuul-status?viewPanel=19&orgId=1	21:28
clarkb	thanks	21:28
rm_work	so, zuul good again? :D	21:29
fungi	rm_work: close	21:29
corvus	about 4,000 git clones away, we hope :)	21:29
fungi	at least we think so. hard to say for sure until we see it actually run some jobs successfully	21:29
rm_work	heh	21:29
rm_work	sitting here, finger hovering the return key on a `git review`	21:30
fungi	since this is a totally new server for the scheduler, there's plenty which could go wrong	21:30
clarkb	assuming this fixes things I would like to continue to land the followup changes, but can leave DISABLE-ANSIBLE in place and do the dns and service-zuul.yaml playbook runs by hand after rebasing my 4.2.0 update onto the zuul01 cleanup change.	21:34
clarkb	then we can remove DISABLE-ANSIBLE when we start corvus' plan for revert and applying the revert and all that	21:34
clarkb	nova clones have started. Hopeflly we move quickly after that	21:35
*** darshna has quit IRC		21:38
corvus	1200 bottles of beer on the wall....	21:42
fungi	clone one down, checkouts abound...	21:43
mordred	1500 bottles of beer on the wall ...	21:45
corvus	nearly done	21:45
clarkb	its up	21:45
clarkb	should I enqueue a change?	21:45
corvus	clarkb: i just did in zuul	21:46
clarkb	k switching tenants on the dashboard	21:46
clarkb	I see a console log	21:47
clarkb	https://zuul.opendev.org/t/zuul/stream/e9f59ab4d01f4f729ec844cba722456b?logfile=console.log	21:47
corvus	and it's actually running playbooks	21:47
corvus	clarkb: maybe re-enqueue now?	21:47
clarkb	corvus: will do	21:47
clarkb	in progress now	21:48
clarkb	and actually now that I think about it I think we're ok to keep the 300 ttl and leave zuul01.openstack.org in the emergency file if we just want to remove DISABLE-ANSIBLE and proceed with revert stuff	21:49
clarkb	its a fairly safe steady state, just with a low ttl we can cleanup in the futher out near future	21:49
fungi	yep	21:49
clarkb	the only issue is the gearman server directive on zm* and ze*	21:50
clarkb	I can push a new chagne that only updates that	21:50
clarkb	we have a successful tox-linters job against zuul	21:50
corvus	on a change uploaded by me, no less. shocking!	21:51
clarkb	fungi: corvus: opinions on fixing the gearman server config? do we want to blaze ahead and land the existing changes to do that or would you prefer we stay somewhat nimble with changing ttls and cleaning up zuul01 and I can push a change that only updates the gearman config	21:51
clarkb	(we're steady state right now due to DISABLE-ANSIBLE)	21:52
clarkb	rm_work: I think we're cautiously optimistic at this point if you want to push	21:52
corvus	clarkb: i say go all the way with existing changes	21:53
rm_work	:P	21:53
rm_work	thanks	21:53
clarkb	corvus: ok wfm.	21:53
clarkb	fungi: corvus: can you review https://review.opendev.org/c/opendev/zone-opendev.org/+/790483/ ?	21:53
clarkb	then next is https://review.opendev.org/c/opendev/system-config/+/790484	21:53
clarkb	that will swap us back to latest but we want that for revert testing anyway	21:54
corvus	clarkb: was already doing that, +2 on both	21:54
clarkb	thanks!	21:54
corvus	clarkb: mind if i go offline for a little while? maybe 30m?	21:54
clarkb	corvus: sure things seem happy enough	21:55
clarkb	corvus: before you go any reason to not remove DISABLE-ANSIBLE?	21:55
ianw	o/ ... just reading the excitement in scrollback ...	21:55
corvus	i'd like to squeeze this ups work in while building/updating is happening	21:55
clarkb	it will revert our gearman config and our docker-compose updates	21:55
clarkb	I'm ok with manually rerunning those playbooks again if necessary	21:55
corvus	clarkb: don't think that's a problem; i don't think we're going to auto restart anything	21:55
clarkb	corvus: ok I'll do that now so that merging those changes can automatically do the right thing	21:55
openstackgerrit	Ade Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node https://review.opendev.org/c/zuul/zuul-jobs/+/788778	21:56
clarkb	and done	21:56
fungi	clarkb: what was in need of fixing with the gearman server config?	21:56
corvus	clarkb, mordred, fungi: maybe you can go ahead and w+1 the revert?	21:56
fungi	yup	21:56
corvus	so that we get new images asap	21:56
clarkb	fungi: the old config points mergers and executors at zuul01.openstack.org. As part of the upgrade I ran an out of band playbook to set it to zuul02.opendev.org instead so that we could control when they swapped over	21:57
clarkb	I have approved the dns ttl cleanup	21:57
clarkb	fungi: the zuul01 cleanup change makes that value zuul02.opendev.org permanentaly	21:57
clarkb	corvus: yup looking at that change now	21:58
fungi	ahh, yep	21:58
corvus	biab.	21:58
clarkb	ianw: tldr is swapped zuul01 for zuul02 and discovered some recent chagnes to zuul executors were not happy. We have since rolled zuul02 back to zuul 4.2.0 which seems to be working	21:58
*** corvus has quit IRC		21:58
clarkb	ianw: we'll roll forward again with a revert of the zuul executor changes to see that that properly addresses the issue then zuulians can work on fixes	21:59
clarkb	looks like fungi got the zuul revert change so that is in the pipeline	21:59
fungi	we rolled everything back to 4.2.0 right, not just the zuul02 scheduler?	22:00
clarkb	fungi: correct	22:00
clarkb	also to be clear zuul01 is not in use and is in the emergency file (this makes running the zuul stop/start/etc playbooks safe)	22:01
fungi	anyway, cleanup changes are approved	22:01
clarkb	should we do a #status log We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem.	22:02
clarkb	er should that be #status notice?	22:02
clarkb	oh you know what we didn't copy is the timing dbs but meh	22:02
fungi	a bit wordy, but sure it works	22:02
clarkb	#status notice We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem.	22:03
openstackstatus	clarkb: sending notice	22:03
-openstackstatus- NOTICE: We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem.		22:03
openstackgerrit	Merged opendev/zone-opendev.org master: Reset zuul.o.o CNAME TTL to default https://review.opendev.org/c/opendev/zone-opendev.org/+/790483	22:03
openstackstatus	clarkb: finished sending notice	22:06
ianw	ok cool thanks. i guess the zuul gate isn't tied up in devstack issues at least	22:06
clarkb	ianw: yup and re devstack it sounds like gmann wants to revert and do the ovn switch properly	22:10
clarkb	rather than try and fix every new random issue that was masked by using the CI jobs to do the switch and not the devstack configs	22:10
clarkb	service-nameserver job has started	22:11
clarkb	I had to reapprove 790484 because fungi approved it before its dependency merged. But that is done now	22:13
ianw	clarkb: yeah, i think i'm on board with the revert too, because it's not quite as simple to just stick that new file in and be done with it	22:13
clarkb	ianw: ya it basically entirely relied on changing the job config to work in the job and won't work anywhere else	22:14
ianw	i think it's probably worth making ensure-devstack install devstack from zuul checkout and running that it the devstack gate as a check on this	22:14
clarkb	I see the new TTL for zuul.opendev.org so that looks good	22:15
clarkb	ianw: ya devstack could run a job that uses it with an up to date checkout so that pre merge testing works but otherwise its not doing much	22:15
clarkb	at this point I think we are just waiting on the zuul01 cleanup change and corvus' revert in zuul. I haven't seen anything from zuul jobs that have run to indicate any major problems	22:16
clarkb	the one problem I identify is we didnt' copy the timing dbs over to the new server so we don't have that data but its not the end of the world	22:16
ianw	i get that the ensure-devstack role is explicitly about using devstack, not testing it, but "can i use devstack" is a pretty good devstack test as well :)	22:16
clarkb	++	22:17
clarkb	the hourly service-zuul.yaml run is likely to run before my zuul01 removal change lands. What this means is our docker-compose.yaml files on scheduler+web, mergers, and executors will be updated as well as the zuul.conf on mergers and executors to remove manual changes we have made	22:24
clarkb	this should be fine as long as we put those changes back again before doing a restart	22:24
clarkb	in fact I may just run the gearman config fixup playbook after that run finsihes as we want the docker-compose changes to go away to do the revert	22:25
fungi	which may be easier than temporaro	22:25
fungi	temporarily adding them all to the emergency list	22:25
clarkb	yup	22:25
clarkb	and when we get to the restart point we only need the gearman fixes in place which means as long as I've put those back we could potentially restart things on the revert before zuul01 cleanup has fully applied (then I can just manually go through it after the fact or let periodic catch it tonight)	22:26
clarkb	service-zuul.yaml is done now. I'll rerun the out of band gearman config fix now	22:53
clarkb	and that is done. We should be able to restart zuul safely on the revert now (docker-compose is back to latest) as long as 790484 runs before the next hourly pass	22:54
*** corvus has joined #opendev		22:56
corvus	o/	22:57
clarkb	corvus: all seems to be going well so far	22:58
clarkb	corvus: still waiting on the zuul01 removal to land, but I went ahead and reran the gearman config fixup after the previous hourly service-zuul.yaml ran	22:58
corvus	cool, looking at eavesdrop now...	22:58
clarkb	corvus: I think that means we're in an ok spot right now to restart on the zuul revert (docker-compose should point to latest now and gearman configs are correct)	22:58
corvus	i think we can copy over the timing db if we want	22:59
corvus	but also, it'll sort itself out soon :)	22:59
clarkb	corvus: I'm not super worried about it because ya it will handle itself	22:59
fungi	i may even clean itself up a bit	22:59
fungi	er, it may	22:59
clarkb	if we are going to do a restart maybe wait for 790848 to land first so that we can get that in without waiting again (it should be soon I think)	22:59
corvus	looks like the revert just landed	23:00
clarkb	is this restart going to be another slow one because it will clone into the other repo paths?	23:00
corvus	clarkb: yep :(	23:00
clarkb	(wondering if we should keep the old repos around or if that complicates stuff somehow)	23:00
corvus	no choice, zuul is going to delete them this time	23:00
clarkb	got it	23:00
corvus	(i thought about leaving the other ones around, but it would have been more surgery to delete the current ones when we switch back because zuul wouldn't do it in that case)	23:01
clarkb	so ya my only suggestion then is maybe wait for 790848 to land I think it may only be ~10 minutes away? then do restart?	23:01
corvus	(because it would have thought it already did the delete of the old scheme)	23:01
corvus	cool, i have some screws i still need to attach, i'll go do that :)	23:01
clarkb	ianw: oh btw I ended up just changing the order of the altnames in the cert to fix the issue you were looking at. I was trying to keep moving parts to a minimum but if we still like those improvements (the log capture for sure) we can do followups to alnd them	23:05
clarkb	and this way we don't need to add more acme records to openstack.org	23:05
clarkb	that will just get deleted next week :)	23:05
ianw	yeah that's cool. i think i had in my head the idea that we'd mostly use the inventory_hostname for consistency in general, but whatever works	23:07
openstackgerrit	Merged opendev/system-config master: Clean up zuul01 from inventory https://review.opendev.org/c/opendev/system-config/+/790484	23:09
clarkb	corvus: ^ my timing was really good too :)	23:10
corvus	approximately 10 minutes :)	23:10
corvus	clarkb: so what's next?	23:10
clarkb	corvus: I just looked at zm01 and ze01 and they both have the correct docker-compose (latest) config as well as gearman configs	23:11
clarkb	corvus: maybe run the pull script and double check that image looks like the one we want on the hosts and we can restart?	23:11
corvus	running pull now	23:12
clarkb	for dumping queues on zuul02 we may need extra tooling. But previously I ran the dump on zuul01 and copied it to zuul02 after zuul01 was stopped with no problem	23:12
clarkb	can probably just use zuul01 for the queue dump still	23:12
corvus	i'll try on 2	23:13
corvus	python3 ~corvus/zuul/tools/zuul-changes.py https://zuul.opendev.org	23:14
corvus	that works	23:14
clarkb	cool	23:14
corvus	clarkb: then we need to mutate the output right?	23:14
clarkb	corvus: yes you need to prefix the entries with the docker exec command one sec	23:14
clarkb	docker exec zuul-scheduler_scheduler_1 prefix the data with that	23:15
clarkb	on each line	23:15
corvus	i'm going to make a custom zuul-changes for now	23:15
clarkb	/root/queues.sh is an example	23:15
clarkb	if you run that as not root you will also need a sudo at the front	23:15
corvus	i'll add it in so it works either way	23:16
corvus	~root/zuul-changes.py https://zuul.opendev.org	23:16
corvus	sudo docker exec zuul-scheduler_scheduler_1 zuul enqueue --tenant openstack --pipeline gate --project opendev.org/openstack/neutron-lib --change 791134,1	23:16
corvus	produces output like that ^	23:16
clarkb	that looks right to me	23:17
corvus	clarkb: zuul-promote-image failed for the revermt	23:17
corvus	looks like it may have failed after the retag though	23:17
corvus	(failed deleting the tag)	23:17
clarkb	ah ya that races sometimes iirc	23:18
corvus	i'll check dockerhub and see if they look ok	23:18
clarkb	++	23:18
fungi	so long as that was all that failed, i guess we should still be clear to pull	23:18
corvus	hrm, not looking good	23:20
corvus	i think it promoted zuul but not zuul-executor	23:20
corvus	i'll try re-enqueing that	23:20
clarkb	corvus: ok	23:21
*** tosky has quit IRC		23:23
clarkb	one of the jobs failed but I think it was a js one?	23:23
clarkb	(on the reenqueue)	23:23
fungi	if it's the tarball publication, yeah that's been broken	23:24
corvus	yeah, image promote looks good now	23:24
corvus	i'll pull again	23:24
corvus	done	23:27
corvus	clarkb: i guess we save/restart/reenqueue now?	23:28
clarkb	corvus: I think so.	23:28
corvus	running that now	23:28
clarkb	the gearman config still looks good	23:28
corvus	cat jobs are going; we can watch the graph again	23:33
clarkb	seems like it went faster this time, but still waiting for the layout to be loaded	23:38
corvus	yeah, we may have one job still running?	23:39
corvus	and it may time out, which may mean we have to restart :?	23:39
corvus	i'm not sure why it went faster	23:39
corvus	it was slower than the earlier restarts, but faster than the last one	23:40
corvus	that last job finished	23:40
clarkb	I think zm01 didn't have its old repos cleaned up	23:40
clarkb	which may explain the speed	23:41
corvus	clarkb: oh yep, i forgot to do that on the mergers	23:41
corvus	i'll shut them down and clean them up in a bit	23:41
corvus	loaded now	23:41
clarkb	ok. opendev%2Fsystem-config shows an older system-cofnig head but it should be updated if it actually gets those merge jobs so I don't think that is a problem	23:41
clarkb	looks like web is responding again	23:42
corvus	re-enqueing	23:42
corvus	btw, mouse over the 'waiting' labels on the tripleo-ci jobs	23:43
clarkb	https://zuul.opendev.org/t/openstack/stream/ed1ad277335145788067faf26d412417?logfile=console.log that seems to be acutally doing something?	23:44
clarkb	oh nice on the mouse over	23:44
clarkb	spot checking some jobs this looks happy	23:45
clarkb	The next thing to do is probably to manually run service-zuul in the foreground and ensure that we noop? We can wait for zuul to be a bit further along in its happyness before we do that though	23:46
corvus	gimme a sec to manually clean up the mergers	23:47
clarkb	yup no rush	23:47
clarkb	hrm a number are stuck at "updating repositories" but I suspect that is beacuse they need to fetch/clone repos/changes	23:48
corvus	yep	23:48
clarkb	ah yup just saw at least one proceed from that state	23:48
corvus	since the cat jobs didn't do much priming	23:48
corvus	re-enqueue finished	23:48
fungi	oh, is it on-demand clones now?	23:49
fungi	ahh, for the executors	23:49
corvus	ooh, check out the waiting mouseovers on the system-config deploy queue item	23:50
clarkb	I was just doing that :)	23:50
corvus	mergers are back up and running	23:51
corvus	#status log restarted zuul on commit ddb7259f0d4130f5fd5add84f82b0b9264589652 (revert of executor decrypt)	23:51
openstackstatus	corvus: finished logging	23:51
clarkb	I've got the service-zuul.yaml command queued up in the screen if anyone wants to look at it. I think we're ok to run that now and get ahead of the deploy jobs and just ensure that it noops	23:53
clarkb	I'll run that at 00:00 if I don't haer any objections	23:54
corvus	clarkb: lgtm	23:54
fungi	yep, ready	23:55
clarkb	ok I'll go ahead and run it then	23:55
clarkb	(now I mean)	23:55
corvus	we have succeeding jobs	23:55
fungi	so much green	23:55
fungi	both in the screen session and the status page i guess	23:55
ianw	clarkb: ++ be good to know it's in a steady state	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!