clarkb | I suspect it is the first name in the name list that determines the output file. I'm not sure https://review.opendev.org/c/opendev/system-config/+/791060 will fix it? | 00:13 |
---|---|---|
clarkb | maybe instead we should just swap the order of the names in the zuul02 list and have zuul.opendev.org come first? | 00:13 |
ianw | ohh, you know what, i think you're right | 00:21 |
fungi | but also interpolating the filename in the vhost configs makes sense | 00:23 |
*** openstackgerrit has joined #opendev | 00:26 | |
openstackgerrit | Ian Wienand proposed opendev/zone-opendev.org master: Add acme challenge for zuul01 https://review.opendev.org/c/opendev/zone-opendev.org/+/791069 | 00:26 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: zuul-web : use hostname for LE cert https://review.opendev.org/c/opendev/system-config/+/791060 | 00:26 |
ianw | it might be worth doing ^ anyway just for consistency of having the cert cover the hostname as well as the CNAMEs | 00:27 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: bootloader: remove extlinux/syslinux path https://review.opendev.org/c/openstack/diskimage-builder/+/541129 | 00:34 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups https://review.opendev.org/c/openstack/diskimage-builder/+/790878 | 00:34 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Add fedora-containerfile element https://review.opendev.org/c/openstack/diskimage-builder/+/790365 | 00:44 |
*** whoami-rajat has quit IRC | 01:13 | |
*** brinzhang has joined #opendev | 02:03 | |
openstackgerrit | Steve Baker proposed openstack/diskimage-builder master: Add element block-device-efi-lvm https://review.opendev.org/c/openstack/diskimage-builder/+/790192 | 02:46 |
openstackgerrit | Steve Baker proposed openstack/diskimage-builder master: WIP Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083 | 02:46 |
*** hemanth_n has joined #opendev | 02:52 | |
*** brinzhang_ has joined #opendev | 03:21 | |
*** brinzhang has quit IRC | 03:24 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [WIP] test devstack https://review.opendev.org/c/openstack/diskimage-builder/+/791091 | 04:07 |
*** ralonsoh has joined #opendev | 04:31 | |
*** ykarel has joined #opendev | 04:38 | |
*** brinzhang0 has joined #opendev | 04:50 | |
*** brinzhang_ has quit IRC | 04:54 | |
*** marios has joined #opendev | 04:58 | |
*** hemanth_n has quit IRC | 05:00 | |
*** hemanth_n has joined #opendev | 05:00 | |
*** vishalmanchanda has joined #opendev | 05:06 | |
*** darshna has joined #opendev | 05:08 | |
jrosser | i think /etc/ci/mirror_info.sh might be broken for bullseye due to missing VERSION_ID | 05:22 |
*** slaweq has joined #opendev | 06:26 | |
*** ykarel has quit IRC | 06:46 | |
*** jpena|off is now known as jpena | 06:48 | |
*** zbr has quit IRC | 06:49 | |
*** zbr has joined #opendev | 06:51 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-devstack: allow for minimal configuration of pull location https://review.opendev.org/c/zuul/zuul-jobs/+/791116 | 06:57 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117 | 07:00 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: ensure-devstack: allow for minimal configuration of pull location https://review.opendev.org/c/zuul/zuul-jobs/+/791116 | 07:03 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117 | 07:03 |
*** lucasagomes has joined #opendev | 07:04 | |
*** amoralej|off is now known as amoralej | 07:06 | |
*** andrewbonney has joined #opendev | 07:25 | |
*** tosky has joined #opendev | 07:47 | |
*** jaicaa has quit IRC | 08:36 | |
*** jpena is now known as jpena|lunch | 11:32 | |
*** jhesketh has quit IRC | 11:43 | |
fungi | jrosser: if it's the same problem as before, it's because base-files 11 is missing version information which base-files 11.1 will provide once it migrates to bullseye | 11:56 |
fungi | so ansible can't find a version and substitutes "n/a" | 11:56 |
jrosser | hrrm, is there a workaround anyone has been using for this? | 11:58 |
fungi | not sure, to be honest. i mean, bullseye isn't released yet so it makes some sense that ansible doesn't recognize it | 12:15 |
fungi | i expect the ansible community considers it to be working as designed | 12:16 |
jrosser | i've tried to patch stuff to insert VERSION_ID=11 into /etc/os-release | 12:17 |
fungi | you could try diffing /etc/os-release between base-files 11 and 11.1 and see if it's one of the other missing values which is needed | 12:18 |
jrosser | https://zuul.opendev.org/t/openstack/build/fa59a627370949be96e3982d31683837/log/job-output.txt#2844 | 12:19 |
*** amoralej is now known as amoralej|lunch | 12:21 | |
*** jhesketh has joined #opendev | 12:21 | |
fungi | jrosser: the source for that is here: https://opendev.org/opendev/base-jobs/src/branch/master/roles/mirror-info/templates/mirror_info.sh.j2#L26-L34 | 12:24 |
fungi | is there a fallback value we could grab when VERSION_ID is unset, do you think? | 12:25 |
fungi | i'm setting up a bullseye machine now to see if i can get any ideas | 12:25 |
fungi | all mine are either sid (which has had base-files 11.1 for months) or buster | 12:25 |
*** jpena|lunch is now known as jpena | 12:33 | |
jrosser | fungi: there is potential fallback information in the node info | localhost | Distro: Debian 11.0 | 12:33 |
jrosser | but i guess there are two slightly different things, making mirror_info.sh robust and then somewhat seperate the ansible n/a version | 12:34 |
fungi | jrosser: yeah, so this is the diff between base-files 11 and 11.1 os-release files: http://paste.openstack.org/show/805351/ | 12:39 |
fungi | and this is the diff of /etc/debian_version as a possibility: http://paste.openstack.org/show/805353/ | 12:41 |
jrosser | oh, hmm https://zuul.opendev.org/t/openstack/build/5132e46c48f64f4ba324b70f94d86eab/log/zuul-info/host-info.debian-bullseye.yaml#126-132 | 12:41 |
jrosser | so it might not be unreasonable for VERSION_ID to fall back to ansible_distribution_major_version | 12:43 |
fungi | it does mix contexts a bit, but yeah we could essentially use ansible jinja interpolation to "hard code" a fallback value into the script | 12:44 |
jrosser | given that the script is a template that should be doable | 12:44 |
fungi | we could even switch to setting those variables with ansible and just using the values from /etc/os-release as fallbacks, though that's more likely to introduce regressions | 12:46 |
jrosser | indeed - thats why i thought it was a good question for here as i'm sure theres good reason for how it is now | 12:47 |
fungi | well, lots of this grew out of shell scripts we ran with jenkins in the long-long ago, in the beforetime | 12:48 |
jrosser | really the motivation here is to get bullseye working ASAP even though it's unreleased, as that lets us drop a decent %age of CI jobs maybe a cycle earlier | 12:48 |
jrosser | what with bionic/focal centos8/stream buster/bullseye the support matrix is really full right now | 12:48 |
fungi | yep, i totally get that. also worth noting the only thing we ultimately use VERSION_ID in is the wheel mirror url, so that could be reworked as well maybe | 12:51 |
fungi | i feel like templating in a fallback string for the VERSION_ID assignment when it's unassigned or maybe even just preassigning before sourcing /etc/os-release would be the safest solution | 12:54 |
jrosser | just to make it super obvious there could be an ANSIBLE_DISTRIBUTION_MAJOR_VERSION={{ ansible_distribution_major_version }} so it's really clear when someone looks at the generated script | 12:56 |
fungi | well, code comments in the script work too | 12:57 |
jrosser | of course :) | 12:57 |
fungi | i'll push up a prototype and we can hash it out in review | 12:57 |
openstackgerrit | Jeremy Stanley proposed opendev/base-jobs master: Test VERSION_INFO default for mirror-info role https://review.opendev.org/c/opendev/base-jobs/+/791176 | 13:02 |
openstackgerrit | Jeremy Stanley proposed opendev/base-jobs master: Revert "Test VERSION_INFO default for mirror-info role" https://review.opendev.org/c/opendev/base-jobs/+/791177 | 13:02 |
fungi | jrosser: for changes to base-jobs content, because it's a trusted repo where we don't get to take advantage of speculative job config changes, we try change the base-test job first and then we can try do-not-merge changes in untrusted repos which set base-test as the parent for some obvious jobs | 13:04 |
fungi | if dnm changes parenting jobs to base-test work after 791176 merges, then we would merge the revert and propose a similar change to the normal mirror-info role used by the base job | 13:05 |
jrosser | ah ok | 13:06 |
*** amoralej|lunch is now known as amoralej | 13:11 | |
*** DSpider has joined #opendev | 13:12 | |
*** hemanth_n has quit IRC | 13:23 | |
mnasiadka | I started to notice not started ntp on debian nodepool instances, timedatectl says "NTP service: inactive" - not all the time, but every 2nd-3rd CI run in kolla-ansible - any idea what might be wrong? | 13:36 |
fungi | do you collect the system journal on those builds? or maybe syslog? it will probably have some indication | 13:41 |
mnasiadka | ntpd claims it's running, but not synchronized - so maybe it's just timedatectl flaw it has problems finding ntpd running | 13:42 |
fungi | could it be that ntpd simply hasn't settled yet by the time you're checking it? | 13:43 |
mnasiadka | well, I'm fine with unsynchronized, I'm not really fine with timedatectl saying NTP service: inactive - but maybe timedatectl has some problems checking ntpd (it's rather tied with networkd-timesyncd) | 13:45 |
fungi | yeah, seems like a very systemd-centric tool | 13:47 |
fungi | which i suppose is fine if you're running systemdos | 13:47 |
fungi | and also don't care that much about precision and discipline in your time sources | 13:52 |
fungi | mnasiadka: do you have an example build with that i can look at? | 13:53 |
fungi | i would expect it do say something like "NTP synchronized: no | 13:54 |
mnasiadka | fungi: I think it's really Kolla-Ansible prechecks rely on timedatectl (which ignores ntpd - it only checks for networkd-timesyncd) - but I don't think we've seen them fail in the past on Debian. recent build: https://935f2aace51477baa019-09dce2ec9ab39d19fdc97cba82216d08.ssl.cf2.rackcdn.com/787701/6/check/kolla-ansible-debian-source/b491ad2/primary/logs/ansible/deploy-prechecks | 13:55 |
fungi | the syslog it collected from that node claimed ntpd started at 12:35:16 and was instructed to accept large time jumps to get the clock in sync | 13:57 |
fungi | May 13 12:35:17 debian-buster-rax-ord-0024662430 ntpd[649]: error resolving pool 0.debian.pool.ntp.org: Temporary failure in name resolution (-3) | 13:58 |
fungi | so it was having trouble with dns resolution there | 13:58 |
fungi | because ntpd started before unbound | 13:58 |
mnasiadka | oops | 13:59 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add zuul02 to inventory https://review.opendev.org/c/opendev/system-config/+/790481 | 13:59 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Clean up zuul01 from inventory https://review.opendev.org/c/opendev/system-config/+/790484 | 13:59 |
clarkb | fungi: ianw I left a review on https://review.opendev.org/c/opendev/system-config/+/791060 which is what triggered my updates above | 13:59 |
clarkb | I'm thinking we do the swap then can improve the cert configs after? I think that simplifies stuff as its one less system to worry about LE on | 14:00 |
fungi | mnasiadka: well, it's built to handle that, it starts polling timeservers just after unbound starts, according to syslog | 14:00 |
fungi | though as of 12:40:35 us still complains "kernel reports TIME_ERROR: 0x41: Clock Unsynchronized" | 14:00 |
clarkb | fungi: ianw if you agree I think we can probably proceed to try and do https://etherpad.opendev.org/p/opendev-zuul-server-swap today. I'm feeling much better | 14:00 |
fungi | clarkb: yeah, that sounds fairly straightforward | 14:00 |
fungi | though systemd reported "Reached target System Time Synchronized. | 14:02 |
fungi | at 12:35:12 | 14:02 |
fungi | which was before ntpd started? | 14:02 |
fungi | the "Clock Unsynchronized" errors are apparently endemic of a system clock which is too erratic for ntpd to properly discipline | 14:04 |
fungi | so maybe that's what's going on | 14:05 |
mnasiadka | might be, will add some debug and check | 14:15 |
clarkb | fungi: I also want to test my playbook at https://review.opendev.org/c/opendev/system-config/+/790487 before we do the swap over so will look at that after bootstrapping my morning. If that loosk good and zuul02 look good I think we can proceed with the swap whenever others are ready | 14:21 |
*** sshnaidm is now known as sshnaidm|afk | 15:00 | |
*** lucasagomes has quit IRC | 15:03 | |
fungi | clarkb: any opinion on the approach in 791176 | 15:33 |
fungi | ? | 15:33 |
clarkb | fungi: basically use the ansible value first and then let os-release override. If os-release doesn't override then at least we have something? that should work | 15:36 |
fungi | yeah, i mean, i expect it to work. mainly that there are several ways we could go about solving it and i was tacking for the one with the least chance to cause regressions in behavior | 15:37 |
clarkb | fungi: I think my only concern would be that the ansible fact and the os-release value may have different value tpyes? like one could be a number and the other string in some situations? But for this weird pre release debian situation it should be fine ? | 15:40 |
fungi | i think it's always going to be a string in the end because it's a text file template to a shell script | 15:41 |
fungi | so even numbers are strings | 15:41 |
*** marios is now known as marios|out | 15:43 | |
clarkb | right I meant more at a highl evel like for ubuntu can one be 20.04 and the other Focal Fossa | 15:45 |
clarkb | 11 vs bullseye etc | 15:45 |
clarkb | I'm not worried about that to the point where we can't make the change though | 15:47 |
*** marios|out has quit IRC | 15:47 | |
clarkb | os-release is the winner and that will preserve existing behavior for us which should be sufficient | 15:47 |
openstackgerrit | Merged opendev/system-config master: Add zuul02 to inventory https://review.opendev.org/c/opendev/system-config/+/790481 | 15:52 |
clarkb | I'm ssh'd into zuul02 ^ and running a tail on syslog watching for ansible | 15:53 |
clarkb | it will probably be a few minutes before it gets there though. I'll try to keep an eye on it | 15:53 |
*** jpena is now known as jpena|off | 16:01 | |
clarkb | I think there is an ssh host key problem with new zuul02 and bridge | 16:16 |
clarkb | I ran the script to scan the dns ssh key records and that usually populates things properly | 16:18 |
clarkb | not sure what is going on yet | 16:18 |
clarkb | and now I'm grumpy that ssh reports errors using a sha256 hash of the key and keyscan gives you the base64 encoding of the key | 16:21 |
clarkb | you'd think having an option to make those line up would be done by now | 16:22 |
clarkb | base has already failed and LE should fail next | 16:23 |
clarkb | fungi: ^ do you see what may have happened there? | 16:23 |
clarkb | ssh keys on the host were generated may 10 which is when I booted it | 16:25 |
clarkb | I found my sudo sshfp.py command on bridge and that looks correct | 16:25 |
clarkb | ok I think I see what happened, there must've been IP reuse | 16:28 |
clarkb | we haev an earlier entry in known_hosts with a different key but the file last updated aroudn when I booted the new instance and the last entry in the file matches what I see when running keyscan on localhost | 16:28 |
clarkb | I'll just remove the older entry and that should make things work | 16:28 |
clarkb | hrm it is still asking me for a key | 16:29 |
clarkb | ok the current entry seems to be for the ipv6 address but ansible uses ipv4 | 16:31 |
clarkb | now they are both in there with what appears to be the correct key based on on host ssh keyscanning | 16:32 |
clarkb | Looks like we bailed out of the runs for that change merge | 16:32 |
clarkb | fungi: ^ should I run base, LE, and then zuul by hand? | 16:32 |
fungi | clarkb: yeah, sorry didn't get a chance to look yet but i agree we fail to clean up stale records for hostkeys of deleted servers in the known_hosts file so i can see where it would cause problems. in the future we might consider generating known_hosts instead | 16:34 |
fungi | and yes running those playbooks seems safe | 16:35 |
clarkb | ok I'll start base now | 16:35 |
clarkb | er let me wait for the puppet else run to finish to avoid any conflicts | 16:35 |
clarkb | base is running now | 16:39 |
*** amoralej is now known as amoralej|off | 16:40 | |
clarkb | I forgot to do -f 50 | 16:41 |
clarkb | this might be a while. I'll touch the ansible stoppage file on bridge if it gets closer to 1700 UTC to avoid conflict with the hourly runs | 16:42 |
*** timburke_ has joined #opendev | 16:45 | |
clarkb | #status log Ran disable-ansible on bridge to avoid conflicts with reruns of playbooks to configure zuul02 | 16:46 |
openstackstatus | clarkb: finished logging | 16:46 |
*** timburke has quit IRC | 16:48 | |
clarkb | if anyone is wondering you really do not want to forget the -f 50 | 16:58 |
fungi | cereal execution | 17:04 |
fungi | as in go eat a bowl of some and check back later | 17:05 |
fungi | or watch the zuul episode on openshift.tv | 17:05 |
clarkb | looks like I also need to run service-borg-backup.yaml in my list of playbooks | 17:10 |
clarkb | base is done. There werew a few issues on other hosts like the rc -13 and apt-get autoremove -y on a few hosts being unhappy. I don't think thoseaffect zuul so will proceed | 17:18 |
clarkb | letsencrypt playbook is running now | 17:19 |
fungi | these would also have been rerun in the daily job, right? | 17:24 |
clarkb | yes, but that doesn't happen for another 12 hours or something | 17:24 |
clarkb | and I want to maybe get a zuul swap in today | 17:24 |
clarkb | I'm hoping I can get zuul02 all configured, we can double check it, eat lunch, then come back and run through the plan on the etherpad | 17:25 |
clarkb | depending on how this goes maybe tomorrow we can land mailman updates too | 17:26 |
clarkb | we'll see :) | 17:26 |
fungi | yeah, i wasn't suggesting we wait, just checking that it would have been run within a day under normal circumstances | 17:27 |
clarkb | yup they would be | 17:27 |
*** andrewbonney has quit IRC | 17:28 | |
clarkb | borg backup is now done. Next is running the zuul playbook | 17:28 |
clarkb | actually I may need to run zookeeper first to update the firewall rules there | 17:29 |
clarkb | double checking on that | 17:29 |
clarkb | ya zk playbook comes before zuul playbook | 17:30 |
clarkb | oh nevermind base runs the iptables role too so this is already done (but doesn't hurt to run service-zookeeper.yaml anyway) | 17:31 |
fungi | sure | 17:31 |
clarkb | I realized this after I started it and it reported a bunch of noops :) | 17:31 |
clarkb | ok thats done. I'm going to run service-zuul.yaml now. Remember we don't expect this to cause problems beacuse we shouldn't start zuul services on the new scheduler. But keep an eye open :) | 17:33 |
clarkb | I notice that we may install apt-transport-https on newer systems that no longer need it | 17:36 |
fungi | yeah, i think it's supported directly on focal? | 17:41 |
clarkb | the start containers task was skipped on zuul02 for the scheduler (we expected and wanted this) | 17:42 |
clarkb | fungi: ya | 17:42 |
clarkb | reloading the scheduler failed (I guess I should've expected this too, going to check if there are any tasks we want that run after that) | 17:43 |
clarkb | otherwise looks good from the ansible side | 17:43 |
*** ralonsoh has quit IRC | 17:43 | |
clarkb | ah that is a handler so it should happen after everything else | 17:44 |
clarkb | I think that means we are good | 17:44 |
clarkb | zuul-web has a handler too that I don't see firing to reload apache2 so maybe I'll just do that by hand to be double sure | 17:44 |
clarkb | that is done. infra-root can you look over zuul02.opendev.org and see if it looks put together to you? | 17:45 |
clarkb | Note: we do not want zuul containers to be running there yet ( and they are not according to docker ps -a ) | 17:45 |
clarkb | I'm going to test my gearman server config update playbook against ze01 and zm01 next | 17:46 |
clarkb | thats done and looks good. I have restored the zuul.conf states on those two hosts to what they should be for now | 17:50 |
clarkb | I will remove the disable ansible file now | 17:50 |
clarkb | corvus: ^ you've probably got the best sense for what a zuul scheduler should look like. Any chance you may have time to look at zuul02.opendev.org? (Not sure when your openshift.tv thing ends) | 17:57 |
clarkb | specifically the things I'm less sure of are the zk and gearman certs/keys/ca | 17:58 |
clarkb | I'm going to take a break and start heating up some lunch. https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 is the DNS update change that we will need to manually merge during the swap, reviews on that would be great. Also looking over https://etherpad.opendev.org/p/opendev-zuul-server-swap if you haven't yet and double checking zuul02 looks happy | 18:03 |
clarkb | one difference I notice is that /opt/zuul/ doesn't exist on zuul02. We run the queue dumping script out of that so it isn't strictly necessary to have on zuul02 for this swap (and we could clone it to one of our homedirs if we need to) | 18:06 |
clarkb | and now really finding food | 18:07 |
corvus | clarkb: are you having second breakfast? | 18:14 |
corvus | oh lunch | 18:14 |
clarkb | corvus: the kids break from school at ~11am for lunch so I end up eating late breakfast/early lunch often | 18:15 |
clarkb | I'm waiting for the oven to preheat and just found a problem with zuul02: cannot ssh to review.o.o due to host key problems | 18:15 |
corvus | clarkb: yeah, i assume we just cloned /opt/zuul at some point. it's not kept updated. we could just manually do that again. | 18:16 |
corvus | /var/log/zuul looks good (sufficient space) | 18:17 |
clarkb | we do write out a known_hosts at /home/zuuld/.ssh/known_hosts with our gerrit and the opendaylight gerrit host keys in it but at least ours doesn't seem to work for some reason (I've not cross checked the values yet) | 18:17 |
clarkb | I'll keep looking at this after food if no one else beats me to it | 18:18 |
fungi | could be that openssh on the new ubuntu is expecting a different key format? | 18:18 |
fungi | though i would have expected our integration jobs to find that if so | 18:19 |
clarkb | fungi: it reported it used rsa in the error | 18:19 |
clarkb | `ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 zuul@review.opendev.org gerrit ls-projects` is the command I ran as zuuld on zuul02 if anyone else wnts to check it | 18:19 |
corvus | i see a different value reported by the server | 18:19 |
corvus | ssh-keyscan != known_hosts | 18:19 |
corvus | known_hosts on zuul01 == zuul02 | 18:20 |
clarkb | I was just going to ssay I wonder how zuul01 works, is it possible we bind mount some other bath? | 18:20 |
clarkb | s/bath/path/ | 18:20 |
clarkb | anyway oven needs attention. Back in a bit | 18:20 |
corvus | ssh test on 01 fails too | 18:20 |
corvus | so afaict, they are == | 18:21 |
corvus | maybe client.set_missing_host_key_policy(paramiko.WarningPolicy()) actually also means "don't fail if they differ" | 18:24 |
fungi | comparing /home/zuuld/.ssh/known_hosts right? | 18:25 |
fungi | that seems to be what we bindmount into the container | 18:26 |
*** d34dh0r53 has quit IRC | 18:32 | |
fungi | yeah, seems that's the one | 18:33 |
fungi | clarkb: oh! it's sha2-256 rather than sha1 | 18:35 |
fungi | i think that's the problem? | 18:36 |
fungi | as for why it's not breaking for the current server, corvus's explanation seems reasonable | 18:36 |
fungi | or maybe paramiko is still using sha1 | 18:36 |
clarkb | fungi: gerrit doesn't serve the sha2-256 though iirc | 18:38 |
clarkb | so that would all have to be client side for user verification | 18:38 |
clarkb | basically that shouldn't have any impact on the known_hosts file, its purely a wire thing | 18:39 |
clarkb | and gerrit shouldn't even attempt it becuase its sshd doesn' support it | 18:39 |
fungi | ahh, maybe i'm just thrown off by the sha2 fingerprint | 18:40 |
clarkb | corvus: does zuul set that cleint policy? I suspect that may be it if we somehow magically work | 18:40 |
clarkb | also I feel like we've fixed this before (it was a port 22 vs 29418 mixup or something along those lines). I wonder if the changes never merged | 18:40 |
corvus | clarkb: yeah that's a paste from zuul code | 18:41 |
fungi | it does seem like the ssh hostkey hash in the known_hosts there doesn't match what i get for either 22 or 29418 on the current server | 18:43 |
fungi | the one i get for 29418 matches this: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/host_vars/review01.openstack.org.yaml#L73 | 18:45 |
fungi | could we be prepopulating with something other than gerrit_self_hostkey? | 18:45 |
clarkb | fungi: its https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/all.yaml#L1 | 18:46 |
fungi | looks like we're using https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/all.yaml#L1 | 18:46 |
clarkb | the gerrit vars aren't necessarily exposed to the zuul hosts in ansible | 18:47 |
clarkb | maybe the easiest thing to do is update all.yaml to match the gerrit value for now? | 18:47 |
fungi | yeah | 18:47 |
clarkb | ok I'm still juggling food. I can write that chagne after I eat (or feel free to push it I won't care too much :) ) | 18:48 |
fungi | i'm double-checking that gerrit_ssh_rsa_pubkey_contents isn't also used for something different | 18:48 |
*** amoralej|off is now known as amoralej | 18:51 | |
*** amoralej is now known as amoralej|off | 18:52 | |
fungi | this supposedly writes it out on the gerrit server: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/tasks/main.yaml#L107-L113 | 18:52 |
*** DSpider has quit IRC | 18:55 | |
fungi | but the value in ~gerrit2/review_site/etc/ssh_host_rsa_key.pub doesn't match the gerrit_ssh_rsa_pubkey_contents value, it matches the key part from gerrit_self_hostkey | 18:56 |
fungi | same goes for review02... this is most perplexing | 18:57 |
clarkb | fungi: do we override them in private vars? | 18:57 |
fungi | oh, could be | 18:58 |
clarkb | ya looks like we do though I'm not quite in a spot to cross check values. One thing I notice is we don't set it for the zuul schedulers but do for mergers and executors | 18:59 |
fungi | yes, gerrit_ssh_rsa_pubkey_contents is overridden in 7 different places in private host_vars and group_vars :/ | 18:59 |
clarkb | I think that is likely how we end up with the wrong value on the scheduler | 18:59 |
clarkb | I wonder if that all.yaml value in system-config is largely there for testing | 18:59 |
clarkb | I'm thinking maybe we update the zuul-scheduler.yaml group var to include this, rerun service-zuul.yaml, double check ssh works, then make a todo to clean this up? | 19:00 |
clarkb | fungi: if that sounds reasonable I can do the group var update for zuul-scheduler.yaml now | 19:01 |
clarkb | actually I'm beginning to wonder if the wires are super corssed here | 19:01 |
fungi | it feels like we're reusing pubkey file contents as known hosts entries, but not very well | 19:02 |
clarkb | in the zuul executor file it almost feels like this is the value for zuul's ssh pubkey but I need to read that ansible | 19:03 |
clarkb | no nevermind I think we do the right thing we just don't set this value at all on the zuul-scheduler group. However, known_hosts is written by the base zuul role which we run against the zuul group | 19:05 |
clarkb | I think the short term fix here is to set that var in group_vars/zuul.yaml then rerun service-zuul.yaml | 19:06 |
fungi | yes, agreed, it would at least be consistent with how we're doing the other zuul servers | 19:06 |
clarkb | and then push up a change to all.yaml undefining it and sprinkle it into the testing vars as necessary? | 19:06 |
clarkb | I'll do the bridge update now | 19:06 |
fungi | longer term we should probably do something about the divergence between gerrit_ssh_rsa_pubkey_contents and gerrit_self_hostkey in system-config | 19:07 |
fungi | yeah, maybe that | 19:07 |
clarkb | ok thats done want to check the git log really quickly then I'll rerun service-zuul.yaml | 19:08 |
clarkb | fungi: ^ | 19:09 |
fungi | yeah | 19:10 |
fungi | yep, that looks like the correct value | 19:11 |
clarkb | ok running service-zuul.yaml now | 19:11 |
clarkb | Assuming that works are there other sanity checks people think we should run? | 19:13 |
fungi | maybe make sure you can reach the geard port from some other servers? | 19:14 |
clarkb | fungi: ya, I'm not sure how to do that with the ssl stuff but I guess we can try to sort that out. Similarly ensure that zuul02 can connect to the zk cluster | 19:15 |
fungi | oh, though zuul won't be running so, right | 19:15 |
clarkb | ya you'd need to boostrap some stuff | 19:16 |
fungi | you'd need a fake geard listening on the port | 19:16 |
clarkb | probably doable, possibly a lot of work | 19:16 |
clarkb | fungi: yes and one that does ssl | 19:16 |
fungi | if it's not working at cut-over i guess we can sort it out then | 19:16 |
clarkb | ls-projects works now. There is a blank line between the two known hosts entries now though. I don't think this is an issue but maybe it is? | 19:17 |
clarkb | I can ls-projects on the other gerrit in the known hosts so ya seems to not be a problem | 19:19 |
clarkb | I can connect to zk04 from zuul02 using nc. Not sure how to set it up to use the ssl and auth stuff | 19:21 |
clarkb | iptables and ip6tables report that port 4730 is open for gearman connections at least | 19:21 |
clarkb | without doing a ton of additional faked out bootstrapping and learning to zk client with ssl by hand I'm not sure there more connectivity checking we can do | 19:22 |
clarkb | https://etherpad.opendev.org/p/opendev-zuul-server-swap I think I'm fast approaching step 7's "when happy" assertion | 19:23 |
clarkb | s/step/line/ | 19:23 |
clarkb | I also gave the openstack release team a heads up a little while ago | 19:23 |
clarkb | zuul seems fairly idle too. Maybe let people look at 02 for another hour or so and then plan to proceed with the swap? If I can get a second set of hands to help with things like manually merging the dns update and updating rax dns that would be great. Then I can focus on the ansible-playbook stuff | 19:25 |
fungi | yeah, sounds fine. i'll be around | 19:50 |
fungi | was hoping to go out for a walk this afternoon, but our pest control people were due to come today and they still haven't shown up, so probably not getting out for a walk | 19:51 |
corvus | clarkb: when do you want to start? any chance it's ~now? | 20:04 |
*** clayg has quit IRC | 20:09 | |
*** fresta has quit IRC | 20:09 | |
*** jonher has quit IRC | 20:09 | |
*** clayg has joined #opendev | 20:09 | |
*** jonher has joined #opendev | 20:09 | |
clarkb | corvus: in about 15-20 minutes? | 20:10 |
clarkb | but I guess I can speed up and do ~now :) | 20:10 |
*** mhu has joined #opendev | 20:11 | |
clarkb | I've started a root screen on bridge | 20:11 |
corvus | clarkb: i'm ready to help whenever you are :) | 20:11 |
clarkb | thanks! | 20:11 |
corvus | (and my next task is a power-out ups maintenance, so i can't really overlap :) | 20:12 |
clarkb | corvus: can you review https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 but don't approve it yet? one of the steps in the plan is to manually merge that in gerrit after we stop zuul | 20:13 |
clarkb | (and maybe you want to get ready to be able to manually merge that when the time is appropriate?) | 20:13 |
clarkb | I'll tell the openstack release team we will proceed shortly | 20:14 |
corvus | lgtm | 20:14 |
clarkb | ok I ran the disable-ansible script to prevent the hourly jobs from getting in our way | 20:15 |
clarkb | I've notified the openstack release team. I think I'm ready to dump queues and stop zuul. Once zuul is stopped someone can submit https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 in gerrit | 20:16 |
corvus | i'll look up how to do that :) | 20:16 |
clarkb | corvus: I think you do the promotion thing via ssh with your admin creds now to your regular user. then unpromote after being doen or you can try to do it as the admin user alone | 20:17 |
clarkb | corvus: do you think I should proceed with dumping queues and stopping zuul? nothing to wait on on your end? | 20:17 |
corvus | yeah pulled up the docs at https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#gerrit-admins | 20:17 |
corvus | clarkb: i think you're gtg | 20:17 |
*** fressi has joined #opendev | 20:18 | |
clarkb | ok queues dumped. Stopping zuul next | 20:18 |
clarkb | zuul is stopped I think you can approve the dns update whenever you are ready | 20:19 |
clarkb | s/approve/submit/ | 20:20 |
corvus | ack | 20:20 |
openstackgerrit | Merged opendev/zone-opendev.org master: Swap zuul.opendev.org CNAME to zuul02.opendev.org https://review.opendev.org/c/opendev/zone-opendev.org/+/790482 | 20:21 |
*** fressi has quit IRC | 20:21 | |
clarkb | once that shows up on gitea servers I'll run the nameserver playbook | 20:21 |
corvus | submitted | 20:21 |
clarkb | looks like it is there | 20:21 |
clarkb | running nameserver playbook now | 20:21 |
clarkb | my record ttl is under a minute now. Will check it resolves properly before proceeding | 20:23 |
corvus | zuul.opendev.org.300INCNAMEzuul02.opendev.org. | 20:23 |
clarkb | zuul.opendev.org.300INCNAMEzuul02.opendev.org. | 20:24 |
corvus | that's from my local resolver | 20:24 |
clarkb | yup I see the same thing. Proceeding with updating gearman config on executors and mergers | 20:24 |
clarkb | corvus: maybe you want to do a status notice? also I think the rax update for zuul.openstack.org is less urgent but that needs doing too | 20:24 |
* fungi joins the screen session late-ish | 20:24 | |
corvus | fungi: ^ rax updates maybe? | 20:25 |
fungi | yeah, i can get that now | 20:25 |
corvus | infra-root calls "not it" on ttw dns updates ;) | 20:25 |
clarkb | ok the gearman config update lgtm on zm04 so I'll proceed to the next step whcih is starting zuul again | 20:26 |
clarkb | corvus: you ready for ^? | 20:26 |
corvus | clarkb: yep | 20:26 |
clarkb | things are started | 20:27 |
clarkb | looks like zuul01 was properly ignored | 20:27 |
clarkb | s/started/starting/ | 20:27 |
corvus | status notice Zuul has been migrated to a new VM. It should be up and operating now with no user visible changes to the service or hostname, but you may need to reload the status page. | 20:27 |
corvus | clarkb: how's that for sending in a minute or so? | 20:27 |
clarkb | lgtm | 20:27 |
clarkb | I'm going to stort out copying the queues.sh script now while we wait for it to come up | 20:28 |
corvus | it's saving keypairs. i'll double check some shasums | 20:28 |
fungi | okay, zuul.openstack.org cname has been updated to point at zuul02.opendev.org, ttl was already 5min | 20:28 |
corvus | fungi: what about updating it to point to zuul.opendev.org? | 20:29 |
fungi | oh, could do that too | 20:29 |
clarkb | fungi: ya I checked its ttl a few days ago and its was already low | 20:29 |
corvus | it's a double cname, but at this point we're probably not worried about network efficiency for folks still using it | 20:29 |
fungi | nsd should be smart enough to return both records when asked for the zuul.opendev.org cname, at least i know bind does it that way | 20:29 |
corvus | spot checks of some keys match on both hosts | 20:29 |
corvus | good point | 20:30 |
clarkb | ok I've just realized that to restore queues I need the zuul enqueue command to be present I ssuecpt this may be easiest if I start a shell on the scheduler container and run it there? | 20:30 |
clarkb | I assume we don't want a global install anymore? | 20:30 |
corvus | clarkb: ++ | 20:30 |
fungi | okay, zuul.openstack.org is now updated to be a cname to zuul.opendev.org | 20:30 |
clarkb | fungi: thanks | 20:30 |
fungi | saves us a step on future server changes | 20:30 |
corvus | clarkb, fungi: did we not make a "zuul" alias? | 20:31 |
corvus | i think we did that with nodepool | 20:31 |
corvus | anyway, container shell for now, then later we can make a one-liner to have "zuul" do "docker-compose exec ...." | 20:31 |
fungi | oh, shell command alias/wrapper? | 20:31 |
corvus | fungi: ya | 20:31 |
clarkb | corvus: I don't see any using `which zuul` | 20:32 |
corvus | huh, i can't find that on nodepool either | 20:32 |
corvus | probably a change from mordred sitting in review | 20:32 |
clarkb | corvus: zuul02:/root/queues.sh has been edited to do docker exec can you check that and see if it looks right? | 20:32 |
fungi | looks like /usr/local/bin/zuul on the old server is the old style entrypoint consolescript | 20:32 |
clarkb | I think zuul is up now | 20:33 |
clarkb | I'm ready to run the queues.sh script if it looks good to yall | 20:33 |
corvus | clarkb: lgtm. will be slow, but script isn't long. | 20:33 |
clarkb | ok running it now | 20:33 |
fungi | from here i get to the webui | 20:33 |
fungi | nothing enqueued yet | 20:33 |
fungi | i take that back, four changes enqueued | 20:34 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Fix typo in gerrit sysadmin doc https://review.opendev.org/c/opendev/system-config/+/791314 | 20:34 |
clarkb | they should be showing up now ya | 20:34 |
corvus | there's another for ya :) | 20:34 |
fungi | make that five ;) | 20:34 |
corvus | retry_limit | 20:35 |
clarkb | ya but thats an airship job that does that alot? may or may not indicate a problem | 20:35 |
clarkb | oh I'm seeing a number of other retries now | 20:35 |
corvus | other attempts | 20:35 |
clarkb | probably not airship specific | 20:35 |
clarkb | yaml.constructor.ConstructorError: could not determine a constructor for the tag '!encrypted/pkcs1-oaep' I see that on ze01 | 20:37 |
corvus | clarkb: we may be out of version sync | 20:37 |
corvus | probably need to run the pull playbook then a full restart | 20:37 |
clarkb | corvus: ok | 20:37 |
fungi | oh, yes that makes sense, that work did just merge | 20:37 |
corvus | (my guess is scheduser version is > executor) | 20:37 |
fungi | so executors/mergers need restarting | 20:38 |
clarkb | running pull now | 20:38 |
corvus | #status notice Zuul is in the process of migrating to a new VM and will be restarted shortly. | 20:38 |
openstackstatus | corvus: sending notice | 20:38 |
corvus | clarkb: in the mean time, can you save queues again (to a second file)? | 20:38 |
-openstackstatus- NOTICE: Zuul is in the process of migrating to a new VM and will be restarted shortly. | 20:38 | |
clarkb | ya though I may need to do it on 01 | 20:39 |
clarkb | corvus: the pulls seem to report they didn't do updatse? | 20:39 |
corvus | clarkb: probably ok | 20:39 |
corvus | (probably images were already local) | 20:39 |
clarkb | but why would be out of sync in that case? | 20:39 |
corvus | (probably only restarts are needed, but good to double check) | 20:39 |
clarkb | not sure I undersatnd that last message | 20:40 |
fungi | clarkb: executors weren't restarted when the images updated | 20:40 |
corvus | oh hrm, if there really was a global restart with everything up to date then... | 20:40 |
fungi | were they? | 20:40 |
clarkb | fungi: the output from the zuul_pull.sh implies this is the case | 20:40 |
clarkb | Pulling executor ... status: image is up to date for z... | 20:41 |
fungi | 20:27 start time, so yeah i guess they were | 20:41 |
corvus | clarkb: let's make sure the image on zuul02 is up to date -- did the pull playbook do that? | 20:41 |
clarkb | checking | 20:41 |
openstackstatus | corvus: finished sending notice | 20:41 |
clarkb | Pulling scheduler ... status: image is up to date for z... | 20:42 |
clarkb | looks like it | 20:42 |
corvus | clarkb: can we do one more full stop / start just to make sure we got everything? | 20:42 |
corvus | then if it happens again, we'll call it a bug | 20:42 |
clarkb | yes, I'll do the dump then stop | 20:42 |
*** vishalmanchanda has quit IRC | 20:43 | |
fungi | definitely worrisome that executors couldn't parse the secrets from zk | 20:43 |
clarkb | stopped, running start now | 20:43 |
corvus | fungi: that was parsing the secrets over gearman | 20:43 |
fungi | ohh | 20:44 |
fungi | right that's still going through gearman | 20:44 |
clarkb | we should be coming back up again | 20:44 |
corvus | we ship secret ciphertext over gearman now, then executors decrypt, we only got to the "decode off the wire" stage on the executor, not quite as far as the decrypt step | 20:44 |
corvus | fungi: if they got to the decrypt step, they would get the keys from zk | 20:44 |
fungi | yeah i forgot the secrets hadn't moved to zk as part of the serialization work | 20:45 |
corvus | the latest scheduler and executor images were build from the same change | 20:45 |
corvus | (i checked docker image inspect on zuul02 and ze10) | 20:45 |
corvus | "org.zuul-ci.change": "788376", on both | 20:45 |
clarkb | I've prepped zuul02:/root/queues.new.sh | 20:46 |
fungi | i'm being hovered over to start heating a wok, but will try to be quick | 20:48 |
*** slaweq has quit IRC | 20:49 | |
clarkb | web loads again | 20:49 |
clarkb | corvus: should I reenqeuue a change maybe? | 20:50 |
corvus | there's one already | 20:50 |
clarkb | yup see a couple now actually | 20:50 |
clarkb | looks like they are retrying | 20:50 |
clarkb | I see the same traceback on ze01 concurrent with the newer restart | 20:51 |
corvus | well, that's fascinating | 20:51 |
corvus | i wonder if yaml is different on our image builds, or if there's a path we're not testing | 20:51 |
clarkb | I6d94c1d8da8b68e5fb60c27e73039155a02fb485 maybe? | 20:52 |
corvus | oh that's certainly the change that broke it, but i don't see how | 20:53 |
clarkb | gotcha | 20:53 |
corvus | there's *extensive* testing of secrets | 20:53 |
clarkb | I suspect that the 13 day old executor image on ze01 isn't the one we want to run with as a fallback | 20:55 |
corvus | we need a sync'd executor and scheduler image | 20:55 |
corvus | and unfortunately i don't think we tagged our last restart | 20:55 |
corvus | but i think we should be able to fall all the way back to 4.2.0? | 20:56 |
clarkb | oh thats an idea | 20:56 |
corvus | (especially now that the keys are on disk) | 20:56 |
corvus | clarkb: i think that's my vote | 20:56 |
clarkb | ok let me see what that looks like as far as changes we need to make | 20:57 |
corvus | oh... 1 thing | 20:57 |
corvus | that might have the old repo layout | 20:57 |
corvus | yep | 20:58 |
clarkb | yes it does | 20:58 |
corvus | that will use extra space and cause a longer restart, but not a blocker | 20:59 |
corvus | (we might as well just rm -rf before restarting to clean up the extra space) | 20:59 |
clarkb | ok so we're still goo to proceed. How do we want to do the modification to zuul's docker-compose configs? Should I just manually do that on my checkout of system-config on bridge? | 21:00 |
corvus | clarkb: yeah, why don't you do that | 21:00 |
corvus | clarkb: i'll run the stop playbook now, and delete the git caches | 21:00 |
clarkb | ok my computer has decided now is a fine time to swap or something | 21:00 |
clarkb | I'll work on the docker-compose updates though | 21:01 |
clarkb | image: docker.io/zuul/zuul-scheduler:4.2.0 that look right? | 21:03 |
corvus | yep | 21:03 |
clarkb | and I'm doing that for all the zuul services | 21:04 |
corvus | ++ | 21:04 |
fungi | okay, dinner's cooked and i'm back | 21:05 |
clarkb | corvus: ok thats done. Are you ready for me to run service-zuul.yaml and then rerun the gearman config updater? | 21:05 |
corvus | oh because the playbook will write old config files? | 21:05 |
openstackgerrit | Ade Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node https://review.opendev.org/c/zuul/zuul-jobs/+/788778 | 21:05 |
corvus | clarkb: yes i am ready | 21:05 |
clarkb | ok running in the screen and up because it will update the config fiel to point at old zuul | 21:06 |
clarkb | so we just rerun the fixer playbook after too | 21:06 |
corvus | clarkb: in fact, i'm now ready for you to proceed all the way to restart | 21:06 |
corvus | (i have stopped everything and cleared out the git repo caches) | 21:06 |
fungi | caught up again, so looks like we did a rollback to 4.2.0 everywhere | 21:06 |
fungi | or are in progress with that | 21:07 |
clarkb | fungi: we're doing that | 21:07 |
fungi | yep, don't let me interrupt | 21:07 |
fungi | but lmk if there's something i can help with | 21:08 |
clarkb | the mergers are pulling 4.2.0 now | 21:11 |
clarkb | then it will be executors then scheduler images | 21:11 |
corvus | while we're waiting, i have a revert staged locally; i'd like to merge that today and restart into it, verify it works, then tag it (timing on that will obviously depend on what happens next, but that's the process i'd like to do) | 21:13 |
fungi | makes sense | 21:13 |
clarkb | sounds good to me | 21:14 |
fungi | i agree i'd have expected the rather extensive testing to pick that up before the change merged though, so the fix is guaranteed to be an interesting one | 21:15 |
clarkb | ok service-zuul is done no errors and rc is 0 | 21:16 |
clarkb | running the config fixup next | 21:16 |
clarkb | corvus: that is done. I think we are ready to start again if you are ready? | 21:16 |
clarkb | do you want to run the start playbook or should I? | 21:17 |
corvus | clarkb: go for it | 21:17 |
clarkb | ok running zuul_start.yaml now | 21:17 |
clarkb | and done | 21:18 |
clarkb | this startup will take longer bceause repos need to be cloned right? | 21:18 |
corvus | yes | 21:19 |
corvus | this should get a lot faster soon because the files that the cat jobs are asking for are going to be persisted in zk | 21:20 |
fungi | that'll be nice | 21:21 |
corvus | we're actually almost at the point of doing that (i think those changes are just coming ready for review now) | 21:21 |
corvus | aside from the fact that they now have 2 things ahead of them instead of just one :) | 21:22 |
clarkb | is there a way to see progress? it seems idler than I would expect | 21:22 |
clarkb | when tailing the scheduler debug log | 21:22 |
corvus | executor/merger logs and grafana | 21:22 |
clarkb | the merger queue is supposedly 0? | 21:23 |
corvus | the zuul job queue is not high which means we haven't sent the cat jobs out yet | 21:23 |
clarkb | zm01 hasn't merged anything in about 5 minutes | 21:23 |
corvus | we're... ratelimited on github? | 21:24 |
corvus | it's possible we're sitting in a sleep in the github driver waiting to query more branches | 21:25 |
fungi | maybe the successive restarts pushed us over query quota there | 21:25 |
corvus | yeah, i'm like 90% sure that's what's happening | 21:26 |
clarkb | hrm I see we're submitting a small number of cat jobs since the most recent start and they seem to all be things hosted on opendev? | 21:26 |
clarkb | but maybe I'm looking at the wrong logs stuff | 21:26 |
corvus | 2021-05-13 21:20:52,327 WARNING zuul.GithubRateLimitHandler: API rate limit reached, need to wait for 386 seconds | 21:27 |
corvus | opendev is the first tenant, openstack is after that | 21:27 |
corvus | and the github projects are in openstack | 21:27 |
clarkb | got it | 21:27 |
corvus | the 5m has expired and it's running again | 21:27 |
clarkb | and ya I see it doing a lot of work now | 21:28 |
clarkb | and zm01 is happily busy | 21:28 |
corvus | should be able to watch progress on 'zuul job queue' in grafana | 21:28 |
corvus | https://grafana.opendev.org/d/5Imot6EMk/zuul-status?viewPanel=19&orgId=1 | 21:28 |
clarkb | thanks | 21:28 |
rm_work | so, zuul good again? :D | 21:29 |
fungi | rm_work: close | 21:29 |
corvus | about 4,000 git clones away, we hope :) | 21:29 |
fungi | at least we think so. hard to say for sure until we see it actually run some jobs successfully | 21:29 |
rm_work | heh | 21:29 |
rm_work | sitting here, finger hovering the return key on a `git review` | 21:30 |
fungi | since this is a totally new server for the scheduler, there's plenty which could go wrong | 21:30 |
clarkb | assuming this fixes things I would like to continue to land the followup changes, but can leave DISABLE-ANSIBLE in place and do the dns and service-zuul.yaml playbook runs by hand after rebasing my 4.2.0 update onto the zuul01 cleanup change. | 21:34 |
clarkb | then we can remove DISABLE-ANSIBLE when we start corvus' plan for revert and applying the revert and all that | 21:34 |
clarkb | nova clones have started. Hopeflly we move quickly after that | 21:35 |
*** darshna has quit IRC | 21:38 | |
corvus | 1200 bottles of beer on the wall.... | 21:42 |
fungi | clone one down, checkouts abound... | 21:43 |
mordred | 1500 bottles of beer on the wall ... | 21:45 |
corvus | nearly done | 21:45 |
clarkb | its up | 21:45 |
clarkb | should I enqueue a change? | 21:45 |
corvus | clarkb: i just did in zuul | 21:46 |
clarkb | k switching tenants on the dashboard | 21:46 |
clarkb | I see a console log | 21:47 |
clarkb | https://zuul.opendev.org/t/zuul/stream/e9f59ab4d01f4f729ec844cba722456b?logfile=console.log | 21:47 |
corvus | and it's actually running playbooks | 21:47 |
corvus | clarkb: maybe re-enqueue now? | 21:47 |
clarkb | corvus: will do | 21:47 |
clarkb | in progress now | 21:48 |
clarkb | and actually now that I think about it I think we're ok to keep the 300 ttl and leave zuul01.openstack.org in the emergency file if we just want to remove DISABLE-ANSIBLE and proceed with revert stuff | 21:49 |
clarkb | its a fairly safe steady state, just with a low ttl we can cleanup in the futher out near future | 21:49 |
fungi | yep | 21:49 |
clarkb | the only issue is the gearman server directive on zm* and ze* | 21:50 |
clarkb | I can push a new chagne that only updates that | 21:50 |
clarkb | we have a successful tox-linters job against zuul | 21:50 |
corvus | on a change uploaded by me, no less. shocking! | 21:51 |
clarkb | fungi: corvus: opinions on fixing the gearman server config? do we want to blaze ahead and land the existing changes to do that or would you prefer we stay somewhat nimble with changing ttls and cleaning up zuul01 and I can push a change that only updates the gearman config | 21:51 |
clarkb | (we're steady state right now due to DISABLE-ANSIBLE) | 21:52 |
clarkb | rm_work: I think we're cautiously optimistic at this point if you want to push | 21:52 |
corvus | clarkb: i say go all the way with existing changes | 21:53 |
rm_work | :P | 21:53 |
rm_work | thanks | 21:53 |
clarkb | corvus: ok wfm. | 21:53 |
clarkb | fungi: corvus: can you review https://review.opendev.org/c/opendev/zone-opendev.org/+/790483/ ? | 21:53 |
clarkb | then next is https://review.opendev.org/c/opendev/system-config/+/790484 | 21:53 |
clarkb | that will swap us back to latest but we want that for revert testing anyway | 21:54 |
corvus | clarkb: was already doing that, +2 on both | 21:54 |
clarkb | thanks! | 21:54 |
corvus | clarkb: mind if i go offline for a little while? maybe 30m? | 21:54 |
clarkb | corvus: sure things seem happy enough | 21:55 |
clarkb | corvus: before you go any reason to not remove DISABLE-ANSIBLE? | 21:55 |
ianw | o/ ... just reading the excitement in scrollback ... | 21:55 |
corvus | i'd like to squeeze this ups work in while building/updating is happening | 21:55 |
clarkb | it will revert our gearman config and our docker-compose updates | 21:55 |
clarkb | I'm ok with manually rerunning those playbooks again if necessary | 21:55 |
corvus | clarkb: don't think that's a problem; i don't think we're going to auto restart anything | 21:55 |
clarkb | corvus: ok I'll do that now so that merging those changes can automatically do the right thing | 21:55 |
openstackgerrit | Ade Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node https://review.opendev.org/c/zuul/zuul-jobs/+/788778 | 21:56 |
clarkb | and done | 21:56 |
fungi | clarkb: what was in need of fixing with the gearman server config? | 21:56 |
corvus | clarkb, mordred, fungi: maybe you can go ahead and w+1 the revert? | 21:56 |
fungi | yup | 21:56 |
corvus | so that we get new images asap | 21:56 |
clarkb | fungi: the old config points mergers and executors at zuul01.openstack.org. As part of the upgrade I ran an out of band playbook to set it to zuul02.opendev.org instead so that we could control when they swapped over | 21:57 |
clarkb | I have approved the dns ttl cleanup | 21:57 |
clarkb | fungi: the zuul01 cleanup change makes that value zuul02.opendev.org permanentaly | 21:57 |
clarkb | corvus: yup looking at that change now | 21:58 |
fungi | ahh, yep | 21:58 |
corvus | biab. | 21:58 |
clarkb | ianw: tldr is swapped zuul01 for zuul02 and discovered some recent chagnes to zuul executors were not happy. We have since rolled zuul02 back to zuul 4.2.0 which seems to be working | 21:58 |
*** corvus has quit IRC | 21:58 | |
clarkb | ianw: we'll roll forward again with a revert of the zuul executor changes to see that that properly addresses the issue then zuulians can work on fixes | 21:59 |
clarkb | looks like fungi got the zuul revert change so that is in the pipeline | 21:59 |
fungi | we rolled everything back to 4.2.0 right, not just the zuul02 scheduler? | 22:00 |
clarkb | fungi: correct | 22:00 |
clarkb | also to be clear zuul01 is not in use and is in the emergency file (this makes running the zuul stop/start/etc playbooks safe) | 22:01 |
fungi | anyway, cleanup changes are approved | 22:01 |
clarkb | should we do a #status log We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem. | 22:02 |
clarkb | er should that be #status notice? | 22:02 |
clarkb | oh you know what we didn't copy is the timing dbs but meh | 22:02 |
fungi | a bit wordy, but sure it works | 22:02 |
clarkb | #status notice We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem. | 22:03 |
openstackstatus | clarkb: sending notice | 22:03 |
-openstackstatus- NOTICE: We are cautiously optimistic that Zuul is functional now on the new server. We ran into some unexpected problems and want to do another restart in the near future to ensure a revert addresses the source of that problem. | 22:03 | |
openstackgerrit | Merged opendev/zone-opendev.org master: Reset zuul.o.o CNAME TTL to default https://review.opendev.org/c/opendev/zone-opendev.org/+/790483 | 22:03 |
openstackstatus | clarkb: finished sending notice | 22:06 |
ianw | ok cool thanks. i guess the zuul gate isn't tied up in devstack issues at least | 22:06 |
clarkb | ianw: yup and re devstack it sounds like gmann wants to revert and do the ovn switch properly | 22:10 |
clarkb | rather than try and fix every new random issue that was masked by using the CI jobs to do the switch and not the devstack configs | 22:10 |
clarkb | service-nameserver job has started | 22:11 |
clarkb | I had to reapprove 790484 because fungi approved it before its dependency merged. But that is done now | 22:13 |
ianw | clarkb: yeah, i think i'm on board with the revert too, because it's not quite as simple to just stick that new file in and be done with it | 22:13 |
clarkb | ianw: ya it basically entirely relied on changing the job config to work in the job and won't work anywhere else | 22:14 |
ianw | i think it's probably worth making ensure-devstack install devstack from zuul checkout and running that it the devstack gate as a check on this | 22:14 |
clarkb | I see the new TTL for zuul.opendev.org so that looks good | 22:15 |
clarkb | ianw: ya devstack could run a job that uses it with an up to date checkout so that pre merge testing works but otherwise its not doing much | 22:15 |
clarkb | at this point I think we are just waiting on the zuul01 cleanup change and corvus' revert in zuul. I haven't seen anything from zuul jobs that have run to indicate any major problems | 22:16 |
clarkb | the one problem I identify is we didnt' copy the timing dbs over to the new server so we don't have that data but its not the end of the world | 22:16 |
ianw | i get that the ensure-devstack role is explicitly about using devstack, not testing it, but "can i use devstack" is a pretty good devstack test as well :) | 22:16 |
clarkb | ++ | 22:17 |
clarkb | the hourly service-zuul.yaml run is likely to run before my zuul01 removal change lands. What this means is our docker-compose.yaml files on scheduler+web, mergers, and executors will be updated as well as the zuul.conf on mergers and executors to remove manual changes we have made | 22:24 |
clarkb | this should be fine as long as we put those changes back again before doing a restart | 22:24 |
clarkb | in fact I may just run the gearman config fixup playbook after that run finsihes as we want the docker-compose changes to go away to do the revert | 22:25 |
fungi | which may be easier than temporaro | 22:25 |
fungi | temporarily adding them all to the emergency list | 22:25 |
clarkb | yup | 22:25 |
clarkb | and when we get to the restart point we only need the gearman fixes in place which means as long as I've put those back we could potentially restart things on the revert before zuul01 cleanup has fully applied (then I can just manually go through it after the fact or let periodic catch it tonight) | 22:26 |
clarkb | service-zuul.yaml is done now. I'll rerun the out of band gearman config fix now | 22:53 |
clarkb | and that is done. We should be able to restart zuul safely on the revert now (docker-compose is back to latest) as long as 790484 runs before the next hourly pass | 22:54 |
*** corvus has joined #opendev | 22:56 | |
corvus | o/ | 22:57 |
clarkb | corvus: all seems to be going well so far | 22:58 |
clarkb | corvus: still waiting on the zuul01 removal to land, but I went ahead and reran the gearman config fixup after the previous hourly service-zuul.yaml ran | 22:58 |
corvus | cool, looking at eavesdrop now... | 22:58 |
clarkb | corvus: I think that means we're in an ok spot right now to restart on the zuul revert (docker-compose should point to latest now and gearman configs are correct) | 22:58 |
corvus | i *think* we can copy over the timing db if we want | 22:59 |
corvus | but also, it'll sort itself out soon :) | 22:59 |
clarkb | corvus: I'm not super worried about it because ya it will handle itself | 22:59 |
fungi | i may even clean itself up a bit | 22:59 |
fungi | er, it may | 22:59 |
clarkb | if we are going to do a restart maybe wait for 790848 to land first so that we can get that in without waiting again (it should be soon I think) | 22:59 |
corvus | looks like the revert just landed | 23:00 |
clarkb | is this restart going to be another slow one because it will clone into the other repo paths? | 23:00 |
corvus | clarkb: yep :( | 23:00 |
clarkb | (wondering if we should keep the old repos around or if that complicates stuff somehow) | 23:00 |
corvus | no choice, zuul is going to delete them this time | 23:00 |
clarkb | got it | 23:00 |
corvus | (i thought about leaving the other ones around, but it would have been more surgery to delete the current ones when we switch back because zuul *wouldn't* do it in that case) | 23:01 |
clarkb | so ya my only suggestion then is maybe wait for 790848 to land I think it may only be ~10 minutes away? then do restart? | 23:01 |
corvus | (because it would have thought it already did the delete of the old scheme) | 23:01 |
corvus | cool, i have some screws i still need to attach, i'll go do that :) | 23:01 |
clarkb | ianw: oh btw I ended up just changing the order of the altnames in the cert to fix the issue you were looking at. I was trying to keep moving parts to a minimum but if we still like those improvements (the log capture for sure) we can do followups to alnd them | 23:05 |
clarkb | and this way we don't need to add more acme records to openstack.org | 23:05 |
clarkb | that will just get deleted next week :) | 23:05 |
ianw | yeah that's cool. i think i had in my head the idea that we'd mostly use the inventory_hostname for consistency in general, but whatever works | 23:07 |
openstackgerrit | Merged opendev/system-config master: Clean up zuul01 from inventory https://review.opendev.org/c/opendev/system-config/+/790484 | 23:09 |
clarkb | corvus: ^ my timing was really good too :) | 23:10 |
corvus | approximately 10 minutes :) | 23:10 |
corvus | clarkb: so what's next? | 23:10 |
clarkb | corvus: I just looked at zm01 and ze01 and they both have the correct docker-compose (latest) config as well as gearman configs | 23:11 |
clarkb | corvus: maybe run the pull script and double check that image looks like the one we want on the hosts and we can restart? | 23:11 |
corvus | running pull now | 23:12 |
clarkb | for dumping queues on zuul02 we may need extra tooling. But previously I ran the dump on zuul01 and copied it to zuul02 after zuul01 was stopped with no problem | 23:12 |
clarkb | can probably just use zuul01 for the queue dump still | 23:12 |
corvus | i'll try on 2 | 23:13 |
corvus | python3 ~corvus/zuul/tools/zuul-changes.py https://zuul.opendev.org | 23:14 |
corvus | that works | 23:14 |
clarkb | cool | 23:14 |
corvus | clarkb: then we need to mutate the output right? | 23:14 |
clarkb | corvus: yes you need to prefix the entries with the docker exec command one sec | 23:14 |
clarkb | docker exec zuul-scheduler_scheduler_1 prefix the data with that | 23:15 |
clarkb | on each line | 23:15 |
corvus | i'm going to make a custom zuul-changes for now | 23:15 |
clarkb | /root/queues.sh is an example | 23:15 |
clarkb | if you run that as not root you will also need a sudo at the front | 23:15 |
corvus | i'll add it in so it works either way | 23:16 |
corvus | ~root/zuul-changes.py https://zuul.opendev.org | 23:16 |
corvus | sudo docker exec zuul-scheduler_scheduler_1 zuul enqueue --tenant openstack --pipeline gate --project opendev.org/openstack/neutron-lib --change 791134,1 | 23:16 |
corvus | produces output like that ^ | 23:16 |
clarkb | that looks right to me | 23:17 |
corvus | clarkb: zuul-promote-image failed for the revermt | 23:17 |
corvus | looks like it may have failed after the retag though | 23:17 |
corvus | (failed deleting the tag) | 23:17 |
clarkb | ah ya that races sometimes iirc | 23:18 |
corvus | i'll check dockerhub and see if they look ok | 23:18 |
clarkb | ++ | 23:18 |
fungi | so long as that was all that failed, i guess we should still be clear to pull | 23:18 |
corvus | hrm, not looking good | 23:20 |
corvus | i think it promoted zuul but not zuul-executor | 23:20 |
corvus | i'll try re-enqueing that | 23:20 |
clarkb | corvus: ok | 23:21 |
*** tosky has quit IRC | 23:23 | |
clarkb | one of the jobs failed but I think it was a js one? | 23:23 |
clarkb | (on the reenqueue) | 23:23 |
fungi | if it's the tarball publication, yeah that's been broken | 23:24 |
corvus | yeah, image promote looks good now | 23:24 |
corvus | i'll pull again | 23:24 |
corvus | done | 23:27 |
corvus | clarkb: i guess we save/restart/reenqueue now? | 23:28 |
clarkb | corvus: I think so. | 23:28 |
corvus | running that now | 23:28 |
clarkb | the gearman config still looks good | 23:28 |
corvus | cat jobs are going; we can watch the graph again | 23:33 |
clarkb | seems like it went faster this time, but still waiting for the layout to be loaded | 23:38 |
corvus | yeah, we may have one job still running? | 23:39 |
corvus | and it may time out, which may mean we have to restart :? | 23:39 |
corvus | i'm not sure why it went faster | 23:39 |
corvus | it was slower than the earlier restarts, but faster than the last one | 23:40 |
corvus | that last job finished | 23:40 |
clarkb | I think zm01 didn't have its old repos cleaned up | 23:40 |
clarkb | which may explain the speed | 23:41 |
corvus | clarkb: oh yep, i forgot to do that on the mergers | 23:41 |
corvus | i'll shut them down and clean them up in a bit | 23:41 |
corvus | loaded now | 23:41 |
clarkb | ok. opendev%2Fsystem-config shows an older system-cofnig head but it should be updated if it actually gets those merge jobs so I don't think that is a problem | 23:41 |
clarkb | looks like web is responding again | 23:42 |
corvus | re-enqueing | 23:42 |
corvus | btw, mouse over the 'waiting' labels on the tripleo-ci jobs | 23:43 |
clarkb | https://zuul.opendev.org/t/openstack/stream/ed1ad277335145788067faf26d412417?logfile=console.log that seems to be acutally doing something? | 23:44 |
clarkb | oh nice on the mouse over | 23:44 |
clarkb | spot checking some jobs this looks happy | 23:45 |
clarkb | The next thing to do is probably to manually run service-zuul in the foreground and ensure that we noop? We can wait for zuul to be a bit further along in its happyness before we do that though | 23:46 |
corvus | gimme a sec to manually clean up the mergers | 23:47 |
clarkb | yup no rush | 23:47 |
clarkb | hrm a number are stuck at "updating repositories" but I suspect that is beacuse they need to fetch/clone repos/changes | 23:48 |
corvus | yep | 23:48 |
clarkb | ah yup just saw at least one proceed from that state | 23:48 |
corvus | since the cat jobs didn't do much priming | 23:48 |
corvus | re-enqueue finished | 23:48 |
fungi | oh, is it on-demand clones now? | 23:49 |
fungi | ahh, for the executors | 23:49 |
corvus | ooh, check out the waiting mouseovers on the system-config deploy queue item | 23:50 |
clarkb | I was just doing that :) | 23:50 |
corvus | mergers are back up and running | 23:51 |
corvus | #status log restarted zuul on commit ddb7259f0d4130f5fd5add84f82b0b9264589652 (revert of executor decrypt) | 23:51 |
openstackstatus | corvus: finished logging | 23:51 |
clarkb | I've got the service-zuul.yaml command queued up in the screen if anyone wants to look at it. I think we're ok to run that now and get ahead of the deploy jobs and just ensure that it noops | 23:53 |
clarkb | I'll run that at 00:00 if I don't haer any objections | 23:54 |
corvus | clarkb: lgtm | 23:54 |
fungi | yep, ready | 23:55 |
clarkb | ok I'll go ahead and run it then | 23:55 |
clarkb | (now I mean) | 23:55 |
corvus | we have succeeding jobs | 23:55 |
fungi | so much green | 23:55 |
fungi | both in the screen session and the status page i guess | 23:55 |
ianw | clarkb: ++ be good to know it's in a steady state | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!