*** tosky_ is now known as tosky | 14:31 | |
Clark[m] | fungi: I'm just about around. I should find some food. | 14:32 |
---|---|---|
fungi | cool, i have some notes in the questions section of https://etherpad.opendev.org/p/listserv-inplace-upgrade-testing-2021 mainly based on experiences from the lists.k.i upgrade | 14:33 |
Clark[m] | fungi: one thing I thought of was we still want to disable the mailman service even though we don't use the main service on that server. That way we don't get a useless mailman set of processes running from the systemd switch | 14:37 |
fungi | yeah, i was unable to work out how to add a disable symlink for the systemd unit for it in advance though | 14:40 |
clarkb | fungi: oh and we should consider disabling the backup crons as well so that we aren't trying to run those alongside other work | 14:47 |
clarkb | and then its possible that the borg install may need manual fixup since that is pip installed iirc | 14:49 |
clarkb | post upgrade I mean | 14:50 |
fungi | yeah, i've also added the corresponding enables at the end for those, but we can determine after what testing should be done | 14:50 |
fungi | i do have a feeling anything pip installed system-wide will be just plain broken, at least until ansible reruns | 14:51 |
fungi | since it will be for an entirely incorrect python interpreter | 14:51 |
clarkb | yup and even ansible running may not be sufficient beacuse ti will see the isntall is present | 14:51 |
clarkb | but shouldn't be too bad to srot that out post upgrade | 14:51 |
fungi | maybe not? ansible will be asking pip if it's installed and pip will be asking the newer python which will say it isn't | 14:52 |
fungi | (i think?) | 14:52 |
clarkb | fungi: it might be a virtualenv? but ya maybe that will error in the right way too | 14:52 |
fungi | we'll just have to see, but yeah maybe the first thing we do after taking it out of the emergency disable list is a manual ansible run? | 14:53 |
clarkb | ya it does an ansible pip module command in a virtualenv | 14:53 |
clarkb | fungi: that seems like a reasonable thing to do | 14:53 |
clarkb | I think we can move the venv aside and have it reinstall | 14:53 |
clarkb | if it doesn't handle it automatically I mean | 14:53 |
fungi | i'm in the rackspace dashboard ready to click the clicky for imaging once the server is down | 14:54 |
fungi | i'll do the steps on lines 280-292 now | 14:54 |
clarkb | sounds good | 14:54 |
fungi | status log should be sufficient for this maintenance, you think? | 14:59 |
fungi | announced well in advance and it's a slow time | 14:59 |
clarkb | ++ | 14:59 |
fungi | #status log The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org will be offline over the next 6 hours for server upgrades, messages will be sent to the primary discussion lists at each site once the maintenance concludes | 15:01 |
opendevstatus | fungi: finished logging | 15:01 |
fungi | okay, ready for me to poweroff the server? | 15:01 |
clarkb | I guess so | 15:02 |
fungi | here we go! | 15:02 |
fungi | as soon as the api reports it shutoff i'll begin the snapshot | 15:02 |
clarkb | I guess also check your ssh connection has gone away? | 15:03 |
fungi | yes | 15:04 |
fungi | if the console indicates it's done but it hasn't gone offline i'll server stop it | 15:04 |
fungi | yeah, vnc can't connect, i'll issue a server stop | 15:05 |
fungi | now it's showing shutoff, proceding with image creation | 15:05 |
fungi | it's queued, name is lists.openstack.org_2021-09-12_pre-upgrade | 15:07 |
clarkb | does the web dashboard give you a progress indicator for that? I think the api does if you can figure out how to work it, but also unsure how accurate it is in any case | 15:07 |
fungi | basically the same. it gives you a useless progress indicator | 15:07 |
fungi | right now it says "queued: preparing for snapshot..." | 15:08 |
fungi | this will take a while, so i'm going to step away and just check it every few minutes until it finishes | 15:08 |
fungi | no point in hovering | 15:08 |
clarkb | sounds good I'll do that same then. Ping me when you see it as done | 15:09 |
fungi | where "a few minutes" is somewhere between 3-4.5 hours based on earlier testing, i think? | 15:09 |
fungi | i will ping you when we're done with this wait, yep | 15:09 |
clarkb | ya iirc it was in the range of several hours. But the server was online then. Hopefully it goes a little quicker with it offline | 15:10 |
*** diablo_rojo is now known as Guest7049 | 15:51 | |
fungi | here's hoping | 16:02 |
fungi | it's at "saving" now, noting "this step duration is based on underlying virtual hard disk (vhd) size" | 16:04 |
fungi | clarkb: it's active now | 17:35 |
fungi | i'm going to boot the server and make sure the services we disabled haven't started (in a root screen session in case you want to attach) | 17:36 |
Clark[m] | Ok I'm migrating back to my desk | 17:37 |
fungi | root screen session on lists.o.o is up | 17:38 |
clarkb | I'm attached to the screen | 17:38 |
fungi | the services we disabled are still not running based on ps | 17:39 |
fungi | i mived the esm unenroll and puppet uninstall to the beginning of the upgrade, due to complications observed on the lists.k.i upgrade | 17:39 |
fungi | er, moved | 17:39 |
fungi | are you good with that? | 17:39 |
clarkb | ya I noticed that and makes sense to me | 17:39 |
fungi | ua gets weird following a dist upgrade, in particular | 17:40 |
fungi | ua and puppet are cleaned up, proceeding to update the package lists though this and the dist-upgrade should no-op ideally | 17:42 |
clarkb | yup | 17:42 |
fungi | looks like detaching from ua doesn't actually remove the sources list entries | 17:42 |
fungi | i'm going to clean that up now | 17:42 |
fungi | maybe because ansible? | 17:42 |
clarkb | fungi: no I think they are alawys there on ubuntu | 17:43 |
fungi | the esm sources? | 17:43 |
clarkb | yes they are prsent on my local bionic fileserver for example but it isn't enrolled in esm | 17:43 |
clarkb | then some other mechanism actually turns it on I think | 17:43 |
fungi | ahh, okay, ignoring that, then | 17:43 |
clarkb | its not entirely clear to me how that works (and I suspect that is by design) | 17:43 |
clarkb | hrm though now I'm double checking and trying to see where they are on my local machine | 17:44 |
clarkb | I remember looking at my local machien when we set up the esm stuff and being confused at how much was present | 17:44 |
fungi | ubuntu-release-upgrader-core is already installed so no need to add it | 17:45 |
clarkb | maybe I was wrong about the sources.list entries. It is in the apt auto update by default | 17:45 |
fungi | note that on bridge.o.o we don't have esm sources by default | 17:45 |
clarkb | which I guess gets ignored if you don't have the sources.list definitions | 17:45 |
clarkb | so ya you can probably clean those up | 17:46 |
fungi | yeah, removing /etc/apt/sources.list.d/ubuntu-esm-infra.list will solve that | 17:46 |
clarkb | its easy enough to reenroll if that becomes necessary so rm'ing that file should be safe | 17:46 |
fungi | doing it now | 17:46 |
fungi | package list re-updated, dist-upgrade still predictably no-io | 17:47 |
fungi | no-op | 17:47 |
fungi | ready for do-release-upgrade? | 17:48 |
clarkb | I guess so seems like everything looks the way we expected. The snapshot succeeded right? that would be the only other thing I can think of to check | 17:48 |
fungi | yes, claims to have a 42.14 gb image in dfw named lists.openstack.org_2021-09-12_pre-upgrade with a status of "active" | 17:49 |
clarkb | then ya I think we are ready | 17:49 |
fungi | pushing the shiny, candy-red button | 17:49 |
fungi | telling it to run the alternate ssh in case we need to fix sshd | 17:50 |
clarkb | yup | 17:50 |
fungi | i'll add the iptables rule for it in a second screen window | 17:50 |
clarkb | I've ssh'd in via port 1022 (but you should too :) ) | 17:51 |
fungi | i have just now | 17:51 |
fungi | continuing | 17:51 |
fungi | you good with the package changes? | 17:53 |
clarkb | I think so. It seemed the stuff it complained about not having candidates were all version specific packages that will be replaced by other version specific packages that are installed with virtual packages a level up | 17:54 |
fungi | i agree | 17:54 |
fungi | proceeding | 17:55 |
clarkb | neat the default site override thing we do prevents it from trying to generate all the mailman languages | 17:58 |
clarkb | I expect we'll be ok with that since we don't use a lot of languages and have content already ? | 17:58 |
fungi | yeah, seems so. i guess we just check everything out later and see if there's anything broken/missing | 17:58 |
clarkb | ya | 17:58 |
fungi | worst case it'll just be the webui/archive not rendering or looking strangely | 17:59 |
clarkb | and can copy files from list.kc.io if we need them quickly | 17:59 |
clarkb | fungi: we might need to keep the old setting for the arp bit? the comment says xen pv guests need it | 18:02 |
fungi | it wants to update /etc/sysctl.conf, and we seem to specifically set -net.ipv4.conf.eth0.arp_notify = 1 with a comment that it's needed for pv xen guests | 18:02 |
fungi | yeah | 18:02 |
fungi | the maintainers version of that file has all lines commented out anyway, so keeping ours should be fine | 18:03 |
clarkb | I suspect that comes from rax and not anything we set. I say we keep our version | 18:03 |
fungi | i suppose this is an artifact of this being an ancient ported flavor | 18:03 |
fungi | seems the conffile updates occur in a nondeterministic order, but keeping our modified login.defs | 18:10 |
clarkb | etherpad says /etc/login.defs is a keep our version | 18:10 |
clarkb | (we override the uid and gid ranges iirc) | 18:10 |
clarkb | fungi: ehterpad says this is a keep of our version? | 18:19 |
fungi | yep | 18:21 |
fungi | and the one after it was as well | 18:21 |
clarkb | looks like it knows about the languages afterall. I guess the selection tool needs the site dir but then generation doesnt | 18:22 |
clarkb | (that is good means we probably don't need to do anything) | 18:22 |
fungi | and installing the maintainer's ntp.conf | 18:23 |
fungi | keeping our sshd_config | 18:24 |
clarkb | noting that the current swap device appears to be xvdc1 so the warnings about initramfs there show it doing the right thing | 18:26 |
clarkb | but also we don't really suspend to memory except for server migrations? its probably super minor | 18:26 |
clarkb | potentially related my laptop has started to refuse to suspend to memory on linux 5.14 | 18:27 |
clarkb | I guess I should double check my initramfs and swap device uuids | 18:27 |
clarkb | fungi: this is another keep | 18:27 |
fungi | yep, keeping the unattended-upgrades config | 18:28 |
fungi | it hasn't asked about iptables yet, right? | 18:28 |
fungi | oh, or logind.conf | 18:28 |
clarkb | fungi: not that I have seen | 18:28 |
clarkb | ya neither have shown up | 18:28 |
clarkb | I guess it won't ask about those | 18:31 |
fungi | removing obsolete packages now | 18:31 |
clarkb | time to remove obsolete packages (which means the installs are done) | 18:31 |
fungi | upgrade is complete, before restarting i'll re-disable services in a second screen window | 18:34 |
clarkb | ++ | 18:34 |
fungi | currently there are no apache, mailman or exim processes still | 18:35 |
clarkb | but we will disable them anyway to ensure the systemd units get disabled? | 18:36 |
fungi | yep | 18:37 |
fungi | "Removed /etc/systemd/system/mailman-qrunner.service." | 18:37 |
fungi | that was the only one which changed | 18:37 |
clarkb | and we expected that | 18:37 |
fungi | okay, ready for the reboot? | 18:37 |
clarkb | I think so | 18:37 |
fungi | done | 18:37 |
fungi | it's back up and i've started the root screen session again | 18:38 |
clarkb | give me a sec to join | 18:38 |
clarkb | I'm in the screen | 18:38 |
fungi | looks like htcacheclean started on boot but not apache proper | 18:39 |
clarkb | thta should be fine | 18:39 |
fungi | the apt-get clean wants to remove a couple of old linux-headers package versions | 18:39 |
fungi | agreeing to it | 18:39 |
clarkb | fungi: maybe double check that isn't the kernel we are ruinning? | 18:39 |
clarkb | but ya agreeing to it seems safe | 18:40 |
fungi | Linux lists 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | 18:40 |
clarkb | cool that doesn't match what it wants to remove | 18:40 |
fungi | it wants to remove linux-headers-4.4.0-212 and linux-headers-4.4.0-213 yeah | 18:40 |
fungi | (it generally won't remove the kernel you're booted on, and it won't see the headers package as needing cleaned up if that version of the kernel package is installed, for future reference) | 18:41 |
clarkb | good to know | 18:41 |
clarkb | before we start the next upgrade should we double check that swap was enabled properly as a followon from the warnings about finding the swap device for resume (I think that doesn't matter as that is just for suspending to memory though) | 18:42 |
fungi | ready to do-release-upgrade for focal? | 18:42 |
fungi | oh, yep | 18:42 |
fungi | lgty? | 18:42 |
clarkb | swap looks good to me. I think we can proceed with the next do-release-upgrade | 18:42 |
fungi | and underway | 18:43 |
fungi | starting the alternate sshd again | 18:43 |
clarkb | I'm on via the alternate | 18:43 |
fungi | and added the iptables rule for it and connected, yeah | 18:43 |
fungi | continuing | 18:44 |
fungi | accepting the package changes if that also looks right to you | 18:45 |
clarkb | oh I missed it while coordinating some questions around lunch. I'm sure its fine though | 18:45 |
fungi | cool | 18:46 |
clarkb | it looks similar to what I expect from what little I see there :) | 18:46 |
fungi | going now | 18:46 |
clarkb | interseting that it is prompting for the list of services to restart rather than just a binary yes no restart things | 18:47 |
clarkb | I'm ok with that list though it has exim4 and apache in it. | 18:48 |
clarkb | as I suspect that is all the 'yes' selection was doing previously | 18:48 |
fungi | yep, agreeing to it now | 18:49 |
fungi | will need to check that it doesn't actually restart exim4 and apache2 | 18:49 |
clarkb | I added a note to do that in the etherapd | 18:50 |
fungi | it's prompting for a different set of service restarts next | 18:55 |
clarkb | fungi: I guess we accept this list too | 18:55 |
fungi | yep | 18:55 |
clarkb | I think it didn't prompt to auto restart so we just accept the lists it gives us instead | 18:55 |
clarkb | and then we are equivalent to selection yes to auto restart if it had prompted | 18:55 |
clarkb | fungi: it is asking about sysctl.conf. We probably want to double check the diff then keep again? | 19:04 |
fungi | yeah, looking | 19:05 |
fungi | same situation as before, so keeping ours | 19:06 |
clarkb | ++ | 19:06 |
clarkb | This is a keep according to the etherpad beacuse we configure snmpd with ansible | 19:07 |
fungi | yep, keeping | 19:07 |
clarkb | this is another keep of our version | 19:10 |
fungi | keeping our unattended upgrades config too | 19:10 |
clarkb | and another keep for sshd_config | 19:10 |
fungi | yep, kept | 19:12 |
clarkb | time to remove obsolete packages | 19:17 |
fungi | yep, agreeing | 19:19 |
clarkb | now we check that apache2 exim4 et al are not running and reboot? | 19:23 |
fungi | yeah, looking | 19:23 |
fungi | just htcacheclean | 19:24 |
fungi | so i think we're safe to reboot | 19:24 |
clarkb | sounds like it | 19:24 |
fungi | doing | 19:24 |
fungi | so once this is booted again we'll be on focal! | 19:25 |
fungi | it's taking longer to boot than i would expect | 19:26 |
clarkb | ya | 19:26 |
fungi | vnc isn't connecting | 19:27 |
clarkb | I wonder if we'll have to hard reboot it from the api? | 19:28 |
clarkb | seems like vnc (and maybe even ping) should work if it was running at all | 19:28 |
fungi | yeah, still failing to connect. i'll hard reboot the instance | 19:29 |
clarkb | ok | 19:29 |
fungi | a hard reboot put it into error state | 19:30 |
clarkb | hrm | 19:31 |
clarkb | {'message': 'Failure', 'code': 500, 'created': '2021-09-12T19:30:26Z'} doesn't give much indication to what failed | 19:31 |
fungi | fault | {'message': 'Failure', 'code': 500, 'created': '2021-09-12T19:30:26Z'} | 19:32 |
clarkb | do we file an issue with rax? (or a phone call?) | 19:33 |
fungi | i suppose i could try to stop and start the server again | 19:33 |
clarkb | ok | 19:33 |
fungi | a server stop put it into shutoff state at least | 19:34 |
fungi | starting it again now | 19:34 |
fungi | it's staying in shutoff now | 19:35 |
clarkb | ya I see that too | 19:35 |
fungi | tried starting it again, still in shutoff | 19:36 |
clarkb | fungi: are you using the web ui or the api? I wonder if the web ui for the instance shows anything useful that the api might not | 19:36 |
fungi | api | 19:37 |
fungi | well, cli | 19:37 |
fungi | trying from the webui now | 19:37 |
fungi | doesn't seem to do anything | 19:38 |
clarkb | I see it in an error state via openstackclient now fwiw | 19:38 |
clarkb | with the same failure 500 fault message | 19:38 |
fungi | oh, yep, it went back to error | 19:38 |
clarkb | it did update the timestamp on that though so it wasn't just returning the same message | 19:38 |
fungi | the webui says "The server has encountered an error. Contact support for troubleshooting." | 19:38 |
fungi | i guess i'll open a ticket now | 19:38 |
clarkb | ok | 19:38 |
clarkb | is there any info I can help gather? I assume you've got it under control | 19:44 |
fungi | ticket #210912-ord-0000303 | 19:44 |
fungi | yeah, i don't know what else to do at this stage | 19:44 |
fungi | if we get closer to the end of the maintenance window we'll probably want to status notice that things are running over, but in the meantime i guess we hope support gets back to us quickly | 19:45 |
clarkb | Also something like #status notice The server hosting mailing lists for airship, opendev, openstack, starlingx, and zuul entered an error state during its operating system upgrades. We have filed a ticket with the cloud provider to help debug the cause. | 19:45 |
clarkb | ya | 19:45 |
fungi | if they don't get back to us quickly, we'll need to think about what bringing up a replacement server from the image we made looks like | 19:46 |
clarkb | As far as other options go I think we can boot from our snapshot and essentially revert. Ideally we'd do that with a nova rebuild of the existing server to preserve the IP address but that seems risky considering the server is in an error state. We can boot it on an entirely new instance as well (and try really hard to deal with the IP address stuff?). Remember we have to recover the | 19:47 |
clarkb | snapshot to update /etc/fstab to remove the swap parition until we can boot it properly and create a new swap partition | 19:47 |
clarkb | A third option would be to boot a new focal instance and move all of the lists over to it (I don't know what that involves, I suspect it might be difficult to not email all the list owners when the new lists are created) | 19:47 |
clarkb | oh we do have the flag to not email them though I bet we could set that in the hostvars for a new instance | 19:48 |
clarkb | fungi: I'll take a break for lunch now then check back in after to see if we've got a response. Let me know if you think there is anything else I can be doing | 19:50 |
fungi | my gut says this is an old pv flavor and the new focal kernel won't work on it | 19:51 |
clarkb | oh hrm. | 19:51 |
fungi | in which case we might be able to force it to boot with the bionic kernel | 19:51 |
clarkb | ya or have them change the flavor for us under the hood? | 19:51 |
fungi | it rebooted into bionic just fine | 19:51 |
fungi | yeah, maybe, i dunno how messy that is | 19:52 |
clarkb | to something that handles focal (which we do have focal nodes so it is possible) | 19:52 |
clarkb | ya I have no idea either | 19:52 |
clarkb | and yup bionic booted | 19:52 |
clarkb | ok eating lunch will check in in a bit | 19:52 |
fungi | k | 19:52 |
clarkb | fungi: I do wonder if we shouldnt' consider a phone call before giving up on waiting. Its annoying to do but iirc ianw wrote down the account verification details in the typical spot for doing that | 19:53 |
clarkb | anyway I'm hungry and I can smell lunch :) | 19:53 |
fungi | yeah, apparently we've been continuously in-place upgrading this since precise (12.04 lts) | 19:54 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM just testing mailman with bionic https://review.opendev.org/c/opendev/system-config/+/808569 | 20:09 |
clarkb | fungi: ^ just to give us an idea if our ansible has any problems with bionic | 20:09 |
clarkb | fungi: thinking out loud here: if we wanted to get prepped we could boot a new instance off of the snapshot and then sort out its swap situation. However, I need a bit more time away from the keyboard so can't help with that just yet | 20:10 |
clarkb | but then if we have to we can update DNS and all that to use that server while we figure out our next move | 20:10 |
fungi | mmm, yeah | 20:11 |
fungi | ticket update | 20:16 |
fungi | The server is failing to boot because the bootloader cannot load the grub/kernel files. | 20:17 |
fungi | message: xenopsd internal error: XenguestHelper.Xenctrl_dom_linux_build_failure(2, " panic: xc_dom_core.c:616: xc_dom_find_loader: no loader\\\"") | 20:17 |
fungi | they recommend booting in rescue mode. i'll give it a shot now | 20:17 |
Clark[m] | https://run.tournament.org.il/xenserver-internal-error-failure-no-loader-found/ | 20:20 |
fungi | i think there must be a bit of a disconnect with the webui. it let me select to reboot into rescue mode, and gave me a temporary ssh password, but the server never actually rebooted | 20:20 |
Clark[m] | fungi: seems that says it is common if you try to run a non pv aware kernel? | 20:20 |
fungi | oh, i just needed to wait a little longer | 20:20 |
fungi | "i'm ssh'd into the rescue rootfs now | 20:22 |
Clark[m] | Maybe check what kernel is installed then we see if maybe we need a different kernel for pv support? | 20:23 |
Clark[m] | I should be back to the keyboard soon | 20:23 |
fungi | looks like xvdb1 is our production rootfs | 20:24 |
fungi | fsck runs clean on it | 20:25 |
fungi | i've got it mounted on /mnt | 20:25 |
fungi | installed kernel packages are linux-image-3.2.0-77-virtual linux-image-4.15.0-156-generic linux-image-4.4.0-214-generic linux-image-5.4.0-84-generic | 20:26 |
clarkb | fungi: the -virtual in the really old kernel makes me wonder if we need a -virtual kernel | 20:27 |
fungi | primary grub boot entry is for 5.4.0-84 | 20:27 |
* clarkb loads up ubuntu package search | 20:27 | |
clarkb | fungi: https://packages.ubuntu.com/focal-updates/linux-image-virtual says the generic kernel is the virtual kernel :/ | 20:30 |
clarkb | that was true for bionic as well, but bionic did boot. So ya maybe they remove paravirtualization support? | 20:30 |
fungi | we could try just switching to 4.15.0-156 in the grub config | 20:31 |
clarkb | ya that would at least allow us to check if it boots I guess | 20:31 |
fungi | though the error really seems like it's having trouble with the bootloader not the kernel? | 20:31 |
fungi | there's a grub-xen package | 20:32 |
clarkb | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699381 <- says it could be the compression algorithm for the initramfs? | 20:33 |
clarkb | https://packages.ubuntu.com/focal/amd64/grub-xen/filelist doesn't seem to have much in that grub-xen package /me goes to look for what is in those files | 20:34 |
fungi | we presently have grub-pc installed | 20:34 |
fungi | but yeah i think this must be related to the server still using a pv flavor | 20:35 |
clarkb | looks like hvm instances have grub-pc installed, but I think that is expected since hvm is far more like a normal thing | 20:35 |
clarkb | https://eengstrom.github.io/musings/booting-xen-on-ubuntu-via-grub-with-uefi | 20:36 |
fungi | https://xenproject.org/2015/01/07/using-grub-2-as-a-bootloader-for-xen-pv-guests/ | 20:36 |
clarkb | hrm my link is for bionic though and it was fine | 20:37 |
fungi | grub-xen seems to be teh grub2 successor of pvgrub | 20:37 |
clarkb | fungi: neat, I guess maybe we try that then? | 20:38 |
clarkb | how do we install that properly on the recovery installation? chroot into the other image and then do it? | 20:38 |
clarkb | https://packages.ubuntu.com/focal/amd64/grub-xen-bin/filelist is isntalled as a dep of grub-xen and that seems to pull in all the bits there | 20:39 |
fungi | looks like prior to the upgrade we were running grub-pc:amd64 2.02~beta2-36ubuntu3.32 | 20:39 |
fungi | i can try to install grub-xen within a chroot | 20:40 |
fungi | chroot of the production rootfs i mean | 20:40 |
fungi | that can be fiddly when it comes to block device detection for grub-install | 20:41 |
clarkb | fungi: that seems reasonable given the error provided by the cloud I think | 20:41 |
clarkb | fungi: disk image builder does do it somehow | 20:41 |
fungi | but i'll recheck the grub config | 20:41 |
clarkb | I'll see if I can discern what dib does | 20:41 |
fungi | it's going to uninstall grub-gfxpayload-lists and grub-pc | 20:42 |
fungi | oh, that's fun... "Temporary failure resolving 'us.archive.ubuntu.com'" | 20:42 |
clarkb | `/usr/sbin/grub-install '--modules=part_msdos part_gpt lvm biosdisk' --force /dev/loop0` | 20:42 |
clarkb | I think dib manually executes the installation | 20:43 |
fungi | ahh, i'm going to need to mount devfs and proc et cetera | 20:43 |
fungi | yeesh, systemd wants several dozen things mounted (no joke) | 20:45 |
fungi | haha, though the reason dns isn't working is that we configure the server to do lookups through a local unbound | 20:49 |
fungi | undoing that in /etc/resolv.conf for a bit | 20:49 |
clarkb | noted | 20:50 |
fungi | but i did also mount /dev, /proc and /sys because it's probably going to need those when doing stuff with the grub packages | 20:50 |
fungi | complained about being unable to log to /dev/pts | 20:51 |
fungi | because i didn't mount that of course | 20:51 |
fungi | looks like it scanned and chose the rescue boot partition instead of the one i'm chrooted into | 20:52 |
clarkb | fungi: if the package is installed I suspect that you can run something like the command dib runs above though? | 20:53 |
clarkb | though I don't think you should need the force or the modules listing? | 20:54 |
clarkb | ya --force means "install even if problems are detected" we probably want to evaluate those if it finds them | 20:55 |
clarkb | and modules just preloads stuff but the default is to have all the modules be available based on the man page | 20:55 |
fungi | mmm, yeah do i want it installed into the partition or into the mbr? | 20:56 |
fungi | i sent ahead and did both but i suspect it'll be the mbr that matters | 20:56 |
clarkb | INSTALL_DEVICE must be system device filename. grub-install copies GRUB images into boot/grub. On some platforms, it may also install GRUB into the boot sector. | 20:57 |
clarkb | I think it is the mbr that matters | 20:57 |
clarkb | but it figures it out sounds like | 20:57 |
fungi | the menu.lst looks like it contains the kernels which are installed in the chroot, so hopefully we're all set | 20:58 |
clarkb | fingers crossed | 20:59 |
fungi | should i undo the resolv.conf change, umount /dev, /proc and /sys, umount the rootfs and then reboot into normal mode now? | 20:59 |
fungi | anything else i should check first? | 20:59 |
clarkb | I can't think of anything else, and you probably grok this stuff better than me :) | 20:59 |
clarkb | note you may have to "unrescue" the server to boot into normal mode | 21:00 |
clarkb | not sure what a straight up reboot will do, and ya I would umount the rootfs before unrescuing | 21:00 |
fungi | did all the above and now i've asked it to "exit rescue mode" which i guess is the webui's unrescue | 21:01 |
clarkb | I don't see it pinging yet, any chance the console is helpful? | 21:02 |
fungi | the webui still says it's unrescuing | 21:03 |
clarkb | ah ok | 21:03 |
clarkb | I should learn to be more patient | 21:03 |
fungi | i'll also need to switch to dinner prep mode shortly | 21:03 |
clarkb | the openstack client shows it as error state though | 21:03 |
clarkb | and the timestamp of that is from around when you unrescued | 21:03 |
fungi | ugh | 21:04 |
clarkb | (so I don't think it is stale from an hour or two ago whenever that was) | 21:04 |
clarkb | https://docs.rackspace.com/support/how-to/rebuild-a-cloud-server that does say rebuilding a server preserves its IP address. In theory we can use that to revert back to xenial | 21:05 |
clarkb | I think that might be the best option for now as it should get us back to a working state with the same IP addrs and we won't have to sort through block lists as part of recovery | 21:06 |
clarkb | er for now meaning "if we can figure it out through rescue" | 21:06 |
clarkb | *if we can't | 21:06 |
clarkb | fungi: maybe try tell grub to boot the bionic kernel? | 21:07 |
clarkb | and if that doesn't work we rebuild? | 21:07 |
clarkb | of course once we get to that state I'm not sure how we move forward from there so maybe that is for information gathering rather than a solution | 21:07 |
fungi | yeah, could give the old fallback kernel a shot | 21:08 |
clarkb | its a "oh neat bionic's kernel works for some not understood reason, but we can't keep it up to date so what do?" | 21:08 |
clarkb | also if we think this might be a sit on it and think situation I can probaby live with that into tomorrow? but considering a rebuild with our snapshot should revert us pretty cleanly I also like that option | 21:09 |
fungi | trued stopping the server and then rebooting it, but still went back into error | 21:11 |
fungi | reentering rescue mode now | 21:11 |
fungi | exiting rescue mode now after removing the focal kernel entries from /boot/grub/menu.lst | 21:15 |
fungi | though if grub-xen is grub2 based shouldn't that be /boot/grub/grub.cfg instead? | 21:16 |
fungi | Linux lists 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | 21:17 |
fungi | 21:17:23 up 2 min, 1 user, load average: 0.41, 0.20, 0.08 | 21:17 |
fungi | clarkb: ^ | 21:17 |
clarkb | aroo | 21:17 |
fungi | i think it's the focal kernel which is the problem? | 21:17 |
clarkb | ok so it is the kernel? | 21:17 |
fungi | i expect so, though this is still with grub-xen installed too | 21:18 |
clarkb | linux-image-extra-virtual - Extra drivers for Virtual Linux kernel image <- is a thing | 21:18 |
clarkb | ya it could've been both things I suppose | 21:18 |
fungi | i want to try another reboot before we go further | 21:19 |
clarkb | ok | 21:19 |
fungi | the webui never indicated it was done exiting rescue mode but it seems to have been | 21:19 |
fungi | rebooting the server now | 21:19 |
clarkb | seems that extra-virtual is a largely empty package and it just depends on the generic image | 21:19 |
fungi | i think the -virtual means it's a virtual package (not an actual package, just a name reference to another package) | 21:20 |
clarkb | oh I see | 21:20 |
fungi | in debian there's the idea of "virtual packages" (packages which don't really exist, they're just convenient pointers to other package names) and "dummy packages" (basically empty packages which only exist to depend on other packages) | 21:21 |
fungi | though also sometimes people conflate the two | 21:21 |
clarkb | seems the server is still pinging did the reboot complete? | 21:21 |
fungi | 21:21:58 up 1 min, 1 user, load average: 0.45, 0.24, 0.09 | 21:22 |
fungi | yep | 21:22 |
fungi | probably okay to switch to beinging services back up | 21:22 |
clarkb | fungi: but this kernel isn't sustainable? | 21:22 |
clarkb | like what do we do if we need kernel updates? | 21:22 |
fungi | it's not sustainable, but it's something we can probably something we can figure out once services are running again | 21:23 |
clarkb | and if we enable services now rolling back becomes potentially much more difficult | 21:23 |
fungi | https://unix.stackexchange.com/questions/583714/xen-pvgrub-with-lz4-compressed-kernels | 21:23 |
fungi | maybe it's the kernel compression after all, like you surmised | 21:24 |
fungi | "an apt hook using extract-vmlinux to decompress kernels during installation" | 21:24 |
clarkb | ya the comments say there are a couple of security concerns with the way it is written and that it doesn't work as is for focal | 21:25 |
fungi | right, though it suggests we could probably manually decompress the kernel as a temporary workaround | 21:27 |
clarkb | oh I see what you are saying. Do you want to try that now? If that works then I'm good with enabling services | 21:27 |
clarkb | fungi: do you want to do that in a root screen? I'm happy to follow along or have you just do it as well :) | 21:28 |
fungi | maybe, i'm trying to understand extracl-vmlinux first | 21:28 |
fungi | extract-vmlinux | 21:28 |
clarkb | looks like it iterates through all of the various compression algorithms that it might be and bails out when it finds the first one that is valid | 21:29 |
fungi | yep | 21:29 |
clarkb | fungi: we can also see if we can compare the curernt kernel to the focal kernel's compression type | 21:30 |
clarkb | file says they are both regular files though (but I think that is because the file is "regular" and the vmlinuz is in there somewhere else?) | 21:30 |
fungi | the `file` utility claims both the problem and working kernels are a bzImage | 21:31 |
fungi | you need to use sudo | 21:31 |
clarkb | ah | 21:31 |
corvus | clarkb, fungi: i see there are issues; is there a tl;dr + something i can help with? | 21:31 |
fungi | for some reason i guess ubuntu assumes user-readable kernel files are a security risk | 21:31 |
clarkb | corvus: rax xen cannot boot the focal kernel but can boot the bionic kernel | 21:31 |
clarkb | fungi: note the script in stackoverflow further checks things beyond bzimage | 21:32 |
fungi | corvus: tl;dr is that lists.o.o is a pv xen server, and the focal kernel won't boot on it, but the bionic kernel will | 21:32 |
clarkb | corvus: we suspect it may be the vmlinuz compression type | 21:32 |
fungi | corvus: yeah looks like it may be because ubuntu switched to lz4 compression for kernel files | 21:32 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir https://review.opendev.org/c/openstack/diskimage-builder/+/804000 | 21:33 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Support grubby and the Bootloader Spec https://review.opendev.org/c/openstack/diskimage-builder/+/804002 | 21:33 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: RHEL/Centos 9 does not have package grub2-efi-x64-modules https://review.opendev.org/c/openstack/diskimage-builder/+/804816 | 21:33 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add policycoreutils package mappings for RHEL/Centos 9 https://review.opendev.org/c/openstack/diskimage-builder/+/804817 | 21:33 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add DIB_YUM_REPO_PACKAGE as an alternative to DIB_YUM_REPO_CONF https://review.opendev.org/c/openstack/diskimage-builder/+/804819 | 21:33 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add reinstall flag to install-packages, use it in bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/804818 | 21:33 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: WIP Add secure boot support to ubuntu. https://review.opendev.org/c/openstack/diskimage-builder/+/806998 | 21:33 |
fungi | corvus: so we're trying to decide the best way forward so we feel comfortable bringing services back up on it | 21:33 |
clarkb | corvus: currently we are thinking if we can manage to de|re compress the vmlinuz for focal such taht it is bootable then we can roll forward (and use a hook to do that automatically in the future). But if not then we might be best doing a server rebuild against the snapshot we took at the beginning of this process | 21:33 |
clarkb | fungi: `grep -aqo "${LZ4_HEADER}" ${KERNEL_PATH}` is the thing to try against the bionic and focal kernels to see if they differ I guess? | 21:34 |
corvus | ok. i have no direct experience with this, so all i have to offer is another set of eyes/hands if necessary :) | 21:34 |
fungi | corvus: we've temporarily updated the grub config to boot the old bionic kernel instead of the focal one, but are concerned that isn't sustainable long term because we want to be able to update the kernel for vulnerabilities in the future | 21:35 |
clarkb | fungi: I'm thinking we do the grep against the lz4 header and try to confirm that differs between the kernels. If it does we attempt the kernel conversion and reboot. | 21:35 |
clarkb | corvus: https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux is the script that linux provides to do the conversion | 21:36 |
corvus | yeah, it sounds like get something that will work for ~1 week, then regroup / possibly schedule an outage to move to a server with new ip addrs might be a good plan? | 21:36 |
clarkb | corvus: ya that too | 21:36 |
fungi | clarkb: `sudo grep -aqo '\002!L\030' /boot/vmlinuz-5.4.0-84-generic` doesn't find anything | 21:37 |
clarkb | fungi: huh I wonder if is is xz then? | 21:37 |
fungi | aha, it may be escaping | 21:38 |
fungi | lz4match=$(printf '\002!L\030'); sudo grep -aq "$lz4match" /boot/vmlinuz-5.4.0-84-generic | 21:38 |
clarkb | fungi: oh also it is grep -q | 21:38 |
fungi | that exits 0 | 21:38 |
clarkb | you need to check the exit code ya | 21:38 |
fungi | lz4match=$(printf '\002!L\030'); sudo grep -aq "$lz4match" /boot/vmlinuz-4.15.0-156-generic | 21:38 |
clarkb | so that means it matched | 21:38 |
fungi | that exits 1 | 21:38 |
clarkb | nice we have likely tracked this down? what an adventure | 21:38 |
fungi | it was the -o which was making it not match | 21:39 |
fungi | mmm, no not the -o either | 21:40 |
fungi | yeah, have to `printf '\002!L\030'` | 21:40 |
fungi | that's it | 21:40 |
fungi | okay, so that confirms the focal kernel is indeed lz4 | 21:40 |
fungi | anyway, i thought we were very close to being able to wrap this up, but i'm really overdue to switch to dinner prep before christine tries to gnaw off my arm | 21:41 |
clarkb | fungi: reading linux's script it basically finds the positions of the compressed file for the lz4 bits then passes through through a decompression and leaves the other bits behind? I guess you only need that last nit? | 21:42 |
clarkb | s/nit/bit/ | 21:42 |
fungi | right, the offset seems to be skipping the header part | 21:42 |
clarkb | fungi: ya enjoy dinner, I can probably do the conversion while you eat then we can try rebooting after? | 21:42 |
clarkb | I will use linux's script and not the parent one to do this as a one off | 21:43 |
fungi | sure, shall i do a #status notice since we're ~45 minutes past the announced end of our window already? | 21:43 |
clarkb | ++ something along the lines of we've isoalted the issue and are working to fix it now | 21:43 |
fungi | status notice The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen | 21:45 |
fungi | like that? | 21:45 |
clarkb | lgtm | 21:45 |
fungi | #status notice The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen | 21:45 |
opendevstatus | fungi: sending notice | 21:45 |
fungi | okay, will dinner as quickly as possible and return | 21:45 |
-opendevstatus- NOTICE: The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen | 21:45 | |
fungi | thanks! | 21:45 |
clarkb | vmlinuz-5.4.0-84-generic is the file we care about right? | 21:46 |
clarkb | /boot/vmlinuz-5.4.0-84-generic I mean | 21:46 |
clarkb | hrm does it actually do it in place? | 21:47 |
clarkb | maybe not | 21:48 |
corvus | clarkb: looks like the result goes to stdout? | 21:49 |
corvus | i think the surgery happens on a tempfile, then the result gets cat'd | 21:49 |
clarkb | corvus: yup I've redirected into a file its in my homedir under kernel-stuff/vmlinuz-5.4.0-84-generic.extracted | 21:50 |
clarkb | it isn't clear to me how fungi told grub to default to the bionic kernel | 21:51 |
clarkb | the default seems to be entry 0 which is focal? | 21:52 |
corvus | did he do that interactively? | 21:53 |
clarkb | corvus: it was done from a rescue instance | 21:54 |
clarkb | or at least I thought it was. Maybe he caught it in the console instead | 21:55 |
corvus | clarkb: grub/menu.lst entry 0 == uname | 21:55 |
fungi | i edited /boot/grub/menu.lst to remove the newer kernel entries | 21:55 |
clarkb | oh I see the thing says it is 20.04 kernel but its the 4.15 path | 21:55 |
fungi | if you rerun update-grub it should put them back | 21:55 |
corvus | ++ | 21:55 |
clarkb | ok corvus I ran readelf against my extracted kernel and it didn't error | 21:56 |
clarkb | should I copy that over the file in /boot then rerun update-grub? | 21:56 |
corvus | clarkb: file also thinks it's an elf binary :) | 21:56 |
corvus | yeah, you have a backup of the orig, right? | 21:56 |
fungi | yeah, elf is what we want | 21:56 |
clarkb | corvus: yes a backup of the original is in the same dir as the extracted file | 21:57 |
clarkb | corvus: I'lld double check shas on taht and the one in /boot | 21:57 |
clarkb | sha1s match | 21:57 |
clarkb | I'll do the copy now | 21:57 |
clarkb | and now running update-grub | 21:58 |
clarkb | hrm that didn't put it in menu.lst. | 21:59 |
clarkb | Is menu.lst used? | 21:59 |
clarkb | internet seems to say grub2 doesn't use menu.lst and its all grub.cfg | 22:00 |
clarkb | Should I try a reboot and either it comes back on the old kernel and works, comes back on the new kernel and works, or tries the new kernel and fails and we can recover from there? | 22:01 |
fungi | right, but i have a feeling the bootloader isn't actually being used | 22:01 |
fungi | i think in a xen pv scenario the bootloader is run outside the domu to parse the files in the filesystem | 22:01 |
fungi | so we may not actually be in control of what bootloader is run | 22:02 |
clarkb | does that mean you think I need to edit menu.lst? | 22:02 |
clarkb | or how do I convince it to use the newer kernel? | 22:02 |
fungi | when i edited menu.lst that worked | 22:02 |
corvus | i'm in favor of a menu.lst change | 22:03 |
clarkb | yup it looks like menu.lst~ has the new kernel in it. I've put backups of menu.lst and menu.lst~ in my homdir and will replace menu.lst with menu.lst~ | 22:05 |
clarkb | that is done. Shall I reboot? | 22:06 |
corvus | ++ | 22:07 |
clarkb | ok proceedingt | 22:07 |
clarkb | it hasn't come back yet. Openstack API still shows it as active and not yet error'd | 22:09 |
clarkb | I can try to reboot it through the rax api | 22:11 |
clarkb | I guess I should check the console first | 22:12 |
clarkb | vnc fails to connnect to the server. I'll try a reboot via the api | 22:14 |
clarkb | Now it has an error status. Hrm | 22:15 |
corvus | does that mean it needs to be rescued again? | 22:17 |
corvus | like rescue it, then edit menu.list as before and reboot to try to recover? | 22:17 |
corvus | (i'm a little confused why it's in error state and not just sitting at a grub prompt, but maybe that goes to fungi's suggestion that we may not even really be using grub) | 22:18 |
clarkb | corvus: yes I think we need to rescue it and restore the old only use bionic menu.lst | 22:18 |
clarkb | https://wiki.xenproject.org/wiki/Booting_Overview#PVGrub explains (with almost no detail) the grub bypass I think | 22:18 |
clarkb | I'll rescue it and restore that file | 22:18 |
corvus | (i'm going to quietly restart zuul while that's going on) | 22:20 |
corvus | #status log restarted all of zuul on commit 9a27c447c159cd657735df66c87b6617c39169f6 | 22:23 |
opendevstatus | corvus: finished logging | 22:23 |
clarkb | it is back up on the bionic kernel | 22:25 |
corvus | the whole xen thing seems like enough of a black box that i'd vote for just bringing it back up now and then taking the hit of moving ips to complete the upgrade. it's not ideal, but the reputation will catch up eventually. | 22:27 |
clarkb | corvus: bringing it back up now on the bionic kernel you mean and then replace the server in the near future? | 22:29 |
corvus | clarkb: yeah, bringing the public service back up in the current kernel state | 22:30 |
clarkb | got it | 22:30 |
clarkb | I'm going to reboot this server again from within the server to double check it is consistent | 22:31 |
clarkb | https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy we might also try that? | 22:33 |
corvus | that looks promising -- i think my concern is that this server seems to be a one-off, and the only way to test any of this is to test it in production? | 22:34 |
corvus | (because we tested a clone, right? and that didn't have this problem?) | 22:35 |
clarkb | correct | 22:35 |
clarkb | and yes I completely agree | 22:35 |
clarkb | well we tested a clone of lists.katacontainers.io and on a zuul test instance. But if we booted a new image in rax it would be hvm and ya not reproduce | 22:36 |
corvus | yeah, so i think we've really just hit the end of the line on this server, and the longer we try to keep it running, the bigger hole we dig. better to cut losses and start fresh i think. | 22:36 |
clarkb | fair enough | 22:37 |
clarkb | I did edit a menu.lst to try in my homedir but sounds like we'd just prefer to enable services instead? | 22:37 |
clarkb | Let me do a reboot with the config that booted most recently (I haven't modified the server since it booted successfully) just to be sure it seems to consistently come up | 22:37 |
corvus | clarkb: my preference is weak. if you want to keep trying, no objection from me. but i feel like even success looks like we're taking on tech debt. | 22:39 |
clarkb | corvus: well I think we should replace the server either way, but thinking that using the new kernel while we sort that out is best | 22:39 |
clarkb | I am however curious enough to try the chainload. Why don't I try that really quickly | 22:40 |
clarkb | worse case I revert menu.lst again and unrescue | 22:40 |
fungi | okay, i'm back and catching up | 22:40 |
clarkb | fungi: tldr is the uncompressed vmlinuz didn't boot. reverting menu.lst in a rescue instance did get us back to booting | 22:41 |
clarkb | fungi: if you look in ~clarkb/kernel-stuff/menu.lst.chainload I've drafted a menu.lst that attempts to do what https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy describes | 22:41 |
clarkb | I'd like to try that if there aren't strong objections and if works great if it doesn't work "great" we revert and either way we reenable services and move on and plan to replace the server | 22:42 |
fungi | yeah, that looks worth a try, though note that once we chainload to grub2 that *will* be reading our /boot/grub/grub.cfg instead of menu.lst | 22:44 |
clarkb | yup and that should point to my decompressed file | 22:45 |
fungi | skimming the grub.cfg though it also looks sane | 22:45 |
clarkb | shall I proceed with replacing menu.lst and rebooting? | 22:45 |
clarkb | (also it is possible that the chainlod will handle lz4?) | 22:45 |
fungi | the /boot/vmlinuz-5.4.0-84-generic is decompressed, right? | 22:45 |
clarkb | fungi: yes you'll see it is like 6 times the size of the kernel we are booted on | 22:46 |
fungi | yeah, the chainloaded grub2 may also support lz4 | 22:46 |
fungi | :q | 22:46 |
fungi | hah, you're not a vi session | 22:46 |
clarkb | wrong window :) | 22:46 |
clarkb | let me know when you feel comfortable with this menu.lst update and reboot (or that you don't feel comfortable) | 22:46 |
clarkb | and I'll go ahead and do that | 22:47 |
fungi | we may just want regular grub2 rather than grub-xen if we're chainloading, but worth a shot, go for it | 22:47 |
clarkb | fungi: Ithink that file is installed by grub-xen. proceeding | 22:48 |
fungi | grub-pc being regular grub2 | 22:48 |
fungi | yeah, if grub-xen is a superset of grub-pc then it's irrelevant | 22:48 |
clarkb | menu.lst is updated. rebooting momentarily | 22:48 |
clarkb | I need to reload my keys I think they timed out but it pings now if you want to juimp in | 22:50 |
clarkb | will take me a minute to load keys | 22:50 |
fungi | Linux lists 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | 22:50 |
fungi | lgtm! | 22:50 |
fungi | i'll go ahead and close the rackspace ticket | 22:50 |
clarkb | fungi: maybe leave a note with them about what we discovered? | 22:51 |
clarkb | maybe it will help the next user that hits this | 22:51 |
fungi | oh absolutely | 22:51 |
clarkb | tldr install grub-xen and chainload and possibly decompress the vmlinuz | 22:51 |
clarkb | alright what is the order of operations for restoring services here? fungi do you want to do that? | 22:51 |
clarkb | maybe apache2 first? check archives then enable mailman and make sure it starts happily then enable exim? | 22:52 |
corvus | Sgtm | 22:54 |
fungi | yeah, rackspace ticket now closed with ample detail | 22:54 |
clarkb | fungi: did you want to do the service enabling or should I/ | 22:55 |
fungi | i can start apache up next | 22:55 |
clarkb | thanks | 22:55 |
fungi | i've started a root screen session on lists.o.o | 22:55 |
clarkb | I'll join it | 22:55 |
clarkb | oh wait | 22:56 |
fungi | oh, actually we had a couple of packaging sanity check steps first | 22:56 |
clarkb | did you want to do the other steps first for apt cleanup? | 22:56 |
fungi | i'll do those | 22:56 |
clarkb | ok | 22:56 |
clarkb | also maybe don't clean up old kernels? | 22:56 |
corvus | The chainloader may be a reasonable long-term solution. That might not add much tech debt as long as update-grub doesn't overwrite menu.lst... | 22:57 |
clarkb | corvus: I don't think it does because update-grub does all the grub.cfg stuff now | 22:57 |
fungi | worth backing up the menu.lst we have now just in case | 22:57 |
clarkb | fungi: that is already done in my kernel-stuff dir in my homedir | 22:57 |
fungi | but yes, what clarkb said | 22:57 |
fungi | ahh, right | 22:57 |
clarkb | so ya maybe we get services up and running. Sleep on it and decide what the best course of action is here | 22:57 |
fungi | i'll avoid doing the autoremove for now | 22:58 |
clarkb | ++ | 22:58 |
fungi | should i leave the main mailman.service unit disabled? | 23:00 |
clarkb | yes it isn't used | 23:01 |
fungi | i've reenabled the 5 initscripts for our current sites | 23:01 |
fungi | one more reboot? | 23:01 |
clarkb | fungi: I suggested we start services one at a time above | 23:01 |
fungi | ahh, okay can do, instead of rebooting | 23:01 |
clarkb | apache then check it, then mailman-foo, then exim4 just to be sure its all happy to start | 23:01 |
clarkb | I think we can reboot afterwards | 23:01 |
fungi | apache is up | 23:02 |
clarkb | I get errors loading the listinfo pages for opendev and openstack | 23:02 |
clarkb | Bad header: Please set MAILMAN_SITE_DIR | 23:02 |
clarkb | ok so something about the multisite setup isn't happy? | 23:03 |
fungi | yeah, but i can load the pipermail archives since they're static | 23:03 |
clarkb | SetEnv MAILMAN_SITE_DIR /srv/mailman/opendev <- is set in the apache config | 23:03 |
fungi | and we do load mod_env | 23:06 |
fungi | https://httpd.apache.org/docs/2.4/mod/mod_env.html#SetEnv doesn't mention any other applicable caveats that i can see | 23:07 |
clarkb | fungi: I wonder if it is related to the Directory we put it in. We could try setting it globally (though that might have other side effects?) | 23:08 |
fungi | i can't imagine other side effects it would cause if we made it vhost-wide | 23:08 |
clarkb | seems like it is worth a try? | 23:09 |
fungi | seems like not much apache invokes is going to care about MAILMAN_SITE_DIR besides the cgi | 23:09 |
clarkb | still getting the error. Maybe try stop start? | 23:09 |
clarkb | hrm no dice | 23:10 |
fungi | yeah, it's like that value isn't making it through /usr/lib/cgi-bin/mailman/listinfo | 23:10 |
clarkb | we are using mod_cgid not mod_cgi, but I wouldn't expect taht to cause trouble | 23:13 |
clarkb | corvus: is this familiar to you at all? | 23:14 |
fungi | it's like something's sanitizing the environment before invoking mailman/mm_cfg.py | 23:14 |
corvus | no, i'm looking but don't see anything yet | 23:14 |
clarkb | and I guess do we want to start mailman and/or exim and see if those are happy? | 23:14 |
fungi | er, /etc/mailman/mm_cfg.py | 23:15 |
clarkb | fungi: ya its definitely calling that script because the error message is the one that the mailman vhosting added if the env var isn't set | 23:15 |
clarkb | actually it could be mm doing that? | 23:16 |
clarkb | since the path is apache2 -> mm cgi script -> /etc/mailman/mm_cfg.py ? | 23:17 |
fungi | right | 23:18 |
clarkb | https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/Cgi/listinfo.py does that get compiled to what we have in our cgi dir? | 23:19 |
clarkb | if so I don't see it clearing any env vars | 23:19 |
fungi | our /usr/lib/cgi-bin/mailman/listinfo is an elf executable, so hard to know | 23:21 |
clarkb | there is a note in the apache2 docs about how you should use SetEnvIf if they are part of modrewrite or otherwise early in the handling | 23:22 |
clarkb | perhaps this is unexpectedly happening early in handling for some reason? an optimization to make cgi faster? | 23:22 |
corvus | i think it may be cleaned by the mm script... still trying to find links to share and confirm (i just downloaded the tarball from ubuntu) | 23:23 |
clarkb | corvus: interesting | 23:23 |
corvus | i don't know where to find an online linkable browser with version 2.1.29 | 23:24 |
fungi | clarkb: i don't think setenvif is relevant, that seems to be for conditionally setting envvars | 23:25 |
fungi | and yeah, the cgi scripts themselves sanitizing the environment would certainly be the simplest explanation | 23:27 |
corvus | well, anyway... if you download 2.1.29, there's a keepenvars variable in src/common.c which is the wrapper | 23:28 |
clarkb | https://launchpad.net/mailman/+milestone/2.1.19 has a tarball at least | 23:28 |
* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/MbFCWvlCwUruAnzodoEifPYm/message.txt > | 23:28 | |
fungi | any way to control keepenvars from outside? | 23:29 |
corvus | that comment is nuts "we should invert this!" "it is done!" | 23:29 |
fungi | i guess not | 23:29 |
corvus | maybe, i dunno, delete the TODO when it's done? | 23:29 |
fungi | we could probably rework it to switch on HOST? | 23:29 |
corvus | but i guess if your version control system is... nevermind. eyes on the ball. | 23:29 |
clarkb | oh 2.1.29 is the version not .19 | 23:29 |
corvus | i don't know what version we were running before; does anyone? | 23:29 |
clarkb | no wonder I don't see it | 23:29 |
clarkb | corvus: I can look it up | 23:30 |
fungi | i'll check the dpkg.log | 23:30 |
clarkb | looks like 2.1.20 | 23:30 |
corvus | i'm pretty sure this is the issue, but i'd like to confirm by seeing if... i'm gonna go out on a limb here, the logic had not been inverted yet and only the TODO was there not the done | 23:30 |
fungi | 2021-09-12 18:10:40 upgrade mailman:amd64 1:2.1.20-1ubuntu0.6 1:2.1.26-1ubuntu0.3 | 23:30 |
fungi | yep | 23:30 |
fungi | that was the first upgrade (from xenial to bionic) | 23:31 |
corvus | yep, it's inverted. that's the issue. | 23:31 |
corvus | i'm guessing the kata listserv doesn't use this system? | 23:32 |
clarkb | corvus: it does not | 23:32 |
clarkb | fungi: I agree we can use the HOST var probably | 23:32 |
clarkb | keep using MAILMAN_SITE_DIR for the non cgi stuff and switch on it with HOST otherwise? | 23:32 |
corvus | yeah, i think do the mapping based on host should work | 23:33 |
fungi | so if $MAILMAN_SITE_DIR undefined MAILMAN_SITE_DIR=$HOST | 23:33 |
clarkb | fungi: MAIL_SITE_DIR = somevalue based on HOST lookup | 23:33 |
clarkb | lists.opendev.org -> /srv/mailman/opendev and so on | 23:33 |
corvus | yep. and hard-code that mapping in /etc/mailman/mm_cfg.py | 23:34 |
clarkb | ya I'm working on an edit in my homedir | 23:34 |
fungi | ahh, yeah, we map them in /etc/mailman/sites | 23:34 |
corvus | and as you say, only do the HOST mapping if the site dir isn't already set (for the rest of the scripts to keep working) | 23:34 |
corvus | oh yea you could use that file to do the lookup rather than hard-coding | 23:35 |
fungi | https://opendev.org/opendev/system-config/src/branch/master/inventory/service/host_vars/lists.openstack.org.yaml#L87 | 23:36 |
corvus | fun fact, that file will parse as yaml (though that's not its intetion). but we don't have pyyaml installed system-wide on that host. | 23:37 |
corvus | (i think that's intended to be an exim lookup file) | 23:38 |
fungi | i ended up using ugly shell to parse it in https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mailman/files/mk-archives-index | 23:38 |
clarkb | something like what I've got in ~clarkb/mm_cfg.py maybe? | 23:39 |
fungi | diffing now | 23:39 |
clarkb | I'm going to test that as much as I can without going through apache or mailman now | 23:39 |
clarkb | I had a syntax error missing a close ) | 23:41 |
fungi | clarkb: yeah, i think that should work | 23:41 |
fungi | assuming correct pairing of parentheses ;) | 23:41 |
corvus | ```VAR_PREFIX = os.environ['MAILMAN_SITE_DIR']``` needs updating | 23:41 |
clarkb | corvus: thanks | 23:41 |
corvus | otherwise lgtm | 23:41 |
clarkb | updated the file with that fixup | 23:42 |
corvus | ++ | 23:42 |
clarkb | will python complain about the mix of 2 and 4 spaces? | 23:42 |
clarkb | I can dedent the new code to match the old code | 23:42 |
clarkb | I'll do that | 23:42 |
clarkb | done | 23:43 |
fungi | python won't care as long as it's not inconsistent within a given nested set | 23:43 |
fungi | but consistency ftw | 23:43 |
fungi | lgtm | 23:43 |
clarkb | I put a backup of the old file in my homedir too just in case | 23:44 |
clarkb | fungi: do you want to copy that new file over or should I? | 23:44 |
fungi | i can | 23:44 |
clarkb | ok thanks | 23:44 |
clarkb | (you have been doing most of the prod updates and keeping it through a central channel helps avoid people walking over each other) | 23:44 |
fungi | malformed header from script 'listinfo': Bad header: Please set MAILMAN_SITE_DIR or | 23:44 |
clarkb | also is that how HOST is passed by apache | 23:44 |
clarkb | oh we may need to set HOST that isn't auto passed? | 23:44 |
fungi | could be we need to add that in the vhost yeah | 23:45 |
clarkb | ya SetEnv HOST lists.opendev.org ? | 23:45 |
ianw | (ps I'm just watching on, but LMN if I can help :) | 23:46 |
clarkb | fungi: does my or need to be an and? | 23:47 |
clarkb | fungi: I think that is the bug | 23:47 |
clarkb | fungi: updated teh copy in my homedir | 23:47 |
fungi | yes | 23:47 |
fungi | hah | 23:47 |
fungi | oh i'm so blind. it's getting late | 23:47 |
clarkb | it works! | 23:48 |
fungi | it works! | 23:48 |
clarkb | lists.openstack.org fails so ya we have to explicitly SetEnv HOST | 23:48 |
fungi | now i wonder if the setenv host is redundant | 23:48 |
fungi | i guess not, right | 23:48 |
clarkb | I don't think it is. I think we have to set it | 23:48 |
fungi | okay, i'll patch it into the local copies real quick | 23:48 |
clarkb | thanks | 23:48 |
fungi | all of them are updated on the server now | 23:50 |
fungi | so should hopefully be working | 23:50 |
fungi | i need to take another break here shortly if folks want to check anything else before we start mailman and exim services | 23:51 |
clarkb | I just confirmed the all load and that they seem to match the right env :) | 23:51 |
clarkb | I think we should continue | 23:51 |
clarkb | corvus: ianw: any objection to starting mailman now? | 23:52 |
clarkb | then we'll start exim | 23:52 |
corvus | clarkb: lgtm (was just checking out list admin interface) | 23:53 |
clarkb | I don't expect the non cgi scripts will have problems beacuse we do actually test creating a list etc using MAILMAN_SITE_DIR in ansible | 23:53 |
clarkb | but if they do I guess we have to go through and set HOST instead? | 23:54 |
clarkb | fungi: I think you should go for it? | 23:54 |
fungi | yeah, syatyomg mailman services next | 23:54 |
fungi | er, starting | 23:54 |
clarkb | was that opendev that was started? | 23:55 |
fungi | i started mailman-opendev and see the expected company of processes | 23:55 |
fungi | i'll do the other remaining 4 now | 23:55 |
clarkb | I see 5 sets of processes | 23:56 |
fungi | 45 processes owned by list, that's 9 per site, like the old days | 23:56 |
fungi | anything else we want to do before starting exim? | 23:56 |
clarkb | I don't think so? corvus ianw ^ ? | 23:56 |
clarkb | fungi: seems like there are no objections I think we should probably send it, then try a test email against service-discuss? | 23:59 |
fungi | yeah, i'll compose a reply to my maintenance announcement real fast | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!