Sunday, 2021-09-12

*** tosky_ is now known as tosky14:31
Clark[m]fungi: I'm just about around. I should find some food.14:32
fungicool, i have some notes in the questions section of mainly based on experiences from the lists.k.i upgrade14:33
Clark[m]fungi: one thing I thought of was we still want to disable the mailman service even though we don't use the main service on that server. That way we don't get a useless mailman set of processes running from the systemd switch14:37
fungiyeah, i was unable to work out how to add a disable symlink for the systemd unit for it in advance though14:40
clarkbfungi: oh and we should consider disabling the backup crons as well so that we aren't trying to run those alongside other work14:47
clarkband then its possible that the borg install may need manual fixup since that is pip installed iirc14:49
clarkbpost upgrade I mean14:50
fungiyeah, i've also added the corresponding enables at the end for those, but we can determine after what testing should be done14:50
fungii do have a feeling anything pip installed system-wide will be just plain broken, at least until ansible reruns14:51
fungisince it will be for an entirely incorrect python interpreter14:51
clarkbyup and even ansible running may not be sufficient beacuse ti will see the isntall is present14:51
clarkbbut shouldn't be too bad to srot that out post upgrade14:51
fungimaybe not? ansible will be asking pip if it's installed and pip will be asking the newer python which will say it isn't14:52
fungi(i think?)14:52
clarkbfungi: it might be a virtualenv? but ya maybe that will error in the right way too14:52
fungiwe'll just have to see, but yeah maybe the first thing we do after taking it out of the emergency disable list is a manual ansible run?14:53
clarkbya it does an ansible pip module command in a virtualenv14:53
clarkbfungi: that seems like a reasonable thing to do14:53
clarkbI think we can move the venv aside and have it reinstall14:53
clarkbif it doesn't handle it automatically I mean14:53
fungii'm in the rackspace dashboard ready to click the clicky for imaging once the server is down14:54
fungii'll do the steps on lines 280-292 now14:54
clarkbsounds good14:54
fungistatus log should be sufficient for this maintenance, you think?14:59
fungiannounced well in advance and it's a slow time14:59
fungi#status log The mailing list services for,,,, and will be offline over the next 6 hours for server upgrades, messages will be sent to the primary discussion lists at each site once the maintenance concludes15:01
opendevstatusfungi: finished logging15:01
fungiokay, ready for me to poweroff the server?15:01
clarkbI guess so15:02
fungihere we go!15:02
fungias soon as the api reports it shutoff i'll begin the snapshot15:02
clarkbI guess also check your ssh connection has gone away?15:03
fungiif the console indicates it's done but it hasn't gone offline i'll server stop it15:04
fungiyeah, vnc can't connect, i'll issue a server stop15:05
funginow it's showing shutoff, proceding with image creation15:05
fungiit's queued, name is lists.openstack.org_2021-09-12_pre-upgrade15:07
clarkbdoes the web dashboard give you a progress indicator for that? I think the api does if you can figure out how to work it, but also unsure how accurate it is in any case15:07
fungibasically the same. it gives you a useless progress indicator15:07
fungiright now it says "queued: preparing for snapshot..."15:08
fungithis will take a while, so i'm going to step away and just check it every few minutes until it finishes15:08
fungino point in hovering15:08
clarkbsounds good I'll do that same then. Ping me when you see it as done15:09
fungiwhere "a few minutes" is somewhere between 3-4.5 hours based on earlier testing, i think?15:09
fungii will ping you when we're done with this wait, yep15:09
clarkbya iirc it was in the range of several hours. But the server was online then. Hopefully it goes a little quicker with it offline15:10
*** diablo_rojo is now known as Guest704915:51
fungihere's hoping16:02
fungiit's at "saving" now, noting "this step duration is based on underlying virtual hard disk (vhd) size"16:04
fungiclarkb: it's active now17:35
fungii'm going to boot the server and make sure the services we disabled haven't started (in a root screen session in case you want to attach)17:36
Clark[m]Ok I'm migrating back to my desk17:37
fungiroot screen session on lists.o.o is up17:38
clarkbI'm attached to the screen17:38
fungithe services we disabled are still not running based on ps17:39
fungii mived the esm unenroll and puppet uninstall to the beginning of the upgrade, due to complications observed on the lists.k.i upgrade17:39
fungier, moved17:39
fungiare you good with that?17:39
clarkbya I noticed that and makes sense to me17:39
fungiua gets weird following a dist upgrade, in particular17:40
fungiua and puppet are cleaned up, proceeding to update the package lists though this and the dist-upgrade should no-op ideally17:42
fungilooks like detaching from ua doesn't actually remove the sources list entries17:42
fungii'm going to clean that up now17:42
fungimaybe because ansible?17:42
clarkbfungi: no I think they are alawys there on ubuntu17:43
fungithe esm sources?17:43
clarkbyes they are prsent on my local bionic fileserver for example but it isn't enrolled in esm17:43
clarkbthen some other mechanism actually turns it on I think17:43
fungiahh, okay, ignoring that, then17:43
clarkbits not entirely clear to me how that works (and I suspect that is by design)17:43
clarkbhrm though now I'm double checking and trying to see where they are on my local machine17:44
clarkbI remember looking at my local machien when we set up the esm stuff and being confused at how much was present17:44
fungiubuntu-release-upgrader-core is already installed so no need to add it17:45
clarkbmaybe I was wrong about the sources.list entries. It is in the apt auto update by default17:45
funginote that on bridge.o.o we don't have esm sources by default17:45
clarkbwhich I guess gets ignored if you don't have the sources.list definitions17:45
clarkbso ya you can probably clean those up17:46
fungiyeah, removing /etc/apt/sources.list.d/ubuntu-esm-infra.list will solve that17:46
clarkbits easy enough to reenroll if that becomes necessary so rm'ing that file should be safe17:46
fungidoing it now17:46
fungipackage list re-updated, dist-upgrade still predictably no-io17:47
fungiready for do-release-upgrade?17:48
clarkbI guess so seems like everything looks the way we expected. The snapshot succeeded right? that would be the only other thing I can think of to check17:48
fungiyes, claims to have a 42.14 gb image in dfw named lists.openstack.org_2021-09-12_pre-upgrade with a status of "active"17:49
clarkbthen ya I think we are ready17:49
fungipushing the shiny, candy-red button17:49
fungitelling it to run the alternate ssh in case we need to fix sshd17:50
fungii'll add the iptables rule for it in a second screen window17:50
clarkbI've ssh'd in via port 1022 (but you should too :) )17:51
fungii have just now17:51
fungiyou good with the package changes?17:53
clarkbI think so. It seemed the stuff it complained about not having candidates were all version specific packages that will be replaced by other version specific packages that are installed with virtual packages a level up17:54
fungii agree17:54
clarkbneat the default site override thing we do prevents it from trying to generate all the mailman languages17:58
clarkbI expect we'll be ok with that since we don't use a lot of languages and have content already ?17:58
fungiyeah, seems so. i guess we just check everything out later and see if there's anything broken/missing17:58
fungiworst case it'll just be the webui/archive not rendering or looking strangely17:59
clarkband can copy files from if we need them quickly17:59
clarkbfungi: we might need to keep the old setting for the arp bit? the comment says xen pv guests need it18:02
fungiit wants to update /etc/sysctl.conf, and we seem to specifically set -net.ipv4.conf.eth0.arp_notify = 1 with a comment that it's needed for pv xen guests18:02
fungithe maintainers version of that file has all lines commented out anyway, so keeping ours should be fine18:03
clarkbI suspect that comes from rax and not anything we set. I say we keep our version18:03
fungii suppose this is an artifact of this being an ancient ported flavor18:03
fungiseems the conffile updates occur in a nondeterministic order, but keeping our modified login.defs18:10
clarkbetherpad says /etc/login.defs is a keep our version18:10
clarkb(we override the uid and gid ranges iirc)18:10
clarkbfungi: ehterpad says this is a keep of our version?18:19
fungiand the one after it was as well18:21
clarkblooks like it knows about the languages afterall. I guess the selection tool needs the site dir but then generation doesnt18:22
clarkb(that is good means we probably don't need to do anything)18:22
fungiand installing the maintainer's ntp.conf18:23
fungikeeping our sshd_config18:24
clarkbnoting that the current swap device appears to be xvdc1 so the warnings about initramfs there show it doing the right thing18:26
clarkbbut also we don't really suspend to memory except for server migrations? its probably super minor18:26
clarkbpotentially related my laptop has started to refuse to suspend to memory on linux 5.1418:27
clarkbI guess I should double check my initramfs and swap device uuids18:27
clarkbfungi: this is another keep18:27
fungiyep, keeping the unattended-upgrades config18:28
fungiit hasn't asked about iptables yet, right?18:28
fungioh, or logind.conf18:28
clarkbfungi: not that I have seen18:28
clarkbya neither have shown up18:28
clarkbI guess it won't ask about those18:31
fungiremoving obsolete packages now18:31
clarkbtime to remove obsolete packages (which means the installs are done)18:31
fungiupgrade is complete, before restarting i'll re-disable services in a second screen window18:34
fungicurrently there are no apache, mailman or exim processes still18:35
clarkbbut we will disable them anyway to ensure the systemd units get disabled?18:36
fungi"Removed /etc/systemd/system/mailman-qrunner.service."18:37
fungithat was the only one which changed18:37
clarkband we expected that18:37
fungiokay, ready for the reboot?18:37
clarkbI think so18:37
fungiit's back up and i've started the root screen session again18:38
clarkbgive me a sec to join18:38
clarkbI'm in the screen18:38
fungilooks like htcacheclean started on boot but not apache proper18:39
clarkbthta should be fine18:39
fungithe apt-get clean wants to remove a couple of old linux-headers package versions18:39
fungiagreeing to it18:39
clarkbfungi: maybe double check that isn't the kernel we are ruinning?18:39
clarkbbut ya agreeing to it seems safe18:40
fungiLinux lists 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux18:40
clarkbcool that doesn't match what it wants to remove18:40
fungiit wants to remove linux-headers-4.4.0-212 and linux-headers-4.4.0-213 yeah18:40
fungi(it generally won't remove the kernel you're booted on, and it won't see the headers package as needing cleaned up if that version of the kernel package is installed, for future reference)18:41
clarkbgood to know18:41
clarkbbefore we start the next upgrade should we double check that swap was enabled properly as a followon from the warnings about finding the swap device for resume (I think that doesn't matter as that is just for suspending to memory though)18:42
fungiready to do-release-upgrade for focal?18:42
fungioh, yep18:42
clarkbswap looks good to me. I think we can proceed with the next do-release-upgrade18:42
fungiand underway18:43
fungistarting the alternate sshd again18:43
clarkbI'm on via the alternate18:43
fungiand added the iptables rule for it and connected, yeah18:43
fungiaccepting the package changes if that also looks right to you18:45
clarkboh I missed it while coordinating some questions around lunch. I'm sure its fine though18:45
clarkbit looks similar to what I expect from what little I see there :)18:46
fungigoing now18:46
clarkbinterseting that it is prompting for the list of services to restart rather than just a binary yes no restart things18:47
clarkbI'm ok with that list though it has exim4 and apache in it.18:48
clarkbas I suspect that is all the 'yes' selection was doing previously18:48
fungiyep, agreeing to it now18:49
fungiwill need to check that it doesn't actually restart exim4 and apache218:49
clarkbI added a note to do that in the etherapd18:50
fungiit's prompting for a different set of service restarts next18:55
clarkbfungi: I guess we accept this list too18:55
clarkbI think it didn't prompt to auto restart so we just accept the lists it gives us instead18:55
clarkband then we are equivalent to selection yes to auto restart if it had prompted18:55
clarkbfungi: it is asking about sysctl.conf. We probably want to double check the diff then keep again?19:04
fungiyeah, looking19:05
fungisame situation as before, so keeping ours19:06
clarkbThis is a keep according to the etherpad beacuse we configure snmpd with ansible19:07
fungiyep, keeping19:07
clarkbthis is another keep of our version19:10
fungikeeping our unattended upgrades config too19:10
clarkband another keep for sshd_config19:10
fungiyep, kept19:12
clarkbtime to remove obsolete packages19:17
fungiyep, agreeing19:19
clarkbnow we check that apache2 exim4 et al are not running and reboot?19:23
fungiyeah, looking19:23
fungijust htcacheclean19:24
fungiso i think we're safe to reboot19:24
clarkbsounds like it19:24
fungiso once this is booted again we'll be on focal!19:25
fungiit's taking longer to boot than i would expect19:26
fungivnc isn't connecting19:27
clarkbI wonder if we'll have to hard reboot it from the api?19:28
clarkbseems like vnc (and maybe even ping) should work if it was running at all19:28
fungiyeah, still failing to connect. i'll hard reboot the instance19:29
fungia hard reboot put it into error state19:30
clarkb{'message': 'Failure', 'code': 500, 'created': '2021-09-12T19:30:26Z'} doesn't give much indication to what failed19:31
fungifault | {'message': 'Failure', 'code': 500, 'created': '2021-09-12T19:30:26Z'}19:32
clarkbdo we file an issue with rax? (or a phone call?)19:33
fungii suppose i could try to stop and start the server again19:33
fungia server stop put it into shutoff state at least19:34
fungistarting it again now19:34
fungiit's staying in shutoff now19:35
clarkbya I see that too19:35
fungitried starting it again, still in shutoff19:36
clarkbfungi: are you using the web ui or the api? I wonder if the web ui for the instance shows anything useful that the api might not19:36
fungiwell, cli19:37
fungitrying from the webui now19:37
fungidoesn't seem to do anything19:38
clarkbI see it in an error state via openstackclient now fwiw19:38
clarkbwith the same failure 500 fault message19:38
fungioh, yep, it went back to error19:38
clarkbit did update the timestamp on that though so it wasn't just returning the same message19:38
fungithe webui says "The server has encountered an error. Contact support for troubleshooting."19:38
fungii guess i'll open a ticket now19:38
clarkbis there any info I can help gather? I assume you've got it under control19:44
fungiticket #210912-ord-000030319:44
fungiyeah, i don't know what else to do at this stage19:44
fungiif we get closer to the end of the maintenance window we'll probably want to status notice that things are running over, but in the meantime i guess we hope support gets back to us quickly19:45
clarkbAlso something like #status notice The server hosting mailing lists for airship, opendev, openstack, starlingx, and zuul entered an error state during its operating system upgrades. We have filed a ticket with the cloud provider to help debug the cause.19:45
fungiif they don't get back to us quickly, we'll need to think about what bringing up a replacement server from the image we made looks like19:46
clarkbAs far as other options go I think we can boot from our snapshot and essentially revert. Ideally we'd do that with a nova rebuild of the existing server to preserve the IP address but that seems risky considering the server is in an error state. We can boot it on an entirely new instance as well (and try really hard to deal with the IP address stuff?). Remember we have to recover the19:47
clarkbsnapshot to update /etc/fstab to remove the swap parition until we can boot it properly and create a new swap partition19:47
clarkbA third option would be to boot a new focal instance and move all of the lists over to it (I don't know what that involves, I suspect it might be difficult to not email all the list owners when the new lists are created)19:47
clarkboh we do have the flag to not email them though I bet we could set that in the hostvars for a new instance19:48
clarkbfungi: I'll take a break for lunch now then check back in after to see if we've got a response. Let me know if you think there is anything else I can be doing19:50
fungimy gut says this is an old pv flavor and the new focal kernel won't work on it19:51
clarkboh hrm.19:51
fungiin which case we might be able to force it to boot with the bionic kernel19:51
clarkbya or have them change the flavor for us under the hood?19:51
fungiit rebooted into bionic just fine19:51
fungiyeah, maybe, i dunno how messy that is19:52
clarkbto something that handles focal (which we do have focal nodes so it is possible)19:52
clarkbya I have no idea either19:52
clarkband yup bionic booted19:52
clarkbok eating lunch will check in in a bit19:52
clarkbfungi: I do wonder if we shouldnt' consider a phone call before giving up on waiting. Its annoying to do but iirc ianw wrote down the account verification details in the typical spot for doing that19:53
clarkbanyway I'm hungry and I can smell lunch :)19:53
fungiyeah, apparently we've been continuously in-place upgrading this since precise (12.04 lts)19:54
opendevreviewClark Boylan proposed opendev/system-config master: DNM just testing mailman with bionic
clarkbfungi: ^ just to give us an idea if our ansible has any problems with bionic20:09
clarkbfungi: thinking out loud here: if we wanted to get prepped we could boot a new instance off of the snapshot and then sort out its swap situation. However, I need a bit more time away from the keyboard so can't help with that just yet20:10
clarkbbut then if we have to we can update DNS and all that to use that server while we figure out our next move20:10
fungimmm, yeah20:11
fungiticket update20:16
fungiThe server is failing to boot because the bootloader cannot load the grub/kernel files.20:17
fungimessage: xenopsd internal error: XenguestHelper.Xenctrl_dom_linux_build_failure(2, " panic: xc_dom_core.c:616: xc_dom_find_loader: no loader\\\"")20:17
fungithey recommend booting in rescue mode. i'll give it a shot now20:17
fungii think there must be a bit of a disconnect with the webui. it let me select to reboot into rescue mode, and gave me a temporary ssh password, but the server never actually rebooted20:20
Clark[m]fungi: seems that says it is common if you try to run a non pv aware kernel?20:20
fungioh, i just needed to wait a little longer20:20
fungi"i'm ssh'd into the rescue rootfs now20:22
Clark[m]Maybe check what kernel is installed then we see if maybe we need a different kernel for pv support?20:23
Clark[m]I should be back to the keyboard soon20:23
fungilooks like xvdb1 is our production rootfs20:24
fungifsck runs clean on it20:25
fungii've got it mounted on /mnt20:25
fungiinstalled kernel packages are linux-image-3.2.0-77-virtual linux-image-4.15.0-156-generic linux-image-4.4.0-214-generic linux-image-5.4.0-84-generic20:26
clarkbfungi: the -virtual in the really old kernel makes me wonder if we need a -virtual kernel20:27
fungiprimary grub boot entry is for 5.4.0-8420:27
* clarkb loads up ubuntu package search20:27
clarkbfungi: says the generic kernel is the virtual kernel :/20:30
clarkbthat was true for bionic as well, but bionic did boot. So ya maybe they remove paravirtualization support?20:30
fungiwe could try just switching to 4.15.0-156 in the grub config20:31
clarkbya that would at least allow us to check if it boots I guess20:31
fungithough the error really seems like it's having trouble with the bootloader not the kernel?20:31
fungithere's a grub-xen package20:32
clarkb <- says it could be the compression algorithm for the initramfs?20:33
clarkb doesn't seem to have much in that grub-xen package /me goes to look for what is in those files20:34
fungiwe presently have grub-pc installed20:34
fungibut yeah i think this must be related to the server still using a pv flavor20:35
clarkblooks like hvm instances have grub-pc installed, but I think that is expected since hvm is far more like a normal thing20:35
clarkbhrm my link is for bionic though and it was fine20:37
fungigrub-xen seems to be teh grub2 successor of pvgrub20:37
clarkbfungi: neat, I guess maybe we try that then?20:38
clarkbhow do we install that properly on the recovery installation? chroot into the other image and then do it?20:38
clarkb is isntalled as a dep of grub-xen and that seems to pull in all the bits there20:39
fungilooks like prior to the upgrade we were running grub-pc:amd64 2.02~beta2-36ubuntu3.3220:39
fungii can try to install grub-xen within a chroot20:40
fungichroot of the production rootfs i mean20:40
fungithat can be fiddly when it comes to block device detection for grub-install20:41
clarkbfungi: that seems reasonable given the error provided by the cloud I think20:41
clarkbfungi: disk image builder does do it somehow20:41
fungibut i'll recheck the grub config20:41
clarkbI'll see if I can discern what dib does20:41
fungiit's going to uninstall grub-gfxpayload-lists and grub-pc20:42
fungioh, that's fun... "Temporary failure resolving ''"20:42
clarkb`/usr/sbin/grub-install '--modules=part_msdos part_gpt lvm biosdisk' --force /dev/loop0`20:42
clarkbI think dib manually executes the installation20:43
fungiahh, i'm going to need to mount devfs and proc et cetera20:43
fungiyeesh, systemd wants several dozen things mounted (no joke)20:45
fungihaha, though the reason dns isn't working is that we configure the server to do lookups through a local unbound20:49
fungiundoing that in /etc/resolv.conf for a bit20:49
fungibut i did also mount /dev, /proc and /sys because it's probably going to need those when doing stuff with the grub packages20:50
fungicomplained about being unable to log to /dev/pts20:51
fungibecause i didn't mount that of course20:51
fungilooks like it scanned and chose the rescue boot partition instead of the one i'm chrooted into20:52
clarkbfungi: if the package is installed I suspect that you can run something like the command dib runs above though?20:53
clarkbthough I don't think you should need the force or the modules listing?20:54
clarkbya --force means "install even if problems are detected" we probably want to evaluate those if it finds them20:55
clarkband modules just preloads stuff but the default is to have all the modules be available based on the man page20:55
fungimmm, yeah do i want it installed into the partition or into the mbr?20:56
fungii sent ahead and did both but i suspect it'll be the mbr that matters20:56
clarkbINSTALL_DEVICE must be system device  filename.   grub-install  copies  GRUB  images  into boot/grub.  On some platforms, it may also install GRUB into the boot sector.20:57
clarkbI think it is the mbr that matters20:57
clarkbbut it figures it out sounds like20:57
fungithe menu.lst looks like it contains the kernels which are installed in the chroot, so hopefully we're all set20:58
clarkbfingers crossed20:59
fungishould i undo the resolv.conf change, umount /dev, /proc and /sys, umount the rootfs and then reboot into normal mode now?20:59
fungianything else i should check first?20:59
clarkbI can't think of anything else, and you probably grok this stuff better than me :)20:59
clarkbnote you may have to "unrescue" the server to boot into normal mode21:00
clarkbnot sure what a straight up reboot will do, and ya I would umount the rootfs before unrescuing21:00
fungidid all the above and now i've asked it to "exit rescue mode" which i guess is the webui's unrescue21:01
clarkbI don't see it pinging yet, any chance the console is helpful?21:02
fungithe webui still says it's unrescuing21:03
clarkbah ok21:03
clarkbI should learn to be more patient21:03
fungii'll also need to switch to dinner prep mode shortly21:03
clarkbthe openstack client shows it as error state though21:03
clarkband the timestamp of that is from around when you unrescued21:03
clarkb(so I don't think it is stale from an hour or two ago whenever that was)21:04
clarkb that does say rebuilding a server preserves its IP address. In theory we can use that to revert back to xenial21:05
clarkbI think that might be the best option for now as it should get us back to a working state with the same IP addrs and we won't have to sort through block lists as part of recovery21:06
clarkber for now meaning "if we can figure it out through rescue"21:06
clarkb*if we can't21:06
clarkbfungi: maybe try tell grub to boot the bionic kernel?21:07
clarkband if that doesn't work we rebuild?21:07
clarkbof course once we get to that state I'm not sure how we move forward from there so maybe that is for information gathering rather than a solution21:07
fungiyeah, could give the old fallback kernel a shot21:08
clarkbits a "oh neat bionic's kernel works for some not understood reason, but we can't keep it up to date so what do?"21:08
clarkbalso if we think this might be a sit on it and think situation I can probaby live with that into tomorrow? but considering a rebuild with our snapshot should revert us pretty cleanly I also like that option21:09
fungitrued stopping the server and then rebooting it, but still went back into error21:11
fungireentering rescue mode now21:11
fungiexiting rescue mode now after removing the focal kernel entries from /boot/grub/menu.lst21:15
fungithough if grub-xen is grub2 based shouldn't that be /boot/grub/grub.cfg instead?21:16
fungiLinux lists 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux21:17
fungi21:17:23 up 2 min,  1 user,  load average: 0.41, 0.20, 0.0821:17
fungiclarkb: ^21:17
fungii think it's the focal kernel which is the problem?21:17
clarkbok so it is the kernel?21:17
fungii expect so, though this is still with grub-xen installed too21:18
clarkblinux-image-extra-virtual - Extra drivers for Virtual Linux kernel image <- is a thing21:18
clarkbya it could've been both things I suppose21:18
fungii want to try another reboot before we go further21:19
fungithe webui never indicated it was done exiting rescue mode but it seems to have been21:19
fungirebooting the server now21:19
clarkbseems that extra-virtual is a largely empty package and it just depends on the generic image21:19
fungii think the -virtual means it's a virtual package (not an actual package, just a name reference to another package)21:20
clarkboh I see21:20
fungiin debian there's the idea of "virtual packages" (packages which don't really exist, they're just convenient pointers to other package names) and "dummy packages" (basically empty packages which only exist to depend on other packages)21:21
fungithough also sometimes people conflate the two21:21
clarkbseems the server is still pinging did the reboot complete?21:21
fungi21:21:58 up 1 min,  1 user,  load average: 0.45, 0.24, 0.0921:22
fungiprobably okay to switch to beinging services back up21:22
clarkbfungi: but this kernel isn't sustainable?21:22
clarkblike what do we do if we need kernel updates?21:22
fungiit's not sustainable, but it's something we can probably something we can figure out once services are running again21:23
clarkband if we enable services now rolling back becomes potentially much more difficult21:23
fungimaybe it's the kernel compression after all, like you surmised21:24
fungi"an apt hook using extract-vmlinux to decompress kernels during installation"21:24
clarkbya the comments say there are a couple of security concerns with the way it is written and that it doesn't work as is for focal21:25
fungiright, though it suggests we could probably manually decompress the kernel as a temporary workaround21:27
clarkboh I see what you are saying. Do you want to try that now? If that works then I'm good with enabling services21:27
clarkbfungi: do you want to do that in a root screen? I'm happy to follow along or have you just do it as well :)21:28
fungimaybe, i'm trying to understand extracl-vmlinux first21:28
clarkblooks like it iterates through all of the various compression algorithms that it might be and bails out when it finds the first one that is valid21:29
clarkbfungi: we can also see if we can compare the curernt kernel to the focal kernel's compression type21:30
clarkbfile says they are both regular files though (but I think that is because the file is "regular" and the vmlinuz is in there somewhere else?)21:30
fungithe `file` utility claims both the problem and working kernels are a bzImage21:31
fungiyou need to use sudo21:31
corvusclarkb, fungi: i see there are issues; is there a tl;dr + something i can help with?21:31
fungifor some reason i guess ubuntu assumes user-readable kernel files are a security risk21:31
clarkbcorvus: rax xen cannot boot the focal kernel but can boot the bionic kernel21:31
clarkbfungi: note the script in stackoverflow further checks things beyond bzimage21:32
fungicorvus: tl;dr is that lists.o.o is a pv xen server, and the focal kernel won't boot on it, but the bionic kernel will21:32
clarkbcorvus: we suspect it may be the vmlinuz compression type21:32
fungicorvus: yeah looks like it may be because ubuntu switched to lz4 compression for kernel files21:32
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Support grubby and the Bootloader Spec
opendevreviewSteve Baker proposed openstack/diskimage-builder master: RHEL/Centos 9 does not have package grub2-efi-x64-modules
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add policycoreutils package mappings for RHEL/Centos 9
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add DIB_YUM_REPO_PACKAGE as an alternative to DIB_YUM_REPO_CONF
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add reinstall flag to install-packages, use it in bootloader
opendevreviewSteve Baker proposed openstack/diskimage-builder master: WIP Add secure boot support to ubuntu.
fungicorvus: so we're trying to decide the best way forward so we feel comfortable bringing services back up on it21:33
clarkbcorvus: currently we are thinking if we can manage to de|re compress the vmlinuz for focal such taht it is bootable then we can roll forward (and use a hook to do that automatically in the future). But if not then we might be best doing a server rebuild against the snapshot we took at the beginning of this process21:33
clarkbfungi: `grep -aqo "${LZ4_HEADER}" ${KERNEL_PATH}` is the thing to try against the bionic and focal kernels to see if they differ I guess?21:34
corvusok.  i have no direct experience with this, so all i have to offer is another set of eyes/hands if necessary :)21:34
fungicorvus: we've temporarily updated the grub config to boot the old bionic kernel instead of the focal one, but are concerned that isn't sustainable long term because we want to be able to update the kernel for vulnerabilities in the future21:35
clarkbfungi: I'm thinking we do the grep against the lz4 header and try to confirm that differs between the kernels. If it does we attempt the kernel conversion and reboot.21:35
clarkbcorvus: is the script that linux provides to do the conversion21:36
corvusyeah, it sounds like get something that will work for ~1 week, then regroup / possibly schedule an outage to move to a server with new ip addrs might be a good plan?21:36
clarkbcorvus: ya that too21:36
fungiclarkb: `sudo grep -aqo '\002!L\030' /boot/vmlinuz-5.4.0-84-generic` doesn't find anything21:37
clarkbfungi: huh I wonder if is is xz then?21:37
fungiaha, it may be escaping21:38
fungilz4match=$(printf '\002!L\030'); sudo grep -aq "$lz4match" /boot/vmlinuz-5.4.0-84-generic21:38
clarkbfungi: oh also it is grep -q21:38
fungithat exits 021:38
clarkbyou need to check the exit code ya21:38
fungilz4match=$(printf '\002!L\030'); sudo grep -aq "$lz4match" /boot/vmlinuz-4.15.0-156-generic21:38
clarkbso that means it matched21:38
fungithat exits 121:38
clarkbnice we have likely tracked this down? what an adventure21:38
fungiit was the -o which was making it not match21:39
fungimmm, no not the -o either21:40
fungiyeah, have to `printf '\002!L\030'`21:40
fungithat's it21:40
fungiokay, so that confirms the focal kernel is indeed lz421:40
fungianyway, i thought we were very close to being able to wrap this up, but i'm really overdue to switch to dinner prep before christine tries to gnaw off my arm21:41
clarkbfungi: reading linux's script it basically finds the positions of the compressed file for the lz4 bits then passes through through a decompression and leaves the other bits behind? I guess you only need that last nit?21:42
fungiright, the offset seems to be skipping the header part21:42
clarkbfungi: ya enjoy dinner, I can probably do the conversion while you eat then we can try rebooting after?21:42
clarkbI will use linux's script and not the parent one to do this as a one off21:43
fungisure, shall i do a #status notice since we're ~45 minutes past the announced end of our window already?21:43
clarkb++ something along the lines of we've isoalted the issue and are working to fix it now21:43
fungistatus notice The mailing list services for,,,, and are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen21:45
fungilike that?21:45
fungi#status notice The mailing list services for,,,, and are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen21:45
opendevstatusfungi: sending notice21:45
fungiokay, will dinner as quickly as possible and return21:45
-opendevstatus- NOTICE: The mailing list services for,,,, and are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen21:45
clarkbvmlinuz-5.4.0-84-generic is the file we care about right?21:46
clarkb/boot/vmlinuz-5.4.0-84-generic I mean21:46
clarkbhrm does it actually do it in place?21:47
clarkbmaybe not21:48
corvusclarkb: looks like the result goes to stdout?21:49
corvusi think the surgery happens on a tempfile, then the result gets cat'd21:49
clarkbcorvus: yup I've redirected into a file its in my homedir under kernel-stuff/vmlinuz-5.4.0-84-generic.extracted21:50
clarkbit isn't clear to me how fungi told grub to default to the bionic kernel21:51
clarkbthe default seems to be entry 0 which is focal?21:52
corvusdid he do that interactively?21:53
clarkbcorvus: it was done from a rescue instance21:54
clarkbor at least I thought it was. Maybe he caught it in the console instead21:55
corvusclarkb: grub/menu.lst entry 0 == uname21:55
fungii edited /boot/grub/menu.lst to remove the newer kernel entries21:55
clarkboh I see the thing says it is 20.04 kernel but its the 4.15 path21:55
fungiif you rerun update-grub it should put them back21:55
clarkbok corvus  I ran readelf against my extracted kernel and it didn't error21:56
clarkbshould I copy that over the file in /boot then rerun update-grub?21:56
corvusclarkb: file also thinks it's an elf binary :)21:56
corvusyeah, you have a backup of the orig, right?21:56
fungiyeah, elf is what we want21:56
clarkbcorvus: yes a backup of the original is in the same dir as the extracted file21:57
clarkbcorvus: I'lld double check shas on taht and the one in /boot21:57
clarkbsha1s match21:57
clarkbI'll do the copy now21:57
clarkband now running update-grub21:58
clarkbhrm that didn't put it in menu.lst.21:59
clarkbIs menu.lst used?21:59
clarkbinternet seems to say grub2 doesn't use menu.lst and its all grub.cfg22:00
clarkbShould I try a reboot and either it comes back on the old kernel and works, comes back on the new kernel and works, or tries the new kernel and fails and we can recover from there?22:01
fungiright, but i have a feeling the bootloader isn't actually being used22:01
fungii think in a xen pv scenario the bootloader is run outside the domu to parse the files in the filesystem22:01
fungiso we may not actually be in control of what bootloader is run22:02
clarkbdoes that mean you think I need to edit menu.lst?22:02
clarkbor how do I convince it to use the newer kernel?22:02
fungiwhen i edited menu.lst that worked22:02
corvusi'm in favor of a menu.lst change22:03
clarkbyup it looks like menu.lst~ has the new kernel in it. I've put backups of menu.lst and menu.lst~ in my homdir and will replace menu.lst with menu.lst~22:05
clarkbthat is done. Shall I reboot?22:06
clarkbok proceedingt22:07
clarkbit hasn't come back yet. Openstack API still shows it as active and not yet error'd22:09
clarkbI can try to reboot it through the rax api22:11
clarkbI guess I should check the console first22:12
clarkbvnc fails to connnect to the server. I'll try a reboot via the api22:14
clarkbNow it has an error status. Hrm22:15
corvusdoes that mean it needs to be rescued again?22:17
corvuslike rescue it, then edit menu.list as before and reboot to try to recover?22:17
corvus(i'm a little confused why it's in error state and not just sitting at a grub prompt, but maybe that goes to fungi's suggestion that we may not even really be using grub)22:18
clarkbcorvus: yes I think we need to rescue it and restore the old only use bionic menu.lst22:18
clarkb explains (with almost no detail) the grub bypass I think22:18
clarkbI'll rescue it and restore that file22:18
corvus(i'm going to quietly restart zuul while that's going on)22:20
corvus#status log restarted all of zuul on commit 9a27c447c159cd657735df66c87b6617c39169f622:23
opendevstatuscorvus: finished logging22:23
clarkbit is back up on the bionic kernel22:25
corvusthe whole xen thing seems like enough of a black box that i'd vote for just bringing it back up now and then taking the hit of moving ips to complete the upgrade.  it's not ideal, but the reputation will catch up eventually.22:27
clarkbcorvus: bringing it back up now on the bionic kernel you mean and then replace the server in the near future?22:29
corvusclarkb: yeah, bringing the public service back up in the current kernel state22:30
clarkbgot it22:30
clarkbI'm going to reboot this server again from within the server to double check it is consistent22:31
clarkb we might also try that?22:33
corvusthat looks promising -- i think my concern is that this server seems to be a one-off, and the only way to test any of this is to test it in production?22:34
corvus(because we tested a clone, right? and that didn't have this problem?)22:35
clarkband yes I completely agree22:35
clarkbwell we tested a clone of and on a zuul test instance. But if we booted a new image in rax it would be hvm and ya not reproduce22:36
corvusyeah, so i think we've really just hit the end of the line on this server, and the longer we try to keep it running, the bigger hole we dig.  better to cut losses and start fresh i think.22:36
clarkbfair enough22:37
clarkbI did edit a menu.lst to try in my homedir but sounds like we'd just prefer to enable services instead?22:37
clarkbLet me do a reboot with the config that booted most recently (I haven't modified the server since it booted successfully) just to be sure it seems to consistently come up22:37
corvusclarkb: my preference is weak.  if you want to keep trying, no objection from me.  but i feel like even success looks like we're taking on tech debt.22:39
clarkbcorvus: well I think we should replace the server either way, but thinking that using the new kernel while we sort that out is best22:39
clarkbI am however curious enough to try the chainload. Why don't I try that really quickly22:40
clarkbworse case I revert menu.lst again and unrescue22:40
fungiokay, i'm back and catching up22:40
clarkbfungi: tldr is the uncompressed vmlinuz didn't boot. reverting menu.lst in a rescue instance did get us back to booting22:41
clarkbfungi: if you look in ~clarkb/kernel-stuff/menu.lst.chainload I've drafted a menu.lst that attempts to do what describes22:41
clarkbI'd like to try that if there aren't strong objections and if works great if it doesn't work "great" we revert and either way we reenable services and move on and plan to replace the server22:42
fungiyeah, that looks worth a try, though note that once we chainload to grub2 that *will* be reading our /boot/grub/grub.cfg instead of menu.lst22:44
clarkbyup and that should point to my decompressed file22:45
fungiskimming the grub.cfg though it also looks sane22:45
clarkbshall I proceed with replacing menu.lst and rebooting?22:45
clarkb(also it is possible that the chainlod will handle lz4?)22:45
fungithe /boot/vmlinuz-5.4.0-84-generic is decompressed, right?22:45
clarkbfungi: yes you'll see it is like 6 times the size of the kernel we are booted on22:46
fungiyeah, the chainloaded grub2 may also support lz422:46
fungihah, you're not a vi session22:46
clarkbwrong window :)22:46
clarkblet me know when you feel comfortable with this menu.lst update and reboot (or that you don't feel comfortable)22:46
clarkband I'll go ahead and do that22:47
fungiwe may just want regular grub2 rather than grub-xen if we're chainloading, but worth a shot, go for it22:47
clarkbfungi:  Ithink that file is installed by grub-xen. proceeding22:48
fungigrub-pc being regular grub222:48
fungiyeah, if grub-xen is a superset of grub-pc then it's irrelevant22:48
clarkbmenu.lst is updated. rebooting momentarily22:48
clarkbI need to reload my keys I think they timed out but it pings now if you want to juimp in22:50
clarkbwill take me a minute to load keys22:50
fungiLinux lists 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux22:50
fungii'll go ahead and close the rackspace ticket22:50
clarkbfungi: maybe leave a note with them about what we discovered?22:51
clarkbmaybe it will help the next user that hits this22:51
fungioh absolutely22:51
clarkbtldr install grub-xen and chainload and possibly decompress the vmlinuz22:51
clarkbalright what is the order of operations for restoring services here? fungi do you want to do that?22:51
clarkbmaybe apache2 first? check archives then enable mailman and make sure it starts happily then enable exim?22:52
fungiyeah, rackspace ticket now closed with ample detail22:54
clarkbfungi: did you want to do the service enabling or should I/22:55
fungii can start apache up next22:55
fungii've started a root screen session on lists.o.o22:55
clarkbI'll join it22:55
clarkboh wait22:56
fungioh, actually we had a couple of packaging sanity check steps first22:56
clarkbdid you want to do the other steps first for apt cleanup?22:56
fungii'll do those22:56
clarkbalso maybe don't clean up old kernels?22:56
corvusThe chainloader may be a reasonable long-term solution. That might not add much tech debt as long as update-grub doesn't overwrite menu.lst...22:57
clarkbcorvus: I don't think it does because update-grub does all the grub.cfg stuff now22:57
fungiworth backing up the menu.lst we have now just in case22:57
clarkbfungi: that is already done in my kernel-stuff dir in my homedir22:57
fungibut yes, what clarkb said22:57
fungiahh, right22:57
clarkbso ya maybe we get services up and running. Sleep on it and decide what the best course of action is here22:57
fungii'll avoid doing the autoremove for now22:58
fungishould i leave the main mailman.service unit disabled?23:00
clarkbyes it isn't used23:01
fungii've reenabled the 5 initscripts for our current sites23:01
fungione more reboot?23:01
clarkbfungi: I suggested we start services one at a time above23:01
fungiahh, okay can do, instead of rebooting23:01
clarkbapache then check it, then mailman-foo, then exim4 just to be sure its all happy to start23:01
clarkbI think we can reboot afterwards23:01
fungiapache is up23:02
clarkbI get errors loading the listinfo pages for opendev and openstack23:02
clarkbBad header: Please set MAILMAN_SITE_DIR23:02
clarkbok so something about the multisite setup isn't happy?23:03
fungiyeah, but i can load the pipermail archives since they're static23:03
clarkbSetEnv MAILMAN_SITE_DIR /srv/mailman/opendev <- is set in the apache config23:03
fungiand we do load mod_env23:06
fungi doesn't mention any other applicable caveats that i can see23:07
clarkbfungi: I wonder if it is related to the Directory we put it in. We could try setting it globally (though that might have other side effects?)23:08
fungii can't imagine other side effects it would cause if we made it vhost-wide23:08
clarkbseems like it is worth a try?23:09
fungiseems like not much apache invokes is going to care about MAILMAN_SITE_DIR besides the cgi23:09
clarkbstill getting the error. Maybe try stop start?23:09
clarkbhrm no dice23:10
fungiyeah, it's like that value isn't making it through /usr/lib/cgi-bin/mailman/listinfo23:10
clarkbwe are using mod_cgid not mod_cgi, but I wouldn't expect taht to cause trouble23:13
clarkbcorvus: is this familiar to you at all?23:14
fungiit's like something's sanitizing the environment before invoking mailman/mm_cfg.py23:14
corvusno, i'm looking but don't see anything yet23:14
clarkband I guess do we want to start mailman and/or exim and see if those are happy?23:14
fungier, /etc/mailman/mm_cfg.py23:15
clarkbfungi: ya its definitely calling that script because the error message is the one that the mailman vhosting added if the env var isn't set23:15
clarkbactually it could be mm doing that?23:16
clarkbsince the path is apache2 -> mm cgi script -> /etc/mailman/ ?23:17
clarkb does that get compiled to what we have in our cgi dir?23:19
clarkbif so I don't see it clearing any env vars23:19
fungiour /usr/lib/cgi-bin/mailman/listinfo is an elf executable, so hard to know23:21
clarkbthere is a note in the apache2 docs about how you should use SetEnvIf if they are part of modrewrite or otherwise early in the handling23:22
clarkbperhaps this is unexpectedly happening early in handling for some reason? an optimization to make cgi faster?23:22
corvusi think it may be cleaned by the mm script... still trying to find links to share and confirm (i just downloaded the tarball from ubuntu)23:23
clarkbcorvus: interesting23:23
corvusi don't know where to find an online linkable browser with version 2.1.2923:24
fungiclarkb: i don't think setenvif is relevant, that seems to be for conditionally setting envvars23:25
fungiand yeah, the cgi scripts themselves sanitizing the environment would certainly be the simplest explanation23:27
corvuswell, anyway... if you download 2.1.29, there's a keepenvars variable in src/common.c which is the wrapper23:28
clarkb has a tarball at least23:28
* corvus < >23:28
fungiany way to control keepenvars from outside?23:29
corvusthat comment is nuts "we should invert this!" "it is done!"23:29
fungii guess not23:29
corvusmaybe, i dunno, delete the TODO when it's done?23:29
fungiwe could probably rework it to switch on HOST?23:29
corvusbut i guess if your version control system is... nevermind.  eyes on the ball.23:29
clarkboh 2.1.29 is the version not .1923:29
corvusi don't know what version we were running before; does anyone?23:29
clarkbno wonder I don't see it23:29
clarkbcorvus: I can look it up23:30
fungii'll check the dpkg.log23:30
clarkblooks like 2.1.2023:30
corvusi'm pretty sure this is the issue, but i'd like to confirm by seeing if... i'm gonna go out on a limb here, the logic had not been inverted yet and only the TODO was there not the done23:30
fungi2021-09-12 18:10:40 upgrade mailman:amd64 1:2.1.20-1ubuntu0.6 1:2.1.26-1ubuntu0.323:30
fungithat was the first upgrade (from xenial to bionic)23:31
corvusyep, it's inverted.  that's the issue.23:31
corvusi'm guessing the kata listserv doesn't use this system?23:32
clarkbcorvus: it does not23:32
clarkbfungi: I agree we can use the HOST var probably23:32
clarkbkeep using MAILMAN_SITE_DIR for the non cgi stuff and switch on it with HOST otherwise?23:32
corvusyeah, i think do the mapping based on host should work23:33
fungiso if $MAILMAN_SITE_DIR undefined MAILMAN_SITE_DIR=$HOST23:33
clarkbfungi: MAIL_SITE_DIR = somevalue based on HOST lookup23:33 -> /srv/mailman/opendev and so on23:33
corvusyep.  and hard-code that mapping in /etc/mailman/mm_cfg.py23:34
clarkbya I'm working on an edit in my homedir23:34
fungiahh, yeah, we map them in /etc/mailman/sites23:34
corvusand as you say, only do the HOST mapping if the site dir isn't already set (for the rest of the scripts to keep working)23:34
corvusoh yea you could use that file to do the lookup rather than hard-coding23:35
corvusfun fact, that file will parse as yaml (though that's not its intetion).  but we don't have pyyaml installed system-wide on that host.23:37
corvus(i think that's intended to be an exim lookup file)23:38
fungii ended up using ugly shell to parse it in
clarkbsomething like what I've got in ~clarkb/ maybe?23:39
fungidiffing now23:39
clarkbI'm going to test that as much as I can without going through apache or mailman now23:39
clarkbI had a syntax error missing a close )23:41
fungiclarkb: yeah, i think that should work23:41
fungiassuming correct pairing of parentheses ;)23:41
corvus```VAR_PREFIX      = os.environ['MAILMAN_SITE_DIR']``` needs updating23:41
clarkbcorvus: thanks23:41
corvusotherwise lgtm23:41
clarkbupdated the file with that fixup23:42
clarkbwill python complain about the mix of 2 and 4 spaces?23:42
clarkbI can dedent the new code to match the old code23:42
clarkbI'll do that23:42
fungipython won't care as long as it's not inconsistent within a given nested set23:43
fungibut consistency ftw23:43
clarkbI put a backup of the old file in my homedir too just in case23:44
clarkbfungi: do you want to copy that new file over or should I?23:44
fungii can23:44
clarkbok thanks23:44
clarkb(you have been doing most of the prod updates and keeping it through a central channel helps avoid people walking over each other)23:44
fungimalformed header from script 'listinfo': Bad header: Please set MAILMAN_SITE_DIR or23:44
clarkbalso is that how HOST is passed by apache23:44
clarkboh we may need to set HOST that isn't auto passed?23:44
fungicould be we need to add that in the vhost yeah23:45
clarkbya SetEnv HOST ?23:45
ianw(ps I'm just watching on, but LMN if I can help :)23:46
clarkbfungi: does my or need to be an and?23:47
clarkbfungi: I think that is the bug23:47
clarkbfungi: updated teh copy in my homedir23:47
fungioh i'm so blind. it's getting late23:47
clarkbit works!23:48
fungiit works!23:48 fails so ya we have to explicitly SetEnv HOST23:48
funginow i wonder if the setenv host is redundant23:48
fungii guess not, right23:48
clarkbI don't think it is. I think we have to set it23:48
fungiokay, i'll patch it into the local copies real quick23:48
fungiall of them are updated on the server now23:50
fungiso should hopefully be working23:50
fungii need to take another break here shortly if folks want to check anything else before we start mailman and exim services23:51
clarkbI just confirmed the all load and that they seem to match the right env :)23:51
clarkbI think we should continue23:51
clarkbcorvus: ianw: any objection to starting mailman now?23:52
clarkbthen we'll start exim23:52
corvusclarkb: lgtm (was just checking out list admin interface)23:53
clarkbI don't expect the non cgi scripts will have problems beacuse we do actually test creating a list etc using MAILMAN_SITE_DIR in ansible23:53
clarkbbut if they do I guess we have to go through and set HOST instead?23:54
clarkbfungi: I think you should go for it?23:54
fungiyeah, syatyomg mailman services next23:54
fungier, starting23:54
clarkbwas that opendev that was started?23:55
fungii started mailman-opendev and see the expected company of processes23:55
fungii'll do the other remaining 4 now23:55
clarkbI see 5 sets of processes23:56
fungi45 processes owned by list, that's 9 per site, like the old days23:56
fungianything else we want to do before starting exim?23:56
clarkbI don't think so? corvus ianw ^ ?23:56
clarkbfungi: seems like there are no objections I think we should probably send it, then try a test email against service-discuss?23:59
fungiyeah, i'll compose a reply to my maintenance announcement real fast23:59

Generated by 2.17.2 by Marius Gedminas - find it at!