Sunday, 2021-09-12

*** tosky_ is now known as tosky		14:31
Clark[m]	fungi: I'm just about around. I should find some food.	14:32
fungi	cool, i have some notes in the questions section of https://etherpad.opendev.org/p/listserv-inplace-upgrade-testing-2021 mainly based on experiences from the lists.k.i upgrade	14:33
Clark[m]	fungi: one thing I thought of was we still want to disable the mailman service even though we don't use the main service on that server. That way we don't get a useless mailman set of processes running from the systemd switch	14:37
fungi	yeah, i was unable to work out how to add a disable symlink for the systemd unit for it in advance though	14:40
clarkb	fungi: oh and we should consider disabling the backup crons as well so that we aren't trying to run those alongside other work	14:47
clarkb	and then its possible that the borg install may need manual fixup since that is pip installed iirc	14:49
clarkb	post upgrade I mean	14:50
fungi	yeah, i've also added the corresponding enables at the end for those, but we can determine after what testing should be done	14:50
fungi	i do have a feeling anything pip installed system-wide will be just plain broken, at least until ansible reruns	14:51
fungi	since it will be for an entirely incorrect python interpreter	14:51
clarkb	yup and even ansible running may not be sufficient beacuse ti will see the isntall is present	14:51
clarkb	but shouldn't be too bad to srot that out post upgrade	14:51
fungi	maybe not? ansible will be asking pip if it's installed and pip will be asking the newer python which will say it isn't	14:52
fungi	(i think?)	14:52
clarkb	fungi: it might be a virtualenv? but ya maybe that will error in the right way too	14:52
fungi	we'll just have to see, but yeah maybe the first thing we do after taking it out of the emergency disable list is a manual ansible run?	14:53
clarkb	ya it does an ansible pip module command in a virtualenv	14:53
clarkb	fungi: that seems like a reasonable thing to do	14:53
clarkb	I think we can move the venv aside and have it reinstall	14:53
clarkb	if it doesn't handle it automatically I mean	14:53
fungi	i'm in the rackspace dashboard ready to click the clicky for imaging once the server is down	14:54
fungi	i'll do the steps on lines 280-292 now	14:54
clarkb	sounds good	14:54
fungi	status log should be sufficient for this maintenance, you think?	14:59
fungi	announced well in advance and it's a slow time	14:59
clarkb	++	14:59
fungi	#status log The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org will be offline over the next 6 hours for server upgrades, messages will be sent to the primary discussion lists at each site once the maintenance concludes	15:01
opendevstatus	fungi: finished logging	15:01
fungi	okay, ready for me to poweroff the server?	15:01
clarkb	I guess so	15:02
fungi	here we go!	15:02
fungi	as soon as the api reports it shutoff i'll begin the snapshot	15:02
clarkb	I guess also check your ssh connection has gone away?	15:03
fungi	yes	15:04
fungi	if the console indicates it's done but it hasn't gone offline i'll server stop it	15:04
fungi	yeah, vnc can't connect, i'll issue a server stop	15:05
fungi	now it's showing shutoff, proceding with image creation	15:05
fungi	it's queued, name is lists.openstack.org_2021-09-12_pre-upgrade	15:07
clarkb	does the web dashboard give you a progress indicator for that? I think the api does if you can figure out how to work it, but also unsure how accurate it is in any case	15:07
fungi	basically the same. it gives you a useless progress indicator	15:07
fungi	right now it says "queued: preparing for snapshot..."	15:08
fungi	this will take a while, so i'm going to step away and just check it every few minutes until it finishes	15:08
fungi	no point in hovering	15:08
clarkb	sounds good I'll do that same then. Ping me when you see it as done	15:09
fungi	where "a few minutes" is somewhere between 3-4.5 hours based on earlier testing, i think?	15:09
fungi	i will ping you when we're done with this wait, yep	15:09
clarkb	ya iirc it was in the range of several hours. But the server was online then. Hopefully it goes a little quicker with it offline	15:10
*** diablo_rojo is now known as Guest7049		15:51
fungi	here's hoping	16:02
fungi	it's at "saving" now, noting "this step duration is based on underlying virtual hard disk (vhd) size"	16:04
fungi	clarkb: it's active now	17:35
fungi	i'm going to boot the server and make sure the services we disabled haven't started (in a root screen session in case you want to attach)	17:36
Clark[m]	Ok I'm migrating back to my desk	17:37
fungi	root screen session on lists.o.o is up	17:38
clarkb	I'm attached to the screen	17:38
fungi	the services we disabled are still not running based on ps	17:39
fungi	i mived the esm unenroll and puppet uninstall to the beginning of the upgrade, due to complications observed on the lists.k.i upgrade	17:39
fungi	er, moved	17:39
fungi	are you good with that?	17:39
clarkb	ya I noticed that and makes sense to me	17:39
fungi	ua gets weird following a dist upgrade, in particular	17:40
fungi	ua and puppet are cleaned up, proceeding to update the package lists though this and the dist-upgrade should no-op ideally	17:42
clarkb	yup	17:42
fungi	looks like detaching from ua doesn't actually remove the sources list entries	17:42
fungi	i'm going to clean that up now	17:42
fungi	maybe because ansible?	17:42
clarkb	fungi: no I think they are alawys there on ubuntu	17:43
fungi	the esm sources?	17:43
clarkb	yes they are prsent on my local bionic fileserver for example but it isn't enrolled in esm	17:43
clarkb	then some other mechanism actually turns it on I think	17:43
fungi	ahh, okay, ignoring that, then	17:43
clarkb	its not entirely clear to me how that works (and I suspect that is by design)	17:43
clarkb	hrm though now I'm double checking and trying to see where they are on my local machine	17:44
clarkb	I remember looking at my local machien when we set up the esm stuff and being confused at how much was present	17:44
fungi	ubuntu-release-upgrader-core is already installed so no need to add it	17:45
clarkb	maybe I was wrong about the sources.list entries. It is in the apt auto update by default	17:45
fungi	note that on bridge.o.o we don't have esm sources by default	17:45
clarkb	which I guess gets ignored if you don't have the sources.list definitions	17:45
clarkb	so ya you can probably clean those up	17:46
fungi	yeah, removing /etc/apt/sources.list.d/ubuntu-esm-infra.list will solve that	17:46
clarkb	its easy enough to reenroll if that becomes necessary so rm'ing that file should be safe	17:46
fungi	doing it now	17:46
fungi	package list re-updated, dist-upgrade still predictably no-io	17:47
fungi	no-op	17:47
fungi	ready for do-release-upgrade?	17:48
clarkb	I guess so seems like everything looks the way we expected. The snapshot succeeded right? that would be the only other thing I can think of to check	17:48
fungi	yes, claims to have a 42.14 gb image in dfw named lists.openstack.org_2021-09-12_pre-upgrade with a status of "active"	17:49
clarkb	then ya I think we are ready	17:49
fungi	pushing the shiny, candy-red button	17:49
fungi	telling it to run the alternate ssh in case we need to fix sshd	17:50
clarkb	yup	17:50
fungi	i'll add the iptables rule for it in a second screen window	17:50
clarkb	I've ssh'd in via port 1022 (but you should too :) )	17:51
fungi	i have just now	17:51
fungi	continuing	17:51
fungi	you good with the package changes?	17:53
clarkb	I think so. It seemed the stuff it complained about not having candidates were all version specific packages that will be replaced by other version specific packages that are installed with virtual packages a level up	17:54
fungi	i agree	17:54
fungi	proceeding	17:55
clarkb	neat the default site override thing we do prevents it from trying to generate all the mailman languages	17:58
clarkb	I expect we'll be ok with that since we don't use a lot of languages and have content already ?	17:58
fungi	yeah, seems so. i guess we just check everything out later and see if there's anything broken/missing	17:58
clarkb	ya	17:58
fungi	worst case it'll just be the webui/archive not rendering or looking strangely	17:59
clarkb	and can copy files from list.kc.io if we need them quickly	17:59
clarkb	fungi: we might need to keep the old setting for the arp bit? the comment says xen pv guests need it	18:02
fungi	it wants to update /etc/sysctl.conf, and we seem to specifically set -net.ipv4.conf.eth0.arp_notify = 1 with a comment that it's needed for pv xen guests	18:02
fungi	yeah	18:02
fungi	the maintainers version of that file has all lines commented out anyway, so keeping ours should be fine	18:03
clarkb	I suspect that comes from rax and not anything we set. I say we keep our version	18:03
fungi	i suppose this is an artifact of this being an ancient ported flavor	18:03
fungi	seems the conffile updates occur in a nondeterministic order, but keeping our modified login.defs	18:10
clarkb	etherpad says /etc/login.defs is a keep our version	18:10
clarkb	(we override the uid and gid ranges iirc)	18:10
clarkb	fungi: ehterpad says this is a keep of our version?	18:19
fungi	yep	18:21
fungi	and the one after it was as well	18:21
clarkb	looks like it knows about the languages afterall. I guess the selection tool needs the site dir but then generation doesnt	18:22
clarkb	(that is good means we probably don't need to do anything)	18:22
fungi	and installing the maintainer's ntp.conf	18:23
fungi	keeping our sshd_config	18:24
clarkb	noting that the current swap device appears to be xvdc1 so the warnings about initramfs there show it doing the right thing	18:26
clarkb	but also we don't really suspend to memory except for server migrations? its probably super minor	18:26
clarkb	potentially related my laptop has started to refuse to suspend to memory on linux 5.14	18:27
clarkb	I guess I should double check my initramfs and swap device uuids	18:27
clarkb	fungi: this is another keep	18:27
fungi	yep, keeping the unattended-upgrades config	18:28
fungi	it hasn't asked about iptables yet, right?	18:28
fungi	oh, or logind.conf	18:28
clarkb	fungi: not that I have seen	18:28
clarkb	ya neither have shown up	18:28
clarkb	I guess it won't ask about those	18:31
fungi	removing obsolete packages now	18:31
clarkb	time to remove obsolete packages (which means the installs are done)	18:31
fungi	upgrade is complete, before restarting i'll re-disable services in a second screen window	18:34
clarkb	++	18:34
fungi	currently there are no apache, mailman or exim processes still	18:35
clarkb	but we will disable them anyway to ensure the systemd units get disabled?	18:36
fungi	yep	18:37
fungi	"Removed /etc/systemd/system/mailman-qrunner.service."	18:37
fungi	that was the only one which changed	18:37
clarkb	and we expected that	18:37
fungi	okay, ready for the reboot?	18:37
clarkb	I think so	18:37
fungi	done	18:37
fungi	it's back up and i've started the root screen session again	18:38
clarkb	give me a sec to join	18:38
clarkb	I'm in the screen	18:38
fungi	looks like htcacheclean started on boot but not apache proper	18:39
clarkb	thta should be fine	18:39
fungi	the apt-get clean wants to remove a couple of old linux-headers package versions	18:39
fungi	agreeing to it	18:39
clarkb	fungi: maybe double check that isn't the kernel we are ruinning?	18:39
clarkb	but ya agreeing to it seems safe	18:40
fungi	Linux lists 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux	18:40
clarkb	cool that doesn't match what it wants to remove	18:40
fungi	it wants to remove linux-headers-4.4.0-212 and linux-headers-4.4.0-213 yeah	18:40
fungi	(it generally won't remove the kernel you're booted on, and it won't see the headers package as needing cleaned up if that version of the kernel package is installed, for future reference)	18:41
clarkb	good to know	18:41
clarkb	before we start the next upgrade should we double check that swap was enabled properly as a followon from the warnings about finding the swap device for resume (I think that doesn't matter as that is just for suspending to memory though)	18:42
fungi	ready to do-release-upgrade for focal?	18:42
fungi	oh, yep	18:42
fungi	lgty?	18:42
clarkb	swap looks good to me. I think we can proceed with the next do-release-upgrade	18:42
fungi	and underway	18:43
fungi	starting the alternate sshd again	18:43
clarkb	I'm on via the alternate	18:43
fungi	and added the iptables rule for it and connected, yeah	18:43
fungi	continuing	18:44
fungi	accepting the package changes if that also looks right to you	18:45
clarkb	oh I missed it while coordinating some questions around lunch. I'm sure its fine though	18:45
fungi	cool	18:46
clarkb	it looks similar to what I expect from what little I see there :)	18:46
fungi	going now	18:46
clarkb	interseting that it is prompting for the list of services to restart rather than just a binary yes no restart things	18:47
clarkb	I'm ok with that list though it has exim4 and apache in it.	18:48
clarkb	as I suspect that is all the 'yes' selection was doing previously	18:48
fungi	yep, agreeing to it now	18:49
fungi	will need to check that it doesn't actually restart exim4 and apache2	18:49
clarkb	I added a note to do that in the etherapd	18:50
fungi	it's prompting for a different set of service restarts next	18:55
clarkb	fungi: I guess we accept this list too	18:55
fungi	yep	18:55
clarkb	I think it didn't prompt to auto restart so we just accept the lists it gives us instead	18:55
clarkb	and then we are equivalent to selection yes to auto restart if it had prompted	18:55
clarkb	fungi: it is asking about sysctl.conf. We probably want to double check the diff then keep again?	19:04
fungi	yeah, looking	19:05
fungi	same situation as before, so keeping ours	19:06
clarkb	++	19:06
clarkb	This is a keep according to the etherpad beacuse we configure snmpd with ansible	19:07
fungi	yep, keeping	19:07
clarkb	this is another keep of our version	19:10
fungi	keeping our unattended upgrades config too	19:10
clarkb	and another keep for sshd_config	19:10
fungi	yep, kept	19:12
clarkb	time to remove obsolete packages	19:17
fungi	yep, agreeing	19:19
clarkb	now we check that apache2 exim4 et al are not running and reboot?	19:23
fungi	yeah, looking	19:23
fungi	just htcacheclean	19:24
fungi	so i think we're safe to reboot	19:24
clarkb	sounds like it	19:24
fungi	doing	19:24
fungi	so once this is booted again we'll be on focal!	19:25
fungi	it's taking longer to boot than i would expect	19:26
clarkb	ya	19:26
fungi	vnc isn't connecting	19:27
clarkb	I wonder if we'll have to hard reboot it from the api?	19:28
clarkb	seems like vnc (and maybe even ping) should work if it was running at all	19:28
fungi	yeah, still failing to connect. i'll hard reboot the instance	19:29
clarkb	ok	19:29
fungi	a hard reboot put it into error state	19:30
clarkb	hrm	19:31
clarkb	{'message': 'Failure', 'code': 500, 'created': '2021-09-12T19:30:26Z'} doesn't give much indication to what failed	19:31
fungi	fault \| {'message': 'Failure', 'code': 500, 'created': '2021-09-12T19:30:26Z'}	19:32
clarkb	do we file an issue with rax? (or a phone call?)	19:33
fungi	i suppose i could try to stop and start the server again	19:33
clarkb	ok	19:33
fungi	a server stop put it into shutoff state at least	19:34
fungi	starting it again now	19:34
fungi	it's staying in shutoff now	19:35
clarkb	ya I see that too	19:35
fungi	tried starting it again, still in shutoff	19:36
clarkb	fungi: are you using the web ui or the api? I wonder if the web ui for the instance shows anything useful that the api might not	19:36
fungi	api	19:37
fungi	well, cli	19:37
fungi	trying from the webui now	19:37
fungi	doesn't seem to do anything	19:38
clarkb	I see it in an error state via openstackclient now fwiw	19:38
clarkb	with the same failure 500 fault message	19:38
fungi	oh, yep, it went back to error	19:38
clarkb	it did update the timestamp on that though so it wasn't just returning the same message	19:38
fungi	the webui says "The server has encountered an error. Contact support for troubleshooting."	19:38
fungi	i guess i'll open a ticket now	19:38
clarkb	ok	19:38
clarkb	is there any info I can help gather? I assume you've got it under control	19:44
fungi	ticket #210912-ord-0000303	19:44
fungi	yeah, i don't know what else to do at this stage	19:44
fungi	if we get closer to the end of the maintenance window we'll probably want to status notice that things are running over, but in the meantime i guess we hope support gets back to us quickly	19:45
clarkb	Also something like #status notice The server hosting mailing lists for airship, opendev, openstack, starlingx, and zuul entered an error state during its operating system upgrades. We have filed a ticket with the cloud provider to help debug the cause.	19:45
clarkb	ya	19:45
fungi	if they don't get back to us quickly, we'll need to think about what bringing up a replacement server from the image we made looks like	19:46
clarkb	As far as other options go I think we can boot from our snapshot and essentially revert. Ideally we'd do that with a nova rebuild of the existing server to preserve the IP address but that seems risky considering the server is in an error state. We can boot it on an entirely new instance as well (and try really hard to deal with the IP address stuff?). Remember we have to recover the	19:47
clarkb	snapshot to update /etc/fstab to remove the swap parition until we can boot it properly and create a new swap partition	19:47
clarkb	A third option would be to boot a new focal instance and move all of the lists over to it (I don't know what that involves, I suspect it might be difficult to not email all the list owners when the new lists are created)	19:47
clarkb	oh we do have the flag to not email them though I bet we could set that in the hostvars for a new instance	19:48
clarkb	fungi: I'll take a break for lunch now then check back in after to see if we've got a response. Let me know if you think there is anything else I can be doing	19:50
fungi	my gut says this is an old pv flavor and the new focal kernel won't work on it	19:51
clarkb	oh hrm.	19:51
fungi	in which case we might be able to force it to boot with the bionic kernel	19:51
clarkb	ya or have them change the flavor for us under the hood?	19:51
fungi	it rebooted into bionic just fine	19:51
fungi	yeah, maybe, i dunno how messy that is	19:52
clarkb	to something that handles focal (which we do have focal nodes so it is possible)	19:52
clarkb	ya I have no idea either	19:52
clarkb	and yup bionic booted	19:52
clarkb	ok eating lunch will check in in a bit	19:52
fungi	k	19:52
clarkb	fungi: I do wonder if we shouldnt' consider a phone call before giving up on waiting. Its annoying to do but iirc ianw wrote down the account verification details in the typical spot for doing that	19:53
clarkb	anyway I'm hungry and I can smell lunch :)	19:53
fungi	yeah, apparently we've been continuously in-place upgrading this since precise (12.04 lts)	19:54
opendevreview	Clark Boylan proposed opendev/system-config master: DNM just testing mailman with bionic https://review.opendev.org/c/opendev/system-config/+/808569	20:09
clarkb	fungi: ^ just to give us an idea if our ansible has any problems with bionic	20:09
clarkb	fungi: thinking out loud here: if we wanted to get prepped we could boot a new instance off of the snapshot and then sort out its swap situation. However, I need a bit more time away from the keyboard so can't help with that just yet	20:10
clarkb	but then if we have to we can update DNS and all that to use that server while we figure out our next move	20:10
fungi	mmm, yeah	20:11
fungi	ticket update	20:16
fungi	The server is failing to boot because the bootloader cannot load the grub/kernel files.	20:17
fungi	message: xenopsd internal error: XenguestHelper.Xenctrl_dom_linux_build_failure(2, " panic: xc_dom_core.c:616: xc_dom_find_loader: no loader\\\"")	20:17
fungi	they recommend booting in rescue mode. i'll give it a shot now	20:17
Clark[m]	https://run.tournament.org.il/xenserver-internal-error-failure-no-loader-found/	20:20
fungi	i think there must be a bit of a disconnect with the webui. it let me select to reboot into rescue mode, and gave me a temporary ssh password, but the server never actually rebooted	20:20
Clark[m]	fungi: seems that says it is common if you try to run a non pv aware kernel?	20:20
fungi	oh, i just needed to wait a little longer	20:20
fungi	"i'm ssh'd into the rescue rootfs now	20:22
Clark[m]	Maybe check what kernel is installed then we see if maybe we need a different kernel for pv support?	20:23
Clark[m]	I should be back to the keyboard soon	20:23
fungi	looks like xvdb1 is our production rootfs	20:24
fungi	fsck runs clean on it	20:25
fungi	i've got it mounted on /mnt	20:25
fungi	installed kernel packages are linux-image-3.2.0-77-virtual linux-image-4.15.0-156-generic linux-image-4.4.0-214-generic linux-image-5.4.0-84-generic	20:26
clarkb	fungi: the -virtual in the really old kernel makes me wonder if we need a -virtual kernel	20:27
fungi	primary grub boot entry is for 5.4.0-84	20:27
* clarkb loads up ubuntu package search		20:27
clarkb	fungi: https://packages.ubuntu.com/focal-updates/linux-image-virtual says the generic kernel is the virtual kernel :/	20:30
clarkb	that was true for bionic as well, but bionic did boot. So ya maybe they remove paravirtualization support?	20:30
fungi	we could try just switching to 4.15.0-156 in the grub config	20:31
clarkb	ya that would at least allow us to check if it boots I guess	20:31
fungi	though the error really seems like it's having trouble with the bootloader not the kernel?	20:31
fungi	there's a grub-xen package	20:32
clarkb	https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699381 <- says it could be the compression algorithm for the initramfs?	20:33
clarkb	https://packages.ubuntu.com/focal/amd64/grub-xen/filelist doesn't seem to have much in that grub-xen package /me goes to look for what is in those files	20:34
fungi	we presently have grub-pc installed	20:34
fungi	but yeah i think this must be related to the server still using a pv flavor	20:35
clarkb	looks like hvm instances have grub-pc installed, but I think that is expected since hvm is far more like a normal thing	20:35
clarkb	https://eengstrom.github.io/musings/booting-xen-on-ubuntu-via-grub-with-uefi	20:36
fungi	https://xenproject.org/2015/01/07/using-grub-2-as-a-bootloader-for-xen-pv-guests/	20:36
clarkb	hrm my link is for bionic though and it was fine	20:37
fungi	grub-xen seems to be teh grub2 successor of pvgrub	20:37
clarkb	fungi: neat, I guess maybe we try that then?	20:38
clarkb	how do we install that properly on the recovery installation? chroot into the other image and then do it?	20:38
clarkb	https://packages.ubuntu.com/focal/amd64/grub-xen-bin/filelist is isntalled as a dep of grub-xen and that seems to pull in all the bits there	20:39
fungi	looks like prior to the upgrade we were running grub-pc:amd64 2.02~beta2-36ubuntu3.32	20:39
fungi	i can try to install grub-xen within a chroot	20:40
fungi	chroot of the production rootfs i mean	20:40
fungi	that can be fiddly when it comes to block device detection for grub-install	20:41
clarkb	fungi: that seems reasonable given the error provided by the cloud I think	20:41
clarkb	fungi: disk image builder does do it somehow	20:41
fungi	but i'll recheck the grub config	20:41
clarkb	I'll see if I can discern what dib does	20:41
fungi	it's going to uninstall grub-gfxpayload-lists and grub-pc	20:42
fungi	oh, that's fun... "Temporary failure resolving 'us.archive.ubuntu.com'"	20:42
clarkb	`/usr/sbin/grub-install '--modules=part_msdos part_gpt lvm biosdisk' --force /dev/loop0`	20:42
clarkb	I think dib manually executes the installation	20:43
fungi	ahh, i'm going to need to mount devfs and proc et cetera	20:43
fungi	yeesh, systemd wants several dozen things mounted (no joke)	20:45
fungi	haha, though the reason dns isn't working is that we configure the server to do lookups through a local unbound	20:49
fungi	undoing that in /etc/resolv.conf for a bit	20:49
clarkb	noted	20:50
fungi	but i did also mount /dev, /proc and /sys because it's probably going to need those when doing stuff with the grub packages	20:50
fungi	complained about being unable to log to /dev/pts	20:51
fungi	because i didn't mount that of course	20:51
fungi	looks like it scanned and chose the rescue boot partition instead of the one i'm chrooted into	20:52
clarkb	fungi: if the package is installed I suspect that you can run something like the command dib runs above though?	20:53
clarkb	though I don't think you should need the force or the modules listing?	20:54
clarkb	ya --force means "install even if problems are detected" we probably want to evaluate those if it finds them	20:55
clarkb	and modules just preloads stuff but the default is to have all the modules be available based on the man page	20:55
fungi	mmm, yeah do i want it installed into the partition or into the mbr?	20:56
fungi	i sent ahead and did both but i suspect it'll be the mbr that matters	20:56
clarkb	INSTALL_DEVICE must be system device filename. grub-install copies GRUB images into boot/grub. On some platforms, it may also install GRUB into the boot sector.	20:57
clarkb	I think it is the mbr that matters	20:57
clarkb	but it figures it out sounds like	20:57
fungi	the menu.lst looks like it contains the kernels which are installed in the chroot, so hopefully we're all set	20:58
clarkb	fingers crossed	20:59
fungi	should i undo the resolv.conf change, umount /dev, /proc and /sys, umount the rootfs and then reboot into normal mode now?	20:59
fungi	anything else i should check first?	20:59
clarkb	I can't think of anything else, and you probably grok this stuff better than me :)	20:59
clarkb	note you may have to "unrescue" the server to boot into normal mode	21:00
clarkb	not sure what a straight up reboot will do, and ya I would umount the rootfs before unrescuing	21:00
fungi	did all the above and now i've asked it to "exit rescue mode" which i guess is the webui's unrescue	21:01
clarkb	I don't see it pinging yet, any chance the console is helpful?	21:02
fungi	the webui still says it's unrescuing	21:03
clarkb	ah ok	21:03
clarkb	I should learn to be more patient	21:03
fungi	i'll also need to switch to dinner prep mode shortly	21:03
clarkb	the openstack client shows it as error state though	21:03
clarkb	and the timestamp of that is from around when you unrescued	21:03
fungi	ugh	21:04
clarkb	(so I don't think it is stale from an hour or two ago whenever that was)	21:04
clarkb	https://docs.rackspace.com/support/how-to/rebuild-a-cloud-server that does say rebuilding a server preserves its IP address. In theory we can use that to revert back to xenial	21:05
clarkb	I think that might be the best option for now as it should get us back to a working state with the same IP addrs and we won't have to sort through block lists as part of recovery	21:06
clarkb	er for now meaning "if we can figure it out through rescue"	21:06
clarkb	*if we can't	21:06
clarkb	fungi: maybe try tell grub to boot the bionic kernel?	21:07
clarkb	and if that doesn't work we rebuild?	21:07
clarkb	of course once we get to that state I'm not sure how we move forward from there so maybe that is for information gathering rather than a solution	21:07
fungi	yeah, could give the old fallback kernel a shot	21:08
clarkb	its a "oh neat bionic's kernel works for some not understood reason, but we can't keep it up to date so what do?"	21:08
clarkb	also if we think this might be a sit on it and think situation I can probaby live with that into tomorrow? but considering a rebuild with our snapshot should revert us pretty cleanly I also like that option	21:09
fungi	trued stopping the server and then rebooting it, but still went back into error	21:11
fungi	reentering rescue mode now	21:11
fungi	exiting rescue mode now after removing the focal kernel entries from /boot/grub/menu.lst	21:15
fungi	though if grub-xen is grub2 based shouldn't that be /boot/grub/grub.cfg instead?	21:16
fungi	Linux lists 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux	21:17
fungi	21:17:23 up 2 min, 1 user, load average: 0.41, 0.20, 0.08	21:17
fungi	clarkb: ^	21:17
clarkb	aroo	21:17
fungi	i think it's the focal kernel which is the problem?	21:17
clarkb	ok so it is the kernel?	21:17
fungi	i expect so, though this is still with grub-xen installed too	21:18
clarkb	linux-image-extra-virtual - Extra drivers for Virtual Linux kernel image <- is a thing	21:18
clarkb	ya it could've been both things I suppose	21:18
fungi	i want to try another reboot before we go further	21:19
clarkb	ok	21:19
fungi	the webui never indicated it was done exiting rescue mode but it seems to have been	21:19
fungi	rebooting the server now	21:19
clarkb	seems that extra-virtual is a largely empty package and it just depends on the generic image	21:19
fungi	i think the -virtual means it's a virtual package (not an actual package, just a name reference to another package)	21:20
clarkb	oh I see	21:20
fungi	in debian there's the idea of "virtual packages" (packages which don't really exist, they're just convenient pointers to other package names) and "dummy packages" (basically empty packages which only exist to depend on other packages)	21:21
fungi	though also sometimes people conflate the two	21:21
clarkb	seems the server is still pinging did the reboot complete?	21:21
fungi	21:21:58 up 1 min, 1 user, load average: 0.45, 0.24, 0.09	21:22
fungi	yep	21:22
fungi	probably okay to switch to beinging services back up	21:22
clarkb	fungi: but this kernel isn't sustainable?	21:22
clarkb	like what do we do if we need kernel updates?	21:22
fungi	it's not sustainable, but it's something we can probably something we can figure out once services are running again	21:23
clarkb	and if we enable services now rolling back becomes potentially much more difficult	21:23
fungi	https://unix.stackexchange.com/questions/583714/xen-pvgrub-with-lz4-compressed-kernels	21:23
fungi	maybe it's the kernel compression after all, like you surmised	21:24
fungi	"an apt hook using extract-vmlinux to decompress kernels during installation"	21:24
clarkb	ya the comments say there are a couple of security concerns with the way it is written and that it doesn't work as is for focal	21:25
fungi	right, though it suggests we could probably manually decompress the kernel as a temporary workaround	21:27
clarkb	oh I see what you are saying. Do you want to try that now? If that works then I'm good with enabling services	21:27
clarkb	fungi: do you want to do that in a root screen? I'm happy to follow along or have you just do it as well :)	21:28
fungi	maybe, i'm trying to understand extracl-vmlinux first	21:28
fungi	extract-vmlinux	21:28
clarkb	looks like it iterates through all of the various compression algorithms that it might be and bails out when it finds the first one that is valid	21:29
fungi	yep	21:29
clarkb	fungi: we can also see if we can compare the curernt kernel to the focal kernel's compression type	21:30
clarkb	file says they are both regular files though (but I think that is because the file is "regular" and the vmlinuz is in there somewhere else?)	21:30
fungi	the `file` utility claims both the problem and working kernels are a bzImage	21:31
fungi	you need to use sudo	21:31
clarkb	ah	21:31
corvus	clarkb, fungi: i see there are issues; is there a tl;dr + something i can help with?	21:31
fungi	for some reason i guess ubuntu assumes user-readable kernel files are a security risk	21:31
clarkb	corvus: rax xen cannot boot the focal kernel but can boot the bionic kernel	21:31
clarkb	fungi: note the script in stackoverflow further checks things beyond bzimage	21:32
fungi	corvus: tl;dr is that lists.o.o is a pv xen server, and the focal kernel won't boot on it, but the bionic kernel will	21:32
clarkb	corvus: we suspect it may be the vmlinuz compression type	21:32
fungi	corvus: yeah looks like it may be because ubuntu switched to lz4 compression for kernel files	21:32
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir https://review.opendev.org/c/openstack/diskimage-builder/+/804000	21:33
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Support grubby and the Bootloader Spec https://review.opendev.org/c/openstack/diskimage-builder/+/804002	21:33
opendevreview	Steve Baker proposed openstack/diskimage-builder master: RHEL/Centos 9 does not have package grub2-efi-x64-modules https://review.opendev.org/c/openstack/diskimage-builder/+/804816	21:33
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add policycoreutils package mappings for RHEL/Centos 9 https://review.opendev.org/c/openstack/diskimage-builder/+/804817	21:33
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add DIB_YUM_REPO_PACKAGE as an alternative to DIB_YUM_REPO_CONF https://review.opendev.org/c/openstack/diskimage-builder/+/804819	21:33
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add reinstall flag to install-packages, use it in bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/804818	21:33
opendevreview	Steve Baker proposed openstack/diskimage-builder master: WIP Add secure boot support to ubuntu. https://review.opendev.org/c/openstack/diskimage-builder/+/806998	21:33
fungi	corvus: so we're trying to decide the best way forward so we feel comfortable bringing services back up on it	21:33
clarkb	corvus: currently we are thinking if we can manage to de\|re compress the vmlinuz for focal such taht it is bootable then we can roll forward (and use a hook to do that automatically in the future). But if not then we might be best doing a server rebuild against the snapshot we took at the beginning of this process	21:33
clarkb	fungi: `grep -aqo "${LZ4_HEADER}" ${KERNEL_PATH}` is the thing to try against the bionic and focal kernels to see if they differ I guess?	21:34
corvus	ok. i have no direct experience with this, so all i have to offer is another set of eyes/hands if necessary :)	21:34
fungi	corvus: we've temporarily updated the grub config to boot the old bionic kernel instead of the focal one, but are concerned that isn't sustainable long term because we want to be able to update the kernel for vulnerabilities in the future	21:35
clarkb	fungi: I'm thinking we do the grep against the lz4 header and try to confirm that differs between the kernels. If it does we attempt the kernel conversion and reboot.	21:35
clarkb	corvus: https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux is the script that linux provides to do the conversion	21:36
corvus	yeah, it sounds like get something that will work for ~1 week, then regroup / possibly schedule an outage to move to a server with new ip addrs might be a good plan?	21:36
clarkb	corvus: ya that too	21:36
fungi	clarkb: `sudo grep -aqo '\002!L\030' /boot/vmlinuz-5.4.0-84-generic` doesn't find anything	21:37
clarkb	fungi: huh I wonder if is is xz then?	21:37
fungi	aha, it may be escaping	21:38
fungi	lz4match=$(printf '\002!L\030'); sudo grep -aq "$lz4match" /boot/vmlinuz-5.4.0-84-generic	21:38
clarkb	fungi: oh also it is grep -q	21:38
fungi	that exits 0	21:38
clarkb	you need to check the exit code ya	21:38
fungi	lz4match=$(printf '\002!L\030'); sudo grep -aq "$lz4match" /boot/vmlinuz-4.15.0-156-generic	21:38
clarkb	so that means it matched	21:38
fungi	that exits 1	21:38
clarkb	nice we have likely tracked this down? what an adventure	21:38
fungi	it was the -o which was making it not match	21:39
fungi	mmm, no not the -o either	21:40
fungi	yeah, have to `printf '\002!L\030'`	21:40
fungi	that's it	21:40
fungi	okay, so that confirms the focal kernel is indeed lz4	21:40
fungi	anyway, i thought we were very close to being able to wrap this up, but i'm really overdue to switch to dinner prep before christine tries to gnaw off my arm	21:41
clarkb	fungi: reading linux's script it basically finds the positions of the compressed file for the lz4 bits then passes through through a decompression and leaves the other bits behind? I guess you only need that last nit?	21:42
clarkb	s/nit/bit/	21:42
fungi	right, the offset seems to be skipping the header part	21:42
clarkb	fungi: ya enjoy dinner, I can probably do the conversion while you eat then we can try rebooting after?	21:42
clarkb	I will use linux's script and not the parent one to do this as a one off	21:43
fungi	sure, shall i do a #status notice since we're ~45 minutes past the announced end of our window already?	21:43
clarkb	++ something along the lines of we've isoalted the issue and are working to fix it now	21:43
fungi	status notice The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen	21:45
fungi	like that?	21:45
clarkb	lgtm	21:45
fungi	#status notice The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen	21:45
opendevstatus	fungi: sending notice	21:45
fungi	okay, will dinner as quickly as possible and return	21:45
-opendevstatus- NOTICE: The mailing list services for lists.airshipit.org, lists.opendev.org, lists.openstack.org, lists.starlingx.io, and lists.zuul-ci.org are still offline while we finish addressing an unforeseen problem booting recent Ubuntu kernels from PV Xen		21:45
fungi	thanks!	21:45
clarkb	vmlinuz-5.4.0-84-generic is the file we care about right?	21:46
clarkb	/boot/vmlinuz-5.4.0-84-generic I mean	21:46
clarkb	hrm does it actually do it in place?	21:47
clarkb	maybe not	21:48
corvus	clarkb: looks like the result goes to stdout?	21:49
corvus	i think the surgery happens on a tempfile, then the result gets cat'd	21:49
clarkb	corvus: yup I've redirected into a file its in my homedir under kernel-stuff/vmlinuz-5.4.0-84-generic.extracted	21:50
clarkb	it isn't clear to me how fungi told grub to default to the bionic kernel	21:51
clarkb	the default seems to be entry 0 which is focal?	21:52
corvus	did he do that interactively?	21:53
clarkb	corvus: it was done from a rescue instance	21:54
clarkb	or at least I thought it was. Maybe he caught it in the console instead	21:55
corvus	clarkb: grub/menu.lst entry 0 == uname	21:55
fungi	i edited /boot/grub/menu.lst to remove the newer kernel entries	21:55
clarkb	oh I see the thing says it is 20.04 kernel but its the 4.15 path	21:55
fungi	if you rerun update-grub it should put them back	21:55
corvus	++	21:55
clarkb	ok corvus I ran readelf against my extracted kernel and it didn't error	21:56
clarkb	should I copy that over the file in /boot then rerun update-grub?	21:56
corvus	clarkb: file also thinks it's an elf binary :)	21:56
corvus	yeah, you have a backup of the orig, right?	21:56
fungi	yeah, elf is what we want	21:56
clarkb	corvus: yes a backup of the original is in the same dir as the extracted file	21:57
clarkb	corvus: I'lld double check shas on taht and the one in /boot	21:57
clarkb	sha1s match	21:57
clarkb	I'll do the copy now	21:57
clarkb	and now running update-grub	21:58
clarkb	hrm that didn't put it in menu.lst.	21:59
clarkb	Is menu.lst used?	21:59
clarkb	internet seems to say grub2 doesn't use menu.lst and its all grub.cfg	22:00
clarkb	Should I try a reboot and either it comes back on the old kernel and works, comes back on the new kernel and works, or tries the new kernel and fails and we can recover from there?	22:01
fungi	right, but i have a feeling the bootloader isn't actually being used	22:01
fungi	i think in a xen pv scenario the bootloader is run outside the domu to parse the files in the filesystem	22:01
fungi	so we may not actually be in control of what bootloader is run	22:02
clarkb	does that mean you think I need to edit menu.lst?	22:02
clarkb	or how do I convince it to use the newer kernel?	22:02
fungi	when i edited menu.lst that worked	22:02
corvus	i'm in favor of a menu.lst change	22:03
clarkb	yup it looks like menu.lst~ has the new kernel in it. I've put backups of menu.lst and menu.lst~ in my homdir and will replace menu.lst with menu.lst~	22:05
clarkb	that is done. Shall I reboot?	22:06
corvus	++	22:07
clarkb	ok proceedingt	22:07
clarkb	it hasn't come back yet. Openstack API still shows it as active and not yet error'd	22:09
clarkb	I can try to reboot it through the rax api	22:11
clarkb	I guess I should check the console first	22:12
clarkb	vnc fails to connnect to the server. I'll try a reboot via the api	22:14
clarkb	Now it has an error status. Hrm	22:15
corvus	does that mean it needs to be rescued again?	22:17
corvus	like rescue it, then edit menu.list as before and reboot to try to recover?	22:17
corvus	(i'm a little confused why it's in error state and not just sitting at a grub prompt, but maybe that goes to fungi's suggestion that we may not even really be using grub)	22:18
clarkb	corvus: yes I think we need to rescue it and restore the old only use bionic menu.lst	22:18
clarkb	https://wiki.xenproject.org/wiki/Booting_Overview#PVGrub explains (with almost no detail) the grub bypass I think	22:18
clarkb	I'll rescue it and restore that file	22:18
corvus	(i'm going to quietly restart zuul while that's going on)	22:20
corvus	#status log restarted all of zuul on commit 9a27c447c159cd657735df66c87b6617c39169f6	22:23
opendevstatus	corvus: finished logging	22:23
clarkb	it is back up on the bionic kernel	22:25
corvus	the whole xen thing seems like enough of a black box that i'd vote for just bringing it back up now and then taking the hit of moving ips to complete the upgrade. it's not ideal, but the reputation will catch up eventually.	22:27
clarkb	corvus: bringing it back up now on the bionic kernel you mean and then replace the server in the near future?	22:29
corvus	clarkb: yeah, bringing the public service back up in the current kernel state	22:30
clarkb	got it	22:30
clarkb	I'm going to reboot this server again from within the server to double check it is consistent	22:31
clarkb	https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy we might also try that?	22:33
corvus	that looks promising -- i think my concern is that this server seems to be a one-off, and the only way to test any of this is to test it in production?	22:34
corvus	(because we tested a clone, right? and that didn't have this problem?)	22:35
clarkb	correct	22:35
clarkb	and yes I completely agree	22:35
clarkb	well we tested a clone of lists.katacontainers.io and on a zuul test instance. But if we booted a new image in rax it would be hvm and ya not reproduce	22:36
corvus	yeah, so i think we've really just hit the end of the line on this server, and the longer we try to keep it running, the bigger hole we dig. better to cut losses and start fresh i think.	22:36
clarkb	fair enough	22:37
clarkb	I did edit a menu.lst to try in my homedir but sounds like we'd just prefer to enable services instead?	22:37
clarkb	Let me do a reboot with the config that booted most recently (I haven't modified the server since it booted successfully) just to be sure it seems to consistently come up	22:37
corvus	clarkb: my preference is weak. if you want to keep trying, no objection from me. but i feel like even success looks like we're taking on tech debt.	22:39
clarkb	corvus: well I think we should replace the server either way, but thinking that using the new kernel while we sort that out is best	22:39
clarkb	I am however curious enough to try the chainload. Why don't I try that really quickly	22:40
clarkb	worse case I revert menu.lst again and unrescue	22:40
fungi	okay, i'm back and catching up	22:40
clarkb	fungi: tldr is the uncompressed vmlinuz didn't boot. reverting menu.lst in a rescue instance did get us back to booting	22:41
clarkb	fungi: if you look in ~clarkb/kernel-stuff/menu.lst.chainload I've drafted a menu.lst that attempts to do what https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy describes	22:41
clarkb	I'd like to try that if there aren't strong objections and if works great if it doesn't work "great" we revert and either way we reenable services and move on and plan to replace the server	22:42
fungi	yeah, that looks worth a try, though note that once we chainload to grub2 that will be reading our /boot/grub/grub.cfg instead of menu.lst	22:44
clarkb	yup and that should point to my decompressed file	22:45
fungi	skimming the grub.cfg though it also looks sane	22:45
clarkb	shall I proceed with replacing menu.lst and rebooting?	22:45
clarkb	(also it is possible that the chainlod will handle lz4?)	22:45
fungi	the /boot/vmlinuz-5.4.0-84-generic is decompressed, right?	22:45
clarkb	fungi: yes you'll see it is like 6 times the size of the kernel we are booted on	22:46
fungi	yeah, the chainloaded grub2 may also support lz4	22:46
fungi	:q	22:46
fungi	hah, you're not a vi session	22:46
clarkb	wrong window :)	22:46
clarkb	let me know when you feel comfortable with this menu.lst update and reboot (or that you don't feel comfortable)	22:46
clarkb	and I'll go ahead and do that	22:47
fungi	we may just want regular grub2 rather than grub-xen if we're chainloading, but worth a shot, go for it	22:47
clarkb	fungi: Ithink that file is installed by grub-xen. proceeding	22:48
fungi	grub-pc being regular grub2	22:48
fungi	yeah, if grub-xen is a superset of grub-pc then it's irrelevant	22:48
clarkb	menu.lst is updated. rebooting momentarily	22:48
clarkb	I need to reload my keys I think they timed out but it pings now if you want to juimp in	22:50
clarkb	will take me a minute to load keys	22:50
fungi	Linux lists 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux	22:50
fungi	lgtm!	22:50
fungi	i'll go ahead and close the rackspace ticket	22:50
clarkb	fungi: maybe leave a note with them about what we discovered?	22:51
clarkb	maybe it will help the next user that hits this	22:51
fungi	oh absolutely	22:51
clarkb	tldr install grub-xen and chainload and possibly decompress the vmlinuz	22:51
clarkb	alright what is the order of operations for restoring services here? fungi do you want to do that?	22:51
clarkb	maybe apache2 first? check archives then enable mailman and make sure it starts happily then enable exim?	22:52
corvus	Sgtm	22:54
fungi	yeah, rackspace ticket now closed with ample detail	22:54
clarkb	fungi: did you want to do the service enabling or should I/	22:55
fungi	i can start apache up next	22:55
clarkb	thanks	22:55
fungi	i've started a root screen session on lists.o.o	22:55
clarkb	I'll join it	22:55
clarkb	oh wait	22:56
fungi	oh, actually we had a couple of packaging sanity check steps first	22:56
clarkb	did you want to do the other steps first for apt cleanup?	22:56
fungi	i'll do those	22:56
clarkb	ok	22:56
clarkb	also maybe don't clean up old kernels?	22:56
corvus	The chainloader may be a reasonable long-term solution. That might not add much tech debt as long as update-grub doesn't overwrite menu.lst...	22:57
clarkb	corvus: I don't think it does because update-grub does all the grub.cfg stuff now	22:57
fungi	worth backing up the menu.lst we have now just in case	22:57
clarkb	fungi: that is already done in my kernel-stuff dir in my homedir	22:57
fungi	but yes, what clarkb said	22:57
fungi	ahh, right	22:57
clarkb	so ya maybe we get services up and running. Sleep on it and decide what the best course of action is here	22:57
fungi	i'll avoid doing the autoremove for now	22:58
clarkb	++	22:58
fungi	should i leave the main mailman.service unit disabled?	23:00
clarkb	yes it isn't used	23:01
fungi	i've reenabled the 5 initscripts for our current sites	23:01
fungi	one more reboot?	23:01
clarkb	fungi: I suggested we start services one at a time above	23:01
fungi	ahh, okay can do, instead of rebooting	23:01
clarkb	apache then check it, then mailman-foo, then exim4 just to be sure its all happy to start	23:01
clarkb	I think we can reboot afterwards	23:01
fungi	apache is up	23:02
clarkb	I get errors loading the listinfo pages for opendev and openstack	23:02
clarkb	Bad header: Please set MAILMAN_SITE_DIR	23:02
clarkb	ok so something about the multisite setup isn't happy?	23:03
fungi	yeah, but i can load the pipermail archives since they're static	23:03
clarkb	SetEnv MAILMAN_SITE_DIR /srv/mailman/opendev <- is set in the apache config	23:03
fungi	and we do load mod_env	23:06
fungi	https://httpd.apache.org/docs/2.4/mod/mod_env.html#SetEnv doesn't mention any other applicable caveats that i can see	23:07
clarkb	fungi: I wonder if it is related to the Directory we put it in. We could try setting it globally (though that might have other side effects?)	23:08
fungi	i can't imagine other side effects it would cause if we made it vhost-wide	23:08
clarkb	seems like it is worth a try?	23:09
fungi	seems like not much apache invokes is going to care about MAILMAN_SITE_DIR besides the cgi	23:09
clarkb	still getting the error. Maybe try stop start?	23:09
clarkb	hrm no dice	23:10
fungi	yeah, it's like that value isn't making it through /usr/lib/cgi-bin/mailman/listinfo	23:10
clarkb	we are using mod_cgid not mod_cgi, but I wouldn't expect taht to cause trouble	23:13
clarkb	corvus: is this familiar to you at all?	23:14
fungi	it's like something's sanitizing the environment before invoking mailman/mm_cfg.py	23:14
corvus	no, i'm looking but don't see anything yet	23:14
clarkb	and I guess do we want to start mailman and/or exim and see if those are happy?	23:14
fungi	er, /etc/mailman/mm_cfg.py	23:15
clarkb	fungi: ya its definitely calling that script because the error message is the one that the mailman vhosting added if the env var isn't set	23:15
clarkb	actually it could be mm doing that?	23:16
clarkb	since the path is apache2 -> mm cgi script -> /etc/mailman/mm_cfg.py ?	23:17
fungi	right	23:18
clarkb	https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/Cgi/listinfo.py does that get compiled to what we have in our cgi dir?	23:19
clarkb	if so I don't see it clearing any env vars	23:19
fungi	our /usr/lib/cgi-bin/mailman/listinfo is an elf executable, so hard to know	23:21
clarkb	there is a note in the apache2 docs about how you should use SetEnvIf if they are part of modrewrite or otherwise early in the handling	23:22
clarkb	perhaps this is unexpectedly happening early in handling for some reason? an optimization to make cgi faster?	23:22
corvus	i think it may be cleaned by the mm script... still trying to find links to share and confirm (i just downloaded the tarball from ubuntu)	23:23
clarkb	corvus: interesting	23:23
corvus	i don't know where to find an online linkable browser with version 2.1.29	23:24
fungi	clarkb: i don't think setenvif is relevant, that seems to be for conditionally setting envvars	23:25
fungi	and yeah, the cgi scripts themselves sanitizing the environment would certainly be the simplest explanation	23:27
corvus	well, anyway... if you download 2.1.29, there's a keepenvars variable in src/common.c which is the wrapper	23:28
clarkb	https://launchpad.net/mailman/+milestone/2.1.19 has a tarball at least	23:28
* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/MbFCWvlCwUruAnzodoEifPYm/message.txt >		23:28
fungi	any way to control keepenvars from outside?	23:29
corvus	that comment is nuts "we should invert this!" "it is done!"	23:29
fungi	i guess not	23:29
corvus	maybe, i dunno, delete the TODO when it's done?	23:29
fungi	we could probably rework it to switch on HOST?	23:29
corvus	but i guess if your version control system is... nevermind. eyes on the ball.	23:29
clarkb	oh 2.1.29 is the version not .19	23:29
corvus	i don't know what version we were running before; does anyone?	23:29
clarkb	no wonder I don't see it	23:29
clarkb	corvus: I can look it up	23:30
fungi	i'll check the dpkg.log	23:30
clarkb	looks like 2.1.20	23:30
corvus	i'm pretty sure this is the issue, but i'd like to confirm by seeing if... i'm gonna go out on a limb here, the logic had not been inverted yet and only the TODO was there not the done	23:30
fungi	2021-09-12 18:10:40 upgrade mailman:amd64 1:2.1.20-1ubuntu0.6 1:2.1.26-1ubuntu0.3	23:30
fungi	yep	23:30
fungi	that was the first upgrade (from xenial to bionic)	23:31
corvus	yep, it's inverted. that's the issue.	23:31
corvus	i'm guessing the kata listserv doesn't use this system?	23:32
clarkb	corvus: it does not	23:32
clarkb	fungi: I agree we can use the HOST var probably	23:32
clarkb	keep using MAILMAN_SITE_DIR for the non cgi stuff and switch on it with HOST otherwise?	23:32
corvus	yeah, i think do the mapping based on host should work	23:33
fungi	so if $MAILMAN_SITE_DIR undefined MAILMAN_SITE_DIR=$HOST	23:33
clarkb	fungi: MAIL_SITE_DIR = somevalue based on HOST lookup	23:33
clarkb	lists.opendev.org -> /srv/mailman/opendev and so on	23:33
corvus	yep. and hard-code that mapping in /etc/mailman/mm_cfg.py	23:34
clarkb	ya I'm working on an edit in my homedir	23:34
fungi	ahh, yeah, we map them in /etc/mailman/sites	23:34
corvus	and as you say, only do the HOST mapping if the site dir isn't already set (for the rest of the scripts to keep working)	23:34
corvus	oh yea you could use that file to do the lookup rather than hard-coding	23:35
fungi	https://opendev.org/opendev/system-config/src/branch/master/inventory/service/host_vars/lists.openstack.org.yaml#L87	23:36
corvus	fun fact, that file will parse as yaml (though that's not its intetion). but we don't have pyyaml installed system-wide on that host.	23:37
corvus	(i think that's intended to be an exim lookup file)	23:38
fungi	i ended up using ugly shell to parse it in https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mailman/files/mk-archives-index	23:38
clarkb	something like what I've got in ~clarkb/mm_cfg.py maybe?	23:39
fungi	diffing now	23:39
clarkb	I'm going to test that as much as I can without going through apache or mailman now	23:39
clarkb	I had a syntax error missing a close )	23:41
fungi	clarkb: yeah, i think that should work	23:41
fungi	assuming correct pairing of parentheses ;)	23:41
corvus	```VAR_PREFIX = os.environ['MAILMAN_SITE_DIR']``` needs updating	23:41
clarkb	corvus: thanks	23:41
corvus	otherwise lgtm	23:41
clarkb	updated the file with that fixup	23:42
corvus	++	23:42
clarkb	will python complain about the mix of 2 and 4 spaces?	23:42
clarkb	I can dedent the new code to match the old code	23:42
clarkb	I'll do that	23:42
clarkb	done	23:43
fungi	python won't care as long as it's not inconsistent within a given nested set	23:43
fungi	but consistency ftw	23:43
fungi	lgtm	23:43
clarkb	I put a backup of the old file in my homedir too just in case	23:44
clarkb	fungi: do you want to copy that new file over or should I?	23:44
fungi	i can	23:44
clarkb	ok thanks	23:44
clarkb	(you have been doing most of the prod updates and keeping it through a central channel helps avoid people walking over each other)	23:44
fungi	malformed header from script 'listinfo': Bad header: Please set MAILMAN_SITE_DIR or	23:44
clarkb	also is that how HOST is passed by apache	23:44
clarkb	oh we may need to set HOST that isn't auto passed?	23:44
fungi	could be we need to add that in the vhost yeah	23:45
clarkb	ya SetEnv HOST lists.opendev.org ?	23:45
ianw	(ps I'm just watching on, but LMN if I can help :)	23:46
clarkb	fungi: does my or need to be an and?	23:47
clarkb	fungi: I think that is the bug	23:47
clarkb	fungi: updated teh copy in my homedir	23:47
fungi	yes	23:47
fungi	hah	23:47
fungi	oh i'm so blind. it's getting late	23:47
clarkb	it works!	23:48
fungi	it works!	23:48
clarkb	lists.openstack.org fails so ya we have to explicitly SetEnv HOST	23:48
fungi	now i wonder if the setenv host is redundant	23:48
fungi	i guess not, right	23:48
clarkb	I don't think it is. I think we have to set it	23:48
fungi	okay, i'll patch it into the local copies real quick	23:48
clarkb	thanks	23:48
fungi	all of them are updated on the server now	23:50
fungi	so should hopefully be working	23:50
fungi	i need to take another break here shortly if folks want to check anything else before we start mailman and exim services	23:51
clarkb	I just confirmed the all load and that they seem to match the right env :)	23:51
clarkb	I think we should continue	23:51
clarkb	corvus: ianw: any objection to starting mailman now?	23:52
clarkb	then we'll start exim	23:52
corvus	clarkb: lgtm (was just checking out list admin interface)	23:53
clarkb	I don't expect the non cgi scripts will have problems beacuse we do actually test creating a list etc using MAILMAN_SITE_DIR in ansible	23:53
clarkb	but if they do I guess we have to go through and set HOST instead?	23:54
clarkb	fungi: I think you should go for it?	23:54
fungi	yeah, syatyomg mailman services next	23:54
fungi	er, starting	23:54
clarkb	was that opendev that was started?	23:55
fungi	i started mailman-opendev and see the expected company of processes	23:55
fungi	i'll do the other remaining 4 now	23:55
clarkb	I see 5 sets of processes	23:56
fungi	45 processes owned by list, that's 9 per site, like the old days	23:56
fungi	anything else we want to do before starting exim?	23:56
clarkb	I don't think so? corvus ianw ^ ?	23:56
clarkb	fungi: seems like there are no objections I think we should probably send it, then try a test email against service-discuss?	23:59
fungi	yeah, i'll compose a reply to my maintenance announcement real fast	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!