Tuesday, 2022-11-08

clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Nov 8 19:01:14 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
ianw	o/	19:01
fungi	ohai!	19:01
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-November/000378.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
frickler	\o	19:01
clarkb	I won't be around Friday or Monday. I'll be back Tuesday so expect to have a meeting next week. I may send the agenda for that a bit late (Tuesday morning?)	19:02
fungi	thanks!	19:02
fungi	hope you have a great weekend	19:02
clarkb	other than that I didn't have anything to announce. We can dive in I guess	19:03
clarkb	#topic Topics	19:03
clarkb	#topic Bastion Host Updates	19:03
clarkb	#link https://review.opendev.org/q/topic:prod-bastion-group	19:03
clarkb	#link https://review.opendev.org/q/topic:bridge-ansible-venv	19:03
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/863564	19:03
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/863568	19:03
clarkb	it seems like we're really close to being done with the old bastion and the new one is working well	19:03
clarkb	ianw does have a few remaining changes that need review as I've linked above. Would be great if we can review those to close that portion of the work out	19:04
ianw	yep thanks -- goes off on a few tangents but all related	19:04
clarkb	then separately Ithink it would be helpful if people can look at old bridge and make note of anything missing that needs to be moved or encoded in ansible. the raxdns client venv, ssl cert stuff, and hieraedit (generally yaml editor) were things I noted	19:05
clarkb	ianw: other than reviewing changes and checking for missing content is there anything else we can do to help?	19:06
ianw	yep notes made on	19:06
ianw	#link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10	19:06
ianw	nope -- the changes you linked to have some things like adding host keys to inventory and i think we've actually fixed the blockers for parallel job running as a nice side-effect too	19:07
ianw	and i'll keep working on the todo to finish it off	19:07
clarkb	sounds good. Thanks! This was always going to be a lot of effort and I appreciate you taking it on :)	19:08
clarkb	#topic Bionic Server Upgrades	19:08
ianw	next time it will be easy :)	19:08
clarkb	Not a whole lot to add here. I stalled a bit on this because of the doom and gloom from the openssl cve. But I should look at the list and pick off something new to do	19:09
clarkb	#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades	19:09
clarkb	the changes to remove snapd and not do phased updates did land	19:09
clarkb	whcih were related tothings I found when booting the new jammy gitea-lb02	19:09
clarkb	I keep meaning to check that snapd is gone as expected but haven't managed that yet.	19:10
frickler	do we have servers other than storyboard that still run on xenial?	19:10
clarkb	I am not aware of any current issues deploying services on jammy so new deployments should favor it	19:10
clarkb	frickler: there are like 4? cacti, translate, storyboard, and something I'm forgetting I feel like	19:11
clarkb	frickler: they are all puppeted services hence the difficulty of moving	19:11
frickler	maybe still list them in the upgrade pad so we don't forget about them?	19:11
clarkb	good idea	19:12
fungi	wiki is still on trusty i think	19:12
clarkb	OpenStack has been looking at zanata replacements though so progress on that end at least	19:12
frickler	oh, I can't even log into wiki from my Jammy host	19:13
frickler	Unable to negotiate with 2001:4800:7813:516:be76:4eff:fe06:36e7 port 22: no matching host key type found. Their offer: ssh-rsa,ssh-dss	19:13
clarkb	ya thats the sha1 + rsa problem. YOu can specifically allow certain hosts to do sha1	19:14
fungi	yeah, ssh to wiki needs host-specific overrides for openssh	19:14
clarkb	the openssh 8.8 release notes have details	19:14
frickler	ah, o.k.	19:14
fungi	i think it's less about that and more that it isn't new enough to support elliptic curve keys	19:15
clarkb	anyway the good news is jammy is functioanl as far as I can tell and any replacement can prefer it at this point	19:15
fungi	but could be both	19:15
clarkb	#topic Mailman 3	19:15
clarkb	This feels like it has stalled out a bit largely on a decision for whether or not we should use forked images	19:15
fungi	well, i wasn't around last week too	19:16
clarkb	the upstream did end up merging my lynx addition change. I have also noticed they have "rolling" docker image tags	19:16
fungi	is switching to rolling just a matter of adjusting the dockerfile?	19:16
clarkb	I think that means we could at this point choose to use upstream and their rolling tags and that would be roughly equivalent to what I have proposed for our fork	19:16
fungi	or the compose file i guess if we don't need to fork	19:17
clarkb	fungi: its an edit to the docker-compose.yaml file. It isn't clear to me what sorts of upgrade garuntees they plan to make though	19:17
clarkb	I've also not seen any responses to my issues around specifying the domain names	19:17
clarkb	which makes me wary. It feels like they merged the easiest change I pushed and called it a day :/	19:17
fungi	i guess we could keep forking but just run a very minimal fork unless we need to do more in the future?	19:18
clarkb	ya I'm on the fence. Our fork now is equiavlent to upstream which makes me think we should just use upstream. But I also would like better comms/info from upstream if possible. I think I lean slightly towards using a local (non)fork for now and seeing if upstream improves	19:19
clarkb	basically build our own images so that we can easily make changes as we don't really expect upstream to quickly make updates right now	19:20
clarkb	even though our current image isn't different	19:20
fungi	i'm good with that. unless anyone objects, let's just go that way for now	19:20
ianw	++ i'm fine with that. we tried upstream first, but it's not always going to be the solution	19:21
clarkb	fungi: cool. Might also be worth a rebuild and rehold just to double check current builds work as expected? But then plan toboot a new node and deploy something?	19:21
fungi	yes, we'll need to un-wip the changes of course	19:21
clarkb	yup I think I unwip'd the work	19:22
clarkb	*the fork	19:22
fungi	oh, cool	19:22
fungi	i'll do another hold and full import test this week, then boot a prod server	19:22
clarkb	I guess let me know if I can help with that process of spot checking things	19:22
fungi	i guess we settled on sticking with rackspace dfw for now?	19:23
clarkb	just be sure the docker images rebuild too	19:23
clarkb	I think that is my slight preference simply because the dns is easier to manage	19:23
clarkb	but we will have to update PBL? we may have to do that with vexxhost too? May be good to ask mnaser__ if he has an opinion on it	19:23
fungi	i expect to do a broad sweep of blocklists for the new server ip addresses regardless, and put in exclusion requests	19:25
clarkb	++	19:25
fungi	well in advance of any maintenance to put it into production	19:25
clarkb	then ya my preference is slightly towards rax as it gives us a bit more direct control over things we may need to edit/control	19:26
clarkb	but I'm happy if others feel the opposite way and override me on that :)	19:26
clarkb	we also have a bit more time to sort that out while we do the rechecking of things in the images	19:26
clarkb	#topic Python base image updates for wheel installs	19:27
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/862152	19:27
clarkb	I don't want to single core approve this change as it will affect a number of our images. ianw you spent a bit of time with the siblings stuff and it might be good for you to double check it from that angle?	19:27
clarkb	I do think we should land a change like this though as it should make us more resilient to pip updates in the future	19:27
ianw	ahh yes sorry, i meant to review that. i'll give it a proper review today	19:28
clarkb	thanks. It isn't super urgent but would be good to have before the next pip update :0	19:28
clarkb	er :)	19:28
clarkb	#topic Etherpad container log growth	19:28
clarkb	ianw: I know you' ve been busy with bridge and k8s and stuff. Any chance there is a change for this yet?	19:29
clarkb	if you'd like I can try to take a stab at it too. Would probably be good to better understand that container logging stuff anyway	19:29
ianw	sorry, yes i got totally distracted in zuul-jobs. i can do it and we can check it all works	19:30
clarkb	ok	19:30
clarkb	#topic Quo vadis Storyboard	19:30
clarkb	I learned me a latin yesterday	19:30
frickler	sorry, had that 6 years in school ;)	19:31
clarkb	During/after the PTG it became apparent that a number of projects were looking at moving off of storyboard	19:31
clarkb	In an effort to help converge discussion (or attempt to) as well as better understand the usage of storyboard I started a thread on service-discuss to gatherfeedback on storyboard and hopefully identify people who might be willing to help maintain it	19:32
clarkb	The feedback so far as been from those looking at moving away. We have not yet identified any group that has indicated a desire to keep using storyboard or help maintain it.	19:33
clarkb	I don't want to do a ton of discussion here in the meeting as I think it would be good to keep as much of that on the mailing list as possible. THis way we don't fragment the discussion (which was already a concern for me) and it helps ensure everyone can easily stay on top of the discussion	19:34
clarkb	that said I think there are a few important takeaways that have come out of this so far	19:34
clarkb	The first is that the OpenStack projects looking at moving have indicated that having different tools for issue tracking across projects has been tough for them. Consistency is a desireable feature	19:34
clarkb	And second that while Storyboard was created to directly meet some of OpenStack's more unique issue tracking needs those needs have both shifted over time and storyboard isn't doing a great job of meeting them	19:35
clarkb	I think that means any longer term maintenance isn't just about library updates and python version upgrades but would need to also try and address the feature need gap	19:36
fungi	though part of why storyboard isn't doing a good job of meeting needs is that we haven't really kept up with fixing our deployment orchestration for it	19:37
fungi	(for example, people wanted attachments support, storyboard has it but our storyboard deployment does not)	19:37
clarkb	yes the two problems are interrelated. But a number of the issues raised appear to be due to a divergence in understanding of what the needs are not just poor implementation hampered by not updating	19:37
clarkb	duplicate issue handling was a major one called out for example	19:38
clarkb	from a hosting perspective this puts us in an unfortunate position because the users interested in solving these problems seem disinterested in solving them through storyboard.	19:39
clarkb	I think if we can find (an as of yet undiscovered) group of people to work on the maintenance and feature gaps we would be happy to continue hosting storyboard. I'm worried about what we do without that since it does clearly need help	19:40
clarkb	But like I said I think we need to give it a bit more time since my last email before we dig into that too much. As we're still trying to find that potential group	19:40
clarkb	those were the key details I've pulled out so far. Please respond to the thread if you have thoughts or concerns and we'll continue to tr and keep as much of the discussion there as possible	19:42
clarkb	#topic Vexxhost nova server rescue behavior	19:43
clarkb	I tested this. I'm glad I did. Unfortunately, server rescue on regular intsances and boot from volume instances does not work	19:43
ianw	this was the microversion api issues?	19:44
clarkb	With boot from volume instances if you naively make the request without specifying the compute api version it stright up fails because you need nova 2.87 or newer to even support the functionality. Vexxhost supports up to 2.88 which allowed me to try it with the microversion specified. The initial request doesn't fail, but the rescue does. The server ends up in an error state	19:44
clarkb	The reason for this error is some sort of permissions thing which i passed on to mnaser__	19:45
clarkb	If you try to unrescue a node in this state it fails because you cannot unrescue and error state instance	19:45
clarkb	basically as far as I can tell rescuing a boot from volume instance using the rescue command is completely broken. However, I think you might be able to shutdown the node, attach the boot volume to another running instance and effectively do a rescue manually	19:46
fungi	that would have been rough if we had tried to go that route during the gerrit outage	19:46
fungi	thanks for catching it	19:46
clarkb	I still need to test this ^ theoretical process	19:46
clarkb	For regular instances things are a bit less straightforward. The rescue commands all work out of the box, but what happens is you get the rescue image's kernel running with the rescued node's root disk mounted on /	19:47
clarkb	I suspect the reason for this is a collision of the root label used in the kernel boot line. They all use cloudimg-rootfs by default and linux is doing what we don't want here.	19:47
clarkb	The next thing I need to test on this is manually changing the root label on a test node to something other than cloudimg-rootfs and then rescuing it. In theory if the labels don't collide we'll get more reliable behavior	19:48
clarkb	If that works we might want to have launch node change our root partition labels	19:48
clarkb	all that to say the naive approach is not working. It was a great idea to test it (thank you frickler for calling out the concerns). We may have some workarounds we can employ	19:49
fungi	so basically we boot the rescue image's kernel/initrd but then mount the original rootfs and pivot to it during early boot?	19:49
clarkb	fungi: yup that appears to be what happens	19:49
fungi	i can definitely see how that could occur	19:49
clarkb	fungi: and that is problematic because if you have a systemd issue you're preserving the problem	19:49
fungi	absolutely. unless you can get to the kernel command line to edit it i guess	19:49
clarkb	I'm going to try and test those workarounds later today and if one or both of them work I'll need to push a docs update I guess	19:50
ianw	we could switch to boot by uuid?	19:50
clarkb	ianw: that won't work because the rescue image is still looking for cloudimg-rootfs which would still be present	19:50
clarkb	you either need to change the label on the actual instance to avoid the collision or use special rescue images that don't look for cloudimg-rootfs	19:51
ianw	ahh, i see yeah, you need to purge the label	19:51
ianw	as long as we don't make ourselves unbootable in the process of making ourselves more bootable :)	19:51
clarkb	I'm still hopeful there is a useable set of steps here to get what we need. I just need to do more testing	19:51
clarkb	ianw: ya thats why doing it in launch node before the launch node reboot is a good idea I think	19:52
clarkb	we'll catch any errors very early :)	19:52
clarkb	but need more testing before we commit to anything as I'm not even sure this will fix it yet	19:52
clarkb	in the meantime please avoid rescuing instances in vexxhost unless it is our last resort. It may well make things worse rather than better	19:53
clarkb	#topic Open Discussion	19:54
clarkb	Anything else?	19:54
fungi	trying to fix a corner case with pypi uploads. see this and its zuul-jobs dependency:	19:55
fungi	#link https://review.opendev.org/864019 Skip existing remote artifacts during PyPI upload	19:55
fungi	the openstack release team wound up with a project where the wheel was uploaded but then pypi hit a transient error while trying to upload the corresponding sdist	19:55
fungi	so no way to rerun the job in its current state	19:56
clarkb	the bit about being able to separate multi arch builds is a good feature too	19:56
fungi	fixing it to no longer treat existing remote files as an error condition was less work than trying to dig out the twine api key and upload the sdist myself	19:56
fungi	i'll reenqueue the failed tag (for openstack/metalsmith) once those changes land	19:57
fungi	that way we'll have confirmation it solves the situation	19:57
clarkb	one minor concern is that this might allow us to upload sdists after a wheel failure	19:57
frickler	it also seems like wheel builds are broken again for 7 days, didn't look at logs yet	19:58
clarkb	which would be a problem for anything that has complicated build requirements and downstream users that rely on wheels	19:58
fungi	clarkb: i don't think this skips all failures, just ignores when pypi tells twine that the file it's trying to upload is already there	19:58
clarkb	fungi: perfect	19:58
fungi	if there's a different upload error, the role/play/job should still fail properly	19:59
fungi	e.g., the build which hit the original problem leaving the sdist not uploaded would still have failed the same way, but this allows us to successfully reenqueue the tag and run the job again	19:59
ianw	frickler: thanks -- https://zuul.opendev.org/t/openstack/build/89b55250a9e4433980dbdbb7ab2cf39c looks like centos openafs errors	19:59
clarkb	and we are at time. Thank you everyone.	20:00
clarkb	As I mentioned we should be back here same time and place next week. See youthen	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Nov 8 20:00:40 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-08-19.01.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-08-19.01.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-08-19.01.log.html	20:00
fungi	thanks clarkb!	20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!