19:01:14 <clarkb> #startmeeting infra
19:01:14 <opendevmeet> Meeting started Tue Nov  8 19:01:14 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:14 <opendevmeet> The meeting name has been set to 'infra'
19:01:20 <ianw> o/
19:01:27 <fungi> ohai!
19:01:35 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-November/000378.html Our Agenda
19:01:44 <clarkb> #topic Announcements
19:01:58 <frickler> \o
19:02:17 <clarkb> I won't be around Friday or Monday. I'll be back Tuesday so expect to have a meeting next week. I may send the agenda for that a bit late (Tuesday morning?)
19:02:34 <fungi> thanks!
19:02:40 <fungi> hope you have a great weekend
19:03:02 <clarkb> other than that I didn't have anything to announce. We can dive in I guess
19:03:06 <clarkb> #topic Topics
19:03:12 <clarkb> #topic Bastion Host Updates
19:03:19 <clarkb> #link https://review.opendev.org/q/topic:prod-bastion-group
19:03:23 <clarkb> #link https://review.opendev.org/q/topic:bridge-ansible-venv
19:03:28 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/863564
19:03:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/863568
19:03:50 <clarkb> it seems like we're really close to being done with the old bastion and the new one is working well
19:04:12 <clarkb> ianw does have a few remaining changes that need review as I've linked above. Would be great if we can review those to close that portion of the work out
19:04:59 <ianw> yep thanks -- goes off on a few tangents but all related
19:05:04 <clarkb> then separately Ithink it would be helpful if people can look at old bridge and make note of anything missing that needs to be moved or encoded in ansible. the raxdns client venv, ssl cert stuff, and hieraedit (generally yaml editor) were things I noted
19:06:00 <clarkb> ianw: other than reviewing changes and checking for missing content is there anything else we can do to help?
19:06:03 <ianw> yep notes made on
19:06:07 <ianw> #link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10
19:07:04 <ianw> nope -- the changes you linked to have some things like adding host keys to inventory and i think we've actually fixed the blockers for parallel job running as a nice side-effect too
19:07:40 <ianw> and i'll keep working on the todo to finish it off
19:08:19 <clarkb> sounds good. Thanks! This was always going to be a lot of effort and I appreciate you taking it on :)
19:08:44 <clarkb> #topic Bionic Server Upgrades
19:08:46 <ianw> *next* time it will be easy :)
19:09:23 <clarkb> Not a whole lot to add here. I stalled a bit on this because of the doom and gloom from the openssl cve. But I should look at the list and pick off something new to do
19:09:29 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades
19:09:39 <clarkb> the changes to remove snapd and not do phased updates did land
19:09:54 <clarkb> whcih were related tothings I found when booting the new jammy gitea-lb02
19:10:09 <clarkb> I keep meaning to check that snapd is gone as expected but haven't managed that yet.
19:10:50 <frickler> do we have servers other than storyboard that still run on xenial?
19:10:56 <clarkb> I am not aware of any current issues deploying services on jammy so new deployments should favor it
19:11:17 <clarkb> frickler: there are like 4? cacti, translate, storyboard, and something I'm forgetting I feel like
19:11:28 <clarkb> frickler: they are all puppeted services hence the difficulty of moving
19:11:57 <frickler> maybe still list them in the upgrade pad so we don't forget about them?
19:12:04 <clarkb> good idea
19:12:24 <fungi> wiki is still on trusty i think
19:12:27 <clarkb> OpenStack has been looking at zanata replacements though so progress on that end at least
19:13:24 <frickler> oh, I can't even log into wiki from my Jammy host
19:13:44 <frickler> Unable to negotiate with 2001:4800:7813:516:be76:4eff:fe06:36e7 port 22: no matching host key type found. Their offer: ssh-rsa,ssh-dss
19:14:04 <clarkb> ya thats the sha1 + rsa problem. YOu can specifically allow certain hosts to do sha1
19:14:10 <fungi> yeah, ssh to wiki needs host-specific overrides for openssh
19:14:23 <clarkb> the openssh 8.8 release notes have details
19:14:51 <frickler> ah, o.k.
19:15:02 <fungi> i think it's less about that and more that it isn't new enough to support elliptic curve keys
19:15:05 <clarkb> anyway the good news is jammy is functioanl as far as I can tell and any replacement can prefer it at this point
19:15:20 <fungi> but could be both
19:15:35 <clarkb> #topic Mailman 3
19:15:52 <clarkb> This feels like it has stalled out a bit largely on a decision for whether or not we should use forked images
19:16:12 <fungi> well, i wasn't around last week too
19:16:13 <clarkb> the upstream did end up merging my lynx addition change. I have also noticed they have "rolling" docker image tags
19:16:44 <fungi> is switching to rolling just a matter of adjusting the dockerfile?
19:16:44 <clarkb> I think that means we could at this point choose to use upstream and their rolling tags and that would be roughly equivalent to what I have proposed for our fork
19:17:03 <fungi> or the compose file i guess if we don't need to fork
19:17:06 <clarkb> fungi: its an edit to the docker-compose.yaml file. It isn't clear to me what sorts of upgrade garuntees they plan to make though
19:17:21 <clarkb> I've also not seen any responses to my issues around specifying the domain names
19:17:37 <clarkb> which makes me wary. It feels like they merged the easiest change I pushed and called it a day :/
19:18:41 <fungi> i guess we could keep forking but just run a very minimal fork unless we need to do more in the future?
19:19:42 <clarkb> ya I'm on the fence. Our fork now is equiavlent to upstream which makes me think we should just use upstream. But I also would like better comms/info from upstream if possible. I think I lean slightly towards using a local (non)fork for now and seeing if upstream improves
19:20:04 <clarkb> basically build our own images so that we can easily make changes as we don't really expect upstream to quickly make updates right now
19:20:13 <clarkb> even though our current image isn't different
19:20:35 <fungi> i'm good with that. unless anyone objects, let's just go that way for now
19:21:24 <ianw> ++ i'm fine with that.  we tried upstream first, but it's not always going to be the solution
19:21:43 <clarkb> fungi: cool. Might also be worth a rebuild and rehold just to double check current builds work as expected? But then plan toboot a new node and deploy something?
19:22:00 <fungi> yes, we'll need to un-wip the changes of course
19:22:07 <clarkb> yup I think I unwip'd the work
19:22:09 <clarkb> *the fork
19:22:16 <fungi> oh, cool
19:22:53 <fungi> i'll do another hold and full import test this week, then boot a prod server
19:22:54 <clarkb> I guess let me know if I can help with that process of spot checking things
19:23:05 <fungi> i guess we settled on sticking with rackspace dfw for now?
19:23:08 <clarkb> just be sure the docker images rebuild too
19:23:25 <clarkb> I think that is my slight preference simply because the dns is easier to manage
19:23:46 <clarkb> but we will have to update PBL? we may have to do that with vexxhost too? May be good to ask mnaser__ if he has an opinion on it
19:25:08 <fungi> i expect to do a broad sweep of blocklists for the new server ip addresses regardless, and put in exclusion requests
19:25:17 <clarkb> ++
19:25:26 <fungi> well in advance of any maintenance to put it into production
19:26:10 <clarkb> then ya my preference is slightly towards rax as it gives us a bit more direct control over things we may need to edit/control
19:26:22 <clarkb> but I'm happy if others feel the opposite way and override me on that :)
19:26:36 <clarkb> we also have a bit more time to sort that out while we do the rechecking of things in the images
19:27:07 <clarkb> #topic Python base image updates for wheel installs
19:27:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/862152
19:27:37 <clarkb> I don't want to single core approve this change as it will affect a number of our images. ianw you spent a bit of time with the siblings stuff and it might be good for you to double check it from that angle?
19:27:52 <clarkb> I do think we should land a change like this though as it should make us more resilient to pip updates in the future
19:28:15 <ianw> ahh yes sorry, i meant to review that.  i'll give it a proper review today
19:28:32 <clarkb> thanks. It isn't super urgent but would be good to have before the next pip update :0
19:28:34 <clarkb> er :)
19:28:49 <clarkb> #topic Etherpad container log growth
19:29:07 <clarkb> ianw: I know you' ve been busy with bridge and k8s and stuff. Any chance there is a change for this yet?
19:29:26 <clarkb> if you'd like I can try to take a stab at it too. Would probably be good to better understand that container logging stuff anyway
19:30:13 <ianw> sorry, yes i got totally distracted in zuul-jobs.  i can do it and we can check it all works
19:30:28 <clarkb> ok
19:30:43 <clarkb> #topic Quo vadis Storyboard
19:30:49 <clarkb> I learned me a latin yesterday
19:31:18 <frickler> sorry, had that 6 years in school ;)
19:31:28 <clarkb> During/after the PTG it became apparent that a number of projects were looking at moving off of storyboard
19:32:30 <clarkb> In an effort to help converge discussion (or attempt to) as well as better understand the usage of storyboard I started a thread on service-discuss to gatherfeedback on storyboard and hopefully identify people who might be willing to help maintain it
19:33:03 <clarkb> The feedback so far as been from those looking at moving away. We have not yet identified any group that has indicated a desire to keep using storyboard or help maintain it.
19:34:00 <clarkb> I don't want to do a ton of discussion here in the meeting as I think it would be good to keep as much of that on the mailing list as possible. THis way we don't fragment the discussion (which was already a concern for me) and it helps ensure everyone can easily stay on top of the discussion
19:34:12 <clarkb> that said I think there are a few important takeaways that have come out of this so far
19:34:46 <clarkb> The first is that the OpenStack projects looking at moving have indicated that having different tools for issue tracking across projects has been tough for them. Consistency is a desireable feature
19:35:39 <clarkb> And second that while Storyboard was created to directly meet some of OpenStack's more unique issue tracking needs those needs have both shifted over time and storyboard isn't doing a great job of meeting them
19:36:26 <clarkb> I think that means any longer term maintenance isn't just about library updates and python version upgrades but would need to also try and address the feature need gap
19:37:12 <fungi> though part of why storyboard isn't doing a good job of meeting needs is that we haven't really kept up with fixing our deployment orchestration for it
19:37:40 <fungi> (for example, people wanted attachments support, storyboard has it but our storyboard deployment does not)
19:37:50 <clarkb> yes the two problems are interrelated. But a number of the issues raised appear to be due to a divergence in understanding of what the needs are not just poor implementation hampered by not updating
19:38:15 <clarkb> duplicate issue handling was a major one called out for example
19:39:10 <clarkb> from a hosting perspective this puts us in an unfortunate position because the users interested in solving these problems seem disinterested in solving them through storyboard.
19:40:10 <clarkb> I think if we can find (an as of yet undiscovered) group of people to work on the maintenance and feature gaps we would be happy to continue hosting storyboard. I'm worried about what we do without that since it does clearly need help
19:40:30 <clarkb> But like I said I think we need to give it a bit more time since my last email before we dig into that too much. As we're still trying to find that potential group
19:42:24 <clarkb> those were the key details I've pulled out so far. Please respond to the thread if you have thoughts or concerns and we'll continue to tr and keep as much of the discussion there as possible
19:43:14 <clarkb> #topic Vexxhost nova server rescue behavior
19:43:46 <clarkb> I tested this. I'm glad I did. Unfortunately, server rescue on regular intsances and boot from volume instances does not work
19:44:46 <ianw> this was the microversion api issues?
19:44:54 <clarkb> With boot from volume instances if you naively make the request without specifying the compute api version it stright up fails because you need nova 2.87 or newer to even support the functionality. Vexxhost supports up to 2.88 which allowed me to try it with the microversion specified. The initial request doesn't fail, but the rescue does. The server ends up in an error state
19:45:24 <clarkb> The reason for this error is some sort of permissions thing which i passed on to mnaser__
19:45:39 <clarkb> If you try to unrescue a node in this state it fails because you cannot unrescue and error state instance
19:46:30 <clarkb> basically as far as I can tell rescuing a boot from volume instance using the rescue command is completely broken. However, I think you might be able to shutdown the node, attach the boot volume to another running instance and effectively do a rescue manually
19:46:31 <fungi> that would have been rough if we had tried to go that route during the gerrit outage
19:46:37 <fungi> thanks for catching it
19:46:47 <clarkb> I still need to test this ^ theoretical process
19:47:15 <clarkb> For regular instances things are a bit less straightforward. The rescue commands all work out of the box, but what happens is you get the rescue image's kernel running with the rescued node's root disk mounted on /
19:47:38 <clarkb> I suspect the reason for this is a collision of the root label used in the kernel boot line. They all use cloudimg-rootfs by default and linux is doing what we don't want here.
19:48:11 <clarkb> The next thing I need to test on this is manually changing the root label on a test node to something other than cloudimg-rootfs and then rescuing it. In theory if the labels don't collide we'll get more reliable behavior
19:48:26 <clarkb> If that works we might want to have launch node change our root partition labels
19:49:04 <clarkb> all that to say the naive approach is not working. It was a great idea to test it (thank you frickler for calling out the concerns). We may have some workarounds we can employ
19:49:05 <fungi> so basically we boot the rescue image's kernel/initrd but then mount the original rootfs and pivot to it during early boot?
19:49:14 <clarkb> fungi: yup that appears to be what happens
19:49:25 <fungi> i can definitely see how that could occur
19:49:30 <clarkb> fungi: and that is problematic because if you have a systemd issue you're preserving the problem
19:49:53 <fungi> absolutely. unless you can get to the kernel command line to edit it i guess
19:50:16 <clarkb> I'm going to try and test those workarounds later today and if one or both of them work I'll need to push a docs update I guess
19:50:20 <ianw> we could switch to boot by uuid?
19:50:51 <clarkb> ianw: that won't work because the rescue image is still looking for cloudimg-rootfs which would still be present
19:51:20 <clarkb> you either need to change the label on the actual instance to avoid the collision or use special rescue images that don't look for cloudimg-rootfs
19:51:31 <ianw> ahh, i see yeah, you need to purge the label
19:51:56 <ianw> as long as we don't make ourselves unbootable in the process of making ourselves more bootable :)
19:51:59 <clarkb> I'm still hopeful there is a useable set of steps here to get what we need. I just need to do more testing
19:52:12 <clarkb> ianw: ya thats why doing it in launch node before the launch node reboot is a good idea I think
19:52:20 <clarkb> we'll catch any errors very early :)
19:52:32 <clarkb> but need more testing before we commit to anything as I'm not even sure this will fix it yet
19:53:20 <clarkb> in the meantime please avoid rescuing instances in vexxhost unless it is our last resort. It may well make things worse rather than better
19:54:13 <clarkb> #topic Open Discussion
19:54:17 <clarkb> Anything else?
19:55:19 <fungi> trying to fix a corner case with pypi uploads. see this and its zuul-jobs dependency:
19:55:21 <fungi> #link https://review.opendev.org/864019 Skip existing remote artifacts during PyPI upload
19:55:57 <fungi> the openstack release team wound up with a project where the wheel was uploaded but then pypi hit a transient error while trying to upload the corresponding sdist
19:56:05 <fungi> so no way to rerun the job in its current state
19:56:28 <clarkb> the bit about being able to separate multi arch builds is a good feature too
19:56:53 <fungi> fixing it to no longer treat existing remote files as an error condition was less work than trying to dig out the twine api key and upload the sdist myself
19:57:21 <fungi> i'll reenqueue the failed tag (for openstack/metalsmith) once those changes land
19:57:37 <fungi> that way we'll have confirmation it solves the situation
19:57:46 <clarkb> one minor concern is that this might allow us to upload sdists after a wheel failure
19:58:03 <frickler> it also seems like wheel builds are broken again for 7 days, didn't look at logs yet
19:58:04 <clarkb> which would be a problem for anything that has complicated build requirements and downstream users that rely on wheels
19:58:35 <fungi> clarkb: i don't think this skips all failures, just ignores when pypi tells twine that the file it's trying to upload is already there
19:58:42 <clarkb> fungi: perfect
19:59:07 <fungi> if there's a different upload error, the role/play/job should still fail properly
19:59:53 <fungi> e.g., the build which hit the original problem leaving the sdist not uploaded would still have failed the same way, but this allows us to successfully reenqueue the tag and run the job again
19:59:59 <ianw> frickler: thanks -- https://zuul.opendev.org/t/openstack/build/89b55250a9e4433980dbdbb7ab2cf39c looks like centos openafs errors
20:00:27 <clarkb> and we are at time. Thank you everyone.
20:00:38 <clarkb> As I mentioned we should be back here same time and place next week. See youthen
20:00:40 <clarkb> #endmeeting