Tuesday, 2022-11-08

clarkbmeeting time19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Nov  8 19:01:14 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
ianwo/19:01
fungiohai!19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-November/000378.html Our Agenda19:01
clarkb#topic Announcements19:01
frickler\o19:01
clarkbI won't be around Friday or Monday. I'll be back Tuesday so expect to have a meeting next week. I may send the agenda for that a bit late (Tuesday morning?)19:02
fungithanks!19:02
fungihope you have a great weekend19:02
clarkbother than that I didn't have anything to announce. We can dive in I guess19:03
clarkb#topic Topics19:03
clarkb#topic Bastion Host Updates19:03
clarkb#link https://review.opendev.org/q/topic:prod-bastion-group19:03
clarkb#link https://review.opendev.org/q/topic:bridge-ansible-venv19:03
clarkb#link https://review.opendev.org/c/opendev/system-config/+/86356419:03
clarkb#link https://review.opendev.org/c/opendev/system-config/+/86356819:03
clarkbit seems like we're really close to being done with the old bastion and the new one is working well19:03
clarkbianw does have a few remaining changes that need review as I've linked above. Would be great if we can review those to close that portion of the work out19:04
ianwyep thanks -- goes off on a few tangents but all related19:04
clarkbthen separately Ithink it would be helpful if people can look at old bridge and make note of anything missing that needs to be moved or encoded in ansible. the raxdns client venv, ssl cert stuff, and hieraedit (generally yaml editor) were things I noted19:05
clarkbianw: other than reviewing changes and checking for missing content is there anything else we can do to help?19:06
ianwyep notes made on 19:06
ianw#link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-1019:06
ianwnope -- the changes you linked to have some things like adding host keys to inventory and i think we've actually fixed the blockers for parallel job running as a nice side-effect too19:07
ianwand i'll keep working on the todo to finish it off19:07
clarkbsounds good. Thanks! This was always going to be a lot of effort and I appreciate you taking it on :)19:08
clarkb#topic Bionic Server Upgrades19:08
ianw*next* time it will be easy :)19:08
clarkbNot a whole lot to add here. I stalled a bit on this because of the doom and gloom from the openssl cve. But I should look at the list and pick off something new to do19:09
clarkb#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades19:09
clarkbthe changes to remove snapd and not do phased updates did land19:09
clarkbwhcih were related tothings I found when booting the new jammy gitea-lb0219:09
clarkbI keep meaning to check that snapd is gone as expected but haven't managed that yet.19:10
fricklerdo we have servers other than storyboard that still run on xenial?19:10
clarkbI am not aware of any current issues deploying services on jammy so new deployments should favor it19:10
clarkbfrickler: there are like 4? cacti, translate, storyboard, and something I'm forgetting I feel like19:11
clarkbfrickler: they are all puppeted services hence the difficulty of moving19:11
fricklermaybe still list them in the upgrade pad so we don't forget about them?19:11
clarkbgood idea19:12
fungiwiki is still on trusty i think19:12
clarkbOpenStack has been looking at zanata replacements though so progress on that end at least19:12
frickleroh, I can't even log into wiki from my Jammy host19:13
fricklerUnable to negotiate with 2001:4800:7813:516:be76:4eff:fe06:36e7 port 22: no matching host key type found. Their offer: ssh-rsa,ssh-dss19:13
clarkbya thats the sha1 + rsa problem. YOu can specifically allow certain hosts to do sha119:14
fungiyeah, ssh to wiki needs host-specific overrides for openssh19:14
clarkbthe openssh 8.8 release notes have details19:14
fricklerah, o.k.19:14
fungii think it's less about that and more that it isn't new enough to support elliptic curve keys19:15
clarkbanyway the good news is jammy is functioanl as far as I can tell and any replacement can prefer it at this point19:15
fungibut could be both19:15
clarkb#topic Mailman 319:15
clarkbThis feels like it has stalled out a bit largely on a decision for whether or not we should use forked images19:15
fungiwell, i wasn't around last week too19:16
clarkbthe upstream did end up merging my lynx addition change. I have also noticed they have "rolling" docker image tags19:16
fungiis switching to rolling just a matter of adjusting the dockerfile?19:16
clarkbI think that means we could at this point choose to use upstream and their rolling tags and that would be roughly equivalent to what I have proposed for our fork19:16
fungior the compose file i guess if we don't need to fork19:17
clarkbfungi: its an edit to the docker-compose.yaml file. It isn't clear to me what sorts of upgrade garuntees they plan to make though19:17
clarkbI've also not seen any responses to my issues around specifying the domain names19:17
clarkbwhich makes me wary. It feels like they merged the easiest change I pushed and called it a day :/19:17
fungii guess we could keep forking but just run a very minimal fork unless we need to do more in the future?19:18
clarkbya I'm on the fence. Our fork now is equiavlent to upstream which makes me think we should just use upstream. But I also would like better comms/info from upstream if possible. I think I lean slightly towards using a local (non)fork for now and seeing if upstream improves19:19
clarkbbasically build our own images so that we can easily make changes as we don't really expect upstream to quickly make updates right now19:20
clarkbeven though our current image isn't different19:20
fungii'm good with that. unless anyone objects, let's just go that way for now19:20
ianw++ i'm fine with that.  we tried upstream first, but it's not always going to be the solution19:21
clarkbfungi: cool. Might also be worth a rebuild and rehold just to double check current builds work as expected? But then plan toboot a new node and deploy something?19:21
fungiyes, we'll need to un-wip the changes of course19:21
clarkbyup I think I unwip'd the work19:22
clarkb*the fork19:22
fungioh, cool19:22
fungii'll do another hold and full import test this week, then boot a prod server19:22
clarkbI guess let me know if I can help with that process of spot checking things19:22
fungii guess we settled on sticking with rackspace dfw for now?19:23
clarkbjust be sure the docker images rebuild too19:23
clarkbI think that is my slight preference simply because the dns is easier to manage19:23
clarkbbut we will have to update PBL? we may have to do that with vexxhost too? May be good to ask mnaser__ if he has an opinion on it19:23
fungii expect to do a broad sweep of blocklists for the new server ip addresses regardless, and put in exclusion requests19:25
clarkb++19:25
fungiwell in advance of any maintenance to put it into production19:25
clarkbthen ya my preference is slightly towards rax as it gives us a bit more direct control over things we may need to edit/control19:26
clarkbbut I'm happy if others feel the opposite way and override me on that :)19:26
clarkbwe also have a bit more time to sort that out while we do the rechecking of things in the images19:26
clarkb#topic Python base image updates for wheel installs19:27
clarkb#link https://review.opendev.org/c/opendev/system-config/+/86215219:27
clarkbI don't want to single core approve this change as it will affect a number of our images. ianw you spent a bit of time with the siblings stuff and it might be good for you to double check it from that angle?19:27
clarkbI do think we should land a change like this though as it should make us more resilient to pip updates in the future19:27
ianwahh yes sorry, i meant to review that.  i'll give it a proper review today19:28
clarkbthanks. It isn't super urgent but would be good to have before the next pip update :019:28
clarkber :)19:28
clarkb#topic Etherpad container log growth19:28
clarkbianw: I know you' ve been busy with bridge and k8s and stuff. Any chance there is a change for this yet?19:29
clarkbif you'd like I can try to take a stab at it too. Would probably be good to better understand that container logging stuff anyway19:29
ianwsorry, yes i got totally distracted in zuul-jobs.  i can do it and we can check it all works19:30
clarkbok19:30
clarkb#topic Quo vadis Storyboard19:30
clarkbI learned me a latin yesterday19:30
fricklersorry, had that 6 years in school ;)19:31
clarkbDuring/after the PTG it became apparent that a number of projects were looking at moving off of storyboard19:31
clarkbIn an effort to help converge discussion (or attempt to) as well as better understand the usage of storyboard I started a thread on service-discuss to gatherfeedback on storyboard and hopefully identify people who might be willing to help maintain it19:32
clarkbThe feedback so far as been from those looking at moving away. We have not yet identified any group that has indicated a desire to keep using storyboard or help maintain it.19:33
clarkbI don't want to do a ton of discussion here in the meeting as I think it would be good to keep as much of that on the mailing list as possible. THis way we don't fragment the discussion (which was already a concern for me) and it helps ensure everyone can easily stay on top of the discussion19:34
clarkbthat said I think there are a few important takeaways that have come out of this so far19:34
clarkbThe first is that the OpenStack projects looking at moving have indicated that having different tools for issue tracking across projects has been tough for them. Consistency is a desireable feature19:34
clarkbAnd second that while Storyboard was created to directly meet some of OpenStack's more unique issue tracking needs those needs have both shifted over time and storyboard isn't doing a great job of meeting them19:35
clarkbI think that means any longer term maintenance isn't just about library updates and python version upgrades but would need to also try and address the feature need gap19:36
fungithough part of why storyboard isn't doing a good job of meeting needs is that we haven't really kept up with fixing our deployment orchestration for it19:37
fungi(for example, people wanted attachments support, storyboard has it but our storyboard deployment does not)19:37
clarkbyes the two problems are interrelated. But a number of the issues raised appear to be due to a divergence in understanding of what the needs are not just poor implementation hampered by not updating19:37
clarkbduplicate issue handling was a major one called out for example19:38
clarkbfrom a hosting perspective this puts us in an unfortunate position because the users interested in solving these problems seem disinterested in solving them through storyboard.19:39
clarkbI think if we can find (an as of yet undiscovered) group of people to work on the maintenance and feature gaps we would be happy to continue hosting storyboard. I'm worried about what we do without that since it does clearly need help19:40
clarkbBut like I said I think we need to give it a bit more time since my last email before we dig into that too much. As we're still trying to find that potential group19:40
clarkbthose were the key details I've pulled out so far. Please respond to the thread if you have thoughts or concerns and we'll continue to tr and keep as much of the discussion there as possible19:42
clarkb#topic Vexxhost nova server rescue behavior19:43
clarkbI tested this. I'm glad I did. Unfortunately, server rescue on regular intsances and boot from volume instances does not work19:43
ianwthis was the microversion api issues?19:44
clarkbWith boot from volume instances if you naively make the request without specifying the compute api version it stright up fails because you need nova 2.87 or newer to even support the functionality. Vexxhost supports up to 2.88 which allowed me to try it with the microversion specified. The initial request doesn't fail, but the rescue does. The server ends up in an error state19:44
clarkbThe reason for this error is some sort of permissions thing which i passed on to mnaser__19:45
clarkbIf you try to unrescue a node in this state it fails because you cannot unrescue and error state instance19:45
clarkbbasically as far as I can tell rescuing a boot from volume instance using the rescue command is completely broken. However, I think you might be able to shutdown the node, attach the boot volume to another running instance and effectively do a rescue manually19:46
fungithat would have been rough if we had tried to go that route during the gerrit outage19:46
fungithanks for catching it19:46
clarkbI still need to test this ^ theoretical process19:46
clarkbFor regular instances things are a bit less straightforward. The rescue commands all work out of the box, but what happens is you get the rescue image's kernel running with the rescued node's root disk mounted on /19:47
clarkbI suspect the reason for this is a collision of the root label used in the kernel boot line. They all use cloudimg-rootfs by default and linux is doing what we don't want here.19:47
clarkbThe next thing I need to test on this is manually changing the root label on a test node to something other than cloudimg-rootfs and then rescuing it. In theory if the labels don't collide we'll get more reliable behavior19:48
clarkbIf that works we might want to have launch node change our root partition labels19:48
clarkball that to say the naive approach is not working. It was a great idea to test it (thank you frickler for calling out the concerns). We may have some workarounds we can employ19:49
fungiso basically we boot the rescue image's kernel/initrd but then mount the original rootfs and pivot to it during early boot?19:49
clarkbfungi: yup that appears to be what happens19:49
fungii can definitely see how that could occur19:49
clarkbfungi: and that is problematic because if you have a systemd issue you're preserving the problem19:49
fungiabsolutely. unless you can get to the kernel command line to edit it i guess19:49
clarkbI'm going to try and test those workarounds later today and if one or both of them work I'll need to push a docs update I guess19:50
ianwwe could switch to boot by uuid?19:50
clarkbianw: that won't work because the rescue image is still looking for cloudimg-rootfs which would still be present19:50
clarkbyou either need to change the label on the actual instance to avoid the collision or use special rescue images that don't look for cloudimg-rootfs19:51
ianwahh, i see yeah, you need to purge the label19:51
ianwas long as we don't make ourselves unbootable in the process of making ourselves more bootable :)19:51
clarkbI'm still hopeful there is a useable set of steps here to get what we need. I just need to do more testing19:51
clarkbianw: ya thats why doing it in launch node before the launch node reboot is a good idea I think19:52
clarkbwe'll catch any errors very early :)19:52
clarkbbut need more testing before we commit to anything as I'm not even sure this will fix it yet19:52
clarkbin the meantime please avoid rescuing instances in vexxhost unless it is our last resort. It may well make things worse rather than better19:53
clarkb#topic Open Discussion19:54
clarkbAnything else?19:54
fungitrying to fix a corner case with pypi uploads. see this and its zuul-jobs dependency:19:55
fungi#link https://review.opendev.org/864019 Skip existing remote artifacts during PyPI upload19:55
fungithe openstack release team wound up with a project where the wheel was uploaded but then pypi hit a transient error while trying to upload the corresponding sdist19:55
fungiso no way to rerun the job in its current state19:56
clarkbthe bit about being able to separate multi arch builds is a good feature too19:56
fungifixing it to no longer treat existing remote files as an error condition was less work than trying to dig out the twine api key and upload the sdist myself19:56
fungii'll reenqueue the failed tag (for openstack/metalsmith) once those changes land19:57
fungithat way we'll have confirmation it solves the situation19:57
clarkbone minor concern is that this might allow us to upload sdists after a wheel failure19:57
fricklerit also seems like wheel builds are broken again for 7 days, didn't look at logs yet19:58
clarkbwhich would be a problem for anything that has complicated build requirements and downstream users that rely on wheels19:58
fungiclarkb: i don't think this skips all failures, just ignores when pypi tells twine that the file it's trying to upload is already there19:58
clarkbfungi: perfect19:58
fungiif there's a different upload error, the role/play/job should still fail properly19:59
fungie.g., the build which hit the original problem leaving the sdist not uploaded would still have failed the same way, but this allows us to successfully reenqueue the tag and run the job again19:59
ianwfrickler: thanks -- https://zuul.opendev.org/t/openstack/build/89b55250a9e4433980dbdbb7ab2cf39c looks like centos openafs errors19:59
clarkband we are at time. Thank you everyone.20:00
clarkbAs I mentioned we should be back here same time and place next week. See youthen20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Nov  8 20:00:40 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-08-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-08-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-08-19.01.log.html20:00
fungithanks clarkb!20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!