19:01:14 #startmeeting infra 19:01:14 Meeting started Tue Nov 8 19:01:14 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:14 The meeting name has been set to 'infra' 19:01:20 o/ 19:01:27 ohai! 19:01:35 #link https://lists.opendev.org/pipermail/service-discuss/2022-November/000378.html Our Agenda 19:01:44 #topic Announcements 19:01:58 \o 19:02:17 I won't be around Friday or Monday. I'll be back Tuesday so expect to have a meeting next week. I may send the agenda for that a bit late (Tuesday morning?) 19:02:34 thanks! 19:02:40 hope you have a great weekend 19:03:02 other than that I didn't have anything to announce. We can dive in I guess 19:03:06 #topic Topics 19:03:12 #topic Bastion Host Updates 19:03:19 #link https://review.opendev.org/q/topic:prod-bastion-group 19:03:23 #link https://review.opendev.org/q/topic:bridge-ansible-venv 19:03:28 #link https://review.opendev.org/c/opendev/system-config/+/863564 19:03:32 #link https://review.opendev.org/c/opendev/system-config/+/863568 19:03:50 it seems like we're really close to being done with the old bastion and the new one is working well 19:04:12 ianw does have a few remaining changes that need review as I've linked above. Would be great if we can review those to close that portion of the work out 19:04:59 yep thanks -- goes off on a few tangents but all related 19:05:04 then separately Ithink it would be helpful if people can look at old bridge and make note of anything missing that needs to be moved or encoded in ansible. the raxdns client venv, ssl cert stuff, and hieraedit (generally yaml editor) were things I noted 19:06:00 ianw: other than reviewing changes and checking for missing content is there anything else we can do to help? 19:06:03 yep notes made on 19:06:07 #link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10 19:07:04 nope -- the changes you linked to have some things like adding host keys to inventory and i think we've actually fixed the blockers for parallel job running as a nice side-effect too 19:07:40 and i'll keep working on the todo to finish it off 19:08:19 sounds good. Thanks! This was always going to be a lot of effort and I appreciate you taking it on :) 19:08:44 #topic Bionic Server Upgrades 19:08:46 *next* time it will be easy :) 19:09:23 Not a whole lot to add here. I stalled a bit on this because of the doom and gloom from the openssl cve. But I should look at the list and pick off something new to do 19:09:29 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades 19:09:39 the changes to remove snapd and not do phased updates did land 19:09:54 whcih were related tothings I found when booting the new jammy gitea-lb02 19:10:09 I keep meaning to check that snapd is gone as expected but haven't managed that yet. 19:10:50 do we have servers other than storyboard that still run on xenial? 19:10:56 I am not aware of any current issues deploying services on jammy so new deployments should favor it 19:11:17 frickler: there are like 4? cacti, translate, storyboard, and something I'm forgetting I feel like 19:11:28 frickler: they are all puppeted services hence the difficulty of moving 19:11:57 maybe still list them in the upgrade pad so we don't forget about them? 19:12:04 good idea 19:12:24 wiki is still on trusty i think 19:12:27 OpenStack has been looking at zanata replacements though so progress on that end at least 19:13:24 oh, I can't even log into wiki from my Jammy host 19:13:44 Unable to negotiate with 2001:4800:7813:516:be76:4eff:fe06:36e7 port 22: no matching host key type found. Their offer: ssh-rsa,ssh-dss 19:14:04 ya thats the sha1 + rsa problem. YOu can specifically allow certain hosts to do sha1 19:14:10 yeah, ssh to wiki needs host-specific overrides for openssh 19:14:23 the openssh 8.8 release notes have details 19:14:51 ah, o.k. 19:15:02 i think it's less about that and more that it isn't new enough to support elliptic curve keys 19:15:05 anyway the good news is jammy is functioanl as far as I can tell and any replacement can prefer it at this point 19:15:20 but could be both 19:15:35 #topic Mailman 3 19:15:52 This feels like it has stalled out a bit largely on a decision for whether or not we should use forked images 19:16:12 well, i wasn't around last week too 19:16:13 the upstream did end up merging my lynx addition change. I have also noticed they have "rolling" docker image tags 19:16:44 is switching to rolling just a matter of adjusting the dockerfile? 19:16:44 I think that means we could at this point choose to use upstream and their rolling tags and that would be roughly equivalent to what I have proposed for our fork 19:17:03 or the compose file i guess if we don't need to fork 19:17:06 fungi: its an edit to the docker-compose.yaml file. It isn't clear to me what sorts of upgrade garuntees they plan to make though 19:17:21 I've also not seen any responses to my issues around specifying the domain names 19:17:37 which makes me wary. It feels like they merged the easiest change I pushed and called it a day :/ 19:18:41 i guess we could keep forking but just run a very minimal fork unless we need to do more in the future? 19:19:42 ya I'm on the fence. Our fork now is equiavlent to upstream which makes me think we should just use upstream. But I also would like better comms/info from upstream if possible. I think I lean slightly towards using a local (non)fork for now and seeing if upstream improves 19:20:04 basically build our own images so that we can easily make changes as we don't really expect upstream to quickly make updates right now 19:20:13 even though our current image isn't different 19:20:35 i'm good with that. unless anyone objects, let's just go that way for now 19:21:24 ++ i'm fine with that. we tried upstream first, but it's not always going to be the solution 19:21:43 fungi: cool. Might also be worth a rebuild and rehold just to double check current builds work as expected? But then plan toboot a new node and deploy something? 19:22:00 yes, we'll need to un-wip the changes of course 19:22:07 yup I think I unwip'd the work 19:22:09 *the fork 19:22:16 oh, cool 19:22:53 i'll do another hold and full import test this week, then boot a prod server 19:22:54 I guess let me know if I can help with that process of spot checking things 19:23:05 i guess we settled on sticking with rackspace dfw for now? 19:23:08 just be sure the docker images rebuild too 19:23:25 I think that is my slight preference simply because the dns is easier to manage 19:23:46 but we will have to update PBL? we may have to do that with vexxhost too? May be good to ask mnaser__ if he has an opinion on it 19:25:08 i expect to do a broad sweep of blocklists for the new server ip addresses regardless, and put in exclusion requests 19:25:17 ++ 19:25:26 well in advance of any maintenance to put it into production 19:26:10 then ya my preference is slightly towards rax as it gives us a bit more direct control over things we may need to edit/control 19:26:22 but I'm happy if others feel the opposite way and override me on that :) 19:26:36 we also have a bit more time to sort that out while we do the rechecking of things in the images 19:27:07 #topic Python base image updates for wheel installs 19:27:11 #link https://review.opendev.org/c/opendev/system-config/+/862152 19:27:37 I don't want to single core approve this change as it will affect a number of our images. ianw you spent a bit of time with the siblings stuff and it might be good for you to double check it from that angle? 19:27:52 I do think we should land a change like this though as it should make us more resilient to pip updates in the future 19:28:15 ahh yes sorry, i meant to review that. i'll give it a proper review today 19:28:32 thanks. It isn't super urgent but would be good to have before the next pip update :0 19:28:34 er :) 19:28:49 #topic Etherpad container log growth 19:29:07 ianw: I know you' ve been busy with bridge and k8s and stuff. Any chance there is a change for this yet? 19:29:26 if you'd like I can try to take a stab at it too. Would probably be good to better understand that container logging stuff anyway 19:30:13 sorry, yes i got totally distracted in zuul-jobs. i can do it and we can check it all works 19:30:28 ok 19:30:43 #topic Quo vadis Storyboard 19:30:49 I learned me a latin yesterday 19:31:18 sorry, had that 6 years in school ;) 19:31:28 During/after the PTG it became apparent that a number of projects were looking at moving off of storyboard 19:32:30 In an effort to help converge discussion (or attempt to) as well as better understand the usage of storyboard I started a thread on service-discuss to gatherfeedback on storyboard and hopefully identify people who might be willing to help maintain it 19:33:03 The feedback so far as been from those looking at moving away. We have not yet identified any group that has indicated a desire to keep using storyboard or help maintain it. 19:34:00 I don't want to do a ton of discussion here in the meeting as I think it would be good to keep as much of that on the mailing list as possible. THis way we don't fragment the discussion (which was already a concern for me) and it helps ensure everyone can easily stay on top of the discussion 19:34:12 that said I think there are a few important takeaways that have come out of this so far 19:34:46 The first is that the OpenStack projects looking at moving have indicated that having different tools for issue tracking across projects has been tough for them. Consistency is a desireable feature 19:35:39 And second that while Storyboard was created to directly meet some of OpenStack's more unique issue tracking needs those needs have both shifted over time and storyboard isn't doing a great job of meeting them 19:36:26 I think that means any longer term maintenance isn't just about library updates and python version upgrades but would need to also try and address the feature need gap 19:37:12 though part of why storyboard isn't doing a good job of meeting needs is that we haven't really kept up with fixing our deployment orchestration for it 19:37:40 (for example, people wanted attachments support, storyboard has it but our storyboard deployment does not) 19:37:50 yes the two problems are interrelated. But a number of the issues raised appear to be due to a divergence in understanding of what the needs are not just poor implementation hampered by not updating 19:38:15 duplicate issue handling was a major one called out for example 19:39:10 from a hosting perspective this puts us in an unfortunate position because the users interested in solving these problems seem disinterested in solving them through storyboard. 19:40:10 I think if we can find (an as of yet undiscovered) group of people to work on the maintenance and feature gaps we would be happy to continue hosting storyboard. I'm worried about what we do without that since it does clearly need help 19:40:30 But like I said I think we need to give it a bit more time since my last email before we dig into that too much. As we're still trying to find that potential group 19:42:24 those were the key details I've pulled out so far. Please respond to the thread if you have thoughts or concerns and we'll continue to tr and keep as much of the discussion there as possible 19:43:14 #topic Vexxhost nova server rescue behavior 19:43:46 I tested this. I'm glad I did. Unfortunately, server rescue on regular intsances and boot from volume instances does not work 19:44:46 this was the microversion api issues? 19:44:54 With boot from volume instances if you naively make the request without specifying the compute api version it stright up fails because you need nova 2.87 or newer to even support the functionality. Vexxhost supports up to 2.88 which allowed me to try it with the microversion specified. The initial request doesn't fail, but the rescue does. The server ends up in an error state 19:45:24 The reason for this error is some sort of permissions thing which i passed on to mnaser__ 19:45:39 If you try to unrescue a node in this state it fails because you cannot unrescue and error state instance 19:46:30 basically as far as I can tell rescuing a boot from volume instance using the rescue command is completely broken. However, I think you might be able to shutdown the node, attach the boot volume to another running instance and effectively do a rescue manually 19:46:31 that would have been rough if we had tried to go that route during the gerrit outage 19:46:37 thanks for catching it 19:46:47 I still need to test this ^ theoretical process 19:47:15 For regular instances things are a bit less straightforward. The rescue commands all work out of the box, but what happens is you get the rescue image's kernel running with the rescued node's root disk mounted on / 19:47:38 I suspect the reason for this is a collision of the root label used in the kernel boot line. They all use cloudimg-rootfs by default and linux is doing what we don't want here. 19:48:11 The next thing I need to test on this is manually changing the root label on a test node to something other than cloudimg-rootfs and then rescuing it. In theory if the labels don't collide we'll get more reliable behavior 19:48:26 If that works we might want to have launch node change our root partition labels 19:49:04 all that to say the naive approach is not working. It was a great idea to test it (thank you frickler for calling out the concerns). We may have some workarounds we can employ 19:49:05 so basically we boot the rescue image's kernel/initrd but then mount the original rootfs and pivot to it during early boot? 19:49:14 fungi: yup that appears to be what happens 19:49:25 i can definitely see how that could occur 19:49:30 fungi: and that is problematic because if you have a systemd issue you're preserving the problem 19:49:53 absolutely. unless you can get to the kernel command line to edit it i guess 19:50:16 I'm going to try and test those workarounds later today and if one or both of them work I'll need to push a docs update I guess 19:50:20 we could switch to boot by uuid? 19:50:51 ianw: that won't work because the rescue image is still looking for cloudimg-rootfs which would still be present 19:51:20 you either need to change the label on the actual instance to avoid the collision or use special rescue images that don't look for cloudimg-rootfs 19:51:31 ahh, i see yeah, you need to purge the label 19:51:56 as long as we don't make ourselves unbootable in the process of making ourselves more bootable :) 19:51:59 I'm still hopeful there is a useable set of steps here to get what we need. I just need to do more testing 19:52:12 ianw: ya thats why doing it in launch node before the launch node reboot is a good idea I think 19:52:20 we'll catch any errors very early :) 19:52:32 but need more testing before we commit to anything as I'm not even sure this will fix it yet 19:53:20 in the meantime please avoid rescuing instances in vexxhost unless it is our last resort. It may well make things worse rather than better 19:54:13 #topic Open Discussion 19:54:17 Anything else? 19:55:19 trying to fix a corner case with pypi uploads. see this and its zuul-jobs dependency: 19:55:21 #link https://review.opendev.org/864019 Skip existing remote artifacts during PyPI upload 19:55:57 the openstack release team wound up with a project where the wheel was uploaded but then pypi hit a transient error while trying to upload the corresponding sdist 19:56:05 so no way to rerun the job in its current state 19:56:28 the bit about being able to separate multi arch builds is a good feature too 19:56:53 fixing it to no longer treat existing remote files as an error condition was less work than trying to dig out the twine api key and upload the sdist myself 19:57:21 i'll reenqueue the failed tag (for openstack/metalsmith) once those changes land 19:57:37 that way we'll have confirmation it solves the situation 19:57:46 one minor concern is that this might allow us to upload sdists after a wheel failure 19:58:03 it also seems like wheel builds are broken again for 7 days, didn't look at logs yet 19:58:04 which would be a problem for anything that has complicated build requirements and downstream users that rely on wheels 19:58:35 clarkb: i don't think this skips all failures, just ignores when pypi tells twine that the file it's trying to upload is already there 19:58:42 fungi: perfect 19:59:07 if there's a different upload error, the role/play/job should still fail properly 19:59:53 e.g., the build which hit the original problem leaving the sdist not uploaded would still have failed the same way, but this allows us to successfully reenqueue the tag and run the job again 19:59:59 frickler: thanks -- https://zuul.opendev.org/t/openstack/build/89b55250a9e4433980dbdbb7ab2cf39c looks like centos openafs errors 20:00:27 and we are at time. Thank you everyone. 20:00:38 As I mentioned we should be back here same time and place next week. See youthen 20:00:40 #endmeeting