#opendev-meeting log

19:01:15 <clarkb> #startmeeting infra
19:01:15 <opendevmeet> Meeting started Tue Oct  4 19:01:15 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:15 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:15 <opendevmeet> The meeting name has been set to 'infra'
19:01:19 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000363.html Our Agenda
19:01:23 <clarkb> #topic Announcements
19:01:32 <clarkb> The OpenStack release is happenign this week (tomorrow in fact)
19:01:57 <clarkb> fungi: I Think you indicated you would try to be around early tomorrow to keep an eye on things. I'll do my best too
19:02:08 <clarkb> But I don't expect any issues
19:02:26 <fungi> yeah, though i have appointments starting around 14:00 utc
19:02:39 <fungi> so will be less available at that point
19:02:54 <fungi> extra eyes are appreciated
19:03:07 <clarkb> I can probably be around at that point and take over
19:03:15 <clarkb> The other thing to note is that the PTG is in 2 weeks
19:04:11 <clarkb> #topic Bastion Host Changes
19:04:15 <clarkb> lets dive right into the agenda
19:04:29 <clarkb> ianw has made progress on a stack fo chagnes to shift bridge to running ansible out of a venv
19:04:34 <clarkb> #link https://review.opendev.org/q/topic:bridge-ansible-venv
19:04:44 <clarkb> The changes lgtm but please do reivew them carefully since this is the bastion host
19:05:02 <ianw> yep i need to loop back on your comments thankyou, but it's close
19:05:12 <clarkb> ianw: one thing I noted on one of the chagnes is that launch node may need different venvs for different clouds in order to have different versions of oepnstacksdk
19:05:31 <clarkb> It is possible that good followup to this will be managing launch node venvs for that purpose
19:06:09 <clarkb> And then separately your change to update zuul to disable console log file generation landed in zuul and I think the most recent restart of the cluster picked it up
19:06:18 <clarkb> That means we can configure out jobs to not write those files
19:06:23 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/855472
19:06:27 <ianw> yeah; is that mostly cinder/rax?  i feel like that's been a pita before, and i saw in scrollback annoyances adding disk to nodepool builders
19:06:41 <ianw> (openstacksdk venvs)
19:06:46 <clarkb> ianw: its now rax and networking (not sure if nova or neutron is the problem there)
19:06:48 <clarkb> but ya
19:07:00 <clarkb> ianw: re console log writing I have a note there that a second location also needs the update.
19:07:38 <fungi> though on a related note, the openstacksdk maintainers want to add a new pipeline in the openstack tenant of our zuul to ease testing of public clouds
19:08:14 <fungi> (a "post-review" pipeline and associated required label in gerrit to enable/trigger it)
19:08:17 <clarkb> and I've proposed a topic to the openstack tc ptg to discuss not forgetting the sdk is a tool for end users in addition to an itnernal api tool for openstack clusters
19:08:32 <ianw> ++
19:08:54 <fungi> i think i'm the only reviewer to have provided them feedback on those changes so far
19:08:59 <ianw> we don't want to have to start another project to smooth out differences in openstacksdk versions ... maybe call it "shade" :)
19:09:35 <clarkb> fungi: I thought I left a comment too
19:09:41 <fungi> ahh, cool
19:09:48 <fungi> i probably missed the update
19:09:49 <clarkb> indicating that there isn't a reason to put it in project-config I don't think
19:10:08 <clarkb> since project-config doesn't protect the secrets in the way they think it does
19:10:17 <fungi> oh, that part, yeah
19:10:42 <fungi> the pipeline creation still needs to happen in project-config though, as does the acl addition and support for the new review label in our linter
19:11:22 <clarkb> I guess I'm not up to date on why any of that is necessary. I'll have to take another look
19:11:36 <fungi> i can bring up more details when we get to open discussion
19:11:50 <clarkb> but ya infra-root please look over the ansible in venv changes and the console log file disabling change(s). And ianw don't forget the second change needed for that
19:11:58 <clarkb> Anything else to bring up on this topic?
19:12:05 <ianw> yep i'll loop back on that
19:12:12 <ianw> one minor change this relevaed in zuul was
19:12:16 <ianw> #link https://review.opendev.org/c/zuul/zuul/+/860062
19:12:30 <ianw> after i messed up the node allocations.  that improves an edge-case error message
19:12:57 <ianw> i think probably the last thing i can do is switch the testing to "bridge.opendev.org"
19:13:16 <ianw> all the changes should have abstracted things such that it should work
19:13:56 <ianw> at that point, i think we're ready (modulo launching focal nodes) to do the switch.  it will still be quite a manual process getting secrets etc, but i'm happy to do that
19:13:59 <clarkb> ya and using the symlink into $PATH should make it fairly transparent to all the infra-prod job runs
19:15:13 <clarkb> #topic Updating Bionic Servers / Launching Jammy Servers
19:15:18 <clarkb> Thats a good lead into the next topic
19:15:49 <clarkb> corvus did try to launch the new tracing server on a jammy host but that failed because our base user role couldn't delete the ubuntu user as a process was running and owned by it
19:16:16 <clarkb> I believe what happened there is launch node logged in as the ubuntu user and used it to set up root. Then it logged back in as root and tried to delete the ubuntu user but something was left behind from the original login
19:16:22 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/860112 Update to launch node to handle jammy hosts
19:16:52 <clarkb> That is one attempt at addressing this. Basically we use userdel --force whcih won't care if a process is running. Then the end of launch node processing involves a reboot which should clear out any stale processes
19:17:19 <clarkb> The downside to this is that --force has some behaviors we may not want generally which is why I've limited the --force deletion to users associated with the base distro cloud images and not with our regular users
19:17:29 <clarkb> this way failures to remove regular users will bubble up and we can debug them more closely
19:18:09 <clarkb> If we don't like that I think another approach would be to have launch login as ubuntu, set up root, then reboot the host and log back in after a reboot
19:18:18 <corvus> what kind of undesirable behaviors?
19:18:19 <clarkb> the reboot should clear out any stale processes and allow userdel to run as before
19:18:48 <clarkb> corvus: "This option forces the removal of the user account, even if the user is still logged in. It also forces userdel to remove the user's home directory and mail spool, even if another user uses the same home directory or if the mail spool is not owned by the specified user."
19:19:04 <clarkb> corvus: in particular I think we want it to error if a normal user outside of the launch context is logged in or otherwise has processes running
19:19:28 <clarkb> as that is something we should address. In the launch context the ubuntu user isn't something we care about and we'll reboot in a few minuets anyway
19:19:45 <corvus> yep agree.  seems like --force is okay (even exactly what we want) for this case, and basically almost never otherwise.
19:20:51 <clarkb> anyway I expect that with that change landed we can retry a jammy node launch and see if we make more progress there
19:21:10 <clarkb> but also let me know if we want to try a different appraoch like the early reboot during launch idea
19:21:22 <clarkb> Did anyone else have server upgrade related items for this topic?
19:22:04 <ianw> all sounds good thanks!  hopefully we have some new nodes up soon :)  if not bridge, the arm64 bits too
19:23:03 <clarkb> #topic Mailman 3
19:23:12 <clarkb> We continue to make progress. Though things have probably slowed a bit
19:23:24 <clarkb> In particular my efforts to work upstream to improve the images seems to have stalled.
19:23:58 <clarkb> There haven't been any responses to the github issues and PRs so I sent email to the mailman3 users list and the response I got there was that maxking is basically the only person who devs on those and we need to wait for maxking
19:24:13 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/860157 Forking upstream mm3 images
19:24:23 <clarkb> because of that I finally gave in and pushed ^ to fork the images.
19:25:21 <clarkb> I think this leads us to two major questions: 1) Do we want to fork or just use the images with their existing issues? and 2) If we do want to fork how forked do we want to get? If we do a minimal fork we can more easily resync with upstream if they become active again. But then we need to continue to carry workarounds in our mm3 role and stick to their uid and gid
19:25:23 <clarkb> selections.
19:25:56 <clarkb> It is worth noting that I did look at maybe just building our own images based on our python base image stuff. The problem with that is it appears there is a lot of inside knowledge over what versions of things need to be combined together to make a working system
19:26:16 <clarkb> https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/H7YK27E4GKG3KNAUPWTV32XWRWPFEU25/ upstream even acknowledges the confusion
19:26:41 <clarkb> For that reason I think we're best off forking or working upstream if we can amange it and then hope upstream curates those lists of specific versions for us
19:27:21 <clarkb> The existing change does a "lightweight" fork fwiw. The only change I made to the images was to install lynx which is necessary for html to text conversion
19:28:01 <clarkb> I don't think we need to decide on any of this right now in the meeting. But I wanted ot throw the considerations out there and ask ya'll to take a look. Feel free to leave your thoughts on the chagne and I'll do my best to followup there
19:28:17 <clarkb> with that out of the way fungi did you have anything to add on the testing side ?
19:28:21 <fungi> it seems like a reasonable path forward, and opens us up to adding other fixes
19:28:56 <fungi> i expect we'll want to hold another node with the build using the forked containers, and do another test import
19:29:14 <clarkb> ++ and probably do that after we update the prod fields that are too long for the new db?
19:29:18 <fungi> i also wanted to double-check that we're redirecting some common patterns like list description pages
19:29:51 <fungi> and i was going to fix those three lists with message templates that were too large for the columns in the db and do at least one more import test
19:30:02 <fungi> yes
19:30:24 <fungi> but otherwise we're probably close to scheduling maintenance for some initial site cut-overs
19:31:00 <clarkb> sounds good. Maybe see if we can get feedback on the image fork idea and then hold based on that
19:31:11 <fungi> right
19:31:15 <clarkb> since we may need to make changes to the images
19:31:25 <fungi> and maybe we'll hear back from the upstream image maintainer
19:31:51 <fungi> but at least we have options if not
19:32:44 <clarkb> Anything else?
19:32:52 <fungi> not on my end
19:33:33 <clarkb> #topic Gitea Connectivity Issues
19:33:49 <clarkb> At the end of last week we had several reports from users in europe that had problems with git clones to opendev.org
19:34:14 <clarkb> We were unable to reproduce this from north american' isp connections and from our ovh region in france
19:34:29 <clarkb> Ultimately I think we decided it was something between the two endpoints and not something we could fix ourselves.
19:34:31 <clarkb> However
19:34:47 <clarkb> it did expose that our gitea logging no longer correlated connections from haproxy -> apache -> gitea
19:35:09 <clarkb> haproxy -> apache was workign fine. The problem was apache -> gitea and that appears to be related to gitea switching http libraries from macaron to go-chi
19:35:27 <clarkb> basically go-chi doesn't handle x-forwarded-for properly to preserve port info and isntead the port becomes :0
19:36:14 <clarkb> We made some changes to stop forwarding x-forwarded-for which forces everything to record the actual ports in use. THis mostly works but apache -> gitea does reuse connections for multiple requests which means that it isn't a fully 1:1 mapping now but it is better than what we had on friday
19:36:38 <clarkb> I think we can also force apache to use a new connection for each request but that is probably overkill?
19:36:53 <clarkb> I wanted to bring this up in case anyone had better ideas or concerns with these cahgnes since we tried to get them in quickly last week while debugging
19:37:32 <fungi> the request pipelining is probably more efficient, yeah, i don't think i'd turn it off just to make logs easier to correlate
19:39:09 <clarkb> Sounds like no one has any immediate concerns.
19:39:16 <clarkb> #topic Open Discussion
19:39:54 <clarkb> Zuul will make its 7.0.0 release soon. The next step in the zuul release planning process is to switch opendev to ansible 6 by default to ensuer that is working happily. I had asked that we do that after the openstack release. But once openstack releases I think we can make that change
19:40:10 <clarkb> I had a test devstack change up to check ansible 6 on the devstack jobs and that seemed to work happily
19:40:26 <clarkb> https://review.opendev.org/c/openstack/devstack/+/858436
19:40:44 <clarkb> Now is a good time to test things with ansible 6 if you have any concerns
19:41:09 <fungi> #link https://review.opendev.org/859977 Add post-review pipeline
19:41:22 <fungi> that's where most of the discussion i was talking about earlier took place
19:42:03 <ianw> thanks -- in slightly related to ansible updates i think ansible-lint has fixed some issues that were holding us back from upgrading in zuul-jobs, i'll take a look
19:42:24 <fungi> the openstacksdk maintainers want to take advantage of zuul's post-review pipeline flag to run some specific jobs which use secrets but limit them to changes which the core reviewers have okayed
19:42:48 <clarkb> fungi: and looks like they don't want to use gate for that because they don't want the changes to merge at that point necessarily
19:43:16 <fungi> right, the reviewers want build results after checking that it's safe to run those jobs but before approving them
19:43:56 <clarkb> it might be worth considering if "Allow-Post-Review" conveys the intent here clearly as this might be a pipeline that is adopted more widely
19:44:00 <fungi> we'd discussed this as a possibility (precisely for the case they bring up, testing with public cloud credentials), so i tried to rehash some of our earlier conversations about that
19:44:09 <clarkb> (typicalyl I'd avoid bikeshedding stuff like that but once it is in gerrit acls it is hard to change)
19:44:29 <fungi> yeah, allow-post-review was merely my best suggestion. what they had before that was even less clear
19:44:48 <corvus> (this use case was an explicit design requirement for zuul, so something like this was anticipated and planned for)
19:45:02 <fungi> something to convey "voting +1 here means it's safe to run post-review pipeline jobs" but small enough to be a gerrit label name
19:45:13 <corvus> in design, i think we called it a "restricted check" pipeline or something like that.
19:45:33 <fungi> that's not terrible
19:46:04 <clarkb> no objections from me to move forward on this. As mentioned this was alays somethign we anticipated might become a useful utility
19:46:31 <fungi> yeah, the previous name they had for it was the "post-check" pipeline (and a corresponding gerrit label of the same name)
19:47:01 <fungi> but i agree bikeshedding on terms at least a little is probably good just because of the cargo cult potential
19:47:16 <corvus> the "post-check" phrasing is slightly confusing to me.
19:47:39 <fungi> yeah, since we already have pipelines in that tenant called post and check
19:47:45 <clarkb> I think my initial concern with "allow-post-review" is it doesn't convey what is being allowed. Just that somethign is
19:48:01 <fungi> short for allow-post-review-jobs-to-run
19:48:07 <corvus> for the label name, maybe something that conveys "safety" or some level of having been "reviewed"...
19:49:08 <fungi> yes, something along those lines would be good
19:49:21 <fungi> my wordsmithing was simply not getting me all that far
19:49:29 <fungi> everything i came up with was too lengthy
19:49:32 <corvus> yeah, i'm not much help either
19:49:39 <clarkb> ya its a tough one
19:49:48 <clarkb> trigger-zuul-secrets
19:49:59 <fungi> word-soup
19:50:03 <clarkb> indeed
19:50:47 <fungi> anyway, since it's a use case we'd discussed at length, but it's been a while, i just wanted to call those changes to others' attention so they don't go unnoticed
19:51:08 <clarkb> ++ thanks
19:51:26 <fungi> especially since it's also in service of something we've had a bee in our collective bonnet over (loss of old public cloud support in openstacksdk)
19:51:47 <corvus> ++
19:52:21 <clarkb> I'll give it a couple more minutes for anything else, but then we can probably end about 5 minutes early today
19:54:50 <clarkb> sounds like that is it. Thank you everyone
19:54:53 <clarkb> We'll be back next week
19:54:56 <clarkb> same location and time
19:55:03 <clarkb> #endmeeting