19:01:15 #startmeeting infra 19:01:15 Meeting started Tue Oct 4 19:01:15 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:15 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:15 The meeting name has been set to 'infra' 19:01:19 #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000363.html Our Agenda 19:01:23 #topic Announcements 19:01:32 The OpenStack release is happenign this week (tomorrow in fact) 19:01:57 fungi: I Think you indicated you would try to be around early tomorrow to keep an eye on things. I'll do my best too 19:02:08 But I don't expect any issues 19:02:26 yeah, though i have appointments starting around 14:00 utc 19:02:39 so will be less available at that point 19:02:54 extra eyes are appreciated 19:03:07 I can probably be around at that point and take over 19:03:15 The other thing to note is that the PTG is in 2 weeks 19:04:11 #topic Bastion Host Changes 19:04:15 lets dive right into the agenda 19:04:29 ianw has made progress on a stack fo chagnes to shift bridge to running ansible out of a venv 19:04:34 #link https://review.opendev.org/q/topic:bridge-ansible-venv 19:04:44 The changes lgtm but please do reivew them carefully since this is the bastion host 19:05:02 yep i need to loop back on your comments thankyou, but it's close 19:05:12 ianw: one thing I noted on one of the chagnes is that launch node may need different venvs for different clouds in order to have different versions of oepnstacksdk 19:05:31 It is possible that good followup to this will be managing launch node venvs for that purpose 19:06:09 And then separately your change to update zuul to disable console log file generation landed in zuul and I think the most recent restart of the cluster picked it up 19:06:18 That means we can configure out jobs to not write those files 19:06:23 #link https://review.opendev.org/c/opendev/system-config/+/855472 19:06:27 yeah; is that mostly cinder/rax? i feel like that's been a pita before, and i saw in scrollback annoyances adding disk to nodepool builders 19:06:41 (openstacksdk venvs) 19:06:46 ianw: its now rax and networking (not sure if nova or neutron is the problem there) 19:06:48 but ya 19:07:00 ianw: re console log writing I have a note there that a second location also needs the update. 19:07:38 though on a related note, the openstacksdk maintainers want to add a new pipeline in the openstack tenant of our zuul to ease testing of public clouds 19:08:14 (a "post-review" pipeline and associated required label in gerrit to enable/trigger it) 19:08:17 and I've proposed a topic to the openstack tc ptg to discuss not forgetting the sdk is a tool for end users in addition to an itnernal api tool for openstack clusters 19:08:32 ++ 19:08:54 i think i'm the only reviewer to have provided them feedback on those changes so far 19:08:59 we don't want to have to start another project to smooth out differences in openstacksdk versions ... maybe call it "shade" :) 19:09:35 fungi: I thought I left a comment too 19:09:41 ahh, cool 19:09:48 i probably missed the update 19:09:49 indicating that there isn't a reason to put it in project-config I don't think 19:10:08 since project-config doesn't protect the secrets in the way they think it does 19:10:17 oh, that part, yeah 19:10:42 the pipeline creation still needs to happen in project-config though, as does the acl addition and support for the new review label in our linter 19:11:22 I guess I'm not up to date on why any of that is necessary. I'll have to take another look 19:11:36 i can bring up more details when we get to open discussion 19:11:50 but ya infra-root please look over the ansible in venv changes and the console log file disabling change(s). And ianw don't forget the second change needed for that 19:11:58 Anything else to bring up on this topic? 19:12:05 yep i'll loop back on that 19:12:12 one minor change this relevaed in zuul was 19:12:16 #link https://review.opendev.org/c/zuul/zuul/+/860062 19:12:30 after i messed up the node allocations. that improves an edge-case error message 19:12:57 i think probably the last thing i can do is switch the testing to "bridge.opendev.org" 19:13:16 all the changes should have abstracted things such that it should work 19:13:56 at that point, i think we're ready (modulo launching focal nodes) to do the switch. it will still be quite a manual process getting secrets etc, but i'm happy to do that 19:13:59 ya and using the symlink into $PATH should make it fairly transparent to all the infra-prod job runs 19:15:13 #topic Updating Bionic Servers / Launching Jammy Servers 19:15:18 Thats a good lead into the next topic 19:15:49 corvus did try to launch the new tracing server on a jammy host but that failed because our base user role couldn't delete the ubuntu user as a process was running and owned by it 19:16:16 I believe what happened there is launch node logged in as the ubuntu user and used it to set up root. Then it logged back in as root and tried to delete the ubuntu user but something was left behind from the original login 19:16:22 #link https://review.opendev.org/c/opendev/system-config/+/860112 Update to launch node to handle jammy hosts 19:16:52 That is one attempt at addressing this. Basically we use userdel --force whcih won't care if a process is running. Then the end of launch node processing involves a reboot which should clear out any stale processes 19:17:19 The downside to this is that --force has some behaviors we may not want generally which is why I've limited the --force deletion to users associated with the base distro cloud images and not with our regular users 19:17:29 this way failures to remove regular users will bubble up and we can debug them more closely 19:18:09 If we don't like that I think another approach would be to have launch login as ubuntu, set up root, then reboot the host and log back in after a reboot 19:18:18 what kind of undesirable behaviors? 19:18:19 the reboot should clear out any stale processes and allow userdel to run as before 19:18:48 corvus: "This option forces the removal of the user account, even if the user is still logged in. It also forces userdel to remove the user's home directory and mail spool, even if another user uses the same home directory or if the mail spool is not owned by the specified user." 19:19:04 corvus: in particular I think we want it to error if a normal user outside of the launch context is logged in or otherwise has processes running 19:19:28 as that is something we should address. In the launch context the ubuntu user isn't something we care about and we'll reboot in a few minuets anyway 19:19:45 yep agree. seems like --force is okay (even exactly what we want) for this case, and basically almost never otherwise. 19:20:51 anyway I expect that with that change landed we can retry a jammy node launch and see if we make more progress there 19:21:10 but also let me know if we want to try a different appraoch like the early reboot during launch idea 19:21:22 Did anyone else have server upgrade related items for this topic? 19:22:04 all sounds good thanks! hopefully we have some new nodes up soon :) if not bridge, the arm64 bits too 19:23:03 #topic Mailman 3 19:23:12 We continue to make progress. Though things have probably slowed a bit 19:23:24 In particular my efforts to work upstream to improve the images seems to have stalled. 19:23:58 There haven't been any responses to the github issues and PRs so I sent email to the mailman3 users list and the response I got there was that maxking is basically the only person who devs on those and we need to wait for maxking 19:24:13 #link https://review.opendev.org/c/opendev/system-config/+/860157 Forking upstream mm3 images 19:24:23 because of that I finally gave in and pushed ^ to fork the images. 19:25:21 I think this leads us to two major questions: 1) Do we want to fork or just use the images with their existing issues? and 2) If we do want to fork how forked do we want to get? If we do a minimal fork we can more easily resync with upstream if they become active again. But then we need to continue to carry workarounds in our mm3 role and stick to their uid and gid 19:25:23 selections. 19:25:56 It is worth noting that I did look at maybe just building our own images based on our python base image stuff. The problem with that is it appears there is a lot of inside knowledge over what versions of things need to be combined together to make a working system 19:26:16 https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/H7YK27E4GKG3KNAUPWTV32XWRWPFEU25/ upstream even acknowledges the confusion 19:26:41 For that reason I think we're best off forking or working upstream if we can amange it and then hope upstream curates those lists of specific versions for us 19:27:21 The existing change does a "lightweight" fork fwiw. The only change I made to the images was to install lynx which is necessary for html to text conversion 19:28:01 I don't think we need to decide on any of this right now in the meeting. But I wanted ot throw the considerations out there and ask ya'll to take a look. Feel free to leave your thoughts on the chagne and I'll do my best to followup there 19:28:17 with that out of the way fungi did you have anything to add on the testing side ? 19:28:21 it seems like a reasonable path forward, and opens us up to adding other fixes 19:28:56 i expect we'll want to hold another node with the build using the forked containers, and do another test import 19:29:14 ++ and probably do that after we update the prod fields that are too long for the new db? 19:29:18 i also wanted to double-check that we're redirecting some common patterns like list description pages 19:29:51 and i was going to fix those three lists with message templates that were too large for the columns in the db and do at least one more import test 19:30:02 yes 19:30:24 but otherwise we're probably close to scheduling maintenance for some initial site cut-overs 19:31:00 sounds good. Maybe see if we can get feedback on the image fork idea and then hold based on that 19:31:11 right 19:31:15 since we may need to make changes to the images 19:31:25 and maybe we'll hear back from the upstream image maintainer 19:31:51 but at least we have options if not 19:32:44 Anything else? 19:32:52 not on my end 19:33:33 #topic Gitea Connectivity Issues 19:33:49 At the end of last week we had several reports from users in europe that had problems with git clones to opendev.org 19:34:14 We were unable to reproduce this from north american' isp connections and from our ovh region in france 19:34:29 Ultimately I think we decided it was something between the two endpoints and not something we could fix ourselves. 19:34:31 However 19:34:47 it did expose that our gitea logging no longer correlated connections from haproxy -> apache -> gitea 19:35:09 haproxy -> apache was workign fine. The problem was apache -> gitea and that appears to be related to gitea switching http libraries from macaron to go-chi 19:35:27 basically go-chi doesn't handle x-forwarded-for properly to preserve port info and isntead the port becomes :0 19:36:14 We made some changes to stop forwarding x-forwarded-for which forces everything to record the actual ports in use. THis mostly works but apache -> gitea does reuse connections for multiple requests which means that it isn't a fully 1:1 mapping now but it is better than what we had on friday 19:36:38 I think we can also force apache to use a new connection for each request but that is probably overkill? 19:36:53 I wanted to bring this up in case anyone had better ideas or concerns with these cahgnes since we tried to get them in quickly last week while debugging 19:37:32 the request pipelining is probably more efficient, yeah, i don't think i'd turn it off just to make logs easier to correlate 19:39:09 Sounds like no one has any immediate concerns. 19:39:16 #topic Open Discussion 19:39:54 Zuul will make its 7.0.0 release soon. The next step in the zuul release planning process is to switch opendev to ansible 6 by default to ensuer that is working happily. I had asked that we do that after the openstack release. But once openstack releases I think we can make that change 19:40:10 I had a test devstack change up to check ansible 6 on the devstack jobs and that seemed to work happily 19:40:26 https://review.opendev.org/c/openstack/devstack/+/858436 19:40:44 Now is a good time to test things with ansible 6 if you have any concerns 19:41:09 #link https://review.opendev.org/859977 Add post-review pipeline 19:41:22 that's where most of the discussion i was talking about earlier took place 19:42:03 thanks -- in slightly related to ansible updates i think ansible-lint has fixed some issues that were holding us back from upgrading in zuul-jobs, i'll take a look 19:42:24 the openstacksdk maintainers want to take advantage of zuul's post-review pipeline flag to run some specific jobs which use secrets but limit them to changes which the core reviewers have okayed 19:42:48 fungi: and looks like they don't want to use gate for that because they don't want the changes to merge at that point necessarily 19:43:16 right, the reviewers want build results after checking that it's safe to run those jobs but before approving them 19:43:56 it might be worth considering if "Allow-Post-Review" conveys the intent here clearly as this might be a pipeline that is adopted more widely 19:44:00 we'd discussed this as a possibility (precisely for the case they bring up, testing with public cloud credentials), so i tried to rehash some of our earlier conversations about that 19:44:09 (typicalyl I'd avoid bikeshedding stuff like that but once it is in gerrit acls it is hard to change) 19:44:29 yeah, allow-post-review was merely my best suggestion. what they had before that was even less clear 19:44:48 (this use case was an explicit design requirement for zuul, so something like this was anticipated and planned for) 19:45:02 something to convey "voting +1 here means it's safe to run post-review pipeline jobs" but small enough to be a gerrit label name 19:45:13 in design, i think we called it a "restricted check" pipeline or something like that. 19:45:33 that's not terrible 19:46:04 no objections from me to move forward on this. As mentioned this was alays somethign we anticipated might become a useful utility 19:46:31 yeah, the previous name they had for it was the "post-check" pipeline (and a corresponding gerrit label of the same name) 19:47:01 but i agree bikeshedding on terms at least a little is probably good just because of the cargo cult potential 19:47:16 the "post-check" phrasing is slightly confusing to me. 19:47:39 yeah, since we already have pipelines in that tenant called post and check 19:47:45 I think my initial concern with "allow-post-review" is it doesn't convey what is being allowed. Just that somethign is 19:48:01 short for allow-post-review-jobs-to-run 19:48:07 for the label name, maybe something that conveys "safety" or some level of having been "reviewed"... 19:49:08 yes, something along those lines would be good 19:49:21 my wordsmithing was simply not getting me all that far 19:49:29 everything i came up with was too lengthy 19:49:32 yeah, i'm not much help either 19:49:39 ya its a tough one 19:49:48 trigger-zuul-secrets 19:49:59 word-soup 19:50:03 indeed 19:50:47 anyway, since it's a use case we'd discussed at length, but it's been a while, i just wanted to call those changes to others' attention so they don't go unnoticed 19:51:08 ++ thanks 19:51:26 especially since it's also in service of something we've had a bee in our collective bonnet over (loss of old public cloud support in openstacksdk) 19:51:47 ++ 19:52:21 I'll give it a couple more minutes for anything else, but then we can probably end about 5 minutes early today 19:54:50 sounds like that is it. Thank you everyone 19:54:53 We'll be back next week 19:54:56 same location and time 19:55:03 #endmeeting