Tuesday, 2022-10-04

clarkbHello, it is meeting time18:59
clarkbwe'll get started in a couple of minutes18:59
fungiahoy!18:59
ianwo/19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Oct  4 19:01:15 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-October/000363.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbThe OpenStack release is happenign this week (tomorrow in fact)19:01
clarkbfungi: I Think you indicated you would try to be around early tomorrow to keep an eye on things. I'll do my best too19:01
clarkbBut I don't expect any issues19:02
fungiyeah, though i have appointments starting around 14:00 utc19:02
fungiso will be less available at that point19:02
fungiextra eyes are appreciated19:02
clarkbI can probably be around at that point and take over19:03
clarkbThe other thing to note is that the PTG is in 2 weeks19:03
clarkb#topic Bastion Host Changes19:04
clarkblets dive right into the agenda19:04
clarkbianw has made progress on a stack fo chagnes to shift bridge to running ansible out of a venv19:04
clarkb#link https://review.opendev.org/q/topic:bridge-ansible-venv19:04
clarkbThe changes lgtm but please do reivew them carefully since this is the bastion host19:04
ianwyep i need to loop back on your comments thankyou, but it's close19:05
clarkbianw: one thing I noted on one of the chagnes is that launch node may need different venvs for different clouds in order to have different versions of oepnstacksdk19:05
clarkbIt is possible that good followup to this will be managing launch node venvs for that purpose19:05
clarkbAnd then separately your change to update zuul to disable console log file generation landed in zuul and I think the most recent restart of the cluster picked it up19:06
clarkbThat means we can configure out jobs to not write those files19:06
clarkb#link https://review.opendev.org/c/opendev/system-config/+/85547219:06
ianwyeah; is that mostly cinder/rax?  i feel like that's been a pita before, and i saw in scrollback annoyances adding disk to nodepool builders19:06
ianw(openstacksdk venvs)19:06
clarkbianw: its now rax and networking (not sure if nova or neutron is the problem there)19:06
clarkbbut ya19:06
clarkbianw: re console log writing I have a note there that a second location also needs the update.19:07
fungithough on a related note, the openstacksdk maintainers want to add a new pipeline in the openstack tenant of our zuul to ease testing of public clouds19:07
fungi(a "post-review" pipeline and associated required label in gerrit to enable/trigger it)19:08
clarkband I've proposed a topic to the openstack tc ptg to discuss not forgetting the sdk is a tool for end users in addition to an itnernal api tool for openstack clusters19:08
ianw++19:08
fungii think i'm the only reviewer to have provided them feedback on those changes so far19:08
ianwwe don't want to have to start another project to smooth out differences in openstacksdk versions ... maybe call it "shade" :)19:08
clarkbfungi: I thought I left a comment too19:09
fungiahh, cool19:09
fungii probably missed the update19:09
clarkbindicating that there isn't a reason to put it in project-config I don't think19:09
clarkbsince project-config doesn't protect the secrets in the way they think it does19:10
fungioh, that part, yeah19:10
fungithe pipeline creation still needs to happen in project-config though, as does the acl addition and support for the new review label in our linter19:10
clarkbI guess I'm not up to date on why any of that is necessary. I'll have to take another look19:11
fungii can bring up more details when we get to open discussion19:11
clarkbbut ya infra-root please look over the ansible in venv changes and the console log file disabling change(s). And ianw don't forget the second change needed for that19:11
clarkbAnything else to bring up on this topic?19:11
ianwyep i'll loop back on that19:12
ianwone minor change this relevaed in zuul was19:12
ianw#link https://review.opendev.org/c/zuul/zuul/+/86006219:12
ianwafter i messed up the node allocations.  that improves an edge-case error message19:12
ianwi think probably the last thing i can do is switch the testing to "bridge.opendev.org"19:12
ianwall the changes should have abstracted things such that it should work19:13
ianwat that point, i think we're ready (modulo launching focal nodes) to do the switch.  it will still be quite a manual process getting secrets etc, but i'm happy to do that 19:13
clarkbya and using the symlink into $PATH should make it fairly transparent to all the infra-prod job runs19:13
clarkb#topic Updating Bionic Servers / Launching Jammy Servers19:15
clarkbThats a good lead into the next topic19:15
clarkbcorvus did try to launch the new tracing server on a jammy host but that failed because our base user role couldn't delete the ubuntu user as a process was running and owned by it19:15
clarkbI believe what happened there is launch node logged in as the ubuntu user and used it to set up root. Then it logged back in as root and tried to delete the ubuntu user but something was left behind from the original login19:16
clarkb#link https://review.opendev.org/c/opendev/system-config/+/860112 Update to launch node to handle jammy hosts19:16
clarkbThat is one attempt at addressing this. Basically we use userdel --force whcih won't care if a process is running. Then the end of launch node processing involves a reboot which should clear out any stale processes19:16
clarkbThe downside to this is that --force has some behaviors we may not want generally which is why I've limited the --force deletion to users associated with the base distro cloud images and not with our regular users19:17
clarkbthis way failures to remove regular users will bubble up and we can debug them more closely19:17
clarkbIf we don't like that I think another approach would be to have launch login as ubuntu, set up root, then reboot the host and log back in after a reboot19:18
corvuswhat kind of undesirable behaviors?19:18
clarkbthe reboot should clear out any stale processes and allow userdel to run as before19:18
clarkbcorvus: "This option forces the removal of the user account, even if the user is still logged in. It also forces userdel to remove the user's home directory and mail spool, even if another user uses the same home directory or if the mail spool is not owned by the specified user."19:18
clarkbcorvus: in particular I think we want it to error if a normal user outside of the launch context is logged in or otherwise has processes running19:19
clarkbas that is something we should address. In the launch context the ubuntu user isn't something we care about and we'll reboot in a few minuets anyway19:19
corvusyep agree.  seems like --force is okay (even exactly what we want) for this case, and basically almost never otherwise.19:19
clarkbanyway I expect that with that change landed we can retry a jammy node launch and see if we make more progress there19:20
clarkbbut also let me know if we want to try a different appraoch like the early reboot during launch idea19:21
clarkbDid anyone else have server upgrade related items for this topic?19:21
ianwall sounds good thanks!  hopefully we have some new nodes up soon :)  if not bridge, the arm64 bits too19:22
clarkb#topic Mailman 319:23
clarkbWe continue to make progress. Though things have probably slowed a bit19:23
clarkbIn particular my efforts to work upstream to improve the images seems to have stalled.19:23
clarkbThere haven't been any responses to the github issues and PRs so I sent email to the mailman3 users list and the response I got there was that maxking is basically the only person who devs on those and we need to wait for maxking19:23
clarkb#link https://review.opendev.org/c/opendev/system-config/+/860157 Forking upstream mm3 images19:24
clarkbbecause of that I finally gave in and pushed ^ to fork the images.19:24
clarkbI think this leads us to two major questions: 1) Do we want to fork or just use the images with their existing issues? and 2) If we do want to fork how forked do we want to get? If we do a minimal fork we can more easily resync with upstream if they become active again. But then we need to continue to carry workarounds in our mm3 role and stick to their uid and gid19:25
clarkbselections.19:25
clarkbIt is worth noting that I did look at maybe just building our own images based on our python base image stuff. The problem with that is it appears there is a lot of inside knowledge over what versions of things need to be combined together to make a working system19:25
clarkbhttps://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/H7YK27E4GKG3KNAUPWTV32XWRWPFEU25/ upstream even acknowledges the confusion19:26
clarkbFor that reason I think we're best off forking or working upstream if we can amange it and then hope upstream curates those lists of specific versions for us19:26
clarkbThe existing change does a "lightweight" fork fwiw. The only change I made to the images was to install lynx which is necessary for html to text conversion19:27
clarkbI don't think we need to decide on any of this right now in the meeting. But I wanted ot throw the considerations out there and ask ya'll to take a look. Feel free to leave your thoughts on the chagne and I'll do my best to followup there19:28
clarkbwith that out of the way fungi did you have anything to add on the testing side ?19:28
fungiit seems like a reasonable path forward, and opens us up to adding other fixes19:28
fungii expect we'll want to hold another node with the build using the forked containers, and do another test import19:28
clarkb++ and probably do that after we update the prod fields that are too long for the new db?19:29
fungii also wanted to double-check that we're redirecting some common patterns like list description pages19:29
fungiand i was going to fix those three lists with message templates that were too large for the columns in the db and do at least one more import test19:29
fungiyes19:30
fungibut otherwise we're probably close to scheduling maintenance for some initial site cut-overs19:30
clarkbsounds good. Maybe see if we can get feedback on the image fork idea and then hold based on that19:31
fungiright19:31
clarkbsince we may need to make changes to the images19:31
fungiand maybe we'll hear back from the upstream image maintainer19:31
fungibut at least we have options if not19:31
clarkbAnything else?19:32
funginot on my end19:32
clarkb#topic Gitea Connectivity Issues19:33
clarkbAt the end of last week we had several reports from users in europe that had problems with git clones to opendev.org19:33
clarkbWe were unable to reproduce this from north american' isp connections and from our ovh region in france19:34
clarkbUltimately I think we decided it was something between the two endpoints and not something we could fix ourselves.19:34
clarkbHowever19:34
clarkbit did expose that our gitea logging no longer correlated connections from haproxy -> apache -> gitea19:34
clarkbhaproxy -> apache was workign fine. The problem was apache -> gitea and that appears to be related to gitea switching http libraries from macaron to go-chi19:35
clarkbbasically go-chi doesn't handle x-forwarded-for properly to preserve port info and isntead the port becomes :019:35
clarkbWe made some changes to stop forwarding x-forwarded-for which forces everything to record the actual ports in use. THis mostly works but apache -> gitea does reuse connections for multiple requests which means that it isn't a fully 1:1 mapping now but it is better than what we had on friday19:36
clarkbI think we can also force apache to use a new connection for each request but that is probably overkill?19:36
clarkbI wanted to bring this up in case anyone had better ideas or concerns with these cahgnes since we tried to get them in quickly last week while debugging19:36
fungithe request pipelining is probably more efficient, yeah, i don't think i'd turn it off just to make logs easier to correlate19:37
clarkbSounds like no one has any immediate concerns.19:39
clarkb#topic Open Discussion19:39
clarkbZuul will make its 7.0.0 release soon. The next step in the zuul release planning process is to switch opendev to ansible 6 by default to ensuer that is working happily. I had asked that we do that after the openstack release. But once openstack releases I think we can make that change19:39
clarkbI had a test devstack change up to check ansible 6 on the devstack jobs and that seemed to work happily19:40
clarkbhttps://review.opendev.org/c/openstack/devstack/+/85843619:40
clarkbNow is a good time to test things with ansible 6 if you have any concerns19:40
fungi#link https://review.opendev.org/859977 Add post-review pipeline19:41
fungithat's where most of the discussion i was talking about earlier took place19:41
ianwthanks -- in slightly related to ansible updates i think ansible-lint has fixed some issues that were holding us back from upgrading in zuul-jobs, i'll take a look19:42
fungithe openstacksdk maintainers want to take advantage of zuul's post-review pipeline flag to run some specific jobs which use secrets but limit them to changes which the core reviewers have okayed19:42
clarkbfungi: and looks like they don't want to use gate for that because they don't want the changes to merge at that point necessarily19:42
fungiright, the reviewers want build results after checking that it's safe to run those jobs but before approving them19:43
clarkbit might be worth considering if "Allow-Post-Review" conveys the intent here clearly as this might be a pipeline that is adopted more widely19:43
fungiwe'd discussed this as a possibility (precisely for the case they bring up, testing with public cloud credentials), so i tried to rehash some of our earlier conversations about that19:44
clarkb(typicalyl I'd avoid bikeshedding stuff like that but once it is in gerrit acls it is hard to change)19:44
fungiyeah, allow-post-review was merely my best suggestion. what they had before that was even less clear19:44
corvus(this use case was an explicit design requirement for zuul, so something like this was anticipated and planned for)19:44
fungisomething to convey "voting +1 here means it's safe to run post-review pipeline jobs" but small enough to be a gerrit label name19:45
corvusin design, i think we called it a "restricted check" pipeline or something like that.19:45
fungithat's not terrible19:45
clarkbno objections from me to move forward on this. As mentioned this was alays somethign we anticipated might become a useful utility19:46
fungiyeah, the previous name they had for it was the "post-check" pipeline (and a corresponding gerrit label of the same name)19:46
fungibut i agree bikeshedding on terms at least a little is probably good just because of the cargo cult potential19:47
corvusthe "post-check" phrasing is slightly confusing to me.19:47
fungiyeah, since we already have pipelines in that tenant called post and check19:47
clarkbI think my initial concern with "allow-post-review" is it doesn't convey what is being allowed. Just that somethign is19:47
fungishort for allow-post-review-jobs-to-run19:48
corvusfor the label name, maybe something that conveys "safety" or some level of having been "reviewed"...19:48
fungiyes, something along those lines would be good19:49
fungimy wordsmithing was simply not getting me all that far19:49
fungieverything i came up with was too lengthy19:49
corvusyeah, i'm not much help either19:49
clarkbya its a tough one19:49
clarkbtrigger-zuul-secrets19:49
fungiword-soup19:49
clarkbindeed19:50
fungianyway, since it's a use case we'd discussed at length, but it's been a while, i just wanted to call those changes to others' attention so they don't go unnoticed19:50
clarkb++ thanks19:51
fungiespecially since it's also in service of something we've had a bee in our collective bonnet over (loss of old public cloud support in openstacksdk)19:51
corvus++19:51
clarkbI'll give it a couple more minutes for anything else, but then we can probably end about 5 minutes early today19:52
clarkbsounds like that is it. Thank you everyone19:54
clarkbWe'll be back next week19:54
clarkbsame location and time19:54
clarkb#endmeeting19:55
opendevmeetMeeting ended Tue Oct  4 19:55:03 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:55
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-04-19.01.html19:55
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-04-19.01.txt19:55
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-04-19.01.log.html19:55
fungithanks clarkb!19:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!