19:01:07 <clarkb> #startmeeting infra
19:01:07 <opendevmeet> Meeting started Tue Sep 13 19:01:07 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:07 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:07 <opendevmeet> The meeting name has been set to 'infra'
19:01:14 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000359.html Our Agenda
19:01:42 <clarkb> There is an agenda with quite a number of things on it. They are mostly small things so I may go quickly to be sure we get through it all then we can swing back around on anything that needed extra discussion
19:01:52 <clarkb> #topic Announcements
19:02:06 <clarkb> Nothing major here. Just a reminder that OpenStack is in the middle of its release process and elections
19:02:36 <clarkb> Don't forget to vote if you're eligible and take care to double check changes you are making to ensure we don't inadverdently break something the release depends on
19:03:29 <clarkb> #topic Topics
19:03:35 <clarkb> #topic Bastion Host Updates
19:04:05 <clarkb> We've taken yet another pivot after realizing we likely just never want to run the console stream daemon in these infra prod jobs. At leastnot in its current form
19:04:16 <clarkb> but the command module (and its relatives like shell) write the files out regardless
19:04:30 <clarkb> ianw: wrote some changes to make that optional which I think will be helpful for us
19:04:36 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/855309/ make console stream file writing toggleable
19:04:43 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/855472 Disable file writing for infra-prod
19:04:44 <ianw> yes sorry that needs a revision from your comments
19:04:56 <clarkb> ianw: ya and did you see my note about modifying the base jobs repo in a similar manner to system-config as well?
19:05:39 <ianw> ummm, sorry no, but can do
19:05:59 <clarkb> ping me if I don't rereview those quickly enough after updates. I'd like to see those get in as they appear to be a good improvement for our use case (and probably others in a similar boat)
19:07:29 <clarkb> #topic Upgrading Bionic Servers
19:07:34 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.
19:07:45 <clarkb> I keep meaning to pick this up but then other things pop up and grab my attention
19:08:06 <clarkb> Help appreciated and let me know if anyone starts working on this and needs changes reviewed or help debugging issues. I'm more than happy to take a look
19:08:13 <clarkb> But no real updates on this yet
19:09:05 <clarkb> #topic Mailman 3
19:09:14 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server.
19:09:19 <clarkb> #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes
19:09:37 <clarkb> We (mostly fungi at this point) continue to make progress on getting to the point where this is ready
19:09:54 <clarkb> The database appears happy with the larger connection buffer settings.
19:10:17 <clarkb> fungi: ^ have you checked if mysqldump is happy with that setting too? We should check that (maybe by manually running the mysqldump?)
19:10:52 <fungi> no, i didn't check that, but can make a note in the etherpad to test it with the next hold after a full import
19:10:56 <clarkb> Other todos include retesting now that the change is creating all the lists and not just those that ansible for mm2 knew about, checkign the pipermail redirects, and I think adding redirects for non list archive urls
19:11:03 <clarkb> fungi: thanks
19:11:16 <clarkb> fungi: we should probably go ahead and add a hold and recheck nowish?
19:11:25 <clarkb> fungi: I can do that after the meeting if that is helpful
19:11:28 <fungi> yeah, i just hadn't gotten to it yet
19:11:42 <clarkb> cool I'll sync up after the meeting to get that moving forward
19:12:02 <clarkb> Thanks for all the help on this. You definitely realize just how many little details go into a big migration like this when you start testing it out
19:12:15 <clarkb> #topic Jaeger Tracing Server
19:12:21 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/855983 adds deployment for jaeger
19:12:48 <corvus> my ball; will update this week.
19:12:53 <clarkb> There is a change now. CI isn't happy with it yet and I think ianw has some feedback
19:13:05 <clarkb> corvus: great, just wanted to make sure others were aware too.
19:13:22 <corvus> seems like ppl generally like it so far. just working through some technical details.
19:15:19 <clarkb> #topic Fedora 36
19:16:22 <clarkb> #link https://review.opendev.org/c/zuul/nodepool/+/853914 Remove fedora 35 testing from nodepool
19:16:53 <clarkb> ianw everywhere else is using the fedora-latest label and will get automatically updated?
19:17:40 <ianw> devstack still has https://review.opendev.org/c/openstack/devstack/+/854334 but i need to look into that
19:18:02 <clarkb> ah they have their own labels.
19:18:19 <ianw> but other than that, yes -- so with the nodepool change one step closer to dropping f35
19:18:35 <clarkb> ianw: looks like the issue there is they are branching nodeset definitions :/
19:18:58 <clarkb> thats going to create problems for every transition that uses an alias like -latest
19:19:08 <clarkb> might make sense tomove that into openstack-zuul-jobs or similar to avoid thatp roblem
19:19:19 <ianw> we always seem to have this discussion about making sure various testing jobs don't end up on stable branches
19:19:53 <clarkb> another option is for them to use anonymous nodesets
19:20:02 <clarkb> but I don't think they should be managing aliased nodesets on branched repos
19:20:08 <clarkb> as this will be a problem every 6 months
19:20:30 <fungi> yeah, a branchless repo like osj should fit the bill
19:20:45 <fungi> er, ozj
19:20:53 <ianw> there should be no fedora on anything but master -- but i agree this could have a better home
19:21:07 <clarkb> ianw: ya the problemi s they branch yoga and don't clean it up
19:21:18 <clarkb> its better to just avoid having it on master where it can end up in a stable branch probably
19:21:27 <ianw> i can add a todo to have a look
19:21:30 <clarkb> anyway we can sort that out with the qa team separately
19:21:47 <clarkb> is there anything other than reviewing the nodepool change that we can do to help
19:22:39 <ianw> i don't think so, thanks.  unless people want to start debugging devstack, which i don't think they do :)
19:23:22 <clarkb> #topic Jitsi Meet Updates
19:23:31 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/856553 Update to use colibri websockets and scale out JVBs
19:23:45 <clarkb> This is one of those changes where in theory I've done what is expected of the service
19:24:04 <clarkb> But its kinda hard to confirm that without having a full blown service running with dns setup and being able to talk to it with our browsers
19:24:26 <fungi> but it's also fairly easy to test if we set aside a window to do so
19:24:26 <clarkb> In particular it isn't clear to me if the JVB java keystores need to have some relationship to a CA or to each other or be verifiable in some way
19:24:52 <clarkb> All of the bits I could find on the docs and forum posts about this don't indicate any sort of relationship like that so I think they may be using this just for encryption and not verification
19:25:00 <clarkb> fungi: ya exactly
19:25:13 <clarkb> I think if people are comfortable with the change we could probably land it and test things during a quiet time (Friday?)
19:25:23 <clarkb> and if it breaks either revert or try to roll forward and fix
19:25:47 <clarkb> I do think there is a window of opportunity here where we should get it done or wait until after the ptg though. Probably week before ptg is not the time to land this but before that is ok?
19:25:53 <fungi> and i think it should be reasonably safe to merge first, make sure things aren't broken, take a jvb server out of emergency, redeploy to it gets updated, stio the jvb container on the main server, test again
19:26:02 <fungi> i like friday
19:26:20 <fungi> s/stio/stop
19:26:25 <ianw> ++
19:27:08 <clarkb> sounds good. /me makes a note on the todo list to try and get that done friday
19:27:55 <clarkb> Other than that I think we are in good shape for having the service up for the ptg. The non jvb setup seems to be working
19:28:06 <clarkb> (just a question of whether or not it can scale, but that is what the jvb change is for)
19:28:18 <clarkb> #topic Stability of the Zuul Reboot Playbook
19:28:31 <clarkb> If you didn't know this already Clouds are excellent chaos monkey drivers
19:29:01 <clarkb> Over the weekend we hit another issue with the playbook. This time it is a race between asking the container to stop in an exec and the container quitting out from under the docker exec
19:29:16 <clarkb> when the container exits before the exec is complete the docker command return code is 137 and ansible gets angry
19:29:42 <clarkb> I pushed an update to handle this as well as an unexpected behavior with docker-compose ps -q printing exited containers that frickler pointed out (docker ps -q does not do this)
19:30:00 <clarkb> I started a manual run of that yesterday and we are currently waiting for ze08 to stop
19:30:23 <clarkb> Hoping that completes today which will have what should become zuul 6.4.0 deployed in opendev for a bit before the release is made
19:30:46 <clarkb> Calling this out because I think it is a good idea for us to keep an eye on this playbook for a bit until we're satisfied it is stable
19:30:48 <corvus> the original run was dev10/dev18
19:30:57 <corvus> the new run is upgrading to dev21?
19:31:26 <corvus> was it resumed or is everything going to dev21?
19:31:48 <corvus> (i'm not sure how to read ze01 being at dev18, ze05 at dev21, and ze12 at dev18 again
19:32:13 <clarkb> corvus: all of the ze's updated to dev18 over the weekend as the crash happened on zm05 which was after the zes
19:32:45 <clarkb> corvus: some time after my manual restart of the playbook yesterday a change or two landed to zuul and our hourly zuul playbook docker-compose pulled that so nodes after that point are upgrading to dev21
19:33:21 <clarkb> once this is done we can go and update ze01-ze04 to dev21 to match
19:33:35 <clarkb> as they should be the only ones out of sync (unless more zuul changes land in the interim)
19:33:36 <corvus> gotcha.  i was hoping to avoid that, but it looks like the changes that merged are inconsequential to the release
19:33:50 <clarkb> yes I looked at them and they didn't look to be major
19:33:56 <corvus> nah, no need, we can run with diverse versions
19:34:10 <corvus> we don't use the elasticsearch reporter :)
19:35:00 <clarkb> So far the updated playbook seems happy. I'll continue to monitor it
19:35:07 <corvus> \o/
19:35:34 <clarkb> #topic Python Container Image Updates
19:35:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/856537
19:36:09 <clarkb> This is a great time to update our python container base images as they now include a fixed glibc for the ansible issue and new python minor releases
19:36:48 <clarkb> Once we land that we can remove the zuul glibc workaround and let that change rebuild the zuul images
19:37:48 <clarkb> I wouldn't call this urgent, but it is good hygiene to update these periodically so that changes to our various images can pick up the underlying updates
19:37:49 <ianw> ok, i feel like the zuul workaround is separate though
19:38:08 <clarkb> ianw: once the base image has fixed glibc the zuul workaround is no longer required?
19:38:22 <clarkb> This is a necessary precondition of removing the workaround
19:38:35 <ianw> oh right, it builds on these base images.  although it might do an apt-get upgrade as part of building
19:38:41 <ianw> zuul that is?
19:38:49 <clarkb> zuul might, thats true
19:38:55 <clarkb> our base images don't
19:39:02 <ianw> anyway, yeah pulling into base images seems good
19:39:15 <corvus> #link zuul workaround: https://review.opendev.org/849795
19:39:36 <corvus> i'm not aware of an apt-get upgrade
19:40:04 <ianw> right, and https://review.opendev.org/c/zuul/zuul/+/854939 was to revert it
19:41:23 <ianw> i updated that to depends-on the system-config change; so ordering should be right now
19:41:32 <clarkb> cool
19:41:45 <corvus> sounds like a plan
19:42:22 <clarkb> #topic Improving Ansible Task Runtime
19:42:40 <clarkb> This is largely meant to be informational to help people be conscious of this as they write new ansible
19:42:53 <clarkb> But I'm also happy if people end up refactoring existing ansible :)
19:43:28 <clarkb> The TL;DR is that even though zuul using ssh control persistence and ansible pipelining the cost to run an individual task as simple as copying a few bytes file or execing ls is often measured in seconds
19:43:45 <clarkb> The exact number of seconds seems to vary across our clouds but we've seen it as high as 6 in some :(
19:44:07 <clarkb> This becomes particularly problematic when you are running ansible tasks in a loop with a large number of loop inputs
19:44:37 <clarkb> each input creates a new task that can take 6 seconds to execute. Multiply that by 100 items in a loop and now you just spent 10 minutes doing something that probably should've taken a second or two at most
19:44:56 <clarkb> I've written a few chagnes at this point to pick off some low hanging fruit that suffer from this
19:45:01 <clarkb> #link https://review.opendev.org/c/zuul/zuul-jobs/+/855402
19:45:05 <clarkb> #link https://review.opendev.org/c/zuul/zuul-jobs/+/857228
19:45:16 <clarkb> in particular improve some shared library roles so that everyone can benefit
19:45:25 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/857232
19:46:07 <clarkb> this change is specific to how we run nested ansible and saves 1-3 minutes or so depending on the test node. As noted in the commit message of this change there is a downside to it (more complicated nested ansible setup) and I've asked for feedback on whether or not we think that cost is worthwhile
19:46:45 <clarkb> I've just WIP'd it to ensure we don't merge it before additional feedback is given
19:47:02 <clarkb> So ya, try to be aware of this as you write ansible, it can make a bit impact on how long our jobs take to execute
19:47:23 <clarkb> sometimes it might be appropriate to move actions into a shell script rather than have ansible work through logic and iteration
19:47:32 <clarkb> sometimes we can use synchronize instaed of a loop of copies, and so on
19:48:55 <clarkb> And be on the look out for any particularly problematic bits that we might be able to improve. The multi node known hsots stuff could be quicker after my improvement above for example and maybe our infra log encryption could be sped up too
19:49:01 <clarkb> #topic Open Discussion
19:49:18 <clarkb> We got through the agenda. Anything else or anything we covered above that you'd like to go into more detail on?
19:50:42 <fungi> i've got nothing else
19:51:07 <clarkb> the debian reprepro mirror needs help
19:51:26 <fungi> yep, planning to dig into that during/after dinner, unless someone beats me to it
19:51:31 <clarkb> it somehow leaked a lock file which I cleaned up earlier today and now it complains of a corrupt db
19:51:34 <fungi> database rebuild seems to be necessary
19:51:43 <ianw> yeah, i feel like i have notes on that, i can take alook
19:51:48 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/852056
19:52:20 <ianw> is one; about reverting the pin of the grafana contatiner.  frickler isn't a fan, i'm a bit less worried about it -- not sure what others think
19:52:33 <clarkb> ianw: are they going to keep releasing beta software to :latest?
19:52:40 <clarkb> I'm ok with deploying it if they stop doing that
19:52:46 <frickler> ah, I checked that, the dashboard page looks empty with :latest
19:53:23 <clarkb> there was talk of them doing a :stable or similar tag iirc
19:53:23 <frickler> also didn't we have a patch that generates screenshots of all the individual dashboards? I didn't find that
19:53:50 <ianw> that's a point, this job doesn't run that
19:53:53 <clarkb> frickler: I think that job runs on the project-config side we could run it here too though and probably a good idea
19:54:37 <frickler> anyway something still seems broken with latest, so we can either try some tagged version or try to find a fix in our setup
19:54:53 <frickler> not sure if someone has time and energy for that
19:55:04 <ianw> well yeah, if there is a problem with :latest, ignoring it is only going to make it worse :)  that's kind of my point
19:55:40 <clarkb> right but it seems that they started releasing known problematic stuff to :latest
19:55:46 <clarkb> whereas before it was vetted releases
19:56:01 <clarkb> I'm ok with keeping up with their relaeses but don't think we should be responsible for beta testing for them
19:57:10 <ianw> well, i doubt they would say that, and really it is our model of loading via the yaml path etc. that i think we're testing, and that's not going to be something covered by upstream ci
19:57:37 <clarkb> ianw: aiui when we broke previously it was because latest was a beta release
19:57:52 <clarkb> and the issue was a known issue they were already working to fix that would never end up in the final release
19:58:38 <ianw> not really -- it was their bug -- but we reported it, and confirmed it, and helped get it fixed
19:59:32 <frickler> different topic, just to shortly mention it before time's up, there seem to be some issues with nested-kvm on ovh-gra1. I'm testing with a beta version of cirros, will apply some special cmdline option
19:59:42 <frickler> hope to have some more information tomorrow
19:59:56 <clarkb> frickler: thanks.
20:00:16 <ianw> yeah, if it's the same thing as we saw with our jammy nodes booting there, i think we'll need some help from the cloud side
20:00:26 <clarkb> checking docker hub they don't seem to have stable tags
20:00:43 <ianw> it releases to kernel messages spewing from a prctl due to cpu flags
20:00:53 <ianw> #link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839
20:01:10 <fungi> though i expect their business model provides them with an incentive to leverage users of the open source version as beta testers in order to shield their paying customers from bugs
20:01:17 <clarkb> amorin was responsive at least. Suggested trying a different flavor on a one off boot to check if that was any btter
20:01:36 <fungi> (grafana's business model, i mean)
20:01:58 <frickler> I'll try the added kernel option first, other flavor second, didn't get to it today
20:02:12 <clarkb> and we are at time. Thanks everyone. Feel free to continue the grafana and nested virt discussion in #opendev
20:02:16 <frickler> made updates to service-types-authority work again
20:02:19 <clarkb> We'll be back here same time and place next week.
20:02:24 <clarkb> #endmeeting