19:01:07 #startmeeting infra 19:01:07 Meeting started Tue Sep 13 19:01:07 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:07 The meeting name has been set to 'infra' 19:01:14 #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000359.html Our Agenda 19:01:42 There is an agenda with quite a number of things on it. They are mostly small things so I may go quickly to be sure we get through it all then we can swing back around on anything that needed extra discussion 19:01:52 #topic Announcements 19:02:06 Nothing major here. Just a reminder that OpenStack is in the middle of its release process and elections 19:02:36 Don't forget to vote if you're eligible and take care to double check changes you are making to ensure we don't inadverdently break something the release depends on 19:03:29 #topic Topics 19:03:35 #topic Bastion Host Updates 19:04:05 We've taken yet another pivot after realizing we likely just never want to run the console stream daemon in these infra prod jobs. At leastnot in its current form 19:04:16 but the command module (and its relatives like shell) write the files out regardless 19:04:30 ianw: wrote some changes to make that optional which I think will be helpful for us 19:04:36 #link https://review.opendev.org/c/zuul/zuul/+/855309/ make console stream file writing toggleable 19:04:43 #link https://review.opendev.org/c/opendev/system-config/+/855472 Disable file writing for infra-prod 19:04:44 yes sorry that needs a revision from your comments 19:04:56 ianw: ya and did you see my note about modifying the base jobs repo in a similar manner to system-config as well? 19:05:39 ummm, sorry no, but can do 19:05:59 ping me if I don't rereview those quickly enough after updates. I'd like to see those get in as they appear to be a good improvement for our use case (and probably others in a similar boat) 19:07:29 #topic Upgrading Bionic Servers 19:07:34 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. 19:07:45 I keep meaning to pick this up but then other things pop up and grab my attention 19:08:06 Help appreciated and let me know if anyone starts working on this and needs changes reviewed or help debugging issues. I'm more than happy to take a look 19:08:13 But no real updates on this yet 19:09:05 #topic Mailman 3 19:09:14 #link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server. 19:09:19 #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes 19:09:37 We (mostly fungi at this point) continue to make progress on getting to the point where this is ready 19:09:54 The database appears happy with the larger connection buffer settings. 19:10:17 fungi: ^ have you checked if mysqldump is happy with that setting too? We should check that (maybe by manually running the mysqldump?) 19:10:52 no, i didn't check that, but can make a note in the etherpad to test it with the next hold after a full import 19:10:56 Other todos include retesting now that the change is creating all the lists and not just those that ansible for mm2 knew about, checkign the pipermail redirects, and I think adding redirects for non list archive urls 19:11:03 fungi: thanks 19:11:16 fungi: we should probably go ahead and add a hold and recheck nowish? 19:11:25 fungi: I can do that after the meeting if that is helpful 19:11:28 yeah, i just hadn't gotten to it yet 19:11:42 cool I'll sync up after the meeting to get that moving forward 19:12:02 Thanks for all the help on this. You definitely realize just how many little details go into a big migration like this when you start testing it out 19:12:15 #topic Jaeger Tracing Server 19:12:21 #link https://review.opendev.org/c/opendev/system-config/+/855983 adds deployment for jaeger 19:12:48 my ball; will update this week. 19:12:53 There is a change now. CI isn't happy with it yet and I think ianw has some feedback 19:13:05 corvus: great, just wanted to make sure others were aware too. 19:13:22 seems like ppl generally like it so far. just working through some technical details. 19:15:19 #topic Fedora 36 19:16:22 #link https://review.opendev.org/c/zuul/nodepool/+/853914 Remove fedora 35 testing from nodepool 19:16:53 ianw everywhere else is using the fedora-latest label and will get automatically updated? 19:17:40 devstack still has https://review.opendev.org/c/openstack/devstack/+/854334 but i need to look into that 19:18:02 ah they have their own labels. 19:18:19 but other than that, yes -- so with the nodepool change one step closer to dropping f35 19:18:35 ianw: looks like the issue there is they are branching nodeset definitions :/ 19:18:58 thats going to create problems for every transition that uses an alias like -latest 19:19:08 might make sense tomove that into openstack-zuul-jobs or similar to avoid thatp roblem 19:19:19 we always seem to have this discussion about making sure various testing jobs don't end up on stable branches 19:19:53 another option is for them to use anonymous nodesets 19:20:02 but I don't think they should be managing aliased nodesets on branched repos 19:20:08 as this will be a problem every 6 months 19:20:30 yeah, a branchless repo like osj should fit the bill 19:20:45 er, ozj 19:20:53 there should be no fedora on anything but master -- but i agree this could have a better home 19:21:07 ianw: ya the problemi s they branch yoga and don't clean it up 19:21:18 its better to just avoid having it on master where it can end up in a stable branch probably 19:21:27 i can add a todo to have a look 19:21:30 anyway we can sort that out with the qa team separately 19:21:47 is there anything other than reviewing the nodepool change that we can do to help 19:22:39 i don't think so, thanks. unless people want to start debugging devstack, which i don't think they do :) 19:23:22 #topic Jitsi Meet Updates 19:23:31 #link https://review.opendev.org/c/opendev/system-config/+/856553 Update to use colibri websockets and scale out JVBs 19:23:45 This is one of those changes where in theory I've done what is expected of the service 19:24:04 But its kinda hard to confirm that without having a full blown service running with dns setup and being able to talk to it with our browsers 19:24:26 but it's also fairly easy to test if we set aside a window to do so 19:24:26 In particular it isn't clear to me if the JVB java keystores need to have some relationship to a CA or to each other or be verifiable in some way 19:24:52 All of the bits I could find on the docs and forum posts about this don't indicate any sort of relationship like that so I think they may be using this just for encryption and not verification 19:25:00 fungi: ya exactly 19:25:13 I think if people are comfortable with the change we could probably land it and test things during a quiet time (Friday?) 19:25:23 and if it breaks either revert or try to roll forward and fix 19:25:47 I do think there is a window of opportunity here where we should get it done or wait until after the ptg though. Probably week before ptg is not the time to land this but before that is ok? 19:25:53 and i think it should be reasonably safe to merge first, make sure things aren't broken, take a jvb server out of emergency, redeploy to it gets updated, stio the jvb container on the main server, test again 19:26:02 i like friday 19:26:20 s/stio/stop 19:26:25 ++ 19:27:08 sounds good. /me makes a note on the todo list to try and get that done friday 19:27:55 Other than that I think we are in good shape for having the service up for the ptg. The non jvb setup seems to be working 19:28:06 (just a question of whether or not it can scale, but that is what the jvb change is for) 19:28:18 #topic Stability of the Zuul Reboot Playbook 19:28:31 If you didn't know this already Clouds are excellent chaos monkey drivers 19:29:01 Over the weekend we hit another issue with the playbook. This time it is a race between asking the container to stop in an exec and the container quitting out from under the docker exec 19:29:16 when the container exits before the exec is complete the docker command return code is 137 and ansible gets angry 19:29:42 I pushed an update to handle this as well as an unexpected behavior with docker-compose ps -q printing exited containers that frickler pointed out (docker ps -q does not do this) 19:30:00 I started a manual run of that yesterday and we are currently waiting for ze08 to stop 19:30:23 Hoping that completes today which will have what should become zuul 6.4.0 deployed in opendev for a bit before the release is made 19:30:46 Calling this out because I think it is a good idea for us to keep an eye on this playbook for a bit until we're satisfied it is stable 19:30:48 the original run was dev10/dev18 19:30:57 the new run is upgrading to dev21? 19:31:26 was it resumed or is everything going to dev21? 19:31:48 (i'm not sure how to read ze01 being at dev18, ze05 at dev21, and ze12 at dev18 again 19:32:13 corvus: all of the ze's updated to dev18 over the weekend as the crash happened on zm05 which was after the zes 19:32:45 corvus: some time after my manual restart of the playbook yesterday a change or two landed to zuul and our hourly zuul playbook docker-compose pulled that so nodes after that point are upgrading to dev21 19:33:21 once this is done we can go and update ze01-ze04 to dev21 to match 19:33:35 as they should be the only ones out of sync (unless more zuul changes land in the interim) 19:33:36 gotcha. i was hoping to avoid that, but it looks like the changes that merged are inconsequential to the release 19:33:50 yes I looked at them and they didn't look to be major 19:33:56 nah, no need, we can run with diverse versions 19:34:10 we don't use the elasticsearch reporter :) 19:35:00 So far the updated playbook seems happy. I'll continue to monitor it 19:35:07 \o/ 19:35:34 #topic Python Container Image Updates 19:35:45 #link https://review.opendev.org/c/opendev/system-config/+/856537 19:36:09 This is a great time to update our python container base images as they now include a fixed glibc for the ansible issue and new python minor releases 19:36:48 Once we land that we can remove the zuul glibc workaround and let that change rebuild the zuul images 19:37:48 I wouldn't call this urgent, but it is good hygiene to update these periodically so that changes to our various images can pick up the underlying updates 19:37:49 ok, i feel like the zuul workaround is separate though 19:38:08 ianw: once the base image has fixed glibc the zuul workaround is no longer required? 19:38:22 This is a necessary precondition of removing the workaround 19:38:35 oh right, it builds on these base images. although it might do an apt-get upgrade as part of building 19:38:41 zuul that is? 19:38:49 zuul might, thats true 19:38:55 our base images don't 19:39:02 anyway, yeah pulling into base images seems good 19:39:15 #link zuul workaround: https://review.opendev.org/849795 19:39:36 i'm not aware of an apt-get upgrade 19:40:04 right, and https://review.opendev.org/c/zuul/zuul/+/854939 was to revert it 19:41:23 i updated that to depends-on the system-config change; so ordering should be right now 19:41:32 cool 19:41:45 sounds like a plan 19:42:22 #topic Improving Ansible Task Runtime 19:42:40 This is largely meant to be informational to help people be conscious of this as they write new ansible 19:42:53 But I'm also happy if people end up refactoring existing ansible :) 19:43:28 The TL;DR is that even though zuul using ssh control persistence and ansible pipelining the cost to run an individual task as simple as copying a few bytes file or execing ls is often measured in seconds 19:43:45 The exact number of seconds seems to vary across our clouds but we've seen it as high as 6 in some :( 19:44:07 This becomes particularly problematic when you are running ansible tasks in a loop with a large number of loop inputs 19:44:37 each input creates a new task that can take 6 seconds to execute. Multiply that by 100 items in a loop and now you just spent 10 minutes doing something that probably should've taken a second or two at most 19:44:56 I've written a few chagnes at this point to pick off some low hanging fruit that suffer from this 19:45:01 #link https://review.opendev.org/c/zuul/zuul-jobs/+/855402 19:45:05 #link https://review.opendev.org/c/zuul/zuul-jobs/+/857228 19:45:16 in particular improve some shared library roles so that everyone can benefit 19:45:25 #link https://review.opendev.org/c/opendev/system-config/+/857232 19:46:07 this change is specific to how we run nested ansible and saves 1-3 minutes or so depending on the test node. As noted in the commit message of this change there is a downside to it (more complicated nested ansible setup) and I've asked for feedback on whether or not we think that cost is worthwhile 19:46:45 I've just WIP'd it to ensure we don't merge it before additional feedback is given 19:47:02 So ya, try to be aware of this as you write ansible, it can make a bit impact on how long our jobs take to execute 19:47:23 sometimes it might be appropriate to move actions into a shell script rather than have ansible work through logic and iteration 19:47:32 sometimes we can use synchronize instaed of a loop of copies, and so on 19:48:55 And be on the look out for any particularly problematic bits that we might be able to improve. The multi node known hsots stuff could be quicker after my improvement above for example and maybe our infra log encryption could be sped up too 19:49:01 #topic Open Discussion 19:49:18 We got through the agenda. Anything else or anything we covered above that you'd like to go into more detail on? 19:50:42 i've got nothing else 19:51:07 the debian reprepro mirror needs help 19:51:26 yep, planning to dig into that during/after dinner, unless someone beats me to it 19:51:31 it somehow leaked a lock file which I cleaned up earlier today and now it complains of a corrupt db 19:51:34 database rebuild seems to be necessary 19:51:43 yeah, i feel like i have notes on that, i can take alook 19:51:48 #link https://review.opendev.org/c/opendev/system-config/+/852056 19:52:20 is one; about reverting the pin of the grafana contatiner. frickler isn't a fan, i'm a bit less worried about it -- not sure what others think 19:52:33 ianw: are they going to keep releasing beta software to :latest? 19:52:40 I'm ok with deploying it if they stop doing that 19:52:46 ah, I checked that, the dashboard page looks empty with :latest 19:53:23 there was talk of them doing a :stable or similar tag iirc 19:53:23 also didn't we have a patch that generates screenshots of all the individual dashboards? I didn't find that 19:53:50 that's a point, this job doesn't run that 19:53:53 frickler: I think that job runs on the project-config side we could run it here too though and probably a good idea 19:54:37 anyway something still seems broken with latest, so we can either try some tagged version or try to find a fix in our setup 19:54:53 not sure if someone has time and energy for that 19:55:04 well yeah, if there is a problem with :latest, ignoring it is only going to make it worse :) that's kind of my point 19:55:40 right but it seems that they started releasing known problematic stuff to :latest 19:55:46 whereas before it was vetted releases 19:56:01 I'm ok with keeping up with their relaeses but don't think we should be responsible for beta testing for them 19:57:10 well, i doubt they would say that, and really it is our model of loading via the yaml path etc. that i think we're testing, and that's not going to be something covered by upstream ci 19:57:37 ianw: aiui when we broke previously it was because latest was a beta release 19:57:52 and the issue was a known issue they were already working to fix that would never end up in the final release 19:58:38 not really -- it was their bug -- but we reported it, and confirmed it, and helped get it fixed 19:59:32 different topic, just to shortly mention it before time's up, there seem to be some issues with nested-kvm on ovh-gra1. I'm testing with a beta version of cirros, will apply some special cmdline option 19:59:42 hope to have some more information tomorrow 19:59:56 frickler: thanks. 20:00:16 yeah, if it's the same thing as we saw with our jammy nodes booting there, i think we'll need some help from the cloud side 20:00:26 checking docker hub they don't seem to have stable tags 20:00:43 it releases to kernel messages spewing from a prctl due to cpu flags 20:00:53 #link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839 20:01:10 though i expect their business model provides them with an incentive to leverage users of the open source version as beta testers in order to shield their paying customers from bugs 20:01:17 amorin was responsive at least. Suggested trying a different flavor on a one off boot to check if that was any btter 20:01:36 (grafana's business model, i mean) 20:01:58 I'll try the added kernel option first, other flavor second, didn't get to it today 20:02:12 and we are at time. Thanks everyone. Feel free to continue the grafana and nested virt discussion in #opendev 20:02:16 made updates to service-types-authority work again 20:02:19 We'll be back here same time and place next week. 20:02:24 #endmeeting