#opendev-meeting log

19:01:04 <clarkb> #startmeeting infra
19:01:04 <opendevmeet> Meeting started Tue Sep  6 19:01:04 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:04 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:04 <opendevmeet> The meeting name has been set to 'infra'
19:01:06 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000358.html Our Agenda
19:01:11 <clarkb> #topic Announcements
19:01:29 <clarkb> No new announcements but keep in mind that OpenStack and StarglingX are working through their release processes right now
19:02:13 <clarkb> #topic Bastion Host Updates
19:02:50 <clarkb> Anything new on this item? I discoverd a few minutse ago that newer ansible on bridge would be nice for newer apt module features. But I was able to work around that without updating ansible
19:03:04 <clarkb> In particular I think the new features also require python3.8?
19:03:15 <clarkb> And that implies upgrading the server and so on
19:04:29 <ianw> yeah, i noticed that; i'm hoping the work to move to a venv will be top of todo now
19:04:46 <clarkb> excellent I'll do my best to review those changes
19:05:38 <clarkb> #topic Bionic Server Upgrades
19:05:44 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.
19:05:52 <clarkb> Mostly keeping this on the agenda to keep it top of mind.
19:06:04 <clarkb> I don't think any new work has happened on this, but we really should start digging in as we can
19:06:35 <clarkb> #topic Mailman 3
19:06:41 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server.
19:06:46 <clarkb> This change is ready for review now
19:06:53 <clarkb> #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes
19:07:05 <clarkb> Thank you fungi for doing much of the work to test the migration process on a held test node
19:07:24 <clarkb> Seems like it works as expected and gave us some good feedback on updates to make to our default list creation
19:07:56 <clarkb> One thing we found is that lynx needs to be installed for html to txt email conversion support. That isn't currently happening on the upstream images and I've opened a PR against them to fix it. Unfortunately no word frmo usptream yet on whether or not they want to accept it
19:08:07 <clarkb> Worst case we'll build our own images based on theirs and fix that
19:08:35 <clarkb> The next thinsg to test are some updates to database connnection settings to allow for larger email attachments (and check that mysqldump can backup the resulting database state)
19:08:42 <fungi> yeah, next phase is to retest openstack-discuss import (particularly the archive) for the db packet size limit tweaks, and also script up what a whole-site migration looks like
19:08:44 <ianw> oh wow, haven't used lynx in a while!
19:09:00 <clarkb> We also need to test the pipermail archive hosting redirects
19:09:08 <fungi> oh, yep that too
19:09:10 <clarkb> as we'll be hosting old pipermail archives to keep old links alive
19:09:21 <clarkb> (there isn't a mapping from pipermail to hyperkitty apparently)
19:09:32 <fungi> probably also time to start thinking about what other redirects we might want for list info pages and the like
19:10:28 <clarkb> I think it is a bit early to commit to any convesion of lists.opendev.org but I'm hopeful we'll continue to make enough progress that we do that in the not too distant future
19:10:32 <fungi> since i the old pipermail listinfo pages had different urls
19:10:54 <fungi> maybe soon after the ptg would be a good timeframe
19:10:55 <clarkb> fungi: for those links we likely can redirect to the new archives though ya?
19:11:00 <fungi> yes
19:11:20 <clarkb> great
19:11:41 <clarkb> More testing planned tomorrow. Hopefully we'll have good things to report next week :)
19:11:44 <fungi> just that we have old links to them in many places so not having to fix all o fthem will be good
19:11:44 <clarkb> Anything else on this topic?
19:11:50 <clarkb> ++
19:13:26 <clarkb> #topic Jaeger Tracing Server
19:14:05 <clarkb> corvus: Wanted to give you an opportunity to provide any updates here if there are any
19:14:14 <corvus> ah yeah i started this
19:14:28 <corvus> #link tracing server https://review.opendev.org/855983
19:14:35 <corvus> does not pass tests yet -- just pushed it up eod yesterday
19:14:55 <corvus> hopefully it's basically complete and probably i just typo'd something.
19:15:16 <corvus> if folks wanted to give it a look, i think it's not too early for that
19:15:55 <corvus> oh
19:16:12 <corvus> i'm using the "zk-ca" to generate certs for the clients (zuul) to send data to the server (jaeger)
19:16:32 <clarkb> corvus: client authentication certs?
19:16:36 <corvus> tls is optional -- we could rely on iptables only -- but i thought that was better, and seemed a reasonable scope expansion for our private ca
19:16:42 <corvus> clarkb: yes exactly
19:16:58 <clarkb> that seems reasonable. There is a similar concern re meetpad that I'll talk about soon
19:17:05 <corvus> it's basically the same use we're already using zk-ca for for zookeeper, so it seemed reasonable to keep it parallel
19:17:44 <corvus> that's all i have -- that's the only interesting new thing that came up
19:17:57 <clarkb> thank you for teh update. I'll try to take a look at the change today
19:18:03 <clarkb> #topic Fedora 36
19:18:20 <clarkb> ianw: this is another one I feel not caught up on. I believe I saw changes happening, but I'm not sure what teh current state is.
19:18:24 <clarkb> Any cahnce you can fill us in?
19:19:42 <ianw> in terms of general deployment, zuul-jobs it's all gtg
19:20:09 <ianw> devstack i do not think works, i haven't looked closely yet
19:20:11 <ianw> #link https://review.opendev.org/c/openstack/devstack/+/854334
19:20:44 <clarkb> ianw: is fedora 35 still there or is that on the way out now?
19:20:52 <ianw> in terms of getting rid of f35, there are issues with openshift testing
19:21:05 <frickler> it is still there for devstack, but pretty unstable
19:21:13 <clarkb> oh right the client doesn't work on f36 yet
19:21:33 <ianw> the openshift testing has two parts too, just to make a bigger yak to shave
19:22:00 <frickler> https://zuul.openstack.org/builds?job_name=devstack-platform-fedora-latest&skip=0
19:22:16 <ianw> the client side is one thing.  the "oc" client doesn't run on fedora 36; due to go compatability issues
19:22:32 <ianw> i just got a bug update overnight that "it was never supposed to work" or something ...
19:22:45 <ianw> #link https://issues.redhat.com/browse/OCPBUGS-559
19:23:08 <ianw> (i haven't quite parsed it)
19:24:02 <ianw> the server side of that testing is also a problem -- it runs on centos7 using a PaaS repo that no longer exists
19:25:03 <ianw> this is disucssion that's been happening about converting this to openshift local which is crc etc. etc. (i think we discused this last week)
19:25:20 <clarkb> Yup and we brought it up in the zuul + open infra board discussion earlier today as well
19:25:33 <clarkb> Sounds like there may be some interest from at least one board member in helping if possible
19:25:48 <ianw> my concern with that is that it requires a nest-virt 9gb+ VM to run.
19:26:18 <fungi> which we can provide in opendev too, so not entirely a show stopper
19:26:19 <ianw> which we *can* provide via nodepool -- but it limits our testing range
19:26:23 <fungi> yeah
19:26:38 <fungi> not ideal, certainl
19:26:41 <fungi> y
19:26:44 <ianw> and also, it seems kind of crazy to require this to run a few containers
19:27:05 <ianw> to test against.  but in general the vibe i get is that nobody else thinks that
19:27:26 <ianw> so maybe i'm just old and think 640k should be enough for everyone :/
19:27:37 <fungi> others probably do, their voices may just not be that loud (yet)
19:28:12 <fungi> or they're consigned to what they feel is unavoidable
19:28:56 <ianw> there is some talk of having this thing support "minishift" which sounds like what i described ... a few containers.  but i don't know anything concrete about that
19:29:56 <clarkb> Sounds like we've got some options there we've just got to sort out which is the best one for our needs?
19:30:04 <clarkb> I guess which is best and viable
19:30:45 <ianw> well, i guess the options both kind of suck.  one is to just drop it all, the other is to re-write a bunch of zuul-jobs openshift deployment, etc. jobs that can only run on bespoke (to opendev) nodes
19:31:17 <ianw> hence i haven't been running at it full speed :)
19:31:56 <clarkb> We did open up the idea of those more specialized labels for specific needs like this so I think it is ok to go that route from an OpenDev perspective if people want to push on that
19:32:42 <clarkb> I can understand the frustration though.
19:32:45 <clarkb> Anything else on this topic?
19:32:59 <fungi> there's an open change for a 16vcpu flavor, which would automatically have more ram as well (and we could make a nestedvirt version of that)
19:34:01 <ianw> yeah, we can commit that.  i kind of figured since it wasn't being pushed the need for it had subsided?
19:34:41 <fungi> the original desire for it was coming from some other jobs in the zuul tenant too
19:35:02 <fungi> though i forget which ones precisely
19:35:14 <corvus> i'd still try out the unit tests on it if it lands
19:35:26 <fungi> ahh, right that was it
19:35:28 <corvus> it's not as critical as when i wrote it, but it will be again in the future :)
19:35:41 <fungi> wanting more parallelization for concurrent unit tests
19:36:31 <ianw> yeah -- i mean i guess my only concern was that people in general find the nodes and give up on smaller vm's and just use that, limiting the testing environments
19:36:41 <corvus> a 2020's era vm can run them in <10m (and they're reliable)
19:36:42 <ianw> but, it's probably me who is out of touch :)
19:37:09 <clarkb> ianw: I think that is still a concern and we should impress upon people that the more people who use that lable by default the more contention and slower the rtt will be
19:37:23 <fungi> ianw: conversely, we can't use the preferred flavors in vexxhost currently because they would supply too much ram for our standard flavor, so our quota there is under-utilized
19:37:23 <clarkb> and they become less fault tolerate so you should use it only when a specific need makes it necessary
19:37:29 <corvus> i'm a big fan of keeping them accessible -- though opendev's definition of accessible is literally 12 years old now, so i'm open to a small amount of moore's law inflation :)
19:37:56 <clarkb> corvus: to eb fair I only just this year ended up with a laptop with more than 8GB of ram. 8GB was common for a very long time for some reason
19:38:05 <fungi> basically, vexxhost wants us to use a minimum 32gb ram flavor
19:38:24 <corvus> clarkb: (yeah, i agree -- that's super weird they stuck on that value so long)
19:38:27 <fungi> so being able to put that quota to use for something would be better than having it sit there
19:38:30 <ianw> yeah, i guess it's cracking the door on a bigger question of vm types ...
19:38:50 <ianw> if we want 8+gb AND nested virt, is vexxhost the only option?  rax is out due to xen
19:39:19 <clarkb> ianw: ovh and inmotion do nested virt too
19:39:26 <clarkb> inap does not iirc
19:39:35 <clarkb> and no nested virt on arm
19:39:47 <fungi> vexxhost is just the one where we actually don't have access to any "standard" 8gb flavors
19:40:06 <corvus> (good -- i definitely don't want to rely on only one provider for check/gate jobs)
19:40:08 <ianw> i know i can look at nodepool config but do we have these bigger vms on those providers too?
19:40:17 <ianw> or is it vexxhost pinned?
19:40:39 <clarkb> ianw: I think we had them on vexxhost and something else for a while. It may be vexxhost specific right now
19:40:45 <corvus> #link big vms https://review.opendev.org/844116
19:40:48 <fungi> ianw: we don't have the label defined yet, but the open change adds it to more providers than just vexxhost
19:40:49 <clarkb> ovh gives us specific flavors we would need to ask them about expanding them
19:40:55 <clarkb> inmotion we control the flavors on and can modify ourselves
19:40:58 <corvus> i did some research for that and tried to add 16gb everywhere i could ^
19:41:24 <clarkb> oh right I remember that
19:41:31 <fungi> but also as noted, that's not 16vcpu+nestedvirt, so we'd want another change for that addition
19:41:40 <corvus> looks like the commit msg says we need custom flavors added to some providers
19:41:59 <corvus> so 3 in that change, and potentially another 3 if we add custom flavors
19:42:11 <corvus> (and yes, that's only looking at ram -- reduce that list for nested)
19:42:31 <corvus> oh actually that's a 16 vcpu change
19:42:34 <ianw> ++ on the nested virt; as this tool does checks for /dev/kvm and won't even start without it
19:42:44 <corvus> (not ram)
19:43:05 <fungi> corvus: yeah, though i think the minimum ram for the 16vcpu flavors was at least 16gb ram (more in some providers)?
19:43:09 <corvus> so yes, i would definitely think we would want another label for that, but this is a start
19:43:20 <corvus> fungi: yes that is true
19:43:26 <corvus> (and that's annotated in the change too)
19:43:28 <ianw> it sounds like maybe to move this on -- we can do corvus' change first
19:43:39 <corvus> in that change, i started adding comments for all the flavors -- i think we should do that in nodepool in the future
19:43:42 <clarkb> Yup sounds liek we'd need to investigate further, but ya no objections from me. I think we can indicate that because they are larger they limit our capacity and thus should be used more sparingly when necessary and take it from there
19:43:53 <ianw> but i should take an action item to come up with some sort of spreadsheet-like layout of node types we could provide
19:43:55 <corvus> eg: `flavor-name: 'A1.16'  # 16vcpu, 16384ram, 320disk`
19:43:58 <fungi> i also like the comment idea, yes
19:44:19 <fungi> hopefully providers don't change what their flavors mean (seems unlikely they would though)
19:44:48 <corvus> i think they generally have not (but could), so i think it's reasonable for us to document them in our config and assume they mostly won't change
19:44:51 <ianw> it seems like we probably want to have more options of ram/cpus/nested virt
19:45:05 <corvus> (and if they do -- our docs will annotate what we thought they were supposed to be, which is also useful!)
19:46:05 <ianw> ++ and also something to point potential hosting providers at -- since currently i think we just say "we run 8gb vms"
19:46:43 <clarkb> sounds like we've reached a good conclusion on this topic for the meeting. Let's move on to cover the last few topics before we run out of time.
19:46:46 <clarkb> #topic Meetpad and Jitsi Meet Updates
19:47:09 <clarkb> Fungi recently updated our jitsi meet configs for our meetpad service to catch up to upstream's latest happenings
19:47:27 <clarkb> The motivation behind this was to add a landing page to joining meetings so that firefox would stop auto blocking the auto playing audio
19:47:36 <fungi> they call it a "pre-join" page
19:47:42 <clarkb> Good news is that all seems to be working now. Along the way we discovered a few interesting things though.
19:47:56 <clarkb> First is that the upstream :latest docker hub image tag is no longer latest (they stopped updating it about 4 months ago)
19:48:11 <clarkb> we now use the :stable tag which seems to correspond most closely to what tehy were tagging :latest previously
19:48:40 <clarkb> The next is that JVB scale out appears to rely on this new colibri websocket system that requires each JVB to be accessible from the nginx serving the jitsi meet site via http(s)
19:49:21 <clarkb> To work around that for now we've shutdown the jvbs and put them in the emergencyfile so that meetpad's all in one installation with localhost http connections can function and serve all the video
19:49:57 <clarkb> I wasn't comfortable running http to remote hosts without knowing what the data sent across that is. I think what we'll end up doing is having the jvb's allocate LE certs and then set it all up with ssl instead
19:50:00 <fungi> though shutting them down was probably not strictly necessary, it helps us be sure we don't accidentally try to farm a room out to one and then have it break
19:50:43 <fungi> but worth noting there, it suggests that the separate jvb servers have probably been completely broken and so unused for many months now
19:50:59 <clarkb> After some thought I don't think the change to do all that for the JVBs is taht difficult. More that we'll have to test it afterwards and ensure it is working as expected. We basically need to add LE, open firewall ports and set a config value that indicates the hostname to connect to via the proxy.
19:51:21 <clarkb> fungi: yup that as well
19:51:53 <clarkb> Hopefully we can try fixing the JVBs later this week and do another round of testing so that we're all prepped and ready for the PTG in a month
19:52:00 <fungi> testing should also be trivial: stop the jvb container on meetpad.o.o and start it on one of the separate jvb servers
19:52:26 <fungi> that way we can be certain over-the-network communication is used
19:52:37 <clarkb> ++ and in the meantime there shouldn't be any issues using the deployment as is
19:52:37 <fungi> rather than the loopback-connected container
19:53:05 <fungi> yeah, we're just not scaling out well currently, but as i said we probably haven't actually been for a while anyway
19:53:33 <clarkb> #topic Zuul Reboot Playbook Stability
19:53:38 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/856176
19:53:43 <fungi> we also talked about dumping one of the two dedicated jvb servers and just keeping a single one unless we find out we need more capacity
19:53:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/855826
19:54:11 <clarkb> Over the weekend the zuul reboot playbook crashed beacuse it ran into the unattended upgrades apt lock on zm02
19:54:36 <fungi> i want to say that happened previously on one of the other servers as well
19:54:37 <clarkb> I restarted it and then also noticed that since services were already down on zm02 the graceful stop playbook would similarly crash once it got to zm02 trying to docker exec on those containers
19:55:04 <clarkb> The changes above address a logging thing I noticed and shoudl address the apt lock issue aswell. I'll try to address the graceful stop thing later today
19:55:35 <clarkb> Would be good to address taht stuff and we should continue to check the zuul components list on monday mornings until we're happy it is running in a stable manner
19:55:55 <corvus> clarkb: you mention `latest ansible` -- what about upgrading bridge?
19:56:27 <clarkb> corvus: yes that would be required and is something that ianw has started poking at. The plan there is to move things into a virtualenv for ansible first aiui making it easier to manage the ansible install. Then use that to upgrade or redeploy
19:56:44 <corvus> kk
19:56:48 <clarkb> corvus: I think I'm fine punting on that as work is in progress to make that happen and we'll get there
19:56:57 <clarkb> just not soon enough for the next weekly run of the zuul reboot playbook
19:57:14 <corvus> ya, just thought to check in on it since i'm not up to date.  thx.
19:57:30 <clarkb> Good news is all these problems have been mechanical and not pointing out major flaws in our system
19:57:45 <clarkb> #topic Open Discussion
19:57:50 <clarkb> Just a couple minutes left. Anything else?
19:58:07 <clarkb> One of our backup hosts needs pruning if anyone hasn't done this yet and would like to run the prune script
19:58:39 <clarkb> frickler: corvus: ^ It is documented and previously ianw, myself, and then fungi have run through the docs just to amke sure we're comforatble with the process
19:58:51 <fungi> i've done it a couple of times already in the past, but happy to do it again unless someone else wants the honor
19:59:05 <frickler> I'll skip this time
19:59:36 <ianw> i'm happy to do it, will do this afternoon .au time
19:59:38 <clarkb> fungi: sounds like maybe you're it this time :) thanks
19:59:49 <fungi> wfm
19:59:51 <clarkb> oh I jumped the gun. Thank you ianw
19:59:58 <clarkb> by 2 seconds :)
20:00:06 <fungi> thanks ianw!
20:00:12 <clarkb> and we are at time. Thank you everyone for listening to me ramble on IRC
20:00:18 <clarkb> We'll be back next week at the same time and location
20:00:18 <fungi> thanks clarkb!
20:00:22 <clarkb> #endmeeting