19:01:04 #startmeeting infra 19:01:04 Meeting started Tue Sep 6 19:01:04 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:04 The meeting name has been set to 'infra' 19:01:06 #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000358.html Our Agenda 19:01:11 #topic Announcements 19:01:29 No new announcements but keep in mind that OpenStack and StarglingX are working through their release processes right now 19:02:13 #topic Bastion Host Updates 19:02:50 Anything new on this item? I discoverd a few minutse ago that newer ansible on bridge would be nice for newer apt module features. But I was able to work around that without updating ansible 19:03:04 In particular I think the new features also require python3.8? 19:03:15 And that implies upgrading the server and so on 19:04:29 yeah, i noticed that; i'm hoping the work to move to a venv will be top of todo now 19:04:46 excellent I'll do my best to review those changes 19:05:38 #topic Bionic Server Upgrades 19:05:44 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. 19:05:52 Mostly keeping this on the agenda to keep it top of mind. 19:06:04 I don't think any new work has happened on this, but we really should start digging in as we can 19:06:35 #topic Mailman 3 19:06:41 #link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server. 19:06:46 This change is ready for review now 19:06:53 #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes 19:07:05 Thank you fungi for doing much of the work to test the migration process on a held test node 19:07:24 Seems like it works as expected and gave us some good feedback on updates to make to our default list creation 19:07:56 One thing we found is that lynx needs to be installed for html to txt email conversion support. That isn't currently happening on the upstream images and I've opened a PR against them to fix it. Unfortunately no word frmo usptream yet on whether or not they want to accept it 19:08:07 Worst case we'll build our own images based on theirs and fix that 19:08:35 The next thinsg to test are some updates to database connnection settings to allow for larger email attachments (and check that mysqldump can backup the resulting database state) 19:08:42 yeah, next phase is to retest openstack-discuss import (particularly the archive) for the db packet size limit tweaks, and also script up what a whole-site migration looks like 19:08:44 oh wow, haven't used lynx in a while! 19:09:00 We also need to test the pipermail archive hosting redirects 19:09:08 oh, yep that too 19:09:10 as we'll be hosting old pipermail archives to keep old links alive 19:09:21 (there isn't a mapping from pipermail to hyperkitty apparently) 19:09:32 probably also time to start thinking about what other redirects we might want for list info pages and the like 19:10:28 I think it is a bit early to commit to any convesion of lists.opendev.org but I'm hopeful we'll continue to make enough progress that we do that in the not too distant future 19:10:32 since i the old pipermail listinfo pages had different urls 19:10:54 maybe soon after the ptg would be a good timeframe 19:10:55 fungi: for those links we likely can redirect to the new archives though ya? 19:11:00 yes 19:11:20 great 19:11:41 More testing planned tomorrow. Hopefully we'll have good things to report next week :) 19:11:44 just that we have old links to them in many places so not having to fix all o fthem will be good 19:11:44 Anything else on this topic? 19:11:50 ++ 19:13:26 #topic Jaeger Tracing Server 19:14:05 corvus: Wanted to give you an opportunity to provide any updates here if there are any 19:14:14 ah yeah i started this 19:14:28 #link tracing server https://review.opendev.org/855983 19:14:35 does not pass tests yet -- just pushed it up eod yesterday 19:14:55 hopefully it's basically complete and probably i just typo'd something. 19:15:16 if folks wanted to give it a look, i think it's not too early for that 19:15:55 oh 19:16:12 i'm using the "zk-ca" to generate certs for the clients (zuul) to send data to the server (jaeger) 19:16:32 corvus: client authentication certs? 19:16:36 tls is optional -- we could rely on iptables only -- but i thought that was better, and seemed a reasonable scope expansion for our private ca 19:16:42 clarkb: yes exactly 19:16:58 that seems reasonable. There is a similar concern re meetpad that I'll talk about soon 19:17:05 it's basically the same use we're already using zk-ca for for zookeeper, so it seemed reasonable to keep it parallel 19:17:44 that's all i have -- that's the only interesting new thing that came up 19:17:57 thank you for teh update. I'll try to take a look at the change today 19:18:03 #topic Fedora 36 19:18:20 ianw: this is another one I feel not caught up on. I believe I saw changes happening, but I'm not sure what teh current state is. 19:18:24 Any cahnce you can fill us in? 19:19:42 in terms of general deployment, zuul-jobs it's all gtg 19:20:09 devstack i do not think works, i haven't looked closely yet 19:20:11 #link https://review.opendev.org/c/openstack/devstack/+/854334 19:20:44 ianw: is fedora 35 still there or is that on the way out now? 19:20:52 in terms of getting rid of f35, there are issues with openshift testing 19:21:05 it is still there for devstack, but pretty unstable 19:21:13 oh right the client doesn't work on f36 yet 19:21:33 the openshift testing has two parts too, just to make a bigger yak to shave 19:22:00 https://zuul.openstack.org/builds?job_name=devstack-platform-fedora-latest&skip=0 19:22:16 the client side is one thing. the "oc" client doesn't run on fedora 36; due to go compatability issues 19:22:32 i just got a bug update overnight that "it was never supposed to work" or something ... 19:22:45 #link https://issues.redhat.com/browse/OCPBUGS-559 19:23:08 (i haven't quite parsed it) 19:24:02 the server side of that testing is also a problem -- it runs on centos7 using a PaaS repo that no longer exists 19:25:03 this is disucssion that's been happening about converting this to openshift local which is crc etc. etc. (i think we discused this last week) 19:25:20 Yup and we brought it up in the zuul + open infra board discussion earlier today as well 19:25:33 Sounds like there may be some interest from at least one board member in helping if possible 19:25:48 my concern with that is that it requires a nest-virt 9gb+ VM to run. 19:26:18 which we can provide in opendev too, so not entirely a show stopper 19:26:19 which we *can* provide via nodepool -- but it limits our testing range 19:26:23 yeah 19:26:38 not ideal, certainl 19:26:41 y 19:26:44 and also, it seems kind of crazy to require this to run a few containers 19:27:05 to test against. but in general the vibe i get is that nobody else thinks that 19:27:26 so maybe i'm just old and think 640k should be enough for everyone :/ 19:27:37 others probably do, their voices may just not be that loud (yet) 19:28:12 or they're consigned to what they feel is unavoidable 19:28:56 there is some talk of having this thing support "minishift" which sounds like what i described ... a few containers. but i don't know anything concrete about that 19:29:56 Sounds like we've got some options there we've just got to sort out which is the best one for our needs? 19:30:04 I guess which is best and viable 19:30:45 well, i guess the options both kind of suck. one is to just drop it all, the other is to re-write a bunch of zuul-jobs openshift deployment, etc. jobs that can only run on bespoke (to opendev) nodes 19:31:17 hence i haven't been running at it full speed :) 19:31:56 We did open up the idea of those more specialized labels for specific needs like this so I think it is ok to go that route from an OpenDev perspective if people want to push on that 19:32:42 I can understand the frustration though. 19:32:45 Anything else on this topic? 19:32:59 there's an open change for a 16vcpu flavor, which would automatically have more ram as well (and we could make a nestedvirt version of that) 19:34:01 yeah, we can commit that. i kind of figured since it wasn't being pushed the need for it had subsided? 19:34:41 the original desire for it was coming from some other jobs in the zuul tenant too 19:35:02 though i forget which ones precisely 19:35:14 i'd still try out the unit tests on it if it lands 19:35:26 ahh, right that was it 19:35:28 it's not as critical as when i wrote it, but it will be again in the future :) 19:35:41 wanting more parallelization for concurrent unit tests 19:36:31 yeah -- i mean i guess my only concern was that people in general find the nodes and give up on smaller vm's and just use that, limiting the testing environments 19:36:41 a 2020's era vm can run them in <10m (and they're reliable) 19:36:42 but, it's probably me who is out of touch :) 19:37:09 ianw: I think that is still a concern and we should impress upon people that the more people who use that lable by default the more contention and slower the rtt will be 19:37:23 ianw: conversely, we can't use the preferred flavors in vexxhost currently because they would supply too much ram for our standard flavor, so our quota there is under-utilized 19:37:23 and they become less fault tolerate so you should use it only when a specific need makes it necessary 19:37:29 i'm a big fan of keeping them accessible -- though opendev's definition of accessible is literally 12 years old now, so i'm open to a small amount of moore's law inflation :) 19:37:56 corvus: to eb fair I only just this year ended up with a laptop with more than 8GB of ram. 8GB was common for a very long time for some reason 19:38:05 basically, vexxhost wants us to use a minimum 32gb ram flavor 19:38:24 clarkb: (yeah, i agree -- that's super weird they stuck on that value so long) 19:38:27 so being able to put that quota to use for something would be better than having it sit there 19:38:30 yeah, i guess it's cracking the door on a bigger question of vm types ... 19:38:50 if we want 8+gb AND nested virt, is vexxhost the only option? rax is out due to xen 19:39:19 ianw: ovh and inmotion do nested virt too 19:39:26 inap does not iirc 19:39:35 and no nested virt on arm 19:39:47 vexxhost is just the one where we actually don't have access to any "standard" 8gb flavors 19:40:06 (good -- i definitely don't want to rely on only one provider for check/gate jobs) 19:40:08 i know i can look at nodepool config but do we have these bigger vms on those providers too? 19:40:17 or is it vexxhost pinned? 19:40:39 ianw: I think we had them on vexxhost and something else for a while. It may be vexxhost specific right now 19:40:45 #link big vms https://review.opendev.org/844116 19:40:48 ianw: we don't have the label defined yet, but the open change adds it to more providers than just vexxhost 19:40:49 ovh gives us specific flavors we would need to ask them about expanding them 19:40:55 inmotion we control the flavors on and can modify ourselves 19:40:58 i did some research for that and tried to add 16gb everywhere i could ^ 19:41:24 oh right I remember that 19:41:31 but also as noted, that's not 16vcpu+nestedvirt, so we'd want another change for that addition 19:41:40 looks like the commit msg says we need custom flavors added to some providers 19:41:59 so 3 in that change, and potentially another 3 if we add custom flavors 19:42:11 (and yes, that's only looking at ram -- reduce that list for nested) 19:42:31 oh actually that's a 16 vcpu change 19:42:34 ++ on the nested virt; as this tool does checks for /dev/kvm and won't even start without it 19:42:44 (not ram) 19:43:05 corvus: yeah, though i think the minimum ram for the 16vcpu flavors was at least 16gb ram (more in some providers)? 19:43:09 so yes, i would definitely think we would want another label for that, but this is a start 19:43:20 fungi: yes that is true 19:43:26 (and that's annotated in the change too) 19:43:28 it sounds like maybe to move this on -- we can do corvus' change first 19:43:39 in that change, i started adding comments for all the flavors -- i think we should do that in nodepool in the future 19:43:42 Yup sounds liek we'd need to investigate further, but ya no objections from me. I think we can indicate that because they are larger they limit our capacity and thus should be used more sparingly when necessary and take it from there 19:43:53 but i should take an action item to come up with some sort of spreadsheet-like layout of node types we could provide 19:43:55 eg: `flavor-name: 'A1.16' # 16vcpu, 16384ram, 320disk` 19:43:58 i also like the comment idea, yes 19:44:19 hopefully providers don't change what their flavors mean (seems unlikely they would though) 19:44:48 i think they generally have not (but could), so i think it's reasonable for us to document them in our config and assume they mostly won't change 19:44:51 it seems like we probably want to have more options of ram/cpus/nested virt 19:45:05 (and if they do -- our docs will annotate what we thought they were supposed to be, which is also useful!) 19:46:05 ++ and also something to point potential hosting providers at -- since currently i think we just say "we run 8gb vms" 19:46:43 sounds like we've reached a good conclusion on this topic for the meeting. Let's move on to cover the last few topics before we run out of time. 19:46:46 #topic Meetpad and Jitsi Meet Updates 19:47:09 Fungi recently updated our jitsi meet configs for our meetpad service to catch up to upstream's latest happenings 19:47:27 The motivation behind this was to add a landing page to joining meetings so that firefox would stop auto blocking the auto playing audio 19:47:36 they call it a "pre-join" page 19:47:42 Good news is that all seems to be working now. Along the way we discovered a few interesting things though. 19:47:56 First is that the upstream :latest docker hub image tag is no longer latest (they stopped updating it about 4 months ago) 19:48:11 we now use the :stable tag which seems to correspond most closely to what tehy were tagging :latest previously 19:48:40 The next is that JVB scale out appears to rely on this new colibri websocket system that requires each JVB to be accessible from the nginx serving the jitsi meet site via http(s) 19:49:21 To work around that for now we've shutdown the jvbs and put them in the emergencyfile so that meetpad's all in one installation with localhost http connections can function and serve all the video 19:49:57 I wasn't comfortable running http to remote hosts without knowing what the data sent across that is. I think what we'll end up doing is having the jvb's allocate LE certs and then set it all up with ssl instead 19:50:00 though shutting them down was probably not strictly necessary, it helps us be sure we don't accidentally try to farm a room out to one and then have it break 19:50:43 but worth noting there, it suggests that the separate jvb servers have probably been completely broken and so unused for many months now 19:50:59 After some thought I don't think the change to do all that for the JVBs is taht difficult. More that we'll have to test it afterwards and ensure it is working as expected. We basically need to add LE, open firewall ports and set a config value that indicates the hostname to connect to via the proxy. 19:51:21 fungi: yup that as well 19:51:53 Hopefully we can try fixing the JVBs later this week and do another round of testing so that we're all prepped and ready for the PTG in a month 19:52:00 testing should also be trivial: stop the jvb container on meetpad.o.o and start it on one of the separate jvb servers 19:52:26 that way we can be certain over-the-network communication is used 19:52:37 ++ and in the meantime there shouldn't be any issues using the deployment as is 19:52:37 rather than the loopback-connected container 19:53:05 yeah, we're just not scaling out well currently, but as i said we probably haven't actually been for a while anyway 19:53:33 #topic Zuul Reboot Playbook Stability 19:53:38 #link https://review.opendev.org/c/opendev/system-config/+/856176 19:53:43 we also talked about dumping one of the two dedicated jvb servers and just keeping a single one unless we find out we need more capacity 19:53:46 #link https://review.opendev.org/c/opendev/system-config/+/855826 19:54:11 Over the weekend the zuul reboot playbook crashed beacuse it ran into the unattended upgrades apt lock on zm02 19:54:36 i want to say that happened previously on one of the other servers as well 19:54:37 I restarted it and then also noticed that since services were already down on zm02 the graceful stop playbook would similarly crash once it got to zm02 trying to docker exec on those containers 19:55:04 The changes above address a logging thing I noticed and shoudl address the apt lock issue aswell. I'll try to address the graceful stop thing later today 19:55:35 Would be good to address taht stuff and we should continue to check the zuul components list on monday mornings until we're happy it is running in a stable manner 19:55:55 clarkb: you mention `latest ansible` -- what about upgrading bridge? 19:56:27 corvus: yes that would be required and is something that ianw has started poking at. The plan there is to move things into a virtualenv for ansible first aiui making it easier to manage the ansible install. Then use that to upgrade or redeploy 19:56:44 kk 19:56:48 corvus: I think I'm fine punting on that as work is in progress to make that happen and we'll get there 19:56:57 just not soon enough for the next weekly run of the zuul reboot playbook 19:57:14 ya, just thought to check in on it since i'm not up to date. thx. 19:57:30 Good news is all these problems have been mechanical and not pointing out major flaws in our system 19:57:45 #topic Open Discussion 19:57:50 Just a couple minutes left. Anything else? 19:58:07 One of our backup hosts needs pruning if anyone hasn't done this yet and would like to run the prune script 19:58:39 frickler: corvus: ^ It is documented and previously ianw, myself, and then fungi have run through the docs just to amke sure we're comforatble with the process 19:58:51 i've done it a couple of times already in the past, but happy to do it again unless someone else wants the honor 19:59:05 I'll skip this time 19:59:36 i'm happy to do it, will do this afternoon .au time 19:59:38 fungi: sounds like maybe you're it this time :) thanks 19:59:49 wfm 19:59:51 oh I jumped the gun. Thank you ianw 19:59:58 by 2 seconds :) 20:00:06 thanks ianw! 20:00:12 and we are at time. Thank you everyone for listening to me ramble on IRC 20:00:18 We'll be back next week at the same time and location 20:00:18 thanks clarkb! 20:00:22 #endmeeting