19:01:20 <clarkb> #startmeeting infra 19:01:20 <opendevmeet> Meeting started Tue Jul 5 19:01:20 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:20 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:20 <opendevmeet> The meeting name has been set to 'infra' 19:01:23 <frickler> \o 19:01:28 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000343.html Our Agenda 19:01:49 <clarkb> The agenda did go out a bit late this time. Yesterday was a holiday and I spent more time outside than anticipated 19:02:03 <clarkb> But we do have an agenda I put together today :) 19:02:08 <clarkb> #topic Announcements 19:02:19 <clarkb> I'm going to miss our next meeting (July 12) 19:03:11 <fungi> i'll be around, but also skipping an occasional meeting is fine by me 19:04:30 <ianw> maybe see if anything drastic comes up by eow, i'll be around too 19:04:37 <clarkb> works for me 19:05:05 <clarkb> Also as of today the web server for our mailman lists redirects to and should force the use of https 19:05:26 <clarkb> It seems to be working but making note of that since it is a change 19:06:26 <clarkb> #topic Topics 19:06:34 <clarkb> #topic Improving CD throughput 19:07:02 <clarkb> The fix for the zuul auto upgrade cron appears to have worked. All of the zuul components were running the same version as of early this morning and uptime was a few days 19:07:27 <clarkb> A new manually triggered restart is in progress beacuse zuul's docker image just udpated to python3.10 and we wanted to control observation of that 19:08:38 <clarkb> That too seems to be going well ( a couple executors are already running on python3.10) 19:09:31 <fungi> yay! automate all the things 19:09:47 <clarkb> Please call out if you notice anything unexpected with these changes. 19:10:57 <clarkb> #topic Improving Grafana management tooling 19:11:26 <clarkb> ianw: started a thread on this so that we can accumulate any concerns and thoughts there. 19:11:31 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html 19:12:00 <clarkb> I'm hoping that we can keep the discussion on the mailing list (and in code reviews) as much as possible so that we can run down any issues and concerns without needing to remind ourselves what tehy are from previous meeting discussions 19:12:10 <fungi> the dashboard preview screenshots in build results are awesome, btw 19:12:43 <clarkb> All that to say, lets try not to dive into details in the meeting today. But if there are specific things that need synchronous comms now is a good time to bring them up. Otherwise please read the email and review the changes and respond there with feedback 19:13:41 <clarkb> ianw: before I continue on anything to call out? 19:14:36 <ianw> nope; all there (and end-to-end tested now :) 19:15:00 <clarkb> thank you for putting it together. I still need to read through it and look at the examples and all that myself 19:15:13 <clarkb> #topic Run a custom url shortener service 19:15:19 <clarkb> frickler: ^ anything new on this item? 19:15:40 <frickler> no, I think we should drop this from the agenda and I'll readd when I make progress 19:15:47 <clarkb> ok will do 19:16:09 <clarkb> #topic Zuul job POST_FAILURES 19:16:31 <clarkb> There are two changes related to this that may aid in debugging. 19:16:40 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/848014 Report POST_FAILURE job timing to graphite 19:16:59 <clarkb> This first one is probably less important, but zuul currently doesn't report job timing on POST_FAILUREs 19:17:17 <clarkb> havingthat info might help us better identify trends (like are these failures often teimout related) 19:17:36 <clarkb> The other records the swift upload target before we upload which means we should get that info even if the task timesouts 19:17:45 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs 19:18:25 <clarkb> This second change deserves care and in depth review to ensure we don't disclose anything we don't want to disclose. But I think the content of the change is safe now and constructed to avoid anything unexpected. jrosser even pasted and example that emulates what the chagne does to show this 19:18:44 <clarkb> if we are confident in that I think we can go ahead and land the base-test update and check that it does what we want before landing it on the proper base job 19:19:07 <clarkb> I hesitate to approve that myself as I'll be popping out in a day and a half, but if someone else is able to push that over the hump and monitor it that would be great 19:19:13 <fungi> thanks, i meant to follow up on that one. will look again after the meeting 19:19:25 <clarkb> thanks. 19:19:31 <fungi> i plan to be around all evening anyway 19:19:50 <clarkb> fungi: ya my main concern is being able to shepherd the main base job through when we are ready for it. 19:20:11 <clarkb> Other than these two chagnes I'm not aware of any other fixes other than our suggestion to the projects to be more careful about what they log. 19:20:38 <clarkb> Avoid deep nesting, avoid duplicating log files either beacuse the same file is copied multiple times by a job or because we're copying stuff off the base OS install that is always identical and so on 19:21:03 <fungi> yeah, it's worth reiterating that the vast majority have occurred for two projects in the openstack tenant we know do massive amounts of logging, so are already pushing this to its limits 19:21:06 <clarkb> Is anyone else aware of any changes here? I suspect the problem may have self corrected a bit since there is less complaining about it. But we should be ready if it happens again 19:21:33 <fungi> it's already come and gone at least once 19:21:53 <fungi> so there's probably some transient environmental variable involved 19:22:26 <fungi> (api in some provider getting bogged down, network overloaded near our executors, et cetera) 19:22:50 <fungi> but having more data is the next step before we can refine our theories 19:24:02 <clarkb> ++ 19:24:09 <clarkb> #topic Bastion Host Updates 19:24:25 <clarkb> I wanted to follow up on this to make sure I wans't missing any progress towards shifting things into a venv 19:24:31 <clarkb> ianw: ^ are there changes for that to call out yet? 19:24:52 <ianw> nope, haven't pushed those yet, will soon 19:25:07 <clarkb> great just double checking 19:25:33 <clarkb> Separately it has got me thinking we should look at bionic -> focal and/or jammy upgrades for servers. In theory this is a lot more straightforward now for anything that has been containerized 19:25:54 <clarkb> On Friday I ran into a fun situation with new setuptools and virtualenv on Jammy not actually creating proper venvs on jammy 19:26:04 <clarkb> You end up with a venv that if used installs to the root of the system 19:26:24 <clarkb> effectively a noop venv. Just binaries living at a different path 19:26:47 <ianw> oh there's your problem, you expected setuptools and virtualenv and venv to work :) 19:26:58 <clarkb> Calling that out as it may impact upgrades of services if they rely on virtualenvs on the root system. I think the vast majority of stuff is in containers and shouldn't be affected though 19:27:24 <clarkb> When I get back from this trip I would like to put together a todo list for these upgrades and start figuring out the scale nad scope of them. 19:27:39 <fungi> sounds great 19:28:26 <clarkb> Anyway the weird ajmmy behavior and discussion about bridge upgrades with a new server or in place got me thinking about the broader need. I'll see what that looks like and put something together 19:28:31 <clarkb> #topic Open Discussion 19:28:44 <fungi> thanks! 19:28:46 <clarkb> That was it for the agenda I put together quickly today. Is there anything else to bring up? 19:29:03 <clarkb> ianw: I think you are working on adding a "proper" CA to our fake SSL certs in test jobs so that we can drop the curl --insecure flag? 19:29:26 <ianw> yes 19:29:29 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/848562 19:29:37 <ianw> pending zuul; i think that's ready for review now 19:29:43 <clarkb> was there a specific need for that or more just cleaning up the dirty --insecure flag 19:29:48 <ianw> (i only changed comments so i expect CI to work) 19:30:10 <ianw> it actually was a yak shaving exercise that came out of another change 19:30:30 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/845316 19:30:36 <ianw> (which is also ready for review :) 19:30:39 <fungi> for those who missed the status log addition, all our mailman sites are strictly https now. should be transparent, though clarkb remarked on a possible content cache problem returning stale contents, so keep an eye out 19:30:55 <ianw> that redirects the haproxy logs to a separate file instead of going into syslog on the loadbalancer 19:31:16 <frickler> I saw the xenial builds failing earlier. not sure how much effort to put into debugging and fixing or whether it could be time to retire them 19:31:35 <ianw> however, i wanted to improve end-to-end testing of that, but i'm 99% sure that we don't pass data through the load-balancer during the testing 19:31:42 <fungi> clarkb surmised it's old python stdlib not satisfying sni requirements for pypi 19:32:04 <clarkb> thats just a hunch. We should check if the builds consistently fail. But ya pausing them and putting people on notice about that may be a good idea 19:32:06 <ianw> builds as in nodepool builds? 19:32:10 <fungi> yeah 19:32:24 <clarkb> ianw: ya the nodepool dib builds fail in the chroot trying to install os-testr and failing to find a version of pbr that is good enough 19:32:49 <clarkb> that is behavior we've seen from distutils in the past when SNI is required by pypi but the installer doesn't speak it (which I think is true for xenial but can be double checked) 19:32:50 <ianw> ok i'll take a look. first thought is we had workarounds for that but who knows 19:33:03 <frickler> you can nicely spot this by the wiggly graph in https://grafana.opendev.org/d/f3089338b3/nodepool-dib-status?orgId=1 19:33:25 <fungi> always be on the lookout for wiggly graphs, that's my motto 19:33:29 <clarkb> I do think we should be telling openstack that they should stop relying on the images at this point (we brought it up with them a while back but not sure much movement happened) and start removing our own uses of them 19:33:44 <ianw> ++ 19:34:07 <clarkb> another user was windmill which i have since removed from zuul's tenant config for other erasons 19:34:08 <fungi> i'll have to take a look and see when 18.04 lts last appeared in the openstack pti 19:34:16 <clarkb> fungi: 16.04 is xenial 19:34:22 <fungi> er, right 19:34:33 <fungi> so long ago is the answer to that 19:35:00 <ianw> (one of the things i played with, with the new grafana was a heatmap of dib builds -> https://68acf0be36b12a32a6a5-4c67d681abf13f527cb7d280eb684e4e.ssl.cf2.rackcdn.com/848212/3/check/project-config-grafana/24fe236/screenshots/nodepool-dib-status.png) 19:35:42 <fungi> the openstack pti is published all the way back to the stein release and says 18.04, so worst case it's used by stable/stein grenade jobs 19:36:19 <fungi> i'll check with elod for a second pair of eyes, but i expect we can yank it as far as openstack is officially concerned 19:36:40 <clarkb> fungi: I think frickler put together an etherpad of users 19:36:49 <fungi> oh, all the better 19:37:12 <clarkb> #link https://etherpad.opendev.org/p/ubuntu-xenial-jobs 19:37:32 <frickler> oh, I had forgotten about that :-) 19:37:45 <clarkb> I annotated it with notes at the time. We can probably psuh on some of those and remove the usages for things like gear 19:38:07 <clarkb> oh that also notes it is only a master branch audit 19:38:11 <fungi> in theory, rocky was the last release to use xenial since it was released august 2018 and we'd have had bionic images long available for stein testing 19:38:13 <clarkb> so ya old openstack may hide extra bits of it 19:38:32 <frickler> master only because I used codesearch, yes 19:39:14 <clarkb> But I do think we are very close to being able to remove it without too much fallout. Getting to that point would be great 19:39:41 <fungi> and grenade jobs aren't supported for branches in extended maintenance, so we should be long clear to drop xenial images in openstack 19:39:47 <frickler> I'll look into devstack(+gate) in more detail 19:39:48 <ianw> fungi: do you want to get back to me and i'm happy to drive an email to openstack with a date and then start working towards it 19:40:23 <fungi> ianw: absolutely, once i hear something from elod (which i expect to be a thumbs-up) i'll let you know 19:40:52 <clarkb> Anything else? to cover? 19:40:53 <ianw> ++, i can then draft something, and i think the way to get this moving is to give ourselves a self-imposed deadline 19:40:55 <fungi> openstack stable/wallaby is the oldest coordinated branch under maintenance now anyway 19:40:58 <clarkb> er I fail at punctuation 19:43:56 <clarkb> Sounds like that may be it. Thank you everyone. I'll let you go to enjoy your morning/evening/$meal :) 19:44:11 <corvus> thanks clarkb ! 19:44:17 <clarkb> Feel free to continue discussion in #opendev or on the mailing list if necessary 19:44:22 <clarkb> #endmeeting