19:01:20 <clarkb> #startmeeting infra
19:01:20 <opendevmeet> Meeting started Tue Jul  5 19:01:20 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:20 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:20 <opendevmeet> The meeting name has been set to 'infra'
19:01:23 <frickler> \o
19:01:28 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000343.html Our Agenda
19:01:49 <clarkb> The agenda did go out a bit late this time. Yesterday was a holiday and I spent more time outside than anticipated
19:02:03 <clarkb> But we do have an agenda I put together today :)
19:02:08 <clarkb> #topic Announcements
19:02:19 <clarkb> I'm going to miss our next meeting (July 12)
19:03:11 <fungi> i'll be around, but also skipping an occasional meeting is fine by me
19:04:30 <ianw> maybe see if anything drastic comes up by eow, i'll be around too
19:04:37 <clarkb> works for me
19:05:05 <clarkb> Also as of today the web server for our mailman lists redirects to and should force the use of https
19:05:26 <clarkb> It seems to be working but making note of that since it is a change
19:06:26 <clarkb> #topic Topics
19:06:34 <clarkb> #topic Improving CD throughput
19:07:02 <clarkb> The fix for the zuul auto upgrade cron appears to have worked. All of the zuul components were running the same version as of early this morning and uptime was a few days
19:07:27 <clarkb> A new manually triggered restart is in progress beacuse zuul's docker image just udpated to python3.10 and we wanted to control observation of that
19:08:38 <clarkb> That too seems to be going well ( a couple executors are already running on python3.10)
19:09:31 <fungi> yay! automate all the things
19:09:47 <clarkb> Please call out if you notice anything unexpected with these changes.
19:10:57 <clarkb> #topic Improving Grafana management tooling
19:11:26 <clarkb> ianw: started a thread on this so that we can accumulate any concerns and thoughts there.
19:11:31 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html
19:12:00 <clarkb> I'm hoping that we can keep the discussion on the mailing list (and in code reviews) as much as possible so that we can run down any issues and concerns without needing to remind ourselves what tehy are from previous meeting discussions
19:12:10 <fungi> the dashboard preview screenshots in build results are awesome, btw
19:12:43 <clarkb> All that to say, lets try not to dive into details in the meeting today. But if there are specific things that need synchronous comms now is a good time to bring them up. Otherwise please read the email and review the changes and respond there with feedback
19:13:41 <clarkb> ianw: before I continue on anything to call out?
19:14:36 <ianw> nope; all there (and end-to-end tested now :)
19:15:00 <clarkb> thank you for putting it together. I still need to read through it and look at the examples and all that myself
19:15:13 <clarkb> #topic Run a custom url shortener service
19:15:19 <clarkb> frickler: ^ anything new on this item?
19:15:40 <frickler> no, I think we should drop this from the agenda and I'll readd when I make progress
19:15:47 <clarkb> ok will do
19:16:09 <clarkb> #topic Zuul job POST_FAILURES
19:16:31 <clarkb> There are two changes related to this that may aid in debugging.
19:16:40 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/848014 Report POST_FAILURE job timing to graphite
19:16:59 <clarkb> This first one is probably less important, but zuul currently doesn't report job timing on POST_FAILUREs
19:17:17 <clarkb> havingthat info might help us better identify trends (like are these failures often teimout related)
19:17:36 <clarkb> The other records the swift upload target before we upload which means we should get that info even if the task timesouts
19:17:45 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs
19:18:25 <clarkb> This second change deserves care and in depth review to ensure we don't disclose anything we don't want to disclose. But I think the content of the change is safe now and constructed to avoid anything unexpected. jrosser even pasted and example that emulates what the chagne does to show this
19:18:44 <clarkb> if we are confident in that I think we can go ahead and land the base-test update and check that it does what we want before landing it on the proper base job
19:19:07 <clarkb> I hesitate to approve that myself as I'll be popping out in a day and a half, but if someone else is able to push that over the hump and monitor it that would be great
19:19:13 <fungi> thanks, i meant to follow up on that one. will look again after the meeting
19:19:25 <clarkb> thanks.
19:19:31 <fungi> i plan to be around all evening anyway
19:19:50 <clarkb> fungi: ya my main concern is being able to shepherd the main base job through when we are ready for it.
19:20:11 <clarkb> Other than these two chagnes I'm not aware of any other fixes other than our suggestion to the projects to be more careful about what they log.
19:20:38 <clarkb> Avoid deep nesting, avoid duplicating log files either beacuse the same file is copied multiple times by a job or because we're copying stuff off the base OS install that is always identical and so on
19:21:03 <fungi> yeah, it's worth reiterating that the vast majority have occurred for two projects in the openstack tenant we know do massive amounts of logging, so are already pushing this to its limits
19:21:06 <clarkb> Is anyone else aware of any changes here? I suspect the problem may have self corrected a bit since there is less complaining about it. But we should be ready if it happens again
19:21:33 <fungi> it's already come and gone at least once
19:21:53 <fungi> so there's probably some transient environmental variable involved
19:22:26 <fungi> (api in some provider getting bogged down, network overloaded near our executors, et cetera)
19:22:50 <fungi> but having more data is the next step before we can refine our theories
19:24:02 <clarkb> ++
19:24:09 <clarkb> #topic Bastion Host Updates
19:24:25 <clarkb> I wanted to follow up on this to make sure I wans't missing any progress towards shifting things into a venv
19:24:31 <clarkb> ianw: ^ are there changes for that to call out yet?
19:24:52 <ianw> nope, haven't pushed those yet, will soon
19:25:07 <clarkb> great just double checking
19:25:33 <clarkb> Separately it has got me thinking we should look at bionic -> focal and/or jammy upgrades for servers. In theory this is a lot more straightforward now for anything that has been containerized
19:25:54 <clarkb> On Friday I ran into a fun situation with new setuptools and virtualenv on Jammy not actually creating proper venvs on jammy
19:26:04 <clarkb> You end up with a venv that if used installs to the root of the system
19:26:24 <clarkb> effectively a noop venv. Just binaries living at a different path
19:26:47 <ianw> oh there's your problem, you expected setuptools and virtualenv and venv to work :)
19:26:58 <clarkb> Calling that out as it may impact upgrades of services if they rely on virtualenvs on the root system. I think the vast majority of stuff is in containers and shouldn't be affected though
19:27:24 <clarkb> When I get back from this trip I would like to put together a todo list for these upgrades and start figuring out the scale nad scope of them.
19:27:39 <fungi> sounds great
19:28:26 <clarkb> Anyway the weird ajmmy behavior and discussion about bridge upgrades with a new server or in place got me thinking about the broader need. I'll see what that looks like and put something together
19:28:31 <clarkb> #topic Open Discussion
19:28:44 <fungi> thanks!
19:28:46 <clarkb> That was it for the agenda I put together quickly today. Is there anything else to bring up?
19:29:03 <clarkb> ianw: I think you are working on adding a "proper" CA to our fake SSL certs in test jobs so that we can drop the curl --insecure flag?
19:29:26 <ianw> yes
19:29:29 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/848562
19:29:37 <ianw> pending zuul; i think that's ready for review now
19:29:43 <clarkb> was there a specific need for that or more just cleaning up the dirty --insecure flag
19:29:48 <ianw> (i only changed comments so i expect CI to work)
19:30:10 <ianw> it actually was a yak shaving exercise that came out of another change
19:30:30 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/845316
19:30:36 <ianw> (which is also ready for review :)
19:30:39 <fungi> for those who missed the status log addition, all our mailman sites are strictly https now. should be transparent, though clarkb remarked on a possible content cache problem returning stale contents, so keep an eye out
19:30:55 <ianw> that redirects the haproxy logs to a separate file instead of going into syslog on the loadbalancer
19:31:16 <frickler> I saw the xenial builds failing earlier. not sure how much effort to put into debugging and fixing or whether it could be time to retire them
19:31:35 <ianw> however, i wanted to improve end-to-end testing of that, but i'm 99% sure that we don't pass data through the load-balancer during the testing
19:31:42 <fungi> clarkb surmised it's old python stdlib not satisfying sni requirements for pypi
19:32:04 <clarkb> thats just a hunch. We should check if the builds consistently fail. But ya pausing them and putting people on notice about that may be a good idea
19:32:06 <ianw> builds as in nodepool builds?
19:32:10 <fungi> yeah
19:32:24 <clarkb> ianw: ya the nodepool dib builds fail in the chroot trying to install os-testr and failing to find a version of pbr that is good enough
19:32:49 <clarkb> that is behavior we've seen from distutils in the past when SNI is required by pypi but the installer doesn't speak it (which I think is true for xenial but can be double checked)
19:32:50 <ianw> ok i'll take a look.  first thought is we had workarounds for that but who knows
19:33:03 <frickler> you can nicely spot this by the wiggly graph in https://grafana.opendev.org/d/f3089338b3/nodepool-dib-status?orgId=1
19:33:25 <fungi> always be on the lookout for wiggly graphs, that's my motto
19:33:29 <clarkb> I do think we should be telling openstack that they should stop relying on the images at this point (we brought it up with them a while back but not sure much movement happened) and start removing our own uses of them
19:33:44 <ianw> ++
19:34:07 <clarkb> another user was windmill which i have since removed from zuul's tenant config for other erasons
19:34:08 <fungi> i'll have to take a look and see when 18.04 lts last appeared in the openstack pti
19:34:16 <clarkb> fungi: 16.04 is xenial
19:34:22 <fungi> er, right
19:34:33 <fungi> so long ago is the answer to that
19:35:00 <ianw> (one of the things i played with, with the new grafana was a heatmap of dib builds -> https://68acf0be36b12a32a6a5-4c67d681abf13f527cb7d280eb684e4e.ssl.cf2.rackcdn.com/848212/3/check/project-config-grafana/24fe236/screenshots/nodepool-dib-status.png)
19:35:42 <fungi> the openstack pti is published all the way back to the stein release and says 18.04, so worst case it's used by stable/stein grenade jobs
19:36:19 <fungi> i'll check with elod for a second pair of eyes, but i expect we can yank it as far as openstack is officially concerned
19:36:40 <clarkb> fungi: I think frickler put together an etherpad of users
19:36:49 <fungi> oh, all the better
19:37:12 <clarkb> #link https://etherpad.opendev.org/p/ubuntu-xenial-jobs
19:37:32 <frickler> oh, I had forgotten about that :-)
19:37:45 <clarkb> I annotated it with notes at the time. We can probably psuh on some of those and remove the usages for things like gear
19:38:07 <clarkb> oh that also notes it is only a master branch audit
19:38:11 <fungi> in theory, rocky was the last release to use xenial since it was released august 2018 and we'd have had bionic images long available for stein testing
19:38:13 <clarkb> so ya old openstack may hide extra bits of it
19:38:32 <frickler> master only because I used codesearch, yes
19:39:14 <clarkb> But I do think we are very close to being able to remove it without too much fallout. Getting to that point would be great
19:39:41 <fungi> and grenade jobs aren't supported for branches in extended maintenance, so we should be long clear to drop xenial images in openstack
19:39:47 <frickler> I'll look into devstack(+gate) in more detail
19:39:48 <ianw> fungi: do you want to get back to me and i'm happy to drive an email to openstack with a date and then start working towards it
19:40:23 <fungi> ianw: absolutely, once i hear something from elod (which i expect to be a thumbs-up) i'll let you know
19:40:52 <clarkb> Anything else? to cover?
19:40:53 <ianw> ++, i can then draft something, and i think the way to get this moving is to give ourselves a self-imposed deadline
19:40:55 <fungi> openstack stable/wallaby is the oldest coordinated branch under maintenance now anyway
19:40:58 <clarkb> er I fail at punctuation
19:43:56 <clarkb> Sounds like that may be it. Thank you everyone. I'll let you go to enjoy your morning/evening/$meal :)
19:44:11 <corvus> thanks clarkb !
19:44:17 <clarkb> Feel free to continue discussion in #opendev or on the mailing list if necessary
19:44:22 <clarkb> #endmeeting