#opendev-meeting log

19:00:15 <clarkb> #startmeeting infra
19:00:15 <opendevmeet> Meeting started Tue Jul  9 19:00:15 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:15 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:15 <opendevmeet> The meeting name has been set to 'infra'
19:00:27 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/PF7F3JFOMNXHCMB7J426QSYOJZXGQ66D/ Our Agenda
19:00:32 <clarkb> #topic Announcements
19:00:52 <clarkb> A heads up that I'll be AFK the first half of next week through Wednesday
19:01:24 <clarkb> whcih means I'll miss that week's meeting. I'm happy for someone else to organize and chair the meeting, but also seems fine to skip as well
19:02:25 <clarkb> Can probably make a decision on which option to choose based on how things are going early next week
19:02:30 <clarkb> not something that needs to be decided here
19:02:51 <tonyb> Sounds fair
19:03:02 <clarkb> #topic Upgrading Old Servers
19:03:23 <clarkb> tonyb: good morning! anything new to add here? I think there was some held node behavior investigation done with fungi
19:04:19 <tonyb> Not a lot of change.  we did some investigation, which didn't show any problems.
19:04:50 <clarkb> should we be reviewing changes for that at this point then?
19:05:03 <tonyb> Not yet
19:05:22 <fungi> i did some preliminary testing
19:05:34 <tonyb> I'll push up a new revision soon and then we can begin the review process
19:05:51 <fungi> but haven't done things that would involve me using multiple separate accounts yet
19:06:13 <fungi> so far it all looks great though, no concerns yet
19:06:27 <clarkb> sounds good. Since we're deploying a new server alongside the old one we should be able to merge things before we're 100% certain they are ready and then we can always delete and start over if necessary
19:06:47 <clarkb> There was also discussion about clouds lacking Ubuntu Noble images at this point.
19:07:08 <clarkb> tonyb: I did upload a noble image to openmetal if you want to start with a mirror replacement there (but then that runs into the openafs problems...)
19:07:17 <fungi> yeah, our typical approach is to get the server running on a snapshot of the production data and then re-sync its state at cut-over time
19:07:29 <tonyb> Yup I'll work on that later this week
19:07:31 <clarkb> but ya I think we can just grab the appropriate image from ubuntu and convert as necessary then upload
19:07:38 <fungi> probably easier in this case since it'll use a new primary domain name anyway
19:08:03 <clarkb> tonyb: oh I did want to mention that I'm not sure if the two vexxhost regions have different bfv requirements (you mentioned gitea servers are not bfv but they are in one region and gerrit is in the other). Something we can test though
19:08:36 <fungi> the only real cut-over step is swapping out the current wiki.openstack.org dns to point at wiki.opendev.org's redirect
19:08:51 <tonyb> Good to know.
19:09:30 <clarkb> fungi: we also need to shutthings down to mvoe db contents
19:09:38 <clarkb> so there will be a short downtime I think as well as updating the dns
19:09:49 <tonyb> I'll re-work the wiki announcement to send out sometime soon
19:10:18 <fungi> yeah, i mentioned re-syncing data
19:10:46 <clarkb> ah yup
19:10:47 <fungi> but point is that only needs to be done one last time when we're ready to change dns for the old domain
19:10:54 <clarkb> ++
19:11:12 <fungi> so we can test the new production server pretty thoroughly before that
19:12:43 <clarkb> anything else related to server upgrades?
19:12:54 <tonyb> Not from me.
19:13:44 <clarkb> #topic AFS Mirror Cleanups
19:14:27 <clarkb> So last week I threw out there that we might consider force merging some centos 8 stream job removals from openstack-zuul-jobs in particular. Doing so will impact projects that are still using fips jobs on centos 8 stream in particular (glance was an example)
19:14:53 <clarkb> I wanted to see how thinsg were looking post openstack CVE patching before committing to that. Do we think that openstack is in a reasonable place for this to happen now?
19:15:29 <frickler> yes
19:15:48 <fungi> i do think so too, yes
19:16:04 <clarkb> cool I pushed a change up for that not realizing that frickler had already pushed a similar change.
19:16:36 <fungi> keep in mind that openstack also has stream 9 fips jobs, so projects running them on 8 are simply on old job configs and should update
19:16:39 <clarkb> oh maybe that was just for wheel jobs
19:17:02 <frickler> clarkb: which ones are you referring to?
19:17:39 <clarkb> https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922649 and https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922314
19:18:14 <clarkb> frickler: maybe I should rebase yours on top of mine and then mine can drop the wheel build stuff. Then we force merge the fips removal and then work through wheel build cleanup directly (since its just requiremenst and ozj that need updates for that)
19:18:27 <clarkb> if that sounds raesonable I can do that after the meeting and lunch
19:19:33 <frickler> we can do the reqs update later, too
19:19:59 <clarkb> ya its less urgent since its less painful to cleanup. For the painful stuff ideally we get it done sooner so that people have more time to deal with it
19:20:15 <clarkb> ok I'll do that proposed change since I don't hear and objections and then we can proceed with additional cleanups
19:20:19 <clarkb> Anything else on this topic?
19:20:32 <frickler> I mean we could also force merge both and clean up afterwards
19:20:51 <frickler> but rebasing to avoid conflicts is a good idea anyway
19:21:18 <clarkb> ack
19:21:24 <clarkb> #topic Gitea 1.22 Upgrade
19:21:33 <clarkb> There is a gitea 1.22.1 release now
19:21:35 <fungi> (can't force merge if they merge-conflict regardless)
19:21:43 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/920580/ Change implementing the upgrade
19:21:55 <tonyb> Oh yay
19:21:56 <clarkb> This change has been updated to deploy the new release. It passes testing and there is a held node
19:22:00 <clarkb> #link https://104.130.219.4:3081/opendev/system-config 1.22.1 Held node
19:22:35 <clarkb> I think our best approach here may be to go ahead and do the upgrade once we're comfortable with it (its a big upgrade...) then worry about the db doctoring after the upgrade since we need the new version for the db doctoring tool anway
19:22:54 <fungi> sounds great to me
19:23:01 <clarkb> with that plan it would be great if people could review the change and call out any concerns from the release notes or what I've had to update
19:23:11 <clarkb> and then once we're happy with it do the upgrade
19:23:47 <fungi> will do
19:24:05 <clarkb> #topic Etherpad 2.1.1 Upgrade
19:24:23 <clarkb> Similarly we have some recent etherpad releases we should consider updating to. The biggest change appears to be readding APIKEY auth back to the service
19:24:28 <clarkb> There are a number of bugfixes too though
19:24:34 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/923661 Etherpad 2.1.1 Upgrade
19:24:44 <clarkb> This also passes testing and I've got a held node for it as well
19:24:55 <clarkb> 149.202.167.222 is the held node and "clarkb-test" is the pad I used there
19:25:05 <clarkb> you have to edit /etc/hosts for this one otherwise redirects send you to the prod server
19:25:41 <clarkb> also I don't bother to revert the auth method since I would expect apikey to go away at some point just with better communication the next time around
19:26:26 <clarkb> Similar to gitea I guess look over the change and release notes and held node and call out any concerns otherwise I think we're in good shape to proceed with this one too
19:26:28 <fungi> i'll test it out after dinner and approve if all looks good. can keep an eye on the deploy job and re-test prod after too
19:26:38 <clarkb> thanks. I'll be around too after lunch
19:26:56 <clarkb> #topic Testing Rackspace's New Cloud Offering
19:27:18 <clarkb> Unfortunately I never heard back after my suggestion for a meeting this week. I'll need to take a different tactic
19:27:33 <clarkb> I left this on the agenda so that I could be clear that no meeting is happening yet
19:27:36 <clarkb> but that was all
19:27:48 <clarkb> #topic Drop x/* projects with config errors from zuul
19:27:58 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/923509 Proposed cleanup of inactive x/ projects
19:28:25 <clarkb> frickler has proposed that we clean up idle/inactive and broken with zuul config projects from the zuul tenant config. I think this is fine to do. It is easy to add projects back if anyone complains
19:28:49 <frickler> do you want to send a mail about this first?
19:29:03 <frickler> and give possibly a last warning period?
19:29:14 <fungi> an announcement with a planned merge date wouldn't be terrible
19:29:31 <clarkb> yes, I was just going to suggest that. I think what we can do is send email to service announce indicating we'll start cleaning up projects and point to your change as the first iteration and encourage people to fix things up or let us know if they are still active and will fix things
19:29:32 <frickler> also I didn't actually check the "idle/inactive" part
19:29:41 <clarkb> frickler: ya I was going to check it before +2ing :)
19:30:01 <frickler> I just took the data from the config-errors list
19:30:06 <clarkb> frickler: do you want to send that email ro should I? if you want me to send it what time frame do you think is a good one for merging it? Sometime next week?
19:30:35 <frickler> please do send the mail, 1 week notice is fine IMO
19:30:45 <clarkb> ok that is on my todo list
19:30:56 <fungi> most of them probably have my admin account owning their last changes from the initial x/ namespace migration
19:31:31 <clarkb> thank you for bringing this up, I think cleanups like this will go a long way in reducing the blast radius around things like image lable removals from nodepool
19:31:38 <clarkb> anything else on this subject?
19:31:43 <frickler> a related question would be whether we want to do any actual repo retirements later
19:31:50 <frickler> like similar to what openstack does
19:31:54 <clarkb> to be clear I'll check activity, leave a review, then send email to service-discuss in the naer future
19:32:21 <frickler> so you'll check all repos? or just the ones with config errors?
19:32:28 <clarkb> frickler: I think we've avoided doing that ourselves because 1) ist a fair bit of work and 2) it doesn't affect us much if the repos are active and the maintainers don't want to indicate things have shutdown
19:32:34 <clarkb> frickler: just the ones in your change
19:32:46 <clarkb> frickler: but I guess I can try sampling others to see if we can easily add to the list
19:33:32 <clarkb> basically we aren't responsible for people managing their software projects in a way that indicates if they have all moed on to other things and I don't think we should set that expectation
19:33:56 <frickler> but we might care about projects that have silently become abandoned
19:34:46 <clarkb> I don't think we do?
19:35:03 <frickler> one relatively easy to check criterium would be no zuul jobs running (passing) in a year or so
19:35:06 <clarkb> I mean we care about their impact on the larger CI system for example. But we shouldn't be responsible for changing the git repo content to say everyone has gone away
19:35:23 <fungi> proactively identifying and weeding out abandoned projects doesn't seem tractable to me, but unconfiguring projects which are causing us problems is fine
19:35:31 <clarkb> projects may elect to do that themselves but I don't think we should be expected to do that for anyone
19:36:41 <frickler> hmm, so I'm in a minority then it seems, fine
19:37:13 <fungi> it's possible projects are "feature complete" and haven't needed any bug fixes in years so haven't run any jobs, but would get things running if they discovered a fix was needed in the future
19:38:10 <fungi> and we haven't done a good job of collecting contact information for projects, so reaching out to them may also present challenges
19:39:12 <clarkb> ya I think the harm is minimized if we disconnect things from the CI system until needed again
19:39:19 <clarkb> but we don't need to police project activity beyond that
19:39:43 <fungi> that roughly matches my position as well
19:39:48 <frickler> ok, let's move on, then
19:40:02 <clarkb> #topic Zuul DB Performance Issues
19:40:09 <clarkb> #link https://zuul.opendev.org/t/openstack/buildsets times out without showing results
19:40:15 <clarkb> #link https://zuul.opendev.org/t/openstack/buildsets?pipeline=gate takes 20-30s
19:40:46 <frickler> yes, I noticed that while shepherding the gate last week for the cve patches
19:40:46 <clarkb> my hunch here is that we're simply lacking indexes that are necessary to make this performant
19:41:11 <clarkb> since there are at least as many builds as there are buildsets and the builds queries seem to be responsive enough
19:42:49 <clarkb> corvus: are you around?
19:43:06 <corvus> i am
19:43:13 <clarkb> fwiw the behaviors frickler pointed out do appear to reproduce toady as well so not just a "zuul is busy with cve patches" behavior
19:43:29 <clarkb> corvus: cool I wanted to make sure you saw this (don't expect answers right away)
19:43:57 <corvus> yeah i don't have any insight
19:44:37 <frickler> so do we need more inspection on the opendev side or rather someone to look into the zuul implementation side first?
19:45:28 <corvus> i think the next step would be asking opendev's db to explain the query plan in this environment to try to track down the problem
19:45:29 <clarkb> maybe a little bit of both. Looking at the api backend for those queries might point out an obviously missing inefficiency with something like an index, but if there isn't anything obvious then checking opendev db server slow logs is probably the next step
19:45:36 <fungi> ideally we'd spot the culprit on our zuul-web services first, yeah
19:46:11 <corvus> it's unlikely to be reproducible outside of that environment; each rdbms does its own query plans based on its own assumptions about the current data set
19:46:14 <fungi> it's always possible problems we're observing are related to our deployment/data and not upstream defects, after all
19:46:20 <clarkb> corvus: is there a good wy to translate the sqlalchemy python query expression to sql? Maybe its in the logs on the server side?
19:46:35 <corvus> "show full processlist" while the query is running
19:46:38 <clarkb> ack
19:46:58 <fungi> with as long as these queries are taking, capturing that should be trivial
19:47:03 <tonyb> I can look at the performance of the other zuuls I have access to but they're much smaller
19:47:11 <clarkb> so thats basically find the running query in the process list, then have the server describe the query plan for that query then determine what if anything needs to chagne in zuul
19:47:25 <corvus> or on the server.  yes.
19:48:14 <clarkb> cool we have one more topic and we're running out of time so lets move on
19:48:19 <frickler> fwiw I don't see any noticable delay on my downstream zuul server, so yes, likely related to the amount of data opendev has collected
19:48:27 <clarkb> we can followup on any investigation outside of the meeting
19:48:34 <frickler> ack
19:48:39 <corvus> it's not necessarily the size, it's the characteristics of the data
19:48:51 <clarkb> #topic Reconsider Zuul Pipeline Queue Floor for OpenStack Tenant
19:49:33 <clarkb> openstack's cve patching process exposed that things are quite flaky there at the moment and it is almost impossible that the 20th change in an openstack gate queue would merge at the same time as the first change
19:49:48 <frickler> this is also mainly based on my observations made last week
19:49:54 <clarkb> currently we configure the gate pipeline to have a floor of 20 in its windowing algorithm which means that is the minimum number of changes that will be enqueued
19:50:05 <clarkb> s/enqueued/running jobs at the same time if we have at least that many changes/
19:50:53 <clarkb> I think reducing that number to say 10 would be fine (this was frickler's suggestion in the agenda) because as you point out it is highly unlikely to merge 20 chagnes together right now and it also isn't that common to have a queue that deep these days
19:51:08 <clarkb> so basically 90% of the time we won't notice either way and the other 10% of the time it is likely to be helpful
19:51:09 <corvus> is there a measurable problem and outcome we would like to achieve?
19:51:33 <fungi> i think, from an optimization perspective, because the gate pipeline will almost always contain fewer items than the check pipeline, that having the floor somewhat overestimate a maximum probable merge window somewhat is still preferable
19:51:37 <clarkb> corvus: time to merge openstack changes in the gate which is impacted by throwing away many test nodes?
19:52:07 <clarkb> corvus: in particular openstck also has clean check but the gate has priority for nodes. So if we're burning nodes in the gate that won't ever pass jobs then check is slowed down which slows down the time to get into the gate
19:52:19 <frickler> also time to check further changes while the gate is using a large percentage of total available resources
19:52:41 <corvus> has "reconsider clean check" been discussed?
19:52:54 <fungi> we'll waste more resources on gate pipeline results by being optimistic, but if it's not starving the gate pipeline of resources then we have options for manually dealing with resource starvation in lower-priority pipelines
19:53:07 <clarkb> not that I am aware of. I suspect that removing clean check would just lead to less stability, but maybe the impact of the lower stability would be minimized
19:53:57 <corvus> if the goal of clean check is to reduce the admission of racy tests, it may be better overall to just run every job twice in gate.  :)
19:53:59 <clarkb> fungi: like dequeuing things from the gate?
19:54:20 <fungi> yes, and promoting, and directly enqueuing urgent changes from check
19:54:24 <corvus> i think originally clean check was added because people were throwing bad changes at the gate and people didn't want to wait for them to fall out
19:54:39 <clarkb> corvus: yes one of the original goals of clean check was to reduce the likelyhood that a change would be rechecked and forced in before someone looked closer and went "hrm this is actually flaky and broken"
19:54:44 <corvus> (if that's no longer a problem, it may be more trouble than it's worth now)
19:54:53 <clarkb> I think that history has shown people are more likeyl to just recheck harder rather than investigate though
19:55:42 <clarkb> from opendev's perspecitve I don't think it creates any issues for us to either remove clean check or reduce the window floor (the window will still grow if things get less flaky)
19:56:02 <frickler> well openstack at least is working on trying to reduce blank rechecks
19:56:08 <clarkb> but OpenStack needs to consider what the various fallout consequences would be in those scenarios
19:56:18 <corvus> how often is the window greater than 10?
19:56:26 <corvus> sorry, the queue depth
19:56:37 <clarkb> corvus: its very rare these days which is why I mentioned the vast majority of the time it is a noop
19:57:03 <clarkb> but it does still happen around feature freeze, requirements update windows, and security patching with a lot of backports
19:57:09 <clarkb> its just not daily like it once was
19:57:58 <fungi> random data point, at this moment in time the openstack tenant has one item in the gate pipeline and it's not even for an openstack project ;)
19:58:41 <frickler> and its not in the integrated queue
19:58:49 <corvus> i don't love the idea, and i especially don't love the idea without any actual metrics around the problem or solution.  i think there are better things that can be done first.
19:58:54 <corvus> one of those things is early failure detection
19:59:03 <corvus> is anyone from the openstack tenant working on that?
19:59:27 <corvus> that literally gets bad changes out of the critical path gate queue faster
19:59:36 <clarkb> corvus: I guess the idea is early failure detection could help because it would release resources more quickly to be used elsewhere?
19:59:39 <clarkb> ya
20:00:07 <frickler> I think that that is working for the openstack tenant?
20:00:09 <fungi> devstack/tempest/grenade jobs in particular (tox-based jobs already get it by default i think?)
20:00:31 <corvus> fungi: the zuul project is the only one using that now afaict
20:00:34 <clarkb> frickler: its disabled by default and not enabled for openstack iirc
20:00:46 <frickler> at least I saw tempest jobs go red before they reported failure
20:00:54 <clarkb> the bug related to that impacted projects whether or not they were enabling hte feature
20:01:01 <corvus> the playbook based detection is automatic
20:01:11 <fungi> longer-running jobs (e.g. tempest) would benefit the most from early failure signalling
20:01:14 <corvus> the output based detection needs to be configured
20:01:22 <corvus> (it's regex based)
20:02:06 <clarkb> we are at time. But maybe ag ood first step is enabling early failure detection for tempest and unittests and see if that helps things first
20:02:14 <corvus> but we worked out a regex for zuul that seems to be working with testr; it would be useful to copy that to other tenants
20:02:22 <clarkb> then if that doesn't we can move onto the next thing which may or may not be removing clean check or reducing the window floor
20:02:38 <fungi> i guess in particular, jobs that "fail" early into a long-running playbook without immediately wrapping up the playbook are the best candidates for optimization there?
20:03:15 <corvus> ++ i'm not super-strongly opposed to changing the floor; i just don't want to throw out the benefits of a deep queue when there are other options.
20:03:45 <clarkb> fungi: ya tempest fits that criteria well as there may be an hour of testing after the first failure I think
20:04:00 <corvus> clarkb: fungi exactly
20:04:16 <fungi> whereas jobs that abort a playbook on the first error are getting most of the benefits already
20:04:23 <corvus> yep
20:04:39 <corvus> imagine the whole gate pipeline turning red and kicking out a bunch of changes in 10 minutes :)
20:05:13 <corvus> (kicking them out of the main line, i mean, not literally ejecting them; not talking about fail-fast here)
20:06:25 <fungi> i do think that would make a significant improvement in throughput/resource usage
20:07:09 <corvus> i can't do that work, but i'm happy to help anyone who wants to
20:08:08 <clarkb> thanks! We're well over time now and I'm hungry. Let's end the meeting here adn we can coordinate those improvements either in the regular irc channel or the mailing list
20:08:11 <clarkb> thanks everyone!
20:08:16 <clarkb> #endmeeting