19:00:15 <clarkb> #startmeeting infra 19:00:15 <opendevmeet> Meeting started Tue Jul 9 19:00:15 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:15 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:15 <opendevmeet> The meeting name has been set to 'infra' 19:00:27 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/PF7F3JFOMNXHCMB7J426QSYOJZXGQ66D/ Our Agenda 19:00:32 <clarkb> #topic Announcements 19:00:52 <clarkb> A heads up that I'll be AFK the first half of next week through Wednesday 19:01:24 <clarkb> whcih means I'll miss that week's meeting. I'm happy for someone else to organize and chair the meeting, but also seems fine to skip as well 19:02:25 <clarkb> Can probably make a decision on which option to choose based on how things are going early next week 19:02:30 <clarkb> not something that needs to be decided here 19:02:51 <tonyb> Sounds fair 19:03:02 <clarkb> #topic Upgrading Old Servers 19:03:23 <clarkb> tonyb: good morning! anything new to add here? I think there was some held node behavior investigation done with fungi 19:04:19 <tonyb> Not a lot of change. we did some investigation, which didn't show any problems. 19:04:50 <clarkb> should we be reviewing changes for that at this point then? 19:05:03 <tonyb> Not yet 19:05:22 <fungi> i did some preliminary testing 19:05:34 <tonyb> I'll push up a new revision soon and then we can begin the review process 19:05:51 <fungi> but haven't done things that would involve me using multiple separate accounts yet 19:06:13 <fungi> so far it all looks great though, no concerns yet 19:06:27 <clarkb> sounds good. Since we're deploying a new server alongside the old one we should be able to merge things before we're 100% certain they are ready and then we can always delete and start over if necessary 19:06:47 <clarkb> There was also discussion about clouds lacking Ubuntu Noble images at this point. 19:07:08 <clarkb> tonyb: I did upload a noble image to openmetal if you want to start with a mirror replacement there (but then that runs into the openafs problems...) 19:07:17 <fungi> yeah, our typical approach is to get the server running on a snapshot of the production data and then re-sync its state at cut-over time 19:07:29 <tonyb> Yup I'll work on that later this week 19:07:31 <clarkb> but ya I think we can just grab the appropriate image from ubuntu and convert as necessary then upload 19:07:38 <fungi> probably easier in this case since it'll use a new primary domain name anyway 19:08:03 <clarkb> tonyb: oh I did want to mention that I'm not sure if the two vexxhost regions have different bfv requirements (you mentioned gitea servers are not bfv but they are in one region and gerrit is in the other). Something we can test though 19:08:36 <fungi> the only real cut-over step is swapping out the current wiki.openstack.org dns to point at wiki.opendev.org's redirect 19:08:51 <tonyb> Good to know. 19:09:30 <clarkb> fungi: we also need to shutthings down to mvoe db contents 19:09:38 <clarkb> so there will be a short downtime I think as well as updating the dns 19:09:49 <tonyb> I'll re-work the wiki announcement to send out sometime soon 19:10:18 <fungi> yeah, i mentioned re-syncing data 19:10:46 <clarkb> ah yup 19:10:47 <fungi> but point is that only needs to be done one last time when we're ready to change dns for the old domain 19:10:54 <clarkb> ++ 19:11:12 <fungi> so we can test the new production server pretty thoroughly before that 19:12:43 <clarkb> anything else related to server upgrades? 19:12:54 <tonyb> Not from me. 19:13:44 <clarkb> #topic AFS Mirror Cleanups 19:14:27 <clarkb> So last week I threw out there that we might consider force merging some centos 8 stream job removals from openstack-zuul-jobs in particular. Doing so will impact projects that are still using fips jobs on centos 8 stream in particular (glance was an example) 19:14:53 <clarkb> I wanted to see how thinsg were looking post openstack CVE patching before committing to that. Do we think that openstack is in a reasonable place for this to happen now? 19:15:29 <frickler> yes 19:15:48 <fungi> i do think so too, yes 19:16:04 <clarkb> cool I pushed a change up for that not realizing that frickler had already pushed a similar change. 19:16:36 <fungi> keep in mind that openstack also has stream 9 fips jobs, so projects running them on 8 are simply on old job configs and should update 19:16:39 <clarkb> oh maybe that was just for wheel jobs 19:17:02 <frickler> clarkb: which ones are you referring to? 19:17:39 <clarkb> https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922649 and https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922314 19:18:14 <clarkb> frickler: maybe I should rebase yours on top of mine and then mine can drop the wheel build stuff. Then we force merge the fips removal and then work through wheel build cleanup directly (since its just requiremenst and ozj that need updates for that) 19:18:27 <clarkb> if that sounds raesonable I can do that after the meeting and lunch 19:19:33 <frickler> we can do the reqs update later, too 19:19:59 <clarkb> ya its less urgent since its less painful to cleanup. For the painful stuff ideally we get it done sooner so that people have more time to deal with it 19:20:15 <clarkb> ok I'll do that proposed change since I don't hear and objections and then we can proceed with additional cleanups 19:20:19 <clarkb> Anything else on this topic? 19:20:32 <frickler> I mean we could also force merge both and clean up afterwards 19:20:51 <frickler> but rebasing to avoid conflicts is a good idea anyway 19:21:18 <clarkb> ack 19:21:24 <clarkb> #topic Gitea 1.22 Upgrade 19:21:33 <clarkb> There is a gitea 1.22.1 release now 19:21:35 <fungi> (can't force merge if they merge-conflict regardless) 19:21:43 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/920580/ Change implementing the upgrade 19:21:55 <tonyb> Oh yay 19:21:56 <clarkb> This change has been updated to deploy the new release. It passes testing and there is a held node 19:22:00 <clarkb> #link https://104.130.219.4:3081/opendev/system-config 1.22.1 Held node 19:22:35 <clarkb> I think our best approach here may be to go ahead and do the upgrade once we're comfortable with it (its a big upgrade...) then worry about the db doctoring after the upgrade since we need the new version for the db doctoring tool anway 19:22:54 <fungi> sounds great to me 19:23:01 <clarkb> with that plan it would be great if people could review the change and call out any concerns from the release notes or what I've had to update 19:23:11 <clarkb> and then once we're happy with it do the upgrade 19:23:47 <fungi> will do 19:24:05 <clarkb> #topic Etherpad 2.1.1 Upgrade 19:24:23 <clarkb> Similarly we have some recent etherpad releases we should consider updating to. The biggest change appears to be readding APIKEY auth back to the service 19:24:28 <clarkb> There are a number of bugfixes too though 19:24:34 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/923661 Etherpad 2.1.1 Upgrade 19:24:44 <clarkb> This also passes testing and I've got a held node for it as well 19:24:55 <clarkb> 149.202.167.222 is the held node and "clarkb-test" is the pad I used there 19:25:05 <clarkb> you have to edit /etc/hosts for this one otherwise redirects send you to the prod server 19:25:41 <clarkb> also I don't bother to revert the auth method since I would expect apikey to go away at some point just with better communication the next time around 19:26:26 <clarkb> Similar to gitea I guess look over the change and release notes and held node and call out any concerns otherwise I think we're in good shape to proceed with this one too 19:26:28 <fungi> i'll test it out after dinner and approve if all looks good. can keep an eye on the deploy job and re-test prod after too 19:26:38 <clarkb> thanks. I'll be around too after lunch 19:26:56 <clarkb> #topic Testing Rackspace's New Cloud Offering 19:27:18 <clarkb> Unfortunately I never heard back after my suggestion for a meeting this week. I'll need to take a different tactic 19:27:33 <clarkb> I left this on the agenda so that I could be clear that no meeting is happening yet 19:27:36 <clarkb> but that was all 19:27:48 <clarkb> #topic Drop x/* projects with config errors from zuul 19:27:58 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/923509 Proposed cleanup of inactive x/ projects 19:28:25 <clarkb> frickler has proposed that we clean up idle/inactive and broken with zuul config projects from the zuul tenant config. I think this is fine to do. It is easy to add projects back if anyone complains 19:28:49 <frickler> do you want to send a mail about this first? 19:29:03 <frickler> and give possibly a last warning period? 19:29:14 <fungi> an announcement with a planned merge date wouldn't be terrible 19:29:31 <clarkb> yes, I was just going to suggest that. I think what we can do is send email to service announce indicating we'll start cleaning up projects and point to your change as the first iteration and encourage people to fix things up or let us know if they are still active and will fix things 19:29:32 <frickler> also I didn't actually check the "idle/inactive" part 19:29:41 <clarkb> frickler: ya I was going to check it before +2ing :) 19:30:01 <frickler> I just took the data from the config-errors list 19:30:06 <clarkb> frickler: do you want to send that email ro should I? if you want me to send it what time frame do you think is a good one for merging it? Sometime next week? 19:30:35 <frickler> please do send the mail, 1 week notice is fine IMO 19:30:45 <clarkb> ok that is on my todo list 19:30:56 <fungi> most of them probably have my admin account owning their last changes from the initial x/ namespace migration 19:31:31 <clarkb> thank you for bringing this up, I think cleanups like this will go a long way in reducing the blast radius around things like image lable removals from nodepool 19:31:38 <clarkb> anything else on this subject? 19:31:43 <frickler> a related question would be whether we want to do any actual repo retirements later 19:31:50 <frickler> like similar to what openstack does 19:31:54 <clarkb> to be clear I'll check activity, leave a review, then send email to service-discuss in the naer future 19:32:21 <frickler> so you'll check all repos? or just the ones with config errors? 19:32:28 <clarkb> frickler: I think we've avoided doing that ourselves because 1) ist a fair bit of work and 2) it doesn't affect us much if the repos are active and the maintainers don't want to indicate things have shutdown 19:32:34 <clarkb> frickler: just the ones in your change 19:32:46 <clarkb> frickler: but I guess I can try sampling others to see if we can easily add to the list 19:33:32 <clarkb> basically we aren't responsible for people managing their software projects in a way that indicates if they have all moed on to other things and I don't think we should set that expectation 19:33:56 <frickler> but we might care about projects that have silently become abandoned 19:34:46 <clarkb> I don't think we do? 19:35:03 <frickler> one relatively easy to check criterium would be no zuul jobs running (passing) in a year or so 19:35:06 <clarkb> I mean we care about their impact on the larger CI system for example. But we shouldn't be responsible for changing the git repo content to say everyone has gone away 19:35:23 <fungi> proactively identifying and weeding out abandoned projects doesn't seem tractable to me, but unconfiguring projects which are causing us problems is fine 19:35:31 <clarkb> projects may elect to do that themselves but I don't think we should be expected to do that for anyone 19:36:41 <frickler> hmm, so I'm in a minority then it seems, fine 19:37:13 <fungi> it's possible projects are "feature complete" and haven't needed any bug fixes in years so haven't run any jobs, but would get things running if they discovered a fix was needed in the future 19:38:10 <fungi> and we haven't done a good job of collecting contact information for projects, so reaching out to them may also present challenges 19:39:12 <clarkb> ya I think the harm is minimized if we disconnect things from the CI system until needed again 19:39:19 <clarkb> but we don't need to police project activity beyond that 19:39:43 <fungi> that roughly matches my position as well 19:39:48 <frickler> ok, let's move on, then 19:40:02 <clarkb> #topic Zuul DB Performance Issues 19:40:09 <clarkb> #link https://zuul.opendev.org/t/openstack/buildsets times out without showing results 19:40:15 <clarkb> #link https://zuul.opendev.org/t/openstack/buildsets?pipeline=gate takes 20-30s 19:40:46 <frickler> yes, I noticed that while shepherding the gate last week for the cve patches 19:40:46 <clarkb> my hunch here is that we're simply lacking indexes that are necessary to make this performant 19:41:11 <clarkb> since there are at least as many builds as there are buildsets and the builds queries seem to be responsive enough 19:42:49 <clarkb> corvus: are you around? 19:43:06 <corvus> i am 19:43:13 <clarkb> fwiw the behaviors frickler pointed out do appear to reproduce toady as well so not just a "zuul is busy with cve patches" behavior 19:43:29 <clarkb> corvus: cool I wanted to make sure you saw this (don't expect answers right away) 19:43:57 <corvus> yeah i don't have any insight 19:44:37 <frickler> so do we need more inspection on the opendev side or rather someone to look into the zuul implementation side first? 19:45:28 <corvus> i think the next step would be asking opendev's db to explain the query plan in this environment to try to track down the problem 19:45:29 <clarkb> maybe a little bit of both. Looking at the api backend for those queries might point out an obviously missing inefficiency with something like an index, but if there isn't anything obvious then checking opendev db server slow logs is probably the next step 19:45:36 <fungi> ideally we'd spot the culprit on our zuul-web services first, yeah 19:46:11 <corvus> it's unlikely to be reproducible outside of that environment; each rdbms does its own query plans based on its own assumptions about the current data set 19:46:14 <fungi> it's always possible problems we're observing are related to our deployment/data and not upstream defects, after all 19:46:20 <clarkb> corvus: is there a good wy to translate the sqlalchemy python query expression to sql? Maybe its in the logs on the server side? 19:46:35 <corvus> "show full processlist" while the query is running 19:46:38 <clarkb> ack 19:46:58 <fungi> with as long as these queries are taking, capturing that should be trivial 19:47:03 <tonyb> I can look at the performance of the other zuuls I have access to but they're much smaller 19:47:11 <clarkb> so thats basically find the running query in the process list, then have the server describe the query plan for that query then determine what if anything needs to chagne in zuul 19:47:25 <corvus> or on the server. yes. 19:48:14 <clarkb> cool we have one more topic and we're running out of time so lets move on 19:48:19 <frickler> fwiw I don't see any noticable delay on my downstream zuul server, so yes, likely related to the amount of data opendev has collected 19:48:27 <clarkb> we can followup on any investigation outside of the meeting 19:48:34 <frickler> ack 19:48:39 <corvus> it's not necessarily the size, it's the characteristics of the data 19:48:51 <clarkb> #topic Reconsider Zuul Pipeline Queue Floor for OpenStack Tenant 19:49:33 <clarkb> openstack's cve patching process exposed that things are quite flaky there at the moment and it is almost impossible that the 20th change in an openstack gate queue would merge at the same time as the first change 19:49:48 <frickler> this is also mainly based on my observations made last week 19:49:54 <clarkb> currently we configure the gate pipeline to have a floor of 20 in its windowing algorithm which means that is the minimum number of changes that will be enqueued 19:50:05 <clarkb> s/enqueued/running jobs at the same time if we have at least that many changes/ 19:50:53 <clarkb> I think reducing that number to say 10 would be fine (this was frickler's suggestion in the agenda) because as you point out it is highly unlikely to merge 20 chagnes together right now and it also isn't that common to have a queue that deep these days 19:51:08 <clarkb> so basically 90% of the time we won't notice either way and the other 10% of the time it is likely to be helpful 19:51:09 <corvus> is there a measurable problem and outcome we would like to achieve? 19:51:33 <fungi> i think, from an optimization perspective, because the gate pipeline will almost always contain fewer items than the check pipeline, that having the floor somewhat overestimate a maximum probable merge window somewhat is still preferable 19:51:37 <clarkb> corvus: time to merge openstack changes in the gate which is impacted by throwing away many test nodes? 19:52:07 <clarkb> corvus: in particular openstck also has clean check but the gate has priority for nodes. So if we're burning nodes in the gate that won't ever pass jobs then check is slowed down which slows down the time to get into the gate 19:52:19 <frickler> also time to check further changes while the gate is using a large percentage of total available resources 19:52:41 <corvus> has "reconsider clean check" been discussed? 19:52:54 <fungi> we'll waste more resources on gate pipeline results by being optimistic, but if it's not starving the gate pipeline of resources then we have options for manually dealing with resource starvation in lower-priority pipelines 19:53:07 <clarkb> not that I am aware of. I suspect that removing clean check would just lead to less stability, but maybe the impact of the lower stability would be minimized 19:53:57 <corvus> if the goal of clean check is to reduce the admission of racy tests, it may be better overall to just run every job twice in gate. :) 19:53:59 <clarkb> fungi: like dequeuing things from the gate? 19:54:20 <fungi> yes, and promoting, and directly enqueuing urgent changes from check 19:54:24 <corvus> i think originally clean check was added because people were throwing bad changes at the gate and people didn't want to wait for them to fall out 19:54:39 <clarkb> corvus: yes one of the original goals of clean check was to reduce the likelyhood that a change would be rechecked and forced in before someone looked closer and went "hrm this is actually flaky and broken" 19:54:44 <corvus> (if that's no longer a problem, it may be more trouble than it's worth now) 19:54:53 <clarkb> I think that history has shown people are more likeyl to just recheck harder rather than investigate though 19:55:42 <clarkb> from opendev's perspecitve I don't think it creates any issues for us to either remove clean check or reduce the window floor (the window will still grow if things get less flaky) 19:56:02 <frickler> well openstack at least is working on trying to reduce blank rechecks 19:56:08 <clarkb> but OpenStack needs to consider what the various fallout consequences would be in those scenarios 19:56:18 <corvus> how often is the window greater than 10? 19:56:26 <corvus> sorry, the queue depth 19:56:37 <clarkb> corvus: its very rare these days which is why I mentioned the vast majority of the time it is a noop 19:57:03 <clarkb> but it does still happen around feature freeze, requirements update windows, and security patching with a lot of backports 19:57:09 <clarkb> its just not daily like it once was 19:57:58 <fungi> random data point, at this moment in time the openstack tenant has one item in the gate pipeline and it's not even for an openstack project ;) 19:58:41 <frickler> and its not in the integrated queue 19:58:49 <corvus> i don't love the idea, and i especially don't love the idea without any actual metrics around the problem or solution. i think there are better things that can be done first. 19:58:54 <corvus> one of those things is early failure detection 19:59:03 <corvus> is anyone from the openstack tenant working on that? 19:59:27 <corvus> that literally gets bad changes out of the critical path gate queue faster 19:59:36 <clarkb> corvus: I guess the idea is early failure detection could help because it would release resources more quickly to be used elsewhere? 19:59:39 <clarkb> ya 20:00:07 <frickler> I think that that is working for the openstack tenant? 20:00:09 <fungi> devstack/tempest/grenade jobs in particular (tox-based jobs already get it by default i think?) 20:00:31 <corvus> fungi: the zuul project is the only one using that now afaict 20:00:34 <clarkb> frickler: its disabled by default and not enabled for openstack iirc 20:00:46 <frickler> at least I saw tempest jobs go red before they reported failure 20:00:54 <clarkb> the bug related to that impacted projects whether or not they were enabling hte feature 20:01:01 <corvus> the playbook based detection is automatic 20:01:11 <fungi> longer-running jobs (e.g. tempest) would benefit the most from early failure signalling 20:01:14 <corvus> the output based detection needs to be configured 20:01:22 <corvus> (it's regex based) 20:02:06 <clarkb> we are at time. But maybe ag ood first step is enabling early failure detection for tempest and unittests and see if that helps things first 20:02:14 <corvus> but we worked out a regex for zuul that seems to be working with testr; it would be useful to copy that to other tenants 20:02:22 <clarkb> then if that doesn't we can move onto the next thing which may or may not be removing clean check or reducing the window floor 20:02:38 <fungi> i guess in particular, jobs that "fail" early into a long-running playbook without immediately wrapping up the playbook are the best candidates for optimization there? 20:03:15 <corvus> ++ i'm not super-strongly opposed to changing the floor; i just don't want to throw out the benefits of a deep queue when there are other options. 20:03:45 <clarkb> fungi: ya tempest fits that criteria well as there may be an hour of testing after the first failure I think 20:04:00 <corvus> clarkb: fungi exactly 20:04:16 <fungi> whereas jobs that abort a playbook on the first error are getting most of the benefits already 20:04:23 <corvus> yep 20:04:39 <corvus> imagine the whole gate pipeline turning red and kicking out a bunch of changes in 10 minutes :) 20:05:13 <corvus> (kicking them out of the main line, i mean, not literally ejecting them; not talking about fail-fast here) 20:06:25 <fungi> i do think that would make a significant improvement in throughput/resource usage 20:07:09 <corvus> i can't do that work, but i'm happy to help anyone who wants to 20:08:08 <clarkb> thanks! We're well over time now and I'm hungry. Let's end the meeting here adn we can coordinate those improvements either in the regular irc channel or the mailing list 20:08:11 <clarkb> thanks everyone! 20:08:16 <clarkb> #endmeeting