#opendev-meeting log

19:00:11 <clarkb> #startmeeting infra
19:00:12 <opendevmeet> Meeting started Tue Feb 18 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:12 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:12 <opendevmeet> The meeting name has been set to 'infra'
19:00:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VM6AC2PI4QEAN5YLPM6UKN7RECVHAOOI/ Our Agenda
19:00:25 <clarkb> #topic Announcements
19:00:41 <clarkb> This has an explicit agenda item but the service coordinator nomination period ends today
19:00:52 <clarkb> we can discuss more later but nwo is maybe a good time to think if you'd like to run
19:01:02 <clarkb> Anything else to announce
19:01:04 <clarkb> ?
19:02:07 <fungi> foundation ml discussion
19:02:33 <fungi> though i think it was already pointed out last week
19:02:34 <clarkb> oh right
19:02:39 <clarkb> it was but it is a good reminder
19:02:43 <clarkb> #link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/thread/3B7OWPRXB4KD2DVX7SYYSHYYRNCKVV46/
19:02:57 <clarkb> the foundation is asking for feedback on a major decision
19:03:22 <clarkb> you can respond on that mailing list thread or reach out to Jonathan directly. fungi and I are happy to help point you in the right direction if you need addresses etc
19:04:04 <clarkb> #topic Zuul-launcher image builds
19:04:27 <clarkb> There are ubuntu jammy and noble nodes now and Zuul is attempting to dogfood the test nodes through the zuul launcher
19:04:46 <clarkb> this has been helpful for fidning additional bugs like needing to clear out used but unneeded disk consumption on the launcher
19:04:52 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942018 Increase working disk space for zuul-launcher
19:05:03 <clarkb> I think this landed. Might be a good idea to check that the launcher restarted with the new temp dir configured
19:05:21 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/940824 Dogfood zuul launcher managed nodes in a zuul change
19:05:42 <corvus> i manually restarted it; it didn't do it automatically
19:05:44 <clarkb> this change is still hitting node failures possibly due to hitting quota limits (the lack of coordination between nodepool and zuul-launcher can cause this)
19:05:46 <clarkb> ack
19:06:09 <corvus> yeah, we can only dogfood it during quiet times due to the lack of quota error handling
19:06:14 <corvus> that's next on my list to implement
19:07:32 <clarkb> One other idea taht came up with all of my apparmor struggles last week was we mgiht want to have ubuntu images with apparmor preinstalled to mimic the real world better and the zuul launcher cutover might be an opportunity to figure that out
19:07:50 <clarkb> but I think any way we approach that we risk unhappyness from newly broken jobs so probably do need to be careful
19:08:14 <clarkb> Anything else on this topic?
19:08:18 <corvus> i feel like building those images would be easy to do with niz at this point, though using them will be hard
19:08:30 <corvus> due to the quota issue
19:08:39 <clarkb> ya the build side should be straightforward. Just tell dib to add the package
19:08:49 <corvus> so if it's not urgent, then i'd say wait a bit before starting that project
19:09:03 <corvus> and if it is urgent, maybe do it with nodepool
19:09:14 <clarkb> ack I don't think it is urgent as this has been the status quo. We're juist noticing more because noble is a bit more strict about it in the upstream packaging when installed
19:09:27 <corvus> ++
19:09:39 <clarkb> on the opendev system-config side of things we're installing apparmor explicitly in many places now which should cover things well for us specifically
19:10:27 <clarkb> #topic Unpinning our Grafana deployment
19:10:33 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/940997 Update to Grafana 11
19:10:46 <clarkb> I think this got lost in all the noble apparmor fun last week, but reviews still very much welcome
19:11:01 <clarkb> I suspect that we can rip the bandaid off and just send it for this change and if something goes wrong we revert
19:11:14 <clarkb> but I'd be curious to hear what others think in review
19:11:48 <fungi> seems fine to me, yep
19:12:08 <clarkb> today is a bit of a bad day with everything else going on but maybe tomorrow we land that and see what happens then
19:12:31 <clarkb> #topic Upgrading old servers
19:12:43 <clarkb> As mentioned we ran into more apparmor problems with Noble
19:13:03 <clarkb> previously we had problems with podman kill and docker compose kill due to apparmor rules. We worked around this by using `kill` directly
19:13:14 <clarkb> upstream has since merged a pull request to address this in the apparmor rules
19:13:32 <clarkb> I haven't seen any movement downstream to backport the fix into noble so we may be stuck with our workaround for a while.
19:14:07 <clarkb> Separately we discovered that apparmor rules affect where rsyslogd can open sockets on the filesystem for syslog and containers' ability to read and write to those sokcet files
19:14:36 <clarkb> I am only aware of a single place where we make use of this functionality and this is with haproxy beacuse it wants to log directly to syslog
19:14:54 <clarkb> we hacked around that with an updated rsyslogd apparmor policy that was suggested by sarnold from the ubuntu security team
19:15:01 <clarkb> I filed a bug against rsyslogd in ubuntu for this
19:15:35 <clarkb> But overall things continue to work and we have been able to find workarounds for the issues
19:16:07 <clarkb> tonyb: not sure if you are around and have anything to add to this topic
19:16:55 <tonyb> Nothing from me
19:17:05 <clarkb> #topic Sprinting to Upgrade Servers to Focal
19:17:25 <clarkb> bah I keep forgetting to fix that typo
19:17:27 <clarkb> #undo
19:17:27 <opendevmeet> Removing item from minutes: #topic Sprinting to Upgrade Servers to Focal
19:17:33 <clarkb> #topic Sprinting to Upgrade Servers to Noble
19:17:53 <clarkb> much of the previous info was discoverd when trying to make headway on the backlog of server upgrades by upgrading them to Noble
19:18:20 <clarkb> I would say this was successful in finding previously unknown problems and working through them for future systems. But a bit disappointing in that I only managed to upgrade zuul-lb and codesearch servers
19:18:31 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint
19:18:48 <clarkb> I was trying to keep track of things on this etherpad and may continue to do so (it needs some updates already)
19:19:06 <clarkb> and basically keep pushing forward on this as much as possible. Would be helpful if others can dive in too
19:19:47 <clarkb> Starting tomorrow I'll try to pick another one or two servers off the list and work on getting them done next
19:20:01 <clarkb> #topic Running certcheck on bridge
19:20:05 <clarkb> fungi: any updates on this item?
19:21:13 <fungi> nope :/
19:22:25 <clarkb> #topic Service Coordinator Election
19:22:44 <clarkb> Today is the last day of the nomination period which ends in ~4.5 hours at the end of day UTC time
19:23:12 <clarkb> I haven't seen any nominations. Does this mean we're happy with status quo and want me to keep sitting in the seat?
19:23:24 <clarkb> I am more than happy for someone else to volunteer and would support them however I can too fwiw
19:24:06 <fungi> seems so
19:25:24 <clarkb> ok I guess if I don't hear different by ~2300 UTC I can make it official
19:25:40 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/
19:25:49 <fungi> thanks!
19:25:50 <clarkb> a link back to the details I wrote down previously
19:25:55 <clarkb> (that isn't my nomination)
19:26:28 <clarkb> #topic Working through our TODO list
19:26:33 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:26:41 <clarkb> just a reminder we've got a rough todo list on this etherpad
19:26:57 <clarkb> if you would like to dive in feel free to take a look there and reach out if there are any qusetions
19:27:41 <clarkb> #topic Open Discussion
19:27:59 <clarkb> cloudnull let us know there is a second raxflex region we can start using
19:28:21 <clarkb> The rough process for that would be figuring out the clouds.yaml entry to use it. Updating clouds.yaml and enrolling the region in our cloud launcher config
19:28:31 <clarkb> then launch a new mirror, then add the region to nodepool / zuul-launcher
19:28:52 <fungi> last call for objections on the pyproject.toml series of changes for bindep. if there are none i'll merge up through the change dropping python 3.6 support
19:28:53 <clarkb> the region is dfw3
19:29:18 <fungi> bindep changes 816741, 938520 and 938568
19:29:20 <clarkb> I'm hoping we can just add that region to the region list in clouds.yaml for raxflex and be off to the races
19:29:53 <clarkb> we may also need to upload a noble cloud image I guess
19:30:08 <fungi> with raxflex-dfw3 we can discuss whether we want to stick to floating-ip vs publicnet
19:30:42 <clarkb> I think using floating ips has been valuable to address some assumptions that were made in CI but I don't think we necessarily need to care unless a cloud forces us to use floating IPs
19:30:59 <clarkb> so ya maybe just do the simplest thing which is publicnet and worry about floating ips if they ever become strictly required
19:31:19 <clarkb> fungi: yesterday I notced that some of the log files on lists.o.o are quite large. Not sure if you saw that note I posted to #opendev
19:31:29 <clarkb> I think we may need to add logrotate rules for those files
19:31:37 <corvus> what's the quota situation for both rax regions?
19:31:45 <corvus> both rax flex regions
19:31:45 <clarkb> but wanted to make sure you had a chance to take a look and weigh in on it before I wrote any changes
19:31:57 <clarkb> corvus: I haven't looked at dfw3 but sjc3 is minimal
19:32:04 <clarkb> I think the nodepool config is our quota
19:32:24 <clarkb> I wanted to replcae the networking setup in sjc3 before we ask to increase it
19:32:42 <fungi> i think there's something like a 50 node quota on all accounts in flex by default, but would have to check
19:33:02 <clarkb> we have max servers set to 32 in sjc3
19:33:10 <clarkb> I think that is the current limit there
19:33:13 <fungi> we might have sized max-servers to 30 for other reasons (ram? vcpu?)
19:33:57 <fungi> rather than the max instances limit
19:34:06 <clarkb> corvus: one thing that just occured to me is we could dial back max-servers by ~5 servers in region zuul-launcher is trying to use
19:34:20 <corvus> speaking of, https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 looks suspiciously flat
19:34:21 <clarkb> corvus: that might help with the quota issues if you think that is useful ebfore making zuul-launcher quota aware
19:34:22 <frickler> iirc we tested iteratively what worked without getting launch failures, not sure why launches failed
19:34:43 <corvus> clarkb: good idea
19:34:46 <clarkb> corvus: ooh that looks like we're in a can't delete servers loop
19:35:04 <corvus> is that a rax problem or a nodepool problem?
19:35:52 <frickler> you'd need to check nodepool logs probably
19:36:01 <clarkb> ya I'm not sure yet. usually it is a cloud provider problem
19:36:17 <clarkb> but it could be that we fail to make valid delete requiests so the db state changes to deleting but the cloud never sees it
19:36:20 <corvus> (so rough plan is: 1. add dfw3; 2. rebuild sjc3 network; 3. re-enable sjc3 ?)
19:36:39 <corvus> oh ok, just sounded like maybe it was something that happened often
19:37:05 <clarkb> corvus: yup
19:37:20 <corvus> nodepool.exceptions.LaunchStatusException: Server in error state
19:37:36 <corvus> oh wait thats ovh
19:37:54 <corvus> i guess it's possible there may be more than one unhappy public cloud simultaneously
19:38:07 <clarkb> probably not the first time that has happened
19:38:40 <corvus> The request you have made requires authentication. (HTTP 401)
19:38:51 <corvus> that is the sjc3 error.  for sure.
19:39:15 <fungi> oh joy, account credentials probably expired out of the cache there again
19:40:21 <corvus> fungi: tell me more please
19:41:02 <fungi> i think the most recent time that happened we brought it to the attention of someone in support? time before that i think it was "fixed" by logging into the webui with that account
19:41:34 <fungi> (which was when we were first setting it up and unable to figure out why the creds we'd been given didn't work)
19:42:14 <corvus> looks like zl01 saw the same error
19:42:26 <corvus> so it's not limited to a single host or session
19:42:28 <clarkb> ya so probably need to bring it up wit hthem again
19:44:37 <fungi> i can confirm i see the same trying to openstack server list from bridge
19:45:40 <clarkb> lets followup with them after the meeting
19:46:09 <tonyb> Now that the caffeine has hit I have something for open discussion
19:46:22 <clarkb> I also wanted to point out I've pushed like 4 changes to update zuul/zuul-jobs use of container images for things like registries and test nodes to pull from quay mirrors of images instead of docker hub
19:46:32 <clarkb> all in an effort to slowly chip away at the use of docker hub
19:46:59 <clarkb> tonyb: go for it
19:47:05 <tonyb> I've mentioned it before but but RDOProject would like to use OpenDev got gerrit
19:47:17 <tonyb> s/got/for/
19:48:08 <tonyb> I think there is a reasonable overlap, how would RDO go about getting a tennant / namespace?
19:48:14 <tonyb> Is there a documented process?
19:48:55 <frickler> just propose new repo(s) with a new namespace?
19:48:59 <clarkb> its basically the existing request a project process
19:49:00 <fungi> for a separate zuul tenant it would be some configuration added to zuul and a creation of a dedicated config project for the tenant
19:49:01 <clarkb> yup that
19:49:12 <tonyb> Okay
19:49:23 <tonyb> That's easy
19:49:52 <frickler> not sure if a new tenant is really needed or populating the opendev tenant a bit more would be ok?
19:49:55 <fungi> worth talking about what the repos would contain too though... we've had problems in the past with packaging projects that carried forks of other projects
19:50:26 <tonyb> Can gerrit support multiple Zuuls?  I suspect there might be some desire to start with gerrit but leave the existing zuul in place
19:50:26 <fungi> and the other concern is around testing, projects in our gerrit need to use our zuul for gating
19:50:52 <clarkb> you can have third party CI but you can only have one gatekeeper
19:51:03 <frickler> tonyb: existing zuul = rdo zuul?
19:51:03 <tonyb> Okay
19:51:12 <tonyb> frickler: Yes
19:51:47 <fungi> another point would be that we could really only import the commits for the existing repos, not their gerrit data from another gerrit
19:52:14 <fungi> so old change review comments and such would be lost
19:52:24 <clarkb> oh yup. We explicitly can't handle that due to zuul iirc
19:52:34 <tonyb> So OpenDev zuul could be used for gating but the existing RDO zuul could be used as 3rd party where a -1 means no gating
19:52:43 <clarkb> once you important changes from another gerrit you open yourself to change number collisions and zuul doesn't handle those iirc
19:52:44 <tonyb> fungi: understood
19:53:13 <fungi> -1 would be advisory, it couldn't really block changes from merging i don't think? but it could be used to inform reviewers not to approve
19:53:21 <clarkb> tonyb: the existing check and gate queues don't let third party ci prevent enqueing to gate. I think if that is desireable then you would wnt your own tenant and you can set the rule up that way
19:53:36 <tonyb> I think we're only talking about code not gerrit history
19:53:44 <tonyb> Okay noted.
19:53:53 <clarkb> fungi: depends on the pipeline configs. It is advisory as configured in existing tenants but a new tenant could make that a bit stronger with a stricter pipeline config
19:54:08 <fungi> the other thing that's come up with projects trying to make extensive use of third-party ci is that it's challenging to require that a third-party ci system report results before approval
19:54:10 <tonyb> That all sounds doable to me, but I admit I don't know a lot about the specifics of RDO gating
19:54:50 <clarkb> that should be solvable with a tenant specific pipeline config basically doing clean check with a third party ci
19:55:13 <clarkb> I'm not sure I would necessarily suggest these things particularly if we (opendev) start getting questions about why rdo can't merge things and it is due to an rdo zuul outage
19:55:21 <fungi> yeah, zuul would need to look for votes from other accounts besides its own, which is configurable at the pipeline level
19:56:11 <frickler> it might be easier to just integrate rdo's node capacity into opendev
19:56:31 <corvus> (i believe it is possible to import changes into gerrit and avoid number collisions, but care must be taken)
19:56:38 <frickler> and possibly also considered more "fair" in some sense
19:56:45 <tonyb> frickler: that's an option, but one I've only recently floated internally
19:57:13 <fungi> but also this is sort of similar to the "why can't i use opendev's zuul but have my project in github?" discussions. we provide an integrated gerrit+zuul solution, it's hard for us to support external systems we don't control, and not how we intended it to be used
19:57:49 <tonyb> I also didn't want to bring that up now for fear of it seeming like we can ad capacity *iff* we can do $other_stuff
19:57:56 <tonyb> that is very much not the case
19:59:58 <tonyb> Thank you all.  I think I understand better now what's needed.  I also think we can do this in a way that works well and extends OpenDev.  I'll bring this up with RDO and try to get a better conversation started
20:00:23 <fungi> on the topic of gerrit imports, it looks like review.rdoproject.org is currently running gerrit 3.7.8, so not too far behind
20:00:24 <tonyb> As I'll be in the US next month I think that works better for TZ overlap
20:00:40 <clarkb> we are at time
20:00:52 <clarkb> thank you everyone and feel free to continue discussion on the mailing list or in #opendev
20:01:01 <clarkb> I've already pinged cardoe there about the account login issue
20:01:21 <tonyb> fungi: Yeah not too far, but the lack of interest in upgrading it is part of the motivation to switch
20:01:26 <clarkb> #endmeeting