Tuesday, 2025-02-18

-opendevstatus- NOTICE: nominations for the OpenStack PTL and TC positions are closing soon, for details see https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/message/7DKEV7IEHOTHED7RVEFG7WIDVUC4MY3Z/15:57
clarkbhello it is our weekly meeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Feb 18 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VM6AC2PI4QEAN5YLPM6UKN7RECVHAOOI/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbThis has an explicit agenda item but the service coordinator nomination period ends today19:00
clarkbwe can discuss more later but nwo is maybe a good time to think if you'd like to run19:00
clarkbAnything else to announce19:01
clarkb?19:01
fungifoundation ml discussion19:02
fungithough i think it was already pointed out last week19:02
clarkboh right19:02
clarkbit was but it is a good reminder19:02
clarkb#link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/thread/3B7OWPRXB4KD2DVX7SYYSHYYRNCKVV46/19:02
clarkbthe foundation is asking for feedback on a major decision19:02
clarkbyou can respond on that mailing list thread or reach out to Jonathan directly. fungi and I are happy to help point you in the right direction if you need addresses etc19:03
clarkb#topic Zuul-launcher image builds19:04
clarkbThere are ubuntu jammy and noble nodes now and Zuul is attempting to dogfood the test nodes through the zuul launcher19:04
clarkbthis has been helpful for fidning additional bugs like needing to clear out used but unneeded disk consumption on the launcher19:04
clarkb#link https://review.opendev.org/c/opendev/system-config/+/942018 Increase working disk space for zuul-launcher19:04
clarkbI think this landed. Might be a good idea to check that the launcher restarted with the new temp dir configured19:05
clarkb#link https://review.opendev.org/c/zuul/zuul/+/940824 Dogfood zuul launcher managed nodes in a zuul change19:05
corvusi manually restarted it; it didn't do it automatically19:05
clarkbthis change is still hitting node failures possibly due to hitting quota limits (the lack of coordination between nodepool and zuul-launcher can cause this)19:05
clarkback19:05
corvusyeah, we can only dogfood it during quiet times due to the lack of quota error handling19:06
corvusthat's next on my list to implement19:06
clarkbOne other idea taht came up with all of my apparmor struggles last week was we mgiht want to have ubuntu images with apparmor preinstalled to mimic the real world better and the zuul launcher cutover might be an opportunity to figure that out19:07
clarkbbut I think any way we approach that we risk unhappyness from newly broken jobs so probably do need to be careful19:07
clarkbAnything else on this topic?19:08
corvusi feel like building those images would be easy to do with niz at this point, though using them will be hard19:08
corvusdue to the quota issue19:08
clarkbya the build side should be straightforward. Just tell dib to add the package19:08
corvusso if it's not urgent, then i'd say wait a bit before starting that project19:08
corvusand if it is urgent, maybe do it with nodepool19:09
clarkback I don't think it is urgent as this has been the status quo. We're juist noticing more because noble is a bit more strict about it in the upstream packaging when installed19:09
corvus++19:09
clarkbon the opendev system-config side of things we're installing apparmor explicitly in many places now which should cover things well for us specifically19:09
clarkb#topic Unpinning our Grafana deployment19:10
clarkb#link https://review.opendev.org/c/opendev/system-config/+/940997 Update to Grafana 1119:10
clarkbI think this got lost in all the noble apparmor fun last week, but reviews still very much welcome19:10
clarkbI suspect that we can rip the bandaid off and just send it for this change and if something goes wrong we revert19:11
clarkbbut I'd be curious to hear what others think in review19:11
fungiseems fine to me, yep19:11
clarkbtoday is a bit of a bad day with everything else going on but maybe tomorrow we land that and see what happens then19:12
clarkb#topic Upgrading old servers19:12
clarkbAs mentioned we ran into more apparmor problems with Noble19:12
clarkbpreviously we had problems with podman kill and docker compose kill due to apparmor rules. We worked around this by using `kill` directly19:13
clarkbupstream has since merged a pull request to address this in the apparmor rules19:13
clarkbI haven't seen any movement downstream to backport the fix into noble so we may be stuck with our workaround for a while.19:13
clarkbSeparately we discovered that apparmor rules affect where rsyslogd can open sockets on the filesystem for syslog and containers' ability to read and write to those sokcet files19:14
clarkbI am only aware of a single place where we make use of this functionality and this is with haproxy beacuse it wants to log directly to syslog19:14
clarkbwe hacked around that with an updated rsyslogd apparmor policy that was suggested by sarnold from the ubuntu security team19:14
clarkbI filed a bug against rsyslogd in ubuntu for this19:15
clarkbBut overall things continue to work and we have been able to find workarounds for the issues19:15
clarkbtonyb: not sure if you are around and have anything to add to this topic19:16
tonybNothing from me19:16
clarkb#topic Sprinting to Upgrade Servers to Focal19:17
clarkbbah I keep forgetting to fix that typo19:17
clarkb#undo19:17
opendevmeetRemoving item from minutes: #topic Sprinting to Upgrade Servers to Focal19:17
clarkb#topic Sprinting to Upgrade Servers to Noble19:17
clarkbmuch of the previous info was discoverd when trying to make headway on the backlog of server upgrades by upgrading them to Noble19:17
clarkbI would say this was successful in finding previously unknown problems and working through them for future systems. But a bit disappointing in that I only managed to upgrade zuul-lb and codesearch servers19:18
clarkb#link https://etherpad.opendev.org/p/opendev-server-replacement-sprint19:18
clarkbI was trying to keep track of things on this etherpad and may continue to do so (it needs some updates already)19:18
clarkband basically keep pushing forward on this as much as possible. Would be helpful if others can dive in too19:19
clarkbStarting tomorrow I'll try to pick another one or two servers off the list and work on getting them done next19:19
clarkb#topic Running certcheck on bridge19:20
clarkbfungi: any updates on this item?19:20
funginope :/19:21
clarkb#topic Service Coordinator Election19:22
clarkbToday is the last day of the nomination period which ends in ~4.5 hours at the end of day UTC time19:22
clarkbI haven't seen any nominations. Does this mean we're happy with status quo and want me to keep sitting in the seat?19:23
clarkbI am more than happy for someone else to volunteer and would support them however I can too fwiw19:23
fungiseems so19:24
clarkbok I guess if I don't hear different by ~2300 UTC I can make it official19:25
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/19:25
fungithanks!19:25
clarkba link back to the details I wrote down previously19:25
clarkb(that isn't my nomination)19:25
clarkb#topic Working through our TODO list19:26
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:26
clarkbjust a reminder we've got a rough todo list on this etherpad19:26
clarkbif you would like to dive in feel free to take a look there and reach out if there are any qusetions19:26
clarkb#topic Open Discussion19:27
clarkbcloudnull let us know there is a second raxflex region we can start using19:27
clarkbThe rough process for that would be figuring out the clouds.yaml entry to use it. Updating clouds.yaml and enrolling the region in our cloud launcher config19:28
clarkbthen launch a new mirror, then add the region to nodepool / zuul-launcher19:28
fungilast call for objections on the pyproject.toml series of changes for bindep. if there are none i'll merge up through the change dropping python 3.6 support19:28
clarkbthe region is dfw319:28
fungibindep changes 816741, 938520 and 93856819:29
clarkbI'm hoping we can just add that region to the region list in clouds.yaml for raxflex and be off to the races19:29
clarkbwe may also need to upload a noble cloud image I guess19:29
fungiwith raxflex-dfw3 we can discuss whether we want to stick to floating-ip vs publicnet19:30
clarkbI think using floating ips has been valuable to address some assumptions that were made in CI but I don't think we necessarily need to care unless a cloud forces us to use floating IPs19:30
clarkbso ya maybe just do the simplest thing which is publicnet and worry about floating ips if they ever become strictly required19:30
clarkbfungi: yesterday I notced that some of the log files on lists.o.o are quite large. Not sure if you saw that note I posted to #opendev19:31
clarkbI think we may need to add logrotate rules for those files19:31
corvuswhat's the quota situation for both rax regions?19:31
corvusboth rax flex regions19:31
clarkbbut wanted to make sure you had a chance to take a look and weigh in on it before I wrote any changes19:31
clarkbcorvus: I haven't looked at dfw3 but sjc3 is minimal19:31
clarkbI think the nodepool config is our quota19:32
clarkbI wanted to replcae the networking setup in sjc3 before we ask to increase it19:32
fungii think there's something like a 50 node quota on all accounts in flex by default, but would have to check19:32
clarkbwe have max servers set to 32 in sjc319:33
clarkbI think that is the current limit there19:33
fungiwe might have sized max-servers to 30 for other reasons (ram? vcpu?)19:33
fungirather than the max instances limit19:33
clarkbcorvus: one thing that just occured to me is we could dial back max-servers by ~5 servers in region zuul-launcher is trying to use19:34
corvusspeaking of, https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 looks suspiciously flat19:34
clarkbcorvus: that might help with the quota issues if you think that is useful ebfore making zuul-launcher quota aware19:34
frickleriirc we tested iteratively what worked without getting launch failures, not sure why launches failed19:34
corvusclarkb: good idea19:34
clarkbcorvus: ooh that looks like we're in a can't delete servers loop19:34
corvusis that a rax problem or a nodepool problem?19:35
frickleryou'd need to check nodepool logs probably19:35
clarkbya I'm not sure yet. usually it is a cloud provider problem19:36
clarkbbut it could be that we fail to make valid delete requiests so the db state changes to deleting but the cloud never sees it19:36
corvus(so rough plan is: 1. add dfw3; 2. rebuild sjc3 network; 3. re-enable sjc3 ?) 19:36
corvusoh ok, just sounded like maybe it was something that happened often19:36
clarkbcorvus: yup19:37
corvusnodepool.exceptions.LaunchStatusException: Server in error state19:37
corvusoh wait thats ovh19:37
corvusi guess it's possible there may be more than one unhappy public cloud simultaneously19:37
clarkbprobably not the first time that has happened19:38
corvusThe request you have made requires authentication. (HTTP 401)19:38
corvusthat is the sjc3 error.  for sure.19:38
fungioh joy, account credentials probably expired out of the cache there again19:39
corvusfungi: tell me more please19:40
fungii think the most recent time that happened we brought it to the attention of someone in support? time before that i think it was "fixed" by logging into the webui with that account19:41
fungi(which was when we were first setting it up and unable to figure out why the creds we'd been given didn't work)19:41
corvuslooks like zl01 saw the same error19:42
corvusso it's not limited to a single host or session19:42
clarkbya so probably need to bring it up wit hthem again19:42
fungii can confirm i see the same trying to openstack server list from bridge19:44
clarkblets followup with them after the meeting19:45
tonybNow that the caffeine has hit I have something for open discussion 19:46
clarkbI also wanted to point out I've pushed like 4 changes to update zuul/zuul-jobs use of container images for things like registries and test nodes to pull from quay mirrors of images instead of docker hub19:46
clarkball in an effort to slowly chip away at the use of docker hub19:46
clarkbtonyb: go for it19:46
tonybI've mentioned it before but but RDOProject would like to use OpenDev got gerrit19:47
tonybs/got/for/19:47
tonybI think there is a reasonable overlap, how would RDO go about getting a tennant / namespace?19:48
tonybIs there a documented process?19:48
fricklerjust propose new repo(s) with a new namespace?19:48
clarkbits basically the existing request a project process19:48
fungifor a separate zuul tenant it would be some configuration added to zuul and a creation of a dedicated config project for the tenant19:49
clarkbyup that19:49
tonybOkay19:49
tonybThat's easy19:49
fricklernot sure if a new tenant is really needed or populating the opendev tenant a bit more would be ok?19:49
fungiworth talking about what the repos would contain too though... we've had problems in the past with packaging projects that carried forks of other projects19:49
tonybCan gerrit support multiple Zuuls?  I suspect there might be some desire to start with gerrit but leave the existing zuul in place19:50
fungiand the other concern is around testing, projects in our gerrit need to use our zuul for gating19:50
clarkbyou can have third party CI but you can only have one gatekeeper19:50
fricklertonyb: existing zuul = rdo zuul?19:51
tonybOkay19:51
tonybfrickler: Yes19:51
fungianother point would be that we could really only import the commits for the existing repos, not their gerrit data from another gerrit19:51
fungiso old change review comments and such would be lost19:52
clarkboh yup. We explicitly can't handle that due to zuul iirc19:52
tonybSo OpenDev zuul could be used for gating but the existing RDO zuul could be used as 3rd party where a -1 means no gating19:52
clarkbonce you important changes from another gerrit you open yourself to change number collisions and zuul doesn't handle those iirc19:52
tonybfungi: understood19:52
fungi-1 would be advisory, it couldn't really block changes from merging i don't think? but it could be used to inform reviewers not to approve19:53
clarkbtonyb: the existing check and gate queues don't let third party ci prevent enqueing to gate. I think if that is desireable then you would wnt your own tenant and you can set the rule up that way19:53
tonybI think we're only talking about code not gerrit history19:53
tonybOkay noted.19:53
clarkbfungi: depends on the pipeline configs. It is advisory as configured in existing tenants but a new tenant could make that a bit stronger with a stricter pipeline config19:53
fungithe other thing that's come up with projects trying to make extensive use of third-party ci is that it's challenging to require that a third-party ci system report results before approval19:54
tonybThat all sounds doable to me, but I admit I don't know a lot about the specifics of RDO gating19:54
clarkbthat should be solvable with a tenant specific pipeline config basically doing clean check with a third party ci19:54
clarkbI'm not sure I would necessarily suggest these things particularly if we (opendev) start getting questions about why rdo can't merge things and it is due to an rdo zuul outage19:55
fungiyeah, zuul would need to look for votes from other accounts besides its own, which is configurable at the pipeline level19:55
fricklerit might be easier to just integrate rdo's node capacity into opendev19:56
corvus(i believe it is possible to import changes into gerrit and avoid number collisions, but care must be taken)19:56
fricklerand possibly also considered more "fair" in some sense19:56
tonybfrickler: that's an option, but one I've only recently floated internally19:56
fungibut also this is sort of similar to the "why can't i use opendev's zuul but have my project in github?" discussions. we provide an integrated gerrit+zuul solution, it's hard for us to support external systems we don't control, and not how we intended it to be used19:57
tonybI also didn't want to bring that up now for fear of it seeming like we can ad capacity *iff* we can do $other_stuff19:57
tonybthat is very much not the case19:57
tonybThank you all.  I think I understand better now what's needed.  I also think we can do this in a way that works well and extends OpenDev.  I'll bring this up with RDO and try to get a better conversation started19:59
fungion the topic of gerrit imports, it looks like review.rdoproject.org is currently running gerrit 3.7.8, so not too far behind20:00
tonybAs I'll be in the US next month I think that works better for TZ overlap20:00
clarkbwe are at time20:00
clarkbthank you everyone and feel free to continue discussion on the mailing list or in #opendev20:00
clarkbI've already pinged cardoe there about the account login issue20:01
tonybfungi: Yeah not too far, but the lack of interest in upgrading it is part of the motivation to switch20:01
clarkb#endmeeting20:01
opendevmeetMeeting ended Tue Feb 18 20:01:26 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:01
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-18-19.00.html20:01
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-18-19.00.txt20:01
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-18-19.00.log.html20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!