19:00:11 #startmeeting infra 19:00:12 Meeting started Tue Feb 18 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:12 The meeting name has been set to 'infra' 19:00:18 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VM6AC2PI4QEAN5YLPM6UKN7RECVHAOOI/ Our Agenda 19:00:25 #topic Announcements 19:00:41 This has an explicit agenda item but the service coordinator nomination period ends today 19:00:52 we can discuss more later but nwo is maybe a good time to think if you'd like to run 19:01:02 Anything else to announce 19:01:04 ? 19:02:07 foundation ml discussion 19:02:33 though i think it was already pointed out last week 19:02:34 oh right 19:02:39 it was but it is a good reminder 19:02:43 #link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/thread/3B7OWPRXB4KD2DVX7SYYSHYYRNCKVV46/ 19:02:57 the foundation is asking for feedback on a major decision 19:03:22 you can respond on that mailing list thread or reach out to Jonathan directly. fungi and I are happy to help point you in the right direction if you need addresses etc 19:04:04 #topic Zuul-launcher image builds 19:04:27 There are ubuntu jammy and noble nodes now and Zuul is attempting to dogfood the test nodes through the zuul launcher 19:04:46 this has been helpful for fidning additional bugs like needing to clear out used but unneeded disk consumption on the launcher 19:04:52 #link https://review.opendev.org/c/opendev/system-config/+/942018 Increase working disk space for zuul-launcher 19:05:03 I think this landed. Might be a good idea to check that the launcher restarted with the new temp dir configured 19:05:21 #link https://review.opendev.org/c/zuul/zuul/+/940824 Dogfood zuul launcher managed nodes in a zuul change 19:05:42 i manually restarted it; it didn't do it automatically 19:05:44 this change is still hitting node failures possibly due to hitting quota limits (the lack of coordination between nodepool and zuul-launcher can cause this) 19:05:46 ack 19:06:09 yeah, we can only dogfood it during quiet times due to the lack of quota error handling 19:06:14 that's next on my list to implement 19:07:32 One other idea taht came up with all of my apparmor struggles last week was we mgiht want to have ubuntu images with apparmor preinstalled to mimic the real world better and the zuul launcher cutover might be an opportunity to figure that out 19:07:50 but I think any way we approach that we risk unhappyness from newly broken jobs so probably do need to be careful 19:08:14 Anything else on this topic? 19:08:18 i feel like building those images would be easy to do with niz at this point, though using them will be hard 19:08:30 due to the quota issue 19:08:39 ya the build side should be straightforward. Just tell dib to add the package 19:08:49 so if it's not urgent, then i'd say wait a bit before starting that project 19:09:03 and if it is urgent, maybe do it with nodepool 19:09:14 ack I don't think it is urgent as this has been the status quo. We're juist noticing more because noble is a bit more strict about it in the upstream packaging when installed 19:09:27 ++ 19:09:39 on the opendev system-config side of things we're installing apparmor explicitly in many places now which should cover things well for us specifically 19:10:27 #topic Unpinning our Grafana deployment 19:10:33 #link https://review.opendev.org/c/opendev/system-config/+/940997 Update to Grafana 11 19:10:46 I think this got lost in all the noble apparmor fun last week, but reviews still very much welcome 19:11:01 I suspect that we can rip the bandaid off and just send it for this change and if something goes wrong we revert 19:11:14 but I'd be curious to hear what others think in review 19:11:48 seems fine to me, yep 19:12:08 today is a bit of a bad day with everything else going on but maybe tomorrow we land that and see what happens then 19:12:31 #topic Upgrading old servers 19:12:43 As mentioned we ran into more apparmor problems with Noble 19:13:03 previously we had problems with podman kill and docker compose kill due to apparmor rules. We worked around this by using `kill` directly 19:13:14 upstream has since merged a pull request to address this in the apparmor rules 19:13:32 I haven't seen any movement downstream to backport the fix into noble so we may be stuck with our workaround for a while. 19:14:07 Separately we discovered that apparmor rules affect where rsyslogd can open sockets on the filesystem for syslog and containers' ability to read and write to those sokcet files 19:14:36 I am only aware of a single place where we make use of this functionality and this is with haproxy beacuse it wants to log directly to syslog 19:14:54 we hacked around that with an updated rsyslogd apparmor policy that was suggested by sarnold from the ubuntu security team 19:15:01 I filed a bug against rsyslogd in ubuntu for this 19:15:35 But overall things continue to work and we have been able to find workarounds for the issues 19:16:07 tonyb: not sure if you are around and have anything to add to this topic 19:16:55 Nothing from me 19:17:05 #topic Sprinting to Upgrade Servers to Focal 19:17:25 bah I keep forgetting to fix that typo 19:17:27 #undo 19:17:27 Removing item from minutes: #topic Sprinting to Upgrade Servers to Focal 19:17:33 #topic Sprinting to Upgrade Servers to Noble 19:17:53 much of the previous info was discoverd when trying to make headway on the backlog of server upgrades by upgrading them to Noble 19:18:20 I would say this was successful in finding previously unknown problems and working through them for future systems. But a bit disappointing in that I only managed to upgrade zuul-lb and codesearch servers 19:18:31 #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint 19:18:48 I was trying to keep track of things on this etherpad and may continue to do so (it needs some updates already) 19:19:06 and basically keep pushing forward on this as much as possible. Would be helpful if others can dive in too 19:19:47 Starting tomorrow I'll try to pick another one or two servers off the list and work on getting them done next 19:20:01 #topic Running certcheck on bridge 19:20:05 fungi: any updates on this item? 19:21:13 nope :/ 19:22:25 #topic Service Coordinator Election 19:22:44 Today is the last day of the nomination period which ends in ~4.5 hours at the end of day UTC time 19:23:12 I haven't seen any nominations. Does this mean we're happy with status quo and want me to keep sitting in the seat? 19:23:24 I am more than happy for someone else to volunteer and would support them however I can too fwiw 19:24:06 seems so 19:25:24 ok I guess if I don't hear different by ~2300 UTC I can make it official 19:25:40 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/ 19:25:49 thanks! 19:25:50 a link back to the details I wrote down previously 19:25:55 (that isn't my nomination) 19:26:28 #topic Working through our TODO list 19:26:33 #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:26:41 just a reminder we've got a rough todo list on this etherpad 19:26:57 if you would like to dive in feel free to take a look there and reach out if there are any qusetions 19:27:41 #topic Open Discussion 19:27:59 cloudnull let us know there is a second raxflex region we can start using 19:28:21 The rough process for that would be figuring out the clouds.yaml entry to use it. Updating clouds.yaml and enrolling the region in our cloud launcher config 19:28:31 then launch a new mirror, then add the region to nodepool / zuul-launcher 19:28:52 last call for objections on the pyproject.toml series of changes for bindep. if there are none i'll merge up through the change dropping python 3.6 support 19:28:53 the region is dfw3 19:29:18 bindep changes 816741, 938520 and 938568 19:29:20 I'm hoping we can just add that region to the region list in clouds.yaml for raxflex and be off to the races 19:29:53 we may also need to upload a noble cloud image I guess 19:30:08 with raxflex-dfw3 we can discuss whether we want to stick to floating-ip vs publicnet 19:30:42 I think using floating ips has been valuable to address some assumptions that were made in CI but I don't think we necessarily need to care unless a cloud forces us to use floating IPs 19:30:59 so ya maybe just do the simplest thing which is publicnet and worry about floating ips if they ever become strictly required 19:31:19 fungi: yesterday I notced that some of the log files on lists.o.o are quite large. Not sure if you saw that note I posted to #opendev 19:31:29 I think we may need to add logrotate rules for those files 19:31:37 what's the quota situation for both rax regions? 19:31:45 both rax flex regions 19:31:45 but wanted to make sure you had a chance to take a look and weigh in on it before I wrote any changes 19:31:57 corvus: I haven't looked at dfw3 but sjc3 is minimal 19:32:04 I think the nodepool config is our quota 19:32:24 I wanted to replcae the networking setup in sjc3 before we ask to increase it 19:32:42 i think there's something like a 50 node quota on all accounts in flex by default, but would have to check 19:33:02 we have max servers set to 32 in sjc3 19:33:10 I think that is the current limit there 19:33:13 we might have sized max-servers to 30 for other reasons (ram? vcpu?) 19:33:57 rather than the max instances limit 19:34:06 corvus: one thing that just occured to me is we could dial back max-servers by ~5 servers in region zuul-launcher is trying to use 19:34:20 speaking of, https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 looks suspiciously flat 19:34:21 corvus: that might help with the quota issues if you think that is useful ebfore making zuul-launcher quota aware 19:34:22 iirc we tested iteratively what worked without getting launch failures, not sure why launches failed 19:34:43 clarkb: good idea 19:34:46 corvus: ooh that looks like we're in a can't delete servers loop 19:35:04 is that a rax problem or a nodepool problem? 19:35:52 you'd need to check nodepool logs probably 19:36:01 ya I'm not sure yet. usually it is a cloud provider problem 19:36:17 but it could be that we fail to make valid delete requiests so the db state changes to deleting but the cloud never sees it 19:36:20 (so rough plan is: 1. add dfw3; 2. rebuild sjc3 network; 3. re-enable sjc3 ?) 19:36:39 oh ok, just sounded like maybe it was something that happened often 19:37:05 corvus: yup 19:37:20 nodepool.exceptions.LaunchStatusException: Server in error state 19:37:36 oh wait thats ovh 19:37:54 i guess it's possible there may be more than one unhappy public cloud simultaneously 19:38:07 probably not the first time that has happened 19:38:40 The request you have made requires authentication. (HTTP 401) 19:38:51 that is the sjc3 error. for sure. 19:39:15 oh joy, account credentials probably expired out of the cache there again 19:40:21 fungi: tell me more please 19:41:02 i think the most recent time that happened we brought it to the attention of someone in support? time before that i think it was "fixed" by logging into the webui with that account 19:41:34 (which was when we were first setting it up and unable to figure out why the creds we'd been given didn't work) 19:42:14 looks like zl01 saw the same error 19:42:26 so it's not limited to a single host or session 19:42:28 ya so probably need to bring it up wit hthem again 19:44:37 i can confirm i see the same trying to openstack server list from bridge 19:45:40 lets followup with them after the meeting 19:46:09 Now that the caffeine has hit I have something for open discussion 19:46:22 I also wanted to point out I've pushed like 4 changes to update zuul/zuul-jobs use of container images for things like registries and test nodes to pull from quay mirrors of images instead of docker hub 19:46:32 all in an effort to slowly chip away at the use of docker hub 19:46:59 tonyb: go for it 19:47:05 I've mentioned it before but but RDOProject would like to use OpenDev got gerrit 19:47:17 s/got/for/ 19:48:08 I think there is a reasonable overlap, how would RDO go about getting a tennant / namespace? 19:48:14 Is there a documented process? 19:48:55 just propose new repo(s) with a new namespace? 19:48:59 its basically the existing request a project process 19:49:00 for a separate zuul tenant it would be some configuration added to zuul and a creation of a dedicated config project for the tenant 19:49:01 yup that 19:49:12 Okay 19:49:23 That's easy 19:49:52 not sure if a new tenant is really needed or populating the opendev tenant a bit more would be ok? 19:49:55 worth talking about what the repos would contain too though... we've had problems in the past with packaging projects that carried forks of other projects 19:50:26 Can gerrit support multiple Zuuls? I suspect there might be some desire to start with gerrit but leave the existing zuul in place 19:50:26 and the other concern is around testing, projects in our gerrit need to use our zuul for gating 19:50:52 you can have third party CI but you can only have one gatekeeper 19:51:03 tonyb: existing zuul = rdo zuul? 19:51:03 Okay 19:51:12 frickler: Yes 19:51:47 another point would be that we could really only import the commits for the existing repos, not their gerrit data from another gerrit 19:52:14 so old change review comments and such would be lost 19:52:24 oh yup. We explicitly can't handle that due to zuul iirc 19:52:34 So OpenDev zuul could be used for gating but the existing RDO zuul could be used as 3rd party where a -1 means no gating 19:52:43 once you important changes from another gerrit you open yourself to change number collisions and zuul doesn't handle those iirc 19:52:44 fungi: understood 19:53:13 -1 would be advisory, it couldn't really block changes from merging i don't think? but it could be used to inform reviewers not to approve 19:53:21 tonyb: the existing check and gate queues don't let third party ci prevent enqueing to gate. I think if that is desireable then you would wnt your own tenant and you can set the rule up that way 19:53:36 I think we're only talking about code not gerrit history 19:53:44 Okay noted. 19:53:53 fungi: depends on the pipeline configs. It is advisory as configured in existing tenants but a new tenant could make that a bit stronger with a stricter pipeline config 19:54:08 the other thing that's come up with projects trying to make extensive use of third-party ci is that it's challenging to require that a third-party ci system report results before approval 19:54:10 That all sounds doable to me, but I admit I don't know a lot about the specifics of RDO gating 19:54:50 that should be solvable with a tenant specific pipeline config basically doing clean check with a third party ci 19:55:13 I'm not sure I would necessarily suggest these things particularly if we (opendev) start getting questions about why rdo can't merge things and it is due to an rdo zuul outage 19:55:21 yeah, zuul would need to look for votes from other accounts besides its own, which is configurable at the pipeline level 19:56:11 it might be easier to just integrate rdo's node capacity into opendev 19:56:31 (i believe it is possible to import changes into gerrit and avoid number collisions, but care must be taken) 19:56:38 and possibly also considered more "fair" in some sense 19:56:45 frickler: that's an option, but one I've only recently floated internally 19:57:13 but also this is sort of similar to the "why can't i use opendev's zuul but have my project in github?" discussions. we provide an integrated gerrit+zuul solution, it's hard for us to support external systems we don't control, and not how we intended it to be used 19:57:49 I also didn't want to bring that up now for fear of it seeming like we can ad capacity *iff* we can do $other_stuff 19:57:56 that is very much not the case 19:59:58 Thank you all. I think I understand better now what's needed. I also think we can do this in a way that works well and extends OpenDev. I'll bring this up with RDO and try to get a better conversation started 20:00:23 on the topic of gerrit imports, it looks like review.rdoproject.org is currently running gerrit 3.7.8, so not too far behind 20:00:24 As I'll be in the US next month I think that works better for TZ overlap 20:00:40 we are at time 20:00:52 thank you everyone and feel free to continue discussion on the mailing list or in #opendev 20:01:01 I've already pinged cardoe there about the account login issue 20:01:21 fungi: Yeah not too far, but the lack of interest in upgrading it is part of the motivation to switch 20:01:26 #endmeeting