-opendevstatus- NOTICE: nominations for the OpenStack PTL and TC positions are closing soon, for details see https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/message/7DKEV7IEHOTHED7RVEFG7WIDVUC4MY3Z/ | 15:57 | |
clarkb | hello it is our weekly meeting time | 19:00 |
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Feb 18 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VM6AC2PI4QEAN5YLPM6UKN7RECVHAOOI/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | This has an explicit agenda item but the service coordinator nomination period ends today | 19:00 |
clarkb | we can discuss more later but nwo is maybe a good time to think if you'd like to run | 19:00 |
clarkb | Anything else to announce | 19:01 |
clarkb | ? | 19:01 |
fungi | foundation ml discussion | 19:02 |
fungi | though i think it was already pointed out last week | 19:02 |
clarkb | oh right | 19:02 |
clarkb | it was but it is a good reminder | 19:02 |
clarkb | #link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/thread/3B7OWPRXB4KD2DVX7SYYSHYYRNCKVV46/ | 19:02 |
clarkb | the foundation is asking for feedback on a major decision | 19:02 |
clarkb | you can respond on that mailing list thread or reach out to Jonathan directly. fungi and I are happy to help point you in the right direction if you need addresses etc | 19:03 |
clarkb | #topic Zuul-launcher image builds | 19:04 |
clarkb | There are ubuntu jammy and noble nodes now and Zuul is attempting to dogfood the test nodes through the zuul launcher | 19:04 |
clarkb | this has been helpful for fidning additional bugs like needing to clear out used but unneeded disk consumption on the launcher | 19:04 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/942018 Increase working disk space for zuul-launcher | 19:04 |
clarkb | I think this landed. Might be a good idea to check that the launcher restarted with the new temp dir configured | 19:05 |
clarkb | #link https://review.opendev.org/c/zuul/zuul/+/940824 Dogfood zuul launcher managed nodes in a zuul change | 19:05 |
corvus | i manually restarted it; it didn't do it automatically | 19:05 |
clarkb | this change is still hitting node failures possibly due to hitting quota limits (the lack of coordination between nodepool and zuul-launcher can cause this) | 19:05 |
clarkb | ack | 19:05 |
corvus | yeah, we can only dogfood it during quiet times due to the lack of quota error handling | 19:06 |
corvus | that's next on my list to implement | 19:06 |
clarkb | One other idea taht came up with all of my apparmor struggles last week was we mgiht want to have ubuntu images with apparmor preinstalled to mimic the real world better and the zuul launcher cutover might be an opportunity to figure that out | 19:07 |
clarkb | but I think any way we approach that we risk unhappyness from newly broken jobs so probably do need to be careful | 19:07 |
clarkb | Anything else on this topic? | 19:08 |
corvus | i feel like building those images would be easy to do with niz at this point, though using them will be hard | 19:08 |
corvus | due to the quota issue | 19:08 |
clarkb | ya the build side should be straightforward. Just tell dib to add the package | 19:08 |
corvus | so if it's not urgent, then i'd say wait a bit before starting that project | 19:08 |
corvus | and if it is urgent, maybe do it with nodepool | 19:09 |
clarkb | ack I don't think it is urgent as this has been the status quo. We're juist noticing more because noble is a bit more strict about it in the upstream packaging when installed | 19:09 |
corvus | ++ | 19:09 |
clarkb | on the opendev system-config side of things we're installing apparmor explicitly in many places now which should cover things well for us specifically | 19:09 |
clarkb | #topic Unpinning our Grafana deployment | 19:10 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/940997 Update to Grafana 11 | 19:10 |
clarkb | I think this got lost in all the noble apparmor fun last week, but reviews still very much welcome | 19:10 |
clarkb | I suspect that we can rip the bandaid off and just send it for this change and if something goes wrong we revert | 19:11 |
clarkb | but I'd be curious to hear what others think in review | 19:11 |
fungi | seems fine to me, yep | 19:11 |
clarkb | today is a bit of a bad day with everything else going on but maybe tomorrow we land that and see what happens then | 19:12 |
clarkb | #topic Upgrading old servers | 19:12 |
clarkb | As mentioned we ran into more apparmor problems with Noble | 19:12 |
clarkb | previously we had problems with podman kill and docker compose kill due to apparmor rules. We worked around this by using `kill` directly | 19:13 |
clarkb | upstream has since merged a pull request to address this in the apparmor rules | 19:13 |
clarkb | I haven't seen any movement downstream to backport the fix into noble so we may be stuck with our workaround for a while. | 19:13 |
clarkb | Separately we discovered that apparmor rules affect where rsyslogd can open sockets on the filesystem for syslog and containers' ability to read and write to those sokcet files | 19:14 |
clarkb | I am only aware of a single place where we make use of this functionality and this is with haproxy beacuse it wants to log directly to syslog | 19:14 |
clarkb | we hacked around that with an updated rsyslogd apparmor policy that was suggested by sarnold from the ubuntu security team | 19:14 |
clarkb | I filed a bug against rsyslogd in ubuntu for this | 19:15 |
clarkb | But overall things continue to work and we have been able to find workarounds for the issues | 19:15 |
clarkb | tonyb: not sure if you are around and have anything to add to this topic | 19:16 |
tonyb | Nothing from me | 19:16 |
clarkb | #topic Sprinting to Upgrade Servers to Focal | 19:17 |
clarkb | bah I keep forgetting to fix that typo | 19:17 |
clarkb | #undo | 19:17 |
opendevmeet | Removing item from minutes: #topic Sprinting to Upgrade Servers to Focal | 19:17 |
clarkb | #topic Sprinting to Upgrade Servers to Noble | 19:17 |
clarkb | much of the previous info was discoverd when trying to make headway on the backlog of server upgrades by upgrading them to Noble | 19:17 |
clarkb | I would say this was successful in finding previously unknown problems and working through them for future systems. But a bit disappointing in that I only managed to upgrade zuul-lb and codesearch servers | 19:18 |
clarkb | #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint | 19:18 |
clarkb | I was trying to keep track of things on this etherpad and may continue to do so (it needs some updates already) | 19:18 |
clarkb | and basically keep pushing forward on this as much as possible. Would be helpful if others can dive in too | 19:19 |
clarkb | Starting tomorrow I'll try to pick another one or two servers off the list and work on getting them done next | 19:19 |
clarkb | #topic Running certcheck on bridge | 19:20 |
clarkb | fungi: any updates on this item? | 19:20 |
fungi | nope :/ | 19:21 |
clarkb | #topic Service Coordinator Election | 19:22 |
clarkb | Today is the last day of the nomination period which ends in ~4.5 hours at the end of day UTC time | 19:22 |
clarkb | I haven't seen any nominations. Does this mean we're happy with status quo and want me to keep sitting in the seat? | 19:23 |
clarkb | I am more than happy for someone else to volunteer and would support them however I can too fwiw | 19:23 |
fungi | seems so | 19:24 |
clarkb | ok I guess if I don't hear different by ~2300 UTC I can make it official | 19:25 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/ | 19:25 |
fungi | thanks! | 19:25 |
clarkb | a link back to the details I wrote down previously | 19:25 |
clarkb | (that isn't my nomination) | 19:25 |
clarkb | #topic Working through our TODO list | 19:26 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:26 |
clarkb | just a reminder we've got a rough todo list on this etherpad | 19:26 |
clarkb | if you would like to dive in feel free to take a look there and reach out if there are any qusetions | 19:26 |
clarkb | #topic Open Discussion | 19:27 |
clarkb | cloudnull let us know there is a second raxflex region we can start using | 19:27 |
clarkb | The rough process for that would be figuring out the clouds.yaml entry to use it. Updating clouds.yaml and enrolling the region in our cloud launcher config | 19:28 |
clarkb | then launch a new mirror, then add the region to nodepool / zuul-launcher | 19:28 |
fungi | last call for objections on the pyproject.toml series of changes for bindep. if there are none i'll merge up through the change dropping python 3.6 support | 19:28 |
clarkb | the region is dfw3 | 19:28 |
fungi | bindep changes 816741, 938520 and 938568 | 19:29 |
clarkb | I'm hoping we can just add that region to the region list in clouds.yaml for raxflex and be off to the races | 19:29 |
clarkb | we may also need to upload a noble cloud image I guess | 19:29 |
fungi | with raxflex-dfw3 we can discuss whether we want to stick to floating-ip vs publicnet | 19:30 |
clarkb | I think using floating ips has been valuable to address some assumptions that were made in CI but I don't think we necessarily need to care unless a cloud forces us to use floating IPs | 19:30 |
clarkb | so ya maybe just do the simplest thing which is publicnet and worry about floating ips if they ever become strictly required | 19:30 |
clarkb | fungi: yesterday I notced that some of the log files on lists.o.o are quite large. Not sure if you saw that note I posted to #opendev | 19:31 |
clarkb | I think we may need to add logrotate rules for those files | 19:31 |
corvus | what's the quota situation for both rax regions? | 19:31 |
corvus | both rax flex regions | 19:31 |
clarkb | but wanted to make sure you had a chance to take a look and weigh in on it before I wrote any changes | 19:31 |
clarkb | corvus: I haven't looked at dfw3 but sjc3 is minimal | 19:31 |
clarkb | I think the nodepool config is our quota | 19:32 |
clarkb | I wanted to replcae the networking setup in sjc3 before we ask to increase it | 19:32 |
fungi | i think there's something like a 50 node quota on all accounts in flex by default, but would have to check | 19:32 |
clarkb | we have max servers set to 32 in sjc3 | 19:33 |
clarkb | I think that is the current limit there | 19:33 |
fungi | we might have sized max-servers to 30 for other reasons (ram? vcpu?) | 19:33 |
fungi | rather than the max instances limit | 19:33 |
clarkb | corvus: one thing that just occured to me is we could dial back max-servers by ~5 servers in region zuul-launcher is trying to use | 19:34 |
corvus | speaking of, https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1 looks suspiciously flat | 19:34 |
clarkb | corvus: that might help with the quota issues if you think that is useful ebfore making zuul-launcher quota aware | 19:34 |
frickler | iirc we tested iteratively what worked without getting launch failures, not sure why launches failed | 19:34 |
corvus | clarkb: good idea | 19:34 |
clarkb | corvus: ooh that looks like we're in a can't delete servers loop | 19:34 |
corvus | is that a rax problem or a nodepool problem? | 19:35 |
frickler | you'd need to check nodepool logs probably | 19:35 |
clarkb | ya I'm not sure yet. usually it is a cloud provider problem | 19:36 |
clarkb | but it could be that we fail to make valid delete requiests so the db state changes to deleting but the cloud never sees it | 19:36 |
corvus | (so rough plan is: 1. add dfw3; 2. rebuild sjc3 network; 3. re-enable sjc3 ?) | 19:36 |
corvus | oh ok, just sounded like maybe it was something that happened often | 19:36 |
clarkb | corvus: yup | 19:37 |
corvus | nodepool.exceptions.LaunchStatusException: Server in error state | 19:37 |
corvus | oh wait thats ovh | 19:37 |
corvus | i guess it's possible there may be more than one unhappy public cloud simultaneously | 19:37 |
clarkb | probably not the first time that has happened | 19:38 |
corvus | The request you have made requires authentication. (HTTP 401) | 19:38 |
corvus | that is the sjc3 error. for sure. | 19:38 |
fungi | oh joy, account credentials probably expired out of the cache there again | 19:39 |
corvus | fungi: tell me more please | 19:40 |
fungi | i think the most recent time that happened we brought it to the attention of someone in support? time before that i think it was "fixed" by logging into the webui with that account | 19:41 |
fungi | (which was when we were first setting it up and unable to figure out why the creds we'd been given didn't work) | 19:41 |
corvus | looks like zl01 saw the same error | 19:42 |
corvus | so it's not limited to a single host or session | 19:42 |
clarkb | ya so probably need to bring it up wit hthem again | 19:42 |
fungi | i can confirm i see the same trying to openstack server list from bridge | 19:44 |
clarkb | lets followup with them after the meeting | 19:45 |
tonyb | Now that the caffeine has hit I have something for open discussion | 19:46 |
clarkb | I also wanted to point out I've pushed like 4 changes to update zuul/zuul-jobs use of container images for things like registries and test nodes to pull from quay mirrors of images instead of docker hub | 19:46 |
clarkb | all in an effort to slowly chip away at the use of docker hub | 19:46 |
clarkb | tonyb: go for it | 19:46 |
tonyb | I've mentioned it before but but RDOProject would like to use OpenDev got gerrit | 19:47 |
tonyb | s/got/for/ | 19:47 |
tonyb | I think there is a reasonable overlap, how would RDO go about getting a tennant / namespace? | 19:48 |
tonyb | Is there a documented process? | 19:48 |
frickler | just propose new repo(s) with a new namespace? | 19:48 |
clarkb | its basically the existing request a project process | 19:48 |
fungi | for a separate zuul tenant it would be some configuration added to zuul and a creation of a dedicated config project for the tenant | 19:49 |
clarkb | yup that | 19:49 |
tonyb | Okay | 19:49 |
tonyb | That's easy | 19:49 |
frickler | not sure if a new tenant is really needed or populating the opendev tenant a bit more would be ok? | 19:49 |
fungi | worth talking about what the repos would contain too though... we've had problems in the past with packaging projects that carried forks of other projects | 19:49 |
tonyb | Can gerrit support multiple Zuuls? I suspect there might be some desire to start with gerrit but leave the existing zuul in place | 19:50 |
fungi | and the other concern is around testing, projects in our gerrit need to use our zuul for gating | 19:50 |
clarkb | you can have third party CI but you can only have one gatekeeper | 19:50 |
frickler | tonyb: existing zuul = rdo zuul? | 19:51 |
tonyb | Okay | 19:51 |
tonyb | frickler: Yes | 19:51 |
fungi | another point would be that we could really only import the commits for the existing repos, not their gerrit data from another gerrit | 19:51 |
fungi | so old change review comments and such would be lost | 19:52 |
clarkb | oh yup. We explicitly can't handle that due to zuul iirc | 19:52 |
tonyb | So OpenDev zuul could be used for gating but the existing RDO zuul could be used as 3rd party where a -1 means no gating | 19:52 |
clarkb | once you important changes from another gerrit you open yourself to change number collisions and zuul doesn't handle those iirc | 19:52 |
tonyb | fungi: understood | 19:52 |
fungi | -1 would be advisory, it couldn't really block changes from merging i don't think? but it could be used to inform reviewers not to approve | 19:53 |
clarkb | tonyb: the existing check and gate queues don't let third party ci prevent enqueing to gate. I think if that is desireable then you would wnt your own tenant and you can set the rule up that way | 19:53 |
tonyb | I think we're only talking about code not gerrit history | 19:53 |
tonyb | Okay noted. | 19:53 |
clarkb | fungi: depends on the pipeline configs. It is advisory as configured in existing tenants but a new tenant could make that a bit stronger with a stricter pipeline config | 19:53 |
fungi | the other thing that's come up with projects trying to make extensive use of third-party ci is that it's challenging to require that a third-party ci system report results before approval | 19:54 |
tonyb | That all sounds doable to me, but I admit I don't know a lot about the specifics of RDO gating | 19:54 |
clarkb | that should be solvable with a tenant specific pipeline config basically doing clean check with a third party ci | 19:54 |
clarkb | I'm not sure I would necessarily suggest these things particularly if we (opendev) start getting questions about why rdo can't merge things and it is due to an rdo zuul outage | 19:55 |
fungi | yeah, zuul would need to look for votes from other accounts besides its own, which is configurable at the pipeline level | 19:55 |
frickler | it might be easier to just integrate rdo's node capacity into opendev | 19:56 |
corvus | (i believe it is possible to import changes into gerrit and avoid number collisions, but care must be taken) | 19:56 |
frickler | and possibly also considered more "fair" in some sense | 19:56 |
tonyb | frickler: that's an option, but one I've only recently floated internally | 19:56 |
fungi | but also this is sort of similar to the "why can't i use opendev's zuul but have my project in github?" discussions. we provide an integrated gerrit+zuul solution, it's hard for us to support external systems we don't control, and not how we intended it to be used | 19:57 |
tonyb | I also didn't want to bring that up now for fear of it seeming like we can ad capacity *iff* we can do $other_stuff | 19:57 |
tonyb | that is very much not the case | 19:57 |
tonyb | Thank you all. I think I understand better now what's needed. I also think we can do this in a way that works well and extends OpenDev. I'll bring this up with RDO and try to get a better conversation started | 19:59 |
fungi | on the topic of gerrit imports, it looks like review.rdoproject.org is currently running gerrit 3.7.8, so not too far behind | 20:00 |
tonyb | As I'll be in the US next month I think that works better for TZ overlap | 20:00 |
clarkb | we are at time | 20:00 |
clarkb | thank you everyone and feel free to continue discussion on the mailing list or in #opendev | 20:00 |
clarkb | I've already pinged cardoe there about the account login issue | 20:01 |
tonyb | fungi: Yeah not too far, but the lack of interest in upgrading it is part of the motivation to switch | 20:01 |
clarkb | #endmeeting | 20:01 |
opendevmeet | Meeting ended Tue Feb 18 20:01:26 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:01 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-18-19.00.html | 20:01 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-18-19.00.txt | 20:01 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-18-19.00.log.html | 20:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!