Thursday, 2025-11-13

priteauHello. I am seeing lots of CI errors while accessing mirror.dfw3.raxflex.opendev.org. It is refusing all connections.08:28
tonybpriteau: I'll look we did have issues with that server but we thought they were fixed08:29
priteauThanks. For example here: https://9b62756387dc03e179ba-9f2ef8380661efedee2428ccb67a9720.ssl.cf2.rackcdn.com/openstack/ec48ad54ccda4860928611e9d04a946c/primary/ansible/install-pre-upgrade08:30
tonybSo that's about an hour ago right?08:31
tonybpriteau: Should be fixed08:35
tonyb#status log apache wasn't running on mirror.dfw3.raxflex.opendev.org with nothing obvious in the logs.   Restarted.08:39
opendevstatustonyb: finished logging08:39
priteauThanks!08:39
opendevreviewMerged openstack/project-config master: Grafana: add core graphs to rackspace-flex  https://review.opendev.org/c/openstack/project-config/+/96698110:59
*** rlandy_ is now known as rlandy12:01
fungiapache dying on mirror.dfw3.raxflex definitely seems odd. that's also the server i just brought back online yesterday after a multi-week outage14:30
fungiyeah, very odd. i booted mirror.dfw3.raxflex around 21:35 utc. i don't see any reference in journald for apache starting until tonyb presumably addressed it at 08:34:26 utc14:37
fungiwhen i tested i was so focused on making sure afs was working i'm not sure i confirmed apache was actually serving things, so maybe apache didn't actually start at boot for some reason?14:38
fungimaybe /var/cache/apache2 wasn't mounted when systemd tried to start apache? though i find no evidence of that in syslog either14:40
clarkbdoes journalctl -u apache2 indicate anything useful?15:48
fungiit did not sadly, no15:56
fungithat was the first thing i checked15:56
fungishows it stopping when we tried to reboot back on the 3rd, and shows when tonyb started it manually, which is what's leading me to suspect it never started at boot15:57
clarkbfor some reason I thought I checked ping and apache but maybe it was ping and ssh15:58
clarkbapache never starting on boot would certainly explain what you discovered so is probably the most likely case15:58
fungiand yeah i merely cat'ed files from the afs ro replica path in a shell there15:58
fungiit didn't dawn on me that apache might not start15:59
fungibut it certainly will from now on15:59
clarkbI'm going to catch up on my morning but then expect to approve the two gerrit update changes before too long. Chime in if there is reason to delay16:03
fungii'll be steppout out momentarily but expect i'll be back before those deploy regardless16:03
fungier, stepping out16:03
clarkbya I dont' think the real fun will start until 2100 so that tonyb can join in16:04
fungiokay, afk for a little while, probably back in an hour-ish16:15
clarkbenjoy16:16
clarkbgerrit changes have been approved17:19
clarkbI did some self rereview just to be sure I felt ready and can't come up with any reason not to proceed that isn't "gerrit restarts have been annoying recently"17:19
fungicool, i'm back and ready to help with restarts later whenever17:30
clarkbidea that just occurred to me: should we consider reducing the rax classic limits in zuul relative to the amounts we're able to increase quotas elsewhere? Mostly thinking that cloud in particular doesn't support the newer instruction sets so we can't run all images there and the disk + fs setup makes some jobs slower to bootstrap there than elsewhere ( specifically devstack style17:44
clarkbjobs).17:44
clarkbI think the way launcher works it will give the rax classic provider proportionally more work because they are larger. We can tune it away from that by reducing its quota17:44
clarkbthe downside to that is we'd have less quota than we can actually field when demand is high17:45
fungior maybe we decide where the demarcation line is when we can declare that we have sufficient quota elsewhere to just stop using rax classic for job nodes17:46
fungiplatform-specific challenges aside (xen, performance, et cetera) ,we also run into provider-oriented problems like slow api responses, accumulation of undeletable cruft, and so on17:47
clarkbantoher thought that occurred to me was adding some sort of prioritization to providers in the launcher so that classic would only be considered if everything else was full up17:48
clarkbbut I'm not sure that is a generally useful feature or if that is just something that makes sense in this one circumstance which may not be worth solving directly like that17:49
fungiit's a common use case in other cloud-oriented applications. long ago we used to refer to it as "cloudbursting" for situations where you had low-cost/sunk-cost infrastructure like a private cloud that you preferred to run your workloads on, but wanted to "burst" into a public cloud when your backlog reached a set threshold, basically paying some outside provider to avoid falling17:51
fungifarther behind17:51
fungithough if the idea is to prefer performant resources and then fall back on a slower/feature-limited alternative under load, i think that's a recipe for increasing job failure rates. the tests will become attuned to the non-overflow providers, and so grow assumptions that don't hold when they spill over into rax classic17:53
clarkbthat is a good point17:54
fungiit happens already, but is kept somewhat in check by some percentage of those jobs always running there17:54
clarkbIn my mind its more about improving throughput. Preferring rax classic means we're choosing some of our sloweest nodes more than others17:54
clarkbI don't really care too much about individual job performance as much as overall throughput in this context17:54
clarkband ya that risk is fine if we think we'll never need those resources again but if we do and we don't operate in that environment anymore it would be painful17:55
fungiif we can tune the ratio of assignments but still keep some jobs always running there, we probably mostly avoid the acclimitzation problem17:55
fungiif we have lengthy periods where no jobs run there, the risk is much higher17:56
fungiacclimitization17:56
fungiit's a hard word to type after lunch17:57
corvusnodepool has a priority feature for the provider pools.... zuul-launcher does not.  i think it would be generally useful and we can probably accomodate it.18:20
corvusan interesting question, which fungi is getting at, is do we set it up so that it's strictly ordered (fill everything else first, then rax-classic), or slightly more random (give rax-classic a 1% chance of handling a request).18:22
clarkbI think I'm convinced we'd want to make it proportional like it is today but probably with more direct control of the proportions18:23
clarkbto ensure that you're semi constantly checking that you can still oeprate in each environment18:23
corvuscurrently it works by going through a sorted list of providers and finding the first one that can fulfill the request, but they are sorted like this:18:24
corvus1) unassigned nodes; 2) quota used percentage (quantized to 10% buckets); 3) random18:24
corvusso the primary sort is by unassigned nodes (so we grab those first)18:24
corvusthen after that we use the provider that's least used18:25
corvusand among providers with the same usage, the order is randomized18:25
corvus(and by "same usage" i mean "within 10%" so that the randomization is meaningful)18:25
opendevreviewMerged opendev/system-config master: Drop /opt/project-config/gerrit/projects.ini on review  https://review.opendev.org/c/opendev/system-config/+/96608318:26
opendevreviewMerged opendev/system-config master: Update Gerrit images to 3.10.9 and 3.11.7  https://review.opendev.org/c/opendev/system-config/+/96608418:26
fungiprobably the ideal algorithm would be distributing across a tunable proportion of available providers, the real question is how backlogged requests can/should factor into those decisions (for example does a solution exist where we always run with x% distributed to that provider across available providers, but prefer to also delay satisfying some requests unless they reach a backlog18:26
fungithreshold?)18:26
corvusso if we did what clarkb suggests, one way to do that might be to tweak the step at #218:26
fungiyeah, i can imagine a couple of different tunable distribution methods, but i expect they directly conflict with one another and we'd need to decide which to support (or support multiple distribution algorithms, but that seems less like it would be worth the effort)18:28
corvusfungi: there isn't anything like that in z-l today -- eventually if all providers become full enough, it's just a random allocation among all of them18:28
corvus(except that we try to avoid processing requests that can't be fulfilled, so if things are working well and the clouds are not lying to us, we just wait for the next available provider)18:29
fungiright, algorithmically speaking, "random" could get a weighted distribution applied to it18:29
clarkbya reusing unassigned nodes first regardless of where they are makes sense as the first criteria once filtering for ability to fulfill the request18:29
corvus(usually the cloud lies to us though, and we end up bouncing over-quota requests around different providers)18:29
fungioh, good point, "already available" vs "need to be booted" is another critical efficiency factor to figure in18:30
clarkbthe naive approach of applying a preference ration to the existing #2 criteria seems like it could starve clouds unexpectedly18:30
corvusyeah.  there's a discontinuity in the algorithm at 90% though -- we quantize to 10% buckets below 90, but above 90%, it's actual values, so we try to land at exactly max quota for every provider.  we could incorporate something like that into the proposed scaling.18:32
clarkbmanage-projects is running now for the updated docker compose file for gerrit. Manage-projects was already bind mounting the homedir location for projects.ini so this should be a noop18:32
corvusoh, sorry, the discontinuity is actually at 100%.  but still, point stands.18:33
corvus(above 100% is a launcher with a backlog)18:33
corvuss/launcher/provider/18:33
fungianyway, back to the original topic, my opinion is that we just need to decide when to stop booting job nodes in rackspace classic (by adjusting our configuration), probably based on our determination that we have sufficient quota in other providers. something more flexible could be nice, but is probably not necessary18:34
fungii don't think *we* have a strong use case for such features in zuul-launcher, though we might take advantage of them if they existed18:36
clarkbya manage projects logs lgtm. Lots of skipping due to matching shas18:36
clarkbthat implies a noop but also I don't think it would've gotten that far if there aws a problem with the projects.ini config which tells it where and how to connect to gerrit18:36
fungi(and others probably do have stronger use cases, so what's optimal for them should be more informative of actual development effort, if any)18:36
clarkbsorry for splitting the discussion I just want to make sure the gerrit stuff updates as expected18:37
fungiyep, appreciated, trying to keep tabs on it myself18:37
clarkbfungi: yes agreed as far as this not being a strong need18:37
clarkbI was mostly brainstorming if we can start managing the old cloud more intelligently in response to new resources18:38
fungii do feel like we don't get a lot of complaints about being backlogged on change queues, but we do hear about performance/featureset problems with rackspace classic nodes somewhat regularly still18:39
fungiso i'd be willing to trade a bit more of the former for less of the latter18:39
clarkbboth buildests in deploy succeeded. I'm checking the docker-compose.yaml now18:40
clarkbya projects.ini bind mount is gone from docker-compose.yaml. I realized the new images don't udpate docker-compose.yaml for that I need to look at quay18:41
fungiperfect18:42
clarkbquay.io has a banner warning of a tls cert update for their cdn occuring on the 17th. The reason for the warning is the trust chain will change. But if you open the document explaining this its subscriber content only :/18:42
clarkbanyway I don't expect that to affect us and https://quay.io/repository/opendevorg/gerrit?tab=tags says 3.10 did update as expected18:42
fungiTheJulia: do you happen to know who at rh would make the decision to fix the fact that the tls certificate warning banner at the top of https://quay.io/ takes users to a red hat knowledgebase article that requires a red hat subscription?18:46
fungiseems like some wires got crossed wrt appropriate publication platforms for public-facing services18:46
clarkbfungi: tonyb: I think that around 2000 UTC we can start to get ready, pull and confirm the new image looks correct, prepare the suite of commands to clear caches and replication queues, then ~2100 UTC send it18:48
fungisounds good, i'll be here18:48
clarkbmaybe around 2000 we also send a status notice indicating our plan to take a short outage18:48
clarkbI'll go ahead and warn the openstack release team now18:49
fungigood idea18:49
fungithough looks like it's been about 5 hours since anyone there was approving new release requests18:49
TheJuliaI suspect it is because the "easy" path for teams is the kb which does require an account and possible subscription entitlement. I suspect the right thing to do would be to open an issue in issues.redhat.com in PROJQUAY, type bug, component set to quay.io 18:50
fungithanks! now to see if i have a "red hat account" so i can log into jira18:52
TheJuliaI take it you don't at all... hmm18:58
fungino, i do, took me a few minutes to find it since i don't use it often18:58
funginow i'm just fumbling around the interface, shouldn't take me long18:59
fungioh, that's fun. i have a preexisting account on the rh jira with the same e-mail address as my rh sso account, so it won't let me log into jira due to that conflict?19:01
fungi"SAML-Login failed: You have logged in with username [...], but you already have a Jira account with [...]. If you need assistance, email jiraconf-sd@redhat.com and provide tracker-id."19:01
fungii guess this is where i hit the "too much effort required to bother reporting something so trivial" threshold19:02
clarkbits gerrit openid all over again19:02
fungiyeah, complicated in my case because the conflict is on an e-mail address that redhat.com's mail provider rejects messages from, so if i e-mail them to try and get it resolved i'd have to do it from a different address than the one for the account in question19:10
clarkbheh I've just realized my 2000 UTC math coincides with lunch. I should be able to make that work just fine though19:11
clarkbmy mental timezone map post DST drop hasn't updated properly yet19:11
TheJuliafungi: opening an issue for you now19:12
fungiTheJulia: oh thank you!19:15
TheJuliaPROJQUAY-977219:15
fungiappreciated. we don't think it impacts us at all, but suspect it may impact other users who don't have rh accounts to read the article explaining whether or not it does19:16
TheJuliaIt is kind of not great, tbh19:17
TheJuliaand the only way for teams to sometimes know is when they triage such issues as a team19:18
clarkbhow does this look #status notice The OpenDev team will be restarting Gerrit at approximately 2100 UTC in order to pick up the latest 3.10 bugfix release.20:00
clarkbtonyb: I'm happy to wait for your morning to start to do the on server actions so that you're able to follow along and/or be the typing driver20:02
clarkband then I guess the other question is do we want to jump on meetpad to make coordination more synchronous or are we happy with irc?20:04
corvusclarkb: msg lgtm20:07
fungiyeah, i'm good with the wording20:07
clarkboh looks like I said we'd get started at 2100 with actual restart happening at about 2130 when planning on Tuesday20:09
clarkbso maybe update that warning to 2130? but otherwise send it at 2030?20:09
fungifine by me, bettter stick with the prior plan yeah20:09
clarkbI then recorded that in my notes as 2100 hence the confusion.20:12
clarkbMostly I want to make sure tonyb has a chnce to participate20:12
fungihah20:12
fungiyeah, it's all good with me, i'll be here regardless20:12
clarkbok sending the notice now20:30
clarkb#status notice The OpenDev team will be restarting Gerrit at approximately 2130 UTC in order to pick up the latest 3.10 bugfix release.20:30
opendevstatusclarkb: sending notice20:30
-opendevstatus- NOTICE: The OpenDev team will be restarting Gerrit at approximately 2130 UTC in order to pick up the latest 3.10 bugfix release.20:30
tonybI'll be at my laptop in a couple of minutes.20:31
fungithere's no rush20:31
clarkbya I may take this as an opportunity for a short break before we dive in. I think the next steps are deciding if we want to use meetpad or not, then starting a screens session and pulling and verifying the image20:32
clarkbthen we wait for ~2130 UTC and can proceed wtih restarting things at that time20:32
opendevstatusclarkb: finished sending notice20:33
fungii'm happy to jump in a meetpad room if people want that (e.g. for screen sharing), but gnu screen multi-attach works fine as long as we're all infra-root20:34
opendevreviewMerged openstack/project-config master: Set noop job for the governance-sigs repository  https://review.opendev.org/c/openstack/project-config/+/96675520:39
tonybclarkb, fungi: ready when you are 20:47
clarkbfungi: ya I was just thinking maybe synchronous voice chat would be helpful to explain what we're doing but I'll let tonyb decide if that is helpful. I don't think we should use meetpad to screenshare and instead use screen to share the context20:52
clarkbtonyb: want to start a root screen we can attach to? Then we can take notes on what the current image is that we're running and fetch ahd check the new image20:53
tonybCan do20:54
clarkbtonyb: if you do a docker ps -a you should see that the container's image matches that image there too20:58
clarkbya so the container running gerrit (gerrit-compose-gerrit-1) says its image is quay.io/opendevorg/gerrit:3.10 which is currently 16e3e6710a1221:00
clarkbthis is mostly for historical record keeping and theoretical rollback paths21:00
tonybOkay21:00
clarkbso now I think you can do `docker-compose pull` and that should only update images for us and not touch the running containers. Then we can cross check the new container image with what we have in quay.io21:01
clarkbor `docker compose pull` since the - is optional on noble nodes21:01
tonybhttps://meetpad.opendev.org/gerrit-update if y'all want21:01
clarkbtonyb: so the new image can be found at https://quay.io/repository/opendevorg/gerrit?tab=tags21:04
opendevreviewJames E. Blair proposed opendev/system-config master: Update zuul-client image location  https://review.opendev.org/c/opendev/system-config/+/96711121:22
corvusclarkb: tonyb zuul isn't super busy, but it's also not idle... if we want to be very conservative, we can pause event processing before you shut down gerrit.21:24
corvuslmk if you want to do that, i've got a command ready (which i can send to you if you want to run it)21:24
clarkbcorvus: will that pause merge attempts for changes already in the gate?21:24
corvusyes21:24
clarkbI think we should go ahead and do that21:25
corvus(that would be the main reason to do it; i'm not worried about disconnections or events or anything like that)21:25
corvusour helper script has an error, which i worked on correcting in 967111; in the mean time, i made a local copy on zuul01 that is corrected21:26
corvusso, as root on zuul01: /root/zuul-client manage-events --all-tenants --reason "Gerrit upgrade in progress" pause-result21:26
clarkbthanks!21:26
corvuswhen finished: /root/zuul-client manage-events --all-tenants normal21:26
clarkbshutdown is proceeding now21:30
clarkb[main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.10.9-4-g309e8c8861-dirty ready21:34
clarkbhttps://groups.google.com/g/repo-discuss/c/TxeJ9lvW6Is/m/NGLT2D5aAQAJ I think this is related to the error we see from the plugin manager preloader21:40
clarkbhowever the specific error is different21:40
clarkbhttps://gerrit.googlesource.com/plugins/webhooks21:42
clarkbhttps://gerrit-review.googlesource.com/Documentation/cmd-index-start.html21:47
opendevreviewClark Boylan proposed opendev/system-config master: DNM this is a change ot test gerrit functionality  https://review.opendev.org/c/opendev/system-config/+/96713121:49
clarkbthis didn't make it into the IRC logs so I'll add it: corvus pointed out that at least one change would've reported during the zuul pause period. If we hadn't paused that could've broken a merge22:09
clarkbso thats a great little feature for us to remember we have now with zuul22:09
clarkbI'm going to delete my autoholds for the old gerrit 3.11 image (we just updated it to a new version) and etherpad 2.5.3 and gitea 1.25.122:17
clarkbI don't see any other autoholds in the openstack tenant22:17
clarkbusually I make note of this here so that I can delete other autoholds too but not necessary this time22:17
*** dmellado9 is now known as dmellado22:18
clarkbinfra-root gerrit reindexing has completed and reports `Reindex changes to version 86 complete` with 3 failed changes (the expected count)22:27
clarkbI have requested new autohold for gerrit 3.10 and 3.11 jobs that will hopefully finally be used to test the ugprade and get that 3.10 to 3.11 upgrade done23:48

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!